Introducing GRAPE — an open-source Genomic Relatedness Detection Pipeline
GRAPE is an end-to-end open-source Genomic RelAtedness detection PipelinE that can be freely used for scientific and commercial purposes (GitHub, Paper). GRAPE is accurate for close and distant relatives and is ready for production use.
What is relatedness detection and how does it work?
On average, each person has 1500 living relatives up to the 10th degree of kinship, where a degree is a number of steps from one individual to another in a genealogical tree.
Relatedness degree is predicted for a given pair of individuals using their genetic data. Individuals inherit their DNA equally from each parent, 50% from the mother and 50% from the father. Parental DNA comes in chunks. Some of these chunks are passed on to the next generation, possibly broken during the recombination process.
Two individuals are considered relatives if they have identical chunks of DNA inherited from a common ancestor. These chunks are called identical-by-descent (IBD) segments. Close relatives have more identical chunks with higher total and average lengths. This information lies behind predicting relationship degrees.
GRAPE overview
- GRAPE adapts the best practices for software development, including the Snakemake workflow management system, Docker containerization, Funnel task execution service, and CI/CD with automatic testing.
- GRAPE requires a single multisample VCF file as input and has a separate workflow for downloading and checking the integrity of reference datasets.
- As real-world datasets are often heterogeneous and inconsistent, GRAPE incorporates various data preprocessing and quality control (QC) options.
- GRAPE has a modular architecture that allows switching between tools.
- GRAPE supports the adjustment of tools for better control of precision and recall.
Preprocessing
Comprehensive data preprocessing and QC are essential steps for ensuring accurate outcomes and low false-positive rates. The pipeline preprocessing includes optional phasing, imputation, converting from hg38 to hg19 assembly, SNP, and sample QC. QC removes all indels and rare SNPs, checks for strand and SNP id mismatches, and removes all suspicious samples, namely with high missingness and homozygosity.
Inferring degrees of relatedness by IBD detection
Tools used to determine relatedness degree between individuals are based on either estimating IBD regions or population allele frequencies. GERMLINE, Refined IBD, RaPID, PhasedIBD, and IBIS detect IBD segments from a common ancestor. Other tools such as KING and PLINK use population allele frequencies.
After a careful analysis of the mentioned tools, we have chosen IBIS as the most accurate and efficient IBD segments inference tool. IBIS does not require phasing or imputation and supports the control over a false-positive rate and sensitivity by adjusting the parameters of IBD segments detection. As KING is a widely known tool for determining close degrees of kinship, it is optionally available in the GRAPE pipeline.
Relatedness estimation by shared IBD segments
The total length of IBD segments decreases with each generation due to the recombination process. It is known to be inversely proportional to the degree of relatedness, making it harder to accurately predict relatedness based on the total length of IBD segments only.
GRAPE uses the ERSA tool to further estimate relatedness degrees based on the IBD statistics, performed by IBIS. ERSA calculates the most likely relationship degree from the length and the total number of IBD segments identified between the two samples. ERSA was originally intended for estimating distant degrees (4+), yet integrating additional post-processing allows the tool to be used for close degrees as well (1–3).
GRAPE’s detection power
To assess GRAPE’s accuracy, we used the ped-sim simulation software to generate a few pedigrees from unrelated individuals in the 1000 Genomes Project. The method uses sex-specific genetic maps to simulate pedigree structures.
The pipeline was tested with both strict and relaxed parameters over the false-positive rate. As a result of the comprehensive preprocessing module, GRAPE was confirmed to be highly reliable and capable of producing almost zero false positives when needed.
The pipeline proved to be able to detect relatedness up to the 10th degree and is accurate up to the 8th degree. GRAPE is ready to operate on highly diverse real-world datasets because of its performance and flexibility with settings and parameters.
Comparison with TRIBES
TRIBES is an earlier open-source pipeline for relatedness detection. The pipeline combines the GERMLINE tool for IBD segments detection and the calculation of the genome proportion with zero alleles inferred IBD (IBD0) for each pair. To infer the degrees of relatedness, GERMLINE requires phased data. If the data is not phased, TRIBES provides an ability to phase data with the EAGLE tool. This part of TRIBES is similar to one of the two main GRAPE workflows of IBD segments detection. In contrast to GRAPE, TRIBES estimates degrees of relationship according to expected IBD0 segments proportion ranges. GRAPE uses the ERSA algorithm, which, to our knowledge, is a more advanced approach.
Using the same simulated dataset, TRIBES was compared with GRAPE. TRIBES has demonstrated a higher detection power for distant relationships of up to the 12th degree. Given that many 13+ degree relatives do not share any IBD segments, this is near a theoretical limit. However, the precision-recall analysis shows that TRIBES produces a higher number of false-positive matches for degrees 7–11, as compared to GRAPE. Since TRIBES lacks preprocessing and evaluation options that allow users to control false-positives rates by varying pipeline parameters, it becomes a crucial drawback. This obstacle does not allow the TRIBES pipeline to be adapted for applications, where desired sensitivity/specificity rate may vary depending on the different business or research objectives.
Conclusion
The performance analysis demonstrates that GRAPE requires 30 minutes to process a dataset of 10,000 samples with a primary IBIS-only option. The pipeline was also run on 100,000 samples and processed the entire dataset in 22 hours, therefore confirming that IBIS is a simple and efficient tool. We have developed GRAPE as a user-friendly, fast, and accurate software and we invite you to our GitHub repository to test and make it better together. You can find more information in the F1000Research GRAPE Software Article.