Global Aggregation of Ancestry Aware Allele Frequencies From Genotyping Data
Abstract
Population specific allele frequencies are an indispensable part of variant annotation for clinical purposes. The 1000 genome project initially established 5 continental super populations and in phase 3, 26 ethnic populations,... [ view full abstract ]
Population specific allele frequencies are an indispensable part of variant annotation for clinical purposes. The 1000 genome project initially established 5 continental super populations and in phase 3, 26 ethnic populations, representing human diversity by selecting around 100 representatives of each to aggregate their respective allele frequencies. Since variant annotation for clinical phenotypes uses the population-specific abundance of the variant observed in a patient to discover or rank genotype-phenotype relationships, it will become more and more important to establish a consensus on ancestry/population assignment of the patient, the accuracy of the allele frequencies used for that purpose and the seamless incorporation of admixed samples into the process.
Here we present a collaboration between Illumina, InsideDNA.me and academic partners to potentially harness the power of millions of genotyped samples to significantly improve the accuracy of population-based allele frequencies. We developed a cost-effective, scalable, cloud-based analysis pipeline that combines ancestry de-convolution via the open-source tool iAdmix with population-specific allele frequency aggregation in a manner that does not require the deposit of actual patient genotyping results in a central database. It was explicitly developed to allow admixed samples to contribute to global allele frequencies. Each clinical researcher that can contribute their patient’s data to the effort will be able to aggregate their own samples and only has to deposit the aggregated and therefore anonymized results into central data repositories.  Average analysis costs per sample with 650k genotypes against 26 reference populations represented by allele frequencies at over 380k loci currently do not exceed $0.05. Additional functionality could be added to the analysis pipeline, for instance a central repository for sample names, to cross check for duplicate or a counter for the number of homozygous alternate alleles encountered for a given variant.
Authors
-
Frank Boellmann
(Illumina)
-
Vikas Bansal
(University of California San Diego)
-
Andrey Khmelevskiy
(InsideDNA.me)
-
Alexander Stepakov
(InsideDNA.me)
Topic Areas
Large scale data management, cloud computing , Bringing sequence to the clinic (i.e., diagnostics, cancer, inherited disorders) , Global engagement and partnerships
Session
OS-1 » Human Forensics & Ancestry (10:00 - Tuesday, 16th May, La Fonda Ballroom)
Presentation Files
The presenter has not uploaded any presentation files.