Evaluating the effects of using the GRCh38 human genome reference in multiple heterogeneous secondary sequencing analysis pipelines
Abstract
Research sequencing at the Baylor College of Medicine Human Genome Sequencing Center (HGSC) involves the sequencing and analysis of over 2,000 whole-genome samples per month from numerous projects and collaborators, including... [ view full abstract ]
Research sequencing at the Baylor College of Medicine Human Genome Sequencing Center (HGSC) involves the sequencing and analysis of over 2,000 whole-genome samples per month from numerous projects and collaborators, including the Alzheimer's Disease Sequencing Project (ADSP), Trans-Omics for Precision Medicine (TOPMed), and the Centers for Common Disease Genomics (CCDG). The set of project requirements for analysis pipelines that must be concurrently supported and managed by the HGSC exhibits considerable heterogeneity, with samples and projects combining various sequencer technologies, sequencing applications, coverage levels, and genome references used in mapping. Maintaining pipeline support for current project requirements while developing support for new and increasingly heterogeneous requirements has prompted a number of systematic changes in the way secondary analysis is performed at the HGSC. In particular, updating HgV, the workflow manager for primary and secondary analysis pipelines at the HGSC, to support CCDG-compliant pipelines, which uses the GRCh38 human genome reference from the Genome Research Consortium, has required changes to multiple other programs and pipeline methods in order to address incompatible genome coordinates and new reference features compared to the already-supported hg19 and hs37d5 human genome references. xAtlas is a variant caller for SNPs and small indels used by the HGSC in resequencing pipelines that features a retrainable logistic regression-based candidate variant evaluation model and support for reading CRAM files. xAtlas variant calls have been evaluated for over 30 trios from recent projects and other well-characterized samples, and subsequent model retraining based on these samples has improved variant recall values to over 99% for SNPs and over 93% for indels for NA12878 when compared against high-confidence variants sets from NIST. Running more diverse sets of TOPMed- and CCDG-compliant pipelines has also provided ample data for a large-scale quality control assessment of current sequencing analysis techniques across an assortment of pipelines. Results from AlignStats, a program that calculates alignment and coverage statistics, and VerifyBamID, used in estimating contamination rate, indicate that pipeline metrics either remain the same or vary to a minor degree across samples between using the hs37d5 reference or the GRCh38 reference (R2 ≥ 0.97 for all calculated metrics) for 188 whole-genome ADSP samples run on multiple pipelines. The effects of using the GRCh38 reference in analysis pipelines have also aided in outlining best practices for how most accurately to calculate sequence and coverage metrics using different references and other pipeline specifications, and in developing analysis pipeline support for the Illumina NovaSeq sequencer. The resulting set of secondary analysis pipelines supports using the GRCh38 reference across multiple projects, with performance and accuracy improvements over previous pipeline versions.
Authors
-
Jesse Farek
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Olga Krasheninina
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Waleed Nasser
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Kimberly Walker
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Adam Mansfield
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Donna M. Muzny
(Baylor College of Medicine / Human Genome Sequencing Center)
-
Richard A. Gibbs
(Baylor College of Medicine / Human Genome Sequencing Center)
-
William J. Salerno
(Baylor College of Medicine / Human Genome Sequencing Center)
Topic Areas
Comparative genomics, re-sequencing, SNPs, structural variation , Large scale data management, cloud computing , Human, non-human, and infectious disease applications
Session
OS-2 » Human Genomic Applications (13:00 - Tuesday, 16th May, La Fonda Ballroom)
Presentation Files
The presenter has not uploaded any presentation files.