Accurate targeted assembly and variant calling from NanoString Hyb & Seq data
Abstract
The Hyb & Seq technology is a novel single molecule approach for targeted DNA resequencing developed by NanoString. Given a short fragment of a reference genome (e.g., 100-200 bps), Hyb & Seq aims to find all mutations in this... [ view full abstract ]
The Hyb & Seq technology is a novel single molecule approach for targeted DNA resequencing developed by NanoString. Given a short fragment of a reference genome (e.g., 100-200 bps), Hyb & Seq aims to find all mutations in this segment without the tedious library preparation and amplification steps. These steps, that represent critical parts of many other sequencing technologies, contribute to various biases and inflate the error rates, making the variant calling challenging. Moreover, those steps are expensive and time consuming making corresponding technologies inapplicable for many important real-life purposes (e.g., for clinical express diagnostics).
Hyb & Seq uses cyclic nucleic acid hybridization of fluorescent barcoded k-mers and works as follows. Genomic DNA or RNA is first “gapped” to generate a single-stranded region and captured onto a flow-cell. Optical barcodes are hybridized to these single-molecule targets, and bases at each hybridized target are yielding a short k-base read referred to as a k-mer (currently, k=6). The hybridized probes are eluted, and the cycle is repeated until all the regions of interest have been read a predefined number of times. As the result of the sequencing process, we obtain a set of error-prone k-mers from the known set of target sequences along with their multiplicities (number of times a particular hybridization event was observed).
We propose Hyb & Seq SubAssembler pipeline that was designed to cope with the challenges of Hyb & Seq data. The pipeline consists of two stages. The first stage is the Demultiplexer algorithm which classifies each barcode to the reference gene ignoring possible variations. The second stage is the Variant Caller algorithm which performs putative variations detection and secondary classification (considering variations).
Our experiments on the simulated data revealed that the Demultiplexer shows 100% classification accuracy for all considered cases. Therefore, the first stage is utmost reliable and can be confidently used for the detection of a known gene. At the same time, the Variant Caller shows > 90% classification accuracy, i.e. it reconstructs the correct variant for more than 90% barcodes
Authors
-
Alexander Shlemov
(Saint Petersburg State University)
-
Andrey Bzikadze
(Saint Petersburg State University)
-
Anton Korobeynikov
(Saint Petersburg State University)
Topic Areas
Sequencing strategies and technology advancements using the various NGS platforms , De novo sequencing, re-sequencing, Human seq., RNA seq., metagenomics, etc. , Bringing sequence to the clinic (i.e., diagnostics, cancer, inherited disorders)
Session
PS-2 » Poster Session B (20:00 - Tuesday, 16th May, Mezannine & New Mexico Room)
Presentation Files
The presenter has not uploaded any presentation files.