Rapid and Portable Genome Classification System
Abstract
Identifying viruses in a quick and timely manner is of constant concern in today’s world. From the West Africa Ebola outbreak in 2014, to the recent Zika spread, rapid identification of viral RNA and DNA is necessary to... [ view full abstract ]
Identifying viruses in a quick and timely manner is of constant concern in today’s world. From the West Africa Ebola outbreak in 2014, to the recent Zika spread, rapid identification of viral RNA and DNA is necessary to provide insight for prevention. The current inaccessibility of whole genome sequencing analysis technology systems to detect emerging outbreaks at the earliest stages of spreading infection can lead to otherwise avoidable increases in mortality from the disease. This includes diseases that result from environmental contamination, exposure to zoonotic pathogens, or purposeful acts of bioterrorism, etc. The availability of advanced detection methods will invariably lead to enhanced biosecurity and help government safety professionals to provide more effective strategies to ensure public health and safety.
Next-generation sequencing (NGS) technologies are generating large amount of data. In a clinical or public-health setting, rapid turnaround time in identification and analysis is critical, and increasing speed while maintaining accuracy is essential. Another challenge arises in the storage requirements for this expansive amount of genomic data. Emerging technical solutions attempt to address how to store, access, and process this information.
Noblis’ approach to combat these challenges is to use Bloom filters to compactly represent large sets of information and test those sets for membership (within some computable error boundary). For example, a genome that takes 30MB as a FASTA file, can be stored as a Bloom Filter in about 5MB while still retaining usable set membership testing (this can be achieved with a higher or lower than5% false positive rate and a 0% false negative rate). If we envision a reference genome as a set of k-mers, we can then test k-mer membership from the sample against a reference bloom filter. Even with a high false positive rate, the raw statistics from the cumulative set membership tests will effectively classify and provide a similarity metric to the reference genome. By using a collection of bloom filter classifiers, we train a machine learning classification model using very large libraries of reference genomes and then rapidly classify a given sample against that model, with the model providing a ranked order of likely organism matches. While using data sets available from NCBI SRA on the recent Zika virus outbreak, we conduct a small-scale binary classification trial of Zika virus read sets against early reference models. The linear classification of read sets against a library of blooms of about 100 complete genomes of different viral strains is under a clinically accepted time scale because of the algorithm’s ability to process the data at a very high speed. Highly ranked results are expected that are as good as more computationally intensive approaches, with reasonable disk utilization characteristics
Authors
-
Masooda Omari
(Noblis)
-
Tyler Barrus
(Noblis)
-
Mark Sanders
(Noblis)
-
Shane Mitchell
(Noblis)
-
Sterling Thomas
(Noblis)
Topic Areas
Sequencing applications for metagenomics, transcriptomics, diagnostics, and biosurveillanc , Large scale data management, cloud computing , Analysis for metagenomics, antimicrobial resistance, and forensics
Session
PS-2 » Poster Session B (20:00 - Tuesday, 16th May, Mezannine & New Mexico Room)
Presentation Files
The presenter has not uploaded any presentation files.