Mining Frequented Regions for Pan-Genome Analysis
Abstract
We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work... [ view full abstract ]
We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work we formalize the FR problem, discuss its computational complexity, and describe an efficient algorithm for mining FRs. We evaluate our algorithm on a variety of data sets and compare it to existing tools. We illustrate the biological relevance of FRs by using our algorithm to identify introgressions in yeast that aid in alcohol tolerance. We also explore FR-based classification of strains within the yeast population and use feature selection to find discriminative FRs that can be used for visualization and other traditional analyses, such as the construction of phylogenies. Overall, mining FRs is shown to be an effective approach to pan-genome analysis and our algorithm is shown to have superior performance and scalability to existing tools.
Authors
-
Alan Cleary
(Montana State University)
-
Joann Mudge
(National Center for Genome Resources)
-
Thiru Ramaraj
(National Center for Genome Resources)
-
Brendan Mumey
(Montana State University)
Topic Areas
Comparative genomics, re-sequencing, SNPs, structural variation , Large scale data management, cloud computing
Session
OS-5 » Metagenomics, Informatics, Assembly & Analysis (14:00 - Wednesday, 17th May, La Fonda Ballroom)
Presentation Files
The presenter has not uploaded any presentation files.