Mining Frequented Regions for Pan-Genome Analysis
  
	
  
    	  		  		    		Abstract
    		
			    
				    We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work...				    [ view full abstract ]
			    
		     
		    
			    
				    
We consider the problem of identifying regions within a pan-genome de Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work we formalize the FR problem, discuss its computational complexity, and describe an efficient algorithm for mining FRs. We evaluate our algorithm on a variety of data sets and compare it to existing tools. We illustrate the biological relevance of FRs by using our algorithm to identify introgressions in yeast that aid in alcohol tolerance. We also explore FR-based classification of strains within the yeast population and use feature selection to find discriminative FRs that can be used for visualization and other traditional analyses, such as the construction of phylogenies. Overall, mining FRs is shown to be an effective approach to pan-genome analysis and our algorithm is shown to have superior performance and scalability to existing tools.
			    
		     
		        
  
  Authors
  
      - 
    Alan Cleary
     (Montana State University)    
- 
    Joann Mudge
     (National Center for Genome Resources)    
- 
    Thiru Ramaraj
     (National Center for Genome Resources)    
- 
    Brendan Mumey
     (Montana State University)    
Topic Areas
		
											Comparative genomics, re-sequencing, SNPs, structural variation							, 				Large scale data management, cloud computing					
	
  
  Session
	
		OS-5 » 		Metagenomics, Informatics, Assembly & Analysis		(14:00 - Wednesday, 17th May, La Fonda Ballroom)
  
  
	
  
			
      Presentation Files
      
						The presenter has not uploaded any presentation files.