Discovery of small-scale variation in genome composition and structure, and characterizing this variation in the context of human disease, is an area of intense research interest. Evidence from whole-exome sequencing of... [ view full abstract ]
Discovery of small-scale variation in genome composition and structure, and characterizing this variation in the context of human disease, is an area of intense research interest. Evidence from whole-exome sequencing of genetic disease samples suggests a significant contribution from rare de novo or somatic variants to diseases such as autism and cancer. Strategies based on aligning short (100-200 nucleotide) Illumina reads to a reference genome sequence dominate variant discovery methods. This strategy suffers from several related deficits. First, many reads align to the reference genome poorly, or not at all, due to repetitive DNA, novel sequence, or misassemblies in the reference. These reads contain interesting and potentially critical data that is discarded de facto by reference-based methods. Also, mapping-based methods are insensitive to certain classes of variants such as 5-200 nucleotide indels and many structural variants. And finally, while the human genome is a suitable reference for many biomedical applications of variant discovery, there are many research contexts in agriculture, veterinary medicine, and related fields where reference genomes are either unavailable or of insufficient quality for reference-based variant discovery.
We are developing a novel alignment-free k-mer based method, Kevlar, for discovery of de novo and somatic variants. Based on simulations we have shown that novel mutations generally produce many k-mers not present in the reference genome. Accordingly, our method is based on analysis of k-mer abundances directly from raw reads, which we can achieve in very low memory using a novel k-mer banding strategy. k-mers unique to a diseased individual, or (more generally) of differential abundance in case samples versus controls, point directly to loci of probable interest. Reads containing novel k-mers are loaded into an assembly graph, which can be partitioned into disconnected components representing distinct variants, and subsequently assembled, filtered, refined, or directly analyzed. Initial testing on trial data sets suggest that there may be between 10 and 100Mbp of novel, non-erroneous sequence in samples from the 1000 Genomes YR1 trio, and preliminary results have confirmed many high-confidence variants including a de novo Alu insertion which has been validated experimentally.
The Kevlar method is being developed as an open source research software project, and is freely available at https://github.com/dib-lab/kevlar. The initial implementation is optimized for family studies (autism trios and quads), but Kevlar’s k-mer banding strategy supports scaling to very large cohort studies (such as cancer case/control studies) even when available memory is limited.
Comparative genomics, re-sequencing, SNPs, structural variation , Human, non-human, and infectious disease applications