Importance of k-mer size selection to de novo assembly of Neisseria gonorrhoeae genomes
Abstract
A k-mer is a substring of length k, and counting the occurrences of all such substrings is an important step in many DNA sequence analyses. Counting k-mers is an essential component of several bioinformatic methods, including... [ view full abstract ]
A k-mer is a substring of length k, and counting the occurrences of all such substrings is an important step in many DNA sequence analyses. Counting k-mers is an essential component of several bioinformatic methods, including genome and transcriptome assembly, repeat detection, read depth estimation, metagenomic sequencing, mutation identification, and error correction of sequence reads. De novo assembly of whole genome shotgun next-generation sequencing data benefits from high-quality input with high coverage. The structure of an assembly graph is highly dependent on the k-mer size used for assembly, and the ideal k-mer size depends on the read length, read depth, and sequence complexity. Initial analyses without the use of these tools prior to running de novo assemblers results in poorly assembled genomes. This is due to the fact that there are many repeats in Neisseria gonorrhoeae which will affect k-mer size resulting in larger contig sizes. To demonstrate the importance of how k-mer sizes effect assemblies we compared two k-mer size selection tools, Kanalyze and KmerGenie. Each tool was developed to run on multiple platforms and process large datasets. We then tested these k-mer counters on a small dataset of raw sequence reads (Illumina) from 12 Neisseria gonorrhoeae isolates. SPAdes analyses conducted with the default k-mer values produced a large number of contigs. However, when we analyzed this dataset using k-mer tools beforehand to determine the optimal k-mer size, the SPAdes analyses generated smaller numbers of contigs, and the original contig size was reduced by half. Furthermore, our preliminary results indicated that KmerGenie outperformed Kanalyze. To further access the utility of k-mer counting tools with regard to the de novo assembly of gonococcal genomes, we applied these and additional k-mer tools (MerCat, JELLYFISH, and KAT) to a larger dataset of 50 sequences. The results indicated that the number of contigs were indeed reduced, and these tools greatly improved our ability to assemble gonococcal genomic data.
Authors
-
Eshaw Vidyaprakash
(Centers for Disease Control and Prevention)
-
David Trees
(Centers for Disease Control and Prevention)
Topic Areas
De novo sequencing, re-sequencing, Human seq., RNA seq., metagenomics, etc. , Whole genome assemblers and integration of next generation dataTopic #1 , Human Genomics and genome improvement
Session
PS-1 » Poster Session A (19:00 - Tuesday, 16th May, Mezannine & New Mexico Room)
Presentation Files
The presenter has not uploaded any presentation files.