Next-generation DNA sequencing (NGS) revolutionizes the way we do biology across its fields, be it phylogeny, population genetics, or functional genomics. Today, genome-scale data are cheaper and easier to obtain than ever... [ view full abstract ]
Next-generation DNA sequencing (NGS) revolutionizes the way we do biology across its fields, be it phylogeny, population genetics, or functional genomics. Today, genome-scale data are cheaper and easier to obtain than ever before, deluging researchers and fostering many applications. Over the decades, millions of specimens have been accumulating in zoological museums across the globe. Some of these specimens are primary types serving as the name-bearing representative of these animals, others represent very rare and even extinct species. Such collections embody a hidden treasure of genomic information, that will transform biodiversity research if unlocked. However, the specimens were not preserved with genome sequencing in mind, and contain only small amounts of badly degraded DNA if collected a hundred years ago. Exacerbated by cross-contamination from fresher samples in large-scale experiments, the conventional NGS data processing pipeline leads to erroneous phylogenetic placement of old specimens, requiring manual intervention. Focusing on butterflies, we improve poor NGS datasets by overcoming the two major problems: uneven coverage and cross-contamination.
Investigating century-old type specimens of cryptic species from the butterfly genera Astraptes and Urbanus, we obtained their mitochondrial DNA COI barcodes using PCR followed by Sanger sequencing. NGS genomic reads from these specimens mapped to a reference genome result in a phylogenetic placement different from that in barcode trees. Manual inspection of NGS genomic mapping revealed a high fraction of heavily clipped reads, suggesting random matches of reads from contaminant DNA. Furthermore, due to non-random DNA degradation, completeness of mapping in old specimens is poor compared to fresh specimens. Trees built from such gapped alignments group old samples together due to higher sequence conservation in mapped regions as compared to conservation in missing regions. To address both problems, we removed possible contaminant reads based on their low fraction of mapped positions in a read and selected gene-coding regions that are covered in all specimens. The phylogenetic placement of old specimens becomes consistent between trees constructed from COI barcodes and nuclear genomes, solving taxonomic problems and opening the way to description of new species.
Mitogenomes of fresh specimens can be constructed using de novo sequence assemblers from NGS reads baited by available reference mitogenomes. In old specimens, DNA degradation is severe and coverage is uneven, resulting in incomplete and fragmented de novo assembly. For instance, the coverage of the Burara striata mitogenome ranges from 200 to 36000, yielding 20 disconnected fragments. We developed a battery of in-house scripts to connect the fragments and fill in the gaps. As a result, we can obtain complete high quality mitogenomes from degraded specimens. Moreover, because mitogenomes are highly covered (~100X), they can be used to evaluate data quality. From SNPs in COI barcodes, we are able to identify possible contaminating species. Reads from contaminants can be filtered out by including their genomes in the mapping procedure. High coverage of mitogenomes facilitates detections and analysis of errors. By comparing k-mers, polymorphisms and their quality score, specific sequence damage and sequencing errors are revealed to aid in error correction of nuclear genomes.
Sequencing strategies and technology advancements using the various NGS platforms , Comparative genomics, re-sequencing, SNPs, structural variation