Predicting phenotype from genotype charts the future of biological research. Success in such prediction depends on an ability to routinely sequence and analyze genomes of thousands of individuals from select model organisms. In this quest, butterflies and moths with relatively small genomes but complex life cycles and diverse wing patterns, are emerging as powerful models. However, the genomic studies of butterflies are hindered by technical difficulties, both in experiments and computations. The two major obstacles preventing acquisition of de novo butterfly genomes are preparation of high quality mate-pair libraries and high heterozygosity reaching 5% and creating difficulties with genomic assembly. I initiated butterfly sequencing projects in the lab, successfully overcoming these difficulties. While some published moth genomes required 100 Illumina lanes to obtain a decent quality assembly, we can get a better de novo genome for under $4000 within two months.
Our genomic pipeline starts from a field-collected butterfly preserved in RNAlater solution. From the same specimen, we prepare RNA-seq, 250 and 500bp paired-end, and 2k, 5k, and 10kbp mate pair libraries using a modified Cre-loxP based protocol. Meticulous attention to details during mate pair library preparation is needed to ensure the highest DNA yield and to avoid fragmenting the DNA at every step. The libraries are pooled at a ratio of 20:40:20:10:5:5 and sequenced for 150bp at both ends. Genome assembly suffers from high heterozygosity due to the difficulty in distinguishing duplications in the genome from sometimes 5% different haplotypes. However, an assembler can use the fact that both maternal and parental copies are expected to show half the overall genomic coverage by the reads. Platanus does the best job among assemblers, and we further improve the results by detecting and removing redundant equivalent segments from homologous chromosomes. Another unique setup in our lab, that the experiments and computations are done by the same person, ensures seamless feedback between the two, contributing to the success of our efficient and cost-effective genomic pipeline. This pipeline, supplemented by computational analysis to predict structural and functional features from DNA and protein sequences, allowed us to obtain first representative genomes for all butterfly families except one, sequenced previously.
Scaling up the efforts, we obtained genomes of over 1500 specimens. We optimized the procedure to prepare paired-end libraries in high throughput manner while minimizing the costs. We adapted it to sequencing of century-old dry specimens from museum collections. It costs about $200 to process and sequence an average butterfly specimen at 10X coverage. We further established computational pipelines to assemble genomes, protein-coding sequences, and mitogenomes. Mapping the reads on the genome of a close relative followed by SNP-calling, we obtain a relatively complete (> 80%) assembly. Or, guided by the protein sequences from a more distant reference, we apply a combination of reference-based and de novo assembly to obtain the coding genes. Taken together, these developments allow us to harvest rich datasets from butterfly genomes and utilize them to tackle interesting biological questions about population genetics, phylogeny and genetic basis for phenotypic traits.
Sequencing strategies and technology advancements using the various NGS platforms , De novo sequencing, re-sequencing, Human seq., RNA seq., metagenomics, etc. , De novo assemblers for short reads, hybrid assemblers