Predicting phenotype from genotype represents the epitome of biological questions. Comparative genomics holds the promise of making it possible. However, the high heterozygosity (1-4%) of many Eukaryotes is an obstacle to assembling their genomes. We devised a cost-effective strategy that overcomes the high heterozygosity problem to sequence and assemble complete de novo genome from a single butterfly specimen collected in the wild. Prior to our work, reference genomes were available for a single butterfly family. Within a couple of years, we obtained complete genomes for representatives of all butterfly families and 10x coverage genomic reads for over 1600 specimens. Among these, many are museum specimens on pins stored dry at room temperature for up to 150 years, thus presenting technical challenges. Some of them are extinct species.
Capitalizing on this technical advance, we mine the data to further our understanding of butterfly biology. From phylogeny construction based on genomic sequences to hypotheses about genotypic determinants of phenotypic traits, we answer questions about speciation, population history and migration, introgression, behavior, food digestion, and morphological traits.
We observed that unique gene expansion and putative horizontally transferred genes frequently give rise to unusual phenotypic traits. For instance, expansion of prenyl transferases in Swallowtails is linked to the terpene production in osmeterium, a fleshy organ that functions to repel predators by caterpillars, present only in this family. Expansion of chitinase-like proteins that may function as cellulases in Skippers correlates with the caterpillar feeding on cellulose-rich but nutrient-poor grasses.
Studying butterfly population, we find that introgression (i.e. gene exchange through hybridization) is rampant between closely related butterfly species. Introgressive hybridization is a fast way to pass advantageous traits between species, such as mimetic wing patterns. Population genomics of migratory butterfly species reveals patterns of dispersal by migration between Caribbean Islands (island hopping) and bottlenecks that significantly reduce genetic diversity in this process. The analysis of genetic diversity across all known populations of a species tells about its geographic origins and reveals population structure and diversification towards periphery of the range.
Large-scale genomic sequencing of butterflies allows us to establish criteria for genetic determinants of speciation. Comparing many sister species and populations of the same species across a geographic boundary (suture zone), we find that two criteria -- fixation index and fraction of introgressed regions -- can discriminate between species and populations, giving the first numerical estimates of species boundaries in genome-scale data. Looking for genes that are conserved within but differ between closely related species we find that the following functional systems may be driving speciation in butterflies: circadian clock regulators, zinc-finger DNA-binding proteins and immunity cascades. Processes guided by these systems may be responsible for adaptation to different latitudes, prezygotic isolation and autoimmune problems in hybrids between species.
Finally, genomics offers ultimate solutions to taxonomic problems allowing us to confidently associate old primary type specimens with recently collected specimens, and thus determine which species in cryptic (i.e., best distinguished by DNA sequences) species complexes have names and which species are new.
De novo sequencing, re-sequencing, Human seq., RNA seq., metagenomics, etc. , Genome annotation and pathway identification tools and pipelines , Comparative genomics, re-sequencing, SNPs, structural variation