The analysis and interpretation of forensic mixtures using conventional STR (short tandem repeat) data remain a significant problem, the recent developments of probabilistic genotyping not withstanding. The clonal feature of NGS allows the quantitative analysis of the composition of mixtures by counting sequence reads corresponding to the individual contributors. The use of a haploid lineage markers like mtDNA or Y-chromosome polymorphisms in NGS analysis simplifies the interpretation of mixtures significantly in that each individual contributes only one sequence rather than the two allelic sequences for a nuclear marker, making de-convolution of mixtures more straightforward and requiring fewer assumptions.
Many of the properties of mtDNA make it valuable as a genetic marker in the analysis of forensic specimens with limited and/or degraded DNA as well as mixtures. These properties include high copy number per cell, haploid nature and matrilineal inheritance. However, most current mtDNA forensics analyses rely solely on polymorphisms in the HVI and HVII regions. Sequencing the whole mtDNA genome provides a much higher Power of Discrimination. To this end, we have developed a probe hybrid capture/Illumina NGS system for the whole mtDNA genome and applied this system to a variety of contrived mixtures as well as some forensics samples.
While complete mtDNA sequences can easily be reconstructed when samples contain DNA exclusively from a single individual, disentangling the constituent haplotypes from a mixture is complicated by limitations in sequence data and haplotype similarity. Typical sequence reads from current high-throughput sequences are limited to a few hundred base pairs while the human mtDNA is over 16 kilobases in length. Moreover, the DNA recovered from many forensics samples consists of short fragments. Thus, individual reads represent only a small fragment of an individual contributor’s mtDNA genome and contig assembly approaches cannot span regions where contributors share identical sequence. Approaches for assigning variants to contributors based on their frequency within the mixture can be effective in two contributor mixtures but only when mixture proportions are sufficiently distinct.
Phylogenetic based approaches for interpreting mtDNA sequence data from mixed samples have the capacity to reconstruct the constituent sequences. We have implemented such an approach in the software package mixemt, to address two fundamental questions in mixture interpretation. First, how many individuals (contributors) are present in a potentially mixed sample and at what relative proportions? Second, what are the variants associated with each contributing haplotype?
The software, mixemt, makes use of the large catalog of defined mitochondrial haplogroups to identify the distinct haplotypes present within a mixture and uses an expectation maximization based algorithm that co-estimates the overall mixture proportions and the source haplogroup for each read. The haplotypes of each contributor are reconstructed by assigning reads to the haplogroup from which it most likely originated. Through this strategy, reads carrying novel variants, which provide the most power to discriminate between individuals, are partitioned by contributor using common, well-described variants. We demonstrate that our method can reliably detect haplogroups, estimate mixture proportions, and assign reads to contributors in in silico and in vitro mixture samples.
Quality standards for new technologies and mixed data sets , Analysis for metagenomics, antimicrobial resistance, and forensics , Human, non-human, and infectious disease applications