This work focuses simultaneously on both the scaffolding and gap filling phases of de nouveau genome assembly. Given a set of contigs and their relationships--overlaps and/or remoteness in terms of distances between them--we propose an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information. Specifically, we define a graph, which we call a contig graph, that encodes information about contigs and overlaps and mate-pair distances between them, and reduce the scaffolding problem to the problem of finding a longest simple path in that graph such that as many as possible mate-pairs distances are satisfied. Since both conditions cannot generally be simultaneously satisfied, our objective function is a linear combination of the path length and penalties for distance mismatches.
Unlike the shortest path problem with non-negative weights, for which efficient polynomial-time algorithms exist, the longest weighted path problem is NP-hard. We solve this problem by reformulating it as a mixed integer linear program (MILP) and develop a method that exactly solves the resulting program on genomes of up to 165 contigs and up to 6682 binary variables.
An advantage of our approach is that the modeling of scaffolding as a longest path problem allows one to solve simultaneously several subtasks specific for this problem like: contig orientation and ordering, repeats, gap filling, and scaffold extension, which in other approaches are targeted as separate problems. We are not aware of previous approaches on scaffolding based on the longest path problem reduction. A drawback of the typically used strategy of constructing a set of disjoint paths, rather than a single path, is that it would require additional steps of gap filling and scaffold extension, involving additional work. Moreover, it would make impossible to find a provably optimal final solution, since, even if each separate problem is implemented optimally, their combination may not be optimal.
We tested this model on a set of chloroplast and bacteria genome data and showed that it allows to assemble the complete genome as a single scaffold. Compared to the publicly available scaffolding tools that we have tested, our solution produces assemblies of significantly higher quality.
Whole genome assemblers and integration of next generation dataTopic #1 , Next generation finishing tools, technologies and pipelines , De novo assemblers for short reads, hybrid assemblers