Introduction: Whole-genome sequencing (WGS) has become critical to the characterization of enteric bacterial outbreak clusters. Yet, such multiple genome analysis projects can be impacted by systematic biases related to suboptimal machine maintenance or reagent quality, among other things. The current study evaluates tools for correcting bias in WGS from enteric pathogen outbreaks.
Methods: Illumina MiSeq WGS from three outbreak clusters, E. coli ser. O26, Salmonella enterica ser. Pomona, and S. enterica ser. Hartford were analyzed based upon preliminary evidence of bias such as low average base pair quality (Q33 score < 30.0). WGS reads from each outbreak were corrected using Prinseq v0.20.3, Musket v1.1, Blue v4.0.30319, BayesHammer from SPAdes v3.8.1, or run_assembly_trimClean.pl from CG-Pipeline v0.3.2. Forward and reverse reads were visualized in R while WGS coverages to reference genomes were quantified in LyveSET v1.1.4f.
Results: Of the ten
E. coli O26 isolates, four reverse read sets had average quality scores below 30.0. One O26 reverse read set had 15% ambiguous nucleotide artifacts (Ns) at a single position, and another had up to 3.5% Ns at multiple positions. Only Blue successfully removed all Ns and increased quality scores above 30. Blue increased median mapped coverage of the reference genome from 58x to 62x. Out of four
Salmonella ser. Pomona isolates, three reverse read sets had average quality scores < 30. One set of reverse reads from Pomona exhibited composition artifacts at two positions, with 10.4% and 3.8% Ns. Another Pomona isolate had 3.5% Ns at one position on its forward reads. BayesHammer minimized artifact site Ns to less than 1% on forward reads and less than 3% on reverse reads. Blue reduced Ns to less than 0.1% on forward and reverse reads while raising read qualities above 30.0. BayesHammer and Blue increased median coverage to the Pomona reference genome, from 96x to 125x and 147x, respectively. Out of six
Salmonella ser. Hartford isolates, two had Q33 qualities averaging less than 30.0 while three sets of forward reads from the same run date had 7.0% to 7.5% Ns artifacts at one site. BayesHammer and Blue both reduced Ns at the site to less than 0.1%, while Blue raised average read qualities above 30.0. BayesHammer and Blue raised median coverage of the Hartford reference genome from 76x to 83x and 87x, respectively. Prinseq, Musket, and CG-Pipeline did not reduce artifact site Ns, did not increase quality scores, and caused some loss of coverage (–1x) across the three clusters.
Conclusions: In enteric bacterial outbreaks, systematic biases and artifacts may appear across WGS projects that include multiple, related isolates, necessitating rapid healing of biased reads. Ideally, the same correction tool should be applied to all isolates within an outbreak cluster without loss of sequence coverage. By removal of Ns composition artifacts and enhancement of coverage, BayesHammer, and Blue, outperformed Musket, Prinseq and CG-Pipeline.
De novo sequencing, re-sequencing, Human seq., RNA seq., metagenomics, etc. , Comparative genomics, re-sequencing, SNPs, structural variation , Human, non-human, and infectious disease applications