Background: Low cost, rapid next generation sequencing (NGS) is revolutionizing public health microbiology as whole genome sequencing (WGS) replaces traditional phenotypic and genotypic characterization methods. Since September 2013, CDC, in collaboration with federal partners, has sequenced all clinical isolates of Listeria monocytogenes (LM) to transform PulseNet’s current pulsed-field gel electrophoresis (PFGE)-based surveillance into a WGS-based infrastructure. One key part of PulseNet’s PFGE-based surveillance is the use of PFGE pattern names to identify related isolates without the need to look at gel images. Currently, such a pattern naming scheme does not exist for whole genome sequence data. As part of this project we developed a LM whole-genome multi-locus sequence typing database (wgMLST) and nomenclature for whole-genome sequence types in BioNumerics 7.6, and tested the utility of this approach in surveillance.
Methods: Whole-genome sequence types, hereafter ‘WGS zip-codes’, were generated for a set of 6450 genomes from LM isolates. All genomes were assessed for quality (coverage, sequence quality, assembly length and percent core MLST) and then analyzed using a BioNumerics 7.6 nomenclature plugin. To develop the nomenclature, we created a single linkage tree from the set of LM genomes and used 10%, 5%, 2.5%, 1.0%, 0.5%, and 0.25% as our similarity thresholds corresponding to a distance of approximately 300, 150, 75, 30, 15 and 7 alleles. This resulted in a six digit WGS zip-code (e.g. 1.1.1.1.1.1) where each digit corresponds respectively to the defined similarity thresholds allowing us to identify related isolates without accessing actual sequence data. These thresholds were then assessed for sensitivity and specificity for identifying outbreak clusters using a dataset of 2810 genomes from clinical isolates and 71 historical outbreaks. For each outbreak, we determined the associated WGS zip-code. The isolates that share this WGS zip-code within the outbreak are true positives, while the remainder of the isolates in the outbreak are false negatives. Isolates in the dataset that share the true positive WGS zip-code but are not a part of any outbreak cluster are assigned as false positives. Isolates in the dataset that have unique WGS zip-codes and are not a part of any outbreak cluster are assigned as true negatives.
Results: Specificity and sensitivity were calculated as weighted averages across 71 outbreaks at each similarity threshold. Specificity and sensitivity were respectively found to be (97.28%, 99.73%); (98.11%, 99.73%); (99.20%, 97.84%); (99.78%, 91.91%); (99.89%, 83.56%); and (99.97%, 70.08%) for the 10%, 5%, 2.5%, 1.0%, 0.5%, and 0.25% thresholds respectively.
Conclusion: Currently up to 25 allele differences is used in the initial assessment of clusters to identify potential outbreaks. This is close to the 1% similarity threshold in our analysis which identifies outbreaks with a sensitivity of > 91% and a specificity of >99% confirming the utility of our current initial cluster assessment threshold. Our next steps include an assessment of the stability of the similarity thresholds as new isolates are added into the single linkage tree and a validation of these thresholds for wgMLST schemes of other foodborne bacteria.
Gene editing, synthetic genomics, forensics, and biosurveillance , Global engagement and partnerships