Whole Genome Sequence Data Anonymization Using Bloom Cipher Application
Abstract
The whole genome sequence (WGS) of an unknown sample can provide a vast amount of information, including sample identification, when compared to a known dataset. However, the amount of knowledge to be gained is only as good as... [ view full abstract ]
The whole genome sequence (WGS) of an unknown sample can provide a vast amount of information, including sample identification, when compared to a known dataset. However, the amount of knowledge to be gained is only as good as the known dataset the sample is being compared against. Through Noblis’ internal research and our relationship with private companies, we’ve learned that the private industry has a large trove of rare organisms (many sequenced) but is unwilling to share the data due to the legal implications of these pathogens being found associated with past or future illnesses. With the intent to help bridge the gap between private and public data sharing issues, until the legal aspects are worked thru, Noblis has developed a novel approach to allow access to this large private data without placing the private sector at undue risk. Noblis’ Bloom Cipher application uses a probabilistic data structure called a Bloom Filter to perform a one-way encoding of the data. Once encoded by the private sector, the data is anonymized and some basic analysis of the data can be performed.
Bloom Filters are a befitting data structure for this particular problem because they allow for rapid insertion and lookup of data while maintaining a zero percent false negative rate. The false positive rate of a Bloom Filter is pre-defined with a correlation to the number of elements to be added to the filter. The Bloom Cipher application encodes WGS data by k-merizing the read set and inserting the individual k-mers into a pre-assembled Bloom Filter. Algorithmically, determining to use either the k-mer or its reverse complement reduces the number of elements that are to be tracked and standardizes what will be added. Once inserted, the data is effectively irretrievable back into its original form. Once the data is encoded by the Bloom Cipher application and delivered, Noblis will be able to check to determine if an organism is likely within the Bloom Filter. This occurs by comparing the Bloom Filter to a reference genome using the same look-up process.
The Bloom Cipher application uses standardized Bloom Filters. As such, Noblis will also be able to determine how similar two anonamyzed, encoded Bloom Filters are to one another by calculating the Jaccard Index between the two. Bloom Filters that result in a Jaccard Index close to 1 signifies that the contents of the ciphers are closely related and contain many of the same k-mers (possibly the same organism or species). A result closer to 0 would indicate that the sequences in the cipher are not closely related (different organisms or species). While we’re currently evaluating organisms relevant to the food industry, the Bloom Cipher application has a multitude of applications including organisms associated with National Security.
Authors
-
Tyler Barrus
(Noblis)
-
Mark Sanders
(Noblis)
-
Shane Mitchell
(Noblis)
-
Danielle Montoya
(Noblis)
-
Sterling Thomas
(Noblis)
Topic Areas
Analysis for metagenomics, antimicrobial resistance, and forensics , Human, non-human, and infectious disease applications , Gene editing, synthetic genomics, forensics, and biosurveillance
Session
OS-5 » Metagenomics, Informatics, Assembly & Analysis (14:00 - Wednesday, 17th May, La Fonda Ballroom)
Presentation Files
The presenter has not uploaded any presentation files.