Jonathan Dursi
CanDIG
Jonathan Dursi has over twenty-five years experience using large-scale computing to advance science. His personal research has focused on astrophysical fluids with the DOE ASCI ASAP program and on bioinformatics with the Ontario Institute for Cancer Research. He has also worked to support other researchers at Canada’s largest HPC centre, SciNet, and as Compute Canada’s first CTO. He currently works at Toronto’s Hospital for Sick Children on the CanDIG project, helping build a platform for national-scale analysis of locally-controlled private genomics data. He is very interested in tools that have the potential to make big scientific computing more productive and powerful, and occasionally blogs on these topics at http://www.dursi.ca .
Tackling the “wicked problems” of cancer and rare diseases against the already complex landscape of human biology requires health researchers to have access to as much health and genomic data as possible in order to see connections and test hypotheses. While a torrent of genomic data is now being produced at sites across Canada, accessibility to researchers means more than having it sit on a disk somewhere with the right permission bits set. It has to be discoverable, analyzable, available, and linked to vital metadata for it to be useful in improving human health.
In this talk we present the Canadian Distributed Infrastructure for Genomics (CanDIG, http://distributedgenomics.ca ), a fully distributed platform that allows national-scale, privacy-maintaining analyses of locally-controlled data sets. CanDIG is a CFI Cyberinfrastructure-funded project with initial sites in Toronto, Montréal, and Vancouver. With participating members including the top generators of genomic data for Canadian patients, CanDIG will serve as a foundational platform for all of these institutions collaborative ventures, with a long-term goal of extending data sharing to a wider base of researchers across Canada.
The CanDIG platform represents a new approach to making sensitive genomic data available to analysis across multiple providers. By moving computation to the data, we enable truly national-scale analysis of private health data in Canada, where provincial health data privacy protections can make it difficult for clinical data to leave the province it was collected in. As privacy protections built in from the very beginning, we make it easier for health data stewards to justify allowing their data to be part of some remote analyses. Granular control of the amount of data and information being released, and to whom, is a fundamental part of the overall design. Our efforts build on and contribute back to the efforts of the international Global Alliance for Global Health (GA4GH, http://genomicsandhealth.org ), using standardized RESTful APIs for data access to provide interoperability with a wide range of tools, and web-era authentication and authorization standards (OpenID Connect and UMA) to ensure privacy and security of all data.