Confidential Computing approach for secure-by-design data collaboration in genomics
Genetic Testing Market: different business models and siloed datasets
With at least 250 genetic testing companies (GTCs) worldwide, there are only few companies, database sizes of which exceed one million samples. The business model of the top direct-to-consumer (DTC) GTCs significantly differs from the one adopted by the majority in this industry.
The key factor is that leading GTCs occupy the best-selling and revenue-generating niche of Ancestry and DNA Matching services which allows them to accumulate extensive genomics databases. As stated in our first article, our market research of 165 genetic companies showed that only about 15% of the companies provide both Ancestry and DNA Matching products.
Large genomics databases provide greater value to GTCs’ customers as GTCs are able to offer more accurate personalised reports and find a higher number of DNA relatives. Further, GTCs leverage these comprehensive databases to enter clinical trial markets and participate in drug discovery, therefore diversifying and increasing their revenue streams.
Such strategic capabilities are rarely available to small- and medium-sized GTCs due to insufficient database sizes. From scientific and business perspectives, data collaboration can drive the value of their genomics datasets non-linearly with size. However, GTCs databases are locked into silos due to the highly sensitive nature of genetic information.
Protection of data-in-use as a bottleneck for data collaboration
Overcoming barriers to data collaboration has been in the interest of the largest IT companies and public institutions for decades. The process can be constrained by differences in storage standards, country-specific regulations, and stakeholders’ interests. However, as mentioned previously, data privacy is the biggest concern.
In order to ensure privacy in the multi-party scenario, data has to be protected in three different states: at-rest, in-transit, and in-use. While there are several reliable approaches for encrypting data-at-rest and in-transit, such as encrypted databases for storing information and TLS protocols for secure data transfers, data-in-use has been the most vulnerable state.
The issue is that data analysis requires decryption in which case data becomes subject to data disclosure, leakage, and modification.
Data protection by law and data protection by design
Two broad and complementary concepts that summarise data protection techniques can be distinguished between ‘protection by law’ and ‘protection by design’. Protection by law is built on the principles of responsible data sharing and refers to the establishment of protocols that stakeholders participating in collaboration have to sign and follow. The method can be time-consuming, stakeholder selective, and cannot guarantee security in all instances.
Data protection by design implies an in-built technology solution for every stage of data management that can withstand any external and internal threats thanks to its impervious data security controls. The problem is that despite several generations of secure multi-party computing technologies, none of them was efficient and fast enough to be adopted on a large scale. The variety of existing privacy-enhancing solutions lacked comprehensive hardware-backed technologies that could prove the security-by-design concept.
Confidential Computing is a novel hardware-based technology that enables secure-by-design data collaboration
The initiative to advance the protection of data in use has been led by Intel, AMD, Microsoft, HP, and IBM since 2003. In 2019, the efforts were transformed into the Confidential Computing Consortium under the Linux Foundation and the global IT leaders that introduced a game-changing technology — Confidential Computing.
Confidential Computing is the protection of data-in-use by performing computations in a hardware-based Trusted Execution Environment (TEE). TEE is a secure area of the main processor that essentially operates as a black box or the so-called secure ‘enclave’. Data and code can be transferred to the enclave where computations will be executed in hardware isolation.
Once inside the black box, the code can no longer be modified. All computations occur only inside the black box. Computational results received after code execution can be in the form of analytical insights or trained models. After the code is executed, the TEE black box is eliminated, along with all the code and data. Hence, TEE ensures data integrity, data confidentiality, and code integrity, and protects against internal and external threats, including cloud providers.
The disruptive advantage of Confidential Computing is the TEE’s capability to connect data from several sources with no data disclosure among the data owners. This is a new paradigm that enables secure-by-design multi-party analytics. Confidential Computing opens up previously impossible data collaboration scenarios with extraordinary opportunities for data owners and software developers.
As an early adopter of Confidential Computing, GENXT developed a solution that implements new approaches for secure data collaboration in genomics
GENXT Confidential Computing Network is designed to connect GTCs, bioinformatics software developers, and research centres into a peer-to-peer ecosystem for secure-by-design multi-party analytics of genomic and clinical data.
The first application of the Confidential Computing Network is built to improve DNA Relatives Matching which was previously available on locally-controlled databases only. The network connects GTCs among one another, allowing them to add value to their DNA relatives matching services by leveraging each other’s databases to find both close and distant DNA relatives in a secure and privacy-preserving way. The architecture of the Confidential Computing Network architecture ensures that personal genetic data stays undisclosed among GTCs, GENXT, and any third parties.
Therefore, the solution offers a cutting-edge way of collaboration between previously competing companies. In the next article, we will provide more details about our Peer-to-Peer DNA Relatives Matching service.
Other use-case scenarios of the Confidential Computing Network will include a federated learning engine, allowing to train models on genetic data, predict phenotypes, and identify disease-gene associations; patient recruitment for clinical trials, and a marketplace for 3rd-party bioinformatics software.