25
LifeCODE Databank The Optimal Place for Safe and Efficient Health Data Storage and Exchange July 2, 2018

LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

LifeCODE Databank

The Optimal Place for Safe and EfficientHealth Data Storage and ExchangeJuly 2, 2018

Page 2: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Table of Contents

Announcing LifeCODE Databank ............................................................................................... 3

Platform Architecture ........................................................................................................................ 7

LifeCODE Blockchain ....................................................................................................................... 9

Permissioned Network ................................................................................................................ 11

Cambrian and Privacy .................................................................................................................. 12

Private Transactions ...................................................................................................................... 13

Data Privacy ........................................................................................................................................... 15

Reliable Storage ................................................................................................................................... 16

Searchable Encryption ..................................................................................................................... 18

Genomic Data Storage, Search, and Analysis ...................................................................... 19

Phenotypic Data Integration .......................................................................................................... 20

Data Exchange and Customer Interaction ............................................................................. 20

LifeCODE Token .................................................................................................................................. 22

Summary ................................................................................................................................................. 23

References .............................................................................................................................................. 24

2 LifeCODE Databank

Page 3: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Announcing LifeCODE Databank

Genomic testing is ushering in a new age of medicine. Patients can be treated more precisely and

effectively when their care is tailored to their own individual genetic makeup1,2. Determining genetic

risk for cardiovascular disease and cancer empowers individuals and their healthcare providers to

prevent those diseases or detect and begin treating them earlier. As the cost for genomic sequencing

becomes more affordable, more and more people around the world are opting to get genetic testing.

As a result, there is a growing need for a safe and efficient way to store healthcare data, while

still allowing for it to be widely shared. Genomic data is extremely sensitive. Most people are not

aware that our DNA contains information about our life expectancy, our proclivity to depression or

schizophrenia, our complete ethnic ancestry, our expected intelligence, and maybe even our political

inclinations3. Within a decade or two, the genome will likely reveal even more. In this context, ensuring

that people’s genomic data is stored responsibly and anonymously is vital for maintaining privacy. It is

also vital for scientific progress.

Personalized medicine is possible only because we can analyze genomic data from hundreds of

thousands of people. Such data allows scientists to do important research to better understand the

role of genetics in diseases and the immune system, and to create better treatments for specific

groups4,5. Only with this data can we start routinely customizing treatments for individuals, thereby

cutting costs, increasing efficiency, and providing patients the best care possible.

Most sequenced genomes are currently stored in strict access-controlled repositories. Access to

this data could help researchers identify disease-causing genetic variants and aid the discovery of

new drug targets. Numerous recent examples show how discovery of new disease-causing genes

lead to better treatment for cancer, cardiovascular disease, and metabolic conditions. Yet, concerns

over genetic data privacy may, understandably, deter individuals from contributing their genomes to

scientific studies6-8 thereby slowing medical progress.

WuXi NextCODE has emerged as one of the companies leading the charge in helping fulfill the Human

Genome Project’s original aspirations. WuXi NextCODE is uniquely positioned to securely provide an

ideal solution to the dilemma of protecting and managing genomic personal data.

3 LifeCODE Databank

Page 4: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

With the launch of the WuXi NextCODE LifeCODE Databank, we combine our unparalleled experience

in genomic sequencing and analysis with the bold new technology of blockchain, allowing patients,

hospitals, companies, and other stakeholders to further advance the genomic revolution.

About WuXi NextCODE

WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the

frontier of genomic research and clinical applications for more than 20 years and we have analyzed

data from more than 600,000 genomes. With offices in Shanghai, Cambridge (Massachusetts),

and Reykjavik, Iceland, we serve the leading population genomics, genomic diagnostics, precision

medicine, and wellness initiatives and enterprises around the world10-12.

Our capabilities span study design, sequencing, secondary analysis, storage, interpretation, scalable

analytics, and AI and deep learning — all backed by the proven and most widely used technology for

organizing, mining, and sharing genome sequence data.

The dramatic advances in genomics mean that individuals can now obtain their own genomic data

and corresponding analysis reports relatively easily and affordably. As a result, huge amounts of

genomic data are being generated and stored in either personal devices and/or those institutions that

provide genomic sequencing services or patient cohorts for studies.

Many big genomic data projects exist, and many institutions are gathering and building their own

databases. But what is being lost is the growing amount of data created when individuals have

their genomes sequenced, not as part of a study, but for their own health or interest. What is more,

individuals have no real-time control over how their genomic data is stored and used. Nor do they

have the option of selectively allowing their data to be used for purposes of their own choosing. As

noted privacy issues are also a key concern causing many to avoid genomic sequencing in the first

place. Confidence in data security together with the freedom to pick and choose the research projects

to be involved in, may greatly increase participation.

Meanwhile, petabytes of phenotype data, including clinical results, daily behavior, diet patterns, and

general health information is also being generated at hospitals, research institutions, pharmaceutical

R&D, and wearable IoTs13-15. Such data is typically stored in centralized databases, which raises

security and data abuse concerns. Moreover, scalability and efficient data sharing and exchange are

hard to achieve based on current research and the various practices of healthcare institutions.

On the other hand, how to find an efficient way to search and exchange qualified health data has

4 LifeCODE Databank

Page 5: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

always been a big challenge that scientific, clinical, and pharmaceutical researchers face. For example,

studies of medicines or treatments for some rare diseases is difficult due to a lack of adequate patients

for discovery research and clinical trials.

For common diseases meanwhile, which are often very genetically complex, we need very large

cohorts of patients to do effective research and clinical studies.

LifeCODE guarantees that data ownership belongs to only the individuals who upload their own

health data. This data is immediately and fully encrypted by default once it is uploaded to LifeCODE.

To ensure security, any access to this data is strictly monitored and recorded in LifeCODE’s blockchain

so that all data transactions are fully traceable. Most importantly, of course, is that the data owner

controls access and authorizes any transactions or sharing.

For instance, a pharmaceutical company could offer a data owner a certain amount of LifeCODE

“tokens” for allowing the company access to designated genomic data (e.g. VCF) that has been

searched for and found in LifeCODE. This offer will be sent through the platform and the data owner

would receive a request for authorization. The transaction is complete once permission is granted or

rejected. Data owners will always remain anonymous to the data searchers as their personal genetic

information is filtered and encrypted in LifeCODE. Again, all the health data transactions are recorded

in LifeCODE blockchain for tracing, assuring that the system is secure.

WuXi NextCODE’s LifeCODE is a global cloud-based and blockchain-enabled platform developed to

provide an ideal solution to this dilemma.

LifeCODE’s key features are that it:

◉ Allows an individual to declare ownership of their own data. All data sharing and uses must be

authorized by the data’s owner.

◉ Includes genomic and phenotypic data.

◉ Enables end-to-end blockchain technology.

◉ Enables data exchange with LifeCODE tokens.

◉ Enables natural language searches.

◉ Provides easy-to-use service APIs for businesses, individuals, and systems.

◉ Provides manageable security services.

◉ Provides access to one of the world’s premier genomic data analysis solutions — the Genomic

Ordered Relational (GOR) architecture.

5 LifeCODE Databank

Page 6: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

◉ Features a searchable encrypted database distributed in cloud.

◉ Provides data privacy and protection. All data is encrypted throughout the system.

Transactions are anonymous and traceable in the blockchain network. Features a searchable

encrypted database distributed in cloud.

◉ Guarantees data authenticity. The data can never be changed without its owner’s

acknowledgement. Hash technique provided.

◉ Guarantees permanent and secure data storage. Data is distributed, saved, and backed up

in the cloud.

◉ Guarantees full data recovery in case of a system malfunction. All data is synched in

distributed cloud storage and can be recovered in the event of disaster.

The LifeCODE ecosystem provides rapid and efficient access to data and services for all genomic

stakeholders, including hospitals, laboratories, pharmaceutical companies, contract research

organizations, information technology vendors, IoTs, patients, and others.

For example, WeGene, a China-based company focusing on personal genetic testing and analysis

services, has significantly improved the accuracy of their analysis and the quality of user reports

through data aggregation with the LifeCODE system and the powerful toolkits it provides.

Figure 1. Potential participants in the LifeCODE ecosystem

6 LifeCODE Databank

Page 7: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Platform Architecture

LifeCODE’s platform integrates both genomic and phenotypic health data, establishing an ecosystem

for data exchange, interoperability, and a wider range of services for all stakeholders. It achieves this

while providing the highest level of data quality and privacy protection.

Security

General Services

Miscellaneous

Searchable- Encrypted data Repositories

LifeCODE Blockchain

Storage

Healthcare Applications

Platform API Gateway

Genomic & Phenotype Services

Network Access Control

Artificial Intelligence

Platform SDK

GOR

Cambrian engine

SEDB

Network Manager

Storages Nodes

Platform Applications

API Gateway

GORpipe

XDS

3rd Party Applications

Authentication Center

DRGs

CSA

Data Exchange Adapters

Configuration Manager

RiskEngine

CCDA

BKMS

Machine Learning

Inegration Specifications

HBase

Transaction Manager

RDBMs

Crypto Enclave

Backup Nodes

Data Anonymization

OCR

Transaction Viewer

Recovery Management

Applicatioin MAC

OAuth

Intel® SGX

MPI

Cyphers

Secure Connection

Figure 2 Components in the LifeCODE Data & Services Platform

7 LifeCODE Databank

Page 8: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

The LifeCODE platform comprises services, applications, and subsystems of general and specific

functions. The securely protected searchable-encrypted data repositories of genomic and phenotypic

data form the platform’s base. An application programming interface (API) and related business

APIs are designed to provide customers with additional analysis and exchange capabilities. Security

modules, using proven and newly-developed techniques and schemes, will be applied to data

quality and privacy protection. The LifeCODE blockchain is one of the platform’s core components,

functioning from end–to–end to ensure privacy while allowing patients to access and share their own

data, case by case, if and when they want.

Individuals, hospitals, research institutes, and pharmaceutical companies are the major participants

in the LifeCODE ecosystem. Typically, once individuals upload their personal genomic and phenotypic

health data to the platform, business participants can access the data only after the owner grants

permission. Participants such as insurance companies or commercial healthcare service providers can

also benefit from the platform by offering professional services in exchange for LifeCODE tokens.

Figure 3 Data and token flow in the LifeCODE ecosystem

Insurance Companies

Healthcare Companies

Institutes

Pharmaceuticals

HospitalsIndividuals

8 LifeCODE Databank

Page 9: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Every data exchange within the LifeCODE platform will generate profit, that is, redeemable tokens, to

users. For instance, after successfully uploading their personal genomic data, individuals will receive

LifeCODE tokens that can be exchanged for discounted healthcare services, such as those offered by

health insurers or healthcare facilities. (NOTE: These organizations, while they may accept LifeCODE

tokens, will never receive a patient’s personal data unless that patient approves the data transfer.)

LifeCODE tokens will also be awarded to individuals from research institutions who are willing to

award LifeCODE tokens in exchange for access to genomic data, which will require the permission

of the patient. All transactions will be carried out on the LifeCODE platform and will be recorded in

the LifeCODE blockchain. This provides more genomic data to researchers, while allowing patients to

keep control of their own medical information.

LifeCODE Blockchain

LifeCODE blockchain is a permissioned blockchain based on the J.P. Morgan’s Quorum platform, an

enterprise-focused version of Ethereum16-17. It allows creation of private Genomic Smart Contracts

with Solidity and supports private transactions through Cambrian, a peer-to-peer encrypted message

exchange for the direct transfer of data among network participants.

Through the LifeCODE Cambrian and Searchable Encrypted Data Repository (SEDR) protocols, we

can create a massive, interactive genomic data bank wherein people can securely store, access, and

share their personal health data via private cryptographic keys. Data owners have complete control.

For example, they can choose to gain revenue from their genomic data or they can donate it.

The LifeCODE blockchain platform consists of four main components:

Crypto Enclave: Provides the Blockchain-associated Key Management System (BKMS), encryption,

and decryption of transaction data.

Consensus Engine: Uses a multiple voting-based consensus mechanism for faster block times, on-

demand block creation and transaction finality, which means that once confirmations are final, the

block cannot be bifurcated. That is, the transaction will not be revoked or rolled back.

Network Manager: Controls access to the blockchain network, enabling a permissioned network

to be created.

9 LifeCODE Databank

Page 10: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Transaction Manager: Allows access to encrypted transaction data for private transactions; also

manages local data storage and communication with other Transaction Managers.

LifeCODE’s Consensus Engine implementation is based on a two-step strategy.

The first step is to adopt a Raft-based consensus18 to implement a closed-membership/consortium

blockchain, which is currently in production. Raft is a consensus algorithm for managing a replicated

log. It produces a result equivalent to multi-Paxos, and it is as efficient as Paxos.

Figure 4 Components of the raft replicated state machine approach19

10 LifeCODE Databank

Page 11: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Raft implements consensus by first electing a distinguished leader and giving the leader complete

responsibility for managing the replicated log. The leader accepts log entries from clients,

replicates them on other nodes, and tells nodes when it is safe to apply log entries to their state

machines. A leader can fail or become disconnected from the other nodes, in which case a new

leader will be elected.

The second step is to use a Genomic Byzantine Fault Tolerance (GBFT) consensus algorithm with

instant transaction finality, a manageable validator set, and higher throughput. Our GBFT is inspired

by Hyperledger’s SBFT, and the Istanbul and NCCU BFTs20-21. This provides high-performance

Byzantine state machine replication, allowing the processing of thousands of requests per second

with sub-millisecond increases in latency. This consensus mechanism is made for genomics research

applications that require high scalability, robust security, and near absolute privacy.

Permissioned NetworkLifeCODE blockchain is engineered to be permissioned, which means the network places restrictions

on who is allowed to participate in the consensus mechanism. Participants must obtain permission

or receive an invitation to join. LifeCODE blockchain provides transparent governance within the

consortium network.

LifeCODE blockchain uses the same protocols and features as the Ethereum blockchain but maintains

the fail-safe by relying on permitted nodes. As a result, participants can benefit from blockchain

technology without the possibility of catastrophic breach.

In general, permissioned blockchain networks are also better performing and more cost-effective

than unpermissioned ones. Unpermissioned blockchain networks are public spaces and, as such, they

must face all the challenges of public goods governance, especially when it comes to ensuring the

networks’ evolution through updates to its rulebook or mechanisms of interaction. It’s not uncommon

to see large numbers of unpermissioned blockchains needing to be hard-forked when there is a

departure from consensus. As a consequence, innovation is slower among these networks.

Further, they run the risk of creating new defects as their security and consensus challenges are

continuously evolving.

Node permissioning is a feature in LifeCODE that allows only a set of permissioned nodes to connect

to the network. Permissioning is granted based on the Network Manager smart contract and the

remote key of the Geth node. The remote keys are placed in a structure-array of the smart contract.

11 LifeCODE Databank

Page 12: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Cambrian and PrivacyBesides using permissions, LifeCODE seeks to further improve on privacy by introducing Quorum

blockchain’s public and private transactions. The public transactions act like normal Ethereum transactions.

The private transactions, however, are encrypted, and the payload details are not disclosed.

Personal health record confidentiality and privacy has long been a concern of health and medical

institutions, one that regular blockchains have so far failed to address adequately. LifeCODE brings a

new level of security through Cambrian, a general-purpose encryption system, to transfer information

securely and privately. It is comparable to a network of Message Transfer Agents (MTA) where

messages are encrypted with PGP.

Based on the Cambrian system, the BKMS will be built up to secure the encrypted keys which are

used for health data cryptography. As a result, on-demand access to health data is granted only when

the data owner authorizes it.

The Cambrian system consists of two modules: The TransactionManager and Enclave. Much of the

cryptographically-heavy work is done within the enclave, and the communication relay is carried out

in-between.

Figure 5 The LifeCODE Blockchain Architecture

12 LifeCODE Databank

Page 13: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

In addition to privacy for transactions, there is also privacy for smart contracts, a particular sensitivity

for research organizations. Such smart contracts could contain research strategies, transaction data,

or sensitive information.

Besides providing private transactions, we cryptographically conceal the genomic assets of a

transaction by applying “zero knowledge proofs.” At the same time, the entire network will still be

allowed to validate the integrity of the transaction payload. Zero Knowledge Proofs (ZKPs) 22 do

this by having one party prove to another party that a given statement is true, without conveying any

information apart from the fact that the statement is indeed true. Only the counter-parties (and those

granted access) can view the details of the transaction.

The ZKP has been recognized as one of the best ways to preserve privacy and confidentiality on

blockchain while achieving speed, scalability, and network stability. This technology will extend

LifeCODE’s existing privacy protections to add cryptographically assured, private settlement of

digitized assets on a distributed ledger, without a central intermediary.

Private TransactionsTransaction Manager stores and allows access to encrypted transaction data and exchanges

encrypted payloads with other participants’ transaction managers. It achieves transaction privacy by:

1) Enabling transaction senders to mark who the transaction is for, via the ‘PrivateFor’ transaction

parameter, which essentially creates a ‘Private Transaction’.

2) Replacing the payload on Private Transactions with a hash of the encrypted payload, so that

the original payload is not visible to users who are not transaction participants.

3) Storing private data off-chain, in encrypted form, in a separate state database, and by

distributing that encrypted data to the parties that are members of the transaction participants.

13 LifeCODE Databank

Page 14: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Above is a diagram of how private transactions are processed within LifeCODE, Party A sends a

private Transaction AB to Party B (but not to Party C) and goes through the following simplified steps:

1) Party A generates a new symmetric key and a random nonce. Encrypts the transaction using

this key.

2) Party A calculates SHA3–512 hash from the encrypted transaction payload.

3) Party A encrypts the encrypted transaction again and the symmetric key from step 1 using

party B’s public key.

4) Party A transmits the encrypted payload hash from step 2 on-chain, then transmits the

transaction and symmetric key (both encrypted) to party B off-chain.

5) All parties make a call to their local Transaction Manager to determine if they hold the

transaction or not.

6) Because Party C does not hold the transaction key, they will receive a NotARecipient message

and will skip the transaction. As a result, Party C’s private state database will not be updated.

7) Party B decrypts the symmetric key and the transaction using their private key, then generates

a hash and confirms that the hash matches the on-chain hash to make sure that transaction

data is correct.

8) Party B fully decrypts the transaction using the symmetric key from step 1.

9) Parties A and B then send the decrypted payload to the EVM for contract code execution.

This will update Parties A and B’s private state databases only.

Figure 6 The LifeCODE Blockchain private transaction process

14 LifeCODE Databank

Page 15: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Data Privacy

LifeCODE guarantees the highest-level of protection on data authenticity, privacy, and permanent

data storage. All owner-authorized health data and associated data exchanges will be securely

recorded and stored to prevent data leaks, abuses, and losses.

Features are as follows:

1) Incorporation of cloud technologies (e.g. cloud storage / cloud computing), which are involved

in all LifeCODE transactions. Because protecting the cloud environment is crucial, we employ

the most advanced tools for that. All LifeCODE’s network infrastructure is equipped with

state-of-art secure hardware and software, including strict access control policies. Auditing

mechanisms are also applied to protect LifeCODE from general network attacks such as

DDoS, web application attacks, vulnerability attacks, and malicious access to the platform.

All internal and external data transmission across the platform will be based on secure

connections, protecting sensitive data from most man-in-the-middle attacks.

2) In the LifeCODE platform, all health data is encrypted by an asymmetric cryptography. The

associated keys are used specifically for health data encryption that can be either provided

by data owners themselves or generated automatically during data uploading. The keys

are managed by the LifeCODE Blockchain-Associated Key Management System (BKMS).

LifeCODE will record all of the key transactions through blockchain technology. Access to

the key for data decryption is allowed only when the data owner’s permission is proved in

blockchain.

3) Strict data access control policies will be applied in the LifeCODE platform. No individuals or

organizations will be allowed to access the original data owner’s privacy information,

unless the owner authorizes access via LifeCODE’s blockchain-based private transactions.

Blockchain will also record all related data access and processing operations.

4) For data analysis and research purposes, health data must be exported to intermediate

searchable-encrypted data repositories. This makes it essential that the data’s owner

authorizes any use of their data. The privacy-sensitive portion of the data will be encrypted

in the same way as the original data. The heterogeneous data bank server applications will

be guarded by both system kernel-level Mandatory Access Control (MAC) and Intel Software

Guard Extensions (SGX) 23,24, protecting data in memory from malware and other attackers.

15 LifeCODE Databank

Page 16: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Reliable Storage

The LifeCODE platform offers reliable storage (OSS/HDFS/NFS) in a secure cloud environment

that contains different types of health data matching with various forms of data persistence

implementations.

1) Object Storage Service (OSS) provides highly extensible and robust storage for genomic

data. In addition, the proven scalabilities and rich APIs are two important OSS features that

help in managing large-size files (e.g. Variant Calling Format (VCF) files).

2) High-performance data storage is required to store structured health data with huge numbers

of records and segments (e.g. genomic data), as well as those with complex mapping and

correlation (e.g. phenotypic data). LifeCODE incorporates the Hadoop Distributed File System

(HDFS) for this purpose.

3) The Network File System (NFS) and local storage will be used by worker nodes running

analysis services and APIs, providing low-latency high-throughput data access. Such NFS

storage fits well for computation-intense scenarios.

All files, streams, and objects will be encrypted before being saved to the database server

applications, cipher applications, transparent secure IO drivers, and the file system features. Both

symmetric and asymmetric cryptographic schemes will be applied for protection approaches, with the

keys being managed by the LifeCODE Blockchain-associated Key Management Services (BKMS).

To prevent data loss, we will apply full data recovery mechanisms. Multiple data storage centers in

different geographical areas will be set up for saving platform data from a single point of failure of

certain clustered nodes. Each data center consists of a group of storage nodes, backup nodes, and

recovery management nodes. The efficient synchronization, timely backup, and robust failover will

ensure permanent and safe data storage.

16 LifeCODE Databank

Page 17: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Data Center in Area 1

Storage Nodes

Back Up Nodes

Recovery Management

Data Center in Area 3

Storage Nodes

Back Up Nodes

Recovery Management

Data Center in Area 2

Storage Nodes

Back Up Nodes

Recovery Management

Data Center in Area 4

Storage Nodes

Back Up Nodes

Recovery Management

Figure 7 The geographically distributed recoverable storage array

17 LifeCODE Databank

Page 18: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Searchable Encryption

To support analysis and other uses of the data in LifeCODE, the searchable encryption schemes and

implementations will be used in several aspects.

A Searchable Encryption Database (SEDB) is being developed based on several academic

research results including asymmetric encryption algorithms, tag-based fingerprint extraction, and

homomorphic encryption.25,26 Uploaded genomic and clinical data is tagged into a multi-dimensional

feature matrix while the main body of data is encrypted.2728

1) The tags generated will be used as indices for searching and other query operations by

keywords. Characteristic models will be developed for feature tag generation as well as pseudo

record obfuscation processing steps.

2) Data records will be entirely or partially encrypted before being stored in the Searchable

Encrypted Database (SEDB) according to corresponding security and performance demands.

The data content cryptography resides with each data owner, giving owners full control of their

own information.

Figure 8 The searchable-encryption database scheme

18 LifeCODE Databank

Page 19: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

A transparent secure persistence layer driver will also be developed and applied to the database

server. This driver encrypts and decrypts the input/output (IO) stream automatically and transparently

when the database server is reading from or writing to storage. Combined with memory protection,

searchable data is fully encrypted outside the database application scope.

Genomic Data Storage, Search, and Analysis

The LifeCODE platform leverages WuXi NextCODE’s proprietary Genomic Ordered Relational (GOR)

database management system, which was designed specifically to store and analyze huge amounts of

genetic code and to index new genetic variations that appear in the hundreds of thousands of genomes

sequenced by WuXi NextCODE’s partners and in every public data resource. The GOR architecture is

the only big data architecture built from the ground up for genomics. The difference between the GOR

architecture and other relational database management systems is that GOR has a specific frame of

reference — a position on the genome — that facilitates the management of massive amounts of data.

GOR is specifically designed for genomic data storage, query, search, indexing, and many other analyses.

Over the last 10 years, this system has handled queries and searches, as well as many analysis

protocols of genomic data, from over 300,000 people, with an average about 5 million variants of each

individual. It is considered one of the most powerful bioinformatics tools for genomic studies. With

GOR, LifeCODE has the capability and confidence to serve platform customers performing research and

analyses on large-scale aggregated genomic data in a practical way with high performance.

To assist with search, LifeCODE also offers GORpipe, CSA APIs, and Risk Engine APIs to platform users.

GORpipe is a query tool built to compliment GOR. It provides sequence variant annotation, spatial

overlap between genomic features, filtering and aggregating genomic variants data, and more.

GORpipe uses a declarative query language combining SQL-like and shell pipe syntax.

The Clinical Sequence Analyzer (CSA) is another tool for the deep analysis of genomic data. It

is particularly useful when seeking potential correlations between genomic variants and clinical

phenotypes or diseases. CSA has backend services and APIs that convert complex queries to GOR

and GORpipe automation tasks.

Risk Engine is also a powerful tool. It uses algorithms and models based on scientific studies. It can

provide reports of risk levels and instructions from genomic results for potential and diagnosed

diseases or defects.

19 LifeCODE Databank

Page 20: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Phenotypic Data Integration

Phenotypic data is the other core data of the LifeCODE platform. By aggregating, classifying,

integrating, and indexing phenotype data, LifeCODE will be able to offer customers more useful data

overall. Because phenotypic data is so variable, LifeCODE uses multiple approaches to manage and

integrate it.

1) HL7 compatible specifications, currently under design, will be used to build general

exchangeable and extensible data structures for both high-quality phenotype data and that

from less-developed systems or sources. The internationally well-adopted ICD10 and ICD9-

CM3 systems for diseases and operations are involved in phenotype data filtering, cleaning, and

mapping processes. These systems also connect genomic data to related clinical genomic data.

2) The data from trusted sources verified by permissioned LifeCODE blockchain is aggregated,

indexed, and linked by logic-based and experience-based protocols and adapters.

Carefully designed and tested extraction transformation and loading (ETL) procedures and

distributed terminology systems DTS APIs will be applied to the data integration in recent

implementations. Artificial intelligence (AI) technologies and machine learning-supported

automations will also play a major role in improving the data quality.

3) A Master Patient Index (MPI) mechanism has been designed to link an individual’s phenotypic

data to an indexed keychain. Binding with the LifeCODE blockchain, this keychain grants the

individual full authority to use their own phenotypic health data, while preventing data abuse.

In addition, MPI will enable data from different healthcare institutions to be integrated, with

appropriate permissions.

Data Exchange and Customer Interaction

Data exchange and interaction among users is another big opportunity for those using the LifeCODE

platform. The platform APIs and adapters are designed to handle data access via pluggable, scalable,

micro-service APIs. The LifeCODE platform Software Development Kit (SDK) will also be available for

third parties to develop their own APIs. This will make the LifeCODE ecosystem more open and active,

offering more benefits to all platform users.

20 LifeCODE Databank

Page 21: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Many adapters are to be implemented, such as data format conversion, inter-system logic correlation

mapping, persistent storing, etc. As LifeCODE foresees the significance of standardizing data

connection and exchange, it is possible to develop more adapters.

As previously mentioned, LifeCODE blockchain is widely involved in most steps of data exchange

workflows. The most common transactions will likely be those in which individuals upload their

data to LifeCODE (directly or through authorized institutions or agents where the data is stored),

authorize others to use their genomic and phenotype data, use services provided by the platform

or third parties, compare personal health data with gene databank reference and other people, and

more. Customers, such as research institutes and hospitals that require access to certain health data,

can also rely on LifeCODE blockchain to deliver their requests to data owners, allow them to make

contracts with individuals for data transmission and use, give rewards to data owners and related

service providers, and more.

Figure 9 Snapshots and QR code for Laiyin Tribe

21 LifeCODE Databank

Page 22: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Laiyin Tribe is the application available to all LifeCODE Databank users. The goal is to develop a safe,

decentralized, and visible personal health data center. It is a handy tool to make personal health data

self-manageable and ensure that the ownership of health data stays with the original individual. A

view of users’ health status will also be provided based on information available.

Laiyin Tribe also provides a way to capitalize users’ personal health data, using smart contracts in

LifeCODE blockchain, which are valued by the LifeCODE Token and stored in the personal LifeCODE

Blockchain Wallet. These health data as assets will continuously bring benefits to data owners

through the LifeCODE Databank. Ultimately, Laiyin Tribe will be the tool that will aggregate all forms

of personal health data and that will allow data owners to manage their data themselves.

LifeCODE Token

The LifeCODE blockchain has a built-in native token (LifeCODE Token, symbol: LCT). The current

LifeCODE tokens are issued by LifeCODE Databank, which was designed in accordance with

Ethereum’s ERC-20 protocol, with a maximum issuance of 3,000,000,000 pieces.

LCT is the token of the LifeCODE blockchain, an equity certificate on the LifeCODE Genomics

ecological chain, and an economic means of LifeCODE Genomics ecosystem. LCT’s main role is to

provide liquidity for digital asset transactions on applications, which are built based on LifeCODE, and

to serve as a payment mechanism for transactions on LCT Genomics ecosystem.

If users upload their genome data to the LifeCODE blockchain platform and their genome data is

used, they will receive tokens. If research institutions conduct studies (i.e. creating a rare disease risk

model) based on the Genomic Smart Contract, and their research yields good results, they will also

receive tokens. Professional medical institutions or individuals who use LifeCODE Databank’s data

or research results, or create their own applications based on the LifeCODE blockchain, will need to

make payments using LifeCODE Tokens (LCTs).

22 LifeCODE Databank

Page 23: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

Summary

Health information boundaries among hospitals, pharmaceutical R&Ds, institutions, doctors

and patients, and ordinary individuals are seriously hampering healthcare services from being

efficiently improved.

Unlocking these boundaries, sharing all resources among the LifeCODE data bank ecosystem

will profit all platform participants. Your data is your own asset. It should be accessible to you

and managed by you and secured by the best privacy protection available. With LifeCODE, it is.

23 LifeCODE Databank

Page 24: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

References 1. Venter, JC. et al. The sequence of the human genome. Science Vol. 291, Issue 5507, pp. 1304-

1351 (2001) doi: 10.1126/science.1058040

2. Metzker, ML. Sequencing technologies—the next generation. Nature Reviews Genetics volume

11, pages 31–46 (2010) doi: 10.1038/nrg2626

3. Marioni, RE. et al. Molecular genetic contributions to socioeconomic status and intelligence.

Intelligence Volume 44, Pages 26-32 (2014) doi: 10.1016/j.intell.2014.02.006

4. Raghupathi, W. & Raghupathi, V. Big data analytics in healthcare: promise and potential.

Health Information Science and Systems 2014, 2:3 (2014) doi: 10.1186/2047-2501-2-3

5. Ginsburg, GS & Willard, HF. Genomic and personalized medicine: foundations and

applications. Translational research Volume 154, Issue 6, Pages 277–287 (2009) doi: 10.1016/j.

trsl.2009.09.005

6. McEwen, JE., Boyer, JT. & Sun, KY. Evolving approaches to the ethical management of genomic

data. Trends in Genetics Volume 29, Issue 6, Pages 375-382 (2013) doi: 10.1016/j.tig.2013.02.001

7. Kahn, SD. On the Future of Genomic Data. Science Vol. 331, Issue 6018, pp. 728-729 (2011) doi:

10.1126/science.1197891

8. Greenbaum, D. et al. Genomics and Privacy: Implications of the New Reality of Closed Data for

the Field. PLoS Computational Biology 7(12), e1002278 (2011) doi: 10.1371/journal.pcbi.1002278

9. Buhr, S. WuXi NextCODE aims for the genomics database “gold standard” with new $240

million. (2017) https://techcrunch.com/2017/09/07/wuxi-nextcode-aims-for-the-genomics-

database-gold-standard-with-new-240-million/

10. Marx, V. The DNA of a nation. Nature volume 524, pages 503–505 (2015) doi: 10.1038/524503a

11. Lu, D. et al. Ancestral origins and genetic history of Tibetan highlanders. The American Journal

of Human Genetics Volume 99, Issue 3, Pages 580-594 (2016) doi: 10.1016/j.ajhg.2016.07.002

12. Pauwels, E. & Vidyarthi, A. Who Will Own the Secrets in Our Genes?: A US-China Race in

Artificial Intelligence and Genomics. Book, Wilson Center (2017)

13. Jain, SH. et al. The Digital Phenotype Nature Biotechnology volume 33, pages 462–463 (2015)

doi: 10.1038/nbt.3223

14. Shaywitz, D. Where Will Healthcare’s Data Scientists Find The Rich Phenotypic Data

They Need? (2014) https://www.forbes.com/sites/davidshaywitz/2014/10/10/where-will-

healthcares-data-scientists-find-the-rich-phenotypic-data-they-need/#68a712d530fe

15. Ross, MK., Wei, W. & Ohno-Machado, L. “Big data” and the electronic health record. Yearbook

of Medical Informatics 2014; 9(1), pages 97–104 (2014) doi: 10.15265/IY-2014-0003

16. Vitalik Buterin et al. A Next-Generation Smart Contract and Decentralized Application Platform

2013) https://github.com/ethereum/wiki/wiki/White-Paper

24 LifeCODE Databank

Page 25: LifeCODE Databank - WuXi NextCODE · WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the frontier of genomic research and clinical applications

17. JP Morgan Chase. Quorum Whitepaper (2016) https://github.com/jpmorganchase/quorum-docs

18. Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm.

USENIX Annual Technical Conference, pages 305-319 (2014)

19. Heidi Howard et al. Raft Refloated: Do We Have Consensus? ACM SIGOPS Operating Systems

Review - Special Issue on Repeatability and Sharing of Experimental Artifacts, pages 12-21 (2015)

20. Hyperledger. Hyperledger Architecture, Volume 1: Introduction to Hyperledger Business

Blockchain Design Philosophy and Consensus, https://www.hyperledger.org/resources/

publications#white-papers

21. Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery.

ACM Transactions on Computer Systems (TOCS), pages 398-461 (2002)

22. Thomas Espel, Laurent Katz, Guillaume Robin. Proposal for Protocol on a Quorum Blockchain

with Zero Knowledge (2017) https://eprint.iacr.org/2017/1093.pdf

23. Costan, V. & Devadas, S. Intel SGX Explained. IACR Cryptology ePrint Archive, Report

2016/086 (2016)

24. McKeen, F. et al. Intel® software guard extensions (intel® sgx) support for dynamic memory

management inside an enclave. HASP 2016 Proceedings of the Hardware and Architectural

Support for Security and Privacy 2016 Article No. 10 (2016) doi: 10.1145/2948618.2954331

25. Popa, RA. & Zeldovich, N. Multi-Key Searchable Encryption. IACR Cryptology ePrint Archive (2013)

26. Salam, MI. et al. Implementation of searchable symmetric encryption for privacy‐preserving

keyword search on cloud storage. Human-centric Computing and Information Sciences (2015)

5:19 (2015) doi: 10.1186/s13673-015-0039-9

27. Shimizu, K, Nuida, K. & Rätsch, G. Efficient privacy-preserving string search and an application

in genomics. Bioinformatics Volume 32, Issue 11, Pages 1652–1661 (2016) doi: 10.1093/

bioinformatics/btw050

28. Ziegeldorf, JH. et al. BLOOM: Bloom filter based oblivious outsourced matchings. BMC Medical

Genomics BMC series - open, inclusive and trusted 201710(Suppl 2):44 (2017) doi: 10.1186/

s12920-017-0277-y

25 LifeCODE Databank