Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
LifeCODE Databank
The Optimal Place for Safe and EfficientHealth Data Storage and ExchangeJuly 2, 2018
Table of Contents
Announcing LifeCODE Databank ............................................................................................... 3
Platform Architecture ........................................................................................................................ 7
LifeCODE Blockchain ....................................................................................................................... 9
Permissioned Network ................................................................................................................ 11
Cambrian and Privacy .................................................................................................................. 12
Private Transactions ...................................................................................................................... 13
Data Privacy ........................................................................................................................................... 15
Reliable Storage ................................................................................................................................... 16
Searchable Encryption ..................................................................................................................... 18
Genomic Data Storage, Search, and Analysis ...................................................................... 19
Phenotypic Data Integration .......................................................................................................... 20
Data Exchange and Customer Interaction ............................................................................. 20
LifeCODE Token .................................................................................................................................. 22
Summary ................................................................................................................................................. 23
References .............................................................................................................................................. 24
2 LifeCODE Databank
Announcing LifeCODE Databank
Genomic testing is ushering in a new age of medicine. Patients can be treated more precisely and
effectively when their care is tailored to their own individual genetic makeup1,2. Determining genetic
risk for cardiovascular disease and cancer empowers individuals and their healthcare providers to
prevent those diseases or detect and begin treating them earlier. As the cost for genomic sequencing
becomes more affordable, more and more people around the world are opting to get genetic testing.
As a result, there is a growing need for a safe and efficient way to store healthcare data, while
still allowing for it to be widely shared. Genomic data is extremely sensitive. Most people are not
aware that our DNA contains information about our life expectancy, our proclivity to depression or
schizophrenia, our complete ethnic ancestry, our expected intelligence, and maybe even our political
inclinations3. Within a decade or two, the genome will likely reveal even more. In this context, ensuring
that people’s genomic data is stored responsibly and anonymously is vital for maintaining privacy. It is
also vital for scientific progress.
Personalized medicine is possible only because we can analyze genomic data from hundreds of
thousands of people. Such data allows scientists to do important research to better understand the
role of genetics in diseases and the immune system, and to create better treatments for specific
groups4,5. Only with this data can we start routinely customizing treatments for individuals, thereby
cutting costs, increasing efficiency, and providing patients the best care possible.
Most sequenced genomes are currently stored in strict access-controlled repositories. Access to
this data could help researchers identify disease-causing genetic variants and aid the discovery of
new drug targets. Numerous recent examples show how discovery of new disease-causing genes
lead to better treatment for cancer, cardiovascular disease, and metabolic conditions. Yet, concerns
over genetic data privacy may, understandably, deter individuals from contributing their genomes to
scientific studies6-8 thereby slowing medical progress.
WuXi NextCODE has emerged as one of the companies leading the charge in helping fulfill the Human
Genome Project’s original aspirations. WuXi NextCODE is uniquely positioned to securely provide an
ideal solution to the dilemma of protecting and managing genomic personal data.
3 LifeCODE Databank
With the launch of the WuXi NextCODE LifeCODE Databank, we combine our unparalleled experience
in genomic sequencing and analysis with the bold new technology of blockchain, allowing patients,
hospitals, companies, and other stakeholders to further advance the genomic revolution.
About WuXi NextCODE
WuXi NextCODE is building the premier global genomics data platform9. Our systems have been at the
frontier of genomic research and clinical applications for more than 20 years and we have analyzed
data from more than 600,000 genomes. With offices in Shanghai, Cambridge (Massachusetts),
and Reykjavik, Iceland, we serve the leading population genomics, genomic diagnostics, precision
medicine, and wellness initiatives and enterprises around the world10-12.
Our capabilities span study design, sequencing, secondary analysis, storage, interpretation, scalable
analytics, and AI and deep learning — all backed by the proven and most widely used technology for
organizing, mining, and sharing genome sequence data.
The dramatic advances in genomics mean that individuals can now obtain their own genomic data
and corresponding analysis reports relatively easily and affordably. As a result, huge amounts of
genomic data are being generated and stored in either personal devices and/or those institutions that
provide genomic sequencing services or patient cohorts for studies.
Many big genomic data projects exist, and many institutions are gathering and building their own
databases. But what is being lost is the growing amount of data created when individuals have
their genomes sequenced, not as part of a study, but for their own health or interest. What is more,
individuals have no real-time control over how their genomic data is stored and used. Nor do they
have the option of selectively allowing their data to be used for purposes of their own choosing. As
noted privacy issues are also a key concern causing many to avoid genomic sequencing in the first
place. Confidence in data security together with the freedom to pick and choose the research projects
to be involved in, may greatly increase participation.
Meanwhile, petabytes of phenotype data, including clinical results, daily behavior, diet patterns, and
general health information is also being generated at hospitals, research institutions, pharmaceutical
R&D, and wearable IoTs13-15. Such data is typically stored in centralized databases, which raises
security and data abuse concerns. Moreover, scalability and efficient data sharing and exchange are
hard to achieve based on current research and the various practices of healthcare institutions.
On the other hand, how to find an efficient way to search and exchange qualified health data has
4 LifeCODE Databank
always been a big challenge that scientific, clinical, and pharmaceutical researchers face. For example,
studies of medicines or treatments for some rare diseases is difficult due to a lack of adequate patients
for discovery research and clinical trials.
For common diseases meanwhile, which are often very genetically complex, we need very large
cohorts of patients to do effective research and clinical studies.
LifeCODE guarantees that data ownership belongs to only the individuals who upload their own
health data. This data is immediately and fully encrypted by default once it is uploaded to LifeCODE.
To ensure security, any access to this data is strictly monitored and recorded in LifeCODE’s blockchain
so that all data transactions are fully traceable. Most importantly, of course, is that the data owner
controls access and authorizes any transactions or sharing.
For instance, a pharmaceutical company could offer a data owner a certain amount of LifeCODE
“tokens” for allowing the company access to designated genomic data (e.g. VCF) that has been
searched for and found in LifeCODE. This offer will be sent through the platform and the data owner
would receive a request for authorization. The transaction is complete once permission is granted or
rejected. Data owners will always remain anonymous to the data searchers as their personal genetic
information is filtered and encrypted in LifeCODE. Again, all the health data transactions are recorded
in LifeCODE blockchain for tracing, assuring that the system is secure.
WuXi NextCODE’s LifeCODE is a global cloud-based and blockchain-enabled platform developed to
provide an ideal solution to this dilemma.
LifeCODE’s key features are that it:
◉ Allows an individual to declare ownership of their own data. All data sharing and uses must be
authorized by the data’s owner.
◉ Includes genomic and phenotypic data.
◉ Enables end-to-end blockchain technology.
◉ Enables data exchange with LifeCODE tokens.
◉ Enables natural language searches.
◉ Provides easy-to-use service APIs for businesses, individuals, and systems.
◉ Provides manageable security services.
◉ Provides access to one of the world’s premier genomic data analysis solutions — the Genomic
Ordered Relational (GOR) architecture.
5 LifeCODE Databank
◉ Features a searchable encrypted database distributed in cloud.
◉ Provides data privacy and protection. All data is encrypted throughout the system.
Transactions are anonymous and traceable in the blockchain network. Features a searchable
encrypted database distributed in cloud.
◉ Guarantees data authenticity. The data can never be changed without its owner’s
acknowledgement. Hash technique provided.
◉ Guarantees permanent and secure data storage. Data is distributed, saved, and backed up
in the cloud.
◉ Guarantees full data recovery in case of a system malfunction. All data is synched in
distributed cloud storage and can be recovered in the event of disaster.
The LifeCODE ecosystem provides rapid and efficient access to data and services for all genomic
stakeholders, including hospitals, laboratories, pharmaceutical companies, contract research
organizations, information technology vendors, IoTs, patients, and others.
For example, WeGene, a China-based company focusing on personal genetic testing and analysis
services, has significantly improved the accuracy of their analysis and the quality of user reports
through data aggregation with the LifeCODE system and the powerful toolkits it provides.
Figure 1. Potential participants in the LifeCODE ecosystem
6 LifeCODE Databank
Platform Architecture
LifeCODE’s platform integrates both genomic and phenotypic health data, establishing an ecosystem
for data exchange, interoperability, and a wider range of services for all stakeholders. It achieves this
while providing the highest level of data quality and privacy protection.
Security
General Services
Miscellaneous
Searchable- Encrypted data Repositories
LifeCODE Blockchain
Storage
Healthcare Applications
Platform API Gateway
Genomic & Phenotype Services
Network Access Control
Artificial Intelligence
Platform SDK
GOR
Cambrian engine
SEDB
Network Manager
Storages Nodes
Platform Applications
API Gateway
GORpipe
XDS
3rd Party Applications
Authentication Center
DRGs
CSA
Data Exchange Adapters
Configuration Manager
RiskEngine
CCDA
BKMS
Machine Learning
Inegration Specifications
HBase
Transaction Manager
RDBMs
Crypto Enclave
Backup Nodes
Data Anonymization
OCR
Transaction Viewer
Recovery Management
Applicatioin MAC
OAuth
Intel® SGX
MPI
Cyphers
Secure Connection
Figure 2 Components in the LifeCODE Data & Services Platform
7 LifeCODE Databank
The LifeCODE platform comprises services, applications, and subsystems of general and specific
functions. The securely protected searchable-encrypted data repositories of genomic and phenotypic
data form the platform’s base. An application programming interface (API) and related business
APIs are designed to provide customers with additional analysis and exchange capabilities. Security
modules, using proven and newly-developed techniques and schemes, will be applied to data
quality and privacy protection. The LifeCODE blockchain is one of the platform’s core components,
functioning from end–to–end to ensure privacy while allowing patients to access and share their own
data, case by case, if and when they want.
Individuals, hospitals, research institutes, and pharmaceutical companies are the major participants
in the LifeCODE ecosystem. Typically, once individuals upload their personal genomic and phenotypic
health data to the platform, business participants can access the data only after the owner grants
permission. Participants such as insurance companies or commercial healthcare service providers can
also benefit from the platform by offering professional services in exchange for LifeCODE tokens.
Figure 3 Data and token flow in the LifeCODE ecosystem
Insurance Companies
Healthcare Companies
Institutes
Pharmaceuticals
HospitalsIndividuals
8 LifeCODE Databank
Every data exchange within the LifeCODE platform will generate profit, that is, redeemable tokens, to
users. For instance, after successfully uploading their personal genomic data, individuals will receive
LifeCODE tokens that can be exchanged for discounted healthcare services, such as those offered by
health insurers or healthcare facilities. (NOTE: These organizations, while they may accept LifeCODE
tokens, will never receive a patient’s personal data unless that patient approves the data transfer.)
LifeCODE tokens will also be awarded to individuals from research institutions who are willing to
award LifeCODE tokens in exchange for access to genomic data, which will require the permission
of the patient. All transactions will be carried out on the LifeCODE platform and will be recorded in
the LifeCODE blockchain. This provides more genomic data to researchers, while allowing patients to
keep control of their own medical information.
LifeCODE Blockchain
LifeCODE blockchain is a permissioned blockchain based on the J.P. Morgan’s Quorum platform, an
enterprise-focused version of Ethereum16-17. It allows creation of private Genomic Smart Contracts
with Solidity and supports private transactions through Cambrian, a peer-to-peer encrypted message
exchange for the direct transfer of data among network participants.
Through the LifeCODE Cambrian and Searchable Encrypted Data Repository (SEDR) protocols, we
can create a massive, interactive genomic data bank wherein people can securely store, access, and
share their personal health data via private cryptographic keys. Data owners have complete control.
For example, they can choose to gain revenue from their genomic data or they can donate it.
The LifeCODE blockchain platform consists of four main components:
Crypto Enclave: Provides the Blockchain-associated Key Management System (BKMS), encryption,
and decryption of transaction data.
Consensus Engine: Uses a multiple voting-based consensus mechanism for faster block times, on-
demand block creation and transaction finality, which means that once confirmations are final, the
block cannot be bifurcated. That is, the transaction will not be revoked or rolled back.
Network Manager: Controls access to the blockchain network, enabling a permissioned network
to be created.
9 LifeCODE Databank
Transaction Manager: Allows access to encrypted transaction data for private transactions; also
manages local data storage and communication with other Transaction Managers.
LifeCODE’s Consensus Engine implementation is based on a two-step strategy.
The first step is to adopt a Raft-based consensus18 to implement a closed-membership/consortium
blockchain, which is currently in production. Raft is a consensus algorithm for managing a replicated
log. It produces a result equivalent to multi-Paxos, and it is as efficient as Paxos.
Figure 4 Components of the raft replicated state machine approach19
10 LifeCODE Databank
Raft implements consensus by first electing a distinguished leader and giving the leader complete
responsibility for managing the replicated log. The leader accepts log entries from clients,
replicates them on other nodes, and tells nodes when it is safe to apply log entries to their state
machines. A leader can fail or become disconnected from the other nodes, in which case a new
leader will be elected.
The second step is to use a Genomic Byzantine Fault Tolerance (GBFT) consensus algorithm with
instant transaction finality, a manageable validator set, and higher throughput. Our GBFT is inspired
by Hyperledger’s SBFT, and the Istanbul and NCCU BFTs20-21. This provides high-performance
Byzantine state machine replication, allowing the processing of thousands of requests per second
with sub-millisecond increases in latency. This consensus mechanism is made for genomics research
applications that require high scalability, robust security, and near absolute privacy.
Permissioned NetworkLifeCODE blockchain is engineered to be permissioned, which means the network places restrictions
on who is allowed to participate in the consensus mechanism. Participants must obtain permission
or receive an invitation to join. LifeCODE blockchain provides transparent governance within the
consortium network.
LifeCODE blockchain uses the same protocols and features as the Ethereum blockchain but maintains
the fail-safe by relying on permitted nodes. As a result, participants can benefit from blockchain
technology without the possibility of catastrophic breach.
In general, permissioned blockchain networks are also better performing and more cost-effective
than unpermissioned ones. Unpermissioned blockchain networks are public spaces and, as such, they
must face all the challenges of public goods governance, especially when it comes to ensuring the
networks’ evolution through updates to its rulebook or mechanisms of interaction. It’s not uncommon
to see large numbers of unpermissioned blockchains needing to be hard-forked when there is a
departure from consensus. As a consequence, innovation is slower among these networks.
Further, they run the risk of creating new defects as their security and consensus challenges are
continuously evolving.
Node permissioning is a feature in LifeCODE that allows only a set of permissioned nodes to connect
to the network. Permissioning is granted based on the Network Manager smart contract and the
remote key of the Geth node. The remote keys are placed in a structure-array of the smart contract.
11 LifeCODE Databank
Cambrian and PrivacyBesides using permissions, LifeCODE seeks to further improve on privacy by introducing Quorum
blockchain’s public and private transactions. The public transactions act like normal Ethereum transactions.
The private transactions, however, are encrypted, and the payload details are not disclosed.
Personal health record confidentiality and privacy has long been a concern of health and medical
institutions, one that regular blockchains have so far failed to address adequately. LifeCODE brings a
new level of security through Cambrian, a general-purpose encryption system, to transfer information
securely and privately. It is comparable to a network of Message Transfer Agents (MTA) where
messages are encrypted with PGP.
Based on the Cambrian system, the BKMS will be built up to secure the encrypted keys which are
used for health data cryptography. As a result, on-demand access to health data is granted only when
the data owner authorizes it.
The Cambrian system consists of two modules: The TransactionManager and Enclave. Much of the
cryptographically-heavy work is done within the enclave, and the communication relay is carried out
in-between.
Figure 5 The LifeCODE Blockchain Architecture
12 LifeCODE Databank
In addition to privacy for transactions, there is also privacy for smart contracts, a particular sensitivity
for research organizations. Such smart contracts could contain research strategies, transaction data,
or sensitive information.
Besides providing private transactions, we cryptographically conceal the genomic assets of a
transaction by applying “zero knowledge proofs.” At the same time, the entire network will still be
allowed to validate the integrity of the transaction payload. Zero Knowledge Proofs (ZKPs) 22 do
this by having one party prove to another party that a given statement is true, without conveying any
information apart from the fact that the statement is indeed true. Only the counter-parties (and those
granted access) can view the details of the transaction.
The ZKP has been recognized as one of the best ways to preserve privacy and confidentiality on
blockchain while achieving speed, scalability, and network stability. This technology will extend
LifeCODE’s existing privacy protections to add cryptographically assured, private settlement of
digitized assets on a distributed ledger, without a central intermediary.
Private TransactionsTransaction Manager stores and allows access to encrypted transaction data and exchanges
encrypted payloads with other participants’ transaction managers. It achieves transaction privacy by:
1) Enabling transaction senders to mark who the transaction is for, via the ‘PrivateFor’ transaction
parameter, which essentially creates a ‘Private Transaction’.
2) Replacing the payload on Private Transactions with a hash of the encrypted payload, so that
the original payload is not visible to users who are not transaction participants.
3) Storing private data off-chain, in encrypted form, in a separate state database, and by
distributing that encrypted data to the parties that are members of the transaction participants.
13 LifeCODE Databank
Above is a diagram of how private transactions are processed within LifeCODE, Party A sends a
private Transaction AB to Party B (but not to Party C) and goes through the following simplified steps:
1) Party A generates a new symmetric key and a random nonce. Encrypts the transaction using
this key.
2) Party A calculates SHA3–512 hash from the encrypted transaction payload.
3) Party A encrypts the encrypted transaction again and the symmetric key from step 1 using
party B’s public key.
4) Party A transmits the encrypted payload hash from step 2 on-chain, then transmits the
transaction and symmetric key (both encrypted) to party B off-chain.
5) All parties make a call to their local Transaction Manager to determine if they hold the
transaction or not.
6) Because Party C does not hold the transaction key, they will receive a NotARecipient message
and will skip the transaction. As a result, Party C’s private state database will not be updated.
7) Party B decrypts the symmetric key and the transaction using their private key, then generates
a hash and confirms that the hash matches the on-chain hash to make sure that transaction
data is correct.
8) Party B fully decrypts the transaction using the symmetric key from step 1.
9) Parties A and B then send the decrypted payload to the EVM for contract code execution.
This will update Parties A and B’s private state databases only.
Figure 6 The LifeCODE Blockchain private transaction process
14 LifeCODE Databank
Data Privacy
LifeCODE guarantees the highest-level of protection on data authenticity, privacy, and permanent
data storage. All owner-authorized health data and associated data exchanges will be securely
recorded and stored to prevent data leaks, abuses, and losses.
Features are as follows:
1) Incorporation of cloud technologies (e.g. cloud storage / cloud computing), which are involved
in all LifeCODE transactions. Because protecting the cloud environment is crucial, we employ
the most advanced tools for that. All LifeCODE’s network infrastructure is equipped with
state-of-art secure hardware and software, including strict access control policies. Auditing
mechanisms are also applied to protect LifeCODE from general network attacks such as
DDoS, web application attacks, vulnerability attacks, and malicious access to the platform.
All internal and external data transmission across the platform will be based on secure
connections, protecting sensitive data from most man-in-the-middle attacks.
2) In the LifeCODE platform, all health data is encrypted by an asymmetric cryptography. The
associated keys are used specifically for health data encryption that can be either provided
by data owners themselves or generated automatically during data uploading. The keys
are managed by the LifeCODE Blockchain-Associated Key Management System (BKMS).
LifeCODE will record all of the key transactions through blockchain technology. Access to
the key for data decryption is allowed only when the data owner’s permission is proved in
blockchain.
3) Strict data access control policies will be applied in the LifeCODE platform. No individuals or
organizations will be allowed to access the original data owner’s privacy information,
unless the owner authorizes access via LifeCODE’s blockchain-based private transactions.
Blockchain will also record all related data access and processing operations.
4) For data analysis and research purposes, health data must be exported to intermediate
searchable-encrypted data repositories. This makes it essential that the data’s owner
authorizes any use of their data. The privacy-sensitive portion of the data will be encrypted
in the same way as the original data. The heterogeneous data bank server applications will
be guarded by both system kernel-level Mandatory Access Control (MAC) and Intel Software
Guard Extensions (SGX) 23,24, protecting data in memory from malware and other attackers.
15 LifeCODE Databank
Reliable Storage
The LifeCODE platform offers reliable storage (OSS/HDFS/NFS) in a secure cloud environment
that contains different types of health data matching with various forms of data persistence
implementations.
1) Object Storage Service (OSS) provides highly extensible and robust storage for genomic
data. In addition, the proven scalabilities and rich APIs are two important OSS features that
help in managing large-size files (e.g. Variant Calling Format (VCF) files).
2) High-performance data storage is required to store structured health data with huge numbers
of records and segments (e.g. genomic data), as well as those with complex mapping and
correlation (e.g. phenotypic data). LifeCODE incorporates the Hadoop Distributed File System
(HDFS) for this purpose.
3) The Network File System (NFS) and local storage will be used by worker nodes running
analysis services and APIs, providing low-latency high-throughput data access. Such NFS
storage fits well for computation-intense scenarios.
All files, streams, and objects will be encrypted before being saved to the database server
applications, cipher applications, transparent secure IO drivers, and the file system features. Both
symmetric and asymmetric cryptographic schemes will be applied for protection approaches, with the
keys being managed by the LifeCODE Blockchain-associated Key Management Services (BKMS).
To prevent data loss, we will apply full data recovery mechanisms. Multiple data storage centers in
different geographical areas will be set up for saving platform data from a single point of failure of
certain clustered nodes. Each data center consists of a group of storage nodes, backup nodes, and
recovery management nodes. The efficient synchronization, timely backup, and robust failover will
ensure permanent and safe data storage.
16 LifeCODE Databank
Data Center in Area 1
Storage Nodes
Back Up Nodes
Recovery Management
Data Center in Area 3
Storage Nodes
Back Up Nodes
Recovery Management
Data Center in Area 2
Storage Nodes
Back Up Nodes
Recovery Management
Data Center in Area 4
Storage Nodes
Back Up Nodes
Recovery Management
Figure 7 The geographically distributed recoverable storage array
17 LifeCODE Databank
Searchable Encryption
To support analysis and other uses of the data in LifeCODE, the searchable encryption schemes and
implementations will be used in several aspects.
A Searchable Encryption Database (SEDB) is being developed based on several academic
research results including asymmetric encryption algorithms, tag-based fingerprint extraction, and
homomorphic encryption.25,26 Uploaded genomic and clinical data is tagged into a multi-dimensional
feature matrix while the main body of data is encrypted.2728
1) The tags generated will be used as indices for searching and other query operations by
keywords. Characteristic models will be developed for feature tag generation as well as pseudo
record obfuscation processing steps.
2) Data records will be entirely or partially encrypted before being stored in the Searchable
Encrypted Database (SEDB) according to corresponding security and performance demands.
The data content cryptography resides with each data owner, giving owners full control of their
own information.
Figure 8 The searchable-encryption database scheme
18 LifeCODE Databank
A transparent secure persistence layer driver will also be developed and applied to the database
server. This driver encrypts and decrypts the input/output (IO) stream automatically and transparently
when the database server is reading from or writing to storage. Combined with memory protection,
searchable data is fully encrypted outside the database application scope.
Genomic Data Storage, Search, and Analysis
The LifeCODE platform leverages WuXi NextCODE’s proprietary Genomic Ordered Relational (GOR)
database management system, which was designed specifically to store and analyze huge amounts of
genetic code and to index new genetic variations that appear in the hundreds of thousands of genomes
sequenced by WuXi NextCODE’s partners and in every public data resource. The GOR architecture is
the only big data architecture built from the ground up for genomics. The difference between the GOR
architecture and other relational database management systems is that GOR has a specific frame of
reference — a position on the genome — that facilitates the management of massive amounts of data.
GOR is specifically designed for genomic data storage, query, search, indexing, and many other analyses.
Over the last 10 years, this system has handled queries and searches, as well as many analysis
protocols of genomic data, from over 300,000 people, with an average about 5 million variants of each
individual. It is considered one of the most powerful bioinformatics tools for genomic studies. With
GOR, LifeCODE has the capability and confidence to serve platform customers performing research and
analyses on large-scale aggregated genomic data in a practical way with high performance.
To assist with search, LifeCODE also offers GORpipe, CSA APIs, and Risk Engine APIs to platform users.
GORpipe is a query tool built to compliment GOR. It provides sequence variant annotation, spatial
overlap between genomic features, filtering and aggregating genomic variants data, and more.
GORpipe uses a declarative query language combining SQL-like and shell pipe syntax.
The Clinical Sequence Analyzer (CSA) is another tool for the deep analysis of genomic data. It
is particularly useful when seeking potential correlations between genomic variants and clinical
phenotypes or diseases. CSA has backend services and APIs that convert complex queries to GOR
and GORpipe automation tasks.
Risk Engine is also a powerful tool. It uses algorithms and models based on scientific studies. It can
provide reports of risk levels and instructions from genomic results for potential and diagnosed
diseases or defects.
19 LifeCODE Databank
Phenotypic Data Integration
Phenotypic data is the other core data of the LifeCODE platform. By aggregating, classifying,
integrating, and indexing phenotype data, LifeCODE will be able to offer customers more useful data
overall. Because phenotypic data is so variable, LifeCODE uses multiple approaches to manage and
integrate it.
1) HL7 compatible specifications, currently under design, will be used to build general
exchangeable and extensible data structures for both high-quality phenotype data and that
from less-developed systems or sources. The internationally well-adopted ICD10 and ICD9-
CM3 systems for diseases and operations are involved in phenotype data filtering, cleaning, and
mapping processes. These systems also connect genomic data to related clinical genomic data.
2) The data from trusted sources verified by permissioned LifeCODE blockchain is aggregated,
indexed, and linked by logic-based and experience-based protocols and adapters.
Carefully designed and tested extraction transformation and loading (ETL) procedures and
distributed terminology systems DTS APIs will be applied to the data integration in recent
implementations. Artificial intelligence (AI) technologies and machine learning-supported
automations will also play a major role in improving the data quality.
3) A Master Patient Index (MPI) mechanism has been designed to link an individual’s phenotypic
data to an indexed keychain. Binding with the LifeCODE blockchain, this keychain grants the
individual full authority to use their own phenotypic health data, while preventing data abuse.
In addition, MPI will enable data from different healthcare institutions to be integrated, with
appropriate permissions.
Data Exchange and Customer Interaction
Data exchange and interaction among users is another big opportunity for those using the LifeCODE
platform. The platform APIs and adapters are designed to handle data access via pluggable, scalable,
micro-service APIs. The LifeCODE platform Software Development Kit (SDK) will also be available for
third parties to develop their own APIs. This will make the LifeCODE ecosystem more open and active,
offering more benefits to all platform users.
20 LifeCODE Databank
Many adapters are to be implemented, such as data format conversion, inter-system logic correlation
mapping, persistent storing, etc. As LifeCODE foresees the significance of standardizing data
connection and exchange, it is possible to develop more adapters.
As previously mentioned, LifeCODE blockchain is widely involved in most steps of data exchange
workflows. The most common transactions will likely be those in which individuals upload their
data to LifeCODE (directly or through authorized institutions or agents where the data is stored),
authorize others to use their genomic and phenotype data, use services provided by the platform
or third parties, compare personal health data with gene databank reference and other people, and
more. Customers, such as research institutes and hospitals that require access to certain health data,
can also rely on LifeCODE blockchain to deliver their requests to data owners, allow them to make
contracts with individuals for data transmission and use, give rewards to data owners and related
service providers, and more.
Figure 9 Snapshots and QR code for Laiyin Tribe
21 LifeCODE Databank
Laiyin Tribe is the application available to all LifeCODE Databank users. The goal is to develop a safe,
decentralized, and visible personal health data center. It is a handy tool to make personal health data
self-manageable and ensure that the ownership of health data stays with the original individual. A
view of users’ health status will also be provided based on information available.
Laiyin Tribe also provides a way to capitalize users’ personal health data, using smart contracts in
LifeCODE blockchain, which are valued by the LifeCODE Token and stored in the personal LifeCODE
Blockchain Wallet. These health data as assets will continuously bring benefits to data owners
through the LifeCODE Databank. Ultimately, Laiyin Tribe will be the tool that will aggregate all forms
of personal health data and that will allow data owners to manage their data themselves.
LifeCODE Token
The LifeCODE blockchain has a built-in native token (LifeCODE Token, symbol: LCT). The current
LifeCODE tokens are issued by LifeCODE Databank, which was designed in accordance with
Ethereum’s ERC-20 protocol, with a maximum issuance of 3,000,000,000 pieces.
LCT is the token of the LifeCODE blockchain, an equity certificate on the LifeCODE Genomics
ecological chain, and an economic means of LifeCODE Genomics ecosystem. LCT’s main role is to
provide liquidity for digital asset transactions on applications, which are built based on LifeCODE, and
to serve as a payment mechanism for transactions on LCT Genomics ecosystem.
If users upload their genome data to the LifeCODE blockchain platform and their genome data is
used, they will receive tokens. If research institutions conduct studies (i.e. creating a rare disease risk
model) based on the Genomic Smart Contract, and their research yields good results, they will also
receive tokens. Professional medical institutions or individuals who use LifeCODE Databank’s data
or research results, or create their own applications based on the LifeCODE blockchain, will need to
make payments using LifeCODE Tokens (LCTs).
22 LifeCODE Databank
Summary
Health information boundaries among hospitals, pharmaceutical R&Ds, institutions, doctors
and patients, and ordinary individuals are seriously hampering healthcare services from being
efficiently improved.
Unlocking these boundaries, sharing all resources among the LifeCODE data bank ecosystem
will profit all platform participants. Your data is your own asset. It should be accessible to you
and managed by you and secured by the best privacy protection available. With LifeCODE, it is.
23 LifeCODE Databank
References 1. Venter, JC. et al. The sequence of the human genome. Science Vol. 291, Issue 5507, pp. 1304-
1351 (2001) doi: 10.1126/science.1058040
2. Metzker, ML. Sequencing technologies—the next generation. Nature Reviews Genetics volume
11, pages 31–46 (2010) doi: 10.1038/nrg2626
3. Marioni, RE. et al. Molecular genetic contributions to socioeconomic status and intelligence.
Intelligence Volume 44, Pages 26-32 (2014) doi: 10.1016/j.intell.2014.02.006
4. Raghupathi, W. & Raghupathi, V. Big data analytics in healthcare: promise and potential.
Health Information Science and Systems 2014, 2:3 (2014) doi: 10.1186/2047-2501-2-3
5. Ginsburg, GS & Willard, HF. Genomic and personalized medicine: foundations and
applications. Translational research Volume 154, Issue 6, Pages 277–287 (2009) doi: 10.1016/j.
trsl.2009.09.005
6. McEwen, JE., Boyer, JT. & Sun, KY. Evolving approaches to the ethical management of genomic
data. Trends in Genetics Volume 29, Issue 6, Pages 375-382 (2013) doi: 10.1016/j.tig.2013.02.001
7. Kahn, SD. On the Future of Genomic Data. Science Vol. 331, Issue 6018, pp. 728-729 (2011) doi:
10.1126/science.1197891
8. Greenbaum, D. et al. Genomics and Privacy: Implications of the New Reality of Closed Data for
the Field. PLoS Computational Biology 7(12), e1002278 (2011) doi: 10.1371/journal.pcbi.1002278
9. Buhr, S. WuXi NextCODE aims for the genomics database “gold standard” with new $240
million. (2017) https://techcrunch.com/2017/09/07/wuxi-nextcode-aims-for-the-genomics-
database-gold-standard-with-new-240-million/
10. Marx, V. The DNA of a nation. Nature volume 524, pages 503–505 (2015) doi: 10.1038/524503a
11. Lu, D. et al. Ancestral origins and genetic history of Tibetan highlanders. The American Journal
of Human Genetics Volume 99, Issue 3, Pages 580-594 (2016) doi: 10.1016/j.ajhg.2016.07.002
12. Pauwels, E. & Vidyarthi, A. Who Will Own the Secrets in Our Genes?: A US-China Race in
Artificial Intelligence and Genomics. Book, Wilson Center (2017)
13. Jain, SH. et al. The Digital Phenotype Nature Biotechnology volume 33, pages 462–463 (2015)
doi: 10.1038/nbt.3223
14. Shaywitz, D. Where Will Healthcare’s Data Scientists Find The Rich Phenotypic Data
They Need? (2014) https://www.forbes.com/sites/davidshaywitz/2014/10/10/where-will-
healthcares-data-scientists-find-the-rich-phenotypic-data-they-need/#68a712d530fe
15. Ross, MK., Wei, W. & Ohno-Machado, L. “Big data” and the electronic health record. Yearbook
of Medical Informatics 2014; 9(1), pages 97–104 (2014) doi: 10.15265/IY-2014-0003
16. Vitalik Buterin et al. A Next-Generation Smart Contract and Decentralized Application Platform
2013) https://github.com/ethereum/wiki/wiki/White-Paper
24 LifeCODE Databank
17. JP Morgan Chase. Quorum Whitepaper (2016) https://github.com/jpmorganchase/quorum-docs
18. Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm.
USENIX Annual Technical Conference, pages 305-319 (2014)
19. Heidi Howard et al. Raft Refloated: Do We Have Consensus? ACM SIGOPS Operating Systems
Review - Special Issue on Repeatability and Sharing of Experimental Artifacts, pages 12-21 (2015)
20. Hyperledger. Hyperledger Architecture, Volume 1: Introduction to Hyperledger Business
Blockchain Design Philosophy and Consensus, https://www.hyperledger.org/resources/
publications#white-papers
21. Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery.
ACM Transactions on Computer Systems (TOCS), pages 398-461 (2002)
22. Thomas Espel, Laurent Katz, Guillaume Robin. Proposal for Protocol on a Quorum Blockchain
with Zero Knowledge (2017) https://eprint.iacr.org/2017/1093.pdf
23. Costan, V. & Devadas, S. Intel SGX Explained. IACR Cryptology ePrint Archive, Report
2016/086 (2016)
24. McKeen, F. et al. Intel® software guard extensions (intel® sgx) support for dynamic memory
management inside an enclave. HASP 2016 Proceedings of the Hardware and Architectural
Support for Security and Privacy 2016 Article No. 10 (2016) doi: 10.1145/2948618.2954331
25. Popa, RA. & Zeldovich, N. Multi-Key Searchable Encryption. IACR Cryptology ePrint Archive (2013)
26. Salam, MI. et al. Implementation of searchable symmetric encryption for privacy‐preserving
keyword search on cloud storage. Human-centric Computing and Information Sciences (2015)
5:19 (2015) doi: 10.1186/s13673-015-0039-9
27. Shimizu, K, Nuida, K. & Rätsch, G. Efficient privacy-preserving string search and an application
in genomics. Bioinformatics Volume 32, Issue 11, Pages 1652–1661 (2016) doi: 10.1093/
bioinformatics/btw050
28. Ziegeldorf, JH. et al. BLOOM: Bloom filter based oblivious outsourced matchings. BMC Medical
Genomics BMC series - open, inclusive and trusted 201710(Suppl 2):44 (2017) doi: 10.1186/
s12920-017-0277-y
25 LifeCODE Databank