Understanding Disease Informatics System (UDIS)A Foundation to Integrate Exploratory Translational DataMartin Strahm, F. Hoffmann – La Roche
UDIS - A Foundation to Integrate Exploratory Translational Data
• Design criteria– Data sources – Data types– Meta data– User community
• Implementation and challenges– High dimensional data – Meta data
• Conclusion
DESIGN CRITERIASome considerations we had when starting to build UDIS
Design Criteria: Data Sources
Experimental Medicine Studies
Genetic variations of ~1000 (~2500) human individuals from ~30 ethnicities.
~20 cancer types
~100-1000 samples per type
~ 10 data types per sample
600 cell lines
Animal models, Xenografts, Tissue Profiles
Pre-clinical Clinical/Human Cohorts
In-h
ouse
Public
Design Criteria: Data Types
DNA
Sequence variation Copy number variation
Design Criteria: Data Types
mRNA
ExpressionRNA Chip
ExpressionRNA Seq
Design Criteria: Meta Data
Meta Data Low Dimensional Data
GenderDisease (MeSH)Disease (ICD10)TissueOrganTreatment…
Clinical chemistryAlbuminBilirubinAspartate transaminase…EQ-5DMobilitySelf-CareUsual ActivitiesPain/DiscomfortAnxiety/Depression…
Design Criteria: User Community
Expert Analyst / Statistician
Bench Scientist / Biologist
Simple view on simple data set
Batch download / APIInterface to R, Spotfire, GenePattern
Design Criteria: User Community
• Some simple questions for a broad user community– In which tissue is my gene of interest expressed?– What is the frequency of my SNP of interest in a
population?– How frequent is a somatic mutation in a cancer type?
• Most questions are complicated and need expert analysis– Correlation between disease status and SNPs in a
heterogeneous population.– Finding biomarker for disease progression.
IMPLEMENTATION AND CHALLENGES
Implementation: High Dimensional Data
SEQ_VARIATION
PK SEQ_VAR_ID
GENOME_ID DBSNP_ID CHROMOSOME SEQ_START SEQ_END SEQ_REF SEQ_ALT
SEQ_VARIATION_RESULT
PK SEQ_VARIATION_RESULT_ID
EXPERIMENT_IDFK1 SEQ_VAR_ID
Implementation: High Dimensional Data
Rows'0
5000000'000
10000000'000
15000000'000
20000000'000
25000000'000
30000000'000
558000'000
30000000'000
BDS (RESULTS)1KG (VARIATIONS)
Implementation: Meta Data
UDIS CV
Property name
Smoking_status
Property valuesCurrent smokerFormer smokerNon-smoker
TCGA Tabacco_smoking_history Lifelong Non-smokerCurrent smokerCurrent reformed smoker for > 15 yearsCurrent reformed smoker for < or = 15 yearsCurrent Reformed Smoker, Duration Not Specified
Source
Implementation: Meta Data
CDISC
Property name
SMKCLAS
Property valuesNEVER SMOKEDSMOKEREX SMOKER
Other Smoking status NeversmokerNon-smokerPassive smokerCurrent smokerFormer smokerSmoker (not further specified)
Source
Implementation: Meta Data
Initial work
Decide on terms and granularity used in your system. There are multiple standards out there which may be used for common concepts.
For each data source
Map property names used externally to internal property namesMap property values used externally to internal property namesConvert external data format (XML, CSV, XLS, RDB, …) to input format
Implementation: Meta Data
FF
Property name
TOTCOM_1TOTSOC_1TOTCOMSOC_1TOTC_1TOTD_1TOTE_1TOT_CS_M1DIAG_CS_M1TOT_TSA_M1DIAG_TSA_M1
Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERYES|NOINTEGERYES|NO
Source
Implementation: Meta Data
FF
Property name
MOBILITYSELFCAREACTIVITYPAINANXIETYEQ_VASEQ5D_SCORE
Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGER
Source
CONCLUSION
Conclusion
• Data volume is challenging, but can be managed with current technology
• Variety of annotation standards is extremely challenging and requires a lot of human work
– Huge amount of property names in individual studies (thousands)
– Deep biomedical knowledge required to understand annotation
– Standardized exchange format and vocabulary would help
• We do not have the perfect solution
Doing now what patients need next