20
Understanding Disease Informatics System (UDIS) A Foundation to Integrate Exploratory Translational Data Martin Strahm, F. Hoffmann – La Roche

Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Embed Size (px)

DESCRIPTION

UDIS is the solution we built in-house to integrate –omics data in one place. The system brings clinical (human cohort) and pre-clinical (animal, cell lines) data from internal and public sources together. It serves as data source for simple visualizations suitable for many scientists and additionally via batch download for expert analysts. The first challenge to build such a system is the huge amount of data that has to be stored and made accessible with reasonable performance. The second is the variety of data sources which store their data in different formats and use different vocabularies. We could overcome the first challenge quite easily by bringing the right talents together. The second one is currently solved by a reasonably large investment in resources for data loading and curation, but would tremendously benefit from one industry wide standard.

Citation preview

Page 1: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Understanding Disease Informatics System  (UDIS)A Foundation to Integrate Exploratory Translational DataMartin Strahm, F. Hoffmann – La Roche

Page 2: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

UDIS - A Foundation to Integrate Exploratory Translational Data

• Design criteria– Data sources – Data types– Meta data– User community

• Implementation and challenges– High dimensional data – Meta data

• Conclusion

Page 3: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

DESIGN CRITERIASome considerations we had when starting to build UDIS

Page 4: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: Data Sources

Experimental Medicine Studies

Genetic variations of ~1000 (~2500) human individuals from ~30 ethnicities.

~20 cancer types

~100-1000 samples per type

~ 10 data types per sample

600 cell lines

Animal models, Xenografts, Tissue Profiles

Pre-clinical Clinical/Human Cohorts

In-h

ouse

Public

Page 5: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: Data Types

DNA

Sequence variation Copy number variation

Page 6: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: Data Types

mRNA

ExpressionRNA Chip

ExpressionRNA Seq

Page 7: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: Meta Data

Meta Data Low Dimensional Data

GenderDisease (MeSH)Disease (ICD10)TissueOrganTreatment…

Clinical chemistryAlbuminBilirubinAspartate transaminase…EQ-5DMobilitySelf-CareUsual ActivitiesPain/DiscomfortAnxiety/Depression…

Page 8: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: User Community

Expert Analyst / Statistician

Bench Scientist / Biologist

Simple view on simple data set

Batch download / APIInterface to R, Spotfire, GenePattern

Page 9: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Design Criteria: User Community

• Some simple questions for a broad user community– In which tissue is my gene of interest expressed?– What is the frequency of my SNP of interest in a

population?– How frequent is a somatic mutation in a cancer type?

• Most questions are complicated and need expert analysis– Correlation between disease status and SNPs in a

heterogeneous population.– Finding biomarker for disease progression.

Page 10: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

IMPLEMENTATION AND CHALLENGES

Page 11: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: High Dimensional Data

SEQ_VARIATION

PK SEQ_VAR_ID

GENOME_ID DBSNP_ID CHROMOSOME SEQ_START SEQ_END SEQ_REF SEQ_ALT

SEQ_VARIATION_RESULT

PK SEQ_VARIATION_RESULT_ID

EXPERIMENT_IDFK1 SEQ_VAR_ID

Page 12: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: High Dimensional Data

Rows'0

5000000'000

10000000'000

15000000'000

20000000'000

25000000'000

30000000'000

558000'000

30000000'000

BDS (RESULTS)1KG (VARIATIONS)

Page 13: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: Meta Data

UDIS CV

Property name

Smoking_status

Property valuesCurrent smokerFormer smokerNon-smoker

TCGA Tabacco_smoking_history Lifelong Non-smokerCurrent smokerCurrent reformed smoker for > 15 yearsCurrent reformed smoker for < or = 15 yearsCurrent Reformed Smoker, Duration Not Specified

Source

Page 14: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: Meta Data

CDISC

Property name

SMKCLAS

Property valuesNEVER SMOKEDSMOKEREX SMOKER

Other Smoking status NeversmokerNon-smokerPassive smokerCurrent smokerFormer smokerSmoker (not further specified)

Source

Page 15: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: Meta Data

Initial work

Decide on terms and granularity used in your system. There are multiple standards out there which may be used for common concepts.

For each data source

Map property names used externally to internal property namesMap property values used externally to internal property namesConvert external data format (XML, CSV, XLS, RDB, …) to input format

Page 16: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: Meta Data

FF

Property name

TOTCOM_1TOTSOC_1TOTCOMSOC_1TOTC_1TOTD_1TOTE_1TOT_CS_M1DIAG_CS_M1TOT_TSA_M1DIAG_TSA_M1

Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERYES|NOINTEGERYES|NO

Source

Page 17: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Implementation: Meta Data

FF

Property name

MOBILITYSELFCAREACTIVITYPAINANXIETYEQ_VASEQ5D_SCORE

Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGER

Source

Page 18: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

CONCLUSION

Page 19: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Conclusion

• Data volume is challenging, but can be managed with current technology

• Variety of annotation standards is extremely challenging and requires a lot of human work

– Huge amount of property names in individual studies (thousands)

– Deep biomedical knowledge required to understand annotation

– Standardized exchange format and vocabulary would help

• We do not have the perfect solution

Page 20: Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Doing now what patients need next