Download pptx - Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

Understanding Disease Informatics System (UDIS)A Foundation to Integrate Exploratory Translational DataMartin Strahm, F. Hoffmann – La Roche

UDIS - A Foundation to Integrate Exploratory Translational Data

• Design criteria– Data sources – Data types– Meta data– User community

• Implementation and challenges– High dimensional data – Meta data

• Conclusion

DESIGN CRITERIASome considerations we had when starting to build UDIS

Design Criteria: Data Sources

Experimental Medicine Studies

Genetic variations of ~1000 (~2500) human individuals from ~30 ethnicities.

~20 cancer types

~100-1000 samples per type

~ 10 data types per sample

600 cell lines

Animal models, Xenografts, Tissue Profiles

Pre-clinical Clinical/Human Cohorts

In-h

ouse

Public

http://www.ncbi.nlm.nih.gov/geo/

Design Criteria: Data Types

DNA

Sequence variation Copy number variation

Design Criteria: Data Types

mRNA

ExpressionRNA Chip

ExpressionRNA Seq

Design Criteria: Meta Data

Meta Data Low Dimensional Data

GenderDisease (MeSH)Disease (ICD10)TissueOrganTreatment…

Clinical chemistryAlbuminBilirubinAspartate transaminase…EQ-5DMobilitySelf-CareUsual ActivitiesPain/DiscomfortAnxiety/Depression…

Design Criteria: User Community

Expert Analyst / Statistician

Bench Scientist / Biologist

Simple view on simple data set

Batch download / APIInterface to R, Spotfire, GenePattern

Design Criteria: User Community

• Some simple questions for a broad user community– In which tissue is my gene of interest expressed?– What is the frequency of my SNP of interest in a

population?– How frequent is a somatic mutation in a cancer type?

• Most questions are complicated and need expert analysis– Correlation between disease status and SNPs in a

heterogeneous population.– Finding biomarker for disease progression.

IMPLEMENTATION AND CHALLENGES

Implementation: High Dimensional Data

SEQ_VARIATION

PK SEQ_VAR_ID

GENOME_ID DBSNP_ID CHROMOSOME SEQ_START SEQ_END SEQ_REF SEQ_ALT

SEQ_VARIATION_RESULT

PK SEQ_VARIATION_RESULT_ID

EXPERIMENT_IDFK1 SEQ_VAR_ID

Implementation: High Dimensional Data

Rows'0

5000000'000

10000000'000

15000000'000

20000000'000

25000000'000

30000000'000

558000'000

30000000'000

BDS (RESULTS)1KG (VARIATIONS)

Implementation: Meta Data

UDIS CV

Property name

Smoking_status

Property valuesCurrent smokerFormer smokerNon-smoker

TCGA Tabacco_smoking_history Lifelong Non-smokerCurrent smokerCurrent reformed smoker for > 15 yearsCurrent reformed smoker for < or = 15 yearsCurrent Reformed Smoker, Duration Not Specified

Source


CDISC

Property name

SMKCLAS

Property valuesNEVER SMOKEDSMOKEREX SMOKER

Other Smoking status NeversmokerNon-smokerPassive smokerCurrent smokerFormer smokerSmoker (not further specified)

Source


Initial work

Decide on terms and granularity used in your system. There are multiple standards out there which may be used for common concepts.

For each data source

Map property names used externally to internal property namesMap property values used externally to internal property namesConvert external data format (XML, CSV, XLS, RDB, …) to input format


FF

Property name

TOTCOM_1TOTSOC_1TOTCOMSOC_1TOTC_1TOTD_1TOTE_1TOT_CS_M1DIAG_CS_M1TOT_TSA_M1DIAG_TSA_M1

Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERYES|NOINTEGERYES|NO

Source


FF

Property name

MOBILITYSELFCAREACTIVITYPAINANXIETYEQ_VASEQ5D_SCORE

Property valuesINTEGERINTEGERINTEGERINTEGERINTEGERINTEGERINTEGER

Source

CONCLUSION

Conclusion

• Data volume is challenging, but can be managed with current technology

• Variety of annotation standards is extremely challenging and requires a lot of human work

– Huge amount of property names in individual studies (thousands)

– Deep biomedical knowledge required to understand annotation

– Standardized exchange format and vocabulary would help

• We do not have the perfect solution

Doing now what patients need next