The logistics for the next hour · Only introduction, results, and discussion. Associations have to...

Preview:

Citation preview

The logistics for the next hour

• 

•  ‘Materials’ section: presentation.pdf

•  muted

•  Take questions at the end: look for the chat box!

Chat box

Address chat to ‘Everyone’

Please enter your question into this box

Extracting gene-disease evidence from literature, genetics, genomics, and more...

Diversity of biomolecular data

Datagenera)on

Therapeu)chypothesis

Publicdata

Dataintegra)on

Our Vision

A partnership to transform drug discovery through the systematic identification and

prioritisation of targets

https://www.opentargets.org

2014 2016 2017 2018

•  Resource of integrated multiomics data

•  Added value (e.g. score) and links to original sources

•  Graphical web interface: easy to use

April 2018 release

21K targets

9.7K diseases

2.3 M associations

6.1 M evidence

Open Targets Platform

Evidence for our T-D associations

https://docs.targetvalidation.org/data-sources/data-sources

Europe PMC A public database of life sciences research literature

33 million biomedical abstracts

3 million genomic variants

image by Jason D. Rowley

138 000 protein structures

in Europe PMC in dbVar in PDBe

Literature as part of big data

What evidence does this contain?

Text-mining for discovery

Text-mining bioentities

Demo: Publication centric workflow

https://europepmc.org/

Annotation types

Open Targets demo

Data sources grouped into data types Gene)c

Associa)onsSoma)cMuta)ons Drugs Affected

Pathways

Differen)alRNA

expression

AnimalModels

TextMining

EVA

GWAS Catalog

PheWAS

Cancer Gene Census

EVA

Expression Atlas PhenoDigm

Europe PMC

G2P

Open Targets Platform

•  Target centric and disease centric information

https://www.targetvalidation.org

•  Associations between targets and diseases

•  Ensembl Gene IDs e.g. ENSGXXXXXXXXXXX

•  UniProt IDs e.g P15056

•  HGNC names e.g. DMD

•  Also non-coding RNA genes

Targets → genes or proteins

•  Experimental Factor Ontology (EFO)

•  Controlled vocabulary (Alzheimers versus Alzheimer’s)

•  Hierarchy (relationships)

Diseases → EFO terms

•  Promotes consistency

•  Increases the richness of annotation

•  Allow for easier and automatic integration

Association score

Which targets have the most evidence for association with

a disease?

What is the relative weight of the evidence for different targets associated with a disease?

Statistical integration, aggregation and scoring

A) per evidence (e.g. one SNP from a GWAS paper)

B) per data source (e.g. SNPs from the GWAS catalog)

C) per data type (e.g. Genetic associations)

D) overall

Four-tier scoring framework

Aggregating individual scores

Ranking target-disease association

Association score: the overall score across all data types

•  Based on the data sources

•  Different weight applied: genetic association = drugs = mutations = pathways > RNA expression > animal models = text mining

To find gene-disease associations the text-mining algorithm searches for a gene and a disease mentioned in the same sentence. ●  Only primary research, no reviews or commentaries. ●  Only introduction, results, and discussion. ●  Associations have to appear more than once. ●  Defined vocabularies for genes and diseases (SwissProt, EFO) ●  Short or ambiguous entries are filtered (gene “A”, protein “Large”) ●  Term variations are included (“α” = “alpha”)

Text-mining evidence

Literature evidence in Open Targets - a target validation platform. Kafkas et. al. 2017

For each association a confidence score is calculated. It takes into account where the association is found in the paper: title scores high, introduction scores low (known associations). 50-80% of other evidence types overlap with the literature mining data.

How accurate is text-mining

APC

antigen-presenting cells

activated protein C anaphase-promoting complex

argon plasma coagulation ?

?

?

?

Annotation errors

Demo: reporting an annotation

Content in Europe PMC

Content for text-mining

Text-mining coverage

https://www.targetvalidation.org/

Demo: Disease centric workflow

What is the evidence for the association between a target

and a disease?

Which targets are associated with a disease?

https://europepmc.org/

Demo: Publication centric workflow

Which articles on p53 cite clinical trials?

Which studies report gene mutations implicated in

diabetes?

Programmatic access

https://europepmc.org/AnnotationsApi

Data citation

Data citation

Conclusions

Help!

helpdesk@europepmc.org

http://bit.ly/EuropePMC-youtube

@EuropePMC_news

Chat box

Please enter your question into this box

Address chat to ‘Everyone’

Back up

EVA

UniProt Gene2Phenotype

GWAS catalog

Cancer Gene Census

EVA (somatic)

IntOGen

ChEMBL

Reactome

Expression Atlas

Europe PMC

PhenoDigm

Genetic associations

Somatic mutations

RNA expression

Animal models

Affected pathways

Text mining

Drugs

*1.0

*1.0

*1.0

*1.0

*1.0

*1.0

*1.0

*1.0

*1.0

*0.5

*0.2

*0.2

Association

S1 + S2/22 + S3/32 + S4/42 + Si/i2

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

ΣH

Genomics England

PhEWAS catalog *1.0

*1.0 ΣH

ΣH

Four-tier scoring framework

Calculated at 4 levels: •  Evidence •  Data source •  Data type •  Overall

Score: 0 to 1 (max)

weight factor

Aggregation with (harmonic sum)

ΣH

Note: Each data set has its own scoring and ranking scheme

f = sample size (cases versus controls)

s = predicted functional consequence (VEP)

c = p value reported in the paper

Factors affecting the relative strength of an evidence

e.g. GWAS Catalog S = f * s * c

f, relative occurrence of a target-disease evidence

s, strength of the effect described by the evidence

c, confidence of the observation for the target-disease evidence

https://docs.targetvalidation.org/getting-started/scoring

Aggregating scores across the data

•  Using a mathematical function, the harmonic sum*

where S1,S2,...,Si are the individual sorted evidence scores in descending order

* PMID: 19107201, PMID: 20118918

•  Advantages:A) account for replicationB) deflate the effect of large amounts of data e.g. text

mining

In addition to T-D associations

•  Everything you wanted to know about…

… but were afraid to ask.

Disease profile page

Target profile page

Profile of a drug target

Protein Drugs Pathwaysinteractions

RNA and protein

baseline expression

Variants, isoforms and

genomic context

Mouse phenotypes Bibliography

Descrip)on Synonyms GeneOntology ProteinStructure

ProteinInterac)ons Similar Targets

ExpressionAtlas

Library/LINK

Extra, extra, extra! Cancer hallmarks in our latest release!

Genetree

http://www.targetvalidation.org/target/ENSG00000141510

Classification Drugs Similar diseases Bibliography

OpenTargetsLibrary/LINK

Profile of a disease

http://www.targetvalidation.org/disease/Orphanet_262

How to access all of this

Core bioinformatics pipelines

www.opentargets.org/projects

Experimental projects

Generate new evidence

CRISPR/Cas9

Organoids and IPS cells

(cellular models for disease)

Integration of available data

Web interface

Batch search tool REST API

Data dumps

Main data store

Elasticsearch Angular JS

Web App*

Public Access

REST

API**

* UI: first released in December 2015

** API first release in April 2016

https://www.targetvalidation.org

https://api.opentargets.io

Recommended