43
Paul Groth (@pgroth) Web & Media Group Department of Computer Science VU University Amsterdam http://www.few.vu.nl/~pgroth Transparency in the Data Supply Chain

Transparency in the Data Supply Chain

Embed Size (px)

DESCRIPTION

Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery. Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014

Citation preview

Page 1: Transparency in the Data Supply Chain

Paul Groth (@pgroth)Web & Media GroupDepartment of Computer ScienceVU University Amsterdamhttp://www.few.vu.nl/~pgroth

Transparency in the Data Supply Chain

Page 2: Transparency in the Data Supply Chain
Page 3: Transparency in the Data Supply Chain

Outline

• Data integration for analysis– i.e. remixing data

• The need for transparency• Two solutions• The future

Page 4: Transparency in the Data Supply Chain

http://[email protected]

@Open_PHACTS

Page 5: Transparency in the Data Supply Chain

Why?

Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

Repeat @ each

companyx

Page 6: Transparency in the Data Supply Chain

Prioritised Research QuestionsNumber sum Nr of 1 Question

15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse

18 14 8Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?

24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives.

32 13 8 For a given interaction profile, give me compounds similar to it.

37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X.

38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not).

41 13 8

A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.

44 13 8 Give me all active compounds on a given target with the relevant assay data46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)59 14 8 Identify all known protein-protein interaction inhibitors

www.openphacts.org

Page 7: Transparency in the Data Supply Chain

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

From Mabel Loza - USC team

Page 8: Transparency in the Data Supply Chain

From Mabel Loza - USC team

Page 9: Transparency in the Data Supply Chain

From Mabel Loza - USC team

Page 10: Transparency in the Data Supply Chain

From Mabel Loza - USC team

Page 11: Transparency in the Data Supply Chain

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

ChEMBL:

Search target Oxidoreductase: 481 targets from different species

Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained

Table exported to a excel spreadsheet and manually filtered

From Mabel Loza - USC team

Page 12: Transparency in the Data Supply Chain

5 people

Working 6 hours

Page 13: Transparency in the Data Supply Chain

Problem: Data Integration

DataSource

DataSource

Data Warehouse

Queries

ExtractTransformLoad

DataSource

DataSource

Mediator

Queries

QueryReformulation

Page 14: Transparency in the Data Supply Chain

Using the Power of Open PHACTS, London, 22-23 April 2013

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

index

Co

re P

latf

orm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Applications

Page 15: Transparency in the Data Supply Chain

15

Open PHACTS Explorer

Page 16: Transparency in the Data Supply Chain

16

Open PHACTS Explorer

?

Page 17: Transparency in the Data Supply Chain

Credits: Curt Tilmes, Peter Fox

Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate Assessment in the Global Change Information System," Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013

Page 18: Transparency in the Data Supply Chain
Page 19: Transparency in the Data Supply Chain

Problem: I don’t trust your assessment what is it based on?

Page 20: Transparency in the Data Supply Chain

Tension:

Integrated & SummarizedData

Transparency& Trust

Page 21: Transparency in the Data Supply Chain

Solution

Integrating and exposing provenance provided by multiple sources

Page 22: Transparency in the Data Supply Chain
Page 23: Transparency in the Data Supply Chain

provbook.org

Page 24: Transparency in the Data Supply Chain
Page 25: Transparency in the Data Supply Chain

National Climate Change Assessment Provenance

Page 26: Transparency in the Data Supply Chain
Page 27: Transparency in the Data Supply Chain
Page 28: Transparency in the Data Supply Chain

PROV the database as a black box

Q

Page 29: Transparency in the Data Supply Chain

Goal

• the capability to trace back, for each query result, the complete list of sources and how they were combined to deliver a result.

Page 30: Transparency in the Data Supply Chain

Implement In a Graph Database at Scale

Marcin WylotPhilippe Cudré-MaurouxExascale LabUniversity of Fribourg

http://diuf.unifr.ch/main/xi/diplodocus

Page 31: Transparency in the Data Supply Chain

TriplePROV [WWW2014]

Page 32: Transparency in the Data Supply Chain

Provenance Polynomials

Page 33: Transparency in the Data Supply Chain

Test on large messy data

• Billion Triple Challenge– Crawled from the linked open data cloud

• Web Data Commons– RDFa, Microdata extracted from common crawl

• 115 million triples (25 GB)• 8 Queries defined for BTC

– T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.

Page 34: Transparency in the Data Supply Chain

External + Internal Provenance

• Unified queries over external and database provenance

• Adapting query results based on provenance

• Performance improvements

Page 35: Transparency in the Data Supply Chain

FUTURE

Page 36: Transparency in the Data Supply Chain

60 % of time is spent on data preparation

Page 37: Transparency in the Data Supply Chain

Big Data is often lots of small data

http://www.data2semantics.org/prov-reconstruction-challenge/

Page 38: Transparency in the Data Supply Chain

Questions?

• More info:– openphacts.org– data2semantics.org– provbook.org– Paul Groth, "Transparency and Reliability in the Data Supply

Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, March-April, 2013

– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013

– Marcin Wylot, Philippe Cudré-Mauroux and Paul Groth. TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store. WWW 2014

Page 39: Transparency in the Data Supply Chain

Backup

Page 40: Transparency in the Data Supply Chain

Hack Sparql

Page 41: Transparency in the Data Supply Chain

What’s the overhead? Setup

Source and complete trace (i.e. triple level)

Page 42: Transparency in the Data Supply Chain

Annotations:

Propagate annotations through the query processing pipeline

Page 43: Transparency in the Data Supply Chain

What’s the overhead?