35
AI for Data Curation Yes, can we? Andrea Splendiani, AD, Information Systems London September 28, 2017 NIBR Informatics

Artificial Intelligence in Data Curation

Embed Size (px)

Citation preview

Page 1: Artificial Intelligence in Data Curation

AI for Data CurationYes, can we?Andrea Splendiani, AD, Information SystemsLondonSeptember 28, 2017

NIBR Informatics

Page 2: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Agenda1. Focus: metadata and

reference data

2. Knowledge Engineering

and AI

3. Data curation: a use case

for AI?

4. Ideas and experiences

5. Conclusions

Public2

What we do in context

Some considerations at 10000ft

Holistic view on a process (1000ft)

Details

Reflections at 10000ft

Page 3: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Focus: metadata and reference data

1. What:– Annotation of datasets– Standards– Ontologies– Reference information

2. Why:– Support analysis– Support search and query answering– Support extraction– Building knowledge networks / information discovery and inference

3. Where– Typically in research

Public3

Page 4: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Can Artificial Intelligence solve biology ?(a stopper)

• 10 years ago: AI approaches to Systems Biology

• Ontology based knowledge-bases (Semantic Web)

• ANN/Fuzzy systems even older

Knowledge Engineering and AI

Public4

Page 5: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Can Artificial Intelligence solve biology ?(taken seriously)

• Now: AI and ML are in the hype

• Interest in Life Sciences industries

Knowledge Engineering and AI

Public5

Page 6: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Knowledge Engineering and AI

Public6

• What helped the resurgence of ML?– Massive data available– Massive computational power available– Few technical improvements– Success stories (Deep learning)

• Do these also apply to Ontology/Sem-Web based systems?– Uniprot: 5.7B triples in 2009, 30+B triples in 2017– EBI RDF Platform (2015)– Wikidata (2014?)

Source: https://tools.wmflabs.org/wikidata-todo/stats.php

Page 7: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Knowledge Engineering and AI

• The way information is represented has implications on what is built on it (e.g.: analytics, data mining)– network: are parallel executions in AND or OR – Annotations: explicit mention of negative information

Public7

Page 8: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Knowledge Engineering and AI

• Metadata is important in a data-centric world (and at least in part of ML applications)

• Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…)

• We start to have large, distributed knowledge-bases– Is there a role for AI systems based on logic/KR?– Can we combine symbolic and sub-symbolic reasoning ?

– Is this already happening ?

Public8

Page 9: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Data curation

Public9

• Annotation• Metadata• Standards• Model• Literature• Databases• …

Source BioCuration 2017 Abstracts via wordscloud.com

Page 10: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

An example: public data curation

Public10

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935

Page 11: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

An example: public data curation(data view)

Public11

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607

Property Value Ontology Bio-Characteristic?

Sample_source_name

WT6 biological rep 1, Affyprocessing batch 2

EFO_0000001

Organism Mus musculus EFO_0000001NCBITaxon_10090

strain 129S6/Sv/Ev EFO_0000001 Bio

genotype wild type EFO_0000001EFO_0005168

Bio

Sex male EFO_0000001EFO_0001266PATO_0000384

age 6 weeks old EFO_0000001 Bio

https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935

Page 12: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

An example: public data curation(data view)

Public12

Property Value Ontology Bio-Characteristic?

Sample_source_name

WT6 biological rep 1, Affyprocessing batch 2

EFO_0000001

Organism Mus musculus EFO_0000001NCBITaxon_10090

strain 129S6/Sv/Ev EFO_0000001 Bio

genotype wild type EFO_0000001EFO_0005168

Bio

Sex male EFO_0000001EFO_0001266PATO_0000384

age 6 weeks old EFO_0000001 Bio

Page 13: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

An example: public data curation

Public13

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935

Supports:• Aggregation• Analysis• Search• Link discovery• “Machine learning”

Page 14: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Can we use AI for Data Curation ?

Why ? – Data curation is an intellectually intensive

activity, time consuming and intensive– Given the the increasing role and amount

of data, curation risks to be a bottleneck

Public14

Example of exponential growth in data

Page 15: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

AI for data curation: characteristics and constraints

• Can we automate data curation ?• Difficult:

– Missing data– Discretionality (e.g.: level of granularity)

• Looks reasonable:– Repetition– Consistency– Data/distances evaluations (clustering/attractors)

• We need to combine human aspects and machineableaspects

Public15

Page 16: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

AI for data curationframing the problem: what

Public16

Should this value be normalized?

Meaning. E.g.: is “age” same as “years”?

Confidence: is this information true ?

The need. E.g.: is this a required information. When? Is this a valid identifier?

Example, extract from NCBI GEO GSM701607

Page 17: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

AI for data curationFraming the problem: howWe consider curation activities as functions in a “curation space” that is exemplified via a “curation record”

Public17

Validation state(Confidence)

Valid Valid Valid

Curation goal(The need)

Required Required Required Required Required

Semantic type1

(Meaning)Identifierabout Sample

ID2 aboutOrganism

Nameabout Organism

Name about Gender

Identifier about Gender

Descriptionabout Age

Age Unitabout Age

Field Name(the “location” in the source)

ID taxID Organism Gender age

Value GSM701607

10090 Mus Musculus

6 weeks old

1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)2 Identifiers also require a domain specification

Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)

Page 18: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

AI and data curationUsing a record to modularize curation processes

• Different classes of operations– Schema mapping (assign a

type)– Standard setting (assign a

goal)– Validation (setting a validation

value)

Public18

Validation state Valid Valid Valid

Curation goal Required

Required Required Required

Semantic type Identifierabout Sample

Name about Gender

Identifier about Gender

Descriptionabout Age

Age Unitabout Age

Field Name ID Gender age

Value GSM701607

6 weeks old

Validation state Valid Valid

Curation goal Required

Semantic type Identifier about Sample

Name about Organism

Name about Gender

Field Name ID Organism Gender

Value GSM701607 Mus Musculus

Validation state Valid Valid Valid

Curation goal Required

Required Required Required

Semantic type Identifierabout Sample

Name about Gender

Identifier about Gender

Descriptionabout Age

Age Unitabout Age

Field Name ID Gender age

Value GSM701607

6 weeks old

Page 19: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

• Different classes of operations– Normalization (filling a

column)– Enrichment (adding a

column)

Public19

AI and data curationUsing a record to modularize curation processes

Validation state

Valid Valid

Curation goal

Required

Required

Semantic type

Identifierabout Sample

Name about Gender

Identifier about Gender

Field Name ID Gender

Value GSM701607

male

Validation state

Valid Valid

Curation goal

Required

Required

Semantic type

Identifierabout Sample

Name about Gender

Identifier about Gender

Field Name ID Gender

Value GSM701607

male PATO:0000384

Validation state Valid Valid

Curation goal Required Required

Semantic type Identifierabout Sample

ID2 aboutOrganism

Name about Organism

Description about Age

Field Name ID taxID Organism age

Value GSM701607 10090 Mus Musculus 6 weeks old

Validation state Valid Valid

Curation goal Required Required

Semantic type Identifierabout Sample

ID2 aboutOrganism

Name about Organism

Descriptionabout Age

IdentifieraboutSample

Field Name ID taxID Organism age EBI ref.

Value GSM701607

10090 Mus Musculus

6 weeks old

SAMEA1189935

Page 20: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Big pictureQuantity/Quality tradeoff

Public20

Quality/validity

Time/cost

• Is the optimal trade-off the same for all data?

• Can this change for the same data over time and use cases ?

• Can we embed a “cost function” in curation processes ?

Page 21: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Big picture(Meta) data evolution, immutability

Public21

Initial condition: organism name present, missing ID

Initial condition: identifier extracted, not verified

Identifier extracted and verified

Entity: 1234Information: V1Meta-Info: V1

Entity: 1234Information: V2Meta-Info: V2

Entity: 1234Information: V2Meta-Info: V3

Validation state Valid Valid

Curation goal Required Required

Semantic type Identifier about Sample

Name about Gender Identifier about Gender

Field Name ID Gender

Value GSM701607 male

Validation state Valid Valid

Curation goal Required Required

Semantic type Identifier about Sample

Name about Gender Identifier about Gender

Field Name ID Gender

Value GSM701607 male PATO:0000384

Validation state Valid Valid Valid

Curation goal Required Required

Semantic type Identifier about Sample

Name about Gender Identifier about Gender

Field Name ID Gender

Value GSM701607 male PATO:0000384

Page 22: Artificial Intelligence in Data Curation

Ideas and experiencesSome details

Page 23: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Data and metadata transformations (deterministic actions + extractors)

• Curation processes can be expressed (by curators) in terms of rules

• Rules embed “atomic operations” e.g.: extractors, transformations,…

• Simple rules go a very long way…

Public23

<ruleConfig method="Extract"><param name="setType" value="UNIT"/><param name="setAmbiguous" value="true"/><param name="setFullMatch" value="false"/><param name="setResultInJson" value="false"/><param name="setSimpleJson" value="false"/><param name="setText">

<ruleConfig method="GetCell"><param name="setAttr" value="AgeDescription"/><param name="setBase" value="XCF_1"/>

</ruleConfig>

Page 24: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Abstract rules and meta-rules• Rules can rely on abstraction/inference for higher genericity• They can also be used to produce meta-information

Public24

Example rules (pesudo-syntax)

• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)

• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”

• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid

Validation state Valid Valid

Curation goal Required Required Required

Semantic type Identifierabout Sample

ID aboutOrganism

Nameabout Organism

Name about Gender

Identifier about Gender

Field Name(the “location” in the source)

ID taxID Organism Gender

Value GSM701607 10090 Mus Musculus

Page 25: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

“Approximate” transformations• Some transformations cannot (easily) be expressed in

terms rules– Complex and ad hoc relations– Discretional elements

• Examples:– Entities de-duplication

– Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…)

– Schema mapping– Determining the meaning of an attribute (e.g.: time) is a complex function of

the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?)

– Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ?

Public25

Page 26: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Implementation of de-duplication and schema mapping via Tamr• One approach that we have chosen to provide

approximate schema-mapping and de-duplication functions is via Tamr (tamr.com)

• Tamr is data unification platform that combines machine learning with human expertise.– E.g.: to support schema mapping, Tamr combines several features:

– Data distribution

– Property names

– Property metadata

– It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions

Public26

Page 27: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Schema-mapping (Tamr)

Public27

Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated.

User interface for curators showing potential attribute matches

Page 28: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Entity de-duplication (Tamr)

User interface for curators showing potential duplicates

Public28

Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.

Page 29: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Entity de-duplication (Tamr)

Details of the implementation of the deduplication process (courtesy of Tamr)

Public29

Page 30: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Re-introducing logic• Can we predict (or suggest) the association between

parameters and entities in a template?– An ontology models the “real world”: entities, qualities, processes– Parameters are annotated with axioms based on this ontology– Inference provides multiple classifications of parameters, as well as

possible/necessary associations between parameters and entities.

• Can this work?

Public30

Page 31: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Re-introducing logic

Public31

Extract from an ontology representing entities and qualities

Example of axiomatic mapping between a parameter and an entity and qualities ontology

Deductions for parameter ReportID:must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entitymay refer to: Report, InternalReport

Page 32: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Exploring automatic ontology matching

Public32

• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge

• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project

• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics

http://oaei.ontologymatching.org/2016

Page 33: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Conclusions: On rules, standards and data ethnography

• Data Curation: “AI” may help (not limited to ML)– Formal knowledge representation is part of the goal

• The need for explanations– We need to define (document) a process– We have theorems for proofs: can we do without ?– Is there a role for “ML” GURUs?

• The “human side” of data – Data normalization is based on assumptions (e.g.: what can be

considered same, what not): there is a cultural side to this.– Would we accept an AI “editor” ?

Public33

Page 34: Artificial Intelligence in Data Curation

Business or Operating Unit/Franchise or Department

Acknowledgments

• NIBR• Daniel Cronenberger• Ming Fang• Frederic Sutter• Anosha Siripala• Fabien Pernot• Jean Marc von Allmen• Martin Petracchi• Dorothy Reilly• Pierre Parisot• Therese Vachon

• Tamr.com• Pistoia Alliance Ontology Matching Project team

Public34

Page 35: Artificial Intelligence in Data Curation

Thank you