View
1.150
Download
0
Embed Size (px)
Citation preview
AI for Data CurationYes, can we?Andrea Splendiani, AD, Information SystemsLondonSeptember 28, 2017
NIBR Informatics
Business or Operating Unit/Franchise or Department
Agenda1. Focus: metadata and
reference data
2. Knowledge Engineering
and AI
3. Data curation: a use case
for AI?
4. Ideas and experiences
5. Conclusions
Public2
What we do in context
Some considerations at 10000ft
Holistic view on a process (1000ft)
Details
Reflections at 10000ft
Business or Operating Unit/Franchise or Department
Focus: metadata and reference data
1. What:– Annotation of datasets– Standards– Ontologies– Reference information
2. Why:– Support analysis– Support search and query answering– Support extraction– Building knowledge networks / information discovery and inference
3. Where– Typically in research
Public3
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?(a stopper)
• 10 years ago: AI approaches to Systems Biology
• Ontology based knowledge-bases (Semantic Web)
• ANN/Fuzzy systems even older
Knowledge Engineering and AI
Public4
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?(taken seriously)
• Now: AI and ML are in the hype
• Interest in Life Sciences industries
Knowledge Engineering and AI
Public5
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
Public6
• What helped the resurgence of ML?– Massive data available– Massive computational power available– Few technical improvements– Success stories (Deep learning)
• Do these also apply to Ontology/Sem-Web based systems?– Uniprot: 5.7B triples in 2009, 30+B triples in 2017– EBI RDF Platform (2015)– Wikidata (2014?)
Source: https://tools.wmflabs.org/wikidata-todo/stats.php
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• The way information is represented has implications on what is built on it (e.g.: analytics, data mining)– network: are parallel executions in AND or OR – Annotations: explicit mention of negative information
Public7
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• Metadata is important in a data-centric world (and at least in part of ML applications)
• Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…)
• We start to have large, distributed knowledge-bases– Is there a role for AI systems based on logic/KR?– Can we combine symbolic and sub-symbolic reasoning ?
– Is this already happening ?
Public8
Business or Operating Unit/Franchise or Department
Data curation
Public9
• Annotation• Metadata• Standards• Model• Literature• Databases• …
Source BioCuration 2017 Abstracts via wordscloud.com
Business or Operating Unit/Franchise or Department
An example: public data curation
Public10
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation(data view)
Public11
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607
Property Value Ontology Bio-Characteristic?
Sample_source_name
WT6 biological rep 1, Affyprocessing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001NCBITaxon_10090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001EFO_0005168
Bio
Sex male EFO_0000001EFO_0001266PATO_0000384
age 6 weeks old EFO_0000001 Bio
https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation(data view)
Public12
Property Value Ontology Bio-Characteristic?
Sample_source_name
WT6 biological rep 1, Affyprocessing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001NCBITaxon_10090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001EFO_0005168
Bio
Sex male EFO_0000001EFO_0001266PATO_0000384
age 6 weeks old EFO_0000001 Bio
Business or Operating Unit/Franchise or Department
An example: public data curation
Public13
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Supports:• Aggregation• Analysis• Search• Link discovery• “Machine learning”
Business or Operating Unit/Franchise or Department
Can we use AI for Data Curation ?
Why ? – Data curation is an intellectually intensive
activity, time consuming and intensive– Given the the increasing role and amount
of data, curation risks to be a bottleneck
Public14
Example of exponential growth in data
Business or Operating Unit/Franchise or Department
AI for data curation: characteristics and constraints
• Can we automate data curation ?• Difficult:
– Missing data– Discretionality (e.g.: level of granularity)
• Looks reasonable:– Repetition– Consistency– Data/distances evaluations (clustering/attractors)
• We need to combine human aspects and machineableaspects
Public15
Business or Operating Unit/Franchise or Department
AI for data curationframing the problem: what
Public16
Should this value be normalized?
Meaning. E.g.: is “age” same as “years”?
Confidence: is this information true ?
The need. E.g.: is this a required information. When? Is this a valid identifier?
Example, extract from NCBI GEO GSM701607
Business or Operating Unit/Franchise or Department
AI for data curationFraming the problem: howWe consider curation activities as functions in a “curation space” that is exemplified via a “curation record”
Public17
Validation state(Confidence)
Valid Valid Valid
Curation goal(The need)
Required Required Required Required Required
Semantic type1
(Meaning)Identifierabout Sample
ID2 aboutOrganism
Nameabout Organism
Name about Gender
Identifier about Gender
Descriptionabout Age
Age Unitabout Age
Field Name(the “location” in the source)
ID taxID Organism Gender age
Value GSM701607
10090 Mus Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
Business or Operating Unit/Franchise or Department
AI and data curationUsing a record to modularize curation processes
• Different classes of operations– Schema mapping (assign a
type)– Standard setting (assign a
goal)– Validation (setting a validation
value)
Public18
Validation state Valid Valid Valid
Curation goal Required
Required Required Required
Semantic type Identifierabout Sample
Name about Gender
Identifier about Gender
Descriptionabout Age
Age Unitabout Age
Field Name ID Gender age
Value GSM701607
6 weeks old
Validation state Valid Valid
Curation goal Required
Semantic type Identifier about Sample
Name about Organism
Name about Gender
Field Name ID Organism Gender
Value GSM701607 Mus Musculus
Validation state Valid Valid Valid
Curation goal Required
Required Required Required
Semantic type Identifierabout Sample
Name about Gender
Identifier about Gender
Descriptionabout Age
Age Unitabout Age
Field Name ID Gender age
Value GSM701607
6 weeks old
Business or Operating Unit/Franchise or Department
• Different classes of operations– Normalization (filling a
column)– Enrichment (adding a
column)
Public19
AI and data curationUsing a record to modularize curation processes
Validation state
Valid Valid
Curation goal
Required
Required
Semantic type
Identifierabout Sample
Name about Gender
Identifier about Gender
Field Name ID Gender
Value GSM701607
male
Validation state
Valid Valid
Curation goal
Required
Required
Semantic type
Identifierabout Sample
Name about Gender
Identifier about Gender
Field Name ID Gender
Value GSM701607
male PATO:0000384
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifierabout Sample
ID2 aboutOrganism
Name about Organism
Description about Age
Field Name ID taxID Organism age
Value GSM701607 10090 Mus Musculus 6 weeks old
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifierabout Sample
ID2 aboutOrganism
Name about Organism
Descriptionabout Age
IdentifieraboutSample
Field Name ID taxID Organism age EBI ref.
Value GSM701607
10090 Mus Musculus
6 weeks old
SAMEA1189935
Business or Operating Unit/Franchise or Department
Big pictureQuantity/Quality tradeoff
Public20
Quality/validity
Time/cost
• Is the optimal trade-off the same for all data?
• Can this change for the same data over time and use cases ?
• Can we embed a “cost function” in curation processes ?
Business or Operating Unit/Franchise or Department
Big picture(Meta) data evolution, immutability
Public21
Initial condition: organism name present, missing ID
Initial condition: identifier extracted, not verified
Identifier extracted and verified
Entity: 1234Information: V1Meta-Info: V1
Entity: 1234Information: V2Meta-Info: V2
Entity: 1234Information: V2Meta-Info: V3
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Validation state Valid Valid Valid
Curation goal Required Required
Semantic type Identifier about Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Ideas and experiencesSome details
Business or Operating Unit/Franchise or Department
Data and metadata transformations (deterministic actions + extractors)
• Curation processes can be expressed (by curators) in terms of rules
• Rules embed “atomic operations” e.g.: extractors, transformations,…
• Simple rules go a very long way…
Public23
<ruleConfig method="Extract"><param name="setType" value="UNIT"/><param name="setAmbiguous" value="true"/><param name="setFullMatch" value="false"/><param name="setResultInJson" value="false"/><param name="setSimpleJson" value="false"/><param name="setText">
<ruleConfig method="GetCell"><param name="setAttr" value="AgeDescription"/><param name="setBase" value="XCF_1"/>
</ruleConfig>
Business or Operating Unit/Franchise or Department
Abstract rules and meta-rules• Rules can rely on abstraction/inference for higher genericity• They can also be used to produce meta-information
Public24
Example rules (pesudo-syntax)
• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”
• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid
Validation state Valid Valid
Curation goal Required Required Required
Semantic type Identifierabout Sample
ID aboutOrganism
Nameabout Organism
Name about Gender
Identifier about Gender
Field Name(the “location” in the source)
ID taxID Organism Gender
Value GSM701607 10090 Mus Musculus
Business or Operating Unit/Franchise or Department
“Approximate” transformations• Some transformations cannot (easily) be expressed in
terms rules– Complex and ad hoc relations– Discretional elements
• Examples:– Entities de-duplication
– Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…)
– Schema mapping– Determining the meaning of an attribute (e.g.: time) is a complex function of
the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?)
– Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ?
Public25
Business or Operating Unit/Franchise or Department
Implementation of de-duplication and schema mapping via Tamr• One approach that we have chosen to provide
approximate schema-mapping and de-duplication functions is via Tamr (tamr.com)
• Tamr is data unification platform that combines machine learning with human expertise.– E.g.: to support schema mapping, Tamr combines several features:
– Data distribution
– Property names
– Property metadata
– It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions
Public26
Business or Operating Unit/Franchise or Department
Schema-mapping (Tamr)
Public27
Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated.
User interface for curators showing potential attribute matches
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
User interface for curators showing potential duplicates
Public28
Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
Details of the implementation of the deduplication process (courtesy of Tamr)
Public29
Business or Operating Unit/Franchise or Department
Re-introducing logic• Can we predict (or suggest) the association between
parameters and entities in a template?– An ontology models the “real world”: entities, qualities, processes– Parameters are annotated with axioms based on this ontology– Inference provides multiple classifications of parameters, as well as
possible/necessary associations between parameters and entities.
• Can this work?
Public30
Business or Operating Unit/Franchise or Department
Re-introducing logic
Public31
Extract from an ontology representing entities and qualities
Example of axiomatic mapping between a parameter and an entity and qualities ontology
Deductions for parameter ReportID:must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entitymay refer to: Report, InternalReport
Business or Operating Unit/Franchise or Department
Exploring automatic ontology matching
Public32
• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge
• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project
• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics
http://oaei.ontologymatching.org/2016
Business or Operating Unit/Franchise or Department
Conclusions: On rules, standards and data ethnography
• Data Curation: “AI” may help (not limited to ML)– Formal knowledge representation is part of the goal
• The need for explanations– We need to define (document) a process– We have theorems for proofs: can we do without ?– Is there a role for “ML” GURUs?
• The “human side” of data – Data normalization is based on assumptions (e.g.: what can be
considered same, what not): there is a cultural side to this.– Would we accept an AI “editor” ?
Public33
Business or Operating Unit/Franchise or Department
Acknowledgments
• NIBR• Daniel Cronenberger• Ming Fang• Frederic Sutter• Anosha Siripala• Fabien Pernot• Jean Marc von Allmen• Martin Petracchi• Dorothy Reilly• Pierre Parisot• Therese Vachon
• Tamr.com• Pistoia Alliance Ontology Matching Project team
Public34
Thank you