45
Reconciling succeeding taxonomic classifications Nico M. Franz School of Life Sciences, Arizona State University Mingmin Chen, Shizhuo Yu, Bertram Ludäscher * Department of Computer Science, University of California at Davis ESA Annual Meeting 2012 November 14, 2012 Knoxville, TN * PI NSF-IIS 1118088: A logic-based, provenance-aware system for merging scientific data under context and classification constraints.

Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Embed Size (px)

DESCRIPTION

Presentation on reconciling taxonomic concepts using the Euler approach, given at the 2012 Annual Meeting of Entomological Society of America, Knoxville, TN.

Citation preview

Page 1: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Reconciling succeeding

taxonomic classifications

Nico M. Franz

School of Life Sciences, Arizona State University

Mingmin Chen, Shizhuo Yu, Bertram Ludäscher *

Department of Computer Science, University of California at Davis

ESA Annual Meeting 2012

November 14, 2012 – Knoxville, TN

* PI – NSF-IIS 1118088: A logic-based, provenance-aware system for merging scientific data under context and classification constraints.

Page 2: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Challenge – describing classification provenance beyond synonymy

Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm

Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005

Page 3: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Challenge – describing classification provenance beyond synonymy

Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm

Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005

Individual columns represent past classifications of Andropogon.

Page 4: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Challenge – describing classification provenance beyond synonymy

Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005

Individual rows represent equivalent taxonomic entities, (almost)regardless of their name labels.

Page 5: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Challenge – describing classification provenance beyond synonymy

Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005

Individual rows represent equivalent taxonomic entities, (almost)regardless of their name labels.Name/synonymy relationships are not sufficiently granular tocapture this evolution of taxonomic views of Andropogon species.

Page 6: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Tracking classification provenance with concepts and articulations

Definition: A taxonomic concept is the underlying meaning of a scientific name as stated

by a particular author and publication. It represents the author's full-blown

view of how the name reaches out to un-/observed objects in nature.

Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by

the full Linnaean name and followed by the specific author and publication.

Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.

Page 7: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Tracking classification provenance with concepts and articulations

Definition: A taxonomic concept is the underlying meaning of a scientific name as stated

by a particular author and publication. It represents the author's full-blown

view of how the name reaches out to un-/observed objects in nature.

Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by

the full Linnaean name and followed by the specific author and publication.

Examples: Andropogon virginicus L. sec. Radford et al. (1968) [earlier, wider concept]

Andropogon virginicus L. sec. Weakley (2005) [later, narrower concept]

Utility: Representing multiple classifications (revisions) through concepts makes it possible

to track their similarities and differences through articulations.

Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.

Page 8: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Five basic articulations between two concepts C1, C2 (set theory)

equivalence proper inclusion

overlapinverse proper

inclusion

exclusion

Source: Franz & Peet. 2009. Towards a language for mapping relationships among taxonomic concepts. Syst. Biodiv. 7: 5–20.

Use of "OR" to express uncertainty.Example: C1 == OR > C2

Page 9: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

How does it work? Connecting Hackel 1889 and Small 1933

Hackel 1889 (1-12)

Small 1933 (13-16)

Step 1: Transcribe two concept hierarchies… …and add unique IDs

Page 10: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Hackel 1889 (1-12)

Step 2: Create a table with all concept labels

Small 1933 (13-16)

How does it work? Connecting Hackel 1889 and Small 1933

Page 11: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Hackel 1889 (1-12)

Step 3: Create a table with corresponding parent/child relationships ('is_a')

Small 1933 (13-16)

How does it work? Connecting Hackel 1889 and Small 1933

Page 12: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Hackel 1889 (1-12)

Step 4: Create a table with a suitable set of articulations

Small 1933 (13-16)

How does it work? Connecting Hackel 1889 and Small 1933

Page 13: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Hackel 1889 (1-12)

Step 4: Create a table with a suitable set of articulations

Small 1933 (13-16)

How does it work? Connecting Hackel 1889 and Small 1933

Translation

Congruence

Page 14: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Co

nce

pt

hie

rarc

hie

s

Articulations

Page 15: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Technical challenges to creating articulations

Input of concept hierarchies

Lack of a server-based platform (e.g. Global Names Architecture)

Lack of user-friendly classification input / visualization tools

Page 16: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Input of concept hierarchies

Lack of a server-based platform (e.g. Global Names Architecture)

Lack of user-friendly classification input / visualization tools

Input of articulations (goal: achieve a complete and consistent mapping)

Taxonomic experts will not input ∞ articulations

Taxonomic experts will miss relevant articulations ("mir")

Taxonomic experts could be uncertain of articulations ("possible worlds")

Taxonomic experts could posit logically inconsistent articulations

Technical challenges to creating articulations

Page 17: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Input of concept hierarchies

Lack of a server-based platform (e.g. Global Names Architecture)

Lack of user-friendly classification input / visualization tools

Input of articulations (goal: achieve a complete and consistent mapping)

Taxonomic experts will not input ∞ articulations

Taxonomic experts will miss relevant articulations ("mir")

Taxonomic experts could be uncertain of articulations ("possible worlds")

Taxonomic experts could posit logically inconsistent articulations

"CleanTax" is being developed to explore solutions to these challenges. 1

Technical challenges to creating articulations

1 There is continuation/overlap with the "Exploring Taxonomic Concepts" project that focuses on character matching (DBI-1147266).

Page 18: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

CleanTax – technical specifications

CleanTax = a set of Python programming scripts stored on bitbucket.org

(initially developed by Dave Thau; now being developed further on many fronts)

CleanTax reads in concept/articulation tables from a PostgreSQL database

CleanTax transforms the input for processing by logic reasoners; including:

Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]

OWL / HermiT – description logic , knowledge representation [complex]

DLV System – propositional logic, answer set programming [promising!]

Page 19: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

CleanTax = a set of Python programming scripts stored on bitbucket.org

(initially developed by Dave Thau; now being developed further on many fronts)

CleanTax reads in concept/articulation tables from a PostgreSQL database

CleanTax transforms the input for processing by logic reasoners; including:

Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]

OWL / HermiT – description logic , knowledge representation [complex]

DLV System – propositional logic, answer set programming [promising!]

CleanTax assesses consistency and completeness of articulations

Output of the set of maximally informative relationships – "mir"

Report , causal explanation, interactive repair of inconsistent articulations

Calculate multiple possible worlds (if ambiguous articulations are present)

CleanTax – technical specifications

Page 20: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

CleanTax = a set of Python programming scripts stored on bitbucket.org

(initially developed by Dave Thau; now being developed further on many fronts)

CleanTax reads in concept/articulation tables from a PostgreSQL database

CleanTax transforms the input for processing by logic reasoners; including:

Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]

OWL / HermiT – description logic , knowledge representation [complex]

DLV System – propositional logic, answer set programming [promising!]

CleanTax assesses consistency and completeness of articulations

Output of the set of maximally informative relationships – "mir"

Report , causal explanation, interactive repair of inconsistent articulations

Calculate multiple possible worlds (if ambiguous articulations are present)

CleanTax creates multiple user-preferred views of the input and merge taxonomies

Reduced Containment Graph – RCG; and Directed Acyclic Graph – DAG

CleanTax – technical specifications

Page 21: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

'Training' CleanTax on abstract examples

Initial expert-madeset of articulationsNew!

Page 22: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

'Training' CleanTax on abstract examples

Input Output – raw hmtl list of articulations ("look-up" + inferred)

Page 23: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

'Training' CleanTax on abstract examples

Input Output – 72 maximally informative relationships = mir

Based on the mir, all theoretically possible articulations

of the R32 lattice can be logically deduced.

Page 24: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Abstract Example 1 – Reduced Contained Graph of the merge

Blue circles shared concepts

Black circles unique concepts

Black solid arrows expert input

Grey dashed arrows deducible

Red solid arrows newly inferred

Input

Page 25: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

More CleanTax training… our infamous Abstract Example 4

Example 4 – representing multiple 'possible worlds'

3/5 articulations are disjoint (OR)

Page 26: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Reduced Containment Graphs of 7 'possible worlds' (combined or's)

Example 4 – CleanTax infers 7 possible worlds (user can view / select / repair / rerun)

Asserted by expert

Implied articulations

Inferred by CleanTax

Shared concepts

Unique concepts

Reduced Containment Graphs (RCGs)

Page 27: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Exploring "views" of the merge - circular Euler diagrams of PW1

Table of mir Corresponding Euler diagram (circular)

Identical

informationcontent

Page 28: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Correspondence of circular and Directed Acyclic Diagrams

PW1: Typical Euler circles Euler-DAG of PW1

Identical

informationcontent

Page 29: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Real life examples

Page 30: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Real-life examples, I – reconciling two weevil classifications 1

Curculionoidea sec. Kuschel 1995 Curculionoidea sec. Marvaldi & Morrone 2000

Concepts 117-157

Concepts 348-372

1 Initial articulations provided by NMF.

Page 31: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Merge taxonomy of Kuschel 1995 / Marvaldi & Morrone 2000

CleanTax RCG – 1 newly inferred articulation ( ) + several inconsistencies

Microcerinae sec. M&M 2000 [363] are included in Brachycerinae sec. KU 1995 [148]

(yes, I missed that; Kuschel 1995 only mentions it in the text, not in the main taxon list)

Page 32: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Real-life examples, II – reconciling two weevil classifications

Curculionoidea sec. Crowson 1981 Curculionoidea sec. Marvaldi & Morrone 2000

Concepts 1-17

Concepts 348-372

Page 33: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

CleanTax RCG – 4 newly inferred articulations ( ) / does not depict overlap (><)

e.g. {Aglycyderidae [2], Allocorynidae [3], Oxycorynidae [17]} sec. Crowson 1981

are included in Belidae [353] sec. M&M 2000

Merge taxonomy of Crowson 1981 / Marvaldi & Morrone 2000

Page 34: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Euler-DAG of the Crowson / Marvaldi & Morrone merge taxonomy

Solid lines – proper inclusion

Black solid line given

Green solid line inferred

Orange solid line explanatory

[Red solid line inconsistent]

Dashed lines - overlap

Black dashed line given

Green dashed line inferred

Orange dashed line explanatory

Red dashed line inconsistent

Concept boxes - concepts

Orange square box shared

Black square box unique

Dashed square box combined

Dashed oval box inconsistent

Page 35: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

DAGs generate "combined concepts" intersections of overlaps

Belidae

sec. MM2000

Belidae

sec. Cro1981

"Belidae"

INT(Cro/MM)

Shared - [2,3,17,357]

Page 36: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Concept AInput

Output

Concept B

Concept A – Concept B

AAttelabidae CR81

AttCR81 [9]

BAttelabidae MM00

AttMM00 [55]

ABAttelabidae CR81 – Attelabidae MM00

AttCR81.AttMM00

* Simple extension to three or more congruent concepts.

New naming/viewing conventions – simple merges (shared, unique) *

Page 37: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Concept AInput

Euler

Concept B

ABelidae CR81BelCR81 [10]

BBelidae MM00

BelMM00 [353]

AbBelCR81.belMM00

A B

Ab AB aB

aBBelMM00.

belCR81

ABBelCR81.BelMM00

DAG

New naming/viewing conventions – combined merges (overlap; T1, T2)

Page 38: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

DAG A B

Abc ABc aBc

C

abCaBCAbC

ABC

EulerAbc

CURCR81.

curKU95.

curMM00

aBcCurKU95.

curCR81.

curMM00

abCCurMM00.

curCR81.

curKU95

AbCCurCR81.

CurMM00.

curKU95

aBCCurKU95.

CurMM00.

curCR81

ABcCurCR81.

CurKU95.

curMM00

ABCCurCR81.

CurKU95.

CurMM00

Concept AInput Concept BA

Curculionidae CR81CurCR81

BCurculionidae KU95

CurKU95

Concept CC

Curculionidae s.s. MM00CurMM00T1, T2, T3

Page 39: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Future directions

Page 40: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Current workflow / "usability" (CleanTax on "Lore" server, UC Davis)

Input script

Output file

Inconsistency Repair, explanation

Possible worlds

VisualizationEuler-DAG

Interactivereduction of PWs

(decision tree)

Page 41: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Shared, real use cases (Perelleschus) with ETC feature-based project

5 taxonomies, 48 concepts, expert articulations, plus textual feature diagnoses

Page 42: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Conclusions and outlook

Improvements to CleanTax will remove many of the technical challenges towards a

full-blown taxon concept approach ( improved tracking of classification provenance).

Other technical challenges are being addressed (server platform, algorithmic

scalability, intensional/ostensive articulations, visualization [Euler, combined

concepts], workflow integration).

Many non-technical challenges remain (in short: transparent/consistent use).

Page 43: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Conclusions and outlook

Improvements to CleanTax will remove many of the technical challenges towards a

full-blown taxon concept approach ( improved tracking of classification provenance).

Other technical challenges are being addressed (server platform, algorithmic

scalability, intensional/ostensive articulations, visualization [Euler, combined

concepts], workflow integration).

Many non-technical challenges remain (in short: transparent/consistent use).

The current approach treats concepts as a 'black box' – the input data are simple and

make no reference to type specimens, synapomorphies, diagnostic features, etc.

"Exploring Taxonomic Concepts" project will develop tools for a balanced view.

Nevertheless, the articulations can expose deep and varied semantic links among

succeeding classifications.

Page 44: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Conclusions and outlook

Improvements to CleanTax will remove many of the technical challenges towards a

full-blown taxon concept approach ( improved tracking of classification provenance).

Other technical challenges are being addressed (server platform, algorithmic

scalability, intensional/ostensive articulations, visualization [Euler, combined

concepts], workflow integration).

Many non-technical challenges remain (in short: transparent/consistent use).

The current approach treats concepts as a 'black box' – the input data are simple and

make no reference to type specimens, synapomorphies, diagnostic features, etc.

"Exploring Taxonomic Concepts" project will develop tools for a balanced view.

Nevertheless, the articulations can expose deep and varied semantic links among

succeeding classifications.

CleanTax may be the first attempt to 'explain' classification provenance to logic

reasoners. This could have considerable implications for future data integration.

Page 45: Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Acknowledgments

Shawn Bowers, Dave Thau, Alan Weakley

NSF-IIS 1118088: "III-SMALL: A logic-based, provenance-aware system for merging scientific data under

context and classification constraints"

"Euler" team, UC Davis