Transcript
Page 1: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

A new power balance is neededfor trustworthy biodiversity data

Please

@taxonbytes

Nico Franz1 & Beckett W. Sterner1

With contributions by Edward Gilbert1, Andrew Johnston1,

Guanyang Zhang1, Bertram Ludäscher2 & Alan Weakley3

1 School of Life Sciences, Arizona State University2 iSchool, University of Illinois at Urbana-Champaign

3 Herbarium, University of North Carolina at Chapel Hill

TDWG 2016 – Biodiversity Information Standards

December 09, 2016 – Instituto Tecnológico de Costa Rica (#TDWG16)

@ http://www.slideshare.net/taxonbytes/franz-sterner-tdwg-2016-new-power-balance-needed-for-trustworthy-biodiversity-data

Page 2: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Largely derived from doi:10.3897/rio.2.e10610

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Page 3: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Premise: We agree that there are significant data quality issues

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Aggregated Australian millipede data 'taken to the cleaners'

Page 4: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Premise: We agree that there are significant data quality issues

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Aggregated Australian millipede data 'taken to the cleaners'

Aggregators respond to the charges

Page 5: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Premise: We agree that there are significant data quality issues

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Aggregated Australian millipede data 'taken to the cleaners'

Aggregators respond to the charges

But this leaves open the question(s):

Who (exactly) is responsible for

how much of each particular issue?

Page 6: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

We seem to disagree on the question of responsibility assignment(s)

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Source: Belbin et al. 2013. A specialist's audit […]: An 'aggregator's' perspective. doi:10.3897/zookeys.305.5438

Page 73

Page 7: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Often enough, aggregators respond by:

• Acknowledging the general issues and their relevance.

• Pointing to many issues that effectively reside "with the sources".

• Calling for more collaboration across all levels; as well as new tools and

annotation options that "motivate and empower" the research community.

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Source: Belbin et al. 2013. A specialist's audit […]: An 'aggregator's' perspective. doi:10.3897/zookeys.305.5438

Page 74

Page 8: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Thesis: For taxonomy integration, this both wrong and self-defeating

91dd0ee1-8a37-4efc-85b7-8176874cf5be

• Many aggregators are designed to impose a single taxonomic hierarchy –

one at a time – onto all taxonomically annotated records.

Page 9: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

91dd0ee1-8a37-4efc-85b7-8176874cf5be

• Many aggregators are designed to impose a single taxonomic hierarchy –

one at a time – onto all taxonomically annotated records.

• By design, these "backbones" are rarely attributable to individual (expert)

authors, but instead are newly created systematic theories that only appear

at the system level.

Thesis: For taxonomy integration, this both wrong and self-defeating

Page 10: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

91dd0ee1-8a37-4efc-85b7-8176874cf5be

• Many aggregators are designed to impose a single taxonomic hierarchy –

one at a time – onto all taxonomically annotated records.

• By design, these "backbones" are rarely attributable to individual (expert)

authors, but instead are newly created systematic theories that only appear

at the system level.

• Data are aggregated accordingly; yet backbone-driven modifications may

newly disrupt the original integrity of submitted data packages.

Thesis: For taxonomy integration, this both wrong and self-defeating

Page 11: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

91dd0ee1-8a37-4efc-85b7-8176874cf5be

• Many aggregators are designed to impose a single taxonomic hierarchy –

one at a time – onto all taxonomically annotated records.

• By design, these "backbones" are rarely attributable to individual (expert)

authors, but instead are newly created systematic theories that only appear

at the system level.

• Data are aggregated accordingly; yet backbone-driven modifications may

newly disrupt the original integrity of submitted data packages.

• By deflecting on responsibilities, aggregators may cause additional self-harm.

Ultimately, the power balance – as presently built in – must shift to bring

experts back into the process of licensing succinct, trustworthy data

packages.

Thesis: For taxonomy integration, this both wrong and self-defeating

Page 13: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Taxonomic views of a frequently revised organismal lineage

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids, "pogonias")

Page 14: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Snapshot of a more frequently revised organismal lineage

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids, "pogonias")

• Vertical sections identify taxonomic concept regions

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 15: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Snapshot of a more frequently revised organismal lineage

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids, "pogonias")

• Vertical sections identify taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 16: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Snapshot of a more frequently revised organismal lineage

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)

• Vertical sections identify taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

• There is no consensus! Five incongruent schemata are used concurrently

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 17: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Further diagnosis:

If incongruent taxonomies are endorsed– locally, provisionally, and democratically –

then what is the impact foraggregated biodiversity data?

Page 18: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Further diagnosis:

Taxonomy becomes a variable that we need to represent,

and thereby control for (at the system level)

Page 19: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus'

• Query: "Where do these orchid species occur?"

• Same set of 250 orchid specimens, according to 4 taxonomies.

"Contr

olling

the t

axonom

ic var

iable" Example: the Cleistes use case

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 20: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

"Contr

olling

the t

axonom

ic var

iable"

• Query: "Where do these orchid species occur?"

• Same set of 250 orchid specimens, according to 4 taxonomies.

Example: the Cleistes use case

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 21: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'"C

ontr

olling

the t

axonom

ic var

iable"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 22: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 23: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Expert views are in conflict

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 24: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Expert views are in conflict

"Just bad"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 25: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora

Impact:Name-based aggregation has created

a novel synthesis that nobody believes in

"Contr

olling

the t

axonom

ic var

iable"

"Just bad"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 26: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

"Just bad"

Expert views are in conflict

Solution:Instead of aggregating

an artificial 'consensus', …

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 27: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

"Just bad"

Expert views are reconciled

Solution:Instead of aggregating

an artificial 'consensus',build translation services

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 28: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Challenges:

How can we redesign aggregation to yieldhigh-quality biodiversity data packages?

Page 29: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Challenges:

How can we redesign aggregation to yieldhigh-quality biodiversity data packages?

What does this mean for Darwin Core1

and how we use this aggregation standard?

1 Wieczorek et al. 2012. Darwin Core: an evolving […]. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715

Page 30: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Preview of solution with eight steps

• DwC is insufficient, and part of the problem

Page 31: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 1: Represent only taxonomic concept labels (TCLs) 1

• Syntax (TCL): taxonomic name [author, year, page] sec. source

1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX

Cleistes divaricatasec. Gregg & Catling 1993

Pogoniasec. Brown & Wunderlin 1997

Page 32: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 1: DwC score keeping TCLs are optional; < 1% realized?

• TCL ~ DwC: nameAccordingTo

• SCAN: 19,722 of nearly 9 million records have TCLs (0.2%)

• Lack of enforcement to use TCLs makes standard less big data-ready

"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone

Page 33: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 2: Represent each source coherently (Parent-Child relationships)

• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]

Cleistesiopsis bifaria sec. Pans. & de Barr. 2008

is a child ofCleistesiopsis sec. Pans. & de Barr. 2008

Page 34: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 2: DwC score keeping Not (adequately) represented

• PC ~ DwC: genus, family, order (etc.; higherClassification)

• However, higher-level names in DwC are not modeled as TCLs

• Taxonomic coherence of sources cannot be preserved with DwC alone

DwC record with higherClassification(BDJ)

Page 35: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 3: Do not force a single hierarchy onto all tip-level TCLs

• Syntax (PC): Tip-level TCL1 , TCL2 , etc. [where TCL1/2 = different sources]

Page 36: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 3: DwC score keeping Optional Not (ever?) practiced

• No PC ~ DwC: infra-/specificEpithet only

• Typically, a single, 'unitary' higher-level classification is represented

• Combinations of algorithmic and social practices achieve the single hierarchy

"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone

Page 37: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 4: Link TCLs via expert-provided RCC–5 articulations

• Syntax (RCC–5): TCL1 {==, >, <, ><, !} TCL2 [where TCL1/2 = diff. sources]

• RCC–5 = Region Connection Calculus

• 14 articulations provided by: http://tinyurl.com/Weakley-Flora-2015

Cleistes bifaria "Coastal Populations" sec. Smith et al. 2004== (is congruent with)

Cleistesiopsis oricamporum sec. Brown & Pans. 2009==

Page 38: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf

Region Connection Calculus (semantics: set constraints)

== < > >< !• Two regions N, M are either:

• congruent (N == M)• properly inclusive (N < M)• inversely properly inclusive (N > M)• overlapping (N >< M)• exclusive of each other (N ! M)

• RCC–5 articulations answer the query: "can we join regions N and M?"

• Taxonomies have multiple RCC–5 alignable components: nodes (parents, children), node-associated traits, even node-anchoring specimens

Page 39: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 4: DwC score keeping Not (adequately) represented

• RCC–5 ~ DwC: accepted(Scientific)Name(Usage), relationshipOfResource,

taxonomicStatus (etc.;

nomenclatural relationships)

• Nomenclatural relationships are type-focused, not region-focused

• "Taxonomic Concept Schema" yes! (however: http://www.tdwg.org/standards/117)

Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063

Example:Milkweed butterflies

Page 40: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Oscillating meanings of the epithet hyalites – 1911 to 2003

Phenotypic diversityTy

pe-a

ncho

red

nam

e id

entit

y re

latio

ns

Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063

Page 41: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 5: Identify occurrence records only to TCLs

Records: EKY39235 MTSU003611 NCSC00040204 …

Records: BOON8098 CLEMS0061133 WILLI39399 …

Records: GMUF-0039355 IBE006808 USCH58399 …

Records: CONV0006268 MDKY00006482 NCU00038930 …

Records: BRYV0023582, BRYV0023584 KHD00032030, MISS0016604 MMNS000227, NCSC00040206 USMS_000002923, USMS_000002924 VSC0053223, VSC0065528 …

Records: ARIZ393087 DBG39049 USCH51217 …

Records: NCU00040710 USCH96248 VSC0053218 …

Records: CLEMS0012881 FUGR0003293 GA023130 …

Records: BOON8100 NCSC00040210 SJNM45487 …

Records: GA023144 LSU00012494 MISS0016608 …

Records: IBE006810, IND-0012374, MMNS000227

Records: NY8654

• Syntax (ID): Occurrence / organism is identified to TCL

"CLEMS0012881"is identified to

Cleistes divaricata sec. Smith et al. 2004

[additional ID metadata]

Page 42: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

DwC record with Identification metadata(BDJ)

# 5: DwC score keeping ID metadata optional; > 50% realized

• ID ~ DwC: Identification, (date)identified(By), identificationReference

• SCAN: 4,715,277 of nearly 9 million records have ID metadata (52.5%)

• Enforcement…still also require use of TCLs

Page 43: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 6: Generate comprehensive, consistent RCC–5 alignments

• Euler/X is a toolkit that infers logically consistent RCC–5 alignments

Page 44: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 6: Generate comprehensive, consistent RCC–5 alignments

• Valued-added: MIR – set of Maximally Informative Relations containing

the RCC–5 articulation for every possible TCL pair scalability

Reasoner inference

Page 45: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 7: Joining occurrence-to-TCL identifications & RCC–5 alignments

Records: BOON8098, CLEMS0061133, CONV0006268, EKY39235 GMUF-0039355, IBE006808, IBE006810, IND-0012374 MDKY00006482, MMNS000227, MTSU003611, NCSC00040204 NCU00038930, NY8654, USCH58399, WILLI39399 …

Records: ARIZ393087, BRYV0023582, BRYV0023584, DBG39049 KHD00032030, MISS0016604, MMNS00022, NCSC00040206 USMS_000002923, USMS_000002924, VSC0053223, VSC0065528 …

Records: BOON8100, CLEMS0012881, FUGR0003293 GA023130, GA023144, LSU00012494 MISS0016608, NCSC00040210, NCU00040710 SJNM45487, USCH96248, VSC0053218 …

• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals

Page 46: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Impact:"Please select your preference (A – D);

we can perform all translations"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

Page 47: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

# 8: "Do you trust us now?" Aggregation as a translational service

Page 48: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset resolving only one narrowly circumscribed concept

# 8: "Do you trust us now?" Aggregation as a translational service

Page 49: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

# 8: "Do you trust us now?" Aggregation as a translational service

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset resolving only one narrowly circumscribed concept

• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,

yet translated into the more granular TCLs sec. Weakley 2015"

• Returns (again) many records, yet represents and contrasts two treatments,

as opposed to providing the ambiguous lineage view (above)

• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)

Page 50: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Conclusion – designing trusted biodiversity data services

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

Page 51: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

• Solutions are in development that realize data aggregation via translational

services – not as disenfranchising "backbones" – and without disrupting the

formation of expert-licensed, high-quality biodiversity data packages

Conclusion – designing trusted biodiversity data services

Page 52: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

• Solutions are in development that realize data aggregation via translational

services – not as disenfranchising "backbones" – and without disrupting the

formation of expert-licensed, high-quality biodiversity data packages

• All of us – not just aggregators – "own" the responsibility of designing

systems where the plurality of taxonomic expertise is fairly accommodated

Conclusion – designing trusted biodiversity data services

Page 53: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Acknowledgments & links to products

• Cleistes use case: Alan Weakley (UNC)

• Euler/X toolkit: Shizhuo Yu (UC Davis)

• Other data issues, discussions: Andrew Johnston, Guanyang Zhang

• NSF DEB–1155984, DBI–1342595 (PI Franz)

• NSF IIS–118088, DBI–1147273 (PI Ludäscher)

• Euler/X code @ https://github.com/EulerProject/EulerX

• Franz et al. 2016. Two influential primate classifications logically aligned. Systematic Biology 65(4): 561–582. Link

Page 54: Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Interested in exploringmulti-taxonomy and/or-phylogeny alignments?

Please contact me.

[email protected]@taxonbytes

https://biokic.asu.edu/