29
Jan Christoph Meister University of Hamburg www.catma.de

Tags in the cloud : Crowdsourcing semantic annotation with CATMA

  • Upload
    abena

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Tags in the cloud : Crowdsourcing semantic annotation with CATMA. Jan Christoph Meister University of Hamburg. www.catma.de. CATMA - an integrated textual markup and analysis tool. Text vs. sentence, or: What ‘ s so different about processing texts?. - PowerPoint PPT Presentation

Citation preview

Page 1: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Jan Christoph MeisterUniversity of Hamburg

www.catma.de

Page 2: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

CATMA - an integrated textual markup and analysis tool

29.10.2012 2CLARIN's Turn Towards The Literary Text

Page 3: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Text vs. sentence, or: What‘s so different about processing texts?• structural complexity: min TEXT > 2 (SENTENCE)

• structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences

• structural dynamic: TEXT processing represents & simulates cognitive and empirical processes

29.10.2012 CLARIN's Turn Towards The Literary Text 3

TEXT yields more INTERPRETATIONS than SENTENCE

+CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“

Page 4: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

The what and why of MarkUp procedural, descriptive & discursive

function

• discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration „What might this text mean to us?“

• declarative markup: informs a human reader how to process a text as a communicative device „How is this text put together and how does it function in its communicative universe?“

• procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string „What is the correct operation to perfom on this input?“

29.10.2012 4CLARIN's Turn Towards The Literary Text

performative function

discursive function

Page 5: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Hermeneutic „must haves“ of discursive markup

facilitate collaboration & non-deterministic annotation

allow for multiple markup allow for overlap allow for concurrent tagging

conceptualize markup as dynamic & recursive

allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop

29.10.2012 5CLARIN's Turn Towards The Literary Text

Page 6: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

MarkUp types & data models

29.10.2012 CLARIN's Turn Towards The Literary Text 6

There is no such thing as “no-mark up”. (Coombs, Renear, DeRose 1987)

opaqueimplicit

<SentenceStart>There</SentenceStart> is no such thing as “no-mark up.”

linearinline, deterministic

<SentenceStart><Adverb>There</Adverb></SentenceStart> is no such thing as “no-mark up”.

nested inline,deterministic sequential

There is no such thing as ”no-mark up”.

<1,5, word class = “Adverb”><1,5, segment = “SentenceStart”><1,5, POS = “verb phrase element”>

relationalstand off, descriptive

<1,5, word class = “Adverb”><1,38, speech act = “declaration”><1,11, POS = “verb phrase”>

There is no such thing as “no-mark up”.

<1,5, word class = “Preposition”><1,5, segment = “SentenceStart”><1,8, POS = “noun phrase”> network

stand off, discursive

Page 7: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Implementation in CATMA

29.10.2012 7CLARIN's Turn Towards The Literary Text

www.catma.de

Page 8: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

The CATMA/CLÉA approach to markup

text range based model a tag references a text range with a start and an

end offset external standoff markup

markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users

markup is stored in a standoff manner to allow overlapping

markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity

29.10.2012 8CLARIN's Turn Towards The Literary Text

Page 9: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Example for overlapping markup in CATMA

29.10.2012 CLARIN's Turn Towards The Literary Text 9

(NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up)

Page 10: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

TEI feature structure tag declaration & overlapping markup

<fs xml:id="CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5" n="1_7985fdf0-77a5-4060-9a3d-2d977e0ab954" type="catma_tag">

<f xml:id="CATMA_aa9b3727-187e-4fb8-9990-e7880912a409" name="catma_tagname">

<string>Keynote_speaker&amp;affiliation</string>

</f>

<f xml:id="CATMA_564825ba-28b2-4dab-b136-b87c8a3d9e28" name="catma_displaycolor">

<numeric value="-13421569"/>

</f>

</fs>

29.10.2012 CLARIN's Turn Towards The Literary Text 10

<ptr target="Abstracts.doc#range( /.21736, /.21888)" type="inclusion"/>

<seg ana="#CATMA_0a252cc2-96d2-4ed4-8fb8-52380550ec0b #CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5 #CATMA_8513fe2d-2e35-4d0a-a3a2-07528bcfa012">

Page 11: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Question 1: How can we model a collaborative mark up practice?

29.10.2012 CLARIN's Turn Towards The Literary Text 11

Page 12: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Answer 1: CATMA’S “n-meta-data set to-1 object data instance”-model

29.10.2012 12CLARIN's Turn Towards The Literary Text

meta-data•procedural•declarative•hermeneutic

object-data

Page 13: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow?

29.10.2012 CLARIN's Turn Towards The Literary Text 13

Page 14: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Example for recursion: a simple querie across the object data/meta data divide

29.10.2012 CLARIN's Turn Towards The Literary Text 14

Step 1: object data querie

Step 2: refinement by adding ...

... an additional meta-data constraint

Page 15: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

... which is why (reg="\b\S*\Qez\E(?=\W)") where (tag="Keynote_speaker&affiliation") generates this:

29.10.2012 CLARIN's Turn Towards The Literary Text 15

Page 16: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Answer 2: CATMA’S dynamic data model, e.g. (n meta-data set to 1 object instance)>n+1

29.10.2012 16CLARIN's Turn Towards The Literary Text

meta-data•procedural•declarative•hermeneutic

object-data

object-data

Page 17: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Question 3: How can we implement this practice in a system?

29.10.2012 CLARIN's Turn Towards The Literary Text 17

Page 18: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Answer 3: Call the big sister – CLÉA!

29.10.2012 CLARIN's Turn Towards The Literary Text 18

CLÉA Data Base Model

Page 19: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

CATMA/CLÉA: User and resource administration

29.10.2012 CLARIN's Turn Towards The Literary Text 19

Page 20: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Manage corpora & source documents, markup collections and tag libraries

29.10.2012 CLARIN's Turn Towards The Literary Text 20

Page 21: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Annotate texts or corpora using pre-defined or ready-made tags

29.10.2012 CLARIN's Turn Towards The Literary Text 21

Page 22: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Build and execute queries on source text & tags, or any combination thereof

29.10.2012 CLARIN's Turn Towards The Literary Text 22

Page 23: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Visualize results

29.10.2012 CLARIN's Turn Towards The Literary Text 23

Page 24: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

What’s in it for CLARIN?

• Import any text or corpus into CATMA/CLÉA• Run standard analytical procedures automatically

or inter actively on upload (indexing, POS tagging etc.)

• Annotate and analyse texts or corpora collaboratively

• Share and export markup from the CATMA/CLÉA data base in multiple formats

CLÉA = Collaborative Literature Éxploration and Annotation

29.10.2012 CLARIN's Turn Towards The Literary Text 24

Page 25: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

29.10.2012 CLARIN's Turn Towards The Literary Text 25

Mille grazie to my CATMA/CLÉA development team

• Evelyn Gius• Malte Meister• Marco Petris• Lena Schüch

and to our funders

• University of Hamburg (2009)• Google DH Awards (2010-2013)• BMBF (2013-2016)

Page 26: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Tag definition

<fsDecl xml:id="CATMA_TAG_ID_1"

type="test"

baseTypes="catma_tag">

<fsDescr>test - Test Tag</fsDescr>

<fDecl xml:id="CATMA_TAG_DEF_1_PROP_1"

name="catma_displaycolor"

optional="false">

<vRange><numeric value="-13408513"/></vRange>

</fDecl>

<fDecl xml:id="CATMA_TAG_DEF_1_PROP_2" name="user_defined_test_property"

optional="false">

<vRange><string/></vRange>

</fDecl>

</fsDecl>

each Tag can haveadditional user defined properties

each Tag has a type

each Tag has a color

29.10.2012 26CLARIN's Turn Towards The Literary Text

Page 27: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Tag instance

<fs xml:id="CATMA_TAG_INSTANCE_1" type="test">

<f xml:id="CATMA_PROPERTY_1_1" name="catma_displaycolor">

<numeric value="-13408513"/>

</f>

<f xml:id="CATMA_PROPERTY_1_2" name="user_defined_test_property">

<string>instance specific test value</string>

</f></fs>

a Tag instance can have individual values for the user defined properties

each Tag instance is of a type

29.10.2012 27CLARIN's Turn Towards The Literary Text

Page 28: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Tag referencing

<seg ana="#CATMA_TAG_INSTANCE_1">

<ptr target="mytext_utf8.txt#char=36168,36185" type="inclusion"/>

</seg>

The content of a range is referenced by a pointer to an external entity.

The URI is based on the RFC 5147 for pointing to plain text.

29.10.2012 28CLARIN's Turn Towards The Literary Text

Page 29: Tags in  the cloud : Crowdsourcing semantic annotation with  CATMA

Potential problems and possible solutions

referencing ranges based on character offsets are vulnerable to modifications of the content• possible solution: automated adjustments with

checksums and context information, and• track versioning and revision history in the source

document header

the encoding of the tags is machine readable but not interoperable out of the box possible solution: defining the feature structure

encoding of tags in terms of the open annotation framework

29.10.2012 29CLARIN's Turn Towards The Literary Text