54
Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural Language Processing Group Department of Computer Science University of Leipzig

Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Embed Size (px)

Citation preview

Page 1: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Introduction of the ASV SubprojectReport on recent state of work and other activities

eTRACES Sponsor MeetingLeipzig, 2012/05/07

Marco BüchlerNatural Language Processing Group

Department of Computer ScienceUniversity of Leipzig

Page 2: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 2

What do you do with a million books?

Page 3: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 3

We do not have any native speakers for ancient languages like ancient Greek and Latin ...

Page 4: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 4

Agenda

Scope of ULEI's subproject

Who is involved?

ACID for the eHumanities as a paradigm for successful projects

Page 5: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 5

Basics for ULEI's subproject

Page 6: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 6

A fundamental problem: How to find relevant information in massive data?

Page 7: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 7

Two initially associated documents

Page 8: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 8

Documents are linked with a direction

Page 9: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 9

Documents are linked with a direction: such as web links

Page 10: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 10

Documents are linked in both directions: A loop

Page 11: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 11

Detecting relevance: a document can be linked by more than one doc

Page 12: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 12

Detecting relevance: a document can be linked by more than one doc

Page 13: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 13

Detecting relevance on an entire digital library

Page 14: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 14

Computing relevance weights (by reliability) on an entire digital library

Source: http://en.wikipedia.org/wiki/PageRank

The name of this strategy is Google's PageRank algorithm.

Page 15: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 15

Some aspects about Google's PageRank algorithm

• Ranking is done by relevance weights (weighted links to a page)

• Benefit for humanities applications: – Ranking does not necessarily need term weights as done with tf.idf

• e. g. Shakespeare's „to be, or not to be“

In humanities relevant data, however, we do not have a link structure like in web based html files.

Page 16: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 16

A similar problem: Two initial documents with text re-use

Page 17: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 17

Given e. g. dating information: text re-use with direction I

Our assumption: A quotation always implies a given relevance of the quoted author

by the quoting author – either in a positive or negative way.

Page 18: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 18

Given e. g. dating information: text re-use with direction II

Page 19: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 19

Given e. g. dating information: text re-use with direction III

Page 20: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 20

An old discipline: Text re-use in traditional humanities

Manually produced record of text re-use.

Page 21: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 21

Some research objectives

In addition to Google's PageRank:– Differentiate by

• Text re-use temperature• Text re-use coverage

– Relevance by• high score • low score

Page 22: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 22

Some answers to the intial questions/statements

What do you do with a million books?– Cultural heritage of textual re-uses– Text re-use graphs for something

like a „Cultural Heritage aware PageRanking“

We do not have any native speakers for ancient languages like ancient Greek and Latin ...

- Crowd sourcing provides on historical texts qualitative results, however,

humanists are no native speakers - The „Cultural Heritage aware PageRanking“ approach aims to deal

with relevance given by native

Speakers even if they are not available, nowadays

Page 23: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 23

Who is involved?

Page 24: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 24

Active collaborators

eTRACES/ULEI(Prof. Dr. Gerhard Heyer)

'The Team'

Interface:projects(Dr. Uwe Crenze)

The business partner

Fragmentary texts(Dr. Monica Berti)

'The Humanist'

Perseus Digital Library(Prof. Dr. Gregory Crane)

'The Content Provider'

Page 25: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 25

„ACID for the eHumanities“

Page 26: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 26

A new paradigm for successful eHumanities projects

• The million dollar question:

How to manage an eHumanities project successfully?

• After 4 years of activities in the eHumanities, you need just four questions:

Acceptance: How do you get humanists' acceptance for your techniques?

Complexity: Understand the complexity of necessary subtasks! e. g.: What is the archetypus?

Interoperability: How can components or data interact with each other?

Diversity: Understand your data! e. g.: What does text re-use mean for your digital library?

The ACID paradigm for the eHumanities

Page 27: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 27

„ACID for the eHumanities“: Interoperability

Page 28: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 28

„ACID for the eHumanities“: (Data) Interoperability I

Perseus DdbDP (XML) vs. Epiduke (XML)

Source: Pansch, D. 2010, Data Integration Methods for Structural HeterogeneousData in an eHumanities' Context, Bachelor thesis, 2010.

Page 29: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 29

„ACID for the eHumanities“: (Data) Interoperability II

Source: Pansch, D. 2010, Data Integration Methods for Structural HeterogeneousData in an eHumanities' Context, Bachelor thesis, 2010.

Page 30: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 30

„ACID for the eHumanities“: (Data) Interoperability III

• Several kinds of interoperability issues on

– Horizontal:• Data level• Algorithm level• Tool/application level

– Vertical:• e. g. between data and algorithm

Page 31: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 31

„ACID for the eHumanities“: Diversity

Page 32: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 32

„ACID for the eHumanities“: (Node) Diversity

Understand your data: Understand the re-used text chunks.

( a knowledge thing)

Page 33: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 33

„ACID for the eHumanities“: (Relation) Diversity

Understand your data: Understand how text is re-used in your data.

(an experience thing)

Page 34: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 34

„ACID for the eHumanities“: Diversity - 6 levels of text re-use

Text re-use is about Text re-use is about unsupervised quotation detectionunsupervised quotation detection in textual data. in textual data.

- Level 1: - Level 1: Pre-processing Pre-processing (Cleaned and prepared data)(Cleaned and prepared data)

- Level 2: - Level 2: FeaturingFeaturing (Digital fingerprint of a re-use unit) (Digital fingerprint of a re-use unit)

- Level 3: - Level 3: Feature selectionFeature selection (Signature of a digital fingerprint) (Signature of a digital fingerprint)

- Level 4: - Level 4: Linking Linking (Match of re-use units that have at least one feature in (Match of re-use units that have at least one feature in

common)common)

- Level 5: - Level 5: Scoring Scoring (Weighting of linked re-use units)(Weighting of linked re-use units)

- Level 6: - Level 6: Post-processing Post-processing (e. g. post selection or views that depend on (e. g. post selection or views that depend on research research questions) questions)

Implemented in TRACER (Implemented in TRACER (http://etraces.e-humanities.net/TRACER):):

- Tool available in 2013- Tool available in 2013

- Teaching courses (full week) are planned for 2013- Teaching courses (full week) are planned for 2013

- More than one million permutations of implementations of the 6 levels - More than one million permutations of implementations of the 6 levels possible (05/2012)possible (05/2012)

Page 35: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 35

„ACID for the eHumanities“: Acceptance

Page 36: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 36

Interdisciplinary collaborations: The problem!

Page 37: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 37

Computer Scientists: Change your view for understanding humanists

How to get acceptance of humanists if text mining is a black box that can't be looked in?

Page 38: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 38

What we need!

Transparency: How to provide user-friendly insights into complex mining techniques and machine learning?

Page 39: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 39

Jumping into the mining process: Level 0 – Initial request

Page 40: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 40

Jumping into the mining process: Level 1 - Preprocessing

Page 41: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 41

Jumping into the mining process: Level 2 - Featuring

Page 42: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 42

Jumping into the mining process: Level 3 - Selection

Page 43: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 43

Jumping into the mining process: Level 4 - Linking

Page 44: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 44

Jumping into the mining process: Level 5 - Scoring

Page 45: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 45

„ACID for the eHumanities“: Complexity

Page 46: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 46

„ACID for the eHumanities“: Complexity I

• Archetypus detection means to identify the origin of a thought or a chunk of text (or at least the earliest occurrence).

• Sentiment (Acceptance) detection means if a text passage is re-used in a „positive“ or „negative“ way

An example:• German: „Gleich und gleich gesellt sich gern.“• Englisch: „Like will to like.“

„Birds of a feather flock together.“ (“to bring like and like together”)

Question: How would/do you use this phrase regarding sentiments in

your daily life?

Page 47: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 47

„ACID for the eHumanities“: Complexity II

Hom. Od. 17 215-219:As he saw them, he spoke and addressed them, and reviled them in terrible and unseemly words, and stirred the heart of Odysseus: “Lo, now, in very truth the vile leads the vile. As ever, the god is bringing like and like together. Whither, pray, art thou leading this filthy wretch,1 thou miserable swineherd, ...

Page 48: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 48

„ACID for the eHumanities“: Complexity III

• German phrase: „jemanden auf's Dach steigen“• English (literally translated): „to climb onto someone's roof“• English (semantically translated): „to put someone down“,

„tell someone off“

• Understanding the example:– Goes back to a German tradition between 7th and 12th century– Young men went onto other's (and not following the rules of the

community ) roof in order to remove it.– Happened especially during (German) carnival and Shrove

Tuesday– There was no legal rule about it ...

– ... in early Middle-ages, however, this became fundamental part of early adaptions of constitutions

Page 49: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 49

„ACID for the eHumanities“: Complexity III

Article 13 of the recent German constitution

The home is invoilable.

Focus here: Constitution evolution task in different societies.

Page 50: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 50

„ACID for the eHumanities“: Complexity IV

Decision about online observation by the German government

Article 13: The home is invoilable. vs.

judgement to online observation by federal institutions in context of terrorism

... Das Schutzgut dieses Grundrechts ist die räumliche Sphäre, in der sich das Privatleben entfaltet [...]. Neben Privatwohnungen fallen auch Betriebs- und

Geschäftsräume in den Schutzbereich des Art. 13 GG [...]. Dabei erschöpft sich der Grundrechtsschutz nicht in der Abwehr eines körperlichen Eindringens in die Wohnung. Als Eingriff in Art. 13 GG sind auch Maßnahmen anzusehen, durch die staatliche Stellen sich mit besonderen Hilfsmitteln einen Einblick in Vorgänge

innerhalb der Wohnung verschaffen, die der natürlichen Wahrnehmung von außerhalb des geschützten Bereichs entzogen sind. Dazu gehören nicht nur die

akustische oder optische Wohnraumüberwachung [...], sondern ebenfalls etwa die Messung elektromagnetischer Abstrahlungen, mit der die Nutzung eines

informationstechnischen Systems in der Wohnung überwacht werden kann. Das kann auch ein System betreffen, das offline arbeitet. ...

Source: http://www.bundesverfassungsgericht.de/entscheidungen/rs20080227_1bvr037007.html

Page 51: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 51

„ACID for the eHumanities“: Complexity of text re-use research

Page 52: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 52

Complex tasks do strongly need collaborations!

Google group for Historical Text Re-use:http://groups.google.com/group/historical-text-re-use

Page 53: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 53

Summary

Scope of ULEI's subproject

Who is involved?

ACID for the eHumanities as paradigm for successful projects

From mission to vision

Page 54: Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural

Marco Büchler 54

eTRACES/ASV: 'The team'

Gerhard Heyer

Maria Moritz Petra Gamrath

Thomas Efer

Christian Kötteritzsch

Marco Büchler

Frederik Baumgardt