57
BiographyNed eScience Center 21 March 2013

BiographyNed

  • Upload
    udell

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

BiographyNed. eScience Center 21 March 2013. Methodological Issues. How telling is the output of our tools? Selection made by ( editors of) dictionaries Reliability of automated text analysis Introduction of biases in the methodology - PowerPoint PPT Presentation

Citation preview

Page 1: BiographyNed

BiographyNed

eScience Center 21 March 2013

Page 2: BiographyNed

Methodological Issues

How telling is the output of our tools?

• Selection made by (editors of) dictionaries• Reliability of automated text analysis• Introduction of biases in the methodology

Careful evaluation and detailed communication is required…

Page 3: BiographyNed

Statistics on available information

Name

Category

Gender

Date of Death

Date of Birth

Place of Birth

Place of Death

Occupation

Religion

Father

Mother

Claim to Fame

Partner

Text

0 20 40 60 80 100 120

Individuals with available information (%)

percentage

Page 4: BiographyNed

Textual Information per person

Information Numbers

Average XML-files per individual 2.79

Texts 78.75%

Words (total/person) 288.83

Words (longest text/person) 229.04

Words (total/text) 366.76

Words (longest text)/texts 290.83

Page 5: BiographyNed

Availability of Information in the portal

Partner

MotherFat

her

Claim 2 Fa

me

Religio

n

Occupati

on

Date of b

irth

Place o

f birt

h

Date of d

eath

Place o

f dea

th

Catego

ryNam

e0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Information AbsentText available

Page 6: BiographyNed

Presence of information for governors of Dutch Indies (% on 71 individuals)

mariage

multiple mari

age

partners

Children

(number)

Children

(nam

es)

Age (st

art fu

nction)

Place o

f Birt

h

Place o

f Deat

h

Studies

Previous c

arree

r

Reason jo

b end

Last jo

b

Family

connecti

ons

Religio

n0

10

20

30

40

50

60

70

80

90

100

metadatatext

Page 7: BiographyNed

Biography Portal of the Netherlands. The Sources

Page 8: BiographyNed

Overview

• History and Biography• Where do eScience and History meet?

• Use Cases

Page 9: BiographyNed

Historical Research

The Art and Science of History: Drawing up a narrative from primary and secondary sources which approximates historical reality as well as

possible.

Page 10: BiographyNed

Building Blocks and Concrete

• Building blocks: facts derived mainly from archival findings and existing literature

• Concrete: the methods historians use to put them together into a narrative/synthesis.

• The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative element in drawing up a narrative

Page 11: BiographyNed

Example: Grand Pensionary Johan de Witt (1625-1672)

• Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of the murder

• Concrete: (logic) Based on these last data itis likely that William ordered the death of Johan

• Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoning

Page 12: BiographyNed

The House of History

Page 13: BiographyNed

The Importance of Provenance

The only way to falsify presented historical facts is by going back to the original source(s) and

look at those sources critically.

Highly important to be able to know what information comes from where exactly.

Page 14: BiographyNed

Our Sources Here

• The Metadata: building blocks

• The entries in biographical dictionaries themselves: short historical narratives

Page 15: BiographyNed

Status of Biography in Academia and Society

• Despite improved efforts this century to embed biography in academic theories and methods, some still do not consider it (e.g. some social historians) a worthy academic discipline, being too anecdotal and limited.

• Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)

Page 16: BiographyNed

Where do eScience and History meet? (I)

“And when the capsule biography of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”

(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography)

Page 17: BiographyNed

Where do eScience and History meet? (II)

A. Quantitative analyses of a larger group of people (prosopography).Surpassing the anecdotal.

B. Finding relations/networks between people which are otherwise hard to detect

Page 18: BiographyNed

Where do eScience and History meet? III

C. Insight in Historiography and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and main author 1001 vrouwen)

D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?

Page 19: BiographyNed

BiographyNed Use Cases

In the initial stages of the research a list of

possible historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to give us an answer to, or at least point

into a direction/trend.

Page 20: BiographyNed

Case I: Making life easier: Group portrait of the Governors-General

• Highest Official in the Dutch indies 1610-1949• 71 men (still a relatively small group)• What can we say about these men as a group?• Who was appointed and what qualities did he

have to have? • Etc ….

Page 21: BiographyNed

Case I: data mining• Family connections (parents/wife/children, other relevant

connections <= patronage)• Place of Birth• Education • Religion• Career (patterns)• Age at appointment• Duration of holding the office• Reason for leaving the office• Place of Death

Page 22: BiographyNed

Case I: Time and Effort

More than 1 full week

to manually mine this information from the Biography Portal. Can a historian do this with

(almost) the same results in under one hour if helped by the demonstrator?

Page 23: BiographyNed

Case II: Making things possible: The Dutch Nation & Identity

• Who were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)

• Are there different perspectives on the sameperson over the time and how can this be explained?

• Who was deemed most important? (based on the length of the entries)• What time periods are most represented?• Is there a difference in claim to fame for people from different

periods in history, or between men and women?• Which words are used most often and can we link them to national

identities?

Page 24: BiographyNed

Case II: More Questions …

• What events are mentioned most often and what does that say about the status questionis of how the Dutch see/saw themselves?

• What are the differences in the answers to these questions between several national biographical dictionaries?

• Are people and events described or appreciated differently over time? Does the perspective change?

• How does this relate to biographical dictionaries, nations and identities elsewhere in Europe?

Page 25: BiographyNed

Conversion to Linked Data

Page 26: BiographyNed

Online machine readable data with links • Simple facts called ‘RDF Triples’

Thorbecke > hasBirthPlace > Zwolle

Some technology concepts: • Schemas: To structure LD• RDF Stores: To store LD • SPARQL: To access LD

Huge growth in the past years: •More than 300 data sources•More than 30 billion triples

A crash course on Linked Data

Page 27: BiographyNed

Purely syntactic conversion• Preserve the original structure of the data• Prevent loss of information• Allow for reinterpretation of the original data in the future

The conversion process

Data Preservation

Page 28: BiographyNed

Conversion steps: • Retrieval of XML dump of the Biography Portal• Initial conversion to ‘crude’ RDF• Using ClioPatria and the XMLRDF

tool for ClioPatria• RDF restructuring• Linking to other sources• Essential step in the

‘Linked Data’ philosophy

The conversion process

Page 29: BiographyNed

Data schema: • Based on the structure of the original XML files• Needs to facilitate the coupling of different biographies of the same

person, without compromising the original data• Needs to facilitate the incorporation of several enrichments, following

from NLP, Entity Reconciliation, etc.• Compatible with existing

schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms

The conversion process

Page 30: BiographyNed

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

BiograpyNedschema

Thorbecke

Biographical Description

ProvenanceMeta Data

NNBW

PersonMeta Data

“Thorbecke”

BiographyParts

Birth1798Event

Biographical Description

Enrichment NLP Tool

PersonMeta Data

EventBirth

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Zwolle1798-01-14

Page 31: BiographyNed

Retrieving Information from Text

Page 32: BiographyNed

The texts in the Biography Portal

• Collection of biographical dictionaries• Dutch, including from the 19th and early 20th

century and even older quotes• Sources (different dictionaries/collections)

have their own style• Metadata available (though large differences

in completeness)

Page 33: BiographyNed

Challenges and Advantages

• Challenges:– Little work on NLP and biographies– Performance of Dutch NLP tools on variations of

Dutch• Advantages:– High quality metadata coverage several categories

of information (supervised machine learning)– Within sources, clear and similar structure of texts

Page 34: BiographyNed

General Approach

• Start by using advantages:– Use metadata to label information– A basic IR system can be build using sentence

number and lemmas as features• Enhance performance with NLP tools• Build upon information retrieve in the first

steps to tackle more challenging tasks

Page 35: BiographyNed

A Basic System

• Supervised Machine Learning• Two step identification process (Wu and Weld

2007;2010, Fader et al. 2011)– Identify sentence that contains information– Sequence tagging to identify information within

the sentence

Page 36: BiographyNed

Adding NLP

• Location & Date recognition (GeoNames)• (other) Named Entities (VIAF enhanced with

names from metadata)• Depending on performance of the system, we’ll

work on:– Chunking, multiword recognition– Parsing– Word Sense Disambiguation

Page 37: BiographyNed

Metadata & Project Goals

• Duplicate detection (metadata and text)• Events/Network discovery– Education (begin, end, location)– Occupation (begin, end, location)– Relations (parents, partners)

• Temporal relations between events

Page 38: BiographyNed

Output first system

• Better coverage of categories mentioned above

• A timeline for a person’s life (birth, education, occupation, locations, death)

• Named Entities in text (dates, locations, persons)

Page 39: BiographyNed

Beyond the first system

The information provided by the first system can be used to:

1. Identify alternative descriptions of events(same time, location and/or participants)

2. Identify relations between events(same locations & time, consequent events, same participants, etc.)

3. Initial networks of people

Page 40: BiographyNed

Methodological issues and text interpretation

• Results should be reproducible– Code release (including scripts, configurations, …)– Documentation– Open source data

• The setup should be modular– Combine output of different tools– Flexible choice of methods used

Page 41: BiographyNed

Evaluation Challenges (1/2)

• How to evaluate the extraction tools?• Partial evaluation using metadata (10-fold

cross-validation), but:1. No precise indication of precision or recall

(incomplete metadata…)2. Biographies with rich metadata are not

necessarily representative Manually annotated data needed!

Page 42: BiographyNed

Evaluation Challenges (2/2)

• How to compare performance NLP tools?– Little work on biographies, little or none on Dutch

ones…– How hard are older texts? Can we quantify?

Systematic comparison:• English biographies (wikipedia)• Dutch biographies (wikipedia)• Biographies from the portal

Page 43: BiographyNed

Reproducibility/Replication

• What do results mean if they cannot be reproduced?

• What variation in results can be expected based on details not mentioned in papers?

• Which information is needed to replicate results or find the origin of differences?

Paper submitted ACL 2013 (joint work with Marieke van Erp and others)

Page 44: BiographyNed

Representations (tools)

• How to represent and combine output of different tools?– Compatibility (easy to convert output of external

NLP tools)– Flexibility (be able to contain alternative

representations and interpretations)

Integrate representations in NIF (joint work with Jesper Hoeksema and Willem van Hage)

Page 45: BiographyNed

Representation (events)

• How to combine knowledge from the NLP community and Linked Data community?– Combination of textual information with external

resources– Complete representation of information from text

(location, retrieval method)

Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)

Page 46: BiographyNed

Current state of affairs

• Basic system using sentence number and lemmas for main categories metadata (evaluation ongoing)

• Module for labeling locations and dates in text (adaptions to be made for modularity)

• Annotation effort started for evaluation (selection of approximately 700 texts)

Page 47: BiographyNed

Demonstrator

Page 48: BiographyNed

• The interface should be easy to use• The demonstrator should inspire historians to

undertake new research and give direction, rather than being the ‘closing factor’ in their research

• The interface should allow to ‘fine tune’ results returned upon an initial action

Interface: Focus

Page 49: BiographyNed

• Query composition• Faceted browsing• A combination

Interface: Options

Page 50: BiographyNed

• Drop down boxes to select ‘Verbs’, data elements and relations

Interface: Query composition

Page 51: BiographyNed

• No explicit querying, but convergence of the data through browsing and selecting

• Provides better feedback to the user• Allows for more direct and easier

adjustment of the selected data

Interface: Faceted browsing

Page 52: BiographyNed

Interface: Faceted browsing

Page 53: BiographyNed

• Query composition combined with faceted browsing

• Create new facets by defining a query– The result of the query is available as a subset of the

data by selecting the defined facet– As such, combinable with other facets

• Method to integrate ‘open’ querying of the data into a general interface and visualization

Interface: A combination

Page 54: BiographyNed

Interface: A combination

Question Analysis

SelectionProcess

Results

Data

Facets

Page 55: BiographyNed

Time and place are primary elements

Interface: Demonstrator

Results

?

Page 56: BiographyNed
Page 57: BiographyNed

Questions