Upload
erick-charles
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Preserving meanings in (multilingual) text mining for
Cultural Heritage
Michel Généreux David B. Arnold
University of Brighton, UK
Overview
Date 24.10.06 ICS-FORTH, Greece Slide 2
•The extraction tool for English:
•Natural Language & Semantic Processing
•Experiment 1: triple extraction from the CIDOC-CRM documentation
•Experiment 2 :triple extraction from free text
•Extension to other languages (French and German)
•Use of the tool in EPOCH
•(Cultural Heritage corpus creation)
•Discussion
Motivation for the extraction tool
Date 24.10.06 ICS-FORTH, Greece Slide 3
• Extract and facilitate the mapping of (unstructured) textual information to CIDOC-CRM
• Assist more generic mapping tools (AMA)• Develop a first prototype able to extract
triples from English texts • Evaluate the tool, using data from the
CIDOC-CRM documentation
On the propositional nature of CIDOC-CRM
“The domain class is analogous to the grammatical subject of the phrase for
which the property is analogous to the verb. Property names in the CRM
are designed to be semantically meaningful and grammatically correct when read
from domain to range. In addition, the inverse property name, normally
given in parentheses, is also designed to be semantically meaningful and
grammatically and correct when read from range to domain.”
Definition of the CIDOC Conceptual Reference Model, June 2005, page vi
Date 24.10.06 ICS-FORTH, Greece Slide 4
ERCDM
XMLSchema
CIDOCXML
Schema
De facto standard modelCIDOC model
XMLCIDOC
CompliantSchema
Mapping Tools
Date 24.10.06 ICS-FORTH, Greece Slide 5
Assisting generic tools:EPOCH AMA
The idea is to represent in the simplest way the CIDOC model as well as the original model and to let the domain expert insert his/her knowledge in the mapping.
Text mining for CH
TEXT
CIDOC-CRM
Database
The creation of a CIDOC-CRM compatible database for CH is partially automated by extracting triples after analyzing free texts in natural language.
Text analysis
Triple extraction
Database creation
CRM ~ semantic representation
We have a painting of John Constable entitled “A country
lane” and identified by “203-1888”. It is a watercolour which
is 21.3 cm high and 18.3 cm wide. It was painted in
England.
Date 24.10.06 ICS-FORTH, Greece Slide 7
The problem
• The capability of the CRM to describe so many formats
with so few properties is due to the fact that most actions
and events are not encoded as properties, but as paths with
the event as node in the middle. So "Van Gogh painted
this .." translates to two triples with a Production in the
centre. I.e. hundreds of action verbs have to be recognized
and mapped. If this is not understood no useful matching
can be done (anonymous reviewer)
Some terms• semantic association: measure of co-occurrence
of terms• lemma: base form of a word• POS: part-of-speeches• clause: a coherent whole of POSs• hypernym: a word more generic than a given
word• chunking: breaking sentences into clauses• WordNet: structured lexical database
Date 24.10.06 ICS-FORTH, Greece Slide 8
Date 24.10.06 ICS-FORTH, Greece Slide 9
POS tagging and Phrase chunking
Getting Semantics through Syntax
“ CIDOC-CRM is great ”
V
be(x,y)
AdjP
great
VP
be(x,great)
NP
CIDOC-CRM
Sentence
be(CIDOC-CRM,great)Sentence NP VPVP V AdjP
Date 24.10.06 ICS-FORTH, Greece Slide 10
Getting Semantics directly through a Semantic Grammar
“ CIDOC-CRM is great ”
Action
be(x,y)
Info
great
Action_Info
be(x,great)
Brand
CIDOC-CRM
Assertion
be(CIDOC-CRM,great)Assertion Brand Action_InfoAction_Info Action Info
Date 24.10.06 ICS-FORTH, Greece Slide 11
Getting Semantics from keywords and pattern-matching
<assertion> <CIDOC-CRM> “is” <info>
<opinion> “I think” <CIDOC-CRM> “is” <info>
Other noisy input is simply skipped.
What about words not in WordNet (proper nouns)? Semantic Orientation
Date 24.10.06 ICS-FORTH, Greece Slide 12
SO-A in practice
We compute Association using PMI:
which is positive when words co-occur and negative otherwise.
We compute PMI in IR using hit counts over a large corpus:
N is the total number of documents in the corpus, smoothing makes PMI-
IR 0 for words not in the corpus and NEAR means a distance of at most
20 words (Turney and Littman, 2003).
Date 24.10.06 ICS-FORTH, Greece Slide 13
1.Text cleaning.
2.Tokenization and POS tagging
3.Clause chunking and pruning
4.NC regrouping
5.Intermediate triples (IT) creation
6.Referent resolution
7.Final triple (FT) creation.
Approach
Date 24.10.06 ICS-FORTH, Greece Slide 14
Experiments (1)
Date 24.10.06 ICS-FORTH, Greece Slide 15
•144 sentences
•184 final triples
•at least a final triple for 46 sentences
Lange Herzogstrasse is Wolfenbüttel
main shopping area. The street's
particular charm lies in its broad-face
half-timbered buildings, historic
merchant's houses; their central gables
still retain the distinctive hatches
through which goods could be hoisted
up to the attics for storage.
Experiments (2)
Date 24.10.06 ICS-FORTH, Greece Slide 16
•A text of 3922 words and 173 sentences.
•197 intermediate triples and 79 final triples extracted.
We have a painting of John Constable entitled “A country
lane” and identified by “203-1888”. It is a watercolour which
is 21.3 cm high and 18.3 cm wide. It was painted in
England.
Date 24.10.06 ICS-FORTH, Greece Slide 17
A shallower approach for English
We PP we
have VHP have
a DT a
painting NN painting
of IN of
John NP John
Constable NP Constable
entitled VVD entitle
" `` "
A DT a
country NN country
lane NN lane
" '' "
and CC and
identified VVD identify
by IN by
" `` "
203-1888 JJ @card@
" '' "
. SENT .
It PP it
is VBZ be
a DT a
watercolour NN
which WDT which
is VBZ be
21.3 CD @card@
centimetres NNS centimetre
high JJ high
and CC and
18.3 CD @card@
centimetres NNS centimetre
wide JJ wide
. SENT .
It PP it
was VBD be
painted VVN paint
in IN in
England NP England
. SENT .
Date 24.10.06 ICS-FORTH, Greece Slide 18
Shallow extraction (English)
French
Nous avons une toile de John Constable intitulée «Une route de
campagne» et identifiée par «203-1888». C'est une aquarelle
qui a une largeur de 21.3 centimètres et une hauteur de 18.3
centimètres. Elle fût peinte en Angleterre.
Date 24.10.06 ICS-FORTH, Greece Slide 19
Nous PRO:PER nous
avons VER:pres avoir
une DET:ART un
toile NOM toile
de PRP de
John NAM John
Constable NOM constable
intitulée ADJ <unknown>
" PUN:cit "
Une DET:ART un
route NOM route
de PRP de
campagne NOM campagne
" PUN:cit "
et KON et
identifiée VER:subp <unknown>
par PRP par
" PUN:cit "
203-1888 ABR @card@
" PUN:cit "
. SENT .
C' PRO:DEM ce
est VER:pres être
une DET:ART un
aquarelle NOM aquarelle
qui PRO:REL qui
a VER:pres avoir
une DET:ART un
largeur NOM largeur
de PRP de
21.3 NUM @card@
centimètres NOM
et KON et
une DET:ART un
hauteur NOM hauteur
de PRP de
18.3 NUM @card@
centimètres NOM <unknown>
. SENT .
Elle PRO:PER elle
fût VER:futu <unknown>
peinte VER:pper peindre
en PRP en
Angleterre NAM Angleterre
. SENT .
Date 24.10.06 ICS-FORTH, Greece Slide 20
Shallow extraction (French)
German
Wir haben einen Anstrich des John Constable erlauben
„Einen Landweg” und durch „203-1888” gekennzeichnet
werden. Es ist ein Aquarell, das 21.3 Zentimeter hoch und
18.3 Zentimeter breit ist. Es würde in England gemält.
Date 24.10.06 ICS-FORTH, Greece Slide 21
Wir PPER wir
haben VAFIN haben
einen ART ein
Anstrich NN Anstrich
des ART d
John NE John
Constable NE <unknown>
erlauben VVFIN erlauben
" $( "
Einen NN Eine
Landweg NN Landweg
" $( "
und KON und
durch APPR durch
" $( "
203-1888 CARD @card@
" $( "
gekennzeichnet VVPP kennzeichnen
werden VAINF werden
. $. .
Es PPER es
ist VAFIN sein
ein ART ein
Aquarell NN Aquarell
, $, ,
das PRELS d
21.3 CARD @card@
Zentimeter NN Zentimeter
hoch ADJD hoch
und KON und
18.3 CARD @card@
Zentimeter NN Zentimeter
breit ADJD breit
ist VAFIN sein
. $. .
Es PPER es
würde VVFIN <unknown>
in APPR in
England NE England
gemält VVFIN <unknown>
. $. .
Date 24.10.06 ICS-FORTH, Greece Slide 22
Shallow extraction (German)
Application: EPOCH CHARACTERISE
Question Answering for CH
User
Natural LanguageInteraction
NLPNLG
Through Natural Language Processing (NLP) and Generation (NLG), users can interact and query databases in CIDOC-CRM. Resources are available for a wide range of languages (multilingualism). By combining the mining and interactive tools, language technology automates the structuring and querying of heterogeneous and semi-structured information within the framework of CIDOC-CRM.
CIDOC-CRM
Database
Resources (1)
Date 24.10.06 ICS-FORTH, Greece Slide 24
•Towards a Semantic Web for Heritage Resources
•http://www.digicult.info/downloads/html/1054648757/1054648757.html
•Art & Architecture Thesaurus Online
•http://www.getty.edu/research/conducting_research/vocabularies/aat/about.html
•National Monuments Record Thesauri
•http://thesaurus.english-heritage.org.uk/
•Networked Knowledge Organization Systems/Services
•http://nkos.slis.kent.edu/
•AGROVOC Multilingual Thesaurus
•http://www.fao.org/aims/ag_intro.htm
•Controlled vocabulary for the applied life sciences
•http://www.cabi-publishing.org/DatabaseSearchTools.asp?PID=277
•National Library of Medicine: http://www.nlm.nih.gov/mesh/
•MDA Archaeological Objects Thesaurus: http://www.mda.org.uk/archobj/archcon.htm
•EuroWordnet: http://nipadio.lsi.upc.es/wei.html
Resources (2)
Date 24.10.06 ICS-FORTH, Greece Slide 25
•Multimatch: http://www.multimatch.org/contact.html
•HEREIN: http://www.european-heritage.net/sdx/herein/thesaurus/introduction.xsp•MUSEUM:http://www.science.uva.nl/~kamps/museum/ •IKEM: http://www.ikem.be/•Cultural Heritage: http://www.culturalheritage.net/•Michael: http://www.michael-culture.org/project.html
•EUROVOC: http://europa.eu/eurovoc/
•AMA tools:
•http://www.epoch-net.org/index.php?option=com_content&task=view&id=74&Itemid=120
•MultiWordNet: http://multiwordnet.itc.it/english/home.php
Abstract
Seeds
Triplets
URLs
Terms
Corpus
Corpus
archeology
art
historic
landscape
monument
museum
conservative
aic
amerisuites
arboretum
archaeology
archeological
archeology
archive
artifacts
blm
browse
cemeteries
conservation
copyright
crm
database
fax
fpi
hrs
iii
internet
interns
internship
metropark
minh
mini
minigrants
monuments
museums
ncptt
nlcs
nps
nsu
online
overview
parks
projects
ptt
services
website
First Seeds
Second seeds
Resources (3): Corpus and Multiwords terms extraction
Date 24.10.06 ICS-FORTH, Greece Slide 26
Archaeological Sites
Image for Archeology
Mailing lists and discussion
North Carolina
Back Issues
archaeological research
Financial Corp
Summer Institute
NCPTT Partners
Ministry of Culture
via e-mail
University Art
Hire Date
Preschooler Programs
Next page
de los
plot keywords
natural resources
Quick Links
Sculpture Conservation Studio
Museum Shop
Dot Net Nuke
Cast overview
megalithic tombs
Archaeological Institute of America
landscape painting
Sound Mix
Mc Culley
Department of Animal
Historical Center
historic sites
Find surnames
Previous page
Golf Course
Genealogy CDs
Children's and Juvenile Books
list archives
Historic Materials
Nine Inch Nails
favorite player
Dennis Montagna
Disaster Recovery by the Book
Preservation Officers
historic properties
U Minh
non-profit organization
Web Notify
Internet Service
Southwest Parks
Arts Council
archeology nationwide
download protocol
Science Center
web site
site provides
African Studies
Mini Grants
Travel Guides
programming language
Landscape Restoration
click the Order button
Ancient Near
Grant Programs
Personal Profile
Development Center
Break w
Remembering the Way
Kuala Lumpur
Guns N Roses
National Monuments
materials conservation
Hidden Treasures
para o
Command History
Architectural History
writing skills
x cm
Preservation Pioneer
Powered by LISTSERV
Remembering Nelson Hall
Please send
Financial Group
Local History
Art Center
Grant Recipients
Site Map
Grand Staircase-Escalante National
Blue Book
Can Tho
Resource guides
Park Net Accessibility FOIA Privacy
Fine art
Contact Lorine
South Carolina
Up the Heat on Research
Road Safety
web pages
22AM PDT
Total Transfers
Common Ground
100 random Multiwords terms from 1567
Date 24.10.06 ICS-FORTH, Greece Slide 27
Discussion•Benchmark for evaluation: CIDOC-CRM human-annotated texts
•Over generation: properties expressed with ‘be’ and ‘have’
•How do we combine triples to form paths ? (see *)
•Given modest results for English, how do we extend the previous method to other languages?
Date 24.10.06 ICS-FORTH, Greece Slide 28
*The capability of the CRM to describe so many formats with so few properties is due to
the fact that most actions and events are not encoded as properties, but as paths with
the event as node in the middle. So "Van Gogh painted this .." translates to two triples
with an Production in the centre. I.e. hundreds of action verbs have to be recognized
and mapped. If this is not understood no useful matching can be done (anonymous
reviewer)
Thank you!
Questions ?
Date 24.10.06 ICS-FORTH, Greece Slide 29