View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Text Mining
Walter DaelemansCNTS Department of LinguisticsUniversity of Antwerp
Centre for Dutch Language and Speech (CNTS) Part of department of linguistics, University of Antwerp
Staff2 tenured + 10-15 with temporary funding from EU, IWT, FWO, NTU, language industry, BOF, …
Topics Corpus Linguistics (mainly Dutch) Child language acquisition / computational psycholinguistics
Language Technology • machine learning of language• shallow parsing• text mining
Information Overload Language is the most natural and most used knowledge representation formalism
Non-structured or weakly structured information Text Databases with text fields Web-pages, e-mail messages, blogs, chat, …
(Non-structured) information overload Doubles every three months (Gardner) Hampers knowledge management and business intelligence
Translation bottleneck
Natural Language Understanding?
Word meaning Morphological analysis Complex Word Interpretation
Word Sense Disambiguation
Sentence Meaning Syntactic structure (parsing)
Sentence interpretation Discourse Meaning
World Knowledge• Frames, scenarios, grounding, intentions, …
Fremdzugehen External train marriages
The box is in the pen
I eat a pizza with extra cheese
I eat a pizza with a forkI eat a pizza with my daughter
The mayors didn’t want the students to strike because they feared violence
The mayors didn’t want the students to strike because they preached the revolution
State of the Art Robust, efficient, accurate, unrestricted language understanding will not be available for a long time AI-complete problem
Alternative: text mining: automatic extraction of reusable knowledge from text, based on linguistic analysis of the text
Approach Text analysis tools (shallow instead of deep understanding) Robust / Efficient / Accurate
Text Mining applications Question Answering Summarization Ontology extraction Information extraction Text categorizationFor embedding in
End user applications related to knowledge search / management / discovery / communication
Examples Application Areas:
Data mining (KDD) from unstructured and semi-structured data
(Corporate) Knowledge Management “Intelligence”
Example Applications: Email routing and filtering (spam filtering) Finding protein interactions in biomedical text
Brokering• Matching on-line resumes and vacancies• Buying and selling property• …
Text Data Mining (Discovery) Find relevant information
Information extraction Text categorization
Analyze the text Text mining
Discovery new information Integrate different sources Data mining
Don Swanson 1981: medical hypothesis generation
stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated
in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet
aggregability magnesium can suppress platelet aggregability …
Magnesium deficiency implicated in migraine (?)Text analysis output
CNTS text analysis tools MBSP
Flexible and adaptable Dutch and English State of the Art accuracy and efficiency
• ~ 90% sentences / ~ 1000 words/sec Configurable combination of linguistic modules
Modules developed using Machine Learning• TiMBL
Adaptation through re-training and semi-supervised learning
Client-server set-up
TextText
Tokenisation
Tokenisation
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
CNTS shallow understanding
TextText
Tokenisation
Tokenisation
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insulin suspension (NPH).
TextText
Tokenisation
Tokenisation
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insulin suspension (NPH).
Insulatardisanisophaneinsulinsuspension(NPH).
TextText
Tokenisation
Tokenisation
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insulin suspension (NPH).
Insulatard NNPis VBZ an DTisophane JJinsulin NNsuspension NN( PuncNPH NNP) Punc. Punc
TekstTekst
Tokenization
Tokenization
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insulin suspension (NPH).
[NP Insulatard][VP is][NP an isophane insulin suspension( NPH )]
TextText
Tokenisation
Tokenisation
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insulin suspension (NPH).
Insulatard = Medicine nameNPH = Hormone
TextText
Tokenization
Tokenization
POS taggingPOS
tagging
NP chunking
NP chunking
NERNER
Relation finding
Relation finding
Insulatard is an isophane insuline suspension (NPH).
[SBJ Insulatard]is[PREDC an isophane insuline suspension ( NPH )]
Application: Question Answering Give answer to question
(document retrieval: find documents relevant to query)
Who invented the telephone? Alexander Graham Bell
When was the telephone invented? 1876
QA System: Shapaqa Parse question
When was the telephone invented? Which slots are given?
• Verb invented• Object telephone
Which slots are asked? • Temporal phrase linked to verb
Document retrieval on internet with given slot keywords
Parsing of sentences with all given slots Count most frequent entry found in asked slot
(temporal phrase)
Shapaqa: example When was the telephone invented? Google: invented AND “the telephone”
produces 835 pages 53 parsed sentences with both input slots and with a temporal phrase
is through his interest in Deafness and fascination with acoustics that the telephone was invented in 1876 , with the intent of helping Deaf and hard of hearing
The telephone was invented by Alexander Graham Bell in 1876
When Alexander Graham Bell invented the telephone in 1876 , he hoped that these same electrical signals could
Shapaqa: frequency ranking So when was the phone invented? Internet answer is noisy, but robust
17: 1876 3: 1874 2: ago 2: later 1: Bell …
System was developed quickly Precision 76% (Google 31%) International competition (TREC): MRR 0.45
IR IE
Text Analysis
Medline abstracts
Linguistic / Semantic Features
TemplatesFactoids
Application: Biomedical text mining (EU project BioMinT)
(Partial) FactoidsThe mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens.
The mouse lymphoma assay MLA
utilizingthe Tk gene
is widely used to identify
chemical mutagens
S O
O
DNA part
CELL-LINE
<!DOCTYPE MBSP SYSTEM 'mbsp.dtd'><MBSP><S cnt="s1"> <NP rel="SBJ" of="s1_1"> <W pos="DT">The</W> <W pos="NN" sem="cell_line">mouse</W> <W pos="NN" sem="cell_line">lymphoma</W> <W pos="NN">assay</W> </NP> <W pos="openparen">(</W> <NP> <W pos="NN" sem="cell_line">MLA</W> </NP> <W pos="closeparen">)</W> <VP id="s1_1"> <W pos="VBG">utilizing</W> </VP> <NP rel="OBJ" of="s1_1"> <W pos="DT">the</W> <W pos="NN" sem="DNA_part">Tk</W> <W pos="NN" sem="DNA_part">gene</W> </NP>
<VP id="s1_2"> <W pos="VBZ">is</W> <W pos="RB">widely</W> <W pos="VBN">used</W> </VP> <VP id="s1_3"> <W pos="TO">to</W> <W pos="VB">identify</W> </VP></VP> <NP rel="OBJ" of="s1_3"> <W pos="JJ">chemical</W> <W pos="NNS">mutagens</W> </NP> <W pos="period">.</W></S></MBSP>
Extracted IEX Templates from shallow parser output
NP(<X protein>) contain NP(Y "domain")EVENT: containPROTEIN: <protein>DOMAIN: “domainf”
NP(<X protein>) be associated with NP(Y “disease”)EVENT: associated_withPROTEIN: <protein>DISEASE: “head”
NP(<X protein>) regulate NP(Y)EVENT: regulatePROTEIN: <protein>Y:
(): to be extracted, <>: semantic constraint, "":
lexical constraint
Jee-Hyub Kim (Geneva)
Application: Ontology Extraction Clustering of head nouns of Subject-Verb and Verb-Object
relations Combine with pattern matching and heuristics Case study: Medline 4 million words hepatitis, SwissProt corpus Results:
Better clusters with shallow parsing Useful in knowledge management, thesaurus development, …
Ontobasis (IWT)
Example (SwissProt corpus)
gene | show | significant homology,amino_acid_sequence |
have/indicate/lack/reveal/show | homologyprotein | show | homology, immunoreactivity,
reactivity, sequence similarity
protein | inhibit | catalytic activity, apoptosis, protein synthesis...
protein | exhibit | significant homologyprotein | bind | copper, ubiquitinprotein | correspond | isoelectric pointinduction | requires | protein synthesisEdman degradation | of | intact protein regulatory subunit | of | cAMP-dependent protein
kinase…
hepatitis
diseaseinfection HBV
cirrhosis
liver
immunizationantibody vaccination
culture antisera
related_torelated_to
simsim sim
produced by produced by
prevented by
Further development Semantic roles Faster adaptation to new domains
Domain semantics (NER / concept tagging)
Active Learning / semi-supervised learning
More analytic power Negation, modality, quantification Limited event and scenario recognition
Conclusions
Text Mining tasks benefit from text analysis
Understanding can be formulated as a flexible heterarchy of classifiers
These classifiers can be trained / adapted on annotated corpora and can eventually approximate deep understanding