31
Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp [email protected]. be

Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp [email protected]

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Text Mining

Walter DaelemansCNTS Department of LinguisticsUniversity of Antwerp

[email protected]

Page 2: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Centre for Dutch Language and Speech (CNTS) Part of department of linguistics, University of Antwerp

Staff2 tenured + 10-15 with temporary funding from EU, IWT, FWO, NTU, language industry, BOF, …

Topics Corpus Linguistics (mainly Dutch) Child language acquisition / computational psycholinguistics

Language Technology • machine learning of language• shallow parsing• text mining

Page 3: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Information Overload Language is the most natural and most used knowledge representation formalism

Non-structured or weakly structured information Text Databases with text fields Web-pages, e-mail messages, blogs, chat, …

(Non-structured) information overload Doubles every three months (Gardner) Hampers knowledge management and business intelligence

Translation bottleneck

Page 4: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Natural Language Understanding?

Word meaning Morphological analysis Complex Word Interpretation

Word Sense Disambiguation

Sentence Meaning Syntactic structure (parsing)

Sentence interpretation Discourse Meaning

World Knowledge• Frames, scenarios, grounding, intentions, …

Fremdzugehen External train marriages

The box is in the pen

I eat a pizza with extra cheese

I eat a pizza with a forkI eat a pizza with my daughter

The mayors didn’t want the students to strike because they feared violence

The mayors didn’t want the students to strike because they preached the revolution

Page 5: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

State of the Art Robust, efficient, accurate, unrestricted language understanding will not be available for a long time AI-complete problem

Alternative: text mining: automatic extraction of reusable knowledge from text, based on linguistic analysis of the text

Page 6: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Approach Text analysis tools (shallow instead of deep understanding) Robust / Efficient / Accurate

Text Mining applications Question Answering Summarization Ontology extraction Information extraction Text categorizationFor embedding in

End user applications related to knowledge search / management / discovery / communication

Page 7: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Examples Application Areas:

Data mining (KDD) from unstructured and semi-structured data

(Corporate) Knowledge Management “Intelligence”

Example Applications: Email routing and filtering (spam filtering) Finding protein interactions in biomedical text

Brokering• Matching on-line resumes and vacancies• Buying and selling property• …

Page 8: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Text Data Mining (Discovery) Find relevant information

Information extraction Text categorization

Analyze the text Text mining

Discovery new information Integrate different sources Data mining

Page 9: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Don Swanson 1981: medical hypothesis generation

stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated

in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet

aggregability magnesium can suppress platelet aggregability …

Magnesium deficiency implicated in migraine (?)Text analysis output

Page 10: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

CNTS text analysis tools MBSP

Flexible and adaptable Dutch and English State of the Art accuracy and efficiency

• ~ 90% sentences / ~ 1000 words/sec Configurable combination of linguistic modules

Modules developed using Machine Learning• TiMBL

Adaptation through re-training and semi-supervised learning

Client-server set-up

Page 11: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

CNTS shallow understanding

Page 12: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Page 13: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatardisanisophaneinsulinsuspension(NPH).

Page 14: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatard NNPis VBZ an DTisophane JJinsulin NNsuspension NN( PuncNPH NNP) Punc. Punc

Page 15: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TekstTekst

Tokenization

Tokenization

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

[NP Insulatard][VP is][NP an isophane insulin suspension( NPH )]

Page 16: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatard = Medicine nameNPH = Hormone

Page 17: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

TextText

Tokenization

Tokenization

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insuline suspension (NPH).

[SBJ Insulatard]is[PREDC an isophane insuline suspension ( NPH )]

Page 18: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Application: Question Answering Give answer to question

(document retrieval: find documents relevant to query)

Who invented the telephone? Alexander Graham Bell

When was the telephone invented? 1876

Page 19: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

QA System: Shapaqa Parse question

When was the telephone invented? Which slots are given?

• Verb invented• Object telephone

Which slots are asked? • Temporal phrase linked to verb

Document retrieval on internet with given slot keywords

Parsing of sentences with all given slots Count most frequent entry found in asked slot

(temporal phrase)

Page 20: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Shapaqa: example When was the telephone invented? Google: invented AND “the telephone”

produces 835 pages 53 parsed sentences with both input slots and with a temporal phrase

is through his interest in Deafness and fascination with acoustics that the telephone was invented in 1876 , with the intent of helping Deaf and hard of hearing

The telephone was invented by Alexander Graham Bell in 1876

When Alexander Graham Bell invented the telephone in 1876 , he hoped that these same electrical signals could

Page 21: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Shapaqa: frequency ranking So when was the phone invented? Internet answer is noisy, but robust

17: 1876 3: 1874 2: ago 2: later 1: Bell …

System was developed quickly Precision 76% (Google 31%) International competition (TREC): MRR 0.45

Page 22: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

IR IE

Text Analysis

Medline abstracts

Linguistic / Semantic Features

TemplatesFactoids

Application: Biomedical text mining (EU project BioMinT)

Page 23: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

(Partial) FactoidsThe mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens.

The mouse lymphoma assay MLA

utilizingthe Tk gene

is widely used to identify

chemical mutagens

S O

O

DNA part

CELL-LINE

Page 24: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

<!DOCTYPE MBSP SYSTEM 'mbsp.dtd'><MBSP><S cnt="s1"> <NP rel="SBJ" of="s1_1"> <W pos="DT">The</W> <W pos="NN" sem="cell_line">mouse</W> <W pos="NN" sem="cell_line">lymphoma</W> <W pos="NN">assay</W> </NP> <W pos="openparen">(</W> <NP> <W pos="NN" sem="cell_line">MLA</W> </NP> <W pos="closeparen">)</W> <VP id="s1_1"> <W pos="VBG">utilizing</W> </VP> <NP rel="OBJ" of="s1_1"> <W pos="DT">the</W> <W pos="NN" sem="DNA_part">Tk</W> <W pos="NN" sem="DNA_part">gene</W> </NP>

<VP id="s1_2"> <W pos="VBZ">is</W> <W pos="RB">widely</W> <W pos="VBN">used</W> </VP> <VP id="s1_3"> <W pos="TO">to</W> <W pos="VB">identify</W> </VP></VP> <NP rel="OBJ" of="s1_3"> <W pos="JJ">chemical</W> <W pos="NNS">mutagens</W> </NP> <W pos="period">.</W></S></MBSP>

Page 25: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Extracted IEX Templates from shallow parser output

NP(<X protein>) contain NP(Y "domain")EVENT: containPROTEIN: <protein>DOMAIN: “domainf”

NP(<X protein>) be associated with NP(Y “disease”)EVENT: associated_withPROTEIN: <protein>DISEASE: “head”

NP(<X protein>) regulate NP(Y)EVENT: regulatePROTEIN: <protein>Y:

(): to be extracted, <>: semantic constraint, "":

lexical constraint

Jee-Hyub Kim (Geneva)

Page 26: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Application: Ontology Extraction Clustering of head nouns of Subject-Verb and Verb-Object

relations Combine with pattern matching and heuristics Case study: Medline 4 million words hepatitis, SwissProt corpus Results:

Better clusters with shallow parsing Useful in knowledge management, thesaurus development, …

Ontobasis (IWT)

Page 27: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Example (SwissProt corpus)

gene | show | significant homology,amino_acid_sequence |

have/indicate/lack/reveal/show | homologyprotein | show | homology, immunoreactivity,

reactivity, sequence similarity

protein | inhibit | catalytic activity, apoptosis, protein synthesis...

protein | exhibit | significant homologyprotein | bind | copper, ubiquitinprotein | correspond | isoelectric pointinduction | requires | protein synthesisEdman degradation | of | intact protein regulatory subunit | of | cAMP-dependent protein

kinase…

Page 28: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

hepatitis

diseaseinfection HBV

cirrhosis

liver

immunizationantibody vaccination

culture antisera

related_torelated_to

simsim sim

produced by produced by

prevented by

Page 29: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Further development Semantic roles Faster adaptation to new domains

Domain semantics (NER / concept tagging)

Active Learning / semi-supervised learning

More analytic power Negation, modality, quantification Limited event and scenario recognition

Page 30: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Conclusions

Text Mining tasks benefit from text analysis

Understanding can be formulated as a flexible heterarchy of classifiers

These classifiers can be trained / adapted on annotated corpora and can eventually approximate deep understanding

Page 31: Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

Questions? Walter Daelemans

A1.10 Campus Drie Eiken •(September: Stadscampus)

[email protected]