Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp...

Preview:

Citation preview

Text Mining

Walter DaelemansCNTS Department of LinguisticsUniversity of Antwerp

walter.daelemans@ua.ac.be

Centre for Dutch Language and Speech (CNTS) Part of department of linguistics, University of Antwerp

Staff2 tenured + 10-15 with temporary funding from EU, IWT, FWO, NTU, language industry, BOF, …

Topics Corpus Linguistics (mainly Dutch) Child language acquisition / computational psycholinguistics

Language Technology • machine learning of language• shallow parsing• text mining

Information Overload Language is the most natural and most used knowledge representation formalism

Non-structured or weakly structured information Text Databases with text fields Web-pages, e-mail messages, blogs, chat, …

(Non-structured) information overload Doubles every three months (Gardner) Hampers knowledge management and business intelligence

Translation bottleneck

Natural Language Understanding?

Word meaning Morphological analysis Complex Word Interpretation

Word Sense Disambiguation

Sentence Meaning Syntactic structure (parsing)

Sentence interpretation Discourse Meaning

World Knowledge• Frames, scenarios, grounding, intentions, …

Fremdzugehen External train marriages

The box is in the pen

I eat a pizza with extra cheese

I eat a pizza with a forkI eat a pizza with my daughter

The mayors didn’t want the students to strike because they feared violence

The mayors didn’t want the students to strike because they preached the revolution

State of the Art Robust, efficient, accurate, unrestricted language understanding will not be available for a long time AI-complete problem

Alternative: text mining: automatic extraction of reusable knowledge from text, based on linguistic analysis of the text

Approach Text analysis tools (shallow instead of deep understanding) Robust / Efficient / Accurate

Text Mining applications Question Answering Summarization Ontology extraction Information extraction Text categorizationFor embedding in

End user applications related to knowledge search / management / discovery / communication

Examples Application Areas:

Data mining (KDD) from unstructured and semi-structured data

(Corporate) Knowledge Management “Intelligence”

Example Applications: Email routing and filtering (spam filtering) Finding protein interactions in biomedical text

Brokering• Matching on-line resumes and vacancies• Buying and selling property• …

Text Data Mining (Discovery) Find relevant information

Information extraction Text categorization

Analyze the text Text mining

Discovery new information Integrate different sources Data mining

Don Swanson 1981: medical hypothesis generation

stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated

in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet

aggregability magnesium can suppress platelet aggregability …

Magnesium deficiency implicated in migraine (?)Text analysis output

CNTS text analysis tools MBSP

Flexible and adaptable Dutch and English State of the Art accuracy and efficiency

• ~ 90% sentences / ~ 1000 words/sec Configurable combination of linguistic modules

Modules developed using Machine Learning• TiMBL

Adaptation through re-training and semi-supervised learning

Client-server set-up

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

CNTS shallow understanding

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatardisanisophaneinsulinsuspension(NPH).

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatard NNPis VBZ an DTisophane JJinsulin NNsuspension NN( PuncNPH NNP) Punc. Punc

TekstTekst

Tokenization

Tokenization

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

[NP Insulatard][VP is][NP an isophane insulin suspension( NPH )]

TextText

Tokenisation

Tokenisation

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insulin suspension (NPH).

Insulatard = Medicine nameNPH = Hormone

TextText

Tokenization

Tokenization

POS taggingPOS

tagging

NP chunking

NP chunking

NERNER

Relation finding

Relation finding

Insulatard is an isophane insuline suspension (NPH).

[SBJ Insulatard]is[PREDC an isophane insuline suspension ( NPH )]

Application: Question Answering Give answer to question

(document retrieval: find documents relevant to query)

Who invented the telephone? Alexander Graham Bell

When was the telephone invented? 1876

QA System: Shapaqa Parse question

When was the telephone invented? Which slots are given?

• Verb invented• Object telephone

Which slots are asked? • Temporal phrase linked to verb

Document retrieval on internet with given slot keywords

Parsing of sentences with all given slots Count most frequent entry found in asked slot

(temporal phrase)

Shapaqa: example When was the telephone invented? Google: invented AND “the telephone”

produces 835 pages 53 parsed sentences with both input slots and with a temporal phrase

is through his interest in Deafness and fascination with acoustics that the telephone was invented in 1876 , with the intent of helping Deaf and hard of hearing

The telephone was invented by Alexander Graham Bell in 1876

When Alexander Graham Bell invented the telephone in 1876 , he hoped that these same electrical signals could

Shapaqa: frequency ranking So when was the phone invented? Internet answer is noisy, but robust

17: 1876 3: 1874 2: ago 2: later 1: Bell …

System was developed quickly Precision 76% (Google 31%) International competition (TREC): MRR 0.45

IR IE

Text Analysis

Medline abstracts

Linguistic / Semantic Features

TemplatesFactoids

Application: Biomedical text mining (EU project BioMinT)

(Partial) FactoidsThe mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens.

The mouse lymphoma assay MLA

utilizingthe Tk gene

is widely used to identify

chemical mutagens

S O

O

DNA part

CELL-LINE

<!DOCTYPE MBSP SYSTEM 'mbsp.dtd'><MBSP><S cnt="s1"> <NP rel="SBJ" of="s1_1"> <W pos="DT">The</W> <W pos="NN" sem="cell_line">mouse</W> <W pos="NN" sem="cell_line">lymphoma</W> <W pos="NN">assay</W> </NP> <W pos="openparen">(</W> <NP> <W pos="NN" sem="cell_line">MLA</W> </NP> <W pos="closeparen">)</W> <VP id="s1_1"> <W pos="VBG">utilizing</W> </VP> <NP rel="OBJ" of="s1_1"> <W pos="DT">the</W> <W pos="NN" sem="DNA_part">Tk</W> <W pos="NN" sem="DNA_part">gene</W> </NP>

<VP id="s1_2"> <W pos="VBZ">is</W> <W pos="RB">widely</W> <W pos="VBN">used</W> </VP> <VP id="s1_3"> <W pos="TO">to</W> <W pos="VB">identify</W> </VP></VP> <NP rel="OBJ" of="s1_3"> <W pos="JJ">chemical</W> <W pos="NNS">mutagens</W> </NP> <W pos="period">.</W></S></MBSP>

Extracted IEX Templates from shallow parser output

NP(<X protein>) contain NP(Y "domain")EVENT: containPROTEIN: <protein>DOMAIN: “domainf”

NP(<X protein>) be associated with NP(Y “disease”)EVENT: associated_withPROTEIN: <protein>DISEASE: “head”

NP(<X protein>) regulate NP(Y)EVENT: regulatePROTEIN: <protein>Y:

(): to be extracted, <>: semantic constraint, "":

lexical constraint

Jee-Hyub Kim (Geneva)

Application: Ontology Extraction Clustering of head nouns of Subject-Verb and Verb-Object

relations Combine with pattern matching and heuristics Case study: Medline 4 million words hepatitis, SwissProt corpus Results:

Better clusters with shallow parsing Useful in knowledge management, thesaurus development, …

Ontobasis (IWT)

Example (SwissProt corpus)

gene | show | significant homology,amino_acid_sequence |

have/indicate/lack/reveal/show | homologyprotein | show | homology, immunoreactivity,

reactivity, sequence similarity

protein | inhibit | catalytic activity, apoptosis, protein synthesis...

protein | exhibit | significant homologyprotein | bind | copper, ubiquitinprotein | correspond | isoelectric pointinduction | requires | protein synthesisEdman degradation | of | intact protein regulatory subunit | of | cAMP-dependent protein

kinase…

hepatitis

diseaseinfection HBV

cirrhosis

liver

immunizationantibody vaccination

culture antisera

related_torelated_to

simsim sim

produced by produced by

prevented by

Further development Semantic roles Faster adaptation to new domains

Domain semantics (NER / concept tagging)

Active Learning / semi-supervised learning

More analytic power Negation, modality, quantification Limited event and scenario recognition

Conclusions

Text Mining tasks benefit from text analysis

Understanding can be formulated as a flexible heterarchy of classifiers

These classifiers can be trained / adapted on annotated corpora and can eventually approximate deep understanding

Questions? Walter Daelemans

A1.10 Campus Drie Eiken •(September: Stadscampus)

Walter.daelemans@ua.ac.be

Recommended