View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
Unsupervised Natural Language Processing using Graph Models
The Structure Discovery Paradigm
Chris BiemannUniversity of Leipzig, Germany
Doctoral Consortium at HLT-NAACL 2007, Rochester, NY, USA
April 22, 2007
2
OutlineReview of traditional approaches• Knowledge-intensive vs. knowledge-free• Degrees of Supervision• Computational Linguistics vs. statistical NLP
A new approach• The Structure Discovery Paradigm
Graph-based SD procedures• Graph models for language processing• Graph-based SD procedures• Results in task-based evaluation
3
Knowledge-Intensive vs. Knowledge-Free
In traditional automated language processing, knowledge is involved in all cases where humans manually tell machines
• How to process language by explicit knowledge
• How a task should be solved by implicit knowledge
Knowledge can be provided by the means of:
• Dictionaries, e.g. thesaurus, WordNet, ontologies, …
• (grammar) rules
• Annotation
4
Degrees of Supervision
Supervision is providing positive and negative training examples to Machine Learning algorithms, which use this as a basis for building a model that reproduces the classification on unseen data
Degrees:
• Fully supervised (Classification): Learning is only carried out on fully labeled training set
• Semi-supervised: Unlabeled examples are also used for building a data model
• Weakly-supervised (Bootstrapping): A small set of labeled examples is grown and classifications are used for re-training
• Unsupervised (Clustering): No labeled examples are provided
5
Computational Linguistics and Statistical NLP
CL:• Implementing linguistic theories with computers• Rule-based approaches• Rules found by introspection, not data-driven• Explicit knowledge• Goal: understanding language itself
Statistical NLP:• Building systems that perform language processing tasks• Machine Learning approaches• Models are built by training on annotated dataset• Implicit knowledge• Goal: Build robust systems with high performance
There is a continuum rather than a sharp cutting edge
6
Structure Discovery Paradigm SD:• Analyze raw data and identify regularities• Statistical methods, clustering• Knowledge-free, unsupervised• Structures: as many as can be discovered• Language-independent, domain-independent, encoding-independent• Goal: Discover structure in language data and mark it in the data
7
Example: Discovered Structures
• Annotation on various levels
• Similar labels denote similar properties as found by the SD algorithms
• Similar structures in corpus are annotated in a similar way
„ Increased interest rates lead to investments in banks .“
<sentence lang=12, subj=34.11> <chunk id=c25> <word POS=p3 m=0.0 s=s14>In=creas-ed</word> <MWU POS=p1 s=s33> <word POS=p1 m=5.1 s=s44>interest</word> <word POS=p1 m=2.12 s=s106>rate-s</word> </MWU> </chunk> <chunk id=c13> <MWU POS=p2> <word POS=p2 m=17.3 s=74>lead</word> <word POS=p117 m=11.98>to</word> </MWU> </chunk> <chunk id=c31> <word POS=p1 m=1.3 s=33>investment-s</word> <word POS=p118 m=11.36>in</word> <word POS=p1 m=1.12 s=33>bank-s</word> </chunk> <word> POS=298> . </word></sentence>
8
Consequences of Working in SD• Only input allowed is raw text data• Machines are told how to algorithmically discover structure• Self-annotation process by marking regularities in the data• Structure Discovery process is iterated
Text Data
SD algorithm
Find regularities by analysis
Annotate data with regularitiesSD algorithm
SD algorithmSD algorithms
9
Pros and Cons of Structure Discovery
Advantages:
• Cheap: only raw data needed• Alleviation of acquisition bottleneck• Language and domain independent• No data-resource mismatch (all resources leak)
Disadvantages:
• No control over self-annotation labels• Congruence to linguistic concepts not guaranteed• Much computing time needed
10
Building Blocks in SD
Hierarchical levels of basic units in text data:• Letters• Words• Sentences• DocumentsThese are assumed to be recognizable in the remainder.
SD allows for • arbitrary numbers of intermediate levels • grouping of basic into complex units, but these have to be found by SD procedures.
11
Similarity and Homogeneity
For determining which units share structure, a similarity measure for units is needed. Two kinds of features are possible:
• Internal features: compare units based on the lower level units they contain
• Context features: compare units based on other units of same or other level that surround them
A clustering based on unit similarity yields sets of units that are homogeneous w.r.t. structure
This is an abstraction process: Units are subsumed under the same label.
12
What is it good for? How do I know?
• Many structural regularities can be thought of, some are interesting, some are not.
• Structures discovered by SD algorithms will not necessarily match the concepts of linguists
• Working in the SD paradigm means to over-generate structure acquisition methods and to check, whether these are helpful
Methods for telling helpful from useless SD procedures:• „Look at my nice clusters“-approach: Examine data by hand. While
good in the initial phase of testing, this is inconclusive: choice of clusters, coverage…
• Task-based evaluation: Use the labels obtained as features in a Machine Learning scenario and measure the contribution of each label type. Involves supervision, is indirect
13
Graph models for SD procedures
Motivation for graph representation• Graphs are an intuitive and natural way to encode language
units as nodes and their similarities as edges - but also other representations are possible
• Graph clustering can efficiently perform abstraction by grouping units into homogeneous sets with Chinese Whispers
Some graphs on basic units• Word co-occurrence (neighbor/sentence), significance,
higher orders• Word context similarity based on local context vectors• Sentence/document similarity on common words
14
Some graph-based SD procedures• Language Separation:
– Cluster sentence-based significant word co-occurrence graph – Use word lists for language identification
• Induced POS – Cluster local stop word context vector similarity graph– Cluster second order neighbor word co-occurrence graph– Train and apply trigram tagger
• Word Sense Disambiguation– Cluster neighborhood of target word of sentence-based significant
co-occurrence graph into sense clusters– Compare sense clusters with local context for disambiguation
• Semantic classes– Cluster similarity graph of words and induced POS contexts– Use contexts for assigning semantic classes
15
“Look at my nice languages!” Cleaning CUCWeb
Latin:In expeditionibus tessellata et sectilia pauimenta circumferebat.Britanniam petiuit spe margaritarum: earum amplitudinem conferebat et interdum sua manu
exigebat ..
Scripting:@echo @cd $(TLSFDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TLSFSRC) …@echo @cd $(TOOLSDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TOOLSSRC) ..
Hungarian:A külügyminiszter a diplomáciai és konzuli képviseletek címjegyzékét és konzuli …Köztestületek, jogi személyiséggel és helyi jogalkotási jogkörrel.
Esperanto:Por vidi ghin kun internacia kodigho kaj kun kelkaj bildoj kliku tie chi ) La Hispana.. Ne nur pro tio, ke ghi perdigis la vivon de kelk-centmil hispanoj, sed ankau pro ghia efiko..
Human Genome:1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt acttcaccaa ctcgtacatg 61 ggggcggaca
tgcatcatgg gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121 cagatgcacc …
Isoko (Nigeria):(1) Ko Ileleikristi a re rowo ino Oghene yo Esanerovo?(5) Ko Jesu o whu evao uruwhere?
16
Task-based unsuPOS evaluationUnsuPOS tags are used as features, performance is
compared to no POS and supervised POS. Tagger was induced in one-CPU-day from BNC
• Kernel-based WSD: better than noPOS, equal to suPOS• POS-tagging: better than noPOS• Named Entity Recognition: no significant differences• Chunking: better than noPOS, worse than suPOS
17
Summary• Structure Discovery Paradigm contrasted to
traditional approaches:– no manual annotation, no resources (cheaper)– language- and domain-independent– iteratively enriching structural information by finding and
annotating regularities
• Graph-based SD procedures
• Evaluation framework and results
19
Structure Discovery Machine I
From linguistics, we have the following intuitions that can lead to SD algorithms that capture their underlying structure:
• There are different languages• Words belong to word classes• Short sequences of words form multi word units• Words can be semantically decomposable (compounds)• Words are subject to inflection• Morphological congruence between words• There are grammatical dependencies between words and sequences
of words• Words can have different semantic properties• Semantic congruence between words• A word can have several meanings
20
Structure Discovery Machine IIThe following methods are SD algorithms:• Language Identification: as introduced• POS Induction: as introduced• MWU detection by Collocation extraction• Unsupervised Compound Decomposition and Paraphrasing: (work in
progress)• Unsupervised Morphology (MorphoChallenge): letter successor
varieties• Unsupervised Parsing: Grammar Induction based on POS and
neighbor-based co-occurrences• Semantic classes: Similarity in context patterns of words and POS
(work in progress)• WSI+WSD: Clustering Co-occurrences+Disambiguation (work in
progress)
21
Chinese Whispers Graph ClusteringExplanations• Nodes have a class and
communicate it to their adjacent nodes
• A node adopts one of the the majority class in its neighborhood
• Nodes are processed in random order for some iterations
Properties• Time-linear in number of edges:
very efficient• Randomized, non-deterministic• Parameter-free• Numbers of clusters found by
algorithm• Small World graphs converge fast
Algorithm:
initialize:forall vi in V: class(vi)=i;
while changes:
forall v in V, randomized order:
class(v)=highest ranked class in neighborhood of v;
AL1L3
DL2
EL3
BL4
CL3
58
63
22
Language Seperation Evaluation
• Cluster the co-occurrence graph of a multilingual corpus
• Use words of the same class in a language identifier as lexicon
• Almost perfect performance
Precision, Recall and F-value for 7-lingual corpora
0,96
0,97
0,98
0,99
1
100 1000 10000 100000
# of sentences per language
P/R
/F
Precision Recall F-value
23
unsuPOS: Steps ... , sagte der Sprecher bei der Sitzung .... , rief der Vorsitzende in der Sitzung .
... , warf in die Tasche aus der Ecke .
17
C1: sagte, warf, riefC2: Sprecher, Vorsitzende, TascheC3: inC4: der, die
... , sagte|C1 der|C4 Sprecher|C2 bei der|C4 Sitzung .... , rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung .
... , warf|C1 in|C3 die|C4 Tasche|C2 aus der|C4 Ecke .
)24,|3( CCwordCP
... , sagte|C1 der|C4 Sprecher|C2 bei|C3 der|C4 Sitzung|C2 .... , rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung|C2 .
... , warf|C1 in|C3 die|C4 Tasche|C2 aus|C3 der|C4 Ecke|C2 .
Unlabelled Text
Distributional Vectors NB-cooccurrences
high frequency words medium frequency words
Graph 1 Graph 2
Partitio
nin
g1
Maxtag Lexicon
Partially Labelled Text
Fully Labelled Text
Trigram Viterbi Tagger
Chinese Whispers Graph Clustering
Partitio
nin
g2
24
unsuPOS: Ambiguity Example
Word cluster ID cluster members (size)
I 166 I (1)
saw 2 past tense verbs (3818)
the 73 a, an, the (3)
man 1 nouns (17418)
with 13 prepositions (143)
a 73 a, an, the (3)
saw 1 nouns (17418)
. 116 . ! ? (3)
25
unsuPOS: Medline tagset1 (13721) recombinogenic, chemoprophylaxis, stereoscopic, MMP2, NIPPV, Lp, biosensor, bradykinin, issue, S-100beta,
iopromide, expenditures, dwelling, emissions, implementation, detoxification, amperometric, appliance, rotation, diagonal, 2(1687) self-reporting, hematology, age-adjusted, perioperative, gynaecology, antitrust, instructional, beta-thalassemia, interrater,
postoperatively, verbal, up-to-date, multicultural, nonsurgical, vowel, narcissistic, offender, interrelated, 3(1383) proven, supplied, engineered, distinguished, constrained, omitted, counted, declared, reanalysed, coexpressed, wait, 4(957) mediates, relieves, longest, favor, address, complicate, substituting, ensures, advise, share, employ, separating, allowing, 5(1207) peritubular, maxillary, lumbar, abductor, gray, rhabdoid, tympanic, malar, adrenal, low-pressure, mediastinal, 6(653) trophoblasts, paws, perfusions, cerebrum, pons, somites, supernatant, Kingdom, extra-embryonic, Britain, endocardium, 7(1282) acyl-CoAs, conformations, isoenzymes, STSs, autacoids, surfaces, crystallins, sweeteners, TREs, biocides, pyrethroids, 8(1613) colds, apnea, aspergilloma, ACS, breathlessness, perforations, hemangiomas, lesions, psychoses, coinfection, terminals,
headache, hepatolithiasis, hypercholesterolemia, leiomyosarcomas, hypercoagulability, xerostomia, granulomata, pericarditis, 9(674) dysregulated, nearest, longest, satisfying, unplanned, unrealistic, fair, appreciable, separable, enigmatic, striking, i10(509) differentiative, ARV, pleiotropic, endothermic, tolerogenic, teratogenic, oxidizing, intraovarian, anaesthetic, laxative, 13(177) ewe, nymphs, dams, fetuses, marmosets, bats, triplets, camels, SHR, husband, siblings, seedlings, ponies, foxes,
neighbor, sisters, mosquitoes, hamsters, hypertensives, neonates, proband, anthers, brother, broilers, woman, eggs, 14(103) considers, comprises, secretes, possesses, sees, undergoes, outlines, reviews, span, uncovered, defines, shares, s15(87) feline, chimpanzee, pigeon, quail, guinea-pig, chicken, grower, mammal, toad, simian, rat, human-derived, piglet, ovum, 16(589) dually, rarely, spectrally, circumferentially, satisfactorily, dramatically, chronically, therapeutically, beneficially, already, 18(124) 1-min, two-week, 4-min, 8-week, 6-hour, 2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h, 8-h, ten-year, overnight, 120-21(12) July, January, May, February, December, October, April, September, June, August, March, November23(13) acetic, retinoic, uric, oleic, arachidonic, nucleic, sialic, linoleic, lactic, glutamic, fatty, ascorbic, folic25(28) route, angle, phase, rim, state, region, arm, site, branch, dimension, configuration, area, Clinic, zone, atom, isoform, 247(6) P<0_001, P<0_01, p<0_001, p<0_01, P<_001, P<0_0001391(119) alcohol, ethanol, heparin, cocaine, morphine, cisplatin, dexamethasone, estradiol, melatonin, nicotine, fibronectin,