Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
On-Demand Information Extraction
New York UniversitySatoshi Sekine
Summer/Fall 07
Introduction(http://nlp.cs.nyu.edu/sekine)
Research topics
Attributefor ENE
EnglishAnalyzer(OAK)
Web PeopleSearch
JapaneseSpelling
Multi/Singdoc. sum.
IE andSum.
Cross Ling.Pre-ODIE
ParaphraseDiscoveryIE pattern
Discovery
EncyclopediaQA
IR, clustering& sum.
Cross Ling.QA
2nd grade language test
Query log& class
Informationneeds
On-demand IE
QA
IRSummarization
RelationDiscovery
ExtendedNE
What is Information Extraction
• To automatically extract information on specific scenario from unstructured text and put it into table format
• Ex. Management succession Newspaper
PresidentBank of Manhattan
Bill Brown7/2/2003
COESmith Trade corporation
John Smith7/1/2003PositionCompanypersonDate
Problem in conventional IE
• Preparation of knowledge for given scenario– Pattern, lexicon and semantic knowledge
• Company announced Person’s promotion to Position
– Writing rules by hand or creating training data• Restriction
– Preparation time (one month in MUC)– Specific/Limited scenario/domain
Our goal• Make one month to one minute (=automatic)• IE on the fly• How?
– Use unsupervised learning methods• Pattern discovery, paraphrase discovery
– Prepare as much knowledge/tools as we can for as many scenarios as possible
• Extended NE, semantic knowledge, relation
• Demo http://blueberry.cs.nyu.edu:8080/odie
http://blueberry.cs.nyu.edu:8080/odie
ODIE – Flow ChartDescription of task
Table
IR systemRelevantDocument
Pattern discoveryLanguage Analyzer
ENE taggerPattern
Pattern sets
Table construction
Paraphrase discoveryCo-reference
Automatic Discovery of IE patterns (Sudo et.al HLT01) (Sudo et.al ACL03)
• Bottleneck of IE is knowledge creation• IE pattern
Company announced Person’s promotion to Position
• Key Idea– Retrieve documents on the given topic– Collect relatively frequent patterns in the retrieved documents
• Research focus and Result– Pattern format (predicate-argument, chain, sub-tree)– Computation time (Tree mining algorithms)– Near human performance was achieved
IE Pattern Discovery - ResultSuccessionTrain: 1 yr. newspaper
Test: 148 (87 relevant, 61 irrelevant)
Slots:
Person:135
Org: 172
Pot: 215
Automatic acquisition of paraphrase(Sekine Gengo01) ( Shinyama et al HLT02) (Shinyama et al IWP03)
• Purpose: Relationship between IE patterns• Key idea
– Events are usually reported in different newspapers on the same day
– However, NE and numeric, date expressions are identical
– Use them as anchor to acquire paraphrase• We observed encouraging results, but we found that
it is not so easy a task
Paraphrase Acquisition
Procedure
NewspaperA
NewspaperB
ArticleA
ArticleB
FindComparable
Articles
Anchorsidentified
Anchorsidentified
NE taggingCoref.
FindComparableSentences
phraseA
ExtractParaphrase
phraseB
Paraphrase Acquisition
EvaluationParaphrase Acquisition
withw/ow/owithw/o
Coref
withwithw/o
ASD
56%69%(52)75
56%(18)3262%(23)37
24%(25)106Phrases(>100)
44%75%(41)55Sentence(93)
recallprecisionobtained
•Two techniques (co-reference, Argument Structure Database)•One year of two Japanese newspaper•195 article pairs on arrest of murder suspects
Relation discovery(Hasegawa et al ACL04)
• Motivation– Discovering particular relations between NE’s– Country & President, Merged Companies
• Unsupervised method– Not known the relationships in advance– As many relationships as possible
• IdeaContext based clustering– Pairs of entities with similar context would be in the
same relation– Similar contexts could also characterize the relations
Relation Discovery
Procedure• Tag ENE on text, select two ENE types• Collecting expressions within 5 words, and make the
context between them as context vector• Cluster ENE pairs based on context vectors• Label each cluster based on frequent context words
is offering to buy’s interest inis negotiating to acquire’s planned purchase of
Company A Company B…
Accumulated contexts
5 words or lessNamed entity Named entity
Relation Discovery
Result
parent (1.0), unit (0.476), own (0.143), …7/7Parentacquire (1.0), acquisition (0.583), buy (0.583), …9/9M&Abuy (1.0), bid (0.382), …10/11M&Acoach (1.0), …5/5CoachRep. (1.0), Republican (0.667), …5/6RepublicanSecretary (1.0), secretary (0.143), …6/7SecretaryGov. (1.0), governor (0.458), Governor (0.3), …15/16GovernorMinister (1.0), minister (0.875), Prime (0.875), …
15/16Prime MinisterSen. (1.0), Republican (0.214), …19/21SenatorPresident (1.0), president (0.415), …17/23PresidentCommon words (Relative frequency)RatioMajor relations
Relation Discovery
Evaluation and Error analysis• Evaluation result
Labeling accuracy: 10/10 = 100% (Cluster level) 108/121 = 89% (NE pair level)
(cf. recall: 140/(177+65) = 58% (All clusters))
• False Alarm (Mis-clustered NE pairs): – Common words in another cluster which existed in 5
words contexts by accident (noise words)… Chechnya war may exhaust President Boris Yeltsin …
• Miss (Undetected NE pairs): – Absence of common words in 5 words contexts– Outer context might be helpful… Boris Yeltsin on end the fighting in Chechnya …
Relation Discovery
Another Paraphrase Discovery via Relation Discovery (Hasegawa et.al Gnengo-05)
• Expressions found in the same relation should contain paraphrase
• Two filters– Used in more than one pair of instance– Expressions contain frequent termEx) A bought B / A has agreed to buy B / A,
which is buying B / A’s proposed acquisition of B / A’s acquisition of B / A’s agreement to buy B / A’s purchase of B / A bid for B / A’s takeover of B / A merger with B / A succeeded in buying B / B, which was acquired by A / B would become a subsidiary of A / B agreed to be bought by A
Yet Another Paraphrase Discovery using context and keywords
(Sekine IPW-05)• To eliminate the frequency threshold for the
vector model in Hasegawa’s method• Algorithm
– Find keywords in context of each pair of NEs using TFIDF
– Cluster phrases based on NE instances (buy – acquire / buy – agree / buy – purchase)
Extended Named Entity( Sekine et.al LREC02) (Sekine et.al LREC04)
• Named Entity with 200 categories• Definition (150 page) was released• Automatic tagger
– 130K entries– 2000 rules– 70% accuracy– Improving…
Attributes for ENE (ver. 7.0.1)(example for PERSON)
5(11)Car accident, GuillotineCause of death
Person5(11)John B. Kelly, Sr.Father
Location5(11)New York, BirminghamPlace of birth
Game6(13)World Series, 1955 piano competition in ParisCompetition
Title 6(13)Knight, an honorary degree at YaleTitle
Person8(17)Santa, father ChristmasAnother name
Person8(17)Saint NicholasReal Name
Award8(17)Academy Award, MVP, Nobel PrizeAward
Era8(17)Edo period, the 11th centuryEra
date10(22)04,23,1704, 04/23/1704, unknownDeath date
Person10(22)Andrea del Verrocchio, Michelangelo di Lodovic Buonarroti SimoniMentor
Location12(26)England, New YorkPrevious stay
Province18(39)State of Illinois, SichuanNative Province
City19(41)Paris, Manchester, ShanghaiHometown
School20(44)M.A. in German at Cambridge, MK High SchoolGraduate
Product, Facility25(54)Guernica, Mona LisaMasterpiece
Vocation 26(57)A professor at Yale University, The Princess of WalesCareer
Country29(63)American, Chinese, JapaneseNationality
Vocation46(100)professional baseball player, economist, poetVocation
ENEFreq. (%)Example of valueAttribute(20)
../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html
ODIE – Flow ChartDescription of task
Table
IR systemRelevantDocument
Pattern discoveryLanguage Analyzer
ENE taggerPattern
Pattern sets
Table construction
Paraphrase discoveryCo-reference
On-demand Information Extraction- ODIE -
( Middle-size NSF project at NYU 03-08)• Demo http://blueberry.cs.nyu.edu:8080/odie• Description of the task (topic)
– acquire acquisition merge merger buy purchase• Build a table in 1 minute
Last year$400 millionChemicals Inc., nyt951113.0483
$1.27 billionTenneco Inc., Mobil Corp.nyt951002.0099
$3.1 billionCoreStates Financial Corp., Merdian Bancorp
nyt951010.0389
Last week$900 millionComdata Holdings Corp., Ceridian Corp.
nyt950420.0056
DateMoneyCompanyDocid
http://blueberry.cs.nyu.edu:8080/odie
• nyt950831.0485 The move comes amid a flurry of acquisitions in the computer financial processing industry. Last week, check processor Ceridian Corp. agreed to buy Comdata Holdings Corp. for $900 million.
• nyt951113.0483 Last year, Terra Industries Inc., another large fertilizer maker, bought Agricultural Minerals & Chemicals Inc. for $400 million. ``Industry growth is primarily overseas,'' said Harvey Stober, an analyst with Donaldson, Lufkin & Jenrette in New York.
• nyt951002.0099 Tenneco Inc. said it agreed to buy Mobil Corp.'s plastics division for $1.27 billion as part of its strategy to expand into less cyclical, higher-growth businesses.
• nyt951010.0389CoreStates Financial Corp. agreed to buy Meridian Bancorp for $3.1 billion in stock, creating a bank with a commanding role in Philadelphia and the industrial cities of eastern Pennsylvania.
ODIE - Evaluation Result
• Out of 20 topics, tables created for 2 topics are judged useful, 12 are useful, 6 are not useful.
• Correctness of the table fillers
Number of rowsEvaluation
12Incorrect4Partially correct84Correct
Research topics
Attributefor ENE
EnglishAnalyzer(OAK)
Web PeopleSearch
JapaneseSpelling
Multi/Singdoc. sum.
IE andSum.
Cross Ling.Pre-ODIE
ParaphraseDiscoveryIE pattern
Discovery
EncyclopediaQA
IR, clustering& sum.
Cross Ling.QA
2nd grade language test
Query log& class
Informationneeds
On-demand IE
QA
IRSummarization
RelationDiscovery
ExtendedNE
Preemptive Information Extraction(Shinyama, Sekine NAACL06)
• We can even delete queries…
1. Find a pair (or triple) of entities that have a similar relation in articles.
ex. Relation between “Katrina”-“New Orleans” in Article A is similar to relation between “Longwang”-“Taiwan” in Article B.
2. Cluster relations based on their similarity.3. A cluster of similar relations = table.
Knowledge Discovery from Query Log(work at Microsoft Research, Summer 06)
• Query log contains a lot of knowledge• Discover knowledge about ENE…
bathroom+showers+picturesmiss+mundoowen+mulligan+tyronenews+pressfree+wedding+plannermarc+arther+glenpreadolescente+follandogokmenler+agricultural+machinery+co.+-erotika+comacademy+award+winner
100+greatest+britons AWARD100+worst+britons AWARDaaass/orbis+books+prize+for+polish+studies AWARDabel+prize AWARDacademy+award AWARDacademy+awards AWARDacm+turing+award AWARDagatha+award AWARDagatha+awards AWARDair+medal AWARD
Query log (6month, 1.5B) ENE dictionary (245K)
Problems to be solved
Problem 2This is noise in
ENEdictionary. I
want toget rid of them!!!
Problem 1This is a very
general context. I want to see
important context for this
category
Problem 4Individual
entities are verbose. I want to cluster them.
Problem 3The list may not
have enough coverage. I want to
populate ENE dictionary.
Important ContextNoise
ClusteringPopulate the list
Key Idea• Find Important context for each category
– New Formula (not TFIDF, MI, log likelihood ratio …) Score(context) = ftype{context} * log( g(context) / C );
g(context) = ftype{context}/Finst{context};C = ftype{contexttop1000} / Finst{contexttop1000};
– Result• ACADEMIC #+syllabus, degrees+in+#, phd+in+#, masters+in+#, diploma+in+#, #+lecture+notes, masters+degree+in+#, courses+in+,• AWARD
#+winners, #+nominees, #+nominations, #+winner, #+award, who+won+#, winners+of+#, list+of+#+winners, winners+of+the+#
Results• Delete Noise (Problem 2)
– Entities which does not co-occur with important context are noise
– BOOK:it, porno, space, night, we, working, candy, wheels, foundation, jazz,
ghost, couples, the bible, giant, daddy, creation …
• Find New entities (Problem 3)– words co-occur with important context are entities– AWARD:
golden+globes, grammys, golden+globe, kentucky+derby, daytime+emmy, sag, sag+awards, american+idol, daytime+emmys
Web People Searcha SemEval 2007 task
• InputFirst 100 results for a person name search
• OutputClustring according to the actual people
• John Smith 1 (Captain)– Captain John Smith – www.apva.org– John Smith wikipedia – en.eikipedia..– …
• John Smith 2 (Labour leader)– BBC: Labour leader John Smith – news…– John Smith wikipedia – en.wikipedia…
• John Smith 3 (IBM researcher)• John Smith 4 (Film director)• John Smith 5 (Shoe company)• John Smith 6 (Writer)• …
http://www.apva.org/
WePS• SemEval - formally called “Senseval” since 1998
– 27 proposal / 18 final tasks– >100 teams, > 125 systems, 110 papers
• WePS is the largest single task– 16 participants (38 people in ML)– http://nlp.uned.es/weps
• Data
98.9345.93Total av.71.2010.76Total av.
99.1050.30Census47.205.90WEB03*
98.4031.00ACL0699.2015.30ECDL06
99.3056.50Wikipedia99.0023.14Wikipedia
Av. doc.Av. entitysourceAv. doc.Av. entitysource
TestTraining
http://nlp.uned.es/weps
WePS (Result)team F α=0,5 purity inv_purity Presentation/Poster
CU_COMSEM 0.79 0.72 0.88 Tomorrow - PS2 - 11:15–12:30COMBINED_BASELINE 0.78 0.64 1.00IRST-BP 0.77 0.75 0.80 Tomorrow - PS2 – 11:15–12:30PSNUS 0.77 0.73 0.82 Next presentation - 17:00–17:15UVA 0.69 0.81 0.60 Tomorrow - PS3 – 14:30–15:45SHEF 0.66 0.60 0.82FICO 0.67 0.53 0.90 Tomorrow - PS2 – 11:15–12:30UNN 0.66 0.60 0.73 Tomorrow - PS3 – 14:30–15:45SCATTERED_BASELINE 0.64 1.00 0.47AUG 0.64 0.50 0.88 Tomorrow - PS2 – 11:15–12:30SWAT-IV 0.62 0.55 0.71 -UA-ZSA 0.61 0.58 0.64 Tomorrow - PS3 – 14:30–15:45TITPI 0.60 0.45 0.89 Tomorrow - PS3 – 14:30–15:45JHU1-13 0.58 0.45 0.82 Tomorrow - PS2 – 11:15–12:30DFKI2 0.53 0.39 0.83 Tomorrow - PS2 – 11:15–12:30WIT 0.52 0.36 0.93 Tomorrow - PS3 – 14:30–15:45UC3M_13 0.51 0.35 0.95 Tomorrow - PS3 – 14:30–15:45UBC-AS 0.45 0.30 0.91 -JOINED_BASELINE 0.45 0.29 1.00
WePS
• Systems– Basic strategies are similar HTML analsys => pre-processing (POS, NE…) =>
feature selection (token, NE…) => calculate similarities => make clusters => produce output
– Summary is available• Future
– Second event (08 fall or 09 spring)– Attribute extraction (DOB, vocation…)
Human Relationship Extraction(Goto NHK)
• Extract human relationships from movie/TV summary
• Display it as a network in a graph
• NE for the domain• Attribute for NE• Clustering the
relationships
English Analyzer (OAK system)
NE tagger Chunker Dependency Analyzer POS tagger Parser
Tokenizer Function Tagger Sentence Splitter Regularizer
Text
Sentence
Tokenized sentence
POS tagged sentence
Chunked sentence (low level constituent) NE tagged sentence (org,loc,person,time…)
Dependency between Chunks and constituents
Parse tree
Parse tree with Function labels
Regularized structure (predicate-argument)
Plain
Plain
Plain
PTB/tag PTB/tree Collins
PTB/tag PTB/tree CONLL
PTB/tag PTB/tree
CONLL, MUC PTB/tree
PTB/tree
PTB/tree
Triplet Glarf OAK
System
Research topics
Attributefor ENE
EnglishAnalyzer(OAK)
Web PeopleSearch
JapaneseSpelling
Multi/Singdoc. sum.
IE andSum.
Cross Ling.Pre-ODIE
ParaphraseDiscoveryIE pattern
Discovery
EncyclopediaQA
IR, clustering& sum.
Cross Ling.QA
2nd grade language test
Query log& class
Informationneeds
On-demand IE
QA
IRSummarization
RelationDiscovery
ExtendedNE
Extra slides
IR, clustering and multi-doc. summaization( Nobata et.al Gengo03) ( Nobata et.al DUC03) (Murakami et.al
Gengo04)
• Large number of retrieved documents• Automatic clustering by sub-topics
– 200 documents -> 5 clusters• Summarize each cluster (MDS)• Demo (http://apple.cs.nyu.edu/eliot)
http://apple.cs.nyu.edu/eliot
Cross Lingual Question Answering( Sekine DARPA03) (Sekine TALIP04)
• Developed in a month in DARPA’s SurpriseLanguage experiment
• Type in question in English Translate question to Hindi
Find answer in Hindi newspaper Translate answer to English• Demo (http://apple.cs.nyu.edu/clqa-eh)
http://apple.cs.nyu.edu/clqa-eh
Q Examiner
Translation (H->E)
Find answer
CLIR
NE tagger
Q in English
Keywords in EnglishType of expected answer
Keywords in Hindi
RelevantDocuments
Answer, context in HindiBi-ling. dict.
Numeric tagger
newspaper
NE tagged corpus
User I/F
Num. knowledge
Answer, context in English
Huge Text data & knowledge acquisition(Yoshihira et.al Gengo04) (Ando et.al LREC04) (Sekine Gengo04)
• Collecting huge corpus (untagged)– WEB pages : (Japanese 350G, English 300G)– Newspaper : 38 years, ~10G– Encyclopedia
• Tool– Suffix Array for the huge data
• Application– Semantic knowledge acquisition by lexico-syntactic pattern– Automatic acquisition of NE instances– IE pattern, paraphrase acquisition– Collect spelling variations– KWIC system (http://kwic.jp)
http://kwic.jp/http://kwic.jp/
Survey on Multi-document summarization( Sekine et.al DUC03)
• Purpose– Multi-doc. summarization & IE
• Analyze summary/category of 100 doc. set• Category of doc. set
– 13 category: (multi|single)-(person|org|loc|…), other– Consistent tagging by people– It should be useful for summarization strategy
• Table format summary– 40-45%: Table is natural, 36-38%: Table is OK– Encouraging results for On-demand IE
Spelling variation in Japanese(Masuyama et.al Gengo04)
• Serious problem in real world systems– 30% of question for encyclopedia QA system– 40% of zero hits on dictionary look up
• A system to find the closest entry– 190,000 entries– Edit distance after narrowing down candidates
• If the target is in the dictionary, 100% in top 3• Demo ( http://languagecraft.jp)• Planning to construct variation dictionary from
WEB corpus
http://languagecraft.jp/http://languagecraft.jp/
Encyclopedia QA(Sekine Gengo03) (Sekine et al. Gengo05)
• User can ask question in natural language and get the answer from encyclopedia
• System is working as a real demo (http://www.japanknowledge.com)
• Answer in the entry words or numeral in explanation
• Answer prepared in advance by IE and pattern based Q matcher pinpoint the answer
http://www.japanknowledge.com/
2nd Grade Language Test
• Objectives– Realize the NLP technologies into the form which can be
easily observed by ordinary people – Observe the problems of the NLP technologies by
degrading the level of target materials – http://languagecraft.net/dennou
• Results
55.283.4196235355Total
29.445.5102234Reading
47.883.6117140245Word K.
90.893.2697476Kanji
Cor. In totalCor in knownCorrectKnown# of Q
http://languagecraft.net/dennouhttp://languagecraft.net/dennou
Research topics
Attributefor ENE
EnglishAnalyzer(OAK)
Web PeopleSearch
JapaneseSpelling
Multi/Singdoc. sum.
IE andSum.
Cross Ling.Pre-ODIE
ParaphraseDiscoveryIE pattern
Discovery
EncyclopediaQA
IR, clustering& sum.
Cross Ling.QA
2nd grade language test
Query log& class
Informationneeds
On-demand IE
QA
IRSummarization
RelationDiscovery
ExtendedNE
My Research Interest
• “things” are the central– People, organization, location, event name,
vocation, product name …– How they are related– How they act, are used or are evaluated
• It is “knowledge”– How to create “knowledge”?
• Extract from huge corpus• Fusion with the existing knowledge (e.g.
encyclopedia)
Knowledge
Belongs to
Knowledge
Belongs to
Belongs to
Belongs toA part of
A part of
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
A part of
A part of
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
A part of
A part of
Located inLocated in
Work in, commute in
Located in
Located inLocated in
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
Located nearby
A part of
A part of
Located in
Located in
Located nearby
Located nearby
Favorite teamLocated in
Work in, commute in
Located in
Located in
Located infeatures
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
Located nearby
A part of
A part of
Located in
Located in
Located nearby
Located nearby
Favorite teamLocated in
Work in, commute in
Located in
Located in
Located infeatures
Largest cityLocated in
Now visiting
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
Located nearby
A part of
A part of
Located in
Located in
Located nearby
Located nearby
Favorite teamLocated in
Work in, commute in
Located in
Located in
Located in Located infeatures
Largest cityLocated in
Now visiting
Friendship countries
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
A member of
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
Located nearby
A part of
A part of
Located in
Located in
Located nearby
Located nearby
Favorite teamLocated in
Work in, commute in
Located in
Located in
Located in Located infeatures
Largest cityLocated in
Favorite teamAwarded
Now visiting
Friendship countries
Saw it play
Pay money
Born in
Nationality
A member of
Live nearby
A member of
A member of
A member of
Nationality
Same name
A member of
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
A member of
Knowledge
Belongs to
Belongs to
Belongs to
Located nearby
Has a lunch
Located nearby
Located nearby
Located nearby
A part of
A part of
Located in
Located in
Located nearby
Located nearby
Favorite teamLocated in
Work in, commute in
Located in
Located in
Located in Located infeatures
Largest cityLocated in
Favorite teamAwarded
Now visiting
Friendship countries
Saw it play
Pay money
Born in
Nationality
A member of
Live nearby
A member of
A member of
A member of
Nationality
Same name
A member of
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
Located nearby
Located in
Located in
Located nearby
Located nearby
Located in
Largest cityLocated in
A member of
Cheer!
Boo!
Knowledge Creation• Not only by hand
– Cyc, EDR• Learn from Corpus
– Frequency, TFIDF• Pattern discovery, NE discovery
– Distributional similarity• NE discovery, attribute for NE, Paraphrase Discovery
– Lexico-syntactic pattern• Hyper-hyponym discovery, IE pattern
– Alignment• Paraphrase Discovery
Topics• Names
– Create categories (e.g. product name, event name)– Find entities (e.g. instances of laptop names)
• Relation– Find relations (e.g. pro/cons of products, sentiment, opinion)
• Categorize Relation– Paraphrase, textual entailment (e.g. same kind of opinion)
• Knowledge– Link and structure the extracted knowledge– Human interface