Upload
hester-preston
View
223
Download
3
Embed Size (px)
Citation preview
December 2005 CSA3180: Information Extraction I 1
CSA2050: Natural Language Processing
Information Extraction• Information Extraction• Named Entities• IE Systems• MUC• Finite State Machines• Pattern Recognition
December 2005 CSA3180: Information Extraction I 2
Classification at different granularities
• Text Categorization: – Classify an entire document
• Information Extraction (IE):– Identify and classify small units within
documents
• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names
• People, locations, organizations
December 2005 CSA3180: Information Extraction I 3
Martin Baker, a person
Genomics job
Employers job posting form
December 2005 CSA3180: Information Extraction I 5
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
December 2005 CSA3180: Information Extraction I 6
Aggregator Websites
• Read in many web pages from different sites
• Extract information into a database
• Screen Scraping
• Can then return data matching particular queries
• Data mining can extract meaningful insight that might not have been obvious
December 2005 CSA3180: Information Extraction I 11
What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
December 2005 CSA3180: Information Extraction I 12
What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
December 2005 CSA3180: Information Extraction I 13
What is Information Extraction?Information Extraction = segmentation + classification + association
As a familyof techniques:
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka “named entity extraction”
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
December 2005 CSA3180: Information Extraction I 14
What is Information Extraction?Information Extraction = segmentation + classification + association
A familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
December 2005 CSA3180: Information Extraction I 15
What is Information Extraction?Information Extraction = segmentation + classification + association
A familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
December 2005 CSA3180: Information Extraction I 16
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
December 2005 CSA3180: Information Extraction I 17
IE in Context: Formatting
Grammatical sentencesand some formatting & links
Text paragraphswithout formatting
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
December 2005 CSA3180: Information Extraction I 18
IE in Context: Formatting
Non-grammatical snippets,rich formatting & links Tables
December 2005 CSA3180: Information Extraction I 19
IE in Context: CoverageWeb site specific
Amazon.com Book Pages
Formatting
Genre specific
Resumes
Layout
December 2005 CSA3180: Information Extraction I 20
IE in Context: CoverageWide, non-specific
University Names
Language
December 2005 CSA3180: Information Extraction I 21
IE in Context: ComplexityClosed set
He was born in Alabama…
The big Wyoming sky…
U.S. states
Regular set
Phone: (413) 545-1323
The CALD main office can be reached at 412-268-1299
U.S. phone numbers
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802
U.S. postal addresses
Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210
…was among the six houses sold by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
Person names
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
December 2005 CSA3180: Information Extraction I 22
IE in Context: Single Field/Record
Single entity
Person: Jack Welch
Binary relationship
Relation: Person-TitlePerson: Jack WelchTitle: CEO
N-ary record
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
Relation: Company-LocationCompany: General ElectricLocation: Connecticut
Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt
Person: Jeffrey Immelt
Location: Connecticut
December 2005 CSA3180: Information Extraction I 23
State of the Art
• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s
• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site
December 2005 CSA3180: Information Extraction I 24
IE Generations
• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the
domain– Iterative guess-test-tweak-repeat cycle
• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates,
using automated rule learners– Require huge, labeled corpora (effort is just moved!)
• Statistical Models [1997 – ]– Use machine learning to learn which features indicate
boundaries and types of entities.– Learning usually supervised; may be partially unsupervised
December 2005 CSA3180: Information Extraction I 25
IE Techniques
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
December 2005 CSA3180: Information Extraction I 26
Trainable IE SystemsPros
• Annotating text is simpler & faster than writing rules.
• Domain independent
• Domain experts don’t need to be linguists or programers.
• Learning algorithms ensure full coverage of examples.
Cons• Hand-crafted systems
perform better, especially at hard tasks. (but this is changing)
• Training data might be expensive to acquire
• May need huge amount of training data
• Hand-writing rules isn’t that hard!!
December 2005 CSA3180: Information Extraction I 27
MUC: Genesis of IE
• DARPA funded significant efforts in IE in the early to mid 1990’s.
• Message Understanding Conference (MUC) was an annual event/competition where results were presented.
• Focused on extracting information from news articles:– Terrorist events– Industrial joint ventures– Company management changes
• Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)
December 2005 CSA3180: Information Extraction I 28
MUC
• Named entity• Person, Organization, Location
• Co-reference• Clinton President Bill Clinton
• Template element• Perpetrator, Target
• Template relation• Incident
• Multilingual
December 2005 CSA3180: Information Extraction I 29
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
December 2005 CSA3180: Information Extraction I 30
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
December 2005 CSA3180: Information Extraction I 31
MUC Templates
• Relationship• tie-up
• Entities:• Bridgestone Sports Co, a local concern, a Japanese trading
house
• Joint venture company• Bridgestone Sports Taiwan Co
• Activity• ACTIVITY 1
• Amount• NT$2,000,000
December 2005 CSA3180: Information Extraction I 32
MUC Templates
• ATIVITY 1– Activity
• Production
– Company• Bridgestone Sports Taiwan Co
– Product• Iron and “metal wood” clubs
– Start Date• January 1990
December 2005 CSA3180: Information Extraction I 33
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
Example from Fastus (1993)
December 2005 CSA3180: Information Extraction I 34
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
December 2005 CSA3180: Information Extraction I 35
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
December 2005 CSA3180: Information Extraction I 36
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
set upnew Taiwan dollars
a Japanese trading househad set up
production of 20, 000 iron and metal wood clubs
[company][set up][Joint-Venture]with[company]
December 2005 CSA3180: Information Extraction I 37
Evaluating IE Accuracy
• Always evaluate performance on independent, manually-annotated test data not used during system development.
• Measure for each test document:– Total number of correct extractions in the solution template: N– Total number of slot/value pairs extracted by the system: E– Number of extracted slot/value pairs that are correct (i.e. in the
solution template): C
• Compute average value of metrics adapted from IR:– Recall = C/N– Precision = C/E– F-Measure = Harmonic mean of recall and precision
December 2005 CSA3180: Information Extraction I 38
Named Entities
• Named Entities• Person Name: Colin Powell, Frodo• Location Name: Middle East, Aiur• Organization: UN, DARPA
• Domain Specific vs. Open Domain
December 2005 CSA3180: Information Extraction I 39
Nymble (BBN Corporation)
• State of the art system
• Near-human performance ~90% accuracy
• Statistical system
• Approach: Hidden Markov Model (HMM)
December 2005 CSA3180: Information Extraction I 40
Nymble (BBN Corporation)
• Noisy channel paradigm• Originally, entities were marked in the raw text• Post noisy channel, annotation is lost
• Probability of most likely sequence of name classes (NC) given a sequence of words (W)
Pr(NC|W) = Pr(W,NC) / Pr(W)
since the a priori probability of the word sequence can be considered constant for any given sentence maximize just numerator
December 2005 CSA3180: Information Extraction I 41
Nymble (BBN Corporation)
Person
Organization
Not-A-Name
Start of Sentence
Five other classes
End of Sentence
December 2005 CSA3180: Information Extraction I 42
Automatic Content Extraction
• DARPA ACE Program
• Identify Entities– Named: Bilbo, San Diego, UNICEF– Nominal: the president, the hobbit– Pronominal: she
• Reference resolution– Clinton the president he
December 2005 CSA3180: Information Extraction I 43
Question Answering
The over-used pipeline paradigm:
Question
AnalysisQuestionInformation
Retrieval
Answer
Extraction
Answer
MergingAnswer
December 2005 CSA3180: Information Extraction I 44
Question Answering
• Feedback loops can be present for constraint relaxation purposes
• Not all QA systems adhere to the pipeline architecture
• Question answering flavors– Factoid vs. complex
– Who invented paper? vs. Which of Mr. Bush’s friends are Black Sabbath fans?
– Closed vs. open domain
December 2005 CSA3180: Information Extraction I 45
Answer Extraction
The over-used pipeline paradigm:
• Focus on open domain, factoid question answering
Question
AnalysisQuestionInformation
Retrieval
Answer
Extraction
Answer
MergingAnswer
December 2005 CSA3180: Information Extraction I 46
Practical Issues
• Web Spell Checking– Mispling– nucular
• Infrequent forms: – Niagra vs. Niagara, Filenes vs. Filene’s
• Google QA
• Genome, video, games
December 2005 CSA3180: Information Extraction I 47
Practical Issues
• Traditional Information Extraction• Either expert built or statistical
• Specific strategies for specific question types
• Person Bio vs. Location question types
• Ability to generalize to new questions and new question types
December 2005 CSA3180: Information Extraction I 48
Practical Issues
• Who invented Blah?
Blah was invented by PersonName
Blah was Verb by PersonNamewhere Verb is synonym to invented
Blah VerbPhrase by PersonName