Upload
lester-byrd
View
232
Download
1
Tags:
Embed Size (px)
Citation preview
1
Snowball Snowball : : Extracting Relations from Extracting Relations from Large Plain-Text CollectionsLarge Plain-Text Collections
Eugene AgichteinEugene AgichteinLuis GravanoLuis Gravano
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Eugene Agichtein Columbia University•2
Extracting Relations from Extracting Relations from DocumentsDocuments
Text documents hide valuable structured information.
If we manage to extract this information: • We can answer user queries more accurately• We can run data mining tasks (e.g., finding trends)
Eugene Agichtein Columbia University•3
GOAL: Extract All Tuples “Hidden” in the Document Collection
System must:• Require minimal training for each new
task• Recover from noise• Exploit redundancy of information in
documents
Eugene Agichtein Columbia University•4
Example Task: Example Task: Organization/LocationOrganization/Location
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.
Microsoft's central headquarters in Redmond is home to almost every product group and division.
Organization Location
Microsoft
Apple Computer
Nike
Redmond
Cupertino
Portland
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
Redundancy
Eugene Agichtein Columbia University•5
Extracting Relations from Text Extracting Relations from Text CollectionsCollections
• Related WorkRelated Work
• The The SnowballSnowball System System
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•6
Related WorkRelated Work
• Traditional Information ExtractionTraditional Information Extraction– MUCs (MUCs (MMessage essage UUnderstanding nderstanding CConferences)onferences)
• Significant (manual) training for each new task Significant (manual) training for each new task
• BootstrappingBootstrapping– Riloff et al. (‘99), Collins & Singer (‘99)Riloff et al. (‘99), Collins & Singer (‘99)
• (Named-entity recognition)(Named-entity recognition)
– Brin (DIPRE) (‘98)Brin (DIPRE) (‘98)
Eugene Agichtein Columbia University•7
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples:
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Eugene Agichtein Columbia University•8
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Occurrences of seed tuples:
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Eugene Agichtein Columbia University•9
• <STRING1>’s headquarters in <STRING2>
•<STRING2> -based <STRING1>
•<STRING1> , <STRING2>
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
DIPREPatterns:
Eugene Agichtein Columbia University•10
Extracting Relations from Text: Extracting Relations from Text: DIPREDIPRE
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Generatenew seedtuples; start newiteration
ORGANIZATION LOCATIONAG EDWARDS ST LUIS157TH STREET MANHATTAN7TH LEVEL RICHARDSON3COM CORP SANTA CLARA3DO REDWOOD CITYJELLIES APPLEMACWEEK SAN FRANCISCO
Eugene Agichtein Columbia University•11
Extracting Relations from Text: Extracting Relations from Text: Potential PitfallsPotential Pitfalls
• Invalid tuples generatedInvalid tuples generated– Degrade quality of tuples on Degrade quality of tuples on
subsequent iterationssubsequent iterations
– Must have automatic way to selectMust have automatic way to selecthigh quality tuples to use as new seedhigh quality tuples to use as new seed
• Pattern representationPattern representation– Patterns must generalizePatterns must generalize
Eugene Agichtein Columbia University•12
Extracting Relations from Text Extracting Relations from Text CollectionsCollections• Related WorkRelated Work
– DIPREDIPRE
• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation
– Tuple generationTuple generation
– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•13
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Initial Seed Tuples:
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Eugene Agichtein Columbia University•14
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Occurrences of seed tuples:
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
Eugene Agichtein Columbia University•15
Today's merger with McDonnell Douglas
positions Seattle -based Boeing to make major money in space.
…, a producer of apple-based jelly, ...
Pattern: <STRING2>-based <STRING1>
Problem: Problem: Patterns Excessively Patterns Excessively GeneralGeneral
<jelly, apple>
Incorrect!
Eugene Agichtein Columbia University•16
Extracting Relations from Text: Extracting Relations from Text: SnowballSnowball
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
Tag Entities
Use MITRE’s Alembic Named Entity tagger
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Eugene Agichtein Columbia University•17
Extracting Relations from TextExtracting Relations from Text
Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...
• <ORGANIZATION>’s headquarters in <LOCATION>
•<LOCATION> -based <ORGANIZATION>
•<ORGANIZATION> , <LOCATION>
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
PROBLEM: Patterns too specific: have to match text exactly.
Eugene Agichtein Columbia University•18
Snowball:Snowball: Pattern Representation Pattern Representation
A Snowball pattern vector is a 5-tuple <left, tag1, middle, tag2, right>,
– tag1, tag2 are named-entity tags
– left, middle, and right are vectors of weighed terms.
< left , tag1 , middle , tag2 , right >
ORGANIZATION 's central headquarters in LOCATION is home to...
LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}
{<is 0.75>, <home 0.75> }
Eugene Agichtein Columbia University•19
The combined company will operate from Boeing’s headquarters in Seattle.
The Armonk -based IBM introduced a new line…
In mid-afternoon trading, share of Redmond-based Microsoft fell…
Computer servers at Microsoft’s central headquarters in Redmond…
Snowball:Snowball: Pattern Generation Pattern Generation
Tagged Occurrences of seed tuples:
Eugene Agichtein Columbia University•20
{<servers 0.75><at 0.75>}
SnowballSnowball Pattern Generation: Pattern Generation: Cluster Similar OccurrencesCluster Similar Occurrences
{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}
ORGANIZATION LOCATION
{<shares 0.75><of 0.75>}
{<- 0.75> <based 0.75> } {<fell 1>}
{<the 1>} {<- 0.75> <based 0.75> }
ORGANIZATION
LOCATION
{<introduced 0.75> <a 0.75>}
LOCATION
ORGANIZATION
{<operate 0.75><from 0.75>}
{<’s 0.7> <headquarters 0.7> <in 0.7>}
ORGANIZATION LOCATION
Occurrences of seed tuples converted to Snowball representation:
Eugene Agichtein Columbia University•21
Similarity MetricSimilarity Metric
{ Lp . Ls + Mp . Ms + Rp . Rs if the tags match
0 otherwise
Match(P, S) =
P =
S =
< Lp , tag1 , Mp , tag2 , Rp >
< Ls , tag1 , Ms , tag2 , Rs >
Eugene Agichtein Columbia University•22
{<servers 0.75><at 0.75>}
SnowballSnowball Pattern Generation: Pattern Generation: ClusteringClustering
{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}
ORGANIZATION LOCATION
{<shares 0.75><of 0.75>}
{<- 0.75> <based 0.75> } {<fell 1>}
{<the 1>} {<- 0.75> <based 0.75> }
ORGANIZATION
LOCATION
{<introduced 0.75> <a 0.75>}
LOCATION
ORGANIZATION
{<operate 0.75><from 0.75>}
{<’s 0.7> <headquarters 0.7> <in 0.7>}
ORGANIZATION LOCATION
Cluster 1
Cluster 2
Eugene Agichtein Columbia University•23
Snowball:Snowball: Pattern Generation Pattern Generation
{<’s 0.7> <in 0.7> <headquarters 0.7>}ORGANIZATION LOCATIO
N
{<- 0.75> <based 0.75>}
ORGANIZATIONLOCATION
Pattern2
Patterns are formed as centroids of the clusters. Filtered by minimum number of supporting tuples.
Pattern1
Eugene Agichtein Columbia University•24
Snowball: Snowball: Tuple ExtractionTuple Extraction
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Using the patterns, scan the collection to generate new seed tuples:
Eugene Agichtein Columbia University•25
Snowball: Snowball: Tuple ExtractionTuple Extraction
Represent each new text segment in the Represent each new text segment in the collection as the context 5-tuple:collection as the context 5-tuple:
Find most similar pattern (if any)Find most similar pattern (if any)
LOCATIONORGANIZATION{<'s 0.5>, <flashy 0.5>, <headquarters 0.5>, < in 0.5>}
{<is 0.75>, <near 0.75> }
Netscape 's flashy headquarters in Mountain View is near
LOCATIONORGANIZATION{<'s 0.7>, <headquarters 0.7>, < in 0.7>}
Eugene Agichtein Columbia University•26
SnowballSnowball: Automatic Pattern : Automatic Pattern EvaluationEvaluation
Automatically estimate probability ofAutomatically estimate probability ofa pattern generating valid tuples:a pattern generating valid tuples:
Conf(Pattern) = _____Conf(Pattern) = _____Positive____Positive____ Positive + Negative Positive + Negative
e.g., Conf(Pattern) = 2/3 = 66%e.g., Conf(Pattern) = 2/3 = 66%
Pattern “ORGANIZATION, LOCATION” in action:
PatternConfidence:
Boeing, Seattle, said… PositiveIntel, Santa Clara, cut prices… Positiveinvest in Microsoft, New York-based Negativeanalyst Jane Smith said
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
Seed tuples
Eugene Agichtein Columbia University•27
Snowball:Snowball: Automatic Tuple Automatic Tuple EvaluationEvaluation
Conf(Tuple) = 1 - Conf(Tuple) = 1 - (1 -Conf(P(1 -Conf(Pii))))
– Estimation of Probability (Correct (Tuple) )Estimation of Probability (Correct (Tuple) )
– A tuple will have high confidence ifA tuple will have high confidence ifgenerated by multiple high-confidencegenerated by multiple high-confidencepatterns (Ppatterns (Pii).).
Apple's programmers "think different" on
a "campus" in Cupertino, Cal.
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
<Apple Computer, Cupertino>
Eugene Agichtein Columbia University•28
Snowball: Snowball: Filtering Seed TuplesFiltering Seed Tuples
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Generatenew seedtuples:
ORGANIZATION LOCATION CONFAG EDWARDS ST LUIS 0.93AIR CANADA MONTREAL 0.897TH LEVEL RICHARDSON 0.883COM CORP SANTA CLARA 0.83DO REDWOOD CITY 0.83M MINNEAPOLIS 0.8MACWORLD SAN FRANCISCO 0.7
157TH STREET MANHATTAN 0.5215TH CENTURY EUROPE NAPOLEON 0.315TH PARTY CONGRESS CHINA 0.3MAD SMITH 0.3
Eugene Agichtein Columbia University•29
Extracting Relations from Text Extracting Relations from Text CollectionsCollections
• Related WorkRelated Work
• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation
– Tuple generationTuple generation
– Automatic pattern and tuple evaluationAutomatic pattern and tuple evaluation
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•30
Task Evaluation MethodologyTask Evaluation Methodology
• Data: Large collection, extracted tablesData: Large collection, extracted tablescontain many tuples (> 80,000)contain many tuples (> 80,000)
• Need scalable methodology:Need scalable methodology:– IdealIdeal set of tuples set of tuples
– Automatic recall/precision estimationAutomatic recall/precision estimation
• Estimated precision using samplingEstimated precision using sampling
Eugene Agichtein Columbia University•31
Collections used in Experiments Collections used in Experiments
More than 300,000 real newspaper articles
Collection Source YearThe New York Times 1996
Training The Wall Street Journal 1996The Los Angeles Times 1996The New York Times 1995
Test The Wall Street Journal 1995The Los Angeles Times 1995,’97
Eugene Agichtein Columbia University•32
TheThe Ideal Ideal Metric (1) Metric (1)
Creating theCreating the Ideal Ideal set of tuplesset of tuples
All tuples mentioned in the collection
Hoover’s directory(13K+ organizations)*Ideal
* A perfect, (ideal) system would be able to extract all these tuples
Eugene Agichtein Columbia University•33
TheThe Ideal Ideal Metric (2) Metric (2)
• Precision:Precision: | Correct ( | Correct (ExtractedExtracted IdealIdeal) |) | | | ExtractedExtracted IdealIdeal | |
• Recall:Recall: | Correct ( | Correct (ExtractedExtracted IdealIdeal) |) |
| | Ideal Ideal ||
Extracted IdealCorrectlocationfound
Eugene Agichtein Columbia University•34
Estimate Precision by SamplingEstimate Precision by Sampling
• Sample extracted table Sample extracted table – Random samples, each 100 tuplesRandom samples, each 100 tuples
• Manually check validity of tuples in Manually check validity of tuples in eacheachsamplesample
Eugene Agichtein Columbia University•35
Extracting Relations from Text Extracting Relations from Text CollectionsCollections
• Related WorkRelated Work
• The The SnowballSnowball System: System: – Pattern representation and generationPattern representation and generation
– Tuple generationTuple generation
– Automatic pattern and tuple validationAutomatic pattern and tuple validation
• Evaluation MetricsEvaluation Metrics
• Experimental ResultsExperimental Results
Eugene Agichtein Columbia University•36
Experimental results: Test CollectionExperimental results: Test Collection
(a) (b)
Recall (a) and precision (a) using the Ideal metric, plotted against the minimal number of occurrences of test tuples in the collection
Eugene Agichtein Columbia University•37
Experimental results: Sample and Experimental results: Sample and CheckCheck
(a) (b)
Recall (a) and precision (b) for varying minimum confidence threshold Tt.
NOTE: Recall is estimated using the Ideal metric, precision is estimated by manually checking random samples of result table.
Eugene Agichtein Columbia University•38
ConclusionsConclusions
We presentedWe presented
• Our Our Snowball Snowball system:system:– Requires minimal training Requires minimal training
(handful of seed tuples) (handful of seed tuples)
– Uses a flexible pattern representationUses a flexible pattern representation
– Achieves high recall/precisionAchieves high recall/precision > 80% of test tuples extracted> 80% of test tuples extracted
• Scalable evaluation methodologyScalable evaluation methodology
Eugene Agichtein Columbia University•39
Recent and Future WorkRecent and Future Work
• Recent (presented in DMKD’00 workshop)Recent (presented in DMKD’00 workshop)– Alternative pattern representationAlternative pattern representation
– Combining representationsCombining representations
• Future WorkFuture Work– Evaluation on other extraction tasksEvaluation on other extraction tasks
– Extensions:Extensions:• Non-binary relationsNon-binary relations
• Relations with no keyRelations with no key
HTML documentsHTML documents
40
Snowball: Snowball: Extracting Relations from Large Extracting Relations from Large Plain-Text CollectionsPlain-Text Collections
Eugene Agichtein Eugene Agichtein ([email protected])([email protected])Luis GravanoLuis Gravano
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Eugene Agichtein Columbia University•42
Snowball Snowball SolutionsSolutions
• Flexible pattern representationFlexible pattern representation
• Pattern generationPattern generation
• Automatic pattern and tuple Automatic pattern and tuple evaluationevaluation– Able to recover from noiseAble to recover from noise
– Keeps only high quality tuples as new Keeps only high quality tuples as new seedseed
Eugene Agichtein Columbia University•43
Experimental Results: TrainingExperimental Results: Training
(a) (b)
Recall (a) and precision (b) using the Ideal metric (training collection)
Eugene Agichtein Columbia University•44
Sampling Results: Error AnalysisSampling Results: Error Analysis
Type of ErrorCorrect Incorrect Location Organization Relationship
DIPRE 74 26 3 28 5Snowball (all tuples) 52 48 6 41 1Snowball (t = 0.8) 93 7 3 4 0Baseline 25 75 8 62 5
The tuples in the random samples were checked by hand to pinpoint the “culprits” responsible for incorrect tuples.Sample size is 100.
Eugene Agichtein Columbia University•45
Sample Discovered PatternsSample Discovered Patterns
Left Middle Right Conf<NEAR 0.01> <IN 0.79>
<HEADQUARTERS 0.03><, 0.20> 0.36
< OF 0.61> <, 0.61> <, 0.15) 0.37
< - 0.53>< BASED 0.53>< , 0.25 >
<SAID 0.1> 1
<WHILE 0.01> <BASED 0.52><IN 0.52>< , 0.43>
<, 0.28> 0.96
< - 0.70><, 0.08>
0.63
FROM 0.01 <S 0.52><' 0.52><IN 0.24><HEADQUARTERS 0.22>
<AND 0.01> 0.69
Eugene Agichtein Columbia University•46
Convergence of Convergence of SnowballSnowball and DIPRE and DIPRE
Precision (a) and Recall (b) of the DIPRE and Snowballwith increased iterations
(a) (b)
Eugene Agichtein Columbia University•47
Organization Location
Microsoft Corp Wash.Microsoft Corporation RedmondMicrosoft Corp. WA
Apple Computer Calif.Apple Corp CupertinoApple Computer Corp. US
Approximate Matching of Approximate Matching of OrganizationsOrganizations
• Use Use WhirlWhirl (W. Cohen @ AT&T) to match similar organization names (W. Cohen @ AT&T) to match similar organization names
• Self-join the Extracted table on the Self-join the Extracted table on the OrganizationOrganization attribute attribute
• Join resulting table with the Join resulting table with the Test Test table, and compare values oftable, and compare values ofLocationLocation attributes attributes
Location OrganizationRedmond MicrosoftArmonk IBMSanta Clara IntelMountain View NetscapeCupertino Apple
Extracted Extracted ‘
Eugene Agichtein Columbia University•48
ReferencesReferences
• Blum & Mitchell. Combining labeled and unlabeled Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.on Computational Learning Theory.
• Brin. Extracting patterns and relations from the World-Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Wide Web. Proceedings on the 1998 International Workshop on Web and Databases (WebDB’98).Workshop on Web and Databases (WebDB’98).
• Collins & Singer. Unsupervised models for named Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999.entity classification. EMNLP 1999.
• Riloff & Jones. Learning dictionaries for information Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99.extraction by multi-level bootstrapping. AAAI’99.
• Yarowsky. Unsupervised word sense disambiguation Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95.rivaling supervised methods. ACL’95.