46
1 Extracting Relations from Extracting Relations from Large Text Collections Large Text Collections Eugene Agichtein, Eugene Agichtein, Eleazar Eskin and Luis Eleazar Eskin and Luis Gravano Gravano Department of Computer Science Department of Computer Science Columbia University Columbia University Combining Strategies for

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

1

Extracting Relations Extracting Relations fromfrom

Large Text CollectionsLarge Text Collections

Eugene Agichtein, Eugene Agichtein, Eleazar Eskin and Luis Eleazar Eskin and Luis GravanoGravano

Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Combining Strategies for

Page 2: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•2

Extracting Relations From Extracting Relations From DocumentsDocuments

•Extract all tuples that appear in the document collection•Require minimal training•Resolve conflicting information•Exploit redundancy of information in documents

Text Documents hide valuable structured information

Page 3: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•3

Example Task: Example Task: Organization/LocationOrganization/Location

Apple's programmers "think different" on a "campus" in

Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.

Microsoft's central headquarters in Redmond is home to almost every product group and division. Organization Location

Microsoft

Apple Computer

Nike

Redmond

Cupertino

Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

Page 4: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•4

Related WorkRelated Work

• Traditional Information ExtractionTraditional Information Extraction– MUCMUC

• BootstrappingBootstrapping– Riloff et al. (‘99), Collins & Singer (‘99)Riloff et al. (‘99), Collins & Singer (‘99)

– Blum & Mitchell (co-training) (‘98)Blum & Mitchell (co-training) (‘98)

– Brin (DIPRE) (‘98)Brin (DIPRE) (‘98)

Page 5: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•5

Extracting Relations from TextExtracting Relations from Text

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKINTEL SANTA CLARABOEING SEATTLE

Seed tuples:

Page 6: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•6

Extracting Relations from TextExtracting Relations from Text

Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...

Occurrences of seed tuples in text:

Page 7: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•7

Extracting Relations from TextExtracting Relations from Text

Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...

Tag Entities:

Page 8: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•8

Extracting Relations from TextExtracting Relations from Text

Computer servers at Microsoft'sheadquarters in Redmond...Exxon, Irving, said it will boost itsstake in the...In midafternoon trading, shares ofIrving-based Exxon fell…The Armonk-based IBM has introduced anew line ...The combined company will operate fromBoeing's headquarters in Seattle.Intel, Santa Clara, cut prices of itsPentium...

GeneratePatterns:

• <ORGANIZATION>’s headquarters in <LOCATION>

•<LOCATION> -based <ORGANIZATION>

•<ORGANIZATION> , <LOCATION>

Page 9: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•9

• <ORGANIZATION>’s headquarters in <LOCATION>

•<LOCATION> -based <ORGANIZATION>

•<ORGANIZATION> , <LOCATION>

Extracting Relations from TextExtracting Relations from Text

Generate new seed tuples:

Page 10: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•10

Extracting Relations from TextExtracting Relations from Text

• Potential Pitfalls:Potential Pitfalls:

– Inaccurate Generated PatternsInaccurate Generated Patterns

– Conflicting InformationConflicting Information

• Snowball Solutions:Snowball Solutions:

– Pattern representationPattern representation

– Pattern generationPattern generation

– Automatic patterns and tuples evaluationAutomatic patterns and tuples evaluation

– Different context representationsDifferent context representations

Page 11: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•11

SnowballSnowball Systems Systems

• Snowball-VSSnowball-VS (Vector Space) (Vector Space)– Original system (presented in DL’00)Original system (presented in DL’00)

• Snowball-SMTSnowball-SMT (consider order of (consider order of terms)terms)– estimating probability based on the estimating probability based on the

sequence of terms surrounding entities.sequence of terms surrounding entities.

– Uses Sparse Markov Transducers for Uses Sparse Markov Transducers for probability estimation.probability estimation.

Page 12: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•12

Snowball-VS: Pattern RepresentationSnowball-VS: Pattern Representation

ORGANIZATION 's central headquarters in LOCATION is home to...

Snowball-VS : Vector space model. Text context is represented as vectors of terms

Page 13: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•13

Snowball-VS: Pattern RepresentationSnowball-VS: Pattern Representation

ORGANIZATION 's central headquarters in LOCATION is home to...

LOCATIONORGANIZATION{<'s 0.5>, <central 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }

Snowball-VS : Vector space model. Text context is represented as vectors of terms

Page 14: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•14

Snowball-VS: Pattern GenerationSnowball-VS: Pattern Generation

Page 15: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•15

Snowball-VS: Tuple ExtractionSnowball-VS: Tuple Extraction

• Represent each text segment as the Represent each text segment as the context 5-tuple:context 5-tuple:

Netscape 's headquarters in Mountain View is near

Page 16: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•16

Snowball-VS: Tuple ExtractionSnowball-VS: Tuple Extraction

• Represent each text segment as the Represent each text segment as the context 5-tuple:context 5-tuple:

LOCATIONORGANIZATION{<'s 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <near 0.75> }

Netscape 's headquarters in Mountain View is near

Page 17: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•17

Snowball-VS: Tuple ExtractionSnowball-VS: Tuple Extraction

• Represent each text segment as the Represent each text segment as the context 5-tuple:context 5-tuple:

• Find Most Similar Pattern:Find Most Similar Pattern:

LOCATIONORGANIZATION{<'s 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <near 0.75> }

Netscape 's headquarters in Mountain View is near

Page 18: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•18

Snowball-VS: Tuple ExtractionSnowball-VS: Tuple Extraction

• Represent each text segment as the Represent each text segment as the context 5-tuple:context 5-tuple:

• Find Most Similar Pattern:Find Most Similar Pattern:

LOCATIONORGANIZATION{<'s 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <near 0.75> }

Netscape 's headquarters in Mountain View is near

LOCATIONORGANIZATION{<'s 0.5>, <central 0.5>, <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }

Page 19: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•19

Snowball-VS: Pattern EvaluationSnowball-VS: Pattern Evaluation

Conf(Pattern) = Positive/(Positive + Negative) Conf(Pattern) = Positive/(Positive + Negative)

e.g., Conf(Pattern) = 2/3 = 66%e.g., Conf(Pattern) = 2/3 = 66%

Exxon, Irving, said +Intel, Santa Clara, cut prices +invest in Microsoft, New York-basedanalyst Jane Smith said

-

Pattern P, “ORGANIZATION, LOCATION” in action:

Pattern Confidence

Page 20: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•20

Snowball-VS: Tuple evaluationSnowball-VS: Tuple evaluation

Conf(T) = 1 - Conf(T) = 1 - (1 -Conf(P(1 -Conf(Pii))))

A tuple T will have high confidence if it is extracted A tuple T will have high confidence if it is extracted by multiple high-quality patterns, scaled by theby multiple high-quality patterns, scaled by thesimilarity of actual text segments to patterns. similarity of actual text segments to patterns.

Page 21: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•21

Snowball-SMT Snowball-SMT : Tuple Extraction: Tuple Extraction

Netscape 's headquarters in Mountain View is near

Conf(<Netscape, Mountain View>) = P( Positive | ‘s , headquarters, in )

Use probabilistic model that best describes the positive examples(seed tuples) and negative examples (conflicting tuples)

Exxon, Irving, said +Intel, Santa Clara, cut prices +invest in Microsoft, New York-basedanalyst Jane Smith said

-

Page 22: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•22

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Initial seed tuples

Page 23: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•23

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Run each system in parallel

Page 24: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•24

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Run each system in parallel

Page 25: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•25

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Run each system in parallel

Page 26: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•26

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Run each system in parallel

Page 27: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•27

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Run each system in parallel

Page 28: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•28

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Combination Strategies: •Union•Intersection•Weighed Mixture

Page 29: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•29

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Select tuples to use as new seed

Page 30: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•30

Combining Combining Snowball-VSSnowball-VS and and Snowball-Snowball-SMTSMT

Repeat the process

Page 31: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•31

Task Evaluation MethodologyTask Evaluation Methodology

• Data: (300,000 newspaper articles)Data: (300,000 newspaper articles)

• Methodology:Methodology:– IdealIdeal Set of tuples Set of tuples

– Automatic Recall/Precision estimationAutomatic Recall/Precision estimation

• Estimated precision using samplingEstimated precision using sampling(see paper)(see paper)

Page 32: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•32

TheThe Ideal Ideal metric metric

All tuples mentioned in the collection

Hoover’s directory(13K+ organizations)

Ideal

• Precision:Precision:Fraction of Fraction of IdealIdeal tuples that the system tuples that the system extracted that have correct values.extracted that have correct values.

• Recall:Recall:Fraction of Fraction of IdealIdeal tuples extracted from tuples extracted from

collection.collection.

Page 33: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•33

Experimental Results (Individual Experimental Results (Individual Systems) Systems)

Precision(a) and Recall(b) of Snowball-VS, Snowball-SMT and DIPRE.

(a) (b)

Extracted tables contain more than 80,000 tuples

Page 34: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•34

Experimental Results - CombinationExperimental Results - Combination

Precision and Recall (b) of the Intersection, Union and Mixture strategies compared to Snowball-VS

(a) (b)

Page 35: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•35

ConclusionsConclusions

• System works after minimal training System works after minimal training (handful of examples)(handful of examples)

• Able to extract 80% of test tuplesAble to extract 80% of test tuples

• Combining Models results in better Combining Models results in better tabletable

Page 36: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•36

Future WorkFuture Work

• EfficiencyEfficiency

• Evaluate on other extraction tasksEvaluate on other extraction tasks– non-binary relationsnon-binary relations

– relations with no keyrelations with no key

HTML documentsHTML documents

– link structurelink structure

– document structuredocument structure

Page 37: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•37

ReferencesReferences

• Blum & Mitchell. Combining labeled and unlabeled Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.on Computational Learning Theory.

• Brin. Extracting patterns and relations from the World-Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Wide Web. Proceedings on the 1998 International Workshop on Web abd Databases (WebDB’98)Workshop on Web abd Databases (WebDB’98)

• Collins & Singer. Unsupervised models for named Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999entity classification. EMNLP 1999

• Riloff & Jones. Learning dictionaries for information Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99extraction by multi-level bootstrapping. AAAI’99

• Yarowsky. Unsupervised word sense disambiguation Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95rivaling supervised methods. ACL’95

Page 38: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

38

Snowball: Snowball: Extracting Relations from Extracting Relations from Large Plain-Text CollectionsLarge Plain-Text Collections

Eugene AgichteinEugene Agichtein and Luis Gravano and Luis GravanoDepartment of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Page 39: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•39

Snowball patterns incorporate Snowball patterns incorporate Named Entity tagsNamed Entity tags

Pattern 1 <ORGANIZATION>'s headquarters in <LOCATION>

Pattern 2 <LOCATION>-based <ORGANIZATION>}

Pattern 3 <ORGANIZATION>, <LOCATION>

Entities tagged using MITRE’s Alembic Workbench namedentity tagger.

Page 40: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•40

Experimental results: trainingExperimental results: training

(a) (b)

Recall (a) and precision (b) using the Ideal metric (training collection)

Page 41: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•41

Snowball: ContributionsSnowball: Contributions

• Pattern RepresentationPattern Representation

• Pattern GenerationPattern Generation

• Patterns and Tuples EvaluationPatterns and Tuples Evaluation

Page 42: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•42

Snowball patterns incorporate Snowball patterns incorporate Named Entity tagsNamed Entity tags

Entities tagged using MITRE’s Alembic Workbench named entity tagger.

Microsoft 's central headquarters in Redmond is home to almost every product group and division.

Today's merger with McDonnell Douglas

positions Seattle -based Boeing to make major money in space.

<ORGANIZATION>’s central headquarters in <LOCATION>

…, a producer of apple-based jelly, ...

<LOCATION>-based <ORGANIZATION>

Page 43: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•43

Snowball-VS: Pattern Similarity Snowball-VS: Pattern Similarity MetricMetric

{ Lp . Ls + Mp . Ms + Rp . Rs if the tags match

0 otherwise

Match(P, S) =

P = <Lp, T1, Mp, T2, Rp>

S = <Ls, T1, Ms, T2, Rs>

Page 44: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•44

Approximate MatchingApproximate Matching

• Use Whirl Use Whirl

Microsoft Corp. = MicrosoftMicrosoft Corp. = Microsoft = Microsoft = Microsoft CorporationCorporation

Page 45: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•45

Collections used in Experiments Collections used in Experiments

More than 300, 000 real newspaper articles

Collection Source Year

TrainingThe New York TimesThe Wall Street JournalThe Los Angeles Times

199619961996

TrainingThe New York TimesThe Wall Street JournalThe Los Angeles Times

199519951995,’97

Page 46: 1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University

Eugene Agichtein Columbia University•46

Sparse Markov TransducersSparse Markov Transducers

• Computes conditional probability: P(Positive|WComputes conditional probability: P(Positive|W11,W,W22,W,W33,W,W44) ) where W where Wii is the i-th word in the context between entities. is the i-th word in the context between entities.

• SMT’s compute mixture of sparse prediction trees.SMT’s compute mixture of sparse prediction trees.

Eskin, Grundy, and Singer ISMB-2000.