41
06/21/22 05:22 1 Copyright © 2000,2002 Weld, Kushmerick Information Carnivores

Information Carnivores

  • Upload
    nash

  • View
    43

  • Download
    2

Embed Size (px)

DESCRIPTION

Information Carnivores. Built by Selberg & Etzioni Release in Jun ‘95 In 2000 aggregated 12 search engines: LookSmart, About, Infoseek, GoTo, Google, DirectHit, RealNames, Webcrawler, AltaVista, Excite, Lycos, Thunderstone History Netbot Go2net Infospace ???. User enters query. - PowerPoint PPT Presentation

Citation preview

Page 1: Information Carnivores

04/22/23 09:05 1Copyright © 2000,2002 Weld, Kushmerick

Information Carnivores

Page 2: Information Carnivores

04/22/23 09:05 2Copyright © 2000,2002 Weld, Kushmerick

• Built by Selberg & Etzioni• Release in Jun ‘95• In 2000 aggregated 12 search engines:

– LookSmart, About, Infoseek, – GoTo, Google, DirectHit, – RealNames, Webcrawler, AltaVista, – Excite, Lycos, Thunderstone

• History– Netbot– Go2net– Infospace– ???

Page 3: Information Carnivores

04/22/23 09:05 3Copyright © 2000,2002 Weld, Kushmerick

User enters query

Formulate queries

Lycos Excite. . .Collate results

Remove duplicates

Post-process + rank

Download?

Present to user

Page 4: Information Carnivores

04/22/23 09:05 4Copyright © 2000,2002 Weld, Kushmerick

The Need for Wrappers

lots ofinformation

but

computers don’tunderstandmuch of it

Page 5: Information Carnivores

04/22/23 09:05 5Copyright © 2000,2002 Weld, Kushmerick

Example 1: Seminar announcement news article<[email protected]>Type: cmu.andrew.assocs.UEATopic: Re: entreprenuership speakerDates: 17-Apr-95Time: 7:00 PMPostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDUAbstract: hello againto reiteratethere will be a speaker on the law and startup businessesthis monday evening the 17thit will be at 7pm in room 261 of GSIA in the new building, ie upstairs.please attend if you have any interest in starting your own business orare even curious.Colin

date = monday evening the 17th speaker = ?start-time = 7pm end-time = ? location = room 261 of GSIA

IE

Page 6: Information Carnivores

04/22/23 09:05 6Copyright © 2000,2002 Weld, Kushmerick

Example 2: Seminar announcement Web pages

date = Nov 5speaker = Dr. Rodger Kibble affil = University of Brighton title = Using centering...

IE

date = Nov 19speaker = Dr. Reinhard Muskens affil = Katholieke Univ... title = Underspecification...

date = Nov 26speaker = Dr. Julie Berndsen affil = University College... title = A Generic Lexicon...

...

Page 7: Information Carnivores

04/22/23 09:05 7Copyright © 2000,2002 Weld, Kushmerick

Example 3: Job listings

IE

Page 8: Information Carnivores

04/22/23 09:05 8Copyright © 2000,2002 Weld, Kushmerick

Strategy: Wrappers

resource A resource B resource C

wrapper A

user

wrapper B wrapper C

Mediator

queries

results

Page 9: Information Carnivores

04/22/23 09:05 9Copyright © 2000,2002 Weld, Kushmerick

Scaling issuesNeed custom wrapper for each resource.

<HTML><BODY BGCOLOR="FFFFFF" LINK="00009C" ALINK="00009C" VLINK="00009C”TEXT= "000000"> <center> <table><tr><td><NOBR> <NOBR><img src="/ypimages/b_r_hd_a.gif”border=0 ALT="Switchboard Results" width=407height=20 align=top><A HREF="/bin/cgiqa.dll?MEM=1" TARGET ="_top"><img src="/ypimages/b_r_hd_1.gif" border=0 ALT="People" width=54 height=20align=top></A><A HREF="/bin/cgidir.dll?MEM=1”TARGET="_top"><img src= "/ypimages/b_r_hd_2.gif”border=0 ALT= "Business" width=62 height=24 align=top></A><A HREF="/" TARGET="_top"><img src=”/ypimages /b_r_hd_3.gif" border=0 ALT="Home”width=47 height=20 align=top></A></NOBR><br></td></tr></table> </center><center><table border=0width=576> <tr><td colspan=2 align =center> <center>

But hand-coding is tedious.

Especially since sites frequently change format

usefulinformation

Page 10: Information Carnivores

04/22/23 09:05 10Copyright © 2000,2002 Weld, Kushmerick

Wrapper Approaches• Perl-like languages

– Simple and effective (if tedious)• Proprietary languages & tools

– Click and generalize• Conversion to tree form

– Use XML as intermediate representation– Extract children of specified node

• Machine Learning– Promising, but not yet fielded

Page 11: Information Carnivores

04/22/23 09:05 11Copyright © 2000,2002 Weld, Kushmerick

Kushmerick Contribution

machine learning techniques to automatically construct wrappers from examples

wrapperprocedure

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

Page 12: Information Carnivores

04/22/23 09:05 12Copyright © 2000,2002 Weld, Kushmerick

Example

(Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34)

Page 13: Information Carnivores

04/22/23 09:05 13Copyright © 2000,2002 Weld, Kushmerick

LR wrappers: The basic idea

Use <B>, </B>, <I>, </I> for parsing

exploit fortuitous non-linguistic regularity

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I> <BR><B>Egypt</B> <I>20</I> <BR><B>Belize</B> <I>501</I> <BR><B>Spain</B> <I>34</I> <BR></BODY></HTML>

Page 14: Information Carnivores

04/22/23 09:05 14Copyright © 2000,2002 Weld, Kushmerick

procedure ExtractCountryCodes while there are more occurrences of <B> 1. extract Country between <B> and </B> 2. extract Code between <I> and </I>

Country/Code LR wrapper

Left-Right wrappers

Page 15: Information Carnivores

04/22/23 09:05 15Copyright © 2000,2002 Weld, Kushmerick

procedure ExtractAttributes: while there are more occurrence of l1

1. extract 1st attribute between l1 and r1 . . . K. extract Kth attribute between lK and rK LR wrapper 2K strings l1 , r1 , …, lK , rK

Not just HTML tags!

“Generic” LR wrapper

K = number of attributesleft delimiter right delimiter

Page 16: Information Carnivores

04/22/23 09:05 16Copyright © 2000,2002 Weld, Kushmerick

Wrapper induction algorithm

PAC modelparameters

wrapper

1. Gather enough pages to satisfy the termination condition (PAC model).

2. Label example pages.

3. Find a wrapper consistent with the examples.

automaticpage labeler

example pagesupply

Page 17: Information Carnivores

04/22/23 09:05 17Copyright © 2000,2002 Weld, Kushmerick

Finding an LR wrapper

l1, r1, …, lK, rK

Example: Find 4 strings<B>, </B>, <I>, </I> l1 , r1 , l2 , r2

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Page 18: Information Carnivores

04/22/23 09:05 18Copyright © 2000,2002 Weld, Kushmerick

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r1 can be any prefixeg </B>

Page 19: Information Carnivores

04/22/23 09:05 19Copyright © 2000,2002 Weld, Kushmerick

LR: Finding l1, l2 and r2<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r2 can be any prefixeg </I>

l2 can be any suffixeg <I>

l1 can be any suffixeg <B>

Page 20: Information Carnivores

04/22/23 09:05 20Copyright © 2000,2002 Weld, Kushmerick

Finding an LR wrapper: Algorithm naïve algorithm enumerate all combinations for each candidate l1

for each candidate r1 ··· for each candidate lK

for each candidate rK succeed if consistent with examples

O(S2K) O(KS)

efficient algorithm constraints are independent

for k = 1 to K for each candidate rk succeed if consistent with examplesfor k = 1 to K for each candidate lk succeed if consistent with examples

S = length of examplesK = number of attributes

Page 21: Information Carnivores

04/22/23 09:05 21Copyright © 2000,2002 Weld, Kushmerick

Summary of Kushmerick PhD Results

“search.com” survey

AltaVista, WebCrawler,

WhoWhere, CNN Headlines,

Lycos, Shareware.Com,

AT&T 800 Directory, ...

time to automatically

build wrappers

K = number of attributes

S = size of examples

useful? learnable?wrapper class

57 %

13 %

53 %57 %

50 %

53 %O(KS2)

O(S2K)

O(KS2)O(KS4)

O(S2K+2)

O(KS)HLRT

N-LR

OCLRHOCLRT

N-HLRT

LR

total 70 %

Page 22: Information Carnivores

04/22/23 09:05 22Copyright © 2000,2002 Weld, Kushmerick

“Strong” trainable IE systems• Examples:

– CRYSTAL (Soderland et al, 1995)– SRV (Freitag, 1999)– Rapier (Califf & Mooney, 1999)

• General approach:– Define a space of possible extraction rules.– Learning = search rule space for set of rules that individually

cover many positive examples and few negative examples– Sometimes use POS tagging and other shallow linguistic pre-

processing

Page 23: Information Carnivores

04/22/23 09:05 23Copyright © 2000,2002 Weld, Kushmerick

SRV (Freitag’s CMU PhD)

... ... exampledocument

rule = conjunction of literalsliteral = predefined relational encoding of a document

Englishinterpretation

FOLinterpretation

Page 24: Information Carnivores

04/22/23 09:05 24Copyright © 2000,2002 Weld, Kushmerick

Learing Curves for Rapier ~ SRV

more training data(job-listings domain)

Page 25: Information Carnivores

04/22/23 09:05 25Copyright © 2000,2002 Weld, Kushmerick

SRV: Pseudo-pseudo-codeprocedure SRV(training examples E)

RuleSet {}while (E is not empty)

Rule TRUErepeat

let Best be the literal that most improves Ruleaccording to an information-theoretic gain metric

Rule RuleBest until no such Best existsremove examples covered by Rule from ERuleSet RuleSet + Rule

return RuleSet

Page 26: Information Carnivores

04/22/23 09:05 26Copyright © 2000,2002 Weld, Kushmerick

Covering algorithm: Pseudo-Example

+++

++

++

+++

--

---

--- -

-- +++

++

++

+++

--

---

--- -

-- +++

++

++

+++

--

---

--- -

--

+++

++

++

+++

--

---

--- -

-- +++

++

++

+++

--

---

--- -

--

--

+++

++

++

+++

--

--

-- --

-

+++

++

++

+++

--

---

--- -

-- +++

++

++

+++

--

---

--- -

--

-

+++

++

++

+++

--

---

-- --

-

1 2 3

4

7

5 6

8 9

Page 27: Information Carnivores

04/22/23 09:05 27Copyright © 2000,2002 Weld, Kushmerick

Why am I telling you this? • “Strong” trainable IE systems explore a

complex rule space…– Complicated algorithm/implementation– Deep & bushy search space– Susceptible to overfitting (?)

• Existing algorithms are covering algorithms– Other ways to reweight examples (eg, Boosting)– Theoretically more satisfying– Learned rules are more accurate (?)

If we use a cleverer reweighting scheme, can we get away with simpler rules? Can we do better than the “strong” learner?!

Page 28: Information Carnivores

04/22/23 09:05 28Copyright © 2000,2002 Weld, Kushmerick

Boosting• Boosting (Schapire, Freund, et al)

– General ML technique for improving the performance of a “weak” learning algorithm, by repeatedly applying the learner, each time modifying the training data weights to force the weak learner to focus on examples which were previously classified incorrectly

– Given:Weak Learner L– Output: Boosted Learner L using L as a “subroutine”

• Theorem: Any learning algorithm L with training error ½ can be mechanically converted into an algorithm L with error arbitrarily close to 0.

Page 29: Information Carnivores

04/22/23 09:05 29Copyright © 2000,2002 Weld, Kushmerick

Reweighting Example

training instancew

eigh

t

training instance

wei

ght

training instance

wei

ght

training instance

wei

ght

t=1 t=2

t=3 t=4

weak learner will focus onthese instances on iteration t=5

= instances correctly classified by ht

weakhypothesis

h1

h2

h3

Page 30: Information Carnivores

04/22/23 09:05 30Copyright © 2000,2002 Weld, Kushmerick

BWI’s extraction patterns• Basic building block: Boundary detector

• Associated with every boundary detector dis a numeric confidence value Cd

“prefix”pattern

“suffix”pattern

Detector d matches a boundary B if:“prefix” pattern matches tokens to B’s left, and“suffix” pattern matches tokens to B’s right

example: Who: Dr. Jane Smith B

Detector d = [who :][dr . Capitalized] wildcard

Page 31: Information Carnivores

04/22/23 09:05 31Copyright © 2000,2002 Weld, Kushmerick

Wildcards• “Standard” wildcards

Anything Alphabetic Capitalized Lowercase Alphanumeric Numeric Punctuation SingleChar (one-character token)

• Tried several simple “lexical” wildcards Firstname (dictionary of names from US Census) Lastname NonEnglishWord (tokens not in /usr/dict/words)

Page 32: Information Carnivores

04/22/23 09:05 32Copyright © 2000,2002 Weld, Kushmerick

Detector learning algorithmInput: training examplesOutput: boundary detector d = p,s

start with empty detector d = [][], and growdetector one token at a timerepeat this process until d can’t be improved:

Consider all ways to grow prefix by one tokenand all ways to grow suffix by one tokenPick the extension that most improves d’saccuracy on the training data.

Page 33: Information Carnivores

04/22/23 09:05 33Copyright © 2000,2002 Weld, Kushmerick

Boosted Wrapper Induction• Wrapper =

1. Start detectors dS1, dS2, …2. End detectors dE1, dE2, … 3. Length histogram L:[-,+][0,1]

• To invoke wrapper on a document:1. Apply all detectors to entire document2. Score every boundary B:

3. Extract all substrings (BS,BE) that satisfy

Bdid

iSiS

CBStartScorematches:

)( Bdid

iEiE

CBEndScorematches:

)(

)()()( SEES BBLBEndScoreBStartScore

user-specified confidence threshold

Page 34: Information Carnivores

04/22/23 09:05 34Copyright © 2000,2002 Weld, Kushmerick

Extraction Example

Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4

End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3

0.20.3

1 2 3 4

L

StartScore=0.6

StartScore=0.4StartScore=0.2

Page 35: Information Carnivores

04/22/23 09:05 35Copyright © 2000,2002 Weld, Kushmerick

Extraction Example

Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4

End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3

0.20.3

1 2 3 4

L

EndScore=0.3

EndScore=0.5

EndScore=0.3

Page 36: Information Carnivores

04/22/23 09:05 36Copyright © 2000,2002 Weld, Kushmerick

Extraction Example

Start detectorsdS1 = [a b c][d e], conf 0.2dS2 = [p q][r s t], conf 0.4

End detectorsdE1 = [w x][y z], conf 0.5dE2 = [m][n o], conf 0.3

0.20.3

1 2 3 4

L

StartScore(SB)=0.6

EndScore(SE)=0.5

SE-SB = 3 tokens

StartScore(SB)EndScore(SE)L(SE-SB) = 0.60.50.3 = 0.09 > ?

roughly, “probability that ‘38-44K’ is a correct value”

Page 37: Information Carnivores

04/22/23 09:05 37Copyright © 2000,2002 Weld, Kushmerick

BWI Algorithm• Procecure BWI

Input: training examplesOutput: Start & end detectors, length histogram Parameters:

Number of boosting rounds TLookahead depth LWildcards

1. S = Start boundary examples2. E = End boundary examples3. Start-detectors = AdaBoost(LearnDetector, S)4. End-detectors = AdaBoost(LearnDetector, E)5. Construct length histogram L from training data

Page 38: Information Carnivores

04/22/23 09:05 38Copyright © 2000,2002 Weld, Kushmerick

Example

Page 39: Information Carnivores

04/22/23 09:05 39Copyright © 2000,2002 Weld, Kushmerick

Experiments• 16 IE tasks from 8 document collections

– 8 fields from 3 “traditional” domains: Seminar announcements, Job listings; Reuters corporate acquisition articles;

– 8 fields from 5 “wrapper” domains: CS department faculty lists; Zagats restaurants reviews; LA Times restaurant reviews; Internet Address Finder; Stock quote server

• Performance metrics– Precision (fraction of extracted items that are correct)– Recall (fraction of items in the documents that were extracted)– F1 = 2/(1/Precision + 1/Recall)

• Competitors– SRV, Rapier, algorithm based on hidden Markov models

Page 40: Information Carnivores

04/22/23 09:05 40Copyright © 2000,2002 Weld, Kushmerick

Results: 16 tasks 4 algorithms

21cases

7cases

Page 41: Information Carnivores

04/22/23 09:05 41Copyright © 2000,2002 Weld, Kushmerick

Summary & Conclusions• BWI learns simple wrapper-like extraction patterns;

each pattern has high accuracy but low coverage– Uses boosting to focus the weak pattern learner on difficult

training examples• Works because a few dozen or hundred (but not

millions!) of patterns suffice for broad coverage.– Many real-world natural corpora have their own stereotypical

language, nongrammatical utterances, stylistic constraints, editorial guidelines, formatting regularities, etc that greatly simplify extraction

• BWI outperforms 3 competitors in 75% of comparisons