77
Web Information Extraction 3 rd Oct 2007

Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Embed Size (px)

Citation preview

Page 1: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Web Information Extraction

3rd Oct 2007

Page 2: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Information Extraction

(Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati, Bing Liu)

Page 3: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Introduction

The Web is perhaps the single largest data source in the world.

Much of the Web (content) mining is about Data/information extraction from semi-structured objects

and free text, and Integration of the extracted data/information

Due to the heterogeneity and lack of structure, mining and integration are challenging tasks.

This talk gives an overview.

Page 4: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Introduction

Web mining aims to develop new techniques to extract useful knowledge from the Web

Web offers unprecedented opportunity and challenges to NLP Huge amount of information accessible Wide and diverse coverage Information of all types – structured table, text,

multimedia data … Semi-structured (html) Linked Redundant

Page 5: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Noisy. main content, advertisement, navigation panel, copyright notices, …

Surface Web and Deep Web Surface web: pages that can be browsed using q web

browser Deep web: databases accessible through parameterized

query interfaces Services. Dynamic Virtual Society. Interactions among people,

organizations, and systems.

Page 6: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Information Extraction (IE)

Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

Transform unstructured information in a corpus of documents or web pages into a structured database.

Applied to different types of text: Newspaper articles Web pages Scientific articles Newsgroup messages Classified ads Medical notes

Page 7: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Information Extraction vs. NLP?

Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages.

As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP.

Web does give one particular boost to NLP Massive corpora..

Page 8: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Sample Job Posting

Page 9: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Extracted Job Template

computer_science_jobid: [email protected]: SOFTWARE PROGRAMMERsalary:company:recruiter:state: TNcity:country: USlanguage: Cplatform: PC \ DOS \ OS-2 \ UNIXapplication:area: Voice Mailreq_years_experience: 2desired_years_experience: 5req_degree:desired_degree:post_date: 17 Nov 1996

Page 10: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Amazon Book Description….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…

Page 11: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Extracted Book Template

Title: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::

Page 12: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Product information/ Comparison shopping, etc.

Need to learn to extract info from online vendors Can exploit uniformity of layout, and (partial)

knowledge of domain by querying with known products Early e.g., Jango Shopbot (Etzioni and Weld)

Gives convenient aggregation of online content Bug: originally not popular with vendors

Make personal agents rather than web services? This seems to have changed (e.g., Froogle)

Page 13: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,
Page 14: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Commercial information…

Need thisprice

Title

A book,Not a toy

Page 15: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Information Extraction

Information extraction systems Find and understand the limited relevant parts of texts

Clear, factual information (who did what to whom when?) Produce a structured representation of the relevant

information: relations (in the DB sense) Combine knowledge about language and a domain Automatically extract the desired information

E.g. Gathering earnings, profits, board members, etc. from

company reports Learn drug-gene product interactions from medical

research literature “Smart Tags” (Microsoft) inside documents

Page 16: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Using information extraction to populate knowledge bases

http://protege.stanford.edu/

Page 17: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

The European Commission said on Thursday it disagreed with German advice.

Only France and Britain backed Fischler 's proposal .

“What we have to be extremely careful of is how other countries are going to take Germany 's lead”, Welsh National Farmers ' Union ( NFU ) chairman John Lloyd Jones said on BBC radio .

The European Commission said on Thursday it disagreed with German advice.

Only France and Britain backed Fischler 's proposal .

“What we have to be extremely careful of is how other countries are going to take Germany 's lead”, Welsh National Farmers ' Union ( NFU ) chairman John Lloyd Jones said on BBC radio .

Named Entity Extraction

The task: find and classify names in text, for example:

The purpose: … a lot of information is really associations between named entities. … for question answering, answers are usually named entities. … the same techniques apply to other slot-filling classifications.

The European Commission [ORG] said on Thursday it disagreed with German [MISC] advice.

Only France [LOC] and Britain [LOC] backed Fischler [PER] 's proposal .

“What we have to be extremely careful of is how other countries are going to take Germany 's lead”, Welsh National Farmers ' Union [ORG] ( NFU [ORG] ) chairman John Lloyd Jones [PER] said on BBC [ORG] radio .

Page 18: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

CoNLL (2003) Named Entity Recognition task

Task: Predict semantic label of each word in text

Foreign NNP I-NP ORG

Ministry NNP I-NP ORG

spokesman NN I-NP O

Shen NNP I-NP PER

Guofang NNP I-NP PER

told VBD I-VP O

Reuters NNP I-NP ORG

: : : :

}Standard evaluationis per entity, not per token

Page 19: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Precision/Recall/F1 for IE

Recall and precision are straightforward for tasks like IR and text categorization, where there is only one grain size (documents)

The measure behaves a bit funnily for IE/NER when there are boundary errors (which are common): First Bank of Chicago announced earnings …

This counts as both a fp and a fn Selecting nothing would have been better Some other systems (e.g., MUC scorer) give partial

credit (according to complex rules)

Page 20: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Template Types

Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that

may not occur in the text itself. Terrorist act: threatened, attempted, accomplished. Job type: clerical, service, custodial, etc. Company type: SEC code

Some slots may allow multiple fillers. Programming language

Some domains may allow multiple extracted templates per document. Multiple apartment listings in one ad

Page 21: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Task: Wrapper Induction

Learning wrappers is wrapper induction Sometimes, the relations are structural.

Web pages generated by a database. Tables, lists, etc.

Can’t computers automatically learn the patterns a human wrapper-writer would use?

Wrapper induction is usually regular relations which can be expressed by the structure of the document:

the item in bold in the 3rd column of the table is the price Wrapper induction techniques can also learn:

If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the project X.

[e.g, Tom Mitchell’s Web->KB project]

Page 22: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Web Extraction

Many web pages are generated automatically from an underlying database.

Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).

However, output is intended for human consumption, not machine interpretation.

An IE system for such generated pages allows the web site to be viewed as a structured database.

An extractor for a semi-structured web site is sometimes referred to as a wrapper.

Process of extracting from such pages is sometimes referred to as screen scraping.

Page 23: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Web Extraction using DOM Trees

Web extraction may be aided by first parsing web pages into DOM trees.

Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.

May still need regex patterns to identify proper portion of the final CharacterData node.

Page 24: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Sample DOM Tree Extraction

HTML

BODY

FONTB

Age of Spiritual Machines

Ray Kurzweil

Element

Character-DataHEADER

by A

Title: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterData

Page 25: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Wrappers: Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern to identify proper context. Amazon list price:

Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “\$\d+(\.\d{2})?\b”

May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:

Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “.+” Post-filler pattern: “</span>”

Page 26: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Simple Template Extraction

Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price …

Make patterns specific enough to identify each filler always starting from the beginning of the document.

Page 27: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Pre-Specified Filler Extraction

If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type

Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

Page 28: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Wrapper induction

Highly regularsource documents

Relatively simple

extraction patterns

Efficient

learning algorithm

Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

Alternative is to use machine learning: Build a training set of

documents paired with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Page 29: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Learning for IE

Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

Alternative is to use machine learning: Build a training set of documents paired with human-produced

filled extraction templates. Learn extraction patterns for each slot using an appropriate

machine learning algorithm.

Page 30: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Use <B>, </B>, <I>, </I> for extraction

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Wrapper induction: Delimiter-based extraction

Page 31: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

l1, r1, …, lK, rK

Example: Find 4 strings

<B>, </B>, <I>, </I> l1 , r1 , l2 , r2

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Learning LR wrappers

Page 32: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r1 can be any prefix

eg </B>

Page 33: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r2 can be any prefix

eg </I>

l2 can be any suffix

eg <I>

l1 can be any suffix

eg <B>

Page 34: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Distracting text in head and tail <HTML><TITLE>Some Country Codes</TITLE>

<BODY><B>Some Country Codes</B><P> <B>Congo</B> <I>242</I><BR> <B>Egypt</B> <I>20</I><BR> <B>Belize</B> <I>501</I><BR> <B>Spain</B> <I>34</I><BR> <HR><B>End</B></BODY></HTML>

A problem with LR wrappers

Page 35: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Ignore page’s head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

head

body

tail

}

}}

start of tail

end of head

Head-Left-Right-Tail wrappers

One (of many) solutions: HLRT

Page 36: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

More sophisticated wrappers

LR and HLRT wrappers are extremely simple Though applicable to many tabular patterns

Recent wrapper induction research has explored more expressive wrapper classes [Muslea et al, Agents-98; Hsu et al, JIS-98; Kushmerick, AAAI-1999; Cohen, AAAI-1999; Minton et al, AAAI-2000] Disjunctive delimiters Multiple attribute orderings Missing attributes Multiple-valued attributes Hierarchically nested data Wrapper verification and maintenance

Page 37: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Boosted wrapper induction

Wrapper induction is only ideal for rigidly-structured machine-generated HTML…

… or is it?! Can we use simple patterns to extract from natural

language documents?

… Name: Dr. Jeffrey D. Hermes … … Who: Professor Manfred Paul …... will be given by Dr. R. J. Pangborn …

… Ms. Scott will be speaking …… Karen Shriver, Dept. of ... … Maria Klawe, University of ...

Page 38: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

BWI: The basic idea

Learn “wrapper-like” patterns for texts pattern = exact token sequence

Learn many such “weak” patterns Combine with boosting to build “strong” ensemble

pattern Boosting is a popular recent machine learning method where

many weak learners are combined Demo: http://www.smi.ucd.ie/bwi Not all natural text is sufficiently regular for exact string

matching to work well!!

Page 39: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Natural Language Processing-based Information Extraction

If extracting from automatically generated web pages, simple regex patterns usually work.

If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging

Mark each word as a noun, verb, preposition, etc. Syntactic parsing

Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)

KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags.

Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Page 40: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Stalker: A wrapper induction system (Muslea et al. Agents-99)

E1:513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2:90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We want to extract area code. Start rules:

R1: SkipTo(()R2: SkipTo(-<b>)

End rules:R3: SkipTo())R4: SkipTo(</b>)

Page 41: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Learning extraction rules

Stalker uses sequential covering to learn extraction rules for each target item. In each iteration, it learns a perfect rule that covers as

many positive items as possible without covering any negative items.

Once a positive item is covered by a rule, the whole example is removed.

The algorithm ends when all the positive items are covered. The result is an ordered list of all learned rules.

Page 42: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Rule induction through an example

Training examples:E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We learn start rule for area code. Assume the algorithm starts with E2. It creates three initial

candidate rules with first prefix symbol and two wildcards: R1: SkipTo(() R2: SkipTo(Punctuation) R3: SkipTo(Anything)

R1 is perfect. It covers two positive examples but no negative example.

Page 43: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Rule induction (cont …)

E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

R1 covers E2 and E4, which are removed. E1 and E3 need additional rules.

Three candidates are created: R4: SkiptTo(<b>) R5: SkipTo(HtmlTag) R6: SkipTo(Anything)

None is good. Refinement is needed. Stalker chooses R4 to refine, i.e., to add additional symbols, to

specialize it. It will find R7: SkipTo(-<b>), which is perfect.

Page 44: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Limitations of Supervised Learning

Manual Labeling is labor intensive and time consuming, especially if one wants to extract data from a huge number of sites.

Wrapper maintenance is very costly: If Web sites change frequently It is necessary to detect when a wrapper stops to work

properly. Any change may make existing extraction rules

invalid. Re-learning is needed, and most likely manual re-

labeling as well.

Page 45: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Road map Structured data extraction

Wrapper induction Automatic extraction

Information integration Summary

Page 46: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

The RoadRunner System(Crescenzi et al. VLDB-01)

Given a set of positive examples (multiple sample pages). Each contains one or more data records.

From these pages, generate a wrapper as a union-free regular expression (i.e., no disjunction).

The approach To start, a sample page is taken as the wrapper. The wrapper is then refined by solving mismatches between the

wrapper and each sample page, which generalizes the wrapper.

Page 47: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,
Page 48: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Compare with wrapper induction

No manual labeling, but need a set of positive pages of the same template which is not necessary for a page with multiple data records

not wrapper for data records, but pages. A Web page can have many pieces of irrelevant information.

Issues of automatic extraction Hard to handle disjunctions Hard to generate attribute names for the extracted data. extracted data from multiple sites need integration, manual or

automatic.

Page 49: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

The DEPTA system (Zhai & Liu WWW-05)

Data region1

Data region2

A data record

A data record

Page 50: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Align and extract data items (e.g., region1)image1 EN7410 17-

inch LCD Monitor Black/Dark charcoal

$299.99 Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

image2 17-inch LCD Monitor

$249.99 Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

image3 AL1714 17-inch LCD Monitor, Black

$269.99 Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

image4 SyncMaster 712n 17-inch LCD Monitor, Black

Was: $369.99

$299.99 Save $70 After: $70 mail-in-rebate(s)

Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

Page 51: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

1. Mining Data Records(Liu et al, KDD-03; Zhai and Liu, WWW-05)

Given a single page with multiple data records (a list page), it extracts data records.

The algorithm is based on two observations about data records in a Web page a string matching algorithm (tree matching ok too)

Considered both contiguous non-contiguous data records

Page 52: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

The Approach

Given a page, three steps: Building the HTML Tag Tree

Erroneous tags, unbalanced tags, etc Some problems are hard to fix

Mining Data Regions Spring matching or tree matching

Identifying Data Records

Rendering (or visual) information is very useful in the whole process

Page 53: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Building tree based on visual cues

1 <table>

2 <tr>

3 <td> … </td>

4 <td> … </td>

5 </tr>

6 <tr>

7 <td> … </td>

8 <td> … </td>

9 </tr>

10</table>

left right top bottom100 300 200 400

100 300 200 300

100 200 200 300

200 300 200 300

100 300 300 400

100 200 300 400

200 300 300 400

tr tr

td td td td

The tag treetable

Page 54: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Mining Data Regions

1

3

10

2

7 8 9

Region 2

5 6

4

11 12

14 15 16 17 191813 20

Region 1

Region 3

Page 55: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Identify Data Records

A generalized node may not be a data record.

Extra mechanisms are needed to identify true atomic objects (see the papers).

Some highlights: Contiguous non-contiguous data records.

Name 1

Description of object 1

Name 2

Description of object 2

Name 3

Description of object 3

Name 4

Description of object 4

Name 1 Name 2

Description of object 1

Description of object 2

Name 3 Name 4

Description of object 3

Description of object 4

Page 56: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

2. Extract Data from Data Records

Once a list of data records are identified, we can align and extract data items in them.

Approaches (align multiple data records): Multiple string alignment

Many ambiguities due to pervasive use of table related tags. Multiple tree alignment (partial tree alignment)

Together with visual information is effective Most multiple alignment methods work like hierarchical

clustering, Not effective, and very expensive

Page 57: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Tree Matching (tree edit distance)

Intuitively, in the mapping each node can appear no more than once in a mapping, the order between sibling nodes are preserved, and the hierarchical relation between nodes are also preserved.

c

ba

p

c

he

p

d a d

A B

Page 58: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

The Partial Tree Alignment approach

Choose a seed tree: A seed tree, denoted by Ts, is picked with the maximum number of data items.

Tree matching:

For each unmatched tree Ti (i ≠ s), match Ts and Ti. Each pair of matched nodes are linked (aligned). For each unmatched node nj in Ti do

expand Ts by inserting nj into Ts if a position for insertion can be uniquely determined in Ts.

The expanded seed tree Ts is then used in subsequent matching.

Page 59: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

p p

a b e dc eb

dc e

pNew part of Ts

e ab x

p pTsTi

a e

ba

Ts Ti

Insertion is possible

Insertion is not possible

Illustration of partial tree alignment

Page 60: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

dx… b

p

c k gn

p

b

dx… b

p

kcx…

b

p

d h

c k gn

p

b

nx… b

p

c d h k

No node inserted

T2 T3

T2

g

Ts

New Ts

d h kc

p

b

c, h, and k inserted

Ts = T1

T2 is matched again

A complete example

Page 61: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Output Data Table

… x b n c d h k g

T1 … 1 1 1

T2 1 1 1 1 1

T3 1 1 1 1 1

DEPTA does not work with nested data records.

NET (Liu & Zhai, WISE-05)extracts data from both flat and nested data records.

Page 62: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Some other systems and techniques

IEPAD (Chang & Lui WWW-01), DeLa (Wang & Lochovsky WWW-03) These systems treat a page as a long string, and find repeated

substring patterns. They often produce multiple patterns (rules). Hard to decide

which is correct. EXALG(Arasu & Garcia-Molina SIGMOD-03), (Lerman et al,

SIGMOD-04). Require multiple pages to find patterns. Which is not necessary for pages with multiple records.

(Zhao et al, WWW-04) It extracts data records in one area of a page.

Page 63: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Limitations and issues

Not for a page with only a single data record Does not generate attribute names for the extracted data (yet!) extracted data from multiple sites need integration. It is possible in each specific application domain, e.g.,

products sold online. need “product name”, “image”, and “price”. identify only these three fields may not be too hard.

Job postings, publications, etc …

Page 64: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Road map

Structured data extraction Wrapper induction Automatic extraction

Information integration Summary

Page 65: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Web query interface integration

Many integration tasks, Integrating Web query interfaces (search forms) Integrating extracted data Integrating textual information Integrating ontologies (taxonomy) …

We only introduce integration of query interfaces. Many web sites provide forms to query deep web Applications: meta-search and meta-query

Page 66: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Global Query Interface

united.com airtravel.com

delta.com hotwire.com

Page 67: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Synonym Discovery (He and Chang, KDD-04) Discover synonym attributes

Author – Writer, Subject – Category

Holistic Model Discovery

author name subject categorywriter

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

Pairwise Attribute Correspondence

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

S1.author S3.nameS1.subject S2.category

V.S.

Page 68: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Schema matching as correlation mining

Across many sources: Synonym attributes are negatively correlated

synonym attributes are semantically alternatives. thus, rarely co-occur in query interfaces

Grouping attributes with positive correlation grouping attributes semantically complement thus, often co-occur in query interfaces

Page 69: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

1. Positive correlation mining as potential groups

2. Negative correlation mining as potential matchings

Mining positive correlations

Last Name, First Name

Mining negative correlationsAuthor =

{Last Name, First Name}

3. Matching selection as model constructionAuthor (any) = {Last Name, First Name}

Subject = Category

Format = Binding

Page 70: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

A clustering approach to schema matching (Wu et al. SIGMOD-04)

1:1 mapping by clustering Bridging effect

“a2” and “c2” might not look similar themselves but they might both be similar to “b3”

1:m mappings Aggregate and is-a types

User interaction helps in: learning of matching thresholds resolution of uncertain mappings

X

Page 71: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Find 1:1 Mappings via ClusteringInterfaces:

After one merge:

…, final clusters:

{{a1,b1,c1}, {b2,c2},{a2},{b3}}

Initial similarity matrix:

Similarity functions linguistic similarity domain similarity

Page 72: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Find 1:m Complex Mappings

Aggregate type – contents of fields on the many side are part ofthe content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

Page 73: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Complex Mappings (Cont’d)

Is-a type – contents of fields on the many side are sum/union ofthe content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity,and (3) value characteristics

Page 74: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Instance-based matching via query probing (Wang et al. VLDB-04) Both query interfaces and returned results (called

instances) are considered in matching. It assumes a global schema (GS) is given and a set of instances are also given.

Uses each instance value (V) in GS to probe the underlying database to obtain the count of V appeared in the returned results. These counts are used to help matching.

Page 75: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Query interface and result page

Search InterfaceResult Page

Format

ISBN

Publish Date

Publisher

Author

Title

Data Attributes

Page 76: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Road map Structured data extraction

Wrapper Induction Automatic extraction

Information integration Summary

Page 77: Web Information Extraction 3 rd Oct 2007. Information Extraction (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld, Perry, Subbarao Kambhampati,

Summary Give an overview of two topics

Structured data extraction Information integration

Some technologies are ready for industrial exploitation, e.g., data extraction.

Simple integration is do-able, complex integration still needs further research.