View
218
Download
0
Category
Tags:
Preview:
Citation preview
Information Extraction Research @ Yahoo! Labs Bangalore
Rajeev RastogiYahoo! Labs Bangalore
The most visited site on the internet
• 600 million+ users per month
• Super popular properties– News, finance, sports– Answers, flickr,
del.icio.us– Mail, messaging– Search
Unparalleled scale
• 25 terabytes of data collected each day– Over 4 billion clicks every day– Over 4 billion emails per day– Over 6 billion instant messages per day
• Over 20 billion web documents indexed• Over 4 billion images searchable
No other company on the planet processes as much data as we do!
Yahoo! Labs Bangalore
• Focus is on basic and applied research– Search– Advertizing– Cloud computing
• University relations– Faculty research grants– Summer internships– Sharing data/computing
infrastructure– Conference sponsorships– PhD co-op program
What does search look like today?
Search results of the future: Structured abstracts
yelp.com
babycenter
epicurious
answers.com
webmd
New York Times
Gawker
Rank by price
Search results of the future: Intelligent ranking
A key technology for enabling search transformation
Information extraction (IE)
Reviews
Information extraction (IE)
• Goal: Extract structured records from Web pages
Name
AddressCategory
PhonePrice
Map
Multiple verticals
• Business, social networking, video, ….
Price
Category
Address
Phone Price
One schema per vertical
NameTitle
Education
Connections
Posted by
Title
Date
Rating Views
IE on the Web is a hard problem
• Web pages are noisy• Pages belonging to different Web sites have different layouts
Noise
Web page types
Template-based Hand-crafted
Template-based pages
• Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction
• ~30% of crawled Web pages • Information rich, frequently appear in the top
results of search queries• E.g. search query: “Chinese Mirch New York”
– 9 template-based pages in the top 10 results
Wrapper Induction
Learn
AnnotatePages
Sample pagesWebsite pages
LearnWrappers
Apply wrappers
Records
XPathRules
Extract
Annotations
Extract
Website pages
Sample
• Enables extraction from template-based pages
Example
XPath: /html/body/div/div/div/div/div/div/span /html/body//div//spanGeneralize
Filters
• Apply filters to prune from multiple candidates that match XPath expression
XPath: /html/body//div//span
Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4
Limitations of wrappers
• Won’t work across Web sites due to different page layouts
• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites
can be time-consuming & expensive
Research challenge
• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site
• Only annotate pages from a few sites initially as training data
Conditional Random Fields (CRFs)
• Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn
– fk: features, k: weights
• Choose k to maximize log-likelihood of training data
• Use Viterbi algorithm to compute label sequence y with highest probability
||
11 ),,,(exp
)(
1)|(
x
xx
xyt k
ttkk tyyfZ
P
CRFs-based IE
Name
Category
Address
Phone
Noise
• Web pages can be viewed as labeled sequences
• Train CRF using pages from few Web sites• Then use trained CRF to extract from remaining sites
Drawbacks of CRFs
• Require too many training examples• Have been used previously to segment short
strings with similar structure• However, may not work too well across Web
sites that – contain long pages with lots of noise– have very different structure
An alternate approach that exploits site knowledge
• Build attribute classifiers for each attribute– Use pages from a few initial Web sites
• For each page from a new Web site– Segment page into sequence of fields (using static repeating
text)– Use attribute classifiers to assign attribute labels to fields
• Use constraints to disambiguate labels– Uniqueness: an attribute occurs at most once in a page– Proximity: attribute values appear close together in a page– Structural: relative positions of attributes are identical across
pages of a Web site
Attribute classifiers + constraints example
Chinese Mirch Chinese, Indian 120 Lexington AvenueNew York, NY 10016
(212) 532 3663Page1:
Jewel of India Indian 15 W 44th StNew York, NY 10016
(212) 869 5544Page2:
21 Club American 21 W 52nd StNew York, NY 10019
(212) 582 7200Page3:
Page3:
PhoneAddress
CategoryName
Category
Category, Name
Name
Name, Noise
Address
Address
Phone
Phone
Uniqueness constraint: NamePrecedence constraint: Name < Category
21 Club American 21 W 52nd StNew York, NY 10019
(212) 582 7200
CategoryName AddressPhone
Performance evaluation: Datasets
• 100 pages from 5 restaurant Web sites with very different structure– www.citysearch.com – www.fromers.com– www.nymag.com– www.superpages.com– www.yelp.com
• Extract attributes: Name, Address, Phone num, Hours of operation, Description
Methods considered
• CRFs, attribute classifiers + constraints• Features
– Lexicon: Words in the training Web pages– Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,
… – Attribute-level: Num of words, Overlap with title,…
Evaluation methodology
• Metrics– Precision, recall, F1 for attributes
• Test on one site, use pages from remaining 4 sites as training data
• Average measures over all 5 sites
Experimental results
CRF Constraint CRF Constraint
Name .39 1 .34 1Phone .02 1 .2 .99
Address .01 .81 .16 .83Hours .22 1 .36 1Desc .13 .25 0 .15
Overall .15 .81 .21 .76
Precision Recall
Other IE scenarios: Browse page extraction
Similar-structuredrecords
IE big picture/taxonomy
• Things to extract from– Template-based, browse, hand-crafted pages, text
• Things to extract– Records, tables, lists, named entities
• Techniques used– Structure-based (HTML tags, DOM tree paths) – e.g.
Wrappers– Content-based (attribute values/models) – e.g. dictionaries– Structure + Content (sequential/hierarchical relationships
among attribute values) – e.g. hierarchical CRFs• Level of automation
– Manual, supervised, unsupervised
Recommended