Upload
brandon-mcfarland
View
216
Download
3
Tags:
Embed Size (px)
Citation preview
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums
Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying MaWeb Search & Mining Group
Microsoft Research Asia
2009-04
Web Forum Data• An important information resource with a lot of human
knowledge.
• These information include recreation, sports, games, computers, art, society, science, home, health;
• 20% pages on the search results are from forums
Understanding Forum
Crawling Data Extraction
Quality AssessmentQuality Assessment
WWW’08iRobot: An Intelligent Crawler for Web Forums
SIGIR’08Exploring Traversal Strategy
KDD’09Incremental Crawling
WWW’08iRobot: An Intelligent Crawler for Web Forums
SIGIR’08Exploring Traversal Strategy
KDD’09Incremental Crawling
WWW’09,Automation Data ExtractionWWW’09,Automation Data Extraction
SIGIR’09Quality AssessmentSIGIR’09Quality Assessment
Challenge
• Leverage more site-level knowledge
Sitemap recovering
Forum Sitemap• A sitemap is a directed graph corresponding
consisting of a set of vertices and the links
List-of-Board
Digest
Login Portal
List-of-Thread
Browse-by-Tag
Home Page
Post-of-Thread
Search Result
Vertex
Link
• Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
Page Clustering• Forum pages are based on database & template• Layout is robust to describe template
– Layout can be characterized by the HTML elements in different DOM paths
(b) (d)(a) (c)
Page Clustering
Dom Path Feature Discovery
Clustering by Virtual Tables
Link Analysis
1. Login
4. Thread List
5. Thread
A Link = URL Pattern + Location
Post pages
Inter-pageInter-vertexInner-page
Profile pages
I II III
Inner-Page Features
• The inclusion relation. Data records usually have inclusion relations.
• The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page.
• Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.
Inner-vertex Features
Inter-vertex Features
List Record
List Record
List Title
List Record
(6)
(4)(5)
(4)(7)(8)
List Title
(9)
List Title
(10)
Post pages
(7)
Problem Setting
Author Title Content
Formulas of list page
List Record
List Record
List Title
List Record
(6)
(4)(5)
(4)(7)(8)
List Title
(9)
List Title
(10)
Post pages
(7)
• Formulas for identifying list record
• Formulas for identifying list title
Formulas of post page• Formulas for identifying post record
• Formulas for identifying post author
Post Record
TimeAuthor
Post Record
Post Record
(11)(14)(11)
(11)(12)
(13)
Author
(15)
Author
(16)
Formulas of post page• Formulas for identifying post time
• Formulas for identifying post content
Post Record
Time
Time
Time
(17)(18)
(19)
Post Record
Content
Content
Content
(20)
(21)
(22)
Joint inference
Markov Logic Networks• An MLN can be viewed as a template for constructing Markov
Random Fields.
• With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:
Markov Logic Networks• Divide DOM tree elements into three categories :
– Text element– Hyperlink element– Inner element
• Benefit
– Reduce the number of possible groundings in inference.
– Reduce the ambiguity and achieve better performance.
Experiments
List Pages Post Pages
Experiments
Experiments
0.8
0.9
1
1 site 2 sites 3 sites 4 sites 5 sites
Ave
rage
F1
Number of sites in training set
List Record
List Title
Post Record
Post Author
Post Time
Post Content
Experiments
0.6
0.7
0.8
0.9
1
10 20 30 40 50
Ave
rage
F1
Number of pages per site in training set
List Record
List Title
Post Record
Post Author
Post Time
Post Content
Future works
http://discussions.apple.com/
Conclusion• A template-independent approach to extract
structured data from web forum sites.
• we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap.
• http://research.microsoft.com/people/jmyang/