Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
RDG___Web2.0_Extraction___2008.09.22.ppt
KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)
Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)
TUD – Technische Universität Darmstadt Merckstr. 25, D-64283 Darmstadt, Germany
Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de
© author(s) of these slides 2008 including research results of the research network KOM and TU Darmstadt otherwise as specified at the respective slide
httc –Hessian Telemedia Technology
Competence-Center e.V - www.httc.de
Dipl. Inform. Renato Dominguez Garcia
[email protected] Tel.+49 6151 165842
13. Oktober 2008
Extraction of Segments from Web 2.0 Pages
GenreDetection
PageSegmentation
SegmentClassification
URL Output Format
KOM – Multimedia Communications Lab 2
KOM Research Program
UbiquitousCommunication
Mobile Networking
Peer-to-PeerNetworking
IT Architectures
E-Le
arni
ng
E-B
usin
ess
& E
-Fin
ance
Com
mun
icat
ion
Serv
ices
& IP
Tel
epho
ny
Wor
kflo
ws
Net
wor
k M
echa
nism
s
Qua
lity
of S
ervi
ce, D
epen
dabi
lity
& S
ecur
ity
Application Areas Fundamentals Research Areas
E-Le
arni
ng
Knowledge MediaFiltering of irrelevant
Information
KOM – Multimedia Communications Lab 3
vs
Motivation
The influence of the internet for the US president election has grown since 2004 (+11%) [1]Primary information source about the US president election for young Americans (18 – 29 years old) is the internet (42%) [1]Each third American read blogs [2]PR professionals recognize importance of blogs [3]
→ Information extraction from blogs can help to understand the public opinion
→ Automatically detection of blogs, wikis and forums may be useful
→ Information extraction from small segments is easier than from large web pages
→ Genre Detection and Information extraction can be used in other fields: Community Mining, Improvement of search results
Source: http://a.abcnews.com/
[1] http://people-press.org/report/384/internets-broader-role-in-campaign-2008[2] DER SPIEGEL 30/2008 of 21.07.2008, Page 94[3] http://www.conversationblog.com/journal/2007/3/18/survey-pr-professionals-recognise-importance-of-blogs-but-do-not-know-how-to-integrate-them-in-their-planning.html
Topic
No of comments
Author
Date
Content
Source: http://www.techcruch.com/
KOM – Multimedia Communications Lab 4
Genre Detection6 Genres
Blogs (Start pages, post pages)WikisForums (Start pages, thread pages)Others
Based on the structure of web pages (Patterns)Machine Learning Techniques
Support Vector Machines 336 FeaturesCorpus: ~ 33000 Web pages
Evaluation1345 Instances (1000 Blogs/Wikis/Forums)87,5 % Correctly classified instances
Our Approach: Genre Detection
GenreDetection
PageSegmentation
SegmentClassification
URL
a b c d e f <-- classified as156 6 1 0 0 37 | a = Blog_Page
15 170 0 1 1 13 | b = Blog_Post1 1 171 0 1 26 | c = Wiki_Page1 0 1 184 4 10 | d = Forum_Page0 0 1 4 185 10 | e = Forum_Thread
13 12 4 5 0 311 | f = Outlier
Number of detected patternsNumber of outer patternsRatio of patterned vs unpatterned CodeLength of patternsOffset before first pattern startsDepth of patterns…
Output Format
KOM – Multimedia Communications Lab 5
Our Features: Examples
Many long patterns
patterns
Many short patterns
Blog’s start page Wiki-page Forum’s start page
All threads share a similar structure …
Few patterns
All posts share a similar structure
title
Set of links
KOM – Multimedia Communications Lab 6
Our Approach: Segmentation
Page segmentation (Four steps)Pre-processing (cleaning HTML)Segmentation based on the hierarchical structure of web PagesVisual-based segmentationFiltering based on heuristics
Segment classificationMachine Learning Techniques
Random Forest139 FeaturesCorpus: ~ 500 instances
EvaluationGenres (blog posts, comments, others)97,2 % correct classified instances
GenreDetection
PageSegmentation
SegmentClassification
URL Output Format
Source: http://www.techcruch.com/ Source: http://www.techcruch.com/