6
RDG___Web2.0_Extraction___2008.09.22.ppt KOM - Multimedia Communications Lab Prof. Dr.-Ing. Ralf Steinmetz (Director) Dept. of Electrical Engineering and Information Technology Dept. of Computer Science (adjunct Professor) TUD – Technische Universität Darmstadt Merckstr. 25, D-64283 Darmstadt, Germany Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de © author(s) of these slides 2008 including research results of the research network KOM and TU Darmstadt otherwise as specified at the respective slide httc – Hessian Telemedia Technology Competence-Center e.V - www.httc.de Dipl. Inform. Renato Dominguez Garcia [email protected] Tel.+49 6151 165842 13. Oktober 2008 Extraction of Segments from Web 2.0 Pages Genre Detection Page Segmentation Segment Classification URL Output Format

Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

RDG___Web2.0_Extraction___2008.09.22.ppt

KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)

Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)

TUD – Technische Universität Darmstadt Merckstr. 25, D-64283 Darmstadt, Germany

Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de

© author(s) of these slides 2008 including research results of the research network KOM and TU Darmstadt otherwise as specified at the respective slide

httc –Hessian Telemedia Technology

Competence-Center e.V - www.httc.de

Dipl. Inform. Renato Dominguez Garcia

[email protected] Tel.+49 6151 165842

13. Oktober 2008

Extraction of Segments from Web 2.0 Pages

GenreDetection

PageSegmentation

SegmentClassification

URL Output Format

Page 2: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 2

KOM Research Program

UbiquitousCommunication

Mobile Networking

Peer-to-PeerNetworking

IT Architectures

E-Le

arni

ng

E-B

usin

ess

& E

-Fin

ance

Com

mun

icat

ion

Serv

ices

& IP

Tel

epho

ny

Wor

kflo

ws

Net

wor

k M

echa

nism

s

Qua

lity

of S

ervi

ce, D

epen

dabi

lity

& S

ecur

ity

Application Areas Fundamentals Research Areas

E-Le

arni

ng

Knowledge MediaFiltering of irrelevant

Information

Page 3: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 3

vs

Motivation

The influence of the internet for the US president election has grown since 2004 (+11%) [1]Primary information source about the US president election for young Americans (18 – 29 years old) is the internet (42%) [1]Each third American read blogs [2]PR professionals recognize importance of blogs [3]

→ Information extraction from blogs can help to understand the public opinion

→ Automatically detection of blogs, wikis and forums may be useful

→ Information extraction from small segments is easier than from large web pages

→ Genre Detection and Information extraction can be used in other fields: Community Mining, Improvement of search results

Source: http://a.abcnews.com/

[1] http://people-press.org/report/384/internets-broader-role-in-campaign-2008[2] DER SPIEGEL 30/2008 of 21.07.2008, Page 94[3] http://www.conversationblog.com/journal/2007/3/18/survey-pr-professionals-recognise-importance-of-blogs-but-do-not-know-how-to-integrate-them-in-their-planning.html

Topic

No of comments

Author

Date

Content

Source: http://www.techcruch.com/

Page 4: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 4

Genre Detection6 Genres

Blogs (Start pages, post pages)WikisForums (Start pages, thread pages)Others

Based on the structure of web pages (Patterns)Machine Learning Techniques

Support Vector Machines 336 FeaturesCorpus: ~ 33000 Web pages

Evaluation1345 Instances (1000 Blogs/Wikis/Forums)87,5 % Correctly classified instances

Our Approach: Genre Detection

GenreDetection

PageSegmentation

SegmentClassification

URL

a b c d e f <-- classified as156 6 1 0 0 37 | a = Blog_Page

15 170 0 1 1 13 | b = Blog_Post1 1 171 0 1 26 | c = Wiki_Page1 0 1 184 4 10 | d = Forum_Page0 0 1 4 185 10 | e = Forum_Thread

13 12 4 5 0 311 | f = Outlier

Number of detected patternsNumber of outer patternsRatio of patterned vs unpatterned CodeLength of patternsOffset before first pattern startsDepth of patterns…

Output Format

Page 5: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 5

Our Features: Examples

Many long patterns

patterns

Many short patterns

Blog’s start page Wiki-page Forum’s start page

All threads share a similar structure …

Few patterns

All posts share a similar structure

title

Set of links

Page 6: Extraction of Segments from Web 2.0 Pages€¦ · KOM – Multimedia Communications Lab 3 vs Motivation The influence of the internet for the US president election has grown since

KOM – Multimedia Communications Lab 6

Our Approach: Segmentation

Page segmentation (Four steps)Pre-processing (cleaning HTML)Segmentation based on the hierarchical structure of web PagesVisual-based segmentationFiltering based on heuristics

Segment classificationMachine Learning Techniques

Random Forest139 FeaturesCorpus: ~ 500 instances

EvaluationGenres (blog posts, comments, others)97,2 % correct classified instances

GenreDetection

PageSegmentation

SegmentClassification

URL Output Format

Source: http://www.techcruch.com/ Source: http://www.techcruch.com/