13
Knowledge Discovery in Databases T15: web mining P. Berka, 2012 1/13 Web mining „Web Mining is the application of data mining techniques to discover patterns from the Web“ (Wikipedia) Three areas: Web content mining (web as a collection of documents) – analogy with text mining Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 1/13

Web mining

„Web Mining is the application of data mining

techniques to discover patterns from the Web“

(Wikipedia)

Three areas:

Web content mining (web as a collection of

documents) – analogy with text mining

Web structure mining (web as a graph)

Web usage mining (web as a space where

users browse through different web sites)

Page 2: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 2/13

Web content mining

The goal is to find knowledge using web pages

understood as textual documents (i.e. text mining):

search and meta-search (tj. search for pages

relevant to user’s query), document categorization

(content-based clustering) or filtering (i.e.

recognizing pages relevant to user’s profile),

mining knowledge “hidden” in pages (information

extraction or query answering).

1. search

2. meta-search

Page 3: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 3/13

simultaneous access to more (classic) search engines:

exploitation of search engines possibly unknown to

the user

uniform interface

further post-processing of returned information

All-in-one: list of search engines

Page 4: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 4/13

MetaCrawler: querying all accessible search engines

SavvySearch: selecting the most promising search

engines to be queried

Page 5: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 5/13

HuskySearch: clustering of retrieved documents

AskJeeves: local database of FAQ + search engines

Page 6: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 6/13

3. information extraction

named entity recognition

comparison shopping - support of on-line shopping

(find which store offers the best price for a

given product)

Netbot Jango

Page 7: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 7/13

Web structure mining

web understood as graph, nodes are documents (pages)

and edges are connections (links) between pages.

HITS (Kleinberg, 1998)

hubs and authorities

a(p) :=

q p

h(q)

h(p) :=

p q

a(q)

recursive algorithm that computes both values (only) for

pages retrieved for a given query (system Clever).

When finding hubs and authorities we can reduce a

part of web that covers a specific topic into bipartite

graph.

Page 8: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 8/13

PageRank (Brin, Page, 1998)

web pages „ranked“, rank of given page is based on

ranks of pages pointing to this page

)(

)(...

)(

)()1()(

n

n

i

i

TC

TPR

TC

TPRddAPR

where:

A is page, for which the PageRank is computed

Ti are pages that point to page A

C(Ti) is the number of links on page Ti

d is damping factor

recursive algorithm that computes the value for all web

pages (Google)

Web communities

Page 9: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 9/13

Web usage mining

Web as a space in which users visit various pages

1. web server log analysis (temporal data, sequences

of visited pages)

remotehost rfc931 Auth

user [date] "request" status bytes

Pre-processing – identifying clickstreams, sequences

of page views visited by single user during single

session

e.g. Discovery Challenge ECML/PKDD 2005

clickstream for typy of page: dp,dp,dp,sb,sb

clickstream for product: 124,182,148

segmentation of sold products

bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:45 -0600] "GET /~bacuslab/ HTTP/1.0" 304 0 bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:49 -0600] "GET /~bacuslab/BulletA.gif HTTP/1.0" 304 0 bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:50 -0600] "GET /~bacuslab/Email4.gif HTTP/1.0" 304 0 151.99.190.27 - - [01/Jan/1997:13:06:51 -0600] "GET /~bacuslab HTTP/1.0" 301 -4 151.99.190.27 - - [01/Jan/1997:13:06:52 -0600] "GET /~bacuslab/ HTTP/1.0" 200 1779 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/BLI_Logo.jpg HTTP/1.0" 200 8210 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/BulletA.gif HTTP/1.0" 200 1151 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/Email4.gif HTTP/1.0" 200 3218

unix time ;IP address ; session ID ; page request; referee

1074589200;193.179.144.2 ;1993441e8a0a4d7a;/dp/?id=124 ;www.google.cz;

1074589201;194.213.35.234;3995b2c0599f1782;/dp/?id=182 ;

1074589202;194.138.39.56 ;2fd3213f2edaf82b;/ ;www.seznam.cz;

1074589233;193.179.144.2 ;1993441e8a0a4d7a;/dp/?id=148 ;/dp/?id=124;

1074589245;193.179.144.2 ;1993441e8a0a4d7a;/sb/ ;/dp/?id=148;

1074589248;194.138.39.56 ;2fd3213f2edaf82b;/contacts/ ; /;

1074589290;193.179.144.2 ;1993441e8a0a4d7a;/sb/ ;/sb/;

Page 10: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 10/13

associations between visited pages

People using fulltext search check the details of products less

frequently

Page 11: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 11/13

predicting next page in the clickstream – is it possible to use the observed sequence A1A2…An-1 to

determine the next page An?

markov model

rules dp, sb -> sb (0.93)

similarity between sequences

segmentation of visitors

n

i

ikiin AAAPAAAP1

1121 )...|()...(

Page 12: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 12/13

2. market basket analysis of e-shops

3. recommender systems – systems, that recommend

(what to buy, which pages to visit etc.) using behavior

of a group of similar visitors – collaborative filtering

amazon

Page 13: Web miningberka/docs/4iz451/sl15-web-en.pdf · 2012-11-11 · Web structure mining (web as a graph) Web usage mining (web as a space where users browse through different web sites)

Knowledge Discovery in Databases T15: web mining

P. Berka, 2012 13/13

MovieLens

last.fm