Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 1/13
Web mining
„Web Mining is the application of data mining
techniques to discover patterns from the Web“
(Wikipedia)
Three areas:
Web content mining (web as a collection of
documents) – analogy with text mining
Web structure mining (web as a graph)
Web usage mining (web as a space where
users browse through different web sites)
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 2/13
Web content mining
The goal is to find knowledge using web pages
understood as textual documents (i.e. text mining):
search and meta-search (tj. search for pages
relevant to user’s query), document categorization
(content-based clustering) or filtering (i.e.
recognizing pages relevant to user’s profile),
mining knowledge “hidden” in pages (information
extraction or query answering).
1. search
2. meta-search
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 3/13
simultaneous access to more (classic) search engines:
exploitation of search engines possibly unknown to
the user
uniform interface
further post-processing of returned information
All-in-one: list of search engines
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 4/13
MetaCrawler: querying all accessible search engines
SavvySearch: selecting the most promising search
engines to be queried
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 5/13
HuskySearch: clustering of retrieved documents
AskJeeves: local database of FAQ + search engines
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 6/13
3. information extraction
named entity recognition
comparison shopping - support of on-line shopping
(find which store offers the best price for a
given product)
Netbot Jango
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 7/13
Web structure mining
web understood as graph, nodes are documents (pages)
and edges are connections (links) between pages.
HITS (Kleinberg, 1998)
hubs and authorities
a(p) :=
q p
h(q)
h(p) :=
p q
a(q)
recursive algorithm that computes both values (only) for
pages retrieved for a given query (system Clever).
When finding hubs and authorities we can reduce a
part of web that covers a specific topic into bipartite
graph.
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 8/13
PageRank (Brin, Page, 1998)
web pages „ranked“, rank of given page is based on
ranks of pages pointing to this page
)(
)(...
)(
)()1()(
n
n
i
i
TC
TPR
TC
TPRddAPR
where:
A is page, for which the PageRank is computed
Ti are pages that point to page A
C(Ti) is the number of links on page Ti
d is damping factor
recursive algorithm that computes the value for all web
pages (Google)
Web communities
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 9/13
Web usage mining
Web as a space in which users visit various pages
1. web server log analysis (temporal data, sequences
of visited pages)
remotehost rfc931 Auth
user [date] "request" status bytes
Pre-processing – identifying clickstreams, sequences
of page views visited by single user during single
session
e.g. Discovery Challenge ECML/PKDD 2005
clickstream for typy of page: dp,dp,dp,sb,sb
clickstream for product: 124,182,148
segmentation of sold products
bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:45 -0600] "GET /~bacuslab/ HTTP/1.0" 304 0 bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:49 -0600] "GET /~bacuslab/BulletA.gif HTTP/1.0" 304 0 bacuslab.pr.mcs.net - - [01/Jan/1997:12:57:50 -0600] "GET /~bacuslab/Email4.gif HTTP/1.0" 304 0 151.99.190.27 - - [01/Jan/1997:13:06:51 -0600] "GET /~bacuslab HTTP/1.0" 301 -4 151.99.190.27 - - [01/Jan/1997:13:06:52 -0600] "GET /~bacuslab/ HTTP/1.0" 200 1779 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/BLI_Logo.jpg HTTP/1.0" 200 8210 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/BulletA.gif HTTP/1.0" 200 1151 151.99.190.27 - - [01/Jan/1997:13:06:54 -0600] "GET /~bacuslab/Email4.gif HTTP/1.0" 200 3218
unix time ;IP address ; session ID ; page request; referee
1074589200;193.179.144.2 ;1993441e8a0a4d7a;/dp/?id=124 ;www.google.cz;
1074589201;194.213.35.234;3995b2c0599f1782;/dp/?id=182 ;
1074589202;194.138.39.56 ;2fd3213f2edaf82b;/ ;www.seznam.cz;
1074589233;193.179.144.2 ;1993441e8a0a4d7a;/dp/?id=148 ;/dp/?id=124;
1074589245;193.179.144.2 ;1993441e8a0a4d7a;/sb/ ;/dp/?id=148;
1074589248;194.138.39.56 ;2fd3213f2edaf82b;/contacts/ ; /;
1074589290;193.179.144.2 ;1993441e8a0a4d7a;/sb/ ;/sb/;
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 10/13
associations between visited pages
People using fulltext search check the details of products less
frequently
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 11/13
predicting next page in the clickstream – is it possible to use the observed sequence A1A2…An-1 to
determine the next page An?
markov model
rules dp, sb -> sb (0.93)
similarity between sequences
segmentation of visitors
n
i
ikiin AAAPAAAP1
1121 )...|()...(
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 12/13
2. market basket analysis of e-shops
3. recommender systems – systems, that recommend
(what to buy, which pages to visit etc.) using behavior
of a group of similar visitors – collaborative filtering
amazon
Knowledge Discovery in Databases T15: web mining
P. Berka, 2012 13/13
MovieLens
last.fm