14
CSE5230 - Data Mining, 2002 Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

Embed Size (px)

Citation preview

Page 1: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.1

Data Mining - CSE5230

Web Mining

CSE5230/DMS/2002/11

Page 2: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.2

Lecture Outline

How big is the web? What is “web data”? A taxonomy of web mining tasks Example: targeted advertising Example: personalization References

Page 3: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.3

How big is the web?

It is not easy to determine the size of the web In 1999, one estimate was that there were

approximately 350 million web pages, growing at about 1 million pages per day

In 2001, Google announced that they were indexing around 3 billion web documents

No matter which of these is more accurate – it’s very big!

We can view the web as the world’s biggest database The word “database” is used loosely here, because the

web has no real formal structure or database schema» This makes the application of data mining to the

web potentially very useful, but also difficult

Page 4: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.4

What is “web data”?

Web data can be classified as follows [Dun2002]: The actual content of web pages (text, images,

multimedia) Intrapage structure – the HTML or XML mark-up

specifying the organization of the page content Interpage structure – the links into and out of web

pages Usage data describing how the users of a web site

access pages – navigation patterns User profiles – these can include demographic data

obtained from a registration process, or perhaps IP addresses. It can also include information found in cookies

Page 5: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.5

A taxonomy of web mining tasks (1)

From [Dun2002], following [Zai1999].

Web Content Mining

Web Mining

Web Usage Mining

Web Structure Mining

Web Page Content Mining

Search Result Mining

General Access Pattern Tracking

Customized Usage Tracking

Page 6: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.6

A taxonomy of web mining tasks (2)

Web content mining Examines the contents of web pages (text, graphics) Examines the results of web searches

» Mining systems built on top of existing search engines Similar to traditional information retrieval (text categoriation,

text filtering, etc.)

» Often goes further than simple keyword search – e.g. may cluster similar pages

Web structure mining Looks at page structure

» e.g. text in <H1> tags may be more important Links between pages

» e.g. pages with many incoming links may be more useful

Page 7: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.7

A taxonomy of web mining tasks (3)

Web usage mining Looks at log files of web access General access tracking looks at history of pages

visited Customised usage tracking may be focused on

particular kinds of usage, or particular users Involves mining of sequential patterns

» Can use association rule discovery, or HMMs» These patterns can be clustered to reveal users

with similar access behaviour Can be used to

» improve web site design» Customize presentation via collaborative filtering

Page 8: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.8

Example: targeted advertising (1)

In marketing, targeting is any technique used to direct marketing or advertising effort to the portion of the population thought to be most valuable to the business, e.g. those

Likely to purchase Likely to spend a lot

The business wants to avoid spending money on sending advertising to people who will not respond to it

In the web context, this can mean displaying an add for a web site on a different web site

Can use web usage information to work out what kind of people use a site: target demographics

Sell advertising to companies wanting to target that demographic

Page 9: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.9

Example: targeted advertising (2)

For example, the Rugby Heaven web site (http://rugbyheaven.smh.com.au/) is today hosting advertising for: MLC life insurance Fintrack Financial Services Business Review Weekly (BRW)

They appear to think that this site is likely to be popular with older people who have money!

The URL for the BRW ad. is:http://campaigns.f2.com.au/event.ng/Type=click&FlightID=10928&AdID=24947&TargetID=2389&Segments=2,13,23,31,35,77,81,88,93,94,153,855,976,993,1145,1301,1989,2320,2389,2394,2396,2477,2534,2576,2581,2689&Targets=535,2389,40,60,1834&Values=25,31,43,48,50,60,72,81,91,100,110,135,150,157,233,239,366,422,605,791,804,805,806,1203,1278,1403,1432,1476,1485,1499&RawValues=&Redirect=http:%2F%2Fwww.brw.com.au%2Fsubscription%2Fsubscribe.asp

It is clear that some sophisticated targeting is going on

Page 10: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.10

Example: personalization (1)

Personalization spans the areas of web content mining and web usage mining

Personalization aims to modify document contents or access patterns to better match the preferences of a particular user

Personalization can involve Dynamically creating and serving web pages that are

unique to an individual user Determining which pages to retrieve or link to on a

user-by-user basis

Page 11: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.11

Example: personalization (2)

Unlike targeting, with personalization can be done for the target web page (unlike a targeted advertisement for another site) Simple example: including the name of the user in the

page content

Personalization techniques include Use of cookies Use of user databases Use of web usage patterns to identify similar users (for

use in collaborative filtering)

Often requires a user to log in – this part is not data mining

Page 12: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.12

Example: personalization (3)

A classic example of personalization is the recommending to a user of a product very similar to something they have bought

before (if the web site is selling something) Content that is similar to something they have used

before Personalization techniques can be based on

clustering, classification or even prediction With classification, the desires of a user are determined

based on the class to which he/she is assigned. Classes may be predetermined by experts.

With clustering, clusters of users with similar navigation or purchasing behaviour are found, and the user’s desires are determined on this basis

Page 13: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.13

Example: personalization (4)

Amazon.com makes use of personalization, as we will see in an on-line example

They make use of both the user’s past behaviour They also use collaborative filtering – they

recommend products bought by users who have similar profiles to the current user Could use clustering, or information filtering techniques

Page 14: CSE5230 - Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11

CSE5230 - Data Mining, 2002 Lecture 11.14

References [Dun2002] Margaret H. Dunham, Data Mining:

Introductory and Advanced Topics, Prentice Hall, Upper Saddle River, NJ, USA, 2002, pp. 195-220.

[Zai1999] Osmar R. Zaïane, Resource and Knowledge Discovery from the Internet and Multimedia Repositories, PhD Thesis, Simon Fraser University, Canada, March 1999.