DATA MINING - 1DL105, 1DL111 - Uppsala University were over 11.5 billion web pages in the publicly...

Kjell Orsborn 12/6/07

1UU - IT - UDBL

DATA MINING - 1DL105, 1DL111

Fall 2007

An introductory class in data mining

http://user.it.uu.se/~udbl/dut-ht2007/alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07

Kjell OrsbornUppsala Database Laboratory

Department of Information Technology, Uppsala University,Uppsala, Sweden

2UU - IT - UDBL

Introduction to Data Mining: Web Mining

(slides + supplemental articles)ref book (used for slides): Data mining / Dunham

Kjell Orsborn

Department of Information TechnologyUppsala University, Uppsala, Sweden

3UU - IT - UDBL

Web Mining Outline

Goal: Examine the use of data mining on the World Wide Web• Introduction• Web Content Mining• Web Structure Mining• Web Usage Mining

4UU - IT - UDBL

Web Mining Issues• Size

– >350 million pages (1999)– Grows at about 1 million pages a day– Google indexes 3 billion documents

• More recent figures:– According to a 2001 study, there were more than 550 billion documents

(approximately 7,500 terabytes of data) on the Web, mostly in the "invisibleweb", or deep web.

– A study, dated January 2005, queried the Google, MSN, Yahoo!, and Ask Jeevessearch engines with search terms from 75 different languages and determined thatthere were over 11.5 billion web pages in the publicly indexable Web, alsotermed the the surface web.

• Diverse types of data

5UU - IT - UDBL

Web data

• Web pages• Intra-page structures• Inter-page structures• Usage data• Supplemental data

– Profiles– Registration information– Cookies

6UU - IT - UDBL

Web Mining Taxonomy

Modified from [zai01]

7UU - IT - UDBL

Web content mining

• Extends work of basic search engines• Search engines

– IR application– Crawlers– Indexing– Profiles– Link analysis

• Text mining functions (from basic to advanced)– Keyword– Term associations– Similarity search (between query and document)– Classification and clustering– Natural language processing

8UU - IT - UDBL

Crawlers• Robot (spider) traverses the hypertext structure in the Web.

– Collect information from visited pages– Used to construct indexes for search engines

• Traditional crawler – visits entire Web (?) and replaces index• Periodic crawler – visits portions of the Web and updates subset

of index• Incremental crawler – selectively searches the Web and

incrementally modifies index• Focused crawler – visits pages related to a particular subject

9UU - IT - UDBL

Focused crawler

• Only visit links from a page if that page is determined to be relevant.• Classifier is static after learning phase.• Components:

– Classifier which assigns relevance score to each page based on crawl topic.– Distiller to identify hub pages.– Crawler visits pages to based on crawler and distiller scores.

• Classifier to related documents to topics• Classifier also determines how useful outgoing links are• Hub Pages contain links to many relevant pages. Must be visited even if not

high relevance score.

10UU - IT - UDBL

Focused crawler

11UU - IT - UDBL

Context focused crawler

• Context Graph:– Context graph created for each seed document.– Root is the seed document.– Nodes at each level show documents with links to documents at next higher level.– Updated during crawl itself.

• Approach:– Construct context graph and classifiers using seed documents as training data.– Perform crawling using classifiers and context graph created.

12UU - IT - UDBL

Context graph

13UU - IT - UDBL

Virtual web view

• Multiple Layered DataBase (MLDB) built on top of the Web.• Each layer of the database is more generalized (and smaller) and centralized

than the one beneath it.• Upper layers of MLDB are structured and can be accessed with SQL type

queries.• Translation tools convert Web documents to XML.• Extraction tools extract desired information to place in first layer of MLDB.• Higher levels contain more summarized data obtained through

generalizations of the lower levels.

14UU - IT - UDBL

Personalization

• Web access or contents tuned to better fit the desires of eachuser.

• Manual techniques identify user’s preferences based on profilesor demographics.

• Collaborative filtering identifies preferences based on ratingsfrom similar users.

• Content based filtering retrieves pages based on similaritybetween pages and user profiles.

15UU - IT - UDBL

Web structure mining• Mine structure (links, graph) of the Web• Techniques

– PageRank– CLEVER– HITS

• Create a model of the Web organization.• May be combined with content mining to more effectively retrieve

important pages.

16UU - IT - UDBL

PageRank• Used by Google• Prioritize pages returned from search by looking at Web structure.• Importance of page is calculated based on number of pages which point to it

– Backlinks.• Weighting is used to provide more importance to backlinks coming form

important pages.

• PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)– PR(i): PageRank for a page i which points to target page p.– Ni: number of links coming out of page i– c: is a value between 0 and 1 used for normalization

17UU - IT - UDBL

CLEVER

• Identify authoritative and hub pages.• Authoritative Pages :

– Highly important pages.– Best source for requested information.

• Hub Pages :– Contain links to highly important pages.

18UU - IT - UDBL

• Hyperlink-Induced Topic Search• Based on a set of keywords, find set of relevant pages – R.• Identify hub and authority pages for these.

– Expand R to a base set, B, of pages linked to or from R.– Calculate weights for authorities and hubs.

• Pages with highest ranks in R are returned.

19UU - IT - UDBL

HITS algorithm

20UU - IT - UDBL

Web usage mining

• Performs mining on web usage data or web logs (clickstreams)– Examined both from a server …

• Uncover info about site where service reside• Can e.g. improve design

– ... and a client perspective• Uncovers info about user or group• Can e.g. improve prefeching and caching

• Applications of web usage mining– Personalization– Improve structure of a site’s Web pages– Aid in caching and prediction of future page references– Improve design of individual pages– Improve effectiveness of e-commerce (sales and advertising)

21UU - IT - UDBL

Web usage mining activities

• Preprocessing Web log– Cleanse– Remove extraneous information– Sessionize

Session: Sequence of pages referenced by one user at a sitting.• Pattern Discovery

– Count patterns that occur in sessions– Pattern is sequence of pages references in session.– Similar to association rules

• Transaction: session• Itemset: pattern (or subset)• Order is important

• Pattern Analysis

22UU - IT - UDBL

Association analysis in web mining

• Web Mining:– Content– Structure– Usage

• Frequent patterns of sequential page references in Websearching.

• Uses:– Caching– Clustering users– Develop user profiles– Identify important pages

23UU - IT - UDBL

Web usage mining issues

• Identification of exact user not possible.• Exact sequence of pages referenced by a user not possible due

to caching.• Session not well defined• Security, privacy, and legal issues

24UU - IT - UDBL

Web log cleansing

• Replace source IP address with unique but non-identifying ID.• Replace exact URL of pages referenced with unique but non-

identifying ID.• Delete error records and records containing not page data (such

as figures and code)

25UU - IT - UDBL

Sessionizing

• Divide Web log into sessions.• Two common techniques:

– Number of consecutive page references from a source IP addressoccurring within a predefined time interval, such as 30 min (empiricalstudies show 25,5 min).

– All consecutive page references from a source IP address where theinterclick time is less than a predefined threshold.

26UU - IT - UDBL

Data structures

• Keep track of patterns identified during Web usage miningprocess

• Common techniques:– Trie– Suffix Tree– Generalized Suffix Tree– WAP Tree

27UU - IT - UDBL

Trie vs. Suffix tree

• Trie:– Rooted tree– Edges labeled which character (page) from pattern– Path from root to leaf represents pattern.

• Suffix tree:– Single child collapsed with parent. Edge contains labels of both prior

edges.

28UU - IT - UDBL

Trie and Suffix tree

29UU - IT - UDBL

Generalized Suffix tree

• Suffix tree for multiple sessions.• Contains patterns from all sessions.• Maintains count of frequency of occurrence of a pattern in the

node.• WAP Tree:

Compressed version of generalized suffix tree

30UU - IT - UDBL

Types of patterns

• Algorithms have been developed to discover different types ofpatterns.

• Properties:– Ordered – Pages (characters) must occur in the exact order in the original

session.– Duplicates – Duplicate pages are allowed in the pattern.– Consecutive – All pages in pattern must occur consecutive in given

session.– Maximal – Not subsequence of another pattern.

31UU - IT - UDBL

Pattern types• Association rules

– None of the properties hold (no order, no duplicates, no consecutive or maximalpatterns)

• Episodes– Only ordering holds

• Sequential patterns– Ordered and maximal

• Forward sequences– Backlinks and reloads eliminated– Ordered, consecutive, and maximal

• Maximal frequent sequences– Support calculated in reference to length of sequence, i.e. no of clicks– All properties hold

32UU - IT - UDBL

Episodes

• Partially ordered set of pages• Serial episode – totally ordered with time constraint• Parallel episode – partial ordered with time constraint• General episode – partial ordered with no time constraint

33UU - IT - UDBL

DAG for Episode

DATA MINING - 1DL105, 1DL111 - Uppsala University were over 11.5 billion web pages in the publicly...

Documents

THE LIGHT OF July, 2018 MERIDIAN · Gary Moore arol Orsborn 8/30 enjamin Mikan Happy Anniversary 8/2 Don & Lori Smith 8/10 Jason & Heather ihonski 8/18 Jim & Emogene Lyon 8/19 Dave

New 2. · 2020. 7. 24. · e-mail: kncowen@tiscali.co.uk John Orsborn 18 The Hills, Reedham, NR13 3TN, 01493 700441 “OUTLOOK” ADVERTISING EDITOR: George Nicholls e-mail ... Kilner:

DATA MINING - 1DL460 · Kjell Orsborn - UDBL - IT - UU! 07/04/14 2 Introduction to Data Mining: Search Engines Technology and Algorithms for Efﬁcient Searching of Information on

Kjell Orsborn 2015-06-28 1 UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No.? A second course on development of database systems Kjell Orsborn Uppsala

Orb Web Tangle Web Spider Web Wonders

Brief Introduction of PHP - Uppsala University · PHP_Xue.ppt Author: Kjell Orsborn Created Date: 4/3/2009 11:41:22 AM

Kjell Orsborn 2015-07-13 1 UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell

E-COMMERCE and SECURITY - 1DL018 · Kjell Orsborn 4/208 U-IT DBL 6 The Different Dimensions of E-commerce Security (E-commerce, Laudon, 3rd ed., 2007) • Integrity – The ability

Informationshanteringssystem - LIMS · Informationshanteringssystem - LIMS ... The LabSoft LIMS Microbiology . Kjell Orsborn - UDBL - IT - UU 2011-02-01 30 BIKA LIMS (open source)

DATABASTEKNIK - 1DL116 · 2004-05-25 · Kjell Orsborn 3/31/04 UU - IT - UDBL 8 Update anomaly - inconsistency SUPP_INFO(SNAME, INAME, SADDR, PRICE) •When you have redundant data

DATA MINING II - 1DL460 fileKjell Orsborn - UDBL - IT - UU 25/02/16 2 Introductory to Spatial Mining • Slides from the book by Dr. M. H. Dunham, Data Mining, Introductory and Advanced

Thing Web Sensor Web The Geospatial Web Mobile Web Web 2.0 ... · Mobile Web Web 2.0 Semantic Web Worldwide Web Mike Liebhold Senior Researcher mliebhold@iftf.org ... mashups ≠geoweb

Comparación Web 1.0 Web 2.0 Web 3.0

Web - Web programming

WEB - Klopotek · WEB Rights Sales Manager WEB WEB Contract Workflow Manager Inventory Manager WEB Suppliers Online WEB Intercompany Publishing Deals WEB Author 360° WEB Contact

MEETING OF THE Audit and Finance Subcommittee · Chair Orsborn will request future AFS agenda items from members and members may provide a report on current events. 9. For information

DATA MINING - 1DL360 · Kjell Orsborn - UDBL - IT - UU! 10/12/12 2 Data Mining Association Analysis: Basic Concepts and Algorithms (Tan, Steinbach, Kumar ch. 6)" Kjell Orsborn! Department

DATABASE DESIGN I - 1DL300 Summer 2008user.it.uu.se/~udbl/dbt-sommar08/slides/dbt05-st08-normalization.pdf · –multi-valued dependencies (4NF) –join dependencies (5NF) Kjell Orsborn

DATA MINING - 1DL105, 1Dl111 · –Risk analysis and management •Forecasting, customer retention, improved underwriting, quality control, competitive analysis –Fraud detection

Web 1.0, Web 2.0 & Web 3.0