22
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL

web structure mining

Embed Size (px)

DESCRIPTION

Web mining is the application of data mining techniques in search engines.Data mining - process of discovering useful knowledge from data sourcesWeb mining automatically discover and extract information from Web documents.Web structure mining discovers useful data from hyperlinks.the credict of this presentation goes to Blessy my friendit is uploaded with all her permission

Citation preview

Page 1: web structure mining

WEB STRUCTURE MINING

SUBMITTED BY: BLESSY JOHN

R7A ROLL NO:18

Page 2: web structure mining

INTRODUCTION

Web mining is the application of data mining techniques in search engines.

Data mining - process of discovering useful knowledge from data sources

Web mining automatically discover and extract information from Web documents.

Web structure mining discovers useful data from hyperlinks.

Page 3: web structure mining

WEB MININGUseful patterns extraction from WWW resources

WWW is widely distributed, global information service centre that constitutes a rich source for data mining

Employing techniques from Data Mining, information retrieval,etc.

Page 4: web structure mining

NEED FOR WEB MINING Aims at finding and extracting

relevant information that is hidden in web- related data.

The challenge is to bring back the semantics of hyper text document

To turn web data into web knowledge

Page 5: web structure mining

CLASSIFICATION

WEB MINING

WEB CONTENT MINING WEB USAGE

MINING

WEB STRUCTURE MINING

Page 6: web structure mining

WEB STRUCTURE MINING Generate structural summary about

the Web site and Web page

Use graph theory to analyse node and connection structure of a web site

Analysis of the link structure of the web, and its purposes is to identify more preferable documents

Page 7: web structure mining

WEB STRUCTURE MINING cont…..

Discovering the nature of the hierarchy of hyperlinks in the website and its structure

Hyperlink identifies author’s endorsement of the other web page

Retrieving information about the relevance and the quality of the web page.

Page 8: web structure mining

Page Layout and Link Analysis for Web Images

Page 9: web structure mining

WEB BASICS A web is a huge collection of documents

linked together by references. To refer from one document to another is

based on hyper text and embedded in HTML

HTML describes how the document should display on browser window

Web document has a web address called URL that identifies it uniquely.

Page 10: web structure mining

WEB CRAWLERS Collects “all” web documents by

browsing the Web systematically and exhaustively

Region of the web to be crawled can be specified by using the URL structure.

Used by a search engine to provide local access to the most recent versions of possibly all web pages

Page 11: web structure mining

INDEXING AND KEYWORD SEARCH There are two types of data: structured and unstructured Structured data have keys associated

with each data item that reflect its content

Content-based access to unstructured data without considering the meaning is the keyword search approach

Page 12: web structure mining

DOCUMENT REPRESENTATION To facilitate the process of matching

keywords and documents, some preprocessing steps are taken first:

1. Documents are tokenized2. Characters are converted to upper or

lower case3. Words reduced to canonical form4. Stopwords are usually removed

Page 13: web structure mining

ALGORITHMS

There are two main algorithms used in web structure mining

1. HITS (Hypertext-Induced Topic Search) 2. Page rank algorithm

Page 14: web structure mining

HITS (Hypertext-Induced Topic Search)

Link analysis algorithm Rates web pages Developed by Jon Kleinberg Determines two values for a page Authority-estimates the value of

the content of the page Hub-estimates the value of its links

to other pages

Page 15: web structure mining

Hubs and Authorities

Hub pages point to interesting links to authorities = relevant pages

Authorities are targets of hub pages

Page 16: web structure mining

Continue…… Authority and hub values are

defined in terms of one another in a mutual recursion

It is executed at querry time with the associated HIT on performance

Page 17: web structure mining

Page Rank

Link analysis algorithm Assigns a numerical weightage to

each element of a hyperlinked set of documents

Denoted by PR(E) Relies on uniquely democratic

nature Link from page A to page B is a

vote, by page A, for page B

Page 18: web structure mining

Continue….. Here, A considers itself important and

help to make B important

Also a probability distribution – represents the probability that a click on a link arrives at any particular page

Page rank of 0.5 -> 50% chance that a person clicking on a link will be directed to the document with the 0.5 page rank

Page 19: web structure mining

APPLICATIONS

Information retrieval in social networks.

To find out the relevancy of each Web page

Measuring completeness of the Web sites

Used in search engines to find out relevant information

Page 20: web structure mining

CONCLUSION Search engines uses web structure

mining to find the information.

We can create new knowledge out of the available information

Web Content mining can be added to it to enhance the performance of search engines.

Page 21: web structure mining

Thank You !

Page 22: web structure mining

Questions ?