WEB STRUCTURE MINING
SUBMITTED BY: BLESSY JOHN
R7A ROLL NO:18
INTRODUCTION
Web mining is the application of data mining techniques in search engines.
Data mining - process of discovering useful knowledge from data sources
Web mining automatically discover and extract information from Web documents.
Web structure mining discovers useful data from hyperlinks.
WEB MININGUseful patterns extraction from WWW resources
WWW is widely distributed, global information service centre that constitutes a rich source for data mining
Employing techniques from Data Mining, information retrieval,etc.
NEED FOR WEB MINING Aims at finding and extracting
relevant information that is hidden in web- related data.
The challenge is to bring back the semantics of hyper text document
To turn web data into web knowledge
CLASSIFICATION
WEB MINING
WEB CONTENT MINING WEB USAGE
MINING
WEB STRUCTURE MINING
WEB STRUCTURE MINING Generate structural summary about
the Web site and Web page
Use graph theory to analyse node and connection structure of a web site
Analysis of the link structure of the web, and its purposes is to identify more preferable documents
WEB STRUCTURE MINING cont…..
Discovering the nature of the hierarchy of hyperlinks in the website and its structure
Hyperlink identifies author’s endorsement of the other web page
Retrieving information about the relevance and the quality of the web page.
Page Layout and Link Analysis for Web Images
WEB BASICS A web is a huge collection of documents
linked together by references. To refer from one document to another is
based on hyper text and embedded in HTML
HTML describes how the document should display on browser window
Web document has a web address called URL that identifies it uniquely.
WEB CRAWLERS Collects “all” web documents by
browsing the Web systematically and exhaustively
Region of the web to be crawled can be specified by using the URL structure.
Used by a search engine to provide local access to the most recent versions of possibly all web pages
INDEXING AND KEYWORD SEARCH There are two types of data: structured and unstructured Structured data have keys associated
with each data item that reflect its content
Content-based access to unstructured data without considering the meaning is the keyword search approach
DOCUMENT REPRESENTATION To facilitate the process of matching
keywords and documents, some preprocessing steps are taken first:
1. Documents are tokenized2. Characters are converted to upper or
lower case3. Words reduced to canonical form4. Stopwords are usually removed
ALGORITHMS
There are two main algorithms used in web structure mining
1. HITS (Hypertext-Induced Topic Search) 2. Page rank algorithm
HITS (Hypertext-Induced Topic Search)
Link analysis algorithm Rates web pages Developed by Jon Kleinberg Determines two values for a page Authority-estimates the value of
the content of the page Hub-estimates the value of its links
to other pages
Hubs and Authorities
Hub pages point to interesting links to authorities = relevant pages
Authorities are targets of hub pages
Continue…… Authority and hub values are
defined in terms of one another in a mutual recursion
It is executed at querry time with the associated HIT on performance
Page Rank
Link analysis algorithm Assigns a numerical weightage to
each element of a hyperlinked set of documents
Denoted by PR(E) Relies on uniquely democratic
nature Link from page A to page B is a
vote, by page A, for page B
Continue….. Here, A considers itself important and
help to make B important
Also a probability distribution – represents the probability that a click on a link arrives at any particular page
Page rank of 0.5 -> 50% chance that a person clicking on a link will be directed to the document with the 0.5 page rank
APPLICATIONS
Information retrieval in social networks.
To find out the relevancy of each Web page
Measuring completeness of the Web sites
Used in search engines to find out relevant information
CONCLUSION Search engines uses web structure
mining to find the information.
We can create new knowledge out of the available information
Web Content mining can be added to it to enhance the performance of search engines.
Thank You !
Questions ?