28
143 International Journal of Computer Processing of Oriental Languages Vol. 16, No. 2 (2003) 143–170 © Chinese Language Computer Society & World Scientific Publishing Co. Using Data Mining to Construct an Intelligent Web Search System YU-RU CHEN * , MING-CHUAN HUNG AND DON-LIN YANG Department of Information Engineering and Computer Science Feng Chia University 100 Wenhwa Rd., Taichung, 407 Taiwan * [email protected] [email protected] [email protected] In this paper, we present a new ranking algorithm and an intelligent Web search system using data mining techniques to search and analyze Web documents in a more flexible and effective way. Our method takes advantage of the characteristics of Web documents to extract, find, and rank data in a more meaningful manner. We utilize hyperlink structures with Web document content to intelligently rank the retrieved results. It can solve ranking problems of existing algorithms for multi- frame Web documents and unrelated linked documents. In addition, we use domain specific ontologies to improve our query process and to rank retrieved Web documents with better semantic notion. Furthermore, we use association rule mining to find the patterns of maximal keyword sets, which represent the main characteristics of the retrieved documents. For subsequent queries, these keywords become recommended sets of query terms for users’ specific needs. Clustering is used to group retrieved documents into distinct sets that can help users make their decisions easier and faster. Experimental results show that our Web search system is indeed effective and efficient. Keywords: World Wide Web; Web Search; Data Mining; Web Mining; Information Retrieval; Information Extraction; Ontology. 1. Introduction The World Wide Web serves as a huge, widely distributed, global information service center for e-commerce, news, advertisements, education, and many other information services. Searching Web documents is one of the most popular tasks performed on the Web. Traditional search techniques use keywords as input to find the information that a user wants. However, this approach often retrieves

Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

143

International Journal of Computer Processing of Oriental LanguagesVol. 16, No. 2 (2003) 143–170© Chinese Language Computer Society &

World Scientific Publishing Co.

Using Data Mining to Construct anIntelligent Web Search System

YU-RU CHEN*, MING-CHUAN HUNG† AND DON-LIN YANG‡

Department of Information Engineering and Computer ScienceFeng Chia University

100 Wenhwa Rd., Taichung, 407 Taiwan*[email protected]

[email protected][email protected]

In this paper, we present a new ranking algorithm and an intelligent Web searchsystem using data mining techniques to search and analyze Web documents in amore flexible and effective way. Our method takes advantage of the characteristicsof Web documents to extract, find, and rank data in a more meaningful manner. Weutilize hyperlink structures with Web document content to intelligently rank theretrieved results. It can solve ranking problems of existing algorithms for multi-frame Web documents and unrelated linked documents. In addition, we use domainspecific ontologies to improve our query process and to rank retrieved Webdocuments with better semantic notion. Furthermore, we use association rule miningto find the patterns of maximal keyword sets, which represent the main characteristicsof the retrieved documents. For subsequent queries, these keywords becomerecommended sets of query terms for users’ specific needs. Clustering is used togroup retrieved documents into distinct sets that can help users make their decisionseasier and faster. Experimental results show that our Web search system is indeedeffective and efficient.

Keywords: World Wide Web; Web Search; Data Mining; Web Mining; InformationRetrieval; Information Extraction; Ontology.

1. Introduction

The World Wide Web serves as a huge, widely distributed, global informationservice center for e-commerce, news, advertisements, education, and many otherinformation services. Searching Web documents is one of the most popular tasksperformed on the Web. Traditional search techniques use keywords as input tofind the information that a user wants. However, this approach often retrieves

00087(GB).p65 9/24/03, 2:02 PM143

Page 2: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

144 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

documents, of which only small portions are really relevant to what the user isreally looking for. Therefore, many ranking algorithms are used to evaluate thematched Web documents based on their similarity to the query, and order thesedocuments by the magnitude of similarity.

Nowadays, ranking the retrieved Web documents by using the informationof hyperlinks between related Web documents becomes very popular. It is intuitivethat the hyperlinks between hyper-documents on the Web contain usefulinformation between documents. It has been shown useful in identifying highquality documents that meet users’ needs. Therefore, many famous search enginestend to utilize this information to design their ranking methods [1–2].Unfortunately, the technical details regarding the ranking algorithms used bymajor search engines are not publicly available. However, one may observe thatmany ranking algorithms use certain edge-weighting strategies to model linkingstructures in terms of the number of links without considering the content of thelinked documents [7, 9, 26]. The results are not always very successful.

In this paper, we attempt to develop a new hyper-textual ranking algorithmto look much deeper into the content of linked documents. We not only evaluatethe text and hypertext information in the retrieved documents, but also examinethe contents in the reciprocally linked documents. The rationale is as follows:even though a Web document is linking to or linked from many other documents(called linking documents), one cannot be sure if these linking documents havesimilar content. They may be irrelevant documents. Furthermore, it is possiblethat these linking documents may have no association with the user’ s query atall. For example, they might be simply Web advertisements of little value tousers. If a ranking algorithm superficially considers that all the linking documentsare equivalent and trivially evaluate the importance of the retrieved documents,it will rank many Web directories, Web advertisements, or linking documentsgenerated automatically by Web document publishing tools. Although the retrievedWeb directories may link many topic-specific Web documents, it takes too muchuser effort to browse links for desired content while skipping generated andlinked unwanted advertisements on the way.

Another important advantage of our ranking algorithm is its applicability inranking multi-frame Web documents. Nowadays, authors tend to edit and organizetheir Web documents in a multi-frame manner. Since Web document publishingtools have become popular and widely available, this phenomenon is moreprevalent than before. In a multi-frame Web document, the browser displayssub-frames in which each sub-frame contains content. Designers can use thestructure of multi-frame Web documents to provide intuitive user interfaces thatmake it possible to manage and arrange content. However, it is not easy to rank

00087(GB).p65 9/30/03, 9:35 AM144

Page 3: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 145

content contained in sub-frames because each sub-frame only contains a portionof the information that an author wants to show to viewers.

We propose a new ranking algorithm that utilizes the hyperlink and hyperdocument information to address the need to include ranking criteria for rich andrelevant content in the search for information on the World Wide Web. Ourmethod is based on the fact that Web documents have correlations with theirhyperlinks and the information conforming to the input query. Furthermore, weuse domain specific ontologies to improve our query process and rank the retrievedWeb documents with better semantic notion. Our proposed algorithm can takeadvantage of these ontologies in order to identify and rank relevant Webdocuments semantically, thereby preventing the problems of synonymy, polysemyand context sensitivity in text search.

In addition to proposing a new ranking algorithm, we also use data miningtechniques to construct a user-friendly Web search system. Data mining andknowledge discovery in large collections of data are known to be effective anduseful. With the growth of Web data, the opportunity to utilize data miningtechniques to analyze data on the Web is attractive. In order to refine our searchengine system, we utilize three dominant mining algorithms, association rulemining [4], sequential pattern mining [3], and clustering [22].

We use weighted association rule mining to explore the frequent keywordsets in retrieved Web documents. These frequent keyword sets represent themain characteristics of the retrieved documents. For subsequent queries, thesekeywords can be used as recommended query terms that are more suitable forusers. Also, a clustering technique is used to mine the retrieved online documents.We choose the well-known fuzzy C-means clustering algorithm to divide theretrieved documents into a user specified number of groups. Then we assigneach document to at least one of the clusters by using our proposed dynamicthreshold measuring method. Finally, the sequential pattern mining is used tomine Chinese phrases for updating our Chinese lexicon semi-automatically,thereby saving a great deal of human effort in maintaining the lexicon. We alsouse a special pattern pruning method based on the property of Chinese phrasesto extend traditional mining algorithm for better extraction of Chinese phrases.

The rest of this paper is organized as follows: First, we provide a generaloverview in Section 2. Section 3 discusses the architecture of our search enginesystem and explains each component in detail. Then, the proposed rankingalgorithm is described in Section 4. Section 5 shows how we utilize data miningtechniques to refine our system. Finally, we present a system implementation inSection 6 and conclude the paper in Section 7.

00087(GB).p65 9/23/03, 6:45 PM145

Page 4: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

146 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

2. Related Works

Here, we introduce the background and related works on Web search engines,link analysis, ontology, and semantic Web.

Search engines traditionally consist of three components: the crawler andindexer, the searcher and ranker, and the interface. A crawler is a program thatautomatically scans various Web sites and collects Web documents from them.Crawlers follow the links on a site to find other related documents. Periodically,crawlers also look for changes. The traversal methods used by crawlers includedepth-first and breadth-first searches combined with heuristics that determinethe order in which to visit linked documents. Recent studies on Web crawlerscan be found in [8, 12, 31].

The searcher and ranker analyze a given query and compare the result withindexes to find relevant documents. In practical terms, a user enters a keywordinto a search engine, and then the search engine scans indexed Web documentsmatching the keyword. In order to determine in which order the documentsshould be displayed to the user, the search engine usually uses an algorithm toperform the ranking function. Currently, most ranking algorithms use similaritymeasure based on the vector-space model [17] or weighted vector-space model[9]. These types of ranking algorithms have been well studied in the InformationRetrieval (IR) community. Several researchers proposed ranking algorithms forWeb search engines based on the analysis of hyperlink structure. Carriere andKazman [7] proposed a simple way to measure the quality of a Web documentby counting the number of documents that have links connecting to the document.Google’s [1] PageRank algorithm is also based, in part, on the number of otherpages that have links to the document. It is an objective criterion of a page’scitation importance that corresponds well with people’s subjective perspectivesof relative importance or value. Haveliwala [19] proposed computing a set ofPageRank vectors, based on using a set of representative topics, to capture moreaccurately the notion of importance with respect to a particular topic. Kleinberg[26] developed an algorithm called HITS (Hyperlink Induced Topic Search) tofind the most information-rich or authoritative documents for a query. Thealgorithm also finds hub documents, i.e., documents that have links to manyauthority documents. In addition, there are some variants of HITS algorithm.Dean and Henzinger [13] used link-based analysis plus some edge-weightingstrategies to find similar pages on the Web. Chakrabarti et al. [10] extendedHITS algorithm to allow computing linking weights for every hyperlink accordingto the count of matched keywords around the anchor text. Bharat and Henzinger[6] eliminated non-relevant nodes from the graph by computing the similarity tothe query topic and regulating the influence of a node based on its relevance.

00087(GB).p65 9/23/03, 6:45 PM146

Page 5: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 147

The interface of a search engine is a software component constructed toreceive queries and display results. The interface of a search engine should meetthe basic requirements of being simple, intuitive, and easy to use, especially fornaive users. Nowadays, many researchers try to embed the clustering algorithmin this component to enable the user to have a good overall view of theinformation contained in the retrieved documents. Various clustering algorithmsfor documents have been proposed in [11, 20, 27].

Recently, ontologies are being applied to the World Wide Web to createsemantic Web sites. The semantic Web provides automated information accessbased on machine processable semantics of data. The explicit representation ofthe semantics of data, accompanied with ontologies, provides intelligent Webservices [15]. Ontologies serve as metadata schemas, providing a controlledvocabulary of concepts, each with explicitly defined meaning. It helps users andmachines communicate more concisely by supporting semantics, not just syntax.The basic infrastructure of the semantic Web consists of Web enabled languagesthat allow the use of machine understandable semantics of data and tools capableof processing that data [14]. A. Gómez-Pérez and O. Corcho [18] analyzed someof the most representative ontology languages. Several ontology languages havebeen developed that will surely become ontology languages in the context of thesemantic Web. Some of them are based on XML syntax, such as OntologyExchange Language (XOL), and Ontology Markup Language (OML), whereasResource Description Framework (RDF) and RDF Schema originated from W3Ccollaboration. Two additional languages are called Ontology Inference Layer(OIL) and DAML+OIL. Currently, ontologies are used in several applicationareas and have become a popular research topic [5, 16, 30].

3. System Architecture

The architecture of our system is shown in Figure 1 and consists of the followingsix main components:

(i) Crawler: A crawler is also known as agent, robot, or spider, an unattendedprogram that works continuously, and automatically, having the essentialrole of locating information on the Web and retrieving it for indexing.

(ii) Language Processor: The language processor component is used by all ofthe other components of our system to process textual information.

(iii) Interface: The interface provides a user with a way to input query terms,request mining processes, and display the query and mining results.

(iv) Query Engine: The query engine is the heart of our system. It searches theinverted file indexes, which are created by the crawler in our index database,

00087(GB).p65 9/23/03, 6:45 PM147

Page 6: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

148 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

for efficient retrieval of the documents matching the query terms providedby the user. The query engine uses the linking structure of the retrieveddocuments to expand the query results. Finally, our ranking algorithm isused to determine the display order based on the degree of how well theresults match the user’ s query.

(v) Miner: The Miner provides several kinds of data mining techniques in oursystem. First, clustering groups the retrieved documents for a user’ s queryfrom our Query Engine into a specified number of clusters and then displayseach cluster by showing the top 10 representative keywords as well as theassociated documents in that cluster. Second, association rule mining usesthe words of the retrieved documents as items to find a set of recommendedkeywords for future use in retrieving more specific documents. Therecommended keyword sets can be treated as the descriptors of thesedocuments. For Chinese documents, we use a lexicon to perform sentencesegmentation. The third mining technique used in our Miner is sequentialpattern mining of Chinese phrases. This extracts new phrases from crawledWeb documents that are not contained in the existing lexicon.

(vi) Database connector: Our system has five databases composed of an indexdatabase, connectivity database, sentence database, ontology database, andlexicon and stopwords database. This component uses multiple databaseconnection threads for other components to read and write datasimultaneously, thereby enhancing the efficiency of the entire system.

Figure 1. System architecture.

World Wide Web

Crawler

DatabaseConnector

Databases

Interface

Miner Query Engine

LanguageProcessor

00087(GB).p65 9/23/03, 6:45 PM148

Page 7: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 149

Figures 2 and 3 illustrate main processes in our system. In the followingsubsections, we will introduce the Crawler, Language Processor, and Databasesin details. Then, the Query Engine and Miner will be discussed in the next twosections while the Interface will be described in the implementation section.

Figure 2. Crawling process.

Figure 3. Querying and mining processes.

Web User

Query Engine

Ranker Searcher

Miner

DocumentClustering

KeyWordAssociation

KeywordExtraction

Interface

Query

MiningLanguage Processor

CaseTranslator

WordStemmer

Stop WordsFilter

PhraseSegmenter

Database Connecter

IndexDatabase

ConnectivityDatabase

SentenceDatabase

Lexicon& StopWords

OntologyDatabase

World Wide Web

Crawler

RetrievingModule

HypertextParser

URL ListingModule

Formatting andIndexing Module

Database Connecter

Lexicon& StopWords

Index DatabaseSentenceDatabase

ConnectivityDatabase

Language Processor

Case Translater

PhraseSegmenter

WordStemmer

StopWordsFilter

Case Translator

00087(GB).p65 9/30/03, 9:35 AM149

Page 8: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

150 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

3.1. Crawler

As shown in Figure 2, our crawler contains four modules that work with eachother to gather Web documents continuously. The Retrieving Module is used toretrieve information from the Web. The Retrieving Module fetches URLs froma pool of candidate URLs stored in the URL Listing Module. The HypertextParser Module processes retrieved resources. Its function includes (1) determiningthe retrieved data type, (2) parsing the retrieved hypertext documents,(3) extracting the hyperlinks and specified structures in the documents. Theresults are then passed to the Formatting and Indexing Module. The ProcessingModule adds the parsed URLs to the URL Listing Module. The LanguageProcessor efficiently and effectively converts the retrieved text into a uniformexpression used for data mining and/or information retrieval. After convertingthe text, the Formatting and Indexing Module updates the Index database toindex the gathered Web documents for later searches, adds the hyperlink structureinformation to the connectivity database, and splits all the sentences in theacquired documents for the sentence database. Finally, the URL Listing Modulefeeds the Retrieving Module, and makes some decisions for selecting URLsfrom the Processing Module to be added to the pool of candidate URLs.

3.2. Language processor

Our system can process data in two languages, English and Chinese. In English,we provide three processes: Case Translator, Word Stemmer, and Stopword Filterto convert all of the retrieved English words into a uniform expression. TheCase Translator converts every English word to lower case. The Word Stemmerreduces words to their morphological root. The Stopword Filter removes commonor insignificant words. In Chinese, we provide an additional process, PhraseSegmentation. The Phrase Segmenter segments Chinese phrases from a Chinesesentence, which uses a knowledge base called a lexicon.

3.3. Databases

There are five databases in our system, index database, connectivity database,sentence database, ontology database, and lexicon and stopwords database. Theindex database is the core of the searching engine that maintains two B+-treeindexed tables: document table and term table. The document table consists ofa set of document records with each record containing two columns: documentidentifiers and posting identifiers. The posting identifier is a file pointerrepresenting the file location in which the located file contains a pre-calculated

00087(GB).p65 9/23/03, 6:45 PM150

Page 9: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 151

TFIDF measure of the document. Details about our TFIDF normalization schemewill be introduced in the next section. This table can be used to retrieve all theterms and their weights (measured by TFIDF) associated with a given set ofdocuments. The term table consists of a set of term records with each recordcontaining two columns: term identifiers and posting identifiers. The postingidentifier is also the file location containing a list of document identifiers inwhich the term appears. This table can be used to retrieve all document identifiersassociated with a given set of terms. Also, the connectivity database like theindex database maintains two tables such that all document identifiers can pointto a given document identifier quickly and vice versa.

Our lexicon database contains 138,347 Chinese phrases and theircorresponding frequencies. In addition, the stopwords database contains 615insignificant terms for stopword filtering. The ontology database in our systemconsists of domain specific ontologies that are constructed and maintained bydomain experts. Lastly, all the sentences of the crawled documents are stored inthe Sentence database and used by the Keyword Extraction module in the Minercomponent.

4. Searching Process and Ranking Algorithm

In this section, we will discuss our searching process and ranking algorithm indetail. Since domain specific ontologies are used in our system, they will also beintroduced below.

Query Parse

OntologyDetermination

Internet User

1.Web Query

2.Keywords

3.PossiblyRelevant

Ontologies

Concepts andRelated-concepts

Identification

4.RelevantOntology

SearchIndex

5.Query byConceptKeywords

Web PageRanking and

Filtering

6.RelevantWeb Pages

7.SortedWeb Page

List

Figure 4. Searching process.

00087(GB).p65 9/30/03, 9:35 AM151

Page 10: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

152 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

4.1. Searching Process

Our searching process is illustrated in Figure 4.After a user inputs several keywords for searching relevant Web documents,

our searcher performs a lookup of the terms in the ontologies database to get theontologies containing these keywords. Then, the searcher sends these possiblyrelated ontologies to the user for selection with the purpose of avoiding textambiguity. Text ambiguity may occur when different domains contain the sameterms. After the specific ontology is selected or specified for the search terms,our searcher uses the terms and the ontology to produce search concepts andrelated concepts for semantic searching. The searcher scans the search index inthe index database for every key term in search concepts to obtain all of theconceptually related documents. Then, the ranker uses these documents andontologies for ranking and filtering in order to get a sorted document list for allof the relevant documents corresponding to the user’ s query. To improve searchrecall, we employ ontologies to perform search by concepts instead of terms. Weperform filtering by using linked documents and related concepts instead ofusing linking structures to improve search precision.

4.2. Ontology

The ontology in our search engine acts as a conceptual backbone for semanticdocument access by providing a common understanding and conceptualizationof a domain. Our ontology consists of three main components: term, termrelationship, and property. A term is a set of basic terms comprising the vocabularyof a domain. A term relationship is a set of relationships between terms. Propertyis a set of properties of the domain ontology, terms, and their relationships suchas word sense (noun, verb) of a term or semantic degree of a term relationship.In this study, we use a property semantic degree between two terms indicatingthe strength of semantic correlation. The semantic degree is a numerical valuebetween –1 and 1. When two terms are completely relevant such as synonym, itis equal to 1. On the other hand, if they are irrelevant terms, a degree of –1indicates that they are polysemy.

Figure 5 is a simple example of Microsoft ontology. The rectangle, rhombusand oval shapes represent the terms in ontology, term relationship, and property,respectively. If a user inputs a search term “windows” and selects the Microsoftontology as a search domain, our searcher will produce a main concept andrelated concepts for the term “windows” for this particular ontology. As illustratedin Figure 6, the value “cw” in each concept represents the weight for this concept.This concept weight indicates the relevance strength of this concept among query

00087(GB).p65 9/23/03, 6:45 PM152

Page 11: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 153

terms. This concept weight is derived from the semantic degree of term property.In this example the semantic degree between “windows” and “OS” is 0.9 andthe concept weight in OS concept is 0.9. The semantic degree between “OS”and “Microsoft” is 0.8 and the concept weight in Microsoft concept is 0.72 (i.e.,0.9×0.8).

Figure 5. Microsoft ontology.

Office

Word

Excel

PowerPoint

Access

FrontPage

Microsoft

OS

Windows

NT XP

98

95

2000

MeServer

Professional

Synonym

Has

Has

Synonym

Synonym

Has

Has

Has

SD=0.8

SD=1.0SD=1.0

SD=1.0

SD=0.9

SD=0.7

SD=0.6

SD=0.8

Figure 6. Query examples with microsoft ontology.

� Query example with Microsoft ontology

Query term: Windows

Main concept: (Windows, | cw=1)

Related concepts:

(95,98,2000,NT,XP | cw=0.7), (OS, | cw=0.9)

(Microsoft, | cw=0.72),(Office | cw=0.576)

(Access, Excel,Word,... | cw=0.46)

Query term: Office

Main concept: (Office | cw=1)

Related concepts:

(Word,Excel,PowerPoint,Access,Frontpage | cw=0.8),

(Microsoft, | cw=0.8)...

00087(GB).p65 9/23/03, 6:46 PM153

Page 12: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

154 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

Figure 7 is a simple example of Windows ontology and Figure 8 is the queryexample “windows” with Windows ontology. In Figure 8, we add the negative

Figure 7. Windows ontology.

Figure 8. Query examples with windows ontology.

� Query example with Windows ontology

Query term: Windows

Main concept: (Windows, | cw=1)

Main concept: (Windows, , , , , | cw=1)

Related concepts:

( , , | cw=0.8), ( , | cw=0.56)

( , , , , , , , | cw=0.64),

( , , , | cw=0.512)

Plus these negative related concepts, if user want filter Windows

concept of Microsoft:

( | cw =–1), (95,98,2000,NT,XP | cw=–0.7),

(OS, | cw=–0.9), (Microsoft, | cw=–0.72)

(Office | cw = –0.576), (access,excel,word, | cw=–0.46)

� ����� �

��

� �

� �

� �

� �

������

���

������

��� ����

����� ����� � ����

�����

�����

� �

������

� ����

!�"��# ����$

������

������

�����

%

& '

������

SD=0.7

SD=0.8SD=1.0

SD=1.0SD=0.8

SD=1.0

SD=1.0

Synonym

Synonym

Synonym

Synonym

Synonym

Has

Kind of

IsA

Windows

Has

Synonym

SD=1.0

00087(GB).p65 9/23/03, 6:46 PM154

Page 13: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 155

related concepts, which are used to filter the documents involving Microsoftwindows concepts in it. The concept weights in negatively-related concepts arebetween –1 and 0. The concept weight in related concepts is between 0 and 1.In Microsoft ontology and Windows ontology, the term “windows” is polysemous,having a semantic degree of –1. We use this value to deduce the concept weightsin negatively related concepts.

4.3. Our ranking algorithm

In this section, we will describe our proposed ranking algorithm in detail. First,we introduce a vector based representational model for Web documents. Next,we implement a multi-layered linkage expansion in our system. Finally, we explainour ranking algorithm step by step and classify the ranking results.

4.3.1. Model for Web documents

In order to represent Web documents, we use the vector space representation[17] in which each document is represented as a vector of words together withnormalized term frequencies. Specifically, each document can be represented asa term vector of the form d = {d1, d2, d3,...,dn}, where each item di representsthe normalized term frequency for a term t in the whole collection of T terms.For each of item di we use the standard TFIDF normalization [9], in which theless frequent term in the aggregate collection is given a higher weight. Wechoose the TFIDF normalization equation as shown below:

)(5.05.0 tIDFTfMax

Tfdi ×

+= (1)

In the term frequency part of Equation (1), (Tf /Max Tf ), we divide eachterm frequency Tf in document d by the maximal term frequency Max Tf. In thisterm frequency normalization, every term frequency is transformed to a weightwith a value between 0 and 1. The second part of Equation (1), IDF(t), isdefined as follows:

=

tN

N

NtIDF log)( (2)

where Nt is the number of documents and the term t appears in the wholecollection of N documents. The same as the term frequency part of Equation (1),Equation (2) gets the inverse document frequency value between 0 and 1.Therefore, each value of di in this TFIDF normalization is a real number between0 and 1.

00087(GB).p65 9/30/03, 9:35 AM155

Page 14: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

156 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

The search concepts include main-concepts, related concepts, and negativelyrelated concepts. We use each search concept produced by query terms andontologies to form a meta-document vector. Each dimension in a meta-documentrepresents a key term in a concept and has a value of 1. For context sensitivityranking, we transform each document vector from a term dimensional space intoa search concept space by using the coordinate along the concept meta-documentaxis. The cosine function is used to make the transformation. After transformation,the dimension in a document vector is the similarity measure between originalTFIDF document vector and search concept meta-document vector.

4.3.2. Linkage expansion

Our ranker can find the set of document identifiers whose document matches atleast one of the query terms and the terms in main search concepts. This set ofdocuments is called the root set. The ranker will expand the root set into a baseset by including all the documents having linkage relation with the root set.Figure 9 illustrates our two-layer expansion and the root set as well as the baseset. The base set is the super set of the root set, i.e., root_set ⊂ base_set.

base set

root set

Figure 9. Root set and base set.

Note that in our expansion step, one of the duplicate links in Figure10(f) isremoved as shown in Figure 10(a). Furthermore, two Web documents may linkto or be linked by the same document, like Figure 10(c) and Figure 10(d). Theymay have a close relationship both in content and linkage, such as multi-framed

00087(GB).p65 9/23/03, 6:46 PM156

Page 15: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 157

Web documents. However, the relationship cannot be found by using just one-layer expansion. For this reason, we implement a multi-layer expansion in oursystem.

Figure 10. Hyperlink relationship.

(a) (b) (c)

(d) (e) (f)

4.3.3. Ranking algorithm

Figure 11 depicts our ranking algorithm. There are eight major steps in thisalgorithm. In the first step of the algorithm, we expand the root set of queryresults to form a base set and collect all of the hyperlinks {l1, l2, l3,...,lr} in thebase set where r is the number of hyperlinks li = ⟨d x,d y ⟩. Note that li ≠ lj whenlj=⟨d y,d x⟩ because there is direction property in hyperlinks. Hyperlink l = ⟨d x,d y ⟩represents a document d x linking to a document d y, where d i ∈ D and D denotesall the documents in the base set. For each d i in D, d i = {d i

1, d i2, d i

3...,din} is the

transformed document vector from its original term dimensional TFIDF documentrepresentation to search concept space. di

j is the cosine similarity measure betweenthe document d i and the concept meta-document cj (abbreviated as concept cj)belonging to the set of search concepts C = {c1, c2, c3,...,cn}, where n indicatesthe number of search concepts.

After transforming each base set document to search concept space, wecompute the hyper-weight for each concept cj of C in a document d i of D. Thehyper-weight indicates the correlation level of the linked documents for adocument. We divide the hyper-weight into two categories, in_weight and

00087(GB).p65 9/30/03, 9:35 AM157

Page 16: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

158 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

out_weight. in_weight(i, j) represents the correlation level of linked documentset Di

in pointing to the document d i where the linked documents contain conceptcj. Di

in is a subset of D and each document d p in Diin has a hyperlink

><= ip ddl , . in_weight(i, j) is computed by using the following Equation (3).

)(

1)0)(()(

iInDegree

jDNumji,in_weight

ini +≠= (3)

The InDegree(i) in this equation is the number of documents having a linkto the document d i (i.e., the number of elements in Di

in). Num(Diin( j) ≠ 0) is the

number of documents having a link to the document d i is the number of documentshaving a link to the document d i and their cosine similarity measure of theconcept meta-document cj is not zero (i.e., non-zero correlation, d p

j ≠ 0 andd p ∈D i

in).Similarly we compute the out_weight of each document d i for concept cj by

using the following Equation (4).

Figure 11. Our ranking algorithm.

Step 1. Expand the query result from root set to base set D.Step 2. Transform each document vector d i in D to search concept space.Step 3. Compute hyper-weights for each concept cj in document d i.

)(

1)0)((),(_ ,

)(

1)0)((),(_

iOutDegree

jDNumjiweightout

iInDegree

jDNumjiweightin

outi

ini +≠=+≠=

Step 4. Compute hyper-document-vectors hdv (i, j) for each concept cj and d i.

∑∑∀∀

×+×=outinDd

jq

ininDd

jp

iqip

ji,out_weightdji,in_weightdji,hdv )()()(

Step 5. Normalize the hdv(i, j) for max hdv(i, j)=1.Step 6. Compute new document vector.

ij

ij

ij

i d in d ,ji,hdvdd ∀×+×−= )]([])1[( θθ

Step 7. Goto Step 4 until document vectors converge.Step 8. Compute cumulated weight DC i(m) for each main concept cm for d i .

∑∀

=×=mr c c

ir

ii mDC rcwdmDCrelated to

1)( maxor ize them fand normal))(()(

Step 9. Compute new . with in ,

0)( if 0

0)( if ))(1(2 m

im

i

i

iimi

mi cdd

mDC

mDCmDCd

d ∀

<

≥+=

00087(GB).p65 9/23/03, 6:46 PM158

Page 17: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 159

.)(

1)0)(()(

iOutDegree

jDNumji,out_weight

outi +≠= (4)

OutDegree(i) is the number of documents that have a link coming from thedocument d i. Di

out is the set of documents linked by d i.Next, we use Equation (5) to compute the hyper-document-vector hdv(i, j)

for each concept cj in the document d i to represent the correlation of the documentwith its linked documents referring to the search concept cj.

.)()()( ∑∑∀∀

×+×=out

iqin

ip D in d

jq

D in d

jp ji,out_weightdji,in_weightdji,hdv

(5)

After computing hyper-document-vectors for every document that refers toeach search concept, we normalize every value in hyper-document-vector to anumerical value between 0 and 1. Then, we compute the new document vectorfrom the original document vector d i and its hyper-document-vector hdv(i,j) byusing Equation (6).

.in)],([])1[( ij

ij

ij

i d dji,hdvdd ∀×+×−= θθ (6)

In Equation (6) we set the linking factor θ = 0.3. This factor indicates thecontribution factor of the linked documents with respect to the document d i.Then, we iteratively compute the new document vectors until the documentvectors converge. In linear algebra, if we continue to multiply non-negativematrix A by another fixed non-negative matrix B (i.e., A = B*A), the value inmatrix A will converge. In Equation (5) we can represent every document vectoras a d × c matrix (matrix A) and every hyper-weight between two documents asa d × d matrix (matrix B). It can be shown that our algorithm will converge aftera few iterations. In fact, most document vectors will converge after 10 iterations.To be sure, in our implementation, we set the termination of these steps after 15iterations. Bharat and Henzinger [6] have more detailed proof of document vectorconvergence.

After the iteration step, we compute the cumulated weight DC i(m) of therelated concepts for each main concept cm in a document d i by using Equation(7) and normalize it with a maximal value DC i(m) = 1.

.))(()(related to∑

∀×=

mr c cr

ii rcwdmDC (7)

The value of cw(r) in Equation (7) represents the concept weight cw forrelated concept cr , with respect to a main concept. The non-negative cumulatedweight is used to support the relevance of a main concept d i

m. Conversely, thenegative cumulated weight will set d i

m to zero. We use Equation (8) to computethe final document vectors

00087(GB).p65 9/23/03, 6:46 PM159

Page 18: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

160 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

.within,0)(if0

0)(if))(1(2 m

im

i

i

iimi

mi c d d

mDC

mDC mDCd

d ∀

<

≥+= (8)

We use the keyword query “windows” with the windows ontology as anexample. Our searcher retrieves all documents indexed by terms in a main conceptsuch as the terms “windows”, “ ”, etc. in the windows ontology. Then, thesystem generates many related concepts by using the “windows” ontology. Theconcept weight is a positive numeric value for a related concept. For negativelyrelated concepts, the concept weight is a negative numeric value. In step 8 ofour ranking algorithm, the cumulative weight of the main concept “windows”for a document is a positive value when the context and the linkage relationshipfor this document belong to regular windows domain. Since a higher correlationresults in a higher cumulative weight, this weight can be used to determine thedocument’ s relevance to the main concept. If the context and the linkagerelationship of this document are close to Microsoft ontology, the cumulativeweight will be a negative value. Then, our ranking algorithm sets the relevanceof the main concept to zero. This means that although the term “windows”appears in this document, there are other terms whose concepts are close inapparent relevance to the Microsoft Windows operating system.

4.3.4. Ranking result analysis

We now classify the computed document ranks into three categories of sortedorder. These three categories are: (1) documents that contain all main searchconcepts, (2) documents that contain some of the main search concepts and havelinkage relationships with other concepts, and (3) documents that have linkagerelationships with some of the main search concepts.

In order to classify and sort documents effectively and efficiently, we firstsummarize all main concept dimensions of the document vector d i from thecomputed base set via Equation (9), where every d i is in the root set.

.0if)1(cm

score ≠+= ∑∀

mi

mii d dd (9)

If the score of d i is less than the number of the main concepts num(cm), d i

will be classified as belonging to the third category. If the score of d i is largerthan the number of the main concepts num(cm), we do the following computationusing Equation (10) for these documents.

.0if])([cm

scorescore ≠++= ∑∀

mi

m

imii d dcnumdd (10)

00087(GB).p65 9/23/03, 7:12 PM160

Page 19: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 161

If the score of d i is larger than num(cm)2, we can classify this document as thefirst category. The remainders are classified as the second category.

Then, we perform a quick sort on the document scores in decreasing order.These documents can be classified according to these three categories easily andordered by the number of main concepts they have matched and the similarity tothe main concepts. Equation (11) summarizes the classification criteria:

.)( ifgory third cate

)(num)(ry ifond categosec)(if gory first cate

score

score2

2score

<≥>

mi

mi

m

mi

cnumd cd cnum

cnum d

(11)

5. Data Miner

The Data Miner performs three types of data mining to refine our search enginesystem. We use the sequential pattern mining to extract new Chinese phrasesautomatically and the weighted association rules mining to mine frequentlyoccurring keyword sets that indicate the main characteristics of retrieveddocuments, applicable to query terms recommendation if more detailed queriesare required. Fuzzy C-means clustering is used to provide an overview of theretrieved documents for the user.

5.1. Chinese phrases extraction

The Chinese lexicon is needed for our system to perform word segmentation.Currently, we have 138,347 Chinese phrases in our lexicon. However, new phrasesare needed from time to time. It is especially true for domain specific phrases.We add new phrase to the lexicon manually to ensure that word segmentation isdone correctly without the unknown phrase problem. For this reason, we performan extraction function of new Chinese phrases for the lexicon maintainer toupdate its lexicon on a domain specific Web site semi-automatically.

In order to complete the mining process, we have to perform preprocessingon crawled documents first. We take every sentence in the crawled documents asa transaction and every character as an item. Moreover, we utilize the structureof a Web document to explore some weighted transactions. For every sentencein a Web document we take all sort of emphasis tags as weighted transactions,like head, title, anchor, bold font, italics and font, etc. In this module, we usesequential pattern mining to process them. At the end of the mining step, weprune the mined frequent sequences as phrases. In a traditional mining approach,it just retains the maximal frequent sequences as the final patterns. But it is not

00087(GB).p65 9/23/03, 6:46 PM161

Page 20: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

162 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

applicable in text mining, especially in Chinese text data. For example, theChinese phrase “ ” will be pruned by the phrase “ ”. But “ ” maybe a meaningful phrase that we want to add into the lexicon. However, we haveto prune the non-meaningful character sequences “ ” and “ ” thatwere used to join the four-character sequence “ ” Similarly for “ ,” itcan be used to join three-character sequences “ ” and “ .” For thisreason, we use the concept of net frequency [28] to prune non-meaningfulcharacter sequences and retain meaningful ones. The equation of our pruningmethod is as follows:

.)()()()()(

2

1

1

112

∑∑+++ ∪=∀

++∈∀

+ +−=kkk sss

k

sksk

kkk sSsSsSsN(12)

The function S(sk) denotes the support count of the sequence sk and thelength of sk is k. Function N(sk) is the net frequency of the sequence sk and weprune the sequence sk if N(sk) is no more than the minimal support. If N(sk) islarger than the minimal support, we take sk as a Chinese phrase.

Finally, after we mined all of the Chinese phrases on a domain specific Website, the miner looks up the lexicon database and displays the Chinese phrases tothe user if they do not appear in the lexicon.

5.2. Keyword association

As mentioned before, we use association rule mining algorithm [4] in this module.We view each document as a transaction and treat all phrases as items, wherephrases are segmented in the document by the Language Processor. Theassociation rule mining is used to mine all the maximal keyword sets that occurfrequently in the retrieved document sets. The mined maximal keyword setsrepresent the primary attributes (keywords) of the retrieved documents such thatusers can have a succinct and clear view of these documents. Furthermore, theseprimary attributes can be used to represent the recommended keywords for furtherkeyword searching. Note that we take a Chinese phrase as a keyword to getmore useful patterns.

In addition, to prevent the mined phrase sets from containing non-meaningfulkeywords, such as those used by people every day, for every keyword appearingin a document, we accumulate their TFIDF measure as their supports.Consequently, we can use the preprocessed document vectors in the index databaseas our mining transactions and accumulate the TFIDF measure as the support toperform weighted association rule mining.

00087(GB).p65 9/23/03, 7:01 PM162

Page 21: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 163

5.3. Document clustering

Our Data Miner utilizes the fuzzy C-means clustering algorithm to performdocument clustering online. Refer to our previous work [22] for more details ofthe algorithm. In this section, we explain how to assign each document to acluster and how to pick up the document sets as initial centroids. In our approach,all the retrieved document vectors are used as data sets for clustering. We usethe cosine function to measure the level of similarity between two documents orbetween a document and its centroid. In our algorithm, we take a centroid as ameta-document.

First, in the initial step of our approach, we pick up the retrieved documentsin decreasing order by their ranking grades. We ensure that the picked documentmust not be similar to the ones we have picked before. Let us initialize the kcentroids by using incremental thresholding. The initial threshold is 0.3. If usingthis threshold cannot pick k centroids in the first 30 × k documents (k is the userdefined number of clusters), we reduce the threshold by ten percent and repeatthe process to pick the document.

Second, in the iteration step, as in the fuzzy C-means clustering algorithm,we compute the membership matrix for every couple of candidate centroids anddocument vectors. Then, we compute the new centroid vectors by using this newmembership matrix.

Finally, after computing the new membership matrix, we can assign eachdocument to at least one cluster. In fact, a document may be relatively close totwo or three clusters. Therefore, we define an equation to assign the averagevariation of the membership values of one document to the cluster of thisdocument. The equation is defined as follows:

).()(),( dVAvgdAvgcdMem +> (13)

If the membership value of document d to the cluster c satisfies this equation,we assign the document to this cluster. Avg(d) is the average membership valuefor document d with respect to every cluster and is equal to 1/k, where the sumof membership values for one document to every cluster is 1. Mem(d,c) denotesthe membership value of document d with respect to cluster c. Finally, Avg(d)for the document d is computed as follows:

.))()(()(cluster

2∑∀

−=C s

cd,MemdAvgdVAvg (14)

In fact, our clustering approach performs well after only one clusteringphase because we can intelligently choose good initial centroids from rankeddocuments. In the consideration of executing time for online clustering, we can

00087(GB).p65 9/23/03, 6:46 PM163

Page 22: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

164 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

display clustering results even faster by using one phase clustering. Furtherclustering function can be performed when users want to refine the clusteringresults of their search.

6. Implementation and Experiments

In this section, we will show the implemented system and discuss the findingsof our experimental results based on this limited study.

6.1. Implementation

We implemented the system in Java and used TomCat4 for Java servlet. All ofthe following Web documents are from the Web sites of Feng Chia University.The crawler started from the following URLs:

http://www.fcu.edu.tw/, http://www.fcuaa.com/, http://www.cie.fcu.edu.tw/,http://www.iecs.fcu.edu.tw/, http://dinosaur.soft.iecs.fcu.edu.tw/

In this implementation, our system collected data in a breadth-first strategyfor all relevant Web documents located on Feng Chia University Web sites. Weignored all non-HTML or non-ASCII documents and we ignored other domains.Table 1 displays some relevant statistics that demonstrate the amount of dataextracted and indexed. The system shows the page for our search engine systemafter a search with a given keyword. We found that our search engine ranks andclassifies the retrieved documents into three categories similar to the rankingand classifying schemes mentioned in Section 4.3.4. Furthermore, we utilize twomining tools: association rule mining and clustering to analyze the online searcheddocuments.

Table 1. Statistics of collected data from our Crawler.

Number of indexed documents 9,054

Number of keyword indexes 49,360Number of hyperlinks 45,537

Number of sentences in the FCU Web site

(# of transactions in Chinese Phrase Extraction) 592,800Number of emphasized sentences

(# of weighted transactions in Chinese Phrase Extraction) 98,293

Average number of keywords per document 215.88

00087(GB).p65 9/23/03, 6:46 PM164

Page 23: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 165

Figure 12 and Table 2 show the document clustering results at the oneclustering phase for the retrieved documents. For each cluster, our system displaysthe main concepts and lists the most similar documents found. Table 2 reportsthe partial clustering results. We can verify that the clustering performs verywell after just one clustering phase.

Figure 12. Document clustering page.

Clusters and top 3 key terms

[delphi][coreldraw][javabean][photoimpact][jbuilder][linux][mcsd][isa][java][ ]� -http://www.fcu.edu.tw/~training/month2.htm� -http://www.fcu.edu.tw/%7Etraining/month2.htm� -http://www.fcu.edu.tw/%7Etraining/month.htm

[interdev][annin][ ][contract][gpr][ ][wap][biztalk][ ][ ]� -http://www.fcu.edu.tw/~training/wanted.htm� -http://www.fcu.edu.tw/%7Etraining/wanted.htm� -http://140.134.4.2/cc/announce/other/inter89.htm

[beanbox][ttabl][monit][ ][tqueri][bde][mfc][swing][edp][iter]� -http://www.fcu.edu.tw/~training/s-pd2002.htm� -http://www.fcu.edu.tw/%7Etraining/s-pd2002.htm� java-programming-language-http://www.fcu.edu.tw/~training/jpl.htm

Number ofDocuments

15

3

7

Table 2. Partial results of document clustering.

00087(GB).p65 9/23/03, 7:02 PM165

Page 24: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

166 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

6.2. Experiment results

Using the same search keyword, we evaluated our search system by makingcomparisons with different selected ontologies. We employ the basic measuresfor text retrieval, the precision and recall to be evaluation criteria. Precision isthe percentage of retrieved documents that are in fact relevant to the query.Recall is the percentage of documents that are relevant to the query and were,in fact, retrieved. Since we were not able to check whether all of the collecteddocuments from the crawled Web documents were relevant to the query, weassume a fixed percent of error rate for our search system when the correctontology was used. Table 3 shows the search precision and recall data for onesearch keyword “windows” via different selections of ontologies. Although weonly used one test case of distinguishing a Microsoft Windows from a regularwindow, it is indeed a very typical example in our Web search and the sameconclusion can be drawn for other search terms.

Table 3. Search recall and precision.

Our system retrieved 168 Web documents when searched on “windows” withoutusing ontology. If users’ desired result was Microsoft Windows, the search recallwas 71.8 percent and the search precision was 97.02 percent. One can see thatthe majority of crawled Web documents were actually related to MicrosoftWindows if the term “windows” was present. If users’ desired results wereregular windows, the search recall and precision become 5 and 0.59 percent,respectively.

Search for Microsoft Windows

Ontologies Recall Precision Documents1. No ontology 71.8% 97.02% 1682. Use Microsoft ontology for search 91.18% 95.83% 2163. Use Microsoft ontology for search and regular windows 90.74% 97.63% 211

ontology for pruning irrelevant documents

Search for regular windows

Ontologies Recall Precision Documents4. No ontology 5% 0.59% 1685. Use regular windows ontology for search 90% 9.52% 1896. Use regular windows ontology for search and Microsoft 90% 66.66% 27

ontology for pruning irrelevant documents

Number of relevant documents for Microsoft Windows: 227Number of relevant documents for regular windows: 20

00087(GB).p65 9/23/03, 6:46 PM166

Page 25: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 167

If ontologies are used for the search in this particular case, we can see thatthe search recalls increase. In Row 2 of Table 3, the search recall increased from71.8% to 91.18% because we can retrieve more relevant Web documents eventhough the keyword “windows” is not present. More obviously, in Row 5 ofTable 3, the search recall for regular windows increases from 5% to 90%.

Furthermore, if our search system uses the information on both of the relatedontology and non-related ontologies, the search precision increases while thesearch recalls remain almost the same. Comparing Row 2 and Row 3 ofTable 3, we can see that the search precision increases from 95.83% to 97.63%,even though the search recall decreases a bit. This slight decrease in resultsoccurs because our system can intelligently utilize the knowledge of domainspecific ontologies to filter out non-relevant Web documents and retrieve morerelevant ones. In Row 5 and Row 6 of Table 3, we can see that the search recallfor regular “windows” remains the same after our algorithm prunes irrelevantdocuments with Microsoft ontology, and the search precision increases in a greatamount from 9.52% to 66.66%. The reason for increased precision is as follows:There is only one regular “windows” search-related document being retrievedout of 168 documents in Row 4 while there are 18 out of 189 for Row 5. Andfor Row 6 of Table 3, we effectively prune 162 irrelevant documents from 189documents while retaining all the relevant results. We can clearly conclude thatemploying ontology greatly improves search relevancy.

7. Conclusions and Future Work

In this paper, we presented a new ranking algorithm that is more effective andflexible for Web search with more semantic notion. We utilize hypertextcharacteristics of Web documents and ontology to model the ranking algorithmto provide more flexibility. Our search system allows users to select the desiredsearch domains that can correctly locate the documents they are looking for,based on relevancy.

In this research, we studied some key problems of existing hyper-textualranking algorithms. The problems include getting irrelevant documents if onesimply follows the connected hyperlinks of a document, unrelated documentsconnected by automatically generated links, Web advertisements, and multi-frameWeb pages containing information distributed in other linked documents. Tosolve these problems, we developed a new ranking algorithm to look muchdeeper into the content of linked documents. In addition, we used domain specificontology to solve traditional problems in text search that involve synonymy,polysemy and context sensitivity.

00087(GB).p65 9/23/03, 6:46 PM167

Page 26: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

168 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

Furthermore, we exploited data mining techniques to refine our search engine.Three useful techniques were used: association rule mining to explore primarykeywords of retrieved documents, fuzzy C-means clustering to provide anoverview of the desired documents, and sequential pattern mining to discoverand identify some domain specific phrases in crawled documents. We implementedthe system and tested it with both Chinese and English Web documents from ouruniversity Web sites. Preliminary results show that our Web search system iseffective and efficient.

Currently, our ontologies are mainly constructed by domain experts. In futurestudies, we can use data mining techniques, such as association rule mining andclustering, to discover the relationship between terms for automatic constructionand maintenance of ontologies. In fact, we already use sequential pattern miningtechnique to discover domain specific phrases for an expert to maintain thelexicon in our searching system and there are many researchers making effortsto construct ontologies semi-automatically [23–25, 29]. In addition, we canimprove the flexibility of our system even further. Presently, our system providesusers options to select desired ontology in their Web searches. In the future, wewould like to design facile interfaces for users to adjust parameter settings, suchas the threshold of semantic degree for main concepts and related concepts, andassign different weight strategies for different ontologies. The framework of ourranking algorithm already has the capability of adjusting weights on ontologiesand search concepts. Additionally, since ontology itself is a publicly shareable,reusable, and inheritable knowledge framework, ontology creators as well asgeneral users should be able to make changes/modifications on content andarchitecture [21] in a controllable way. We believe that our search system andranking algorithm will get better results when more complete and standardizedontologies are available.

References

[1] http://www.google.com/[2] http://www.yahoo.com/[3] R. Agrawal and R. Srikant, “Mining Sequential Patterns”, in Proc. of the

11th Int. Conf. on Data Engineering, 1995, pp. 3–14.[4] R. Agrawal et al., “Mining Association Rules Between Sets of Items in

Very Large Databases”, in Proc. of the ACM SIGMOD Int. Conf. onManagement of Data, 1993, pp. 207–216.

[5] A. H. Alani et al., “Automatic Ontology-based Knowledge Extraction fromWeb Documents”, IEEE Intelligent Systems 18(1), 2003, 14–21.

00087(GB).p65 9/30/03, 9:35 AM168

Page 27: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

Using Data Mining to Construct an Intelligent Web Search System 169

[6] K. Bharat and M. R. Henzinger, “Improved Algorithms for Topic Distillationin a Hyperlinked Environment”, in Proc. of the 21st Annual Int. ACMSIGIR Conf. on Research and Development in Information Retrieval, 1998,pp. 104–111.

[7] J. Carriere and R. Kazman, “WebQuery: Searching and Visualizing theWeb Through Connectivity”, in Proc. of the 6th Int. World Wide Web Conf.,1997.

[8] S. Chakrabarti et al., “Focused Crawling: A New Approach to Topic-specificWeb Resource Discovery”, in Proc. of the 8th Int. World Wide Web Conf.,1999, pp. 1623–1640.

[9] S. Chakrabarti, “Data Mining for Hypertext: A Tutorial Survey”, ACMSIGKDD Explorations 1(2), 2000, 1–11.

[10] S. Chakrabarti et al., “Automatic Resource Compilation by AnalyzingHyperlink Structure and Associated Text”, in Proc. of the 7th Int. WorldWide Web Conf., 1998, pp.14–18.

[11] C. H. Chang and C. C. Hsu, “Enabling Concept-based Relevance Feedbackfor Information Retrieval in the WWW”, IEEE Trans. on Knowledge andData Engineering 11(4), 1999, 595–609.

[12] J. Cho and H. Garcia-Molina, “Parallel Crawlers”, in Proc. of the 11th Int.World Wide Web Conf., 2002, pp. 124–135.

[13] J. Dean and M. R. Henzinger, “Finding Related Pages on the World WideWeb”, Elsevier Science B.V., 1999.

[14] Y. Ding et al., “The Semantic Web: Yet Another Hip”, Data and KnowledgeEngineering 41(2), 2002, 205–227.

[15] D. Fensel and M. A. Musen, “The Semantic Web: A Brain for Humankind”,IEEE Intelligent Systems 16(2), 2001, 24–25.

[16] D. Fensel, “Ontology-based Knowledge Management”, IEEE Computer35(11), 2002, 56–59.

[17] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structuresand Algorithms, Prentice Hall, 1992.

[18] A. Gómez-Pérez and O. Corcho, “Ontology Languages for Semantic Web”,IEEE Intelligent Systems 17(2), 2002, 54–60.

[19] T. H. Haveliwala, “Topic-sensitive Pagerank”, in Proc. of the 11th Int. WorldWide Web Conf., 2002, pp. 517–526.

[20] X. F. He et al., “Automatic Topic Identification using Webpage Clustering”,in Proc. of the 1st IEEE Int. Conf. on Data Mining, 2001, pp. 195–202.

[21] J. Hendler, “Agents and the Semantic Web”, IEEE Intelligent Systems 16(2),2001, 30–37.

00087(GB).p65 9/30/03, 9:35 AM169

Page 28: Using Data Mining to Construct an Intelligent Web Search ...pdfs.semanticscholar.org/af89/1f42d6e127040f1deb7ec27f4f43d83155ae.pdfsystem using data mining techniques to search and

170 Yu-Ru Chen, Ming-Chuan Hung and Don-Lin Yang

[22] M. C. Hung and D. L. Yang, “An Efficient Fuzzy C-means Clusteringalgorithm”, in Proc. of the 1st IEEE Int. Conf. on Data Mining, 2001,pp.225–232.

[23] C.-C. Kao et al., “Personalized Information Classification System withAutomatic Ontology Construction Capability”, in Proc. of the 11th Workshopon Object-Oriented Technology and Application, 2000.

[24] J. U. Kietz et al., “A Method for Semi-automatic Ontology Acquisitionfrom a Corporate Intranet”, EKAW2000, 12th Int. Conf. on KnowledgeEngineering and Knowledge Management, October 2, 2000.

[25] J. U. Kietz et al., “Extracting a Domain-specific Ontology from a CorporateIntranet”, in Proc. of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.

[26] J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, Journalof the ACM 46(5), 1999, 604–632.

[27] K. I. Lin and R. Kondadadi, “A Similarity-based Soft Clustering Algorithmfor Documents”, in Proc. of the 7th Int. Conf. on Database Systems forAdvanced Applications, 2001.

[28] Y. J. Lin and M. S. Yu, “Extracting Chinese Frequent Strings without aDictionary from a Chinese Corpus and its Applications”, Journal ofInformation Science and Engineering 17, 2001, 805–824.

[29] A. Maedche and S. Staab, “The TEXT-TO-ONTO Ontology LearningEnvironment”, 8th Int. Conf. on Conceptual Structures Logical, Linguistic,and Computational Issues, August 2000.

[30] R. Navigli et al., “Ontology Learning and its Application to AutomatedTerminology Translation”, IEEE Intelligent Systems 18(1), 2003, 22–31.

[31] H. F. Yan et al., “Architectural Design and Evaluation of an Efficient web-crawling System”, in Proc. of the 15th Int. Parallel and DistributedProcessing Symposium, 2001, pp. 1824–1831.

00087(GB).p65 9/30/03, 9:36 AM170