Chapter 7 Web Search Results Visualization System Based on

132

Chapter 7

Web Search Results Visualization System Based on Multi-label Classification

Web search engine like Google can nowadays be considered as a cornerstone service for any Internet user. However, too much returned results make us submerge in the sea of information. This chapter explores a method for post-processing these results and designs a prototype system integrating this function. We analyze the characteristics of results returned by search engine and apply multi-label classification method for their post-processing. We conceive the algorithm of multi-label classification for this kind of real-time application. A prototype system based on Java environment is developed. We test our system on-line and analyze its performance and its problems.

7.1 Web Search Results Visualization

In information age, Internet has become the main channel for people to obtain information. With the explosively growing of information in Internet, people are drowning in a sea of information. Google, Yahoo and other famous search engine become an indispensable tool for people to obtain information. However, the current major search engines are still far from satisfying

《THE RESEARCH ON CHINESE TEXT MULTI-LABEL CLASSIFICATION》 ZHIHUA WEI 133

customers due to many reasons. The most important phenomenon is: a user's search keywords are often not very clear expression of the true intentions because of the "synonyms" and "Polysemy" phenomenon in natural languages.

The current search engines (such as Google, MSN, Yahoo, etc.) return a sorted search results according to the keywords inputting by users. In order to find the interesting information, users should browse the title and digest one by one in the search results. If someone input a poor search keyword, the result may be far from satisfied, that is, some interesting information is sorted to the back. In this situation, users need to spend a lot of time to traverse all search results.

To solve above problem, a Web search results visualization system is designed, aiming at improving the search efficient and accurate and ameliorating users’ browsing experiences. Here, Web search results visualization refers to the further sorting on the base of results returned by search engine. Previous researches about this task are mainly based on clustering technology. Related research could be found in literatures [12,28,58,60]. The search engines based on this way are already exist, for example, Vivisimo[88] and Grokker[23]. However, there is a problem that the extracted topic of every cluster is far from satisfying the needs of users. In this case, it’s difficult for users to find interesting information from corresponding cluster. This made the process of arranging search results less significant.

In this chapter, we will take into account the diversity of content in returned search results, that is, one entry (a document) may belong to multiple categories. The system adopts a multi-label classifier to arrange the results according to categories. In this case, users could find an interesting document in various classes it belongs to.

The search engine which we expect is that users could browse interesting contents by clicking corresponding class label. Figure 7.1 shows the results returned by normally used search engine. Figure 7.2 shows the results returned by Web search results visualization system MLWC (Multi-Label Web Classifier) defined in this chapter. The labels in left box are the categories for news pre-defined beforehand. In the right, there are all entries

134 CHAPTER 7. WEB SEARCH RESULTS VISUALIZATION SYSTEM BASED ON MULTI-LABEL CLASSIFICATION

returned by search engine. We can also choose one or a few categories, allowing the system to return the news in specified categories. We could find from Figure 7.2 that the search results are showed in a more clear and coherent approach. You can save much time in locating the interesting contents and browse them more efficiently.

Figure7.1: Results of general search engine


Figure7.2: Results of TJ-MLWC

7.2 TJ-MLWC System Design

TJ-MLWC system is composed of three models: search model, classification model and visualization model, which is showed in Figure 7.3.


Figure 7.3: TJ-MLWC flowcharts

7.2.1 Search Model The search model of system is realized by encapsulating the API of Yahoo Search!. The search results of Yahoo Search! are separated into individual pages and encapsulated coherently to satisfy the requirement of system TJ-MLWC.

The function of this model is to call the API of Yahoo Search! and get its search results. It also extracts the Title and Digest of each entry (see Figure 7.4) for the using of classification model.


Figure 7.4: Search results returned by Yahoo Search!

7.2.2 Classification Model This model is composed of classifier learning and classifier testing. Off-line and on-line learning methods are combined to improve the performance of classifier. Firstly, a Bayes classifier is trained on a news corpus, which is a multi-class multi-label classifier. Then, the Bayes classifier is rectified gradually by the information feedback. That is, the parameters are modified when users give the class labels of a group of results of new query. This method could ensure the basis classification performance and gradually enhancement of classifier.

(1) Constructing Corpus

Chinese documents are considered in current system. We construct the corpus for training process on the base of Sogou Corpus. We select part of the Sogou Corpus and re-label the documents with various labels. Our corpus, TJ-Multi-label Corpus, consists of 5800 documents distributed in 9 classes such as economy, politics, military, sports, entertainment, science, society, business and education. The detailed distribution is showed in Table 4.2.


(2) Off-line Learning of Bayes Classifier

The process of off-line learning Bayes classifier is showed in Figure 7.5. Where, HashMap reserve the feature words and their statistic information. HashMap, a kind of class in Java, uses a hash table to implement the Map interface. It allows the execution time of basic operations, such as get() and put(), to remain constant even for large sets. That is, in the construction of HashMap, the query of feature words and modification of statistic information could be completed within O(1) time complexity.

Figure7.5: Algorithm for off-line learning improved Bayes classifier

(3) Classification Search Results and On-line Learning Bayes classifier

The process of on-line learning Bayes classifier is showed in Figure 7.6.

Input: Training corpus Output: Feature words and their statistic information

Step1: Transverse documents in training set;

Step2: Pre-process documents including word segmentation and removing stop words;

Step3: Calculate the frequency of each feature word in HashMap;

Calculate the conditional probability of each feature word in each class word and reserve the results in HashMap.

End.


Figure7.6: Algorithm for classification process and on-line learning improved Bayes classifier

7.2.3 Visualization Model Visualization model completes the design of user interface which is composed of two parts as follows.

A Strut 2 is selected to construct the view for users. Apache Geronimo 2.X +Jetty 6 are selected as container. This design ensures satisfying the users’ requirements, in the same time, reducing the spending of software in the arrangement process.

AJAX technology is adopted to complete the real-time updating with the changing of class choosing.

For the part of Web, index.html is used to be responsible for the interaction in search process and result.jsp could produce data for AJAX dynamically. For the part of server, json.class processes the data for classification and generates JSON object.

Input: Model files produced in learning process New documents Output: Labels of new documents Step1: Pre-process new documents and construct feature words group

including word segmentation and removing stop words; Step2: Transverse the feature words of new documents and search their

conditional probability in each class in model files which is produced in off-line

learning process. Step3: Calculate the union probability of new document in all classes.

Calculate the threshold. Step4: Assign and output labels for the new documents according to the

threshold. Step5: Modify the parameters in HashMap according to the classification

result of new documents.


7.3 Improved Multi-label Bayes Classification

Algorithm

In the consideration of real-time requirement of system, improved Naïve Bayes classification algorithm is adopted. Naïve Bayes algorithm is fast. For the real-time system, it has a wide application such as in Spam filtering system.

In 2.2.1, Naïve Bayes algorithm is introduced. In classic Naïve Bayes classifier, the label for a document is selected according to the maximum value of . In this case, each document has only one label. For the application of multi-label classification, an improved Bayes algorithm is produced. Assume n class in training set, is adopted to represent the average of posterior probability of document in each class showed in Formula 7.1.

(7.1)

is the probability of in class . When , we consider that belongs to class . In this strategy, new document belongs to all classes which satisfy . The least amount of labels of a document is 1 when the probability of the document belonging to a class is obviously higher than that of the document belonging to other classes. The maximum amount of labels of a document is n when the probabilities of the document belongs to each class are equal.

7.4 Application Examples of System

TJ-MLWC could show the search results in one class or several classes by

( | )j iP C d

thresPid

1

1 ( | )n

thres j ij

P P C dn =

= ∑

( | )j iP C d id jC ( | )j i thresP C d P≥

id jC id( | )j i thresP C d P≥


assigning the class label on the left of system interface. If users do not assign the class labels, system return all results like general search engine. System is simulated in a PC (CPU: Intel Pentium 4, Memory: 2G). The time for returning 100 search results is about 5s, which could satisfy browsing requirement.

Figure7.7: Search results of “Tongji University” when selecting “education” class


Figure7.8: Search results of “Tongji University” when selecting “education” class and “science” class

Figure 7.7 shows that search results of “Tongji University” when we select “education” class. Figure 7.8 shows that search results of “Tongji University” when we select “education” class and “science” class. Figure 7.2 shows all search results.


Figure7.9: All search results of “harm of web game”

Figure7.10: Search results of “harm of web game” when selecting “science” class


Figure 7.9 shows the search results of “harm of web game” in all classes including society, science and economy. It means that the search results of “harm of web game” exist only in three classes. Figure 7.10 shows the results of “harm of web game” when selecting “science” classes, which are mainly about the topics on “web psychology” and “web game addiction”. These contents are reasonable and acceptable.

Figure7.11: Search results of “2009 college entrance examination” when selecting “economy” class

The problems of this system lie in processing image entries and foreign language entries. That is, if a search result is image or video information, the system always assigns all class labels to it. It indicates that the system could not classify multimedia information. The same phenomenon occurs when a search result is other languages (beside Chinese). The reason lies in our training set is only Chinese corpus. There are also other problems. For example, Figure 7.11 shows the search results of “2009 college entrance examination”. There is an entry from “economy TV channel” classified into “economy” class. In fact, it does not relate to economy. The title of web site “economy TV channel” is regarded as feature of document by classifier; as a


result, it is classified into “economy” class. From the analysis of system simulation, we could conclude that

TJ-MLWC system has the function of classifying search results into several classes which makes the browsing more conveniently. There are also some problems to be solved such as multimedia classification and multi-linguistic classification. All of these are our further research interests.

7.5 Conclusion

In this chapter, we improve Naïve Bayes algorithm for the task of multi-label classification in the application of Web search results visualization. A prototype system TJ-MLWC is developed based on this algorithm. The system could call the results of search engine and further classify these results real-time by using Bayes multi-label classifier. This system could show the search results in one or more classes according to users’ requirement. The limits of this system are as follow:

This system is only trained by Chinese corpus, so it could not give the correct class label of documents in other languages. For the entries including image information, our system gives all the class labels.

For the first problem, constructing a multi-lingual classification system would be better. For the second problem, a strategy of removing noise should be integrated into the system. All of these are our further research topics.

Documents

Chapter 7 Web Search Results Visualization System Based on