2
Web Mining Ricardo Baeza-Yates Center for Web Research Computer Science Department Universidad de Chile [email protected] & ICREA Professor Dept. of Technology Universitat Pompeu Fabra Barcelona, Spain Summary The Web grows and evolves faster than we would like and expect, imposing scalability and relevance problems to Web search engines. There are three main data types in the Web: content (text, multimedia), structure (links that form a graph) and Web usage (transactionsfrom Web logs). We emphasize the last type of data, in particular a new subfield called query mining. In this tutorial we present: main current applications of Web mining; how mining Web data and usage logs allows to im- prove search engines in several ways (ranking, in- dexes, queries, and interfaces); and a new subfield called query mining that allows to ob- tain information scent and new content suggestions to improve a Web site. Server logs of search engines store traces of queries sub- mitted by users, which include queries themselves along with Web pages selected in their answers. Query mining is based in the fact that user queries in search engines and Websites give valuable information on the interests of peo- ple. In addition, clicks after queries relate those interests to actual content. Even queries without answers imply impor- tant missing synonyms or content [1, 2]. One example that we present is a framework for cluster- ing query traces to identify groups of queries used to search for similar information on the Web. The framework is based on a new vectorial representation of query traces which al- lows to treat them similarly to documents in traditional in- formation retrieval systems. Also, we consider the problem of reducing the bias in the selections caused by the partic- ular answer rankings computed by the search engine. We This research was partially supported by Millennium Nucleus Grant P01-029-F, Mideplan, Chile. show the application of the clustering framework to two problems: relevance ranking boosting and query recommen- dation. Finally, we show with experiments the effectiveness of our approach [4, 3]. The same ideas can be applied to advertising campaigns in search engines and the automatic generation of a pseudo-ontology for queries. As a corollary of all the examples we show several inter- esting relations of different Web characteristics: structure, dynamics, ”quality”, etc. Our results help to understand not only technical issues, but also social ones, as the Web is the collaborative work of many people, a few publishing, and all of them querying. References [1] R. Baeza-Yates. Query usage mining in search engines. Web Mining: Applications and Techniques, Anthony Scime, editor. Idea Group, 2004. [2] R. A. Baeza-Yates. Applications of web query mining. In Advances in Information Retrieval, 27th European Confer- ence on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005, Proceedings, volume 3408 of Lec- ture Notes in Computer Science, pages 7–22. Springer, 2005. [3] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Query clustering for boosting web page ranking. In Advances in Web Intelligence, Second International Atlantic Web Intelligence Conference, AWIC 2004, Cancun, Mexico, May 16-19, 2004. Proceedings, volume 3034 of Lecture Notes in Computer Sci- ence, pages 164–175. Springer, 2004. [4] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. In Cur- rent Trends in Database Technology - EDBT 2004 Workshops, EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, and ClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Re- vised Selected Papers, volume 3268 of Lecture Notes in Com- puter Science, pages 588–596. Springer, 2004.

[IEEE Third Latin American Web Congress (LA-WEB'2005) - Buenos Aires, Argentina (31-02 Oct. 2005)] Third Latin American Web Congress (LA-WEB'2005) - Web Mining

  • Upload
    r

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Third Latin American Web Congress (LA-WEB'2005) - Buenos Aires, Argentina (31-02 Oct. 2005)] Third Latin American Web Congress (LA-WEB'2005) - Web Mining

Web Mining

Ricardo Baeza-Yates∗

Center for Web ResearchComputer Science Department

Universidad de [email protected]

&ICREA Professor

Dept. of TechnologyUniversitat Pompeu Fabra

Barcelona, Spain

Summary

The Web grows and evolves faster than we would like

and expect, imposing scalability and relevance problems to

Web search engines. There are three main data types in the

Web: content (text, multimedia), structure (links that form

a graph) and Web usage (transactions from Web logs). We

emphasize the last type of data, in particular a new subfield

called query mining.

In this tutorial we present:

• main current applications of Web mining;

• how mining Web data and usage logs allows to im-

prove search engines in several ways (ranking, in-

dexes, queries, and interfaces); and

• a new subfield called query mining that allows to ob-

tain information scent and new content suggestions to

improve a Web site.

Server logs of search engines store traces of queries sub-

mitted by users, which include queries themselves along

with Web pages selected in their answers. Query mining

is based in the fact that user queries in search engines and

Websites give valuable information on the interests of peo-

ple. In addition, clicks after queries relate those interests to

actual content. Even queries without answers imply impor-

tant missing synonyms or content [1, 2].

One example that we present is a framework for cluster-

ing query traces to identify groups of queries used to search

for similar information on the Web. The framework is based

on a new vectorial representation of query traces which al-

lows to treat them similarly to documents in traditional in-

formation retrieval systems. Also, we consider the problem

of reducing the bias in the selections caused by the partic-

ular answer rankings computed by the search engine. We

∗ This research was partially supported by Millennium Nucleus GrantP01-029-F, Mideplan, Chile.

show the application of the clustering framework to two

problems: relevance ranking boosting and query recommen-

dation. Finally, we show with experiments the effectiveness

of our approach [4, 3]. The same ideas can be applied to

advertising campaigns in search engines and the automatic

generation of a pseudo-ontology for queries.

As a corollary of all the examples we show several inter-

esting relations of different Web characteristics: structure,

dynamics, ”quality”, etc. Our results help to understand not

only technical issues, but also social ones, as the Web is the

collaborative work of many people, a few publishing, and

all of them querying.

References

[1] R. Baeza-Yates. Query usage mining in search engines. WebMining: Applications and Techniques, Anthony Scime, editor.Idea Group, 2004.

[2] R. A. Baeza-Yates. Applications of web query mining. InAdvances in Information Retrieval, 27th European Confer-ence on IR Research, ECIR 2005, Santiago de Compostela,Spain, March 21-23, 2005, Proceedings, volume 3408 of Lec-ture Notes in Computer Science, pages 7–22. Springer, 2005.

[3] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Queryclustering for boosting web page ranking. In Advances in WebIntelligence, Second International Atlantic Web IntelligenceConference, AWIC 2004, Cancun, Mexico, May 16-19, 2004.Proceedings, volume 3034 of Lecture Notes in Computer Sci-ence, pages 164–175. Springer, 2004.

[4] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Queryrecommendation using query logs in search engines. In Cur-rent Trends in Database Technology - EDBT 2004 Workshops,EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, andClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Re-vised Selected Papers, volume 3268 of Lecture Notes in Com-puter Science, pages 588–596. Springer, 2004.

Page 2: [IEEE Third Latin American Web Congress (LA-WEB'2005) - Buenos Aires, Argentina (31-02 Oct. 2005)] Third Latin American Web Congress (LA-WEB'2005) - Web Mining

Biography

Ricardo Baeza-Yates received the bachelor degree in CS

in 1983 from the University of Chile. Later, he received

also the M.Sc. in CS (1985), the professional title in electri-

cal engineering (1985) and the M.Eng. in EE (1986) from

the same university. He received his Ph.D. in CS from the

U. of Waterloo, Canada, in 1989. In 1992 he was elected

president of the Chilean Computer Science Society (SCCC)

until 1995, being elected again in 1997 . During 1993,

he received the Organization of American States award

for young researchers in exact sciences. In 1997 with two

Brazilian colleagues obtained the COMPAQ prize to best

Brazilian research article in CS. In 2003 he was incorpo-

rated to the Chilean Academy of Sciences, being the first

computer scientist to achieve this position.

Currently he is professor and director of the Center for

Web Research at the CS department of the University of

Chile, where he was the chair in the periods 1993-5 and

2003-4. He also is an ICREA Professor at the Dept. of

Technology of the Universitat Pompeu Fabra in Barcelona,

Spain. His research interests include information retrieval,

algorithms, and information visualization. He is co-author

of the book Modern Information Retrieval, published in

1999 by Addison-Wesley, as well as co-author of the 2nd

edition of the Handbook of Algorithms and Data Struc-

tures, Addison-Wesley, 1991; and co-editor of Information

Retrieval: Algorithms and Data Structures, Prentice-Hall,

1992, between other publications in journals published by

ACM, IEEE or SIAM. He has been visiting professor or

invited speaker at several conferences and universities all

around the world, as well as referee of several journals, con-

ferences, NSF, etc. He is member of the ACM, EATCS,

IEEE (senior), SCCC and SIAM.