Upload
r
View
213
Download
0
Embed Size (px)
Citation preview
Web Mining
Ricardo Baeza-Yates∗
Center for Web ResearchComputer Science Department
Universidad de [email protected]
&ICREA Professor
Dept. of TechnologyUniversitat Pompeu Fabra
Barcelona, Spain
Summary
The Web grows and evolves faster than we would like
and expect, imposing scalability and relevance problems to
Web search engines. There are three main data types in the
Web: content (text, multimedia), structure (links that form
a graph) and Web usage (transactions from Web logs). We
emphasize the last type of data, in particular a new subfield
called query mining.
In this tutorial we present:
• main current applications of Web mining;
• how mining Web data and usage logs allows to im-
prove search engines in several ways (ranking, in-
dexes, queries, and interfaces); and
• a new subfield called query mining that allows to ob-
tain information scent and new content suggestions to
improve a Web site.
Server logs of search engines store traces of queries sub-
mitted by users, which include queries themselves along
with Web pages selected in their answers. Query mining
is based in the fact that user queries in search engines and
Websites give valuable information on the interests of peo-
ple. In addition, clicks after queries relate those interests to
actual content. Even queries without answers imply impor-
tant missing synonyms or content [1, 2].
One example that we present is a framework for cluster-
ing query traces to identify groups of queries used to search
for similar information on the Web. The framework is based
on a new vectorial representation of query traces which al-
lows to treat them similarly to documents in traditional in-
formation retrieval systems. Also, we consider the problem
of reducing the bias in the selections caused by the partic-
ular answer rankings computed by the search engine. We
∗ This research was partially supported by Millennium Nucleus GrantP01-029-F, Mideplan, Chile.
show the application of the clustering framework to two
problems: relevance ranking boosting and query recommen-
dation. Finally, we show with experiments the effectiveness
of our approach [4, 3]. The same ideas can be applied to
advertising campaigns in search engines and the automatic
generation of a pseudo-ontology for queries.
As a corollary of all the examples we show several inter-
esting relations of different Web characteristics: structure,
dynamics, ”quality”, etc. Our results help to understand not
only technical issues, but also social ones, as the Web is the
collaborative work of many people, a few publishing, and
all of them querying.
References
[1] R. Baeza-Yates. Query usage mining in search engines. WebMining: Applications and Techniques, Anthony Scime, editor.Idea Group, 2004.
[2] R. A. Baeza-Yates. Applications of web query mining. InAdvances in Information Retrieval, 27th European Confer-ence on IR Research, ECIR 2005, Santiago de Compostela,Spain, March 21-23, 2005, Proceedings, volume 3408 of Lec-ture Notes in Computer Science, pages 7–22. Springer, 2005.
[3] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Queryclustering for boosting web page ranking. In Advances in WebIntelligence, Second International Atlantic Web IntelligenceConference, AWIC 2004, Cancun, Mexico, May 16-19, 2004.Proceedings, volume 3034 of Lecture Notes in Computer Sci-ence, pages 164–175. Springer, 2004.
[4] R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza. Queryrecommendation using query logs in search engines. In Cur-rent Trends in Database Technology - EDBT 2004 Workshops,EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, andClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Re-vised Selected Papers, volume 3268 of Lecture Notes in Com-puter Science, pages 588–596. Springer, 2004.
Biography
Ricardo Baeza-Yates received the bachelor degree in CS
in 1983 from the University of Chile. Later, he received
also the M.Sc. in CS (1985), the professional title in electri-
cal engineering (1985) and the M.Eng. in EE (1986) from
the same university. He received his Ph.D. in CS from the
U. of Waterloo, Canada, in 1989. In 1992 he was elected
president of the Chilean Computer Science Society (SCCC)
until 1995, being elected again in 1997 . During 1993,
he received the Organization of American States award
for young researchers in exact sciences. In 1997 with two
Brazilian colleagues obtained the COMPAQ prize to best
Brazilian research article in CS. In 2003 he was incorpo-
rated to the Chilean Academy of Sciences, being the first
computer scientist to achieve this position.
Currently he is professor and director of the Center for
Web Research at the CS department of the University of
Chile, where he was the chair in the periods 1993-5 and
2003-4. He also is an ICREA Professor at the Dept. of
Technology of the Universitat Pompeu Fabra in Barcelona,
Spain. His research interests include information retrieval,
algorithms, and information visualization. He is co-author
of the book Modern Information Retrieval, published in
1999 by Addison-Wesley, as well as co-author of the 2nd
edition of the Handbook of Algorithms and Data Struc-
tures, Addison-Wesley, 1991; and co-editor of Information
Retrieval: Algorithms and Data Structures, Prentice-Hall,
1992, between other publications in journals published by
ACM, IEEE or SIAM. He has been visiting professor or
invited speaker at several conferences and universities all
around the world, as well as referee of several journals, con-
ferences, NSF, etc. He is member of the ACM, EATCS,
IEEE (senior), SCCC and SIAM.