20
1 Clustering of search engine results by Google [email protected] CWI, Amsterdam, The Netherlands [email protected] Vrije Universiteit Brussel , and Universiteit Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands Presented at Internet Librarian International 2004 in London, England, October 2004

1 Clustering of search engine results by Google [email protected] CWI, Amsterdam, The Netherlands [email protected] Vrije Universiteit

Embed Size (px)

Citation preview

Page 1: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

1

Clustering of search engine results by Google

[email protected]

CWI, Amsterdam, The Netherlands

[email protected]

Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium

•Hanneke Smulders

Infomare Consultancy, The Netherlands

Presented at Internet Librarian International 2004in London, England, October 2004

Page 2: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

2

Abstract - Summary - Overview Our experimental and quantitative investigation has shed some light

on the phenomenon that the Google search engine omits WWW documents from the ranked list of search results that it provides, when the documents are “very similar”. Google offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users.

However, our investigation revealed that pages are also clustered, omitted and thus hidden to some extent, even when they can be substantially different in meaning for a human reader. The system does not distinguish authentic pages from copies or more importantly from copies that were modified on purpose.

Furthermore, Google selects different WWW documents over time to represent the cluster of very similar documents.

A practical consequence of this system is that a search for information may lead a user to rely on the information that is presented in a WWW document that represents a cluster of documents, but that is not necessarily the most appropriate or authentic document.

Page 3: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

3

1. Introduction:Google omits documents from search results

2. Hypothesis & problem statement

3. Experimental procedure

4. Results / findings5. Discussion:

This may be important6. Conclusion of our

investigation and recommendation

- contents - summary - structure- overview

of this paper

Page 4: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

4

Introduction: very similar documents and

Google• In response to a search query, the Google Web

search engine delivers a ranked list of entries that refer to documents on the WWW.

•Google “omits some entries” in the case that these are “very similar” and offers the possibility to "repeat the search with the omitted results included", on the last page with search results.

•All this can be considered as an additional service offered by the system to the users, because they do not have to waste time by inspecting very similar entries.

Page 5: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

5

Introduction: other services offered by

Google•Omitting of very similar entries should not be

confused with other services / tricks performed by Google, like»offering a link to “cached” documents»offering a link to “similar pages” that are associated with an entry; this leads in fact not to similar but to very different documents on different servers; these are related to the entry and may be useful according to Google

»clustering and hiding entries from the results, because they are all located on the same server computer even though they may not be similar in contents

Page 6: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

6

Hypothesis and problem statement: competition for

visibility•Our hypothesis was that the Google computer

system cannot find out and know which entry is the best one to serve as a representative of the cluster of entries (at least not in all cases) because this depends

»on the one hand, on the aims of the user

»on the other hand, on the variations in the documents

This is analogous to the problem of how to rank entries in the presentation of search results.

Page 7: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

7

Screen shot of a Google Web search in a case with “omitted

entries”

Page 8: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

8

Screen shot of a Google Web search with “omitted entries

included”

Page 9: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

9

Experimental procedure: Test documents for the WWW

(1)•To understand the clustering better, we have

performed experiments with very similar test documents.

•A unique test document was put on the Internet,together with a specific content of several metatags.This guaranteed that our test documents ranked highly in the results of our searches.

•Differences among documents in Title tag + Body text + Filename

Page 10: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

10

Experimental procedure: Test documents for the WWW

(2)•We keep on the Internet

18 different samples of the test document on 8 server computers.

•Documents were submitted at the end of 2002 (9 documents) and at the end of 2003 (10 documents).

•Our test documents do NOT change over time.

Page 11: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

11

Experimental procedure: Example of our test document

on the WWW

Page 12: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

12

Experimental procedure: Searching for the test

documents•We submitted 55 queries simultaneously

every 30 minutes in the period March – June 2004.

•Total number of queries submitted is 274615.

• In 99% of the test results, Google put all retrieved test documents in one cluster. This cluster was always represented by one of the test documents.

Page 13: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

13

Experimental results: Changes of representative

document•We found that Google chose a different test document to represent the cluster of similar test documents over time.

Page 14: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

14

Experimental results: Changes of representative

document•Definition

»A representative period is a period that lasted more than 2 hours, in which there is no change in the representative document of the cluster with test documents for one of our queries.

•Findings

»The number of representative periods per query was 11 or 12.

»The length of representative periods per query was between 1 day and 27 days.

Page 15: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

15

Experimental results: Representatives for queries 1-

5

Page 16: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

16

Experimental results: Observations

• In 99% of the results of our test, an old test document (submitted in 2002) was the representative. Does Google aim at authenticity?

•The selection of representative document of a cluster is not only dependent on the documents in the cluster, but also on the query that is used to retrieve the cluster.

Page 17: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

17

Experimental results: Test documents retrieved

All test documents: 18

Test documents found by searching for content (not URL): 15

Test documents used as representative: 9

Test documents found by searching for URL: 15

Page 18: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

18

Discussion: The importance of similar

documents•Real, authentic documents on their original server

computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers. In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

Page 19: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

19

Discussion: The importance of similar

documents•May complicate scientometric / bibliometric

studies, quantitative studies of numbers of documents retrieved.

•Documents on their original server pushed away from Google search results, by very similar competing documents on 1 or several other servers?!

Page 20: 1 Clustering of search engine results by Google Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit

20

ConclusionRecommendation

• In general, realize that Google Web search omits very similar entries from search results.

• In particular, take this into account when it is important to find

»the oldest, authentic, master version of a document;

»the newest, most recent version of a document;

»versions of a document with comments, corrections…

»in general: variations of documents