1 Clustering of search engine results by Google [email protected] CWI, Amsterdam, The Netherlands [email protected] Vrije Universiteit

1

Clustering of search engine results by Google

•[email protected]

CWI, Amsterdam, The Netherlands

•[email protected]

Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium

•Hanneke Smulders

Infomare Consultancy, The Netherlands

Presented at Internet Librarian International 2004in London, England, October 2004

http://www.internet-librarian.com/

2

Abstract - Summary - Overview Our experimental and quantitative investigation has shed some light

on the phenomenon that the Google search engine omits WWW documents from the ranked list of search results that it provides, when the documents are “very similar”. Google offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users.

However, our investigation revealed that pages are also clustered, omitted and thus hidden to some extent, even when they can be substantially different in meaning for a human reader. The system does not distinguish authentic pages from copies or more importantly from copies that were modified on purpose.

Furthermore, Google selects different WWW documents over time to represent the cluster of very similar documents.

A practical consequence of this system is that a search for information may lead a user to rely on the information that is presented in a WWW document that represents a cluster of documents, but that is not necessarily the most appropriate or authentic document.


3

1. Introduction:Google omits documents from search results

2. Hypothesis & problem statement

3. Experimental procedure

4. Results / findings5. Discussion:

This may be important6. Conclusion of our

investigation and recommendation

- contents - summary - structure- overview

of this paper


4

Introduction: very similar documents and

Google• In response to a search query, the Google Web

search engine delivers a ranked list of entries that refer to documents on the WWW.

•Google “omits some entries” in the case that these are “very similar” and offers the possibility to "repeat the search with the omitted results included", on the last page with search results.

•All this can be considered as an additional service offered by the system to the users, because they do not have to waste time by inspecting very similar entries.


5

Introduction: other services offered by

Google•Omitting of very similar entries should not be

confused with other services / tricks performed by Google, like»offering a link to “cached” documents»offering a link to “similar pages” that are associated with an entry; this leads in fact not to similar but to very different documents on different servers; these are related to the entry and may be useful according to Google

»clustering and hiding entries from the results, because they are all located on the same server computer even though they may not be similar in contents

6

Hypothesis and problem statement: competition for

visibility•Our hypothesis was that the Google computer

system cannot find out and know which entry is the best one to serve as a representative of the cluster of entries (at least not in all cases) because this depends

»on the one hand, on the aims of the user

»on the other hand, on the variations in the documents

This is analogous to the problem of how to rank entries in the presentation of search results.

7

Screen shot of a Google Web search in a case with “omitted

entries”

8

Screen shot of a Google Web search with “omitted entries

included”

9

Experimental procedure: Test documents for the WWW

(1)•To understand the clustering better, we have

performed experiments with very similar test documents.

•A unique test document was put on the Internet,together with a specific content of several metatags.This guaranteed that our test documents ranked highly in the results of our searches.

•Differences among documents in Title tag + Body text + Filename

10

Experimental procedure: Test documents for the WWW

(2)•We keep on the Internet

18 different samples of the test document on 8 server computers.

•Documents were submitted at the end of 2002 (9 documents) and at the end of 2003 (10 documents).

•Our test documents do NOT change over time.

11

Experimental procedure: Example of our test document

on the WWW


12

Experimental procedure: Searching for the test

documents•We submitted 55 queries simultaneously

every 30 minutes in the period March – June 2004.

•Total number of queries submitted is 274615.

• In 99% of the test results, Google put all retrieved test documents in one cluster. This cluster was always represented by one of the test documents.

13

Experimental results: Changes of representative

document•We found that Google chose a different test document to represent the cluster of similar test documents over time.

14

Experimental results: Changes of representative

document•Definition

»A representative period is a period that lasted more than 2 hours, in which there is no change in the representative document of the cluster with test documents for one of our queries.

•Findings

»The number of representative periods per query was 11 or 12.

»The length of representative periods per query was between 1 day and 27 days.


15

Experimental results: Representatives for queries 1-

5


16

Experimental results: Observations

• In 99% of the results of our test, an old test document (submitted in 2002) was the representative. Does Google aim at authenticity?

•The selection of representative document of a cluster is not only dependent on the documents in the cluster, but also on the query that is used to retrieve the cluster.


17

Experimental results: Test documents retrieved

All test documents: 18

Test documents found by searching for content (not URL): 15

Test documents used as representative: 9

Test documents found by searching for URL: 15

18

Discussion: The importance of similar

documents•Real, authentic documents on their original server

computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers. In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

19

Discussion: The importance of similar

documents•May complicate scientometric / bibliometric

studies, quantitative studies of numbers of documents retrieved.

•Documents on their original server pushed away from Google search results, by very similar competing documents on 1 or several other servers?!

20

ConclusionRecommendation

• In general, realize that Google Web search omits very similar entries from search results.

• In particular, take this into account when it is important to find

»the oldest, authentic, master version of a document;

»the newest, most recent version of a document;

»versions of a document with comments, corrections…

»in general: variations of documents