29
Relevance of Search Results

Relevance of Search Results. Anyone can create a web page and produce their own content, comprising observations, facts, anecdotes and all manner of written

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Relevance of Search Results

Anyone can create a web page and produce their own content, comprising observations, facts, anecdotes and all manner of written texts from poems to polemical essays. Given the few barriers to publishing on the web, it can be described as a medium of the people, by the people, for the people!

This very democratic characteristic of open publishing on the web introduces new kinds of problems. Traditionally, publishing a paper, poem, article, or book meant that a series of filtering exercises had been performed that served as a form of quality control, controversial though it may be at times.

A formally published work would carry with it the associated gravitas of its source with the imprimatur of a respected publishing house, based upon a perceived sense of the quality of the house and its previous publications. Information published in this way retained a certain level of integrity, with a reasonable assurance that the facts presented were essentially accurate and reliable.

However, the web removes the formal filtering mechanisms, for better or worse, such that much of the information available through the Web comes from non-authoritative sources and may or may not be reliable.

Given the dubious nature of many sources of web content, the integrity of information on the web is often in doubt. The problem is exacerbated by the fact that the web is riddled with out-of-date and unmaintained pages. While a person might recognize quickly that a particular page is irrelevant or out-of-date, it is decidedly more difficult for a crawling and indexing process to make the same determination “on sight”.

A “quality” page from an authoritative source is difficult to distinguish from the rest of the clutter on the web. Indexing by itself is useful but not sufficient to rank the relative reliability and significance of information contained in web pages.

For example, the sample search engine from the previous section could be refined, such that the search results are ranked according to the greatest number of occurrences of the key word(s) or perhaps based upon the relative frequency of usage within the document. However, this approach only guarantees that the search terms occur in the documents, but it does not ascertain the relative quality of the documents.

Furthermore, the sorting of search results is not always such a transparent process. Which sites occupy the top spots? Did they “earn” the spot, or was it purchased? A number of search engines engaged in the rather dubious practice whereby top spots were “sold” or “sponsored” without being clearly identified as, essentially, advertisements.

Bias in the ranking of search results, whether by unsophisticated indexing and data collection or through “sponsored” results, led to a certain loss of confidence in search engines and index sites.

Correcting this perception of bias by performing objective and highly accurate searches was a primary design objective in the design of Google, and accounts for its quick ascension as the leading search engine on the web.

A distinguishing feature of the Google algorithms and architecture is the emphasis on the quality of the search results, as opposed to the quantity of returned “hits” from earlier systems. The top-ten elements of a Google search result are generally quite reliable, routinely identifying definitive references, so much so that the Google search engine has become a de-facto standard for information extraction from the web.

“We designed our ranking function so that no particular factor can have too much influence. First, consider the simplest case -- a single word query. In order to rank a document with a single word query, Google looks at that document's hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, ...), each of which has its own type-weight.

The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help.

We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.”[1]

[1] “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Brin and Page, p.10

The ranking procedure is used to distinguish among occurrences of the search terms; some occurrences are more significant than others, and the web pages themselves contain clues that help establish the relative importance of one occurrence over another. The net result of the procedure is that the additional data factored in to the sorting procedure produces more reliable final rankings.

When seeking information about the exploration of space, the web pages of NASA might be more authoritative than “Big John’s Alien Abduction” page, and the two can be distinguished by examining their respective topical web linkages.

The structure of linkages on the web can be examined to ascertain the relative importance of web pages. The concept of “page rank analysis” has been developed to help rank the perceived level of authority of any given page by assessing the degree to which other pages link to that page, thereby conferring more authority to that page.

The notion of page rank begins with an initial search based upon the original inquiry. The search yields a set of topically relevant sites. The sites themselves are evaluated as to their relative significance based upon an appraisal of their degree of “connectedness”.

The idea of page ranking is based upon graph theory, where the web sites are represented as labeled nodes and the edges represent links amongst the sites. An analysis of the structure of the graph yields clues as to the relative importance of the individual sites.

For example, a site with many “out” links could very well be a “hub” for that topic, consisting of a collection of related links for the subject. The act of collecting the links suggests a level of relevant topical activity that is noteworthy, and could therefore raise the prominence of the site in the rankings.

A site with many “in” links could be potentially construed as an “authority” for that subject, as many other sites identified with the subject have chosen to include a link to the perceived “authority” site.

The combination of indexing and page-rank analysis serves to improve the topical relevance of the search results, reducing the number of spurious “hits”, and prioritizing the results based upon the likelihood that a particular link is an authoritative source for the topic.

The utility of crawling, indexing, and searching the resultant data stores is also quite powerful at an organizational level. Companies and organizations have enormous volumes of digital data, much of it in free-form text.

The large amounts of data, combined with a modest amount of employee turnover, could easily result in a certain loss of institutional memory. The indexing process can organize this information and make it accessible for internal use, and also key segments of the data can be indexed and made available to clients and customers.

It is a rare eCommerce site that does not include an internal search capacity to enable users to find what they are looking for without having to wade through a hierarchy of pages

The results are readily apparent; consider, for example, the convenience of a key word search at a typical on-line auction site. Enter a description and immediately you are provided with a lengthy list of merchandise that in some way or another meets that description.

• A good introduction to the problem of prioritizing search results can be found in “The PageRank Citation Ranking: Bring Order to the Web”, by Page, Brin, Motwani, and Windograd, at http://dbpubs.stanford.edu:8090/pub/1999-66.

 

The concept of analyzing the structure of the web using a “hubs” and “authorities” model is presented in the paper “Authoritative Sources in a Hyperlinked Environment” by Kleinberg, Journal of the ACM, Vol 46, number 5, 9/99, pp.604-632.