Link analysis for web search

Emrullah Delibas

� The Problem of Ranking •  Objectives, Challenges

� Early Assumptions & Approaches � Link-Based Ranking Algorithms

•  InDegree Algorithm •  Hubs and Authorities: HITS •  PageRank •  SALSA •  Hilltop

� Search Engine Spamming � Problems with Non-textual Context

� “Cornell” •  Did the searcher want information about the

university? •  The university’s hockey team? •  The Lab of Ornithology run by the university? •  Cornell College in Iowa? •  The Nobel-Prize-winning physicist Eric Cornell?

The same ranking of search results can’t be right for everyone.

�  Objectives: •  To categorize webpages •  To find pages related to given pages •  To find duplicated websites •  To calculate the ‘quality’ of a web link •  To get the most ‘relevant’ web links based on a given query •  To model human judgments indirectly •  …

�  Challenges: •  Searching by itself is a hard problem for computers to solve in any

setting •  scale and complexity on the Web •  problems of synonymy and polysemy •  dynamic and constantly-changing nature of Web content •  …

� Back in the 1990’s, web search was purely based on the number of occurrences of a word in a document.

� The search was purely and only based on relevancy of a document with the query.

Simply getting the relevant documents wasn’t sufficient as the number of relevant documents may range in a few millions.

�  Links are assumed to be endorsements •  Disagreement •  Self-citation •  Link to a popular document

�  Hyperlinks contain information about the human judgment

of a site

�  The more incoming links to a site, the more it is judged

�  The Web is not a random network

-Bray, Tim. "Measuring the web." Computer networks and ISDN systems 28.7 (1996): 993-1005. -Marchiori, Massimo. "The quest for correct information on the web: Hyper search engines." Computer Networks and ISDN Systems 29.8 (1997): 1225-1235.

� Hyperlinks are not at random, they provide valuable information for: •  Link-based ranking •  Structure analysis •  Detection of communities •  Spam detection •  …

� This approach could be seen as the basis of each and every link analysis ranking algorithm.

� The link recommendation assumption is that by linking to another page, the author recommends it. •  So, a page with many incoming links has been highly

recommended.

� The ranking is just base on the authority and no weighting of authority values.

Hypertext Induced Topic Selection

� The basic idea is that relevant pages (“authorities”) are linked to by many other pages (“hubs”).

� The algorithm is now a part of the Ask search engine.

Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. A preliminary version appears in the Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, Jan. 1998.

� It is developed by looking at the way how humans analyze a search process rather than the machines searching up a query by looking at a bunch of documents and return the matches.

� For example; •  “top automobile makers in the world”

� Rules: •  A good hub points to many good authorities. •  A good authority is pointed to by many good

hubs. •  Authorities and hubs have a mutual

reinforcement relationship.

� Objective: Sq •  (i) Sq is relatively small •  (ii) Sq is rich in relevant pages •  (iii) Sq contains most (or many) of the strongest

authorities � Solution

•  Generate a Root Set Qσ from text-based search engine

•  Expand the root set

� Let authority score of the page i be x(i), and the hub score of page i be y(i).

� mutual reinforcing relationship: •  I step:

•  O step:

� 1st iteration

� 1st iteration

•  I step

� 1st iteration

•  I step •  O step

� 2nd iteration

•  I step

� 2nd iteration

•  I step •  O step

� 2nd iteration

•  I step •  O step •  … •  ... •  ...

1.  must be built “on the fly” 2.  suffers from topic drift 3.  cannot detect advertisements 4.  can easily be spammed 5.  query time evaluation is slow

Heart of Google

� Proposed by by Sergey Brin and Lawrence Page

� Uses a recursive scheme similar to Kleinberg’s HITS algorithm

� But the PageRank algorithm produces a ranking, independent of a user’s query.

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Proc. 7th International World Wide Web Conference, pages 107–117, 1998.

� A page is important if it is pointed to by other important pages.

� The PageRank of a page pi is given as follows: •  Suppose that the page pi has pages M(pi) linking

to it. •  L(pj) is the number of outbound links on page pj.

� The algorithm is robust against Spam •  since its not easy for a webpage owner to add in-

links to his/her page from other important pages.

� PageRank is a global measure and is query independent.

� It favors the older pages •  Since new ones will not have many links

� PageRank can be easily increased by the concept of “link-farms” •  However, while indexing, the search actively

tries to find these flaws.

� Rank Sinks: occurs when in a network pages get in infinite link cycles

� Spider Traps: occurs if there are no links from within the group to outside the group.

� Dangling Links: occurs when a page contains a link such that the hypertext points to a page with no outgoing links.

� Dead Ends: pages with no outgoing links.

� Damping Factor •  random jumps (teleportation) � where N is the total number of pages � Typically d ≈ 0.85

PAGERANK HITS

�  Computed for all web-pages stored prior to the query

�  Computes authorities only �  Fast to compute �  No need for additional

normalization

�  Performed on the subset generated by each query.

�  Computes authorities and hubs

�  Easy to compute, real-time execution is hard.

�  There is need for normalization

Criteria HITS PageRank

Complexity Analysis O(kN2) O(n)

Result quality Less than PageRank algorithm

Medium

Relevancy Less. Since this algorithm ranks the pages on the indexing time

More since this algorithm uses the hyperlinks to give good results and also consider the content of the page

Neighborhood applied to the local neighborhood of pages surrounding the results of a query

applied to entire web

Grover, Nidhi, and Ritika Wason. "Comparative analysis of pagerank and hits algorithms." International Journal of Engineering Research and Technology. Vol. 1. No. 8 (October-2012). ESRSA Publications, 2012.

�  Keyword-Stuffing: Overloading the website with relevant keywords.

�  Text-Hidding: Placing relevant content on the website which can only be seen by search engines.

�  Doorway-Page: A page which is very well optimized for some keywords and with the only purpose to redirect to a real website.

�  Link-farms: Websites which are optimized for some keywords and contains only a huge number of links to other websites.

� Flash: rarely processed by search engines

� Java Applets: normally not processed.

� Videos and Images: not directly processable for search engines.

� Other Rich-Media Formats: (e.g. Silverlight) which are typically not processed by search engines.

Science

Link analysis for web search