Crawling and Ranking
HTML (HyperText Markup Language)
• Described the structure and content of a (web) document• HTML 4.01: most common version, W3C standard• XHTML 1.0: XML-ization of HTML 4.01, minor differences• Validation (http://validator.w3.org/) against a schema. Checks
the conformity of a Web page with respect to recommendations, for accessibility:– to all graphical browsers (IE, Firefox, Safari, Opera, etc.)– to text browsers (lynx, links, w3m, etc.)– to all other user agents including Web crawlers
The HTML language
• Text and tags
• Tags define structure– Used for instance by a browser to lay out the
document.
• Header and Body
HTML structure <!DOCTYPE html …> <html lang="en"> <head> <!-- Header of the document --> </head> <body> <!-- Body of the document --> </body> </html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN“ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns=http://www.w3.org/1999/xhtml lang="en" xml:lang="en"><head><meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /><title>Example XHTML document</title></head><body><p>This is a <a href="http://www.w3.org/">link to theW3C</a></p></body></html>
Header
• Appears between the tags <head> ... </head>
• Includes meta-data such as language, encoding…
• Also include document title
• Used by (e.g.) the browser to decipher the body
Body• Between <body> ... </body> tags • The body is structured into sections, paragraphs,
lists, etc. <h1>Title of the page</h1> <h2>Title of a main section</h2> <h3>Title of a subsection</h3> . . .• <p> ... </p> define paragraphs• More block elements such as table, list…
HTTP• Application protocol
Client request: GET /MarkUp/ HTTP/1.1
Host: www.google.comServer response:HTTP/1.1 200 OK
• Two main HTTP methods: GET and POST
GET
URL: http://www.google.com/search?q=BGU
Corresponding HTTP GET request:GET /search?q=BGU HTTP/1.1Host: www.google.com
POST
• Used for submitting forms
POST /php/test.php HTTP/1.1Host: www.bgu.ac.ilContent-Type: application/x-www-
formurlencodedContent-Length: 100…
Status codes
• HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK)
• First digit indicates the class of the response:1 Information2 Success3 Redirection4 Client-side error5 Server-side error
Authentication
• HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc.
• It can be used instead to transmit sensitive data
GET ... HTTP/1.1Authorization: Basic dG90bzp0aXRp
Cookies• Key/value pairs, that a server asks a client to store and
retransmit with each HTTP request (for a given domain name).
• Can be used to keep information on users between visits
• Often what is stored is a session ID– Connected, on the server side, to all session
information
Crawling
Basics of Crawling• Crawlers, (Web) spiders, (Web) robots: autonomous agents
that retrieve pages from the Web
• Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL
Problem: The web is huge!
Discovering new URLs
• Browse the "internet graph" (following e.g. hyperlinks)
• Site maps (sitemap.org)
The internet graph
• At least 14.06 billion nodes = pages
• At least 140 billion edges = links
• Lots of "junk"
Graph-browsing algorithms
• Depth-first
• Breath-first
• Combinations..
• Parallel crawling
Duplicates
• Identifying duplicates or near-duplicates on the Web to prevent multiple indexing
• Trivial duplicates: same resource at the same canonized URL:
http://example.com:80/toto http://example.com/titi/../toto• Exact duplicates: identification by hashing• near-duplicates: (timestamps, tip of the day, etc.)
more complex!
Near-duplicate detection
• Edit distance– Good measure of similarity,– Does not scale to a large collection of documents
(unreasonable to compute the edit distance for every pair!).
• Shingles: two documents similar if they mostly share the same succession of k-grams
Crawling ethics
• robots.txt at the root of a Web server
• User-agent: * Allow: /searchhistory/ Disallow: /search• Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW">• Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a>• Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server
Overview
• Crawl
• Retrieve relevant documents – How?• To define relevance, to find relevant docs..
• Rank– How?
Relevance
• Input: keyword (or set of keywords), “the web”
• First question: how to define the relevance of a page with respect to a key word?
• Second question: how to store pages such that the relevant ones for a given keyword are easily retrieved?
Relevance definition
• Boolean based on existence of a word in the document– Synonyms– Disadvantages?
• Word count– Synonyms– Disadvantages?
• Can we do better?
TF-IDF
Storing pages
• Offline pre-processing can help online search
• Offline preprocessing includes stemming, stop words removal…
• As well as the creation of an index
Inverted Index
More advanced text analysis
• N-grams
• HMM language models
• PCFG langage models
• We will discuss all that later in the course!
Ranking
Why Ranking?
• Huge number of pages
• Huge even if we filter according to relevance– Keep only pages that include the keywords
• A lot of the pages are not informative– And anyway it is impossible for users to go
through 10K results
When to rank?
• Before retrieving results– Advantage: offline!– Disadvantage: huge set
• After retrieving results– Advantage: smaller set– Disadvantage: online, user is waiting..
How to rank?
• Observation: links are very informative!
• Not just for discovering new sites, but also for estimating the importance of a site
• CNN.com has more links to it than my homepage…
• Quality and Efficiency are key factors
Authority and Hubness
• Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from less-important sitesA(v) = The authority of v
• Hubness A good hub is a site that links to many authoritative sites
H(v) = The hubness of v
HITS
• Recursive dependency: a(v) = Σ(u,v) h(u) h(v) = Σ(v,u) a(u) Normalize (when?) according to square root of
sum of squares of authorities \ hubness values• Start by setting all values to 1– We could also add bias
• We can show that a(v) and h(v) converge
HITS (cont.)
• Works rather well if applied only on relevant web pages– E.g. pages that include the input keywords
• The results are less satisfying if applied on the whole web
• On the other hand, online ranking is a problem
Google PageRank
• Works offline, i.e. computes for every web-site a score that can then be used online
• Extremely efficient and high-quality
• The PageRank algorithm that we will describe here appears in [Brin & Page, 1998]
Random Surfer Model
• Consider a "random surfer"
• At each point chooses a link and clicks on it
• A link is chosen with uniform distribution– A simplifying assumption..
• What is the probability of being, at a random time, at a web-page W?
Recursive definition
• If PageRank reflects the probability of being in a web-page (PR(w) = P(w)) then
PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W
Problems
• A random surfer may get stuck in one component of the graph
• May get stuck in loops
• “Rank Sink” Problem– Many Web pages have no inlinks/outlinks
Damping Factor
• Add some probability d for "jumping" to a random page
• Now PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index
How to compute PR?
• Simulation
• Analytical methods– Can we solve the equations?
Simulation: A random surfer algorithm
• Start from an arbitrary page• Toss a coin to decide if you want to follow a
link or to randomly choose a new page• Then toss another coin to decide which link to
follow \ which page to go to• Keep record of the frequency of the web-
pages visited
Convergence
• Not guaranteed without the damping factor!• (Partial) intuition: if unlucky, the algorithm
may get stuck forever in a connected component
• Claim: with damping, the probability of getting stuck forever is 0
• More difficult claim: with damping, convergence is guaranteed
Markov Chain Monte Carlo (MCMC)
• A class of very useful algorithms for sampling a given distribution
• We first need to know what is a Markov Chain
Markov Chain
• A finite or countably infinite state machine
• We will consider the case of finitely many states
• Transitions are associated with probabilities
• Markovian property: given the present state, future choices are independent from the past
MCMC framework
• Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution
• Perform a random walk on the MC, keeping track of the proportion of state visits– Discard samples made before “Mixing”
• Return proportion as an approximation of the correct distribution
Properties of Markov Chains
• A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time)
• We want conditions on when this distribution is unique, and when will a random walk approximate it
Properties
• Periodicity– A state i has period k if any return to state i must
occur in multiples of k time steps– Aperiodic: period = 1 for all states
• Reducibility– An MC is irreducible if there is a probability 1 of
(eventually) getting from every state to every state• Theorem: A finite-state MC has a unique
stationary distribution if it is aperiodic and irreducible
Back to PageRank
• The MC is on the graph with probabilities we have defined
• MCMC is the random walk algorithm
• Is the MC aperiodic? Irreducible?
• Why?
Problem with MCMC
• In general no guarantees on convergence time– Even for those “nice” MCs
• A lot of work on characterizing “nicer” MCs– That will allow fast convergence
• In practice for the web graph it converges rather slowly– Why?
A different approach
• Reconsider the equation system
PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N • A linear equation system!
Transition Matrix
T= (0 0.33 0.33 0.33 0 0 0.5 0.5 0.25 0.25 0.25 0.25 0 0 0 0)
Stochastic matrix
EigenVector!
• PR (column vector) is the right eigenvector of the stochastic transition matrix– I.e. the adjacency matrix normalized to have the
sum of every column to be 1• The Perron-Frobinius theorem ensures that
such a vector exists
• Unique under the same assumptions as before
Direct solution
• Solving the equations set – Via e.g. Gaussian elimination
• This is time-consuming
• Observation: the matrix is sparse
• So iterative methods work better here
Power method• Start with some arbitrary rank vector R0
• Compute Ri = A Ri-1
• If we happen to get to the eigenvector we will stay there
• Theorem: The process converges to the eigenvector!• Convergence is in practice pretty fast (~100 iterations)
Power method (cont.)
• Every iteration is still “expensive”
• But since the matrix is sparse it becomes feasible
• Still, need a lot of tweaks and optimizations to make it work efficiently
Other issues• Accelerating Computation
• Updates
• Distributed PageRank
• Mixed Model (Incorporating "static" importance)
• Personalized PageRank