Upload
gaganhungama007
View
223
Download
0
Embed Size (px)
DESCRIPTION
bakka log lu
Citation preview
Google (a common spelling for googol)
• googol: 10100, or 1 followed by 100 zeros
• googolplex: 1010100, or 10googol, or 1 followed by a googol zeros
The name 'googol' was invented in the 1930’s, by a child (the mathematician Edward Kasner’s nine-year-old nephew) who was asked to think up a name for a very big number, namely, 1 with a hundred zeros after it.
General Architecture of a Search Engine
• Spider (Crawler) – gathering information
• Indexer – analyzing information• Searcher – displaying information (ranking)
What makes Ranking difficult:
• Web is not well controlled (it is not like a closed Information Retrieval System) Anyone can publish anything they want
A word can be repeated many times (if frequency is one of the ranking criteria then this is bad)
Metadata may be abused
• “Cloaking”: a website returns altered web pages to a search engine accessing the site, usually to distort search engine rankings
Sub-Optimal Ranking Methods:
• manually maintain a list!
• simply return the document that is closest to the query
What information does Google use in ranking Web Pages?
• Link Structure (PageRank)
IR(Information Retrieval) Measures:
• Anchor Text
• Font(relative to the rest of the page), Capitalization, Position in Page
• Plain Hits vs. Fancy Hits (URL, title, anchor text, meta tag)
• Location Information of different hits Proximity
Link Structure (PageRank)
• Idea behind PageRanking is Citation
• Count the number of links pointing to a page,
• But place different importance levels on each link (e.g. link from yahoo vs. link from a personal web page)
How does PageRanking actually work?
Markov Chains:
Limiting probability of a page ~ Probability that a surfer will visit a page
A B
C
PageRank Example:
A B
C 1
1/2
1/2
1
Equations:
P(A)=P(C)
P(B)=(1/2)*P(A)
P(C)=(1/2)*P(A)+P(B)
P(A)+P(B)+P(C)=1
Limiting probabilities:
P(A)=0.4, P(B)=0.2, P(C)=0.4
Problem with this approach:
A B
C 1
1/2
1/2
Equations:
P(A)=P(C)
P(B)=(1/2)*P(A)+P(B)
P(C)=(1/2)*P(A)
P(A)+P(B)+P(C)=1
1
Limiting probabilities:
P(A)=0, P(B)=1, P(C)=0
This is no good!!!
Solution: Use a Damping Factor
A B
C
1
(½)
(½)
P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]
(1-d)/3(1-d)/3
(1-d)/3
*d
(1-d)/3
(1-d)/3(1-d)/3
*d
*d
1
(1-d)/3
(1-d)/3
(1-d)/3
*d
Solution: Use a Damping Factor (continued)
P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]
Rational:
User follows the links then gets bored and randomly goes to another page
Question:
How should we apply the damping factor?
Equally to all pages or more heavily to a subset of pages?
P(C)= [(1-d)/3] + d*[(1/2)*P(A)]
In General;
P(X)=[(1-d)/n] + d*[P(T1)/C(T1)+…. + P(Tn)/C(Tn)]
On a typical workstation each iteration takes ~ 6 min.“The PageRank Citation Ranking: Bringing Order to the Web”
Copy of paper available at:
http://citeseer.nj.nec.com/368196.html
Anchor Text
Idea:
• Links provide information about the pages they are pointing to
• Also allows the inclusion of documents:
which have links pointing to them but which can not be crawled
e.g.: images, programs, databases
(cannot be indexed by text-based search engines)
General Architecture of a Search Engine
• Spider (Crawler) – gathering information
• Indexer – analyzing information• Searcher – displaying information
Main Concerns of Google:Fast and Space Efficient
Barrels: Forward vs. Inverted Index
Forward: Partially sorted (each barrel holds a word range)
Inverted: Sorted
• Two steps (for performance reasons??)• Is using word ranges the best solution, or should it be
balanced based on popularity? (when doing searching)
Inverted Barrels
Sort by docID orSort by ranking
Hybrid solution: use 2 sets of barrels
• One for title and anchor hits (they have more importance than plain hits)
• and one for all hits
Hit Lists
• Capitalization• Font Size• Position
No Color Information!!!
Question:How much effect do each of these properties have on the
ordering of web pages?(i.e. what’s the trade-off in using these?)
How often should Google’s database be updated?
Well there are some limitations: (back in 1998)
• Crawling 26 million pages takes ~ 9 days
• Indexing 24 million pages takes ~ 5 days
• Sorting them takes ~ 1 day
• Plus PageRanking
In reality Google was updated ~ 1- 4 weeks
Incremental Updating??
Smart Algorithmsto decide which pages should be crawled
(or Cooperation from Web Servers)
Improvements in Ranking
1)User Feedback
• User preferences (relevance)
exp: DirectHit (a system that measures what users click on from search results in order to refine relevancy rankings)
• Personalize PageRank by increasing the weights of users’ bookmarks
2)Use correlation information among different words?
(exp: networks computer networks)
Improvements in Ranking(Continued)
3)XML issuesHTML:<td width="20%" valign="top"><small><font face="Arial">Hamburg</font></small></td>Code:<td width="20%" valign="top"><% = & " " & rs.fields("city") %></td>
XML:<City>Hamburg</City>