23
Topic : Search Engine Section 1 Group Members: 1. Umair Daud Raja(9380) 2.Abdul Basit(9675)

Analysis Of Algorithm

Embed Size (px)

Citation preview

”Topic : Search Engine

Section 1

Group Members: 1. Umair Daud Raja(9380)

2.Abdul Basit(9675)

What are Search Engines?

It is basically a type of program that uses keywords to search for documents that relate to these keywords and then puts the result found in the order of relevance to the topic that was searched for.

Examples:

• Google

• Bing

• Ask

The information we want to find maybe a mix of Images, Videos, Web Pages and other type of files.

Moreover the pages that are displayed on the search are called as Search Engine Result Pages (SERPs).

How search engine works?

Which is the best search engine?

Google : 1,100,000,000 - Estimated Unique Monthly Visitors

Bing: 350,000,000 - Estimated Unique Monthly Visitors

Yahoo: 300,000,000 - Estimated Unique Monthly Visitors

Ask : 245,000,000 - Estimated Unique Monthly Visitors

AOL Search : 125,000,000 - Estimated Unique Monthly Visitors

Reference : http://www.ebizmba.com/articles/search-engines

Market share

Reference : http://searchengineland.com/googles-search-market-share-67-percent-pc-83-percent-mobile-203937

Search Engines: Algorithm

Previous: backrub; calculate on the basis of visits.

Algorithm: Page Rank Impractical Solution; It was proposed by Larry Page

PageRank is the technique used by Google to determine importance of page on the web.

 One of the most important factors that Google uses is PageRank. PageRank is a numeric value that represents how important a page is on the web.

Off course PageRank is not the only factor, which decides importance of page, but still it is one of them.

PageRank is described by one mathematical formula that seems very difficult at first, but actually it is not.

Formula

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85.C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PR(A) is the PageRank of page A

PR(Ti) is the PageRank of pages Ti which link to page A

C(Ti) is the number of outbound links on page Ti

d is a damping factor which can be set between 0 and 1

Damping factor

Damping factor

The PageRank theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking.

The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85.

Explanation

Consider an imaginary web of 3 web pages.And the inbound and outbound link structure is as shown in the figure. The calculations can be done by following method :

PR(A) = 0.15 + 0.85 PR(C) = 0.15 + (0.85*1) = 1

PR(B) = 0.15 + 0.85 (PR(A) / 2) = 0.15 + 0.85 (1/2) = 0.15 + (0.85 * 0.5) = 0.15 + 0.425 = 0.575

PR(C) = 0.15 + 0.85 ((PR(A) / 2 )+ PR (B)) = 0.15 + 0.85 (1/2 + 0.575) = 0.15 + 0.85 (1.075) = 0.15 + 0.913 = 1.06

Where d=0.85, as according to formula, 1-d=1-0.85=0.15

complexity

O( n+m)

Because even in a complete graph, one has to touch each edge twice So the complexity is O(n+m)

Techniques

We will discuss two techniques here

Dynamical systems point of view:

Linear algebra point of view:

Another Example.

Example(Cont.)

We "translate" the picture into a

directed graph with 4 nodes, one for

each web site.

Example(cont.)

In the picture below, node 1 has 3 outgoing links.

Node 2 has 2 outgoing links.

Node 3 has 1 outgoing link.

Node 4 has 2 outgoing links.

In general, if there are k outgoing

nodes then it will pass on 1/k.

Example(Cont.)

After the diagram the Values of A, i.e. Transition matric becomes.

1. Dynamic System point of view

Suppose that initially the importance is uniformly distributed among the 4 nodes, each getting ¼. Denote by v the initial rank vector, having all entries equal to ¼. Each incoming link increases the importance of a web page, so at step 1, we update the rank of each page by adding to the current value the importance of the incoming links. This is the same as multiplying the matrix A with v . At step 1, the new importance vector is v1 = Av. We can iterate the process, thus at step 2, the updated importance vector is v2 = A(Av) = A2v. Numeric computations give:

Dynamic System point of view(Cont.)

2. Linear algebra point of view:

Let us denote by x1, x2, x3, and x4 the importance of the four pages. Analyzing the situation at each node we get the system:

Linear algebra point of view(Cont.)

Then we eigenvalues, and formula is det(A- I4 ) =0 and AX=0..

In this scenarios

X=

Linear algebra point of view(Cont.)

After calculation eigenvalues we get

eigenvalues =

and then we add eigenvalues we get 31, that is multiplied by these values to find Pagrank.

=

Current Algorithm used by Google

The latest algorithm used by google is “HummingBird”.

Google started using Hummingbird about 30 August 2013, and announced the change on September 26 on the eve of the company's 15th anniversary.

Pros and cons of PageRank

Pros:

It is query independent

Most relevant search results.

Cons:

The major disadvantage of PageRank is that it favors the older pages, because a new page, even a very good one will not have as many links as the old one.

Search results are based on literals(keywords,) things but not on meaning.