62
A Seminar Report On Working of web search engine Submitted in partial fulfillment of the requirement For the award of the degree Of Bachelor of Engineering In Information Technology Submitted to : Guide: Sachin Sharma Dr. K.R. Chowdhary B.E. Final Year Professor, CSE Dept. Department of Computer Science and Engineering M.B.M. Engineering College, Faculty of Engineering, i Working of web search engine

Seminar Report

Embed Size (px)

Citation preview

Page 1: Seminar Report

ASeminar Report

On

Working of web search engine

Submitted in partial fulfillment of the requirementFor the award of the degree

OfBachelor of Engineering

InInformation Technology

Submitted to: Guide:Sachin Sharma Dr. K.R. ChowdharyB.E. Final Year Professor, CSE Dept.

Department of Computer Science and EngineeringM.B.M. Engineering College, Faculty of Engineering,

Jai Narain Vyas UniversityJodhpur (Rajasthan) – 342001

Session 2008-09

iWorking of web search engine

Page 2: Seminar Report

CANDIDATE’S DECLARATION

I hereby declare that the work which is being presented in the Seminar entitled

“Working of Web Search engine” in the partial fulfillment of the requirement for the

award of degree of Bachelor of engineering in Information Technology, submitted in

the department of Computer science and engineering, M.B.M. Engineering college,

Jodhpur (Rajasthan), is an authentic record of my own work carried out during the

period from February 2009 to May 2009, under the supervision of Dr. K.R. Chowdhary,

Professor, Department of Computer science and engineering, M.B.M. Engineering

college, Jodhpur (Rajasthan).

The matter embodied in this project has not been submitted by me for the award of any

other degree. I also declare that the matter of seminar is not ‘reproduced as it is’ from

any source.

Date:

Place: Jodhpur (SACHIN SHARMA)

CERTIFICATE

This is certified that the above statement made by the candidate is correct to the best of

my knowledge.

Dr. K.R. Chowdhary

Professor

Department of Computer science and engineering

M.B.M. Engineering College,

Jodhpur (Rajasthan) – 342001

iiWorking of web search engine

Page 3: Seminar Report

Contents

1. Introduction…………………………………………………………….1

2. Types of search engine………………………………………………2

3. General system architecture of web search engine………………2

3.1.Web crawling……………………………………………………..4

3.1.1. Types of crawling…………………………………………6

3.1.1.1. Focused crawling…………………………………6

3.1.1.2. Distributed crawling……………………………....6

3.1.2. Robot exclusion protocol…………………………………7

3.1.3. Resource constraints……………………………………..8

3.2.Web indexing……………………………………………………..8

3.2.1. Index design factors………………………………………9

3.2.2. Index data structures……………………………………..10

3.2.3. Types of indexing…………………………………………11

3.2.3.1. Inverted Index……………………………………...11

3.2.3.2. Forward index……………………………………..12

3.2.4. Latent Semantic Indexing (LSI)………………………….13

3.2.4.1. What is LSI…………………………………………13

3.2.4.2. How LSI Works…………………………………….14

3.2.4.3. Singular Value Decomposition (SVD)…………..17

3.2.4.4. Stemming…………………………………………..20

3.2.4.5. The Term Document Matrix………………………22

3.2.5. Challenges in parallelism…………………………………27

4. Meta search engine……………………………………………………27

5. Search engine optimization…………………………………………..29

5.1.Page Rank…………………………………………………………29

5.2.The ranking algorithm simplified………………………………...30

5.3.Damping factor…………………………………………………….32

5.4.Uses of page Rank………………………………………………..35

6. Marketing of search engines………………………………………….36

7. Summary………………………………………………………………..37

iiiWorking of web search engine

Page 4: Seminar Report

Abstract

Exploring the content of web pages for automatic indexing is of fundamental importance

for efficient e-commerce and other applications of the Web. It enables users, including

customers and businesses, to locate the best sources for their use. Today’s search

engines use one of two approaches to indexing web pages. They either:

Analyze the frequency of the words (after filtering out common or meaningless

words) appearing in the entire or a part (typically, a title, an abstract or the first

300 words) of the text of the target web page,

They use sophisticated algorithms to take into account associations of words in

the indexed web page. In both cases only words appearing in the web page in

question are used in analysis. Often, to increase relevance of the selected terms

to the potential searches, the indexing is refined by human processing.

To identify so called “authority” or “expert” pages, some search engines use the

structure of the links between pages to identify pages that are often referenced by other

pages. The approach used in the Google Search Engine implementation, assign each

page a score that depends on frequency with which this page is visited by web surfers.

ivWorking of web search engine

Page 5: Seminar Report

1. Introduction

A search engine is an information retrieval system designed to help find information

stored on a computer system. Search engines help to minimize the time required to find

information and the amount of information which must be consulted, akin to other

techniques for managing information overload. The most public, visible form of a search

engine is a Web search engine which searches for information on the World Wide Web.

Engineering a web search engine is a challenging task. Search engines index

tens to hundreds of millions of web pages involving a comparable number of distinct

terms. They answer tens of millions of queries every day. Despite the importance of

large-scale search engines on the web, very little academic research has been

conducted on them. Furthermore, due to rapid advance in technology and web

proliferation, creating a web search engine today is very different from three years ago.

There are differences in the ways various search engines work, but they all perform

three basic tasks:

They search the Internet or select pieces of the Internet based on important

words.

They keep an index of the words they find, and where they find them.

They allow users to look for words or combinations of words found in that index.

The most important measure for a search engine is the search performance, quality of

the results and ability to crawl, and index the web efficiently. The primary goal is to

provide high quality search results over a rapidly growing World Wide Web. Some of the

efficient and recommended search engines are Google, Yahoo and Teoma, which

share some common features and are standardized to some extent.

vWorking of web search engine

Page 6: Seminar Report

Types of search engine

2. Types of search engine

Search engines provide an interface to a group of items that enables users to specify

criteria about an item of interest and have the engine find the matching items. The

criteria are referred to as a search query. In the case of text search engines, the search

query is typically expressed as a set of words that identify the desired concept that one

or more documents may contain. There are several styles of search query syntax that

vary in strictness. It can also switch names within the search engines from previous

sites. Whereas some text search engines require users to enter two or three words

separated by white space, other search engines may enable users to specify entire

documents, pictures, sounds, and various forms of natural language. Some search

engines apply improvements to search queries to increase the likelihood of providing a

quality set of items through a process known as query expansion.

3. General system architecture of web search engine

This section provides an overview of how the whole system of a search engine works.

The major functions of the search engine crawling, indexing and searching are also

covered in detail in the later sub-sections.

Before a search engine can tell you where a file or document is, it must be found.

To find information on the hundreds of millions of Web pages that exist, a typical search

engine employs special software robots, called spiders, to build lists of the words found

on Websites. When a spider is building its lists, the process is called Web crawling. A

Web crawler is a program, which automatically traverses the web by downloading

documents and following links from page to page. They are mainly used by web search

engines to gather data for indexing. Other possible applications include page validation,

structural analysis and visualization; update notification, mirroring and personal web

assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.

Crawlers are automated programs that follow the links found on the web pages. There

is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages

Page 7: Seminar Report

General system architecture of search engine

that are fetched are then sent to the store server. The store server then compresses

and stores the web pages into a repository. Every web page has an associated ID

number called a doc ID, which is assigned whenever a new URL is parsed out of a web

page. The indexer and the sorter perform the indexing function.

The indexer performs a number of functions. It reads the repository,

uncompresses the documents, and parses them. Each document is converted into a set

of word occurrences called hits. The hits record the word, position in document, an

approximation of font size, and capitalization. The indexer distributes these hits into a

set of "barrels", creating a partially sorted forward index.

The indexer performs another important function. It parses out all the links in

every web page and stores important information about them in an anchors file. This file

contains enough information to determine where each link points from and to, and the

text of the link. The URL Resolver reads the anchors file and converts relative URLs into

absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index,

associated with the doc ID that the anchor points to.

It also generates a database of links, which are pairs of doc IDs. The links

database is used to compute Page Ranks for all the documents. The sorter takes the

barrels, which are sorted by doc ID and resorts them by word ID to generate the

inverted index. This is done in place so that little temporary space is needed for this

operation. The sorter also produces a list of word IDs and offsets into the inverted index.

A program called Dump Lexicon takes this list together with the lexicon produced by the

indexer and generates a new lexicon to be used by the searcher.

A lexicon lists all the terms occurring in the index along with some term-level

statistics (e.g., total number of documents in which a term occurs) that are used by the

ranking algorithms The searcher is run by a web server and uses the lexicon built by

Dump Lexicon together with the inverted index and the Page Ranks to answer queries.

Page 8: Seminar Report

General system architecture of search engine

3.1. Web crawling

Web crawlers are an essential component to search engines; running a web crawler is a

challenging task. There are tricky performance and reliability issues and even more

importantly, there are social issues. Crawling is the most fragile application since it

involves interacting with hundreds of thousands of web servers and various name

servers, which are all beyond the control of the system. Web crawling speed is

governed not only by the speed of one’s own Internet connection, but also by the speed

Page 9: Seminar Report

General system architecture of search engine

of the sites that are to be crawled. Especially if one is a crawling site from multiple

servers, the total crawling time can be significantly reduced, if many downloads are

done in parallel. Despite the numerous applications for Web crawlers, at the core they

are all fundamentally the same. Following is the process by which Web crawlers work:

Download the Web page.

Parse through the downloaded page and retrieve all the links.

For each link retrieved, repeat the process.

The Web crawler can be used for crawling through a whole site on the Inter-/Intranet.

You specify a start-URL and the Crawler follows all links found in that HTML page. This

usually leads to more links, which will be followed again, and so on. A site can be seen

as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct

sons of the root. Subsequent links are then sons of the previous sons.

Page 10: Seminar Report

General system architecture of search engine

3.1.1. Types of crawling Crawlers are of two types basically.

3.1.1.1. Focused crawling

A general purpose Web crawler gathers as many pages as it can from a particular set of

URL’s. Where as a focused crawler is designed to gather documents only on a specific

topic, thus reducing the amount of network traffic and downloads. The goal of the

focused crawler is to selectively seek out pages that are relevant to a pre-defined set of

topics. The topics are specified not using keywords, but using exemplary documents.

Rather than collecting and indexing all accessible web documents to be able to

answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to

find the links that are likely to be most relevant for the crawl, and avoids irrelevant

regions of the web. This leads to significant savings in hardware and network resources,

and helps keep the crawl more up-to-date. The focused crawler has three main

components: a classifier, which makes relevance judgments on pages, crawled to

decide on link expansion, a distiller which determines a measure of centrality of crawled

pages to determine visit-priorities, and a crawler with dynamically reconfigurable priority

controls which is governed by the classifier and distiller.

The most crucial evaluation of focused crawling is to measure the harvest ratio,

which is the rate at which relevant pages are acquired and irrelevant pages are

effectively filtered off from the crawl. This harvest ratio must be high, otherwise the

focused crawler would spend a lot of time merely eliminating irrelevant pages, and it

may be better to use an ordinary crawler instead.

3.1.1.2. Distributed crawling

Indexing the web is a challenge due to its growing and dynamic nature. As the size of

the Web is growing it has become imperative to parallelize the crawling process in order

to finish downloading the pages in a reasonable amount of time. A single crawling

process even if multithreading is used will be insufficient for large – scale engines that

Page 11: Seminar Report

General system architecture of search engine

need to fetch large amounts of data rapidly. When a single centralized crawler is used

all the fetched data passes through a single physical link. Distributing the crawling

activity via multiple processes can help build a scalable, easily configurable system,

which is fault tolerant system. Splitting the load decreases hardware requirements and

at the same time increases the overall download speed and reliability. Each task is

performed in a fully distributed fashion, that is, no central coordinator exists.

3.1.2. Robot exclusion protocol

Web sites also often have restricted areas that crawlers should not crawl. To address

these concerns, many Web sites adopted the Robot protocol, which establishes

guidelines that crawlers should follow. Over time, the protocol has become the unwritten

law of the Internet for Web crawlers. The Robot protocol specifies that Web sites

wishing to restrict certain areas or pages from crawling have a file called robots.txt

placed at the root of the Web site. The ethical crawlers will then skip the disallowed

areas. Following is an example robots.txt file and an explanation of its format:

# Robots.txt for http://somehost.com/

User-agent: *

Disallow: /cgi-bin/

Disallow: /registration # Disallow robots on registration page

Disallow: /login

The first line of the sample file has a comment on it, as denoted by the use of a hash (#)

character. Crawlers reading robots.txt files should ignore any comments. The third line

of the sample file specifies the User-agent to which the Disallow rules following it apply.

Page 12: Seminar Report

General system architecture of search engine

User-agent is a term used for the programs that access a Web site. Each browser has a

unique User-agent value that it sends along with each request to a Web server.

However, typically Web sites want to disallow all robots (or User-agents) access to

certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that

all User-agents be disallowed for the rules that follow it. The lines following the User-

agent lines are called disallow statements. The disallow statements define the Web site

paths that crawlers are not allowed to access. For example, the first disallow statement

in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus,

the following URLs are both off limits to crawlers according to that line.

http://somehost.com/cgi-bin

http://somehost.com/cgi-bin/register

3.1.3. Resource Constraints

Crawlers consume resources: network bandwidth to download pages, memory to

maintain private data structures in support of their algorithms, CPU to evaluate and

select URLs, and disk storage to store the text and links of fetched pages as well as

other persistent data.

3.2. Web Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate

information retrieval. Index design incorporates interdisciplinary concepts from

linguistics, cognitive psychology, mathematics, informatics, physics and computer

science. An alternate name for the process in the context of search engines designed to

find web pages on the Internet is Web indexing.

The purpose of storing an index is to optimize speed and performance in finding

relevant documents for a search query. Without an index, the search engine would scan

every document in the corpus, which would require considerable time and computing

Page 13: Seminar Report

General system architecture of search engine

power. For example, while an index of 10,000 documents can be queried within

milliseconds, a sequential scan of every word in 10,000 large documents could take

hours. The additional computer storage required to store the index, as well as the

considerable increase in the time required for an update to take place, are traded off for

the time saved during information retrieval.

3.2.1. Index design factors

Major factors in designing a search engine's architecture include:

Merge factors : How data enters the index, or how words or subject features are

added to the index during text corpus traversal, and whether multiple indexers

can work asynchronously. The indexer must first check whether it is updating old

content or adding new content. Traversal typically correlates to the data

collection policy. Search engine index merging is similar in concept to the SQL

Merge command and other merge algorithms.

Storage techniques : How to store the index data, that is, whether information

should be data compressed or filtered.

Index size : How much computer storage is required to support the index.

Lookup speed : How quickly a word can be found in the inverted index. The

speed of finding an entry in a data structure, compared with how quickly it can be

updated or removed, is a central focus of computer science.

Maintenance : How the index is maintained over time.

Fault tolerance : How important it is for the service to be reliable. Issues include

dealing with index corruption, determining whether bad data can be treated in

isolation, dealing with bad hardware, partitioning, and schemes such as hash-

based or composite partitioning, as well as replication.

3.2.2. Index data structures

Page 14: Seminar Report

General system architecture of search engine

Search engine architectures vary in the way indexing is performed and in methods of

index storage to meet the various design factors. Types of indices include:

Suffix tree : It is figuratively structured like a tree, supports linear time lookup.

Built by storing the suffixes of words. The suffix tree is a type of trie. Tries

support extendable hashing, which is important for search engine indexing. Used

for searching for patterns in DNA sequences and clustering. A major drawback is

that the storage of a word in the tree may require more storage than storing the

word itself. An alternate representation is a suffix array, which is considered to

require less virtual memory and supports data compression such as the BWT

algorithm.

Tree : An ordered tree data structure that is used to store an associative array

where the keys are strings. Regarded as faster than a hash table but less space-

efficient.

Inverted index : Stores a list of occurrences of each atomic search criterion,

typically in the form of a hash table or binary tree

Citation index : Stores citations or hyperlinks between documents to support

citation analysis, a subject of Bibliometrics.

Ngram index : Stores sequences of length of data to support other types of

retrieval or text mining.

Term document matrix : Used in latent semantic analysis, stores the occurrences

of words in documents in a two-dimensional sparse matrix.

3.2.3. Types of indexing: Indexing is basically of two types.

Page 15: Seminar Report

General system architecture of search engine

3.2.3.1. Inverted Index:

Many search engines incorporate an inverted index when evaluating a search query to

quickly locate documents containing the words in a query and then rank these

documents by relevance. Because the inverted index stores a list of the documents

containing each word, the search engine can use direct access to find the documents

associated with each word in the query in order to retrieve the matching documents

quickly. The following is a simplified illustration of an inverted index:

Word Documents

the Doc1, Doc3, Doc4, Doc5

cow Doc2, Doc3, Doc4

says Doc5

moo Doc7

This index can only determine whether a word exists within a particular document,

since it stores no information regarding the frequency and position of the word; it is

therefore considered to be a Boolean index. Such an index determines which

documents match a query but does not rank matched documents. In some designs the

index includes additional information such as the frequency of each word in each

document or the positions of a word in each document. Position information enables the

search algorithm to identify word proximity to support searching for phrases; frequency

can be used to help in ranking the relevance of documents to the query. Such topics are

the central research focus of information retrieval.The inverted index is a sparse matrix,

since not all words are present in each document. To reduce computer storage memory

requirements, it is stored differently from a two dimensional array. The index is similar to

the term document matrices employed by latent semantic analysis. The inverted index

Page 16: Seminar Report

General system architecture of search engine

can be considered a form of a hash table. In some cases the index is a form of a binary

tree, which requires additional storage but may reduce the lookup time. In larger indices

the architecture is typically a distributed hash table.

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge

but first deletes the contents of the inverted index. The architecture may be designed to

support incremental indexing, where a merge identifies the document or documents to

be added or updated and then parses each document into words. For technical

accuracy, a merge conflates newly indexed documents, typically residing in virtual

memory, with the index cache residing on one or more computer hard drives.

After parsing, the indexer adds the referenced document to the document list for

the appropriate words. In a larger search engine, the process of finding each word in the

inverted index (in order to report that it occurred within a document) may be too time

consuming, and so this process is commonly split up into two parts, the development of

a forward index and a process which sorts the contents of the forward index into the

inverted index. The inverted index is so named because it is an inversion of the forward

index.

3.2.3.2. Forward Index:

The forward index stores a list of words for each document. The following is a simplified

form of the forward index:

Document word

Doc1 the, cow, says, moo

Doc2 the, cat, and, the ,hat

Doc3 the, dish, ran, away, with, the, spoon

Page 17: Seminar Report

General system architecture of search engine

The rationale behind developing a forward index is that as documents are parsing, it is

better to immediately store the words per document. The delineation enables

Asynchronous system processing, which partially circumvents the inverted index update

bottleneck. The forward index is sorted to transform it to an inverted index. The forward

index is essentially a list of pairs consisting of a document and a word, collated by the

document. Converting the forward index to an inverted index is only a matter of sorting

the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

3.2.4. Latent Semantic Indexing (LSI)

3.2.4.1. What is LSI:

Regular keyword searches approach a document collection with a kind of accountant

mentality: a document contains a given word or it doesn't, without any middle ground.

We create a result set by looking through each document in turn for certain keywords

and phrases, tossing aside any documents that don't contain them, and ordering the

rest based on some ranking system. Each document stands alone in judgement before

the search algorithm - there is no interdependence of any kind between documents,

which are evaluated solely on their contents.

Latent semantic indexing adds an important step to the document indexing

process. In addition to recording which keywords a document contains, the method

examines the document collection as a whole, to see which other documents contain

some of those same words. LSI considers documents that have many words in common

to be semantically close, and ones with few words in common to be semantically

distant. This simple method correlates surprisingly well with how a human being, looking

at content, might classify a document collection. Although the LSI algorithm doesn't

understand anything about what the words mean, the patterns it notices can make it

seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at similarity values

it has calculated for every content word, and returns the documents that it thinks best fit

the query. Because two documents may be semantically very close even if they do not

Page 18: Seminar Report

General system architecture of search engine

share a particular keyword, LSI does not require an exact match to return useful results.

Where a plain keyword search will fail if there is no exact match, LSI will often return

relevant documents that don't contain the keyword at all.

To use an earlier example, let's say we use LSI to index our collection of mathematical

articles. If the words n-dimensional, manifold and topology appear together in enough

articles, the search algorithm will notice that the three terms are semantically close. A

search for n-dimensional manifolds will therefore return a set of articles containing that

phrase (the same result we would get with a regular search), but also articles that

contain just the word topology. The search engine understands nothing about

mathematics, but examining a sufficient number of documents teaches it that the three

terms are related. It then uses that information to provide an expanded set of results

with better recall than a plain keyword search.

3.2.4.2 How LSI Works:

Natural language is full of redundancies, and not every word that appears in a

document carries semantic meaning. In fact, the most frequently used words in English

are words that don't carry content at all: functional words, conjunctions, prepositions,

auxiliary verbs and others. The first step in doing LSI is culling all those extraneous

words from a document, leaving only content words likely to have semantic meaning.

There are many ways to define a content word - here is one recipe for generating a list

of content words from a document collection:

Make a complete list of all the words that appear anywhere in the collection

Discard articles, prepositions, and conjunctions

Discard common verbs (know, see, do, be)

Discard pronouns

Discard common adjectives (big, late, high)

Discard frilly words (therefore, thus, however, albeit, etc.)

Page 19: Seminar Report

General system architecture of search engine

Discard any words that appear in every document

Discard any words that appear in only one document

This process condenses our documents into sets of content words that we can then use

to index our collection.

Using our list of content words and documents, we can now generate a term-

document matrix. This is a fancy name for a very large grid, with documents listed

along the horizontal axis, and content words along the vertical axis. For each content

word in our list, we go across the appropriate row and put an 'X' in the column for any

document where that word appears. If the word does not appear, we leave that column

blank.

Doing this for every word and document in our collection gives us a mostly empty

grid with a sparse scattering of X-es. This grid displays everything that we know about

our document collection. We can list all the content words in any given document by

looking for X-es in the appropriate column, or we can find all the documents containing

a certain content word by looking across the appropriate row.

Notice that our arrangement is binary - a square in our grid either contains an X, or it

doesn't. This big grid is the visual equivalent of a generic keyword search, which looks

for exact matches between documents and keywords. If we replace blanks and X-es

with zeroes and ones, we get a numerical matrix containing the same information.

The key step in LSI is decomposing this matrix using a technique called singular value

decomposition. The mathematics of this transformation is beyond the scope of this

article.

Imagine that you are curious about what people typically order for breakfast

down at your local diner, and you want to display this information in visual form. You

decide to examine all the breakfast orders from a busy weekend day, and record how

many times the words bacon, eggs and coffee occur in each order.

Page 20: Seminar Report

General system architecture of search engine

You can graph the results of your survey by setting up a chart with three orthogonal

axes - one for each keyword. The choice of direction is arbitrary - perhaps a bacon axis

in the x direction, an eggs axis in the y direction, and the all-important coffee axis in the

z direction. To plot a particular breakfast order, you count the occurrence of each

keyword, and then take the appropriate number of steps along the axis for that word.

When you are finished, you get a cloud of points in three-dimensional space,

representing all of that day's breakfast orders.

If you draw a line from the origin of the graph to each of these points, you obtain

a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector

tells you how many of the three key items were in any particular order, and the set of all

the vectors taken together tells you something about the kind of breakfast people favor

on a Saturday morning.

What your graph shows is called a term space. Each breakfast order forms a vector in

that space, with its direction and magnitude determined by how many times the three

keywords appear in it. Each keyword corresponds to a separate spatial direction,

perpendicular to all the others. Because our example uses three keywords, the resulting

term space has three dimensions, making it possible for us to visualize it. It is easy to

see that this space could have any number of dimensions, depending on how many

keywords we chose to use. If we were to go back through the orders and also record

occurrences of sausage, muffin, and bagel, we would end up with a six-dimensional

term space, and six-dimensional document vectors.

Applying this procedure to a real document collection, where we note each use of a

content word, results in a term space with many thousands of dimensions. Each

document in our collection is a vector with as many components as there are content

words. Although we can't possibly visualize such a space, it is built in the exact same

way as the whimsical breakfast space we just described. Documents in such a space

that have many words in common will have vectors that are near to each other, while

documents with few shared words will have vectors that are far apart.

Page 21: Seminar Report

General system architecture of search engine

Latent semantic indexing works by projecting this large, multidimensional space down

into a smaller number of dimensions. In doing so, keywords that are semantically similar

will get squeezed together, and will no longer be completely distinct. This blurring of

boundaries is what allows LSI to go beyond straight keyword matching. To understand

how it takes place, we can use another analogy.

3.2.4.3. Singular Value Decomposition:

Imagine you keep tropical fish, and are proud of your prize aquarium - so proud that you

want to submit a picture of it to Modern Aquaria magazine, for fame and profit. To get

the best possible picture, you will want to choose a good angle from which to take the

photo. You want to make sure that as many of the fish as possible are visible in your

picture, without being hidden by other fish in the foreground. You also won't want the

fish all bunched together in a clump, but rather shot from an angle that shows them

nicely distributed in the water. Since your tank is transparent on all sides, you can take

a variety of pictures from above, below, and from all around the aquarium, and select

the best one.

In mathematical terms, you are looking for an optimal mapping of points in 3-space (the

fish) onto a plane (the film in your camera). 'Optimal' can mean many things - in this

case it means 'aesthetically pleasing'. But now imagine that your goal is to preserve the

relative distance between the fish as much as possible, so that fish on opposite sides of

the tank don't get superimposed in the photograph to look like they are right next to

each other. Here you would be doing exactly what the SVD algorithm tries to do with a

much higher-dimensional space.

Instead of mapping 3-space to 2-space, however, the SVD algorithm goes to much

greater extremes. A typical term space might have tens of thousands of dimensions,

and be projected down into fewer than 150. Nevertheless, the principle is exactly the

same. The SVD algorithm preserves as much information as possible about the relative

distances between the document vectors, while collapsing them down into a much

Page 22: Seminar Report

General system architecture of search engine

smaller set of dimensions. In this collapse, information is lost, and content words are

superimposed on one another.

Information loss sounds like a bad thing, but here it is a blessing. What we are losing is

noise from our original term-document matrix, revealing similarities that were latent in

the document collection. Similar things become more similar, while dissimilar things

remain distinct. This reductive mapping is what gives LSI its seemingly intelligent

behavior of being able to correlate semantically related terms. We are really exploiting a

property of natural language, namely that words with similar meaning tend to occur

together.

While a discussion of the mathematics behind singular value decomposition is beyond

the scope of our article, it's worthwhile to follow the process of creating a term-

document matrix in some detail, to get a feel for what goes on behind the scenes. Here

we will process a sample wire story to demonstrate how real-life texts get converted into

the numerical representation we use as input for our SVD algorithm.

The first step in the chain is obtaining a set of documents in electronic form. This can be

the hardest thing about LSI - there are all too many interesting collections not yet

available online. In our experimental database, we download wire stories from an online

newspaper with an AP news feed. A script downloads each day's news stories to a local

disk, where they are stored as text files.

Let's imagine we have downloaded the following sample wire story, and want to incorporate it in our

collection:

O'Neill Criticizes Europe on GrantsPITTSBURGH (AP)

Treasury Secretary Paul O'Neill expressed irritation on Wednesday that European countries have refused to go along with a U.S. proposal to boost the amount of direct grants rich nations offer poor countries. The Bush administration is pushing a plan

Page 23: Seminar Report

General system architecture of search engine

to increase the amount of direct grants the World Bank provides the poorest nations to 50 percent of assistance, reducing use of loans to these nations.

The first thing we do is strip all formatting from the article,

including capitalization, punctuation, and extraneous markup

(like the dateline). LSI pays no attention to word order,

formatting, or capitalization, so can safely discard that

information. Our cleaned-up wire story looks like this:

O’Neill criticizes Europe on grants treasury secretary Paul O’Neill expressed irritation Wednesday that European countries have refused to go along with a us proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations

The next thing we want to do is pick out the content words in our article. These are the

words we consider semantically significant - everything else is clutter. We do this by

applying a stop list of commonly used English words that don't carry semantic meaning. Using a stop

list greatly reduces the amount of noise in our collection, as well as eliminating a large number of words that

would make the computation more difficult. Creating a stop list is something of an art - they depend very

much on the nature of the data collection. You can see our full wire stories stop list here.

Here is our sample story with stop-list words highlighted:

O’Neill criticizes Europe on grants treasury secretary Paul O’Neill expressed irritation Wednesday that European countries have refused to go along with a US proposal to boost the amount of direct grants rich nations offer poor countries

Page 24: Seminar Report

General system architecture of search engine

the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations

Removing these stop words leaves us with an abbreviated version of the article containing content words

only:

O’Neill criticizes Europe grants treasury secretary Paul O’Neill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations

However, one more important step remains before our document is ready for indexing.

We can notice how many of our content words are plural noun (grants, nations) and

inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected

form of a content word be listed separately in our master word list - with all the possible

variants, the list would soon grow unwieldy. More troubling is that LSI might not

recognize that the different variant forms were actually the same word in disguise. We

solve this problem by using a stemmer.

3.2.4.4. Stemming:

While LSI itself knows nothing about language (we saw how it deals exclusively with a

mathematical vector space), some of the preparatory work needed to get documents

ready for indexing is very language-specific. We have already seen the need for a stop

list, which will vary entirely from language to language and to a lesser extent from

document collection to document collection. Stemming is similarly language-specific,

derived from the morphology of the language. For English documents, we use an

algorithm called the Porter stemmer to remove common endings from words, leaving

Page 25: Seminar Report

General system architecture of search engine

behind an invariant root form. Here are some examples of words before and after

stemming:

Information -> informPresidency -> presidePresiding -> presideHappiness -> happyHappily -> happyDiscouragement -> discourageBattles -> battle

And here is our sample story as it appears to the stemmer:

O’Neill criticizes Europe grants treasury secretary Paul O’Neill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations

Note that at this point we have reduced the original natural-language news story to a

series of word stems. All of the information carried by punctuation, grammar, and style

is gone - all that remains is word order, and we will be doing away with even that by

transforming our text into a word list. It is striking that so much of the meaning of text

passages inheres in the number and choice of content words, and relatively little in the

way they are arranged. This is very counterintuitive, considering how important

grammar and writing style are to human perceptions of writing.

Having stripped, pruned, and stemmed our text; we are left with a flat list of words:

administratamountassist bank boost bush

Page 26: Seminar Report

General system architecture of search engine

countri (2)directeuropexpressgrant (2)increasirritatloan nation (3)O’NeillPaulplanpoor (2) propospushrefusrichsecretartreasuriUSworld

This is the information we will use to generate our term-document matrix, along with a

similar word list for every document in our collection.

3.2.4.5. The Term-Document Matrix:

As we mentioned in our discussion of LSI, the term-document matrix is a large grid

representing every document and content word in a collection. We have looked in detail

at how a document is converted from its original form into a flat list of content words. We

prepare a master word list by generating a similar set of words for every document in

our collection, and discarding any content words that either appear in every document

(such words won't let us discriminate between documents) or in only one document

(such words tell us nothing about relationships across documents). With this master

word list in hand, we are ready to build our TDM.

We generate our TDM by arranging our list of all content words along the vertical axis,

and a similar list of all documents along the horizontal axis. These need not be in any

particular order, as long as we keep track of which column and row corresponds to

Page 27: Seminar Report

General system architecture of search engine

which keyword and document. For clarity we will show the keywords as an alphabetized

list.

We fill in the TDM by going through every document and marking the grid square for all

the content words that appear in it. Because any one document will contain only a tiny

subset of our content word vocabulary, our matrix is very sparse (that is, it consists

almost entirely of zeroes).

Here is a fragment of the actual term-document matrix from our wire stories database:

Document a b c d e F g h i j k l m n o p q

Astro 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

satellite 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

shine 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

star 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0

planet 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

earth 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

We can easily see if a given word appears in a given document by looking at the

intersection of the appropriate row and column. In this sample matrix, we have used

ones to represent document/keyword pairs. With such a binary scheme, all we can tell

about any given document/keyword combination is whether the keyword appears in the

document.

Page 28: Seminar Report

General system architecture of search engine

This approach will give acceptable results, but we can significantly improve our results

by applying a kind of linguistic favoritism called term weighting to the value we use for

each non-zero term/document pair.

Term weighting is a formalization of two common-sense insights:

1. Content words that appear several times in a document are probably more

meaningful than content words that appear just once.

2. Infrequently used words are likely to be more interesting than common words.

The first of these insights applies to individual documents, and we refer to it as local

weighting. Words that appear multiple times in a document are given a greater local

weight than words that appear once. We use a formula called logarithmic local

weighting to generate our actual value.

The second insight applies to the set of all documents in our collection, and is called

global term weighting. There are many global weighting schemes; all of them reflect

the fact that words that appear in a small handful of documents are likely to be more

significant than words that are distributed widely across our document collection. Our

own indexing system uses a scheme called inverse document frequency to calculate

global weights.

By way of illustration, here are some sample words from our collection, with the number

of documents they appear in, and their corresponding global weights.

word count global weight

unit 833 1.44cost 295 2.47project 169 3.03tackle 40 4.47wrestler 7 6.22

You can see that a word like wrestler, which appears in only seven documents, is

considered twice as significant as a word like project, which appears in over a hundred.

Page 29: Seminar Report

General system architecture of search engine

There is a third and final step to weighting, called normalization. This is a scaling step

designed to keep large documents with many keywords from overwhelming smaller

documents in our result set. It is similar to handicapping in golf - smaller documents are

given more importance, and larger documents are penalized, so that every document

has equal significance.These three values multiplied together - local weight, global

weight, and normalization factor - determine the actual numerical value that appears in

each non-zero position of our term/document matrix.

Although this step may appear language-specific, note that we are only looking at word

frequencies within our collection. Unlike the stop list or stemmer, we don't need any

outside source of linguistic information to calculate the various weights. While weighting

isn't critical to understanding or implementing LSI, it does lead to much better results, as

it takes into account the relative importance of potential search terms.

With the weighting step done, we have done everything we need to construct a

finished term-document matrix. The final step will be to run the SVD algorithm itself.

Notice that this critical step will be purely mathematical - although we know that the

matrix and its contents are a shorthand for certain linguistic features of our collection,

the algorithm doesn't know anything about what the numbers mean. This is why we say

LSI is language-agnostic - as long as you can perform the steps needed to generate a

term-document matrix from your data collection, it can be in any language or format

whatsoever.

You may be wondering what the large matrix of numbers we have created has to do

with the term vectors and many-dimensional spaces we discussed in our earlier

explanation of how LSI works. In fact, our matrix is a convenient way to represent

vectors in a high-dimensional space. While we have been thinking of it as a lookup grid

that shows us which terms appear in which documents, we can also think of it in spatial

terms. In this interpretation, every column is a long list of coordinates that gives us the

exact position of one document in a many-dimensional term space. When we applied

term weighting to our matrix in the previous step, we nudged those coordinates around

to make the document's position more accurate.

Page 30: Seminar Report

General system architecture of search engine

As the name suggests, singular value decomposition breaks our matrix down into a set of smaller

components. The algorithm alters one of these components ( this is where the number of dimensions gets

reduced ), and then recombines them into a matrix of the same shape as our original, so we can again use it

as a lookup grid. The matrix we get back is an approximation of the term-document matrix we provided as

input, and looks much different from the original:

a b c d e f g h i j

star -0.006 -0.006 -0.002 -0.002 -0.003 -0.001 0.000 0.007 0.004 0.008

planet -.0012

moon

sun

earth

astro

shine

Notice two interesting features in the processed data:

The matrix contains far fewer zero values. Each document has a similarity value

for most content words.

Some of the similarity values are negative. In our original TDM, this would

correspond to a document with fewer than zero occurrences of a word,

impossibility. In the processed matrix, a negative value is indicative of a very

large semantic distance between a term and a document.

This finished matrix is what we use to actually search our collection. Given one or more

terms in a search query, we look up the values for each search term/document

combination, calculate a cumulative score for every document, and rank the documents

by that score, which is a measure of their similarity to the search query. In practice, we

Page 31: Seminar Report

General system architecture of search engine

will probably assign an empirically-determined threshold value to serve as a cutoff

between relevant and irrelevant documents, so that the query does not return every

document in our collection.

3.2.4. Challenges in parallelism:

A major challenge in the design of search engines is the management of parallel

computing processes. There are many opportunities for race conditions and coherent

faults. For example, a new document is added to the corpus and the index must be

updated, but the index simultaneously needs to continue responding to search queries.

This is a collision between two competing tasks. Consider that authors are producers of

information, and a web crawler is the consumer of this information, grabbing the text

and storing it in a cache (or corpus). The forward index is the consumer of the

information produced by the corpus, and the inverted index is the consumer of

information produced by the forward index. This is commonly referred to as a producer-

consumer model. The indexer is the producer of searchable information and users are

the consumers that need to search. The challenge is magnified when working with

distributed storage and distributed processing. In an effort to scale with larger amounts

of indexed information, the search engine's architecture may involve distributed

computing, where the search engine consists of several machines operating in unison.

Page 32: Seminar Report

General system architecture of search engine

This increases the possibilities for incoherency and makes it more difficult to maintain a

fully-synchronized, distributed, parallel architecture.

4. Meta-Search Engine

A meta-search engine is the kind of search engine that does not have its own database

of Web pages. It sends search terms to the databases maintained by other search

engines and gives users the results that come from all the search engines queried.

Fewer meta-searchers allow you to delve into the largest, most useful search engine

databases. They tend to return results from smaller and/or free search engines and

miscellaneous free directories, often small and highly commercial. The mechanism and

algorithms that meta-search engines employ are quite different. The simplest meta-

search engines just pass the queries to other direct search engines. The results are

then simply displayed in different newly opened browser windows as if several different

queries were posed.

Some improved meta-search engines organize the query results in one screen in

different frames, or in one frame but in a sequential order. Some more sophisticated

meta-search engines permit users to choose their favorite direct search engines in the

query input process, while using filters and other algorithms to process the returned

query results before displaying them to the users. Problems often arise in the query-

input process though. Meta-Search engines are useful if the user is looking for a unique

term or phrase; or if he (she) simply wants to run a couple of keywords. Some meta-

search engines simply pass search terms along to the underlying direct search engine,

and if a search contains more than one or two words or very complex logic, most of

them will be lost. It will only make sense to the few search engines that supports such

logic. Following are some of the powerful meta-search engines with some direct search

engines like Google, AltaVista and Yahoo.

No two meta-search engines are alike. Some search only the most popular

search engines while others also search lesser-known engines, newsgroups, and other

databases. They also differ in how the results are presented and the quantity of engines

that are used. Some will list results according to search engine or database. Others

Page 33: Seminar Report

General system architecture of search engine

return results according to relevance, often concealing which search engine returned

which results. This benefits the user by eliminating duplicate hits and grouping the most

relevant ones at the top of the list.

Search engines frequently have different ways they expect requests submitted. For

example, some search engines allow the usage of the word "AND" while others require

"+" and others require only a space to combine words. The better meta-search engines

try to synthesize requests appropriately when submitting them.

Results can vary between meta-search engines based on a large number of

variables. Still, even the most basic meta-search engine will allow more of the web to be

searched at once than any one stand-alone search engine. On the other hand, the

results are said to be less relevant, since a meta-search engine can’t know the internal

“alchemy” a search engine does on its result (a meta-search engine does not have any

direct access to the search engines’ database).Meta-search engines are sometimes

used in vertical search portals, and to search the deep web. Some examples of meta-

search engine are Dogpile and Meta-crawler.

5. Search engine optimization

Search engine optimization (SEO) is the process of improving the volume and quality of

traffic to a web site from search engines via "natural" ("organic" or "algorithmic") search

results. Usually, the earlier a site is presented in the search results, or the higher it

"ranks," the more searchers will visit that site. SEO can also target different kinds of

search, including image search, local search, and industry-specific vertical search

engines.

As an Internet marketing strategy, SEO considers how search engines work and

what people search for. Optimizing a website primarily involves editing its content and

HTML coding to both increase its relevance to specific keywords and to remove barriers

to the indexing activities of search engines.

Page 34: Seminar Report

General system architecture of search engine

In Internet marketing terms, search engine optimization or SEO is the process of

making website easy to find in search engines for its targeted and relevant keywords.

This could be achieved by optimizing internal and external factors that influence search

engine positioning. The main goal of every professionally implemented SEO campaign

is gaining top positioning for targeted keywords as well as search engine traffic growth.

That is the reason why search engine optimization may increase the number of sales

and conversions in times. Many third party organizations provides visibility to business

organization’s website on the World Wide Web, through search engine marketing

techniques and by methods of increasing page ranks of the organization’s website.

5.1. Page Rank

Page-Rank is a link analysis algorithm used by the Google Internet search engine that

assigns a numerical weighting to each element of a hyperlinked set of documents, such

as the World Wide Web, with the purpose of "measuring" its relative importance within

the set. The algorithm may be applied to any collection of entities with reciprocal

quotations and references. The numerical weight that it assigns to any given element E

is also called the Page-Rank of E and denoted by PR (E).

The name "Page-Rank" is a trademark of Google, and the Page-Rank process

has been patented (U.S. Patent 6,285,999). However, the patent is assigned to

Stanford University and not to Google. Google has exclusive license rights on the patent

from Stanford University. The university received 1.8 million shares of Google in

exchange for use of the patent; the shares were sold in 2005 for $336 million.

As an algorithm, Page-Rank is a probability distribution used to represent the

likelihood that a person randomly clicking on links will arrive at any particular page.

Page-Rank can be calculated for collections of documents of any size. It is assumed in

several research papers that the distribution is evenly divided between all documents in

the collection at the beginning of the computational process. The Page-Rank

computations require several passes, called "iterations", through the collection to adjust

approximate Page-Rank values to more closely reflect the theoretical true value.

Page 35: Seminar Report

General system architecture of search engine

A probability is expressed as a numeric value between 0 and 1. A 0.5 probability

is commonly expressed as a "50% chance" of something happening. Hence, a Page-

Rank of 0.5 means there is a 50% chance that a person clicking on a random link will be

directed to the document with the 0.5 Page-Rank.

5.2. The Ranking algorithm simplified

Assume a small universe of four web pages: A, B, C and D. The initial approximation of

Page-Rank would be evenly divided between these four documents. Hence, each

document would begin with an estimated Page-Rank of 0.25.

In the original form of Page-Rank initial values were simply 1. This meant that the

sum of all pages was the total number of pages on the web. Later versions of Page-

Rank (see the below formulas) would assume a probability distribution between 0 and 1.

Here we're going to simply use a probability distribution hence the initial value of 0.25.

If pages B, C, and D each only link to A, they would each confer 0.25 Page-Rank to A.

All Page-Rank PR ( ) in this simplistic system would thus gather to A because all links

would be pointing to A.

This is 0.75.

Again, suppose page B also has a link to page C, and page D has links to all

three pages. The value of the link-votes is divided among all the outbound links on a

page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page

C. Only one third of D's Page-Rank is counted for A's Page-Rank (approximately 0.083)

Page 36: Seminar Report

General system architecture of search engine

In other words, the Page-Rank conferred by an outbound link L ( ) is equal to the

document's own Page-Rank score divided by the normalized number of outbound links

(it is assumed that links to specific URLs only count once per document).

In the general case, the Page-Rank value for any page u can be expressed as

I.e. the Page-Rank value for a page u is dependent on the Page-Rank values for each

page v out of the set Bu (this set contains all pages linking to page u), divided by the

number L (v) of links from page v

5.3. Damping factor

The Page-Rank theory holds that even an imaginary surfer who is randomly clicking on

links will eventually stop clicking. The probability, at any step, that the person will

continue is a damping factor d. The various studies have tested different damping

factors, but it is generally assumed that the damping factor will be set around 0.85. The

damping factor is subtracted from 1 (and in some variations of the algorithm, the result

is divided by the number of documents in the collection) and this term is then added to

the product of the damping factor and the sum of the incoming Page-Rank scores.

That is,

Page 37: Seminar Report

General system architecture of search engine

Or (N = the number of documents in collection)

So any page's Page-Rank is derived in large part from the Page-Ranks of other pages.

The damping factor adjusts the derived value downward. Google recalculates Page-

Rank scores each time it crawls the Web and rebuilds its index. As Google increase the

number of documents in its collection, the initial approximation of Page-Rank decreases

for all documents.

The formula uses a model of a random surfer who gets bored after several clicks

and switches to a random page. The Page-Rank value of a page reflects the chance

that the random surfer will land on that page by clicking on a link. It can be understood

as a Markov chain in which the states are pages, and the transitions are all equally

probable and are the links between pages. If a page has no links to other pages, it

becomes a sink and therefore terminates the random surfing process. However, the

solution is quite simple. If the random surfer arrives at a sink page, it picks another URL

at random and continues surfing again.

When calculating Page-Rank, pages with no outbound links are assumed to link

out to all other pages in the collection. Their Page-Rank scores are therefore divided

evenly among all other pages. In other words, to be fair with pages that are not sinks,

these random transitions are added to all nodes in the Web, with a residual probability

of usually d = 0.85, estimated from the frequency that an average surfer uses his or her

browser's bookmark feature.

So, the equation is as follows:

Page 38: Seminar Report

General system architecture of search engine

where p1,p2,p3,..,pN are the pages under consideration, M(pi) is the set of pages that link

to pi, L(pj) is the number of outbound links on page pj, and N is the total number of

pages.

The Page-Rank values are the entries of the dominant eigenvector of the

modified adjacency matrix. This makes Page-Rank a particularly elegant metric: the

eigenvector is

Where R is the solution of the equation as follows:

where the adjacency function L(Pi,Pj) is 0, if page pj does not link to pi, and normalized

such that, for each j

i.e. the elements of each column sum up to 1.

This is a variant of the eigenvector centrality measure used commonly in network

analysis. Because of the large Eigen-gap of the modified adjacency matrix above, the

Page 39: Seminar Report

General system architecture of search engine

values of the Page-Rank eigenvector are fast to approximate (only a few iterations are

needed).

As a result of Markov theory, it can be shown that the Page-Rank of a page is the

probability of being at that page after lots of clicks. This happens to equal t − 1 where t

is the expectation of the number of clicks (or random jumps) required to get from the

page back to itself.

The main disadvantage is that it favors older pages, because a new page, even a

very good one, will not have many links unless it is part of an existing site (a site being a

densely connected set of pages, such as Wikipedia). The Google Directory (itself a

derivative of the Open Directory Project) allows users to see results sorted by Page-

Rank within categories. The Google Directory is the only service offered by Google

where Page-Rank directly determines display order. In Google's other search services

(such as its primary Web search) Page-Rank is used to weigh the relevance scores of

pages shown in search results.

Several strategies have been proposed to accelerate the computation of Page-

Rank. The Various strategies to manipulate Page-Rank have been employed in

concerted efforts to improve search results rankings and monetize advertising links.

These strategies have severely impacted the reliability of the Page-Rank concept, which

seeks to determine which documents are actually highly valued by the Web community.

5.4. Uses of Page-Rank

A version of Page-Rank has recently been proposed as a replacement for the

traditional Institute for Scientific Information (ISI) impact factor, and implemented

at eigenfactor.org. Instead of merely counting total citation to a journal, the

"importance" of each citation is determined in a Page-Rank fashion.

Page 40: Seminar Report

General system architecture of search engine

A similar new use of Page-Rank is to rank academic doctoral programs based on

their records of placing their graduates in faculty positions. In Page-Rank terms,

academic departments link to each other by hiring their faculty from each other

(and from themselves).

Page-Rank has been used to rank spaces or streets to predict how many people

(pedestrians or vehicles) come to the individual spaces or streets.

Page-Rank has also been used to automatically rank WordNet synsets according

to how strongly they possess a given semantic property, such as positivity or

negativity.

A dynamic weighting method similar to Page-Rank has been used to generate

customized reading lists based on the link structure of Wikipedia.

A Web crawler may use Page-Rank as one of a number of importance metrics it

uses to determine which URL to visit next during a crawl of the web. One of the

early working papers which were used in the creation of Google is efficient

crawling through URL ordering, which discusses the use of a number of different

importance metrics to determine how deeply and how much of a site Google will

crawl. Page-Rank is presented as one of a number of these importance metrics,

though there are others listed such as the number of inbound and outbound links

for a URL, and the distance from the root directory on a site to the URL.

The Page-Rank may also be used as a methodology to measure the apparent

impact of a community like the Blogosphere on the overall Web itself. This

approach uses therefore the Page-Rank to measure the distribution of attention

in reflection of the Scale-free network paradigm.

Page 41: Seminar Report

General system architecture of search engine

6. Marketing of search engines

Search engine marketing, or SEM, is a form of Internet marketing that seeks to promote

websites by increasing their visibility in search engine result pages (SERPs) through the

use of paid placement, contextual advertising, and paid inclusion. The Pay per Click

(PPC) lead Search Engine Marketing Professional Organization (SEMPO) also includes

search engine optimization (SEO) within its reporting, but SEO is a separate discipline

with most sources, including the New York Times defining SEM as 'the practice of

buying paid search listings”.

Fig. 1. The Advertisement market share of search engines

Search engines have become indispensable to interacting on the Web. In addition to

processing information requests, they are navigational tools that can direct users to

specific Web sites or aid in browsing. Search engines can also facilitate e-commerce

transactions as well as provide access to noncommercial services such as maps, online

auctions, and driving directions.

People use search engines as dictionaries, spell checkers, and thesauruses; as

discussion groups (Google Groups) and social networking forums (Yahoo! Answers);

and even as entertainment (Google-whacking, vanity searching). In this competitive

Page 42: Seminar Report

General system architecture of search engine

market, rivals continually strive to improve their information-retrieval capabilities and

increase their financial returns. One innovation is sponsored search, an “economics

meets search” model in which content providers pay search engines for user traffic

going from the search engine to their Web sites. Sponsored search has proven to be a

successful business.

Most Web search engines are commercial ventures supported by advertising

revenue and, as a result, some employ the practice of allowing advertisers to pay

money to have their listings ranked higher in search results. Those search engines

which do not accept money for their search engine results make money by running

search related ads alongside the regular search engine results. The search engines

make money every time someone clicks on one of these ads.

Revenue in the web search portals industry is projected to grow in 2008 by 13.4

percent, with broadband connections expected to rise by 15.1 percent. Between 2008

and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still

has some way to go to reach full saturation in American households. Furthermore,

broadband services are projected to account for an ever increasing share of domestic

Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by

fiber-optic and high speed cable lines.

7. Summary

With the precipitous expansion of the Web, extracting knowledge from the Web is

becoming gradually important and popular. This is due to the Web’s convenience and

richness of information. Today search engines can cover more than 60% of information

of the information on the World Wide Web. The future prospects of every aspect of

search engine are very bright. Like Google is coming up with embedded intelligence in

its search engine.

For all their problems, online search engines have come a long way. Sites like

Google are pioneering the use of sophisticated techniques to help distinguish content

from drivel, and the arms race between search engines and the marketers who want to

manipulate them has spurred innovation. But the challenge of finding relevant content

Page 43: Seminar Report

General system architecture of search engine

online remains. Because of the sheer number of documents available, we can find

interesting and relevant results for any search query at all. The problem is that those

results are likely to be hidden in a mass of semi-relevant and irrelevant information, with

no easy way to distinguish the good from the bad.

8. References

Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual Web

search engine. Computer Networks and ISDN Systems, April 1998

Baldi, Pierre. Modeling the Internet and the Web: Probabilistic Methods and

Algorithms, 2003

Chakrabarti, Soumen. Mining the Web: Analysis of Hypertext and Semi Structured

Data, 2003

Jansen, B. J. (May 2007). "The Comparative Effectiveness of Sponsored and Non-

sponsored Links for Web E-commerce Queries" (PDF). ACM Transactions on the

Web.

"Fast Page-Rank Computation via a Sparse Linear System (Extended Abstract)".

Gianna M. Del Corso, Antonio Gullí, Francesco Romani.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.118.5422.

Deeho Search Engine Optimization (SEO) solutions

9. Bibliography

Wikipedia.org

Google Books

The SEO Books