Mythology of search engine

#@>_ HIMANSHU KUMAR DASDepartment of Information Science

Rajarajeswari College of EngineeringBangalore

Email^ [email protected]

Agenda

• What?• When?• Why?• How?

What: Is a search engine?

• A computer program designed to search for information on the World Wide Web.

• The search results are usually presented in a list of results and are commonly called hits

• The information may consist of web pages, images, information and other types of files.

• Search engines operate algorithmically or are a mixture of algorithmic and human input.

When:History!!!

• Archie (or “Archive” without ‘v’) – 1990 – Alan Emtage, Bill Heelan and J. Peter Deutsch – computer science students at McGill University in Montreal

• First Search Engine.• FTP site hosted an index of downloadable directory

listings.• Due to limited space,only the listings were available and

not the content of each site.

Why: Do we search?

• Not because,we want to know what Barack Obama is going to have in his lunch?

Why: Do we search?(…continued)

• Not because - we want to know, where is osama?

Why: Do we search?(…continued)• We search because we are curious. And curiosity comes

from chaos - state of extreme confusion and disorder .• Our environment determines how curious we are. If

nothing changed, we wouldn’t need curiosity.• Information is key in everything we do.• We need information to complete a task (looking up a

phone number, referring to a map, reading directions) or to learn something new.

• Sometimes we seek information because we need it. Sometimes we seek information just because we want it.

• We scan our store of information, retrieve what we have and identify what we don’t.

Why: Do we search?(…continued)

• When we talk about information-seeking and the ease of retrieval, the Web–and in particular, Web search–has been the most significant development in the history of man. That’s where we start in the next column.

Anatomy of Search Engine• A search engine operates, in the following order 1. Web Crawling 2.Indexing

3.Query Processing (Searching)

Anatomy of Search Engine(…continued)• Search Engines use software called Bots or Spiders - (an automated Web

browser which follows every link on the site) to scour the web.

These Bots and Spiders find new websites and web pages

by following links on a web page.

Anatomy of Search Engine(…continued)


• Once they find a web page, they “read” the text-based content…

• …and the Search Engine stores this data into a huge library called an Index.



• When you search for something on Google…

• …the Search Engine reaches into its gigantic Index…



• …and before displaying a list of web pages it uses its Algorithm to calculate which ones best match your search query…


Search Engine Algorithm

• What is there with the search engine that we input and we get the desired result?

Search Engine Algorithm• Search Engine Algorithms are Kept Secret• Facts About Search Engine Algorithm Secrets:• Anyone who knew the exact search engine algorithm

would not be selling the information cheaply over the web.

• Search engines change their algorithms many times each month to fight off "spam."

• If people knew the exact algorithm then they could manipulate rankings as they please until the search results became so irrelevant that the search engine became junk.

• Search engine algorithms are some of the most tightly kept secrets in the world for this reason.

• Entire businesses, and businesses which spawn of those businesses which spawn off of those exist solely because of these search engine algorithms.

Search Query• A web search query is a query that a user enters

into web search engine to satisfy his or her information needs.

• There are four broad categories that cover most web search queries.

• Informational queries – Queries that cover a broad topic (e.g., colorado or trucks) for which there may be thousands of relevant results.

• A web search query is a query that a user enters into web search engine to satisfy his or her information needs.

• There are four broad categories that cover most web search queries.• Informational queries – Queries that cover a broad topic (e.g., colorado or

trucks) for which there may be thousands of relevant results. • Navigational queries – Queries that seek a single website or web page of a

single entity (e.g., youtube or delta air lines). • Transactional queries – Queries that reflect the intent of the user to perform

a particular action, like purchasing a car or downloading a screen saver. • Connectivity queries – Queries that report on the connectivity of the

indexed web graph (e.g., Which links point to this URL?, and How many pages are indexed from this domain name?).

Architecture of the search engine

War on New Technologies : Google vs Yahoo!

• Did you hear? Google’s launching a new, upgraded version of its search engine soon.

• Google promises that the new search tool (codename “Caffeine”) will improve the speed, accuracy, size, and comprehensiveness of Google search.

• To refresh a layer of the old index-analyze the entire web-significant delay between-found a page - made it available

• With Caffeine, we analyze the web in small portions- update - search index on continuous basis, globally.

• Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day.

• BOSS (Build your Own Search Service) is simple — faster innovation in the search landscape.

• Yahoo! Search BOSS, a web services platform that allows developers and companies to create and launch web-scale search products by utilizing the same infrastructure and technology that powers Yahoo! Search.

So what is BOSS? • BOSS is a new, open platform that offers programmatic

access to the entire Yahoo! Search index via an API. • BOSS allows developers to take advantage of Yahoo!’s

production search infrastructure and technology, combine that with their own unique assets, and create their own search experiences.

• While search APIs have been available for some time, BOSS removes many of the usage restrictions that have prevented other companies from using them to build innovative new search engines.

Examples of BOSS!!!• Hakia, a semantic search start-up, is using BOSS to

access the Yahoo! Search index and dramatically increase the speed with which it can semantically analyze the web. With BOSS providing this important infrastructure, Hakia is able to deliver a language search experience that isn’t available from any of the “big three” search providers or other semantic search engines.

• Daylife To-Go -Smart Web Publishing Solutions• Cluuz, a next-generation search engine prototype,

generates easier-to-understand search results through semantic cluster graphs, image extraction and tag clouds. The Cluuz analysis is performed in real-time on results returned from the BOSS API.

http://www.hakia.com/

http://www.daylife.com/page/yahoosearchondaylife



http://www.cluuz.com/

Technology

Mythology of search engine