13
Web Search Created by Ejaj Ahamed

Web Search Created by Ejaj Ahamed. What is web? The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Embed Size (px)

Citation preview

Page 1: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Web Search

Created by Ejaj Ahamed

Page 2: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

What is web?

The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain widespread popular use until browsers like NCSA Mosaic became available in 1993, and Netscape in 1994. The Web become more searchable began soon thereafter with search tools as the Wanderer and JumpStation in 1993.

Page 3: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Web challenges

• Distributed data: Documents exists over millions decentralized servers. Computers are interconnected without any predefined topology and the bandwidth and reliability also varies widely. There is no central registry for web servers and virtual hosting makes this more complicated.

• Volatile data: Many documents change or disappeared rapidly. It’s been predicted 40% of web changes monthly; as a result indexes quickly grow outdated or inaccurate.

• Scale: there are billions of separate documents. The growth appears exponential that poses scaling issues difficult to

cope with.

Page 4: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Web challenges (Continued) Lack of structure: No uniform structure, HTML errors, up

to 30%(near) duplicate documents. Most HTML pages are not valid and have many formats. Much web data is repeated.

Quality of data: There are no editorial control, false information, poor quality writing etc. And there is undesirable contents, filtering those content is technically complex.

Heterogeneous data: Multiple media types (images, video, VRML), languages, character sets, etc. Initially, the Web was dominated by English speakers, now less then half of existing web pages are in English. The growth of non-English servers and users increased dramatically.

Page 5: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Search engines!

• Search engines are critically important to help users find relevant information on the World Wide Web

• People can search the Web by using different search engines that uses various algorithms and techniques

• There are also non-human conduct web searching now and they includes agents, softbots and automated processes or spiders.

Page 6: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

How a search engine works?

Create an indexReceive a query – a set of search

terms and commandsLook in the index file for matchingGather the matching page entries and

rank them by relevanceFormat the resultsReturn the result page in HTML to the

searcher web browser

Page 7: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Google Search Engine Architecture

Source: - http://www-db.stanford.edu/~backrub/google.html

Page 8: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Indexing process

Indexer Application

-Gathers and stores text

Inverted Index File contains entries for each instance of each word: – Location within file ( for phrase matching) – Enclosing field or meta tag – Pointer to document info

Page 9: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Robot spider indexers

Many search engines use programs called robots to gather web pages for indexing. These programs are not limited to a pre-defined list of web pages, they can follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering.

Page 10: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Database indexers

Databases provide the content storage for many sites, which dynamically create web pages around them, including ecommerce catalog sites, online news, and even entertainment sites

Intranets often contain large amounts of text stored in databases as well.

databases generally have their own search functions, which may appear to take the place of a full-text search engine .

Page 11: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Database indexers (Continued)

Work best locally – Most use JDBC or ODBC – Can index via the web

Easiest with straightforward tables – Perform a join to build listings for

indexing – Problems with legacy systems

Page 12: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Effective Site Search

• Index everything and keep it fresh • Add synonym and spell checking • Tweak relevance until it works for you • Customize results pages

• Provide help for search failure

Page 13: Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain

Conclusion

At present searching the World-Wide Web successfully is the basis for many of our information tasks today. Search engines provide us with the right information from a vast majority of web pages and it just accomplish its task with the minimum input from the users, generally one or two keywords. A lot of work has been done to make search engine more efficient but still there are substantial amounts of work remain to be accomplished in order to keep with the expansion flow of the Web.