Today’s Agenda Exam post-mortem Should I drop? Search Engines: Details & Ramifications Future: – 2 Presentations every Thursday – Project #2 is coming

Today’s Agenda

Exam post-mortem Should I drop? Search Engines: Details & Ramifications Future:

– 2 Presentations every Thursday– Project #2 is coming out next week (Website)– Lab: Switching to Javascript

Exam post-mortem

1. Communication substrates Ethernet USB Wireless Satellite Infrared

Exam post-mortem

2. Content & Services email (pop, imap, etc.) webpage (http) documents (ftp) peer-to-peer, chat, IM, etc video audio

Exam post-mortem

3. Two or more computers NetworkTwo or more networks Inter-network

A sub-network is actually part of a network.

WAN Wide Area Network implies long distance.

An inter-network can be in the same room.

Exam post-mortem

6. The Internet was a military project DARPANET until 1969/70.The Internet was a research project ARPANET

throughout the 70’s (Available only at National Labs and Universities)

Bulletin boards and email became available to the general public in the 80’s

However, 1990 is a reasonable answer.

Exam post-mortem

7. In 1946 computers didn’t really exist yet.In 1956 people had yet to even envision connecting

computers in different locations

8. Web browsing via hypertext was going on in the late 80’sBy 1995 e-commerce was already happening

Emerged to come forth from obscurity

Exam post-mortem

19. 200 million hosts in 2002

20. Did you notice that it multiplies by 4 every year.800,000,000 best answer

500 million to 1 billion was only -1

21. False, In 2010 will there be 100 billion people on the earth?

Exam post-mortem

23-28. 100 cable pro connections is betterCosts less

More bandwidth

The problem is that you’d have to manage 100 separate connections

Road Runner probably doesn’t have that many separate lines crossing one point.

Exam post-mortem

38. Perhaps I wasn’t clear but…Domain registration is a yearly costInternet access is required.

Was anyone thinking: At school or work internet access is free.

1. At school its not! Is your room free?

2. At work its not! Someone pays.

3. Netzero is free, right? Good luck.

Q: Should I drop?

A: No! Here’s why?

Exam1 Average

92 95.6 A

91 95.2 A

91 94.3 A

90 93.6 A

Median 89 93.4 A

87 92.8 A

82 89.7 A-

80 88.9 B+

78 86.7 B

Today’s Agenda

Exam post-mortem Should I drop? Search Engines: How exactly they work. Future:

– 2 Presentations every Thursday– Project #2 is coming out next week (Website)– Lab: Switching to Javascript

Search Engines

Background In the early 90’s, people

still wore mullets, and finding info. on the

WWW was not easy. Links were important Hubs & Authorities

emerged.

Search Engines

Typical Academic Webpage

– Welcome to UCLA’s Neurosurgery Website

– Here are some academic publications

– Here are links to other Neurosurgery Websites

Typical Personal Webpage

– Hi my name is Rupert “I don’t have a life” McNerd

– Here are pictures of Heather Locklear

– Here are links to other people who have nerdy websites with pictures of Heather Locklear

Search Engines

Through – word of mouth– email– message boards

Hubs & Authorities emerged A hub is a website that links you to other important websites

– There are good hubs and bad hubs An authority is any website that has information, data, etc.

– There are good and bad authorities Some websites are both Authorities and Hubs

Search Engines

The Problem: People designed “homepages” with lots of links

– Not for the benefits of others but – to help themselves find stuff

The lists of links were not necessarily related, organized, or kept up to date

As a result, its hard for ordinary people to find or identify good hubs.

Example– http://alumni.umbc.edu/~efreem2/lynx.html

http://alumni.umbc.edu/~efreem2/lynx.html

Search Engines

The Solution: Comprehensive Directories

– The first big one was Yahoo! Yahoo! began as a student hobby in February 1994

– David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University

They started their guide to keep track of their personal interests on the Internet.

Eventually, became too long and unwieldy, and they broke them out into categories.

When the categories became too full, they developed subcategories ...

Search Engines

Yahoo!– an acronym for "Yet Another Hierarchical Officious Oracle,“

Even though much of the process was automated, a lot of human care went into their directory.

Yahoo distinguished itself by using a combination of custom software and human care to make a well-organized and somewhat comprehensive directory of the WWW.

The had experts who would help organize categories They allowed people to submit web pages, locations,

and descriptions

Search Engines

Yahoo’s directory is stored in a database

ID Title URL Description Keywords Category Sub-category

1 ESPN www.espn.com A comprehensive site with scores, stories, stats…

NBA, NFL, MLB, games, scores, players

Entertainment Sports

2 Siena CS

www.cs.siena.edu The computer science department at Siena…

CS, siena, computer…

Education Colleges

…

1,234,041

Search Engines

Yahoo! Directory is created using

1. User submission

2. Staff, consultants, etc.

3. Robots/Spiders (programs that fetch pages automatically and add them to the directory)

Search Engines

Initially, Yahoo’s was not a search engine. It was a directory. While it was possible to search the directory

using keywords, Users were not searching the entire WWW Problem: If you were not in the directory,

your site would not be found by a Yahoo search.

Search Engines

Ways to get your site noticed by Yahoo

1. Fill out an online site submission form

2. Get lot of people to link their page with your page and hope that a Yahoo staff or robot finds it.

3. Add lots of meta tags that are consistent with your sites content.

Search Engines

Problem: Great websites pop us so quickly that Yahoo can’t find them all.

1. User submission (many people don’t submit their site)

2. Staff, consultants, etc. (you’d need an army)

3. Robots/Spiders(most effective way to build a directory)

Search Engines

Another Problem: Robots/Spiders aren’t good at automatically

determining– Description– Keywords– Category– even the title

Web pages are often poorly composed, and Down-right, misleading.

Search Engines

Example:

Click here to view your local weather (Actually this will bring you to a porn site and the makers of this web page

get 0.5 cents every time some idiot clicks this link)your local weather, your local weather, weather channel, current temp, local weather, your local

weather, your local weather, your local weather, your local weather, your local weather, weather channel, current temp, local weather, your local weather, your local weather, your local weather, your local weather, your local weather, weather channel, current temp, local weather, your local weather, your local weather, your local weather, your local weather, your local weather, weather channel, current temp, weather, your local weather, your local weather, your local weather, your local weather, weather channel, current temp, local weather, your local weather, your local weather, your local weather, weather channel, current temp, local local weather, your local weather, your local weather, your local weather, your local weather, weather channel, current temp, local, your local weather, your local weather, your local weather,

Search Engines

Even though these robots/spiders do a poor job of analyzing information

Search engines emerge with directories completely built from information gathered automatically.

As the WWW grows, – directories become more automated to the point where

there is little human care involved– search engines compete to try to index the entire WWW

Search Engines

Quantity becomes more important than quality and the Search Engine is born.

– (see the history of search engines) Q: What is the difference between a search engine

and a searchable directory? A: Nothing really.

– In fact, some search engines automatically generate a categorized directory from their index database.

If there is a difference… its the quality and correctness of the categories.

Search Engines

Recall the database behind Yahoo’s directory

ID Title URL Description Keywords Category Sub-category

1 ESPN www.espn.com A comprehensive site with scores, stories, stats…

NBA, NFL, MLB, games, scores, players

Entertainment Sports

2 Siena CS

www.cs.siena.edu The computer science department at Siena…

CS, siena, computer…

Education Colleges

…

1,234,041

Search Engines

Recall that robots/spiders do NOT do a good job of determining – Description– Keywords– Category– even the title

Q: So what is actually stored in the database of a search engine?

Search Engines

All you can store is the raw content (i.e., the words)

ID URL

1 www.espn.com Sports (35) NFL (42) ESPN (103) Scores (27) …

2 www.cs.siena.edu Siena (11) Computer (15) Science (22) Breimer (7) …

…

1,234,041

Search Engines

How to make a search engine.

1. Send robots out to collect websites

2. Build an index URL list of words.

3. Remove stop words

4. Invert the index Word list of URL’s

5. Design some formula or methodology for ranking URL’s

Search Engines

Despite the problems, search engines dramatically changed the WWW.

People had the notion that the WWW was itself a huge database of information that could be searched.

Most prominent Search Engines– Altavista, Lycos, Infoseek, AskJeeves, Looksmart, Hotbot,

Google. To survive ($$$) search engines became advertising

venues

Search Engines

The Big Problem Too Much Information. Even if most of the information retrieved by a search

engine was relevant to what the user wanted, users could easily get overwhelmed and give up if

the first two or three hits were not appropriate. The problem of dealing with too much information is

a problem that was never really a problem in the past.

Search Engines

Designing an effective search engine was an information management dilemma that had never been seen before.– The information was vastly distributed– Not uniform or consistent– Excessively redundant

– Massively large

Next Class

Two presentations

Documents

Today’s Agenda Exam post-mortem Should I drop? Search Engines: Details & Ramifications Future: – 2 Presentations every Thursday – Project #2 is coming