Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the...

Preview:

Citation preview

www.sims.monash.edu.au

IMS5016/3616Information Access

Lecture 4

Information Seeking – the Internet

www.sims.monash.edu.au

2

What information on the Web?

www.sims.monash.edu.au

3

Reading

Henninger, Maureen [2003]. The hidden web: finding quality information on the net. Sydney: UNSW Press. pp. 94-119

[available electronically through unit site]

www.sims.monash.edu.au

4

Outline

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms

www.sims.monash.edu.au

5

Less-structured documents

• Databases are structured• Relational databases are very highly

structured• Documents are structured• Is the structure reflected in Web

documents?• Nothing is unstructured

www.sims.monash.edu.au

6

Structure in Web documents

• Look at a source document:– http://www.health.gov.au/internet/wcms/Publishing.nsf/Co

ntent/health-avian_influenza-index.htm

• What are the structural elements? [class discussion]

www.sims.monash.edu.au

7

Structural elements of a Web page

• OK, I cheated – this is a particularly well-structured site, but it has:

– Words– Paragraphs– Lists– Images– Metadata– etc. etc.– Required HTML structural elements – Head, Body

and Identifier [big structure]

www.sims.monash.edu.au

8

What is being sought on the Web

• Text• Images [static and moving]• Sound/music• Entertainment• In general, Information?

www.sims.monash.edu.au

9

Image searching

• Relies on complex algorithm. E.g.– Google analyzes

> the text on the page adjacent to the image,> the image caption and > dozens of other factors to determine the image

content.

– Google also uses sophisticated algorithms to> remove duplicates and > ensure that the highest quality images are presented

first. [Google Help pages]

www.sims.monash.edu.au

10

Music

• Somewhat problematic– Identification

– Retrieval

– Quality

– Rights management

– Streaming audio

www.sims.monash.edu.au

11

Entertainment

• Generally relies on knowing where to look, or

• At least having a starting point• A process based as much on serendipity

as anything else.

www.sims.monash.edu.au

12

Text Information

• Data with structure– There’s not all that much data on the Web

– There may not be that much information

– Lots of opinion and commentary

• Words are the way by which much information is sought, using Search Engines

www.sims.monash.edu.au

13

Search Engines

• Three parts to the search engine:– A robot/knowbot/harvester/agent piece of

software, used to seek out and download web pages, usually by following links

– A database of web pages and the indexes whereby it can be searched

– A front-end application – the bit the user sees.

www.sims.monash.edu.au

14

Searcher Robot

• A small application – won’t interfere with the operation of the site

• The site is typically back-loaded to the search engine database

• Links [to a certain level?] are followed to seek more and more sites

• More than one copy of the harvester working?

www.sims.monash.edu.au

15

Size of a Search Engine

• This is really big. More than 10,000 servers• Officially, Google says that it processes more

than 150 million searches a day, but the true number is probably much higher.

• According to Nielsen/NetRatings, 67.6 million people worldwide visited Google an average of 6.2 times December. [2002]

• Analysts guess that [2002] revenue was between $60 million and $300 million.

www.sims.monash.edu.au

16

Google Query Process

• Here’s the original paper by the founders of Google about how it all works:

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm• This site from Google shows their model

of what goes on with a query:

http://www.google.com/press/query.html

www.sims.monash.edu.au

17

Other Search Engines

• In general these rely on hits of a word on each page of a site.

• The ranking algorithm is the tricky bit [read: “trade secret”]

• The advantages and disadvantages of full-text searching have been known for ages

www.sims.monash.edu.au

18

Full-text searching

• Every word [except stopwords?] in the document is indexed.

• Sometimes the word order position is recorded.

• The number of times the word occurs in the document is usually recorded.

www.sims.monash.edu.au

19

Advantages of full-text indexing

• Generally the source text is electronic [so it’s easy to load to computer].

• Author has control of the terms that are used to identify her work.

• The index is cheap to build, as it is created by a computer, not people.

• Users usually search naively for “keywords” rather than controlled terms, so the index supports their behaviour

www.sims.monash.edu.au

20

Disadvantages of full-text searching

• Lack of control• Any added metadata is merely abused• Poor contextual information• English is especially

rich/verbose/ambiguous

www.sims.monash.edu.au

21

Metadata

• Structure• Control [sometimes]• Predictability• Human intervention generally necessary• Expensive• Not yet standardised

www.sims.monash.edu.au

22

Why not [just] use metadata

• Web sites [and some other electronic documents] are volatile – why index something that will be quickly gone?

• Adding metadata is expensive [it needs humans?]

• Metadata standards are either obscure [but specific] or too general to be really useful [Dublin Core?].

www.sims.monash.edu.au

23

Boolean logic – a problem

• Boolean logic as an approach to searching is eminently well suited to digital [binary] machines but

• As humans in a pluralist environment we are accustomed to shades of grey, and the subtle shades of meaning that gives us.

www.sims.monash.edu.au

24

George Boole and his Logic

www.sims.monash.edu.au

25

Stopwords

• A list of words that don’t provide discrimination of one document from another.

• “a”, “and”, “of”, “the”, “or”, “that” etc.• The list is often derived from the

occurrences in the document body.• The first 20 account for more than 40%

of words in documents.

www.sims.monash.edu.au

26

A Problem?

“To be or not to be”

A test for any search engine that precludes stopwords from its indices

www.sims.monash.edu.au

27

Phrases and other proximities

• To search for a pair of adjacent words it is necessary to know there is a document in which the pair occur and word order in that document

• Adjacent words can then be retrieved• What about adjacent words in different

sentences?• What about phrases containing stop

words?

www.sims.monash.edu.au

28

Ranking

www.sims.monash.edu.au

29

Ranking is the secret

The basics of how the search engines work are pretty similar.

The exact algorithms used to achieve ranking is proprietary and secret.

The ranking process is what gives each engine its advantage.

www.sims.monash.edu.au

30

Ranking by “older” search engines

• Boolean “and” before Boolean “or” [before a non-Boolean “some”?]

• Word count• Balanced or total word count• Comparative word frequency• Word proximity

www.sims.monash.edu.au

31

Boolean “and” before Boolean “or”

Most full-text Web search tools will return items that contain all the “keyword” search words ranked ahead of items containing fewer of the search terms, finally reaching a point where items contain only one of the terms.

Other database search tools are much more rigid in adhering to “and” or “or” statements from users.

www.sims.monash.edu.au

32

Word Count

• The number of hits for either each or all of the words may be used to rank the items.

www.sims.monash.edu.au

33

Balanced or Total word count

• The ranking may be based on – an even distribution of the search term occurrences

– A distribution that is to be expected, depending on word count frequencies for the search terms across the entire database

– The inverse of that [because the “unexpected” is unusual, or specific]

– The total count of occurrences of all search terms in the retrieved document.

www.sims.monash.edu.au

34

Word proximity

Searchers may enter a pair of [or more] search terms because they are looking for a phrase, so the ranking is based on word proximity, highest ranking given to immediate adjacency [i.e. as a phrase], with subsequent documents ranked by the proximity of some or all the search terms.

www.sims.monash.edu.au

35

Ranking by Google

• Citation analysis by another name• A popularity poll, with high ranked

Google sites having more votes [more weighting] than others.

• Is this a positive feedback system? How does a new site get a high ranking?

www.sims.monash.edu.au

36

Clustering

• Attempts to group retrieved items by some common elements.

• Linked to notions of how portals might work

• Could require AI techniques

www.sims.monash.edu.au

37

Personal/Desktop searchers

• Google has released version 1.0 of their desktop searcher. Download from http://desktop.google.com/?promo=mp-gds-v1-1 [Is this ver 1.1 already?]

• Another candidate is Copernic. Download ver 1.2 from http://www.copernic.com/en/products/desktop-search/download.html

• Each is free, quite powerful, very useful.

www.sims.monash.edu.au

38

A question?

• What does ranking, say beyond 300 sites [or 100, or 50?] mean to users? Google, for example, has an option to limit outcomes to 100 sites. Perhaps 100 sites is the number most users regard as their maximum search depth.

www.sims.monash.edu.au

39

Summary

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms

Recommended