Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the...

www.sims.monash.edu.au

IMS5016/3616Information Access

Lecture 4

Information Seeking – the Internet

What information on the Web?

Reading

Henninger, Maureen [2003]. The hidden web: finding quality information on the net. Sydney: UNSW Press. pp. 94-119

[available electronically through unit site]

Outline

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms

Less-structured documents

• Databases are structured• Relational databases are very highly

structured• Documents are structured• Is the structure reflected in Web

documents?• Nothing is unstructured

Structure in Web documents

• Look at a source document:– http://www.health.gov.au/internet/wcms/Publishing.nsf/Co

ntent/health-avian_influenza-index.htm

• What are the structural elements? [class discussion]

Structural elements of a Web page

• OK, I cheated – this is a particularly well-structured site, but it has:

– Words– Paragraphs– Lists– Images– Metadata– etc. etc.– Required HTML structural elements – Head, Body

and Identifier [big structure]

What is being sought on the Web

• Text• Images [static and moving]• Sound/music• Entertainment• In general, Information?

Image searching

• Relies on complex algorithm. E.g.– Google analyzes

> the text on the page adjacent to the image,> the image caption and > dozens of other factors to determine the image

content.

– Google also uses sophisticated algorithms to> remove duplicates and > ensure that the highest quality images are presented

first. [Google Help pages]

• Somewhat problematic– Identification

– Retrieval

– Quality

– Rights management

– Streaming audio

Entertainment

• Generally relies on knowing where to look, or

• At least having a starting point• A process based as much on serendipity

as anything else.

Text Information

• Data with structure– There’s not all that much data on the Web

– There may not be that much information

– Lots of opinion and commentary

• Words are the way by which much information is sought, using Search Engines

Search Engines

• Three parts to the search engine:– A robot/knowbot/harvester/agent piece of

software, used to seek out and download web pages, usually by following links

– A database of web pages and the indexes whereby it can be searched

– A front-end application – the bit the user sees.

Searcher Robot

• A small application – won’t interfere with the operation of the site

• The site is typically back-loaded to the search engine database

• Links [to a certain level?] are followed to seek more and more sites

• More than one copy of the harvester working?

Size of a Search Engine

• This is really big. More than 10,000 servers• Officially, Google says that it processes more

than 150 million searches a day, but the true number is probably much higher.

• According to Nielsen/NetRatings, 67.6 million people worldwide visited Google an average of 6.2 times December. [2002]

• Analysts guess that [2002] revenue was between $60 million and $300 million.

Google Query Process

• Here’s the original paper by the founders of Google about how it all works:

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm• This site from Google shows their model

of what goes on with a query:

http://www.google.com/press/query.html

Other Search Engines

• In general these rely on hits of a word on each page of a site.

• The ranking algorithm is the tricky bit [read: “trade secret”]

• The advantages and disadvantages of full-text searching have been known for ages

Full-text searching

• Every word [except stopwords?] in the document is indexed.

• Sometimes the word order position is recorded.

• The number of times the word occurs in the document is usually recorded.

Advantages of full-text indexing

• Generally the source text is electronic [so it’s easy to load to computer].

• Author has control of the terms that are used to identify her work.

• The index is cheap to build, as it is created by a computer, not people.

• Users usually search naively for “keywords” rather than controlled terms, so the index supports their behaviour

Disadvantages of full-text searching

• Lack of control• Any added metadata is merely abused• Poor contextual information• English is especially

rich/verbose/ambiguous

Metadata

• Structure• Control [sometimes]• Predictability• Human intervention generally necessary• Expensive• Not yet standardised

Why not [just] use metadata

• Web sites [and some other electronic documents] are volatile – why index something that will be quickly gone?

• Adding metadata is expensive [it needs humans?]

• Metadata standards are either obscure [but specific] or too general to be really useful [Dublin Core?].

Boolean logic – a problem

• Boolean logic as an approach to searching is eminently well suited to digital [binary] machines but

• As humans in a pluralist environment we are accustomed to shades of grey, and the subtle shades of meaning that gives us.

George Boole and his Logic

Stopwords

• A list of words that don’t provide discrimination of one document from another.

• “a”, “and”, “of”, “the”, “or”, “that” etc.• The list is often derived from the

occurrences in the document body.• The first 20 account for more than 40%

of words in documents.

A Problem?

“To be or not to be”

A test for any search engine that precludes stopwords from its indices

Phrases and other proximities

• To search for a pair of adjacent words it is necessary to know there is a document in which the pair occur and word order in that document

• Adjacent words can then be retrieved• What about adjacent words in different

sentences?• What about phrases containing stop

words?

Ranking

Ranking is the secret

The basics of how the search engines work are pretty similar.

The exact algorithms used to achieve ranking is proprietary and secret.

The ranking process is what gives each engine its advantage.

Ranking by “older” search engines

• Boolean “and” before Boolean “or” [before a non-Boolean “some”?]

• Word count• Balanced or total word count• Comparative word frequency• Word proximity

Boolean “and” before Boolean “or”

Most full-text Web search tools will return items that contain all the “keyword” search words ranked ahead of items containing fewer of the search terms, finally reaching a point where items contain only one of the terms.

Other database search tools are much more rigid in adhering to “and” or “or” statements from users.

Word Count

• The number of hits for either each or all of the words may be used to rank the items.

Balanced or Total word count

• The ranking may be based on – an even distribution of the search term occurrences

– A distribution that is to be expected, depending on word count frequencies for the search terms across the entire database

– The inverse of that [because the “unexpected” is unusual, or specific]

– The total count of occurrences of all search terms in the retrieved document.

Word proximity

Searchers may enter a pair of [or more] search terms because they are looking for a phrase, so the ranking is based on word proximity, highest ranking given to immediate adjacency [i.e. as a phrase], with subsequent documents ranked by the proximity of some or all the search terms.

Ranking by Google

• Citation analysis by another name• A popularity poll, with high ranked

Google sites having more votes [more weighting] than others.

• Is this a positive feedback system? How does a new site get a high ranking?

Clustering

• Attempts to group retrieved items by some common elements.

• Linked to notions of how portals might work

• Could require AI techniques

Personal/Desktop searchers

• Google has released version 1.0 of their desktop searcher. Download from http://desktop.google.com/?promo=mp-gds-v1-1 [Is this ver 1.1 already?]

• Another candidate is Copernic. Download ver 1.2 from http://www.copernic.com/en/products/desktop-search/download.html

• Each is free, quite powerful, very useful.

A question?

• What does ranking, say beyond 300 sites [or 100, or 50?] mean to users? Google, for example, has an option to limit outcomes to 100 sites. Perhaps 100 sites is the number most users regard as their maximum search depth.

Summary

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms

Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the...

Documents

Www.sims.monash.edu.au IMS9300 IS/IM Fundamentals Lecture 11 Web content management Structuring information for use Archiving and storage of information

Avaya 3616 Handsets

Www.sims.monash.edu.au Course introduction Introduction to information science Information in contemporary organizations IMS9300 INFORMATION SYSTEMS/ INFORMATION

Silver Loaded Epoxy Adhesive - Resin #186-3616

ITU-T Rec. Q.3616 (10/2015) Communication diversion

Hazardous Waste Initial Training Environmental Health, Safety and Risk Management 512-245-3616 June 2010

50320215963 Diagnostico de Recesion en Valvulas de Escape Cat-3616 - Ponencia Peru

3616 - 3660 NE 11TH AVE OAKLAND PARK, FL 33334-2967 ... · 3616 - 3660 ne 11th ave oakland park, fl 33334-2967 galleria real estate services number of units number of buildings square

Www.sims.monash.edu.au Information description – data dictionary Introduction to data modelling IMS9300 IS/IM FUNDAMENTALS

Www.sims.monash.edu.au 1 IMS9043 IT in Organisations Week 3 IT Architecture and Infrastructure

Www.sims.monash.edu.au Professional practice Ethics in IM/IT IMS9300 IS/IM FUNDAMENTALS

3616 2010, 3616-3631 Potential Therapeutic Applications of ... · Potential Therapeutic Applications of Metal Compounds Directed Current Medicinal Chemistry, 2010 Vol. 17, No. 31

For SELECTION OF CONSULTANTS Authority Tenders/DR 3616... · 2020. 10. 7. · Standards of DR 3616: Tsandi – Onesi – Epalela in OMUSATI REGION (48.3KM) Procurement Reference No:

RENR5912 (planos 3616)

Www.sims.monash.edu.au IS development: Quality Standards Documentation IMS9300 IS/IM FUNDAMENTALS

Fig 7, 33, 34, 34HP, 36, 36HP, 37, 3616 and Fig 3716 ...pointing.spiraxsarco.com/pdfs/IM/s60_18.pdf · Fig 3616 PN16 DN15 - DN25 SEP SEP SEP SEP DN32 - DN50 1 SEP SEP SEP ... proposed

Fig 3, 7, 33, 34, 34HP, 36, 36HP, 37, 3616 and Fig 3716 ... · Fig 3616 (DIN) DN15 - DN25 SEP SEP SEP SEP DN32 - DN50 1 SEP SEP SEP ... proposed action (e.g. closing isolation valves,

Www.sims.monash.edu.au IMS9300 IS/IM Fundamentals Lecture 8 Strategic Information Management

AMINAT ABIOLA SHOWOLE - Universiti Teknologi Malaysiair.fsksm.utm.my/3616/1/aminatpc073009d10ttp.pdf · Name: Aminat Abiola Showole ... metodologi sumber terbuka. Kemampuan mengubah

CAT- G 3616