40
LIS618 lecture 5 Thomas Krichel 2003-10-26

LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Embed Size (px)

Citation preview

Page 1: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

LIS618 lecture 5

Thomas Krichel

2003-10-26

Page 2: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Structure

• Theory on query languages

• Web information retrieval– Google “theory”, see essay by Brin and Page.

It used to be at http://www7.scu.edu.au/programme/ fullpapers/1921/com1921.htm

– Google query language, from Calishain and Dornfest's book "Google hacks"

– Google Images, groups, odp

Page 3: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

simple queries

• single-word queries– one word only – Hopefully some word combinations are

understood as one word, e.g. on-line

• Context queries– phrase queries (be aware of stop words)– proximity queries, generalize phrase queries

• Boolean queries

Page 4: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

simple pattern queries

• prefix queries (e.g. "anal" for analogy)

• suffix queries (e.g. "oral" for choral)

• substring (e.g. "al" for talk)

• ranges (e.g. form "held" to "hero")

• within a distance, usually Levenshtein distance (i.e. the minimum number of insertions, deletions, and replacements) of query term

Page 5: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

regular expressions

• come from UNIX computing• build form strings where certain characters are

metacharacters. • example: "pro(blem)|(tein)s?" matches problem,

problem, protein and proteins. • example: New .*y matches "New Jersey" and

"New York City", and "New Delhy". • great variety of dialects, usually very powerful. • Extremely important in digital libraries.

Page 6: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

structured queries

• make use of document structures

• simplest example is when the documents are database records, we can search for terms is a certain field only.

• if there is sufficient structure to field contents, the field can be interpreted as meaning something different than the word it contains. example: dates

Page 7: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

query protocols

• There are some standard languages– Z39.50 queries– CCL, "common command language" is a

development of Z39.50– CD-RDx "compact disk read only data

exchange" is supported by US government agencies such as CIA and NASA

– SFQL "structure full text query language" built on SQL

Page 8: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

web information retrieval• We can think of the web as a pile of

documents called pages.

• Some "pages" are hard to index– PDF documents– Pictures– Sound files

• But a majority of pages are written in HTML– easy to index– have a loose structure

Page 9: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Google uses the structure of HTML

• Google finds the title of the page, i.e. the contents of the <title> element.

• Google analysis headings and large font sizes and gives priority weight to terms found there.

• Most importantly, Google uses the link structure of the web to find important pages.

Page 10: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

classic IR and the web

• In classic information retrieval, every document has the same importance. They differ as to their relevance to a query.

• In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents.

• If a web page contains the words "Bill Clinton sucks" and a picture, it is not a relevant hit for "Bill Clinton".

Page 11: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Google finds important pages

• The idea is that the documents on the web have different degrees of "importance".

• Google will show the most important pages first.

• The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

Page 12: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Google's monkey

• Imagine that the web has P pages. Each page has its own address (URL).

• Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P.

• Will the monkey visit all pages with equal probability?

Page 13: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

PageRank

• Google page rank of a page is the probability that the Google's money will visit the page. – The monkey will come frequently to pages

that have a lot of links to them.– Once he is there, he will likely go to a page

that it linked by one of the pages that an important page links to.

• The structure of all the links on the entire web reveals the importance of the page.

Page 14: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

many PageRanks

• There is an infinite number of ways to calculate the page rank depending on– how likely the monkey gets bored.– the probability of the monkey to visit each

page.

• Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

Page 15: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Notation

• Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p.

• For any page p, let o_p the number of outgoing links.

• Let l(p',p) be the number of links from page p' to page p.

Page 16: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Page rank

• The page rank for a page p is

r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p'

• In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

Page 17: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

example

• let there be a web of four pages A B C D

• A links to B.

• B links to C

• C links to A and D.

• D links to A.

• Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored.

Page 18: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

page ranks

• The following system calculates the ranks

r_A = ¼ ¼ + ¾ (r_C / 2 + r_D)

r_B = ¼ ¼ + ¾ r_A

r_C = ¼ ¼ + ¾ r_B

r_D = ¼ ¼ + ¾ r_C / 2

• Since this is fairly complicated, Google uses an iterative approximation to calculate the rank.

Page 19: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

interfaces• simple interface has command driven

features that make it more advanced than the advanced interface

• The advanced interface is a form interface to query language available on the simple interface.

• There are extensive language settings– preferences for finding pages in a certain

language– preferences for the language of the interface

Page 20: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

query language I

• default Boolean AND between terms

• case insensitive

• terms can be ORed with "OR" or "|"

• adjacent terms have to be put in double quotes

• Boolean NOT can be expressed with –

• Example: "krichel –thomas"

Page 21: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

query language II

• * is a wildcard for any word

• +stopword requires the presences of a stop word stopword. But the list of stop words has not been published.

• In fact it depends from query to query

• There is a limit of 10 words, but a * does not count towards the limit

Page 22: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

query treatment

• Google prefers pages that have the search terms – in close proximity– in the same order as in the query

• Repeating a query term once adds weight to it

• repeating it twice has no further effect

Page 23: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

special syntax I

• intitle: find in title only, "intitle: google"

• intext: find in text only. This will exclude occurrences of the search term in anchor or title data. "intext: html"

• inanchor: This option requests pages, for which there is another page that links to them with the anchor text in the query. example: inanchor:"list of my courses" finds my courses page because it has a link with that text from my homepage.

Page 24: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

special syntax II

• cache: pages that are in the google cache,

useful if query result has nothing to do with the query terms cache:openlib.org/home/krichel will show the cached version of the page.

• If you add further terms, they will be highlighted.

Page 25: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

site: and inurl: special syntax

• inurl: find in URL only, "inurl: help"– can use the * as a wildcard, like in inurl:

“*.openlib.org"

• site: domain of page, "site: liu.edu"– breaks down if a path is included– can not be used on its one, only with other

query expressions

Page 26: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

daterange: special syntax

• limits the search to pages indexed between a range of dates. Changed pages are reindexed, unchanged pages are not reindexed when the crawler visits a page.

• dates are expressed in the Julian period, i.e. number of days after -4713-01-01 0:00 UTC of the Julian calendar. Today is 2452939.

• example: daterange: 2452640-2452939

Page 27: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

mixing special syntax expressions

• The link: syntax does not mix with others.

• Other bad ideas:– "site:openlib.org –inurl:openlib"– "site:edu site:com"

• Things that work well– intitle:search – Intitle:biology inurl:help

Page 28: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Examples

• George Bush site:nytimes.com

• "Copyright * The New York Times" "George Bush"

• Intitle:"directory * * trees"

• Botany intitle:"directory of" site:edu

• "powered by blogger" or site:blogspot.com

• "classical music" (inurl:mailman | inurl:listserv)

Page 29: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

phonebook: special syntax

• A location seems to be required, i.e. phonebook: long island university ny• no

– wildcards– exclusions– or

• there is also– rphonebook for residential– bphonebook for businesses

Page 30: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

stocks on google

• stocks: ticker will look up a ticker symbol ticker at http://finance.yahoo.com

• you can find ticker symbols there

• ticker symbols are useful to find financial information about publicly traded companies.

Page 31: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

google images

• it has the following special syntaxes– intitle searches for images on a page with a

given title, "intitle: long island university"– Inurl: searches for images in pages that have

a certain url, inurl:liu.edu– site: restricts the search to a certain site,

should be combined with a search term like "site:liu.edu koenig"

Page 32: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

Google interfaces to 3rd party data

• Google groups are an interface to usenet news

• Google directory is an interface to the Open Directory Project.

• In both cases Google is dependent on the quality of these underlying data source.

Page 33: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

usenet news

• Usenet is a collection of user-submitted notes on various subjects that are posted to servers on a worldwide network. Each subject collection of posted notes is known as a newsgroup.

• A newsgroup is a discussion about a particular subject consisting of notes written to a networked site and distributed through Usenet.

• Newsgroups are hierarchical. Hierarchical levels are separated by dots example: comp.text.tex

• alt stands for anarchists, lunatics and terrorists.

Page 34: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

usenet history

• The idea of network news was born in 1979 when two graduate students, Tom Truscott and Jim Ellis, thought of using UUCP to connect machines for the purpose of information exchange among users. They set up a small network of three machines in North Carolina.

• UUCP is ``UNIX to UNIX copy'' a protocol that is used to copy files between machines running some flavor of UNIX, without the need for IP protocol. Usenet is older than the Internet

Page 35: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

decline of usenet

• essentially open to all (peer-to-peer system)

• used by spammers for – posting – gathering addresses

• steady decline of quality of contribution

• steady decline of quantity of contributions

Page 36: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

usenet worth checking out

• independent reviews of products, often written by experts.

• Example: interpretation of beethoven sonatas by Wilhelm Kempff.

• Sorting by date reveals that the newsgroup rec.music.classical.recordings is still active. On a good day, you will find no finer guide to records.

Page 37: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

special syntax for usenet

• group: limits posting to a certain group

• title: limits to titles of postings

• author: searches for author name or email address

• Mixing syntaxes works well

Page 38: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

the open directory project

• "The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors.

• Claim that there is a historic precedence in the Oxford English Dictionary.

• Formerly known as ``GnuHoo'', then ``NewHoo'', then acquired by NetScape, and called ``dmoz''.

Page 39: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

dmoz.org

• dmoz is maintained by volunteers ``net-citizen''. No special qualifications required, but claimed to be experts.

• There are about 30,000 volunteers (they claim).• Powers the core directory services for the

Web's largest and most popular search engines and portals– Netscape Search AOL Search– Google Lycos– HotBot DirectHit

• Headquarters run by Netscape

Page 40: LIS618 lecture 5 Thomas Krichel 2003-10-26. Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page

http://openlib.org/home/krichel

Thank you for your attention!