34
1 ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine

ICS 215: Advances in Database Management System Technology Spring 2004

  • Upload
    violet

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

ICS 215: Advances in Database Management System Technology Spring 2004. Professor Chen Li Information and Computer Science University of California, Irvine. Course Web Server. URL: http://www.ics.uci.edu/~ics215/ All course info will be posted online Instructor: Chen Li - PowerPoint PPT Presentation

Citation preview

Page 1: ICS 215: Advances in Database Management System Technology  Spring 2004

1

ICS 215: Advances in Database Management System Technology Spring 2004

Professor Chen Li

Information and Computer Science

University of California, Irvine

Page 2: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 2

Course Web Server

• URL: http://www.ics.uci.edu/~ics215/– All course info will be posted online

• Instructor: Chen Li– ICS 424B, [email protected]

• Course general info: http://www.ics.uci.edu/~ics215/geninfo.html

Page 3: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 3

Topic today: Web Search• How did earlier search engines work?

• How does Google work?

• Readings:– Lawrence and Giles,

Searching the World Wide Web, Science, 1998. – Brin and Page,

The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998.

Page 4: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 4

Earlier Search Engines• Hotbot, Yahoo, Alta Vista, Northern Light, Excite,

Infoseek, Lycos …• Main technique: “inverted index”

– Conceptually: use a matrix to represent how many times a term appears in one page

– # of columns = # of pages (huge!)

– # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 …

‘car’ 1 0 1 0

‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice

‘honda’ 2 1 0 0

Page 5: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 5

Search by Keywords• If the query has one keyword, just return all

the pages that have the word– E.g., “toyota” all pages containing “toyota”:

page2, page4,…– There could be many many pages!– Solution: return those pages with most

frequencies of the word first

Page 6: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 6

Multi-keyword Search• For each keyword W, find all the set of pages

mentioning W

• Intersect all the sets of pages– Assuming an “AND” operation of those keywords

• Example:– A search “toyota honda” will return all the

pages that mention both “toyota” and “honda”

Page 7: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 7

Observations• The “matrix” can be huge:

– Now the Web has 4.2 billion pages!– There are many “terms” on the Web. Many of

them are typos.– It’s not easy to do the computation efficiently:

Given a word, find all the pages… Intersect many sets of pages…

• For these reasons, search engines never store this “matrix” so naively.

Page 8: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 8

Problems• Spamming:

– People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times

– Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time).

• Search engines can be easily “fooled”

Page 9: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 9

Closer look at the problems• Lacking the concept of “importance” of each

page on each topic• E.g.: Our ICS215 class page is not as

“important” as Yahoo’s main page.• A link from Yahoo is more important than a

link from our class page• But, how to capture the importance of a page?

– A guess: # of hits? where to get that info?– # of inlinks to a page Google’s main idea.

Page 10: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 10

Google’s History• Started at Stanford DB Group as a research

project (Brin and Page)

• Used to be at: google.stanford.edu

• Very soon many people started liking it

• Incorporated in 1998: www.google.com

• The “largest” search engine now

• Started other businesses: froogle, gmail, …

Page 11: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 11

PageRank• Intuition:

– The importance of each page should be decided by what other pages “say” about this page

– One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks)

• Problem:– We can easily fool this technique by generating

many dummy pages that point to our class page

Page 12: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 12

Details of PageRank• At the beginning, each page has weight 1• In each iteration, each page propagates its current

weight W to all its N forward neighbors. Each of them gets weight: W/N

• Meanwhile, a page accumulates the weights from its backward neighbors

• Iterate until all weights converge. Usually 6-7 times are good enough.

• The final weight of each page is its importance.• NOTICE: currently Google is using many other

techniques/heuristics to do search. Here we just cover some of the initial ideas.

Page 13: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 13

Example: MiniWeb• (Materials used by courtesy of Jeff Ullman)• Our “MiniWeb” has only three web sites: Netscape,

Amazon, and Microsoft.• Their weights are represented as a vector

oldnewa

m

n

a

m

n

012/1

2/100

2/102/1Ne

Am

MS

For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

Page 14: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 14

Iterative computation

5/6

5/3

5/6

16/17

16/11

4/5

8/11

2/1

8/9

1

4/3

4/5

2/3

2/1

1

1

1

1

a

m

n

Ne

Am

MSFinal result:

• Netscape and Amazon have the same importance, and twice the importance of Microsoft.

• Does it capture the intuition? Yes.

Page 15: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 15

Observations• We cannot get absolute weights:

– We can only know (and we are only interested in) those relative weights of the pages

• The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:

a

m

n

a

m

n

012/1

2/100

2/102/1

Page 16: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 16

Problem 1 of algorithm: dead ends

0

0

0

16/5

16/3

2/1

8/3

4/1

8/5

2/1

4/1

4/3

2/1

2/1

1

1

1

1

a

m

n

Ne

Am

MS

• MS does not point to anybody

• Result: weights of the Web “leak out”

oldnewa

m

n

a

m

n

002/1

2/100

2/102/1

Page 17: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 17

Problem 2 of algorithm: spider traps

0

3

0

16/5

16/35

2/1

8/3

2

8/5

2/1

4/7

4/3

2/1

2/3

1

1

1

1

a

m

n

Ne

Am

MS

• MS only points to itself

• Result: all weights go to MS!

oldnewa

m

n

a

m

n

002/1

2/110

2/102/1

Page 18: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 18

Google’s solution: “tax each page”• Like people paying taxes, each page pays some weight into a

public pool, which will be distributed to all pages.

• Example: assume 20% tax rate in the “spider trap” example.

2.0

2.0

2.0

002/1

2/110

2/102/1

*8.0

a

m

n

a

m

n

11/5

11/21

11/7

a

m

n

Page 19: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 19

The War of Search Engines

• More companies are realizing the importance of search engines

• More competitors in the market: Microsoft, Yahoo!, etc.

Page 20: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 20

Next: HITS / Web communities

• Readings:– Jon M. Kleinberg,

Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999.

– Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999

Page 21: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 21

Hubs and Authorities

• Motivation: find web pages to a topic– E.g.: “find all web sites about automobiles”

• “Authority”: a page that offers info about a topic– E.g.: DBLP is a page about papers

– E.g.: google.com, aj.com, teoma.com, lycos.com

• “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic– E.g.: our ICS215 page linking to pages about papers

– E.g.: www.searchenginewatch.com is a hub of search engines

Page 22: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 22

Two values of a page• Each page has a hub value and an authority value.

– In PageRank, each page has one value: “weight”

• Two vectors:– H: hub values– A: authority values

2

1

h

h

H

2

1

A

A

A

Page 23: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 23

HITS algorithm: find hubs and authorities• First step: find pages related to the topic (e.g., “automobile”), and

construct the corresponding “focused subgraph”

– Find pages S containing the keyword (“automobile”)

– Find all pages these S pages point to, i.e., their forward neighbors.

– Find all pages that point to S pages, i.e., their backward neighbors

– Compute the subgraph of these pages

rootFocused subgraph

Page 24: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 24

Step 2: computing H and A• Initially: set hub and authority to 1• In each iteration, the hub score of

a page is the total authority value of its forward neighbors (after normalization)

• The authority value of each page is the total hub value of its backward neighbors (after normalization)

• Iterate until converge hubs authorities

Page 25: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 25

Example: MiniWeb

Ne

Am

MS

011

100

111

M

oldnew AMH **

a

m

n

h

h

h

H

a

m

n

a

a

a

A

oldT

new HMA **

oldT

new HMMH ***

Normalization!

Therefore:

oldT

new AMMA ***

Page 26: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 26

Example: MiniWeb

Ne

Am

MS

011

100

111

M

011

101

101TM

202

011

213TMM

211

122

122

MM T

2

31

31

84

114

114

18

24

24

4

5

5

1

1

1

A

31

1

32

96

36

132

20

8

28

4

2

6

1

1

1

H

Page 27: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 27

Trawling: finding online communities

• Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”)

• Examples:– Web pages of NBA fans

– Community of Turkish student organizations in the US

– Fans of movie star Jack Lemmon

• Applications:– Provide valuable and timely info for interested people

– Represent the sociology of the web

– Target advertising

Page 28: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 28

How: analyzing web structure

• These pages often do not reference each other– Competitions

– Different view points

• Main idea: “co-citations”– Often these pages share a large number of pages

– Example: the following two web sites share many pages http://kcm.co.kr/English/ www.cyberkorean.com/church

Page 29: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 29

Bipartite subgraphs

• Bipartite graphs: sets of nodes, F and C

• Dense bipartite graph: there are “enough” number of edges between F and C

• Complete bipartite graph: there is an edge between each node in F and each node in C

• (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C

• (i,j)-Core is a good signature for finding online communities

• Usually i and j are between 3 and 9

F“Fans”

C“Centers”

Page 30: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 30

“Trawling”: finding cores

• Find all (i,j)-cores in the Web graph. – In particular: find “fans” (or “hubs”) in the graph

– “centers” = “authorities”

– Challenge: Web is huge. How to find cores efficiently? Experiments: 200M pages, 1 TB data

• Main idea: pruning

• Step 1: using out-degrees– Rule: each fan must point to at least 6 different websites

– Pruning results: 12% of all pages (= 24M pages) are potential fans

– Retain only links, and ignore page contents

Page 31: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 31

Step 2: eliminate mirroring pages

• Many pages are mirrors (exactly the same page)

• They can produce many spurious fans

• Use a “shingling” method to identify and eliminate duplicates

• Results: – 60% of 24M potential-fan pages are removed

– # of potential centers is 30 times of # of potential fans

Page 32: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 32

Step 3: using in-degrees of pages

• Delete pages highly referenced, e.g., yahoo, altavista

• Reason: they are referenced for many reasons, not likely forming an emerging community

• Formally: remove all pages with more than k inlinks (k = 50, for instance)

• Results: – 60M pages pointing to 20M pages

– 2M potential fans

Page 33: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 33

Step 4: iterative pruning

• To find (i,j)-cores– Remove all pages whose # of out-links is < i

– Remove all pages whose # of in-links is < j

– Do it iteratively

Page 34: ICS 215: Advances in Database Management System Technology  Spring 2004

ICS215 Notes 01 34

Step 5: inclusion-exclusion pruning• Idea: in each step, we

– Either “include” a community– Or we “exclude” a page from further contention

• Check a page x with j out-degree. x is a fan of an (i,j)-core if:– There are i-1 fans point to all the forward neighbors of x– This step can be checked easily using the index on fans and centers

• Result: for (3,3)-cores, 5M pages remained• Final step:

– Since the graph is much smaller, we can afford to “enumerate” the remaining cores

• Result:– (3,3)-cores: about 75 KB– High-quality communities– Check a few in the paper by yourself