100
Computer Science in the Information Age John Hopcroft Cornell University Ithaca, New York

Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Computer Science in the Information Age

John HopcroftCornell UniversityIthaca, New York

Page 2: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Time of change

There is a fundamental revolution taking place that is changing all aspects of our lives.

Those individuals who recognize this and position themselves for the future will benefit enormously.

Page 3: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Early years ofComputer Science

Programming languagesCompilersOperating systemsNetwork protocolsAlgorithmsComputability

Page 4: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Computer Science is changing

Structure of large networksLarge data setsInformationSearch

Page 5: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Drivers of change

Computers are becoming ubiquitousSpeed sufficient for word processing, email, chat and spreadsheetsMerging of computing and communicationsData available in digital formDevices being networked

Page 6: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Computer Science departments are beginning to develop courses that cover the underlying theory

Random graphsPhase transitionsGiant componentsSpectral analysisSmall world phenomenaGrown graphs

Page 7: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

What is the theory needed to support the future?

Large amounts of dataNoisy data with outliersHigh dimensionalPossibly power law distributed

Page 8: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Internet queriesToday

Autos

Graph theory

Colleges, universities

Computer science

Page 9: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Internet queries are changingToday

AutosGraph theory

Colleges, universitiesComputer science

TomorrowWhich car should I buy?Construct an annotated bibliography on graph theoryWhere should I go to college?How did the field of CS develop?

Page 10: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

What car should I buy?

List of makesCostReliabilityFuel economyCrash safety

Pertinent articlesConsumer reportsCar and driver

Page 11: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Where should I go to college?List of factors that might influence choice

CostGeographic locationSizeType of institution

MetricsRanking of programsStudent faculty ratiosGraduates from your high school/neighborhood

Page 12: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

How did field develop?

From ISI database create set of important papers in field?Extract important authors

author of key paperauthor of several important papersthesis advisor of Ph.D.’s student(s) who is (are) important author(s)

Extract key institutionsinstitution of important author(s)Ph.D. granting institution of important authors

OutputDirected graph of key papersFlow of Ph.D.’s geographically with timeFlow of ideas

Page 13: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

poral Cluster Histograms: NIPS Results

12: chip, circuit, analog, voltage, vlsi11: kernel, margin, svm, vc, xi10: bayesian, mixture, posterior, likelihood

em9: spike, spikes, firing, neuron, neurons8: neurons, neuron, synaptic, memory,

firing7: david, michael, john, richard, chair6: policy, reinforcement, action, state,

agent5: visual, eye, cells, motion, orientation4: units, node, training, nodes, tree3: code, codes, decoding, message, hints2: image, images, object, face, video1: recurrent hidden training units error

NIPS k-means clusters (k=13)

Page 14: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Publishing

Researcher makes discovery and writes technical paperSubmits paper to journalJournal sends it out for refereeingRevised article is copy edited and appears in print about two years laterResults are available world wide through major research libraries

Page 15: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

The future of publishing

Advice to young faculty forced journals to let Google search them and ultimately to allow authors to post their article on their website.

What other changes are in store?

Page 16: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Wikipedia

905,707 articlesRecent comparison showed Wikipedia almost as accurate as Encyclopedia BritannicaMajor text source for formulas, definitions and proofs

Page 17: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Cayley's formula From Wikipedia, the free encyclopedia.

Jump to: navigation, search

In mathematics, Cayley's formula is a result in graph theory. It states that if n is a positive integer, the number of trees on n labeled vertices is

It is a particular case of Kirchhoff's theorem.

[edit]

Proof of the formula

Let Tn be the set of trees possible on the vertex set . We seek to show that | Tn | = nn − 2.

We begin by proving a lemma:

Claim: Let be positive integers such that . Let be the set of trees on the vertex set such that the degree of vi (denoted d(vi)) is di for . Then

Proof: We go by induction on n. For n = 1 and n = 2 the proposition holds trivially and is easy to verify. We move to the inductive step. Assume n > 2 and that the claim holds for degree sequences on n − 1 vertices. Since the di are all positive but their sum is less than 2n,

such that dk = 1. Assume without loss of generality that dn = 1.

For let be the set of trees on the vertex set such that:

Page 18: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Fed Ex package tracking

Page 19: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 20: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 21: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 22: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 23: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Theory to supportnew directions

Large graphsSpectral analysisHigh dimensions and dimension reductionClusteringCollaborative filteringExtracting signal from noise

Page 24: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Graph Theory of the 50’s

Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction.

Page 25: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Theory of Large Graphs

Large graphsBillion verticesExact edges present not critical

Theoretical basis for study of large graphsMaybe theory of graph generationInvariant to small changes in definitionMust be able to prove basic theorems

Page 26: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Erdös-Renyin verticeseach of n2 potential edges is present with independent probability

Nn

pn (1-p)N-n

vertex degreebinomial degree distribution

numberof

vertices

Page 27: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 28: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Generative models for graphs

Vertices and edges added at each unit of timeRule to determine where to place edges

Uniform probabilityPreferential attachment - gives rise to power law degree distributions

Page 29: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Number

of

vertices

Preferential attachment gives rise to the power law degree distribution common in many graphs

Vertex degree

Page 30: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Protein interactions

2730 proteins in data base

3602 interactions between proteins

SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1851

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 1

Science 1999 July 30; 285:751-753

Page 31: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Giant Component

1.Create n isolated vertices

2.Add Edges randomly one by one

3.Compute number of connected components

Page 32: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Giant Component1

1000

1 2

998 1

1 2 3 4 5 6 7 8 9 10 11

1 111359142889548

Page 33: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Giant Component1 2 3 4 5 6 7 8 9 10 11

1 111359142889548

1 2 3 4 5 6 7 8367 70 24 12 9 3 2 2

9 10 12 13 14 20 55 1012 2 1 2 2 1 1 1

1 2 3 4 5 6 7 8 9 11 5141 11126361339252

Page 34: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Tubes

Tendrils

TendrilsIn44 million

Out44 million

SCC56 million nodes

Disconnected components

Source: almaden.ibm.com

Page 35: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Phase transitionsG(n,p)

Emergence of cycleGiant componentConnected graph

N(p)Emergence of arithmetic sequence

CNF satisfiabilityFix number of variables, increase number of clauses

Page 36: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Access to InformationSMART Technology aardvark 0

abacus 0...antitrust 42...CEO 17...microsoft 61...windows 14wine 0wing 0winner 3winter 0...zoo 0zoology 0Zurich 0

document

Page 37: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Locating relevant documentsQuery: Where can I get information on

gates?2,060,000 hits

Bill Gates 593,000Gates county 177,000baby gates 170,000gates of heaven 169,000automatic gates 83,000fences and gates 43,000Boolean gates 19,000

Page 38: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Clustering documents

cluster documentsrefine cluster

microsoftwindowsantitrust

Booleangates

GatesCounty

automaticgates

gates

Bill Gates

Page 39: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Refinement of another type: books

mathem

atics

physics

chemistr

y

astronomy

children’s books

textbooks

reference books

general population

Page 40: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

High Dimensions

Intuition from two and three dimensions not valid for high dimension

Volume of cube is one in all dimensions

Volume of sphere goes to zero

Page 41: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

1

Unit sphere

Unit square

2 2 11 1 0.7072 2 2

⎛ ⎞ ⎛ ⎞+ = =⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

2 Dimensions

Page 42: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

2 2 2 21 1 1 1 12 2 2 2

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞+ + + =⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

4 Dimensions

Page 43: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

2122dd ⎛ ⎞ =⎜ ⎟

⎝ ⎠

1

d Dimensions

Page 44: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Almost all area of the unit cube is outside the unit sphere

Page 45: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

High dimension is fundamentally different from 2

or 3 dimensional space

Page 46: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

High dimensional data is inherently unstable

Given n random points in d dimensional space essentially all n2 distances are equal.

( )22

1

d

i ii

x yx y=

= −− ∑

Page 47: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Gaussian distribution

Page 48: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Gaussian distribution

Probability mass concentrated between dotted lines

Page 49: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Gaussian in high dimensions

Page 50: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Two Gaussians

Page 51: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 52: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Distance between two random points from same Gaussian

Points on thin annulus of radius

Approximate by sphere of radius

Average distance between two points is(Place one pt at N. Pole other at random. Almost surely second point near the equator.)

d

d

d

Page 53: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction
Page 54: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

2d

d

d

Page 55: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Expected distance between pts from two Gaussians separated by δ

2 2dδ +

Page 56: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Can separate points from two Gaussians if

( )

( )

2

14

2

12 2

2

2 2

2 1 2

12 2

2 2

d

d d

d d

d

d

δ

δ γ

γ

δ γ

δ γ

+ > +

+ + > +

>

>

L

Page 57: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Dimension reduction

Project points onto subspace containing centers of GaussiansReduce dimension from d to k, the number of Gaussians

Page 58: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Centers retain separationAverage distance between points reduced by d

k

Page 59: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Can separate Gaussians provided

2 2 2k kδ γ+ > +

> some constant involving k and γindependent of the dimension

δ

Page 60: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Ranking is importantRestaurantsMoviesWeb pages

Multi billion dollar industry

Page 61: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Page rank equals stationary probability of random walk

Page 62: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

restart

15%

Page 63: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

restart

15%

Restart yields strongly connected graph

Page 64: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Suppose you wish to increase the page rank of vertex v

Capture restartweb farm

Capture random walksmall cycles

Page 65: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Capture restart

Buy 20,000 url’s and capture restart

Can be countered by small restart value Small restart increases web rank of page that captures random walk by small cycles.

Page 66: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

1 0.851

0.15 restart

Capture random walk

0.66

0.23 restart

0.6611.56

0.1restart

0.66

Page 67: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

X=1.56

Y=0.66

0.1restart

0.56 0.66

10.66

0.23restart

X=1+0.85*y

Y=0.85*x/2

Page 68: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

If one loop increases Pagerankfrom 1 to 1.56 why not add many self loops?

Maximum increase in Pagerank is 6.67

Page 69: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Discovery time – time to first reach a vertex by random walk from uniform start

SCannot lower discovery time of any page in S below minimum already in S

Page 70: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Why not replace Pagerank by discovery time?

No efficient algorithm for discovery time

DiscoveryTime(v)

remove edges out of v

calculate Pagerank(v) in modified graph

Page 71: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Is there a way for a spammer to raise Pagerank in a way that is not statistically detectable

Page 72: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Information is important

When a customer makes a purchase what else is he likely to buy?

CameraMemory cardBatteriesCarrying caseEtc.

Knowing what a customer is likely to buy is important information.

Page 73: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

How can we extract information from a customer’s visit to a web site?

What web pages were visited?What order?How long?

Page 74: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Collaborative filteringRecommendationsWhich pop-up ads

Detecting changes over timeChanges in a marketBuying habits

Access to information

Page 75: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

One million products

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Probability of customer

Buying product

200 million

customers

Page 76: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

100 categories

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

One million

products

Probability of product given category

200 million

customers

Probability of category

Given customer

Page 77: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Extracting Information from Large Data Sources

Data streamsLarge data collectionsDetecting changes in patterns

Page 78: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Detecting trends before they become obvious

Is some category of customer changing their buying habits?

Purchases, travel destination, vacationsIs there some new trend in the stock market?How do we detect changes in a large database over time?

Page 79: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Identifying Changing Patterns in a Large Data Set

How soon can one detect a change in patterns in a large volume of information?How large must a change be in order to distinguish it from random fluctuations?

Page 80: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Conclusions

We are in an exciting time of change.Information technology is a big driver of that change.The computer science theory of the last thirty years needs to be extended to cover the next thirty years.

Page 81: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Spectral Analysis

The model

1 12 4

0 1ˆ

1 14 2

McSherry FOCS 2002

G Gmatrix

⎛ ⎞⎜ ⎟ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟−⎜ ⎟= → = ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎝ ⎠⎜ ⎟⎝ ⎠

Page 82: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Eigenvalue distribution

Page 83: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Spectral Analysis

Recovering the graph structure

( )

1

2

31 2 1 2

4

1 1 12 4

20

01 14 2

0

ˆ

T T Tu u u u u un n

n

TG UDU

G

λ

λ

λ

λ

λ

λ

λ

⎛ ⎞⎜ ⎟⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠ ⎝ ⎠⎜ ⎟⎝ ⎠

⎛ ⎞⎛ ⎞ ⎜ ⎟⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎜ ⎟= = ⎜⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜⎝ ⎠ ⎝ ⎠⎜ ⎟ ⎜⎝ ⎠ ⎝ ⎠

=L L

O

O

M cSherry FOCS 2002

⎟⎟⎟⎟⎟⎟⎟

Page 84: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Power law distributions

If the degrees of a random matrix are power law distributed, the major eigenvectors are associate with neighborhoods of high degree vertices rather than structure of the graph.

Papadimitriou and Mihail

1 1

2 21 12 4 ˆ1 14 2

n n

d dd dM DGD M

d d

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

= = →O O

Page 85: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Signal to noise ratio

Multiply every element of matrix by some fixed constant.The bounds in spectral analysis are determined by maximum noise of an element not the average.Multiplying low variance elements by constant increases signal without increasing maximum noise.

Page 86: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Variance of random variable

21

10 1

px p px p

⎧⎪⎪ ⎛ ⎞⎛ ⎞⎪ ⎜ ⎟⎨ ⎜ ⎟ ⎜ ⎟⎝ ⎠⎪ ⎝ ⎠⎪⎪⎩

= = ≅−−

Page 87: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Two ways of modifying x

2

2 2 2

11

0 1

10 1

cpy cp cpy cpcp

c pz c p c pz pp

σ

σ

⎧⎪⎪ ⎛ ⎞⎛ ⎞⎪ ⎜ ⎟⎪ ⎜ ⎟ ⎜ ⎟⎨ ⎜ ⎟ ⎜ ⎟⎪ ⎜ ⎟ ⎜ ⎟⎝ ⎠⎪ ⎝ ⎠⎪⎪⎩

⎧⎪⎪ ⎛ ⎞⎪ ⎛ ⎞ ⎜ ⎟⎪ ⎜ ⎟ ⎜ ⎟⎨ ⎜ ⎟ ⎜ ⎟⎪ ⎝ ⎠ ⎜ ⎟⎪ ⎝ ⎠⎪⎪⎩

= = ≅−−

= = ≅−−

Page 88: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Increasing probability by c increases variance by c

Multiplying variable by c increases variance by

Thus we correct for increase in probability of factor of c by multiplying

variable by

2c

1c

1 1

2 2

1 1

1 1

1 1

n n

d d

d dL M

d d

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟

= ⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠ ⎝ ⎠

O O

Page 89: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Power law distributions arise in many different contexts1) data2) queries

Page 90: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

General theme emerging in clustering

Use SVD to find reduced subspaceProject data onto reduced subspaceClusterAlthough SVD minimizes the sum of squared error between points and their expected values the error is not uniformly distributed and thus there are usually some outliersReproject data onto subspace through the approximate cluster centersRecluster

Page 91: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Two situations

( )( )

Centers 0,0, ,0

1,0, ,0

L

L

( )C e n te rs 0 ,0 , ,0L

( )1 1,1, ,1d

L

Page 92: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Two situations

Centers

Centers

( )0 , 0 , , 0L

( )1 1 1, , ,d d d

L

( )0, 0 , , 0L

( )1, 0 , , 0L

( )1 1 1, , ,d d d

L

Page 93: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

For spherical Gaussian the two situations are equivalent – Probability distribution for spherical Gaussian depends only on distance from center.

For binomial distributions – the two situations are fundamentally different.

Page 94: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Binomial distribution in d dimensions

12

12

01i

px

p=⎧

= ⎨ =⎩

Page 95: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Centers differ in one dimension

Page 96: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

Centers differ along all dimensions

Page 97: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

The two situations

Page 98: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

BalanceSuggests we define balance of a unit vector by how uniformly the coordinates contribute to its length.

2

bal( )x

xx

∞=

Page 99: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

General method

Project data onto SVD subspaceClusterDraw lines through all k2 pairs of cluster centersSmooth linesProject onto each line and cluster

Page 100: Computer Science in the Information Age...Extracting signal from noise Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction

A B

C