of 53 /53
Randomized Algorithms Graph Algorithms William Cohen

Randomized Algorithms Graph Algorithms

Embed Size (px)


Randomized Algorithms Graph Algorithms. William Cohen. Outline. Randomized methods SGD with the hash trick (review) Other randomized algorithms Bloom filters Locality sensitive hashing Graph Algorithms. Learning as optimization for regularized logistic regression. Algorithm: - PowerPoint PPT Presentation

Text of Randomized Algorithms Graph Algorithms

Machine Learning from Big Datasets

Randomized AlgorithmsGraph AlgorithmsWilliam CohenOutlineRandomized methodsSGD with the hash trick (review)Other randomized algorithmsBloom filtersLocality sensitive hashingGraph AlgorithmsLearning as optimization for regularized logistic regressionAlgorithm:Initialize arrays W, A of size R and set k=0For each iteration t=1,TFor each example (xi,yi)Let V be hash table so that pi = ; k++For each hash value h: V[h]>0:W[h] *= (1 - 2)k-A[j]W[h] = W[h] + (yi - pi)V[h]A[j] = k

Learning as optimization for regularized logistic regressionInitialize arrays W, A of size R and set k=0For each iteration t=1,TFor each example (xi,yi)k++; let V be a new hash table; let tmp=0For each j: xi j >0: V[hash(j)%R] += xi j Let ip=0For each h: V[h]>0: W[h] *= (1 - 2)k-A[j]ip+= V[h]*W[h]A[h] = kp = 1/(1+exp(-ip))For each h: V[h]>0:W[h] = W[h] + (yi - pi)V[h]

regularize W[h]sAn example

2^26 entries = 1 Gb @ 8bytes/weight3,2M emails400k users40M tokens5Results

A variant of feature hashingHash each feature multiple times with different hash functionsNow, each w has k chances to not collide with another useful w An easy way to get multiple hash functionsGenerate some random strings s1,,sLLet the k-th hash function for w be the ordinary hash of concatenation wsk

A variant of feature hashingWhy would this work?

Claim: with 100,000 features and 100,000,000 buckets:k=1 Pr(any duplication) 1k=2 Pr(any duplication) 0.4k=3 Pr(any duplication) 0.01

Hash Trick - InsightsSave memory: dont store hash keysAllow collisionseven though it distorts your data someLet the learner (downstream) take up the slack

Heres another famous trick that exploits these insights.

Bloom filtersInterface to a Bloom filterBloomFilter(int maxSize, double p);void bf.add(String s); // insert sbool bd.contains(String s);// If s was added return true;// else with probability at least 1-p return false;// else with probability at most p return true;

I.e., a noisy set where you can test membership (and thats it)note a hash table would do this in constant time and storagethe hash trick does this as well10Bloom filtersAnother implementationAllocate M bits, bit[0],bit[1-M]Pick K hash functions hash(1,s),hash(2,s),.E.g: hash(s,i) = hash(s+ randomString[i])To add string s:For i=1 to k, set bit[hash(i,s)] = 1To check contains(s):For i=1 to k, test bit[hash(i,s)]Return true if theyre all set; otherwise, return falseWell discuss how to set M and K soon, but for now:Let M = 1.5*maxSize // less than two bits per item!Let K = 2*log(1/p) // about right with this M

Bloom filtersAnalysis:Assume hash(i,s) is a random functionLook at Pr(bit j is unset after n adds):

and Pr(collision):

. fix m and n and minimize k:

k =Bloom filtersAnalysis:Assume hash(i,s) is a random functionLook at Pr(bit j is unset after n adds):

and Pr(collision):

. fix m and n, you can minimize k:

k =p =Bloom filtersAnalysis:Plug optimal k=m/n*ln(2) back into Pr(collision):

Now we can fix any two of p, n, m and solve for the 3rd:

E.g., the value for m in terms of n and p:

p =

Bloom filters: demoBloom filtersAn example applicationFinding items in sharded dataEasy if you know the sharding ruleHarder if you dont (like Google n-grams)Simple idea:Build a BF of the contents of each shardTo look for key, load in the BFs one by one, and search only the shards that probably contain keyAnalysis: you wont miss anything, you might look in some extra shardsYoull hit O(1) extra shards if you set p=1/#shardsBloom filtersAn example applicationdiscarding singleton features from a classifierScan through data once and check each w:if bf1.contains(w): bf2.add(w)else bf1.add(w)Now:bf1.contains(w) w appears >= oncebf2.contains(w) w appears >= 2xThen train, ignoring words not in bf2Bloom filtersAn example applicationdiscarding rare features from a classifierseldom hurts much, can speed up experimentsScan through data once and check each w:if bf1.contains(w): if bf2.contains(w): bf3.add(w)else bf2.add(w)else bf1.add(w)Now:bf2.contains(w) w appears >= 2xbf3.contains(w) w appears >= 3xThen train, ignoring words not in bf3Bloom filtersMore on this next week..LSH: key ideasGoal: map feature vector x to bit vector bxensure that bx preserves similarity

Random ProjectionsRandom projections

u-u2+++++++++---------Random projections

u-u2+++++++++---------To make those points close we need to project to a direction orthogonal to the line between themRandom projections

u-u2+++++++++---------Any other direction will keep the distant points distant.So if I pick a random r and r.x and r.x are closer than then probably x and x were close to start with.

LSH: key ideasGoal: map feature vector x to bit vector bxensure that bx preserves similarityBasic idea: use random projections of xRepeat many times:Pick a random hyperplane rCompute the inner product or r with xRecord if x is close to r (r.x>=0) the next bit in bxTheory says that is x and x have small cosine distance then bx and bx will have small Hamming distance

LSH: key ideasNave algorithm:Initialization:For i=1 to outputBits:For each feature f:Draw r(f,i) ~ Normal(0,1)Given an instance xFor i=1 to outputBits:LSH[i] = sum(x[f]*r[i,f] for f with non-zero weight in x) > 0 ? 1 : 0Return the bit-vector LSHProblem: the array of rs is very large

LSH: pooling (van Durme)Better algorithm:Initialization:Create a pool:Pick a random seed sFor i=1 to poolSize:Draw pool[i] ~ Normal(0,1)For i=1 to outputBits:Devise a random hash function hash(i,f): E.g.: hash(i,f) = hashcode(f) XOR randomBitString[i]Given an instance xFor i=1 to outputBits:LSH[i] = sum( x[f] * pool[hash(i,f) % poolSize] for f in x) > 0 ? 1 : 0Return the bit-vector LSH

LSH: key ideasAdvantages:with pooling, this is a compact re-encoding of the datayou dont need to store the rs, just the poolleads to very fast nearest neighbor methodjust look at other items with bx=bxalso very fast nearest-neighbor methods for Hamming distancesimilarly, leads to very fast clusteringcluster = all things with same bx vectorMore next week.

Graph AlgorithmsGraph algorithmsPageRank implementationsin memorystreaming, node list in memorystreaming, no memorymap-reduce

A little like Nave Bayes variantsdata in memoryword counts in memorystream-and-sortmap-reduceGoogles PageRankweb site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxInlinks are good (recommendations)Inlinks from a good site are better than inlinks from a bad sitebut inlinks from sites with many outlinks are not as good...Good and bad are relative.web site xxxGoogles PageRankweb site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxImagine a pagehopper that always either follows a random link, or jumps to random pageGoogles PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)web site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxImagine a pagehopper that always either follows a random link, or jumps to random pagePageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected crowd sizePageRank in MemoryLet u = (1/N, , 1/N)dimension = #nodes NLet A = adjacency matrix: [aij=1 i links to j]Let W = [wij = aij/outdegree(i)]wij is probability of jump from i to jLet v0 = (1,1,.,1) or anything else you wantRepeat until converged:Let vt+1 = cu + (1-c)Wvtc is probability of jumping anywhere randomlyStreaming PageRankAssume we can store v but not W in memoryRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Store A as a row matrix: each line isi ji,1,,ji,d [the neighbors of i]Store v and v in memory: v starts out as cuFor each line i ji,1,,ji,d For each j in ji,1,,ji,d v[j] += (1-c)v[i]/dEverything needed for update is right there in row.Streaming PageRank: with some long rowsRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Store A as a list of edges: each line is: i d(i) jStore v and v in memory: v starts out as cuFor each line i d jv[j] += (1-c)v[i]/dWe need to get the degree of i and store it locallyStreaming PageRank: preprocessingOriginal encoding is edges (i,j)Mapper replaces i,j with i,1Reducer is a SumReducerResult is pairs (i,d(i))

Then: join this back with edges (i,j)For each i,j pair:send j as a message to node i in the degree tablemessages always sorted after non-messagesthe reducer for the degree table sees i,d(i) firstthen j1, j2, .can output the key,value pairs with key=i, value=d(i), j

Preprocessing Control Flow: 1IJi1j1,1i1j1,2i1j1,k1i2j2,1i3j3,1Ii11i11i11i21i31Ii11i11i11i21i31Id(i)i1d(i1)..i2d(i2)i3d)i3)MAPSORTREDUCESumming values38Preprocessing Control Flow: 2IJi1j1,1i1j1,2i2j2,1Ii1d(i1)i1~j1,1i1~j1,2..i2d(i2)i2~j2,1i2~j2,2Ii1d(i1)j1,1i1d(i1)j1,2i1d(i1)j1,n1i2d(i2)j2,1i3d(i3)j3,1Id(i)i1d(i1)..i2d(i2)MAPSORTREDUCEIJi1~j1,1i1~j1,2i2~j2,1Id(i)i1d(i1)..i2d(i2)copy or convert to messagesjoin degree with edges39Streaming PageRank: with some long rowsRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Pure streaming: use a table mapping nodes to degree+pageRankLines are i: degree=d,pr=vFor each edge i,jSend to i (in degree/pagerank) table: outlink jFor each line i: degree=d,pr=v:send to i: incrementVBy cfor each message outlink j:send to j: incrementVBy (1-c)*v/dFor each line i: degree=d,pr=vsum up the incrementVBy messages to compute voutput new row: i: degree=d,pr=v

One identity mapper with two inputs (edges, degree/pr table) Reducer outputs the incrementVBy messagesTwo-input mapper + reducerControl Flow: Streaming PRIJi1j1,1i1j1,2i2j2,1Id/vi1d(i1),v(i1)i1~j1,1i1~j1,2..i2d(i2),v(i2)i2~j2,1i2~j2,2todeltai1cj1,1(1-c)v(i1)/d(i1)j1,n1ii2cj2,1i3cId/vi1d(i1),v(i1)i2d(i2),v(i2)MAPSORTREDUCEMAPSORTIdeltai1ci1(1-c)v().i1(1-c)..i2ci2(1-c)i2.copy or convert to messagessend pageRank updates to outlinks41Control Flow: Streaming PRtodeltai1cj1,1(1-c)v(i1)/d(i1)j1,n1ii2cj2,1i3cREDUCEMAPSORTIdeltai1ci1(1-c)v().i1(1-c)..i2ci2(1-c)i2.REDUCEIvi1~v(i1)i2~v(i2)Summing valuesId/vi1d(i1),v(i1)i2d(i2),v(i2)MAPSORTREDUCEReplace v with vId/vi1d(i1),v(i1)i2d(i2),v(i2)42Control Flow: Streaming PRIJi1j1,1i1j1,2i2j2,1Id/vi1d(i1),v(i1)i2d(i2),v(i2)MAPcopy or convert to messagesand back around for next iteration.43More on graph algorithmsPageRank is a one simple example of a graph algorithmbut an important onepersonalized PageRank (aka random walk with restart) is an important operation in machine learning/data analysis settingsPageRank is typical in some waysTrivial when graph fits in memoryEasy when node weights fit in memoryMore complex to do with constant memoryA major expense is scanning through the graph many times same as with SGD/Logistic regressiondisk-based streaming is much more expensive than memory-based approachesLocality of access is very important!gains if you can pre-cluster the graph even approximatelyavoid sending messages across the network keep them local

Machine Learning in Graphs - 2010Some ideasCombiners are helpfulStore outgoing incrementVBy messages and aggregate themThis is great for high indegree pagesHadoops combiners are suboptimalMessages get emitted before being combinedHadoop makes weak guarantees about combiner usage

Id think you want to spill the hash table to memory when it gets largeSome ideasMost hyperlinks are within a domainIf we keep domains on the same machine this will mean more messages are localTo do this, build a custom partitioner that knows about the domain of each nodeId and keeps nodes on the same domain togetherAssign node ids so that nodes in the same domain are together partition node ids by rangeChange Hadoops Partitioner for this

Some ideasRepeatedly shuffling the graph is expensiveWe should separate the messages about the graph structure (fixed over time) from messages about pageRank weights (variable)compute and distribute the edges onceread them in incrementally in the reducernot easy to do in Hadoop!call this the Schimmy pattern


Relies on fact that keys are sorted, and sorts the graph input the same way..Schimmy