25
Algorithms at Scale (Week 13) Introduction

(Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

AlgorithmsatScale(Week13)

Introduction

Page 2: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Today’sPlan

Wrap-up(Seth)

StreamingAlgorithmsa. RobinYu,Sidhant Bansalb. EldonChung,Qiu Siqi

Sampling/DimensionalityReductiona. XuKai,WangYueb. GiovanniPagliarini,Corentin Dumery

Cache-efficientAlgorithmsa. FooGuo Wei,Kuan WeiHeng

Introduction (Seth)

Page 3: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

StreamingAlgorithms

WindowedStreams

HyperLogLog

RobinYu,Sidhant Bansal

What if only recent data matters?

Techniques for discarding old data in stream.

Example: average temperature over the last 4 hours.

Algorithm for counting distinct elements in a stream.

Improves on Flajolet-Martin.

State-of-the-art streaming algorithm, in use today.

Page 4: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

StreamingAlgorithms

TriangleEstimation

EldonChung,Qiu Siqi

Graph streaming algorithm.

Techniques for estimating the number of triangles in a graph.

Several different sampling approaches.

Page 5: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

DimensionalityReduction

( 3 , 4 )

( 2 , 2 )

( 6 , 6 )

( 8 , 4 )

( 3 , 8 )

( 5 , 2 )

( 6 , 9 )

( 2 , 4 )

( 5 , 6 )

( 7 , 1 )

( 8 , 1 )

( 3 , 5 )

( 5 , 9 )

( 7 , 5 )

( 1 , 2 )

( 2 , 4 )

( 1 , 7 )

( 6 , 9 )

( 8 , 0 )

( 9 , 1 )

2 dim data

Page 6: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

DimensionalityReduction

( 3 , 4 )

( 2 , 2 )

( 6 , 6 )

( 8 , 4 )

( 3 , 8 )

( 5 , 2 )

( 6 , 9 )

( 2 , 4 )

( 5 , 6 )

( 7 , 1 )

( 8 , 1 )

( 3 , 5 )

( 5 , 9 )

( 7 , 5 )

( 1 , 2 )

( 2 , 4 )

( 1 , 7 )

( 6 , 9 )

( 8 , 0 )

( 9 , 1 )

2 dim data

A small sample is sufficient to get a pretty good estimate of the data.

Page 7: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

DimensionalityReduction

( 3 , 0 , 0 , 0 , 1 , 0 , 0 , 1 )

( 2 , 1 , 1 , 0 , 5 , 1 , 0 , 0 )

( 6 , 0 , 0 , 0 , 7 , 0 , 0 , 0 )

( 8 , 0 , 0 , 0 , 0 , 0 , 0 , 0 )

( 3 , 0 , 0 , 4 , 1 , 0 , 4 , 0 )

( 5 , 0 , 0 , 7 , 0 , 0 , 7 , 0 )

( 6 , 0 , 0 , 0 , 5 , 0 , 0 , 4 )

( 2 , 4 , 4 , 3 , 0 , 4 , 3 , 7 )

( 5 , 7 , 7 , 5 , 0 , 7 , 5 , 0 )

( 7 , 0 , 0 , 7 , 0 , 0 , 7 , 0 )

( 8 , 3 , 7 , 0 , 0 , 7 , 0 , 0 )

( 3 , 5 , 0 , 8 , 0 , 0 , 8 , 0 )

( 5 , 7 , 1 , 3 , 0 , 1 , 3 , 4 )

( 7 , 0 , 0 , 5 , 0 , 0 , 5 , 7 )

( 1 , 2 , 5 , 6 , 1 , 5 , 6 , 0 )

( 2 , 8 , 0 , 2 , 4 , 0 , 2 , 7 )

( 1 , 1 , 0 , 5 , 6 , 0 , 5 , 7 )

( 6 , 0 , 1 , 7 , 9 , 1 , 7 , 0 )

( 8 , 0 , 4 , 8 , 0 , 4 , 8 , 8 )

( 9 , 0 , 5 , 5 , 0 , 5 , 5 , 3 )

d dim data

Page 8: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

DimensionalityReduction

( 3 , 0 , 0 , 0 , 1 , 0 , 0 , 1 )

( 2 , 1 , 1 , 0 , 5 , 1 , 0 , 0 )

( 6 , 0 , 0 , 0 , 7 , 0 , 0 , 0 )

( 8 , 0 , 0 , 0 , 0 , 0 , 0 , 0 )

( 3 , 0 , 0 , 4 , 1 , 0 , 4 , 0 )

( 5 , 0 , 0 , 7 , 0 , 0 , 7 , 0 )

( 6 , 0 , 0 , 0 , 5 , 0 , 0 , 4 )

( 2 , 4 , 4 , 3 , 0 , 4 , 3 , 7 )

( 5 , 7 , 7 , 5 , 0 , 7 , 5 , 0 )

( 7 , 0 , 0 , 7 , 0 , 0 , 7 , 0 )

( 8 , 3 , 7 , 0 , 0 , 7 , 0 , 0 )

( 3 , 5 , 0 , 8 , 0 , 0 , 8 , 0 )

( 5 , 7 , 1 , 3 , 0 , 1 , 3 , 4 )

( 7 , 0 , 0 , 5 , 0 , 0 , 5 , 7 )

( 1 , 2 , 5 , 6 , 1 , 5 , 6 , 0 )

( 2 , 8 , 0 , 2 , 4 , 0 , 2 , 7 )

( 1 , 1 , 0 , 5 , 6 , 0 , 5 , 7 )

( 6 , 0 , 1 , 7 , 9 , 1 , 7 , 0 )

( 8 , 0 , 4 , 8 , 0 , 4 , 8 , 8 )

( 9 , 0 , 5 , 5 , 0 , 5 , 5 , 3 )

d dim data Sample a couple

dimension??

NO. Does not work.

Page 9: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

DimensionalityReduction

d dim data Different way of

sampling dimension:

Project to a random lower dimensional subspace.

(3

,0

,0

,0

,1

,0

,0

,1

)

(2

,1

,1

,0

,5

,1

,0

,0

)

(6

,0

,0

,0

,7

,0

,0

,0

)

(8

,0

,0

,0

,0

,0

,0

,0

)

(3

,0

,0

,4

,1

,0

,4

,0

)

(5

,0

,0

,7

,0

,0

,7

,0

)

(6

,0

,0

,0

,5

,0

,0

,4

)

(2

,4

,4

,3

,0

,4

,3

,7

)

(5

,7

,7

,5

,0

,7

,5

,0

)

(7

,0

,0

,7

,0

,0

,7

,0

)

(8

,3

,7

,0

,0

,7

,0

,0

)

(3

,5

,0

,8

,0

,0

,8

,0

)

(5

,7

,1

,3

,0

,1

,3

,4

)

(7

,0

,0

,5

,0

,0

,5

,7

)

(1

,2

,5

,6

,1

,5

,6

,0

)

(2

,8

,0

,2

,4

,0

,2

,7

)

(1

,1

,0

,5

,6

,0

,5

,7

)

(6

,0

,1

,7

,9

,1

,7

,0

)

(8

,0

,4

,8

,0

,4

,8

,8

)

(9

,0

,5

,5

,0

,5

,5

,3

)

Page 10: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

Johnson-Lindenstrauss (JL)Transform

PrincipalComponentAnalysis

XuKai,WangYue

Project to a lower dimension.

Choose the projection at random.

Project to a lower dimension.

Maximize the variance.

Page 11: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SamplingAlgorithms

FASTJohnson-Lindenstrauss (JL)Transform

Johnson-Lindenstrauss (JL)Transform

GiovanniPagliarini,Corentin Dumery

How do you make the JL transform even faster?

Choose a sparse projection matrix.

Precondition the vectors before projecting!

Project to a lower dimension.

Choose the projection at random.

Page 12: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Cache-efficientAlgorithms

Write-optimizedDataStructures

FooGuo Wei,Kuan WeiHeng

p1

p1p1p2

p1p2 p1p2 p1p2 p1

B-treesearch:insert:

p2p1

buffer

Buffer treesearch:

insert:O(logB n)

O(logB n)

O(log n)

O

✓1

Blog n

Page 13: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Cache-efficientAlgorithms

Cache-ObliviousLookahead Array

Log-StructuredMergeTree

FooGuo Wei,Kuan WeiHeng

Write-optimized: writes are faster than reads.

No distinction between disk and memory. B, M are unknown.

Primarily of theoretical interest… so far!

Write-optimized: writes are faster than reads.

Designed to separately optimize memory and disk accesses.

Popular in practice recently.

Page 14: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Today’sPlan

Wrap-up(Seth)

StreamingAlgorithmsa. RobinYu,Sidhant Bansalb. EldonChung,Qiu Siqi

Sampling/DimensionalityReductiona. XuKai,WangYueb. GiovanniPagliarini,Corentin Dumery

Cache-efficientAlgorithmsa. FooGuo Wei,Kuan WeiHeng

Introduction (Seth)

Page 15: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

AlgorithmsatScale(Week13)

SemesterRecap

Page 16: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

CS5234 Goals

q Intuition:Insights and ideas to help you design your own algorithms for large-scale problems.

q Arsenal of tools and techniquesTools and techniques to help you analyze and understand algorithms at scale.

Page 17: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

FinalExam

Topics:Everything from the semester• Weeks 1-13.• Problem sets.• Tutorial problems.• Bad jokes.

Practical Info:November 27: • Time: 1pm• Location: ???

(NOT HERE)• Please double-check date

and time and location.

To bring: • One double-sided sheet of

paper.• Otherwise, nothing.

Page 18: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Topics

Sampling&SublinearTime

Streaming&Sketching

CacheEfficient Parallel

Page 19: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Key techniques

Batching

SketchingApproximationSampling

Memory Layout

Examples:• Flajolet-Martin• Count-Min Sketch• Count Sketch• Misra-Gries• Push-Pull sketch

Examples:• Polling• Reservoir Sampling• Chernoff / Hoeffding bounds• PAC learning• Data analysis (e.g., median)• Graph properties (e.g.,

approximate MST)

Examples:• Approximate graph

properties (e.g., connected components)

• Flajolet-Martin (discrete element counting)

• Clustering algorithms

Examples:• B-trees• Cache-oblivious search trees

/ van Emde Boas layout• External memory Lubys• External memory BFS

Examples:• Buffer trees• LSM• COLA

Divide-and-Conquer

Examples:• Fork-join algorithms• Prefix-Sum• Map-Reduce algorithms• k-machine algorithms

Page 20: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

SublinearTimeAlgorithms

Sublinear graph algs:Number of connected components in a graph.• Additive approximation

algorithm.Weight of MST• Multiplicative approximation

algorithm.Key Techniques• Chernoff and Hoeffding

Bounds

Simple sampling:Toy example 1: array all 0’s?• Gap-style question:

All 0’s or far from all 0’s?Toy example 2: Faction of 1’s?• Additive ± 𝜀 approximation• Hoeffding BoundIs the graph connected?• Gap-style question.• O(1) time algorithm.• Correct with probability

2/3.

Page 21: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

StreamingAlgorithms

GraphStreaming

Connectivity• Isthegraphconnected?Bipartite• Isthegraphbipartite?MST

• FindaminimumspanningtreeSpanners• FindapproximateshortestpathsMatching• Findan(approximate)maximum

matching.

Sketches

Misra-Gries:• Itemfrequency• HeavyHittersFlajolet-Martin:• Numberofdistinctelements

• Median-of-meanstechnique• Chebychev+ChernoffCount-MinSketch:• HeavyHitters.

Page 22: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Approximation&Streaming

Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaveragedistancetoacenter.

LPapproximationalgorithm• Find2k centersthatgivea4-approximationoftheoptimalclustering.

Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreducespace.

Otherclusteringproblems

Page 23: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

Caching

GraphAlgorithms

Breadth-First-Search• SortingyourgraphMIS• Luby’s Algorithm• Cache-efficientimplementation

MST• Connectivity• MinimumSpanningTree

ExternalMemory

Externalmemorymodel• Howtopredictthe

performanceofalgorithms?

B-trees• EfficientsearchingWrite-optimizeddatastructures• BuffertreesCache-obliviousalgorithms• vanEmde Boasmemory

layout

Page 24: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

ParallelAlgorithms

Fork-Join

ModelsofParallelism• Fork-Joinmodel• WorkandSpan• GreedyschedulersAlgorithms

• Sum• MergeSort• ParallelSets• BFS• Prefix-Sum• Luby’s

Map-Reduce

Map-ReduceModel• Clustercomputing

Somesimpleexamples• Wordcount

• Join

Algorithms• Bellman-Ford• PageRank

k-Machinek-Machine Model• Cluster computing

Some simple examples• Luby’s• Bellman-Ford

Minimum Spanning Tree• Basic algorithm• Fully distributed• Lower bound

Page 25: (Week 13) Introductiongilbert/CS5234/2019/... · 2019-11-14 · HyperLogLog Robin Yu, SidhantBansal What if only recent data matters? Techniques for discarding old data in stream

CS5234:AlgorithmsatScale

Sampling&SublinearTime

Streaming&Sketching

CacheEfficient Parallel