Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa

Algorithms and data structures for big data,

what’s next?

Paolo FerraginaUniversity of Pisa

Is Big Data a buzz word ?

“Big Data” vs “Grid Computing”

VLDB does exist since 1992

Big data, big impact !

Big data are everywhere !

[Procs OSDI 2006] No SQL

HyperTable

CassandraHadoop

Cosmos

From macro to micro-users

Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players

Our driving moral...

Big steps come from theory

... but do NOT forget practice ;-)

Our running example

(String-)Dictionary Problem

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support prefix searches for a

pattern P.

Exact search Hashing

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:• Search ≈ O(|P|) time

• Space ≈ O(N)

Dominated the string-matching scene in the ‘80s-90s

Most known is the Suffix Tree

Software engineers objected:• Search: random memory accesses

• Space: pointers + strings

Lexicographic search

P = systo

Timeline: theory and practice...

‘60

Trie

’90

’70-

’80

Suffix Tree

What aboutSoftware Engineers ??

What did systems implement?

Used the Compacted trie, of course, but with 2 other concerns because of large data

1° issue: space concern

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

systile syzygetic syzygial syzygy….2,zygetic 5,ial 5,y

FrontCoding

2° issue: Disk memorytrack

BCPU Internal

Memory

1

2 main features:• Seek time = I/Os are costly

• Blocked access = B items per I/O

Count I/Os

Strings may be arbitrarily long

Why are strings challenging ?

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory One main limitation:

Sampling rate & lengths of sampled strings

Trade-off btw speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 disk access

• Space ≈ Front-coding over buckets

(Prefix) B-tree

B B


‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

Space+

Hierarchical Memory

Do we need to tradespace by I/Os ?

1995

String B

-tree

An old idea: Patricia Trie

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

stile zyg

5

etic

ial

y

aibelyte

czecin

omo

[Morrison, J.ACM 1968]

Disk

A new (lexicographic) search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

sz

5

e

i

y

a

c

o

Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation

Lexicographic search:P = syzytea

01

2 5 yg

Lexicographic position

Only 1 string is checked on disk

Trie Space ≈ #strings, NOT their

length

[Ferragina-Grossi, J.ACM 1999]

Disk

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PTSearch(P)

•O((p/B) logB K) I/Os

O(occ/B) I/OsIt is dynamic...

Check 1 string = O(p/B) I/Os

O(logB K) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

> 15 US-patents cite it !!

Knuth, vol 3°, pag. 489: “elegant”

I/O-aware algorithms & data structures

[CACM 1988]

[2006]

Huge literature !!

I/Os was the

main concern


‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

CPUregisters

L1 L2 RAM

Cache

HD net

Cache-oblivious solutions, aka parameter-free algo+ds Anywhere, anytime, anyway... I/O-optimal !!

Not just 2 memory levels


‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Space

Cache-oblivious data structures

Compresseddata structures

Not just 2 memory levels

Can we “automate” and “guarantee” the process ?

A challenging question [Ken Church, AT&T

1995]

Software Engineers use “squeezing heuristics” that

compress data and still support fast access to them

Aka: Compressed self-indexes

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Space for text+index space for compressed text

only ( Hk) Query/Decompression time theoretically

(quasi-)optimal

...now, J.ACM 2005

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

Highly compressible, but…

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

T

bzip2 = BWT + other simple compressors

bwt(T)

From practice to theory...

FM-index = BWT is searchable

...or Suffix Array is compressible

• Space = |T| Hk + o(|T|) bits

• Search(P) = O(p + occ * polylog(|T|))

Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

[Ferragina-Manzini, IEEE Focs ‘00]

10 pi#mississi p 9 ppi#mississ i 7 sippi#missi s 4 sissippi#mi s 6 ssippi#miss i 3 ssissippi#m i

5 issippi#mis s

1 mississippi # 2 ississippi# m

12 #mississipp i11 i#mississip p 8 ippi#missis s

bwt(T)sa(T)

Compressed & Searchable data formats

After our paper in FOCS 2000, about texts

We find nowdays compressed indexes for: Trees Labeled trees and graphs Functions Integer Sets Geometry Images ...

From theory to practice…

December 2003

ACM J. on Experimental Algorithmics, 2009

> 103 faster than Smith-W.

>102 faster than SOAP & Maq

What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]

An XML excerpt<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>...</dblp>

IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012

A tree interpretation

XML document exploration Tree navigation XML document search Labeled subpath

searches

XBW

transform

XBW Transform: Some performance figures

Nu

m s

earc

hes p

er

secon

d

Xerces better on smaller files

larger and larger datasets

Xerces worse on larger files

Xerces uses10x space

Where we are nowadays

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Cache-oblivious data structures

Compresseddata structures

Something is known... yet very preliminary

Lower Bounds derived from Geometry

Text search = 2d Range Search

New food for research..

[E. Gal, S. Toledo. ACM Comp. Surv., 2005]

[Ajwani et al, WEA 2009]

Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling

Self-adjusting or Weighted design Time ops depend on some (un/known) distribution

Challenge: no pointers, self-adjust (perf) vs compression (space)

[Ferragina et al, ESA 2011]

40Gb, about 100$

The energy challenge

IEEE Computer, 2007

Browsing a web site

Javascript framework

Prototype Dojo jQuery

Chrome best choice best choice 1,5%

FireFox 2,5% 4,8% 4,3%

IE 10,2% 8,5% 11%

The most

used!The most

used!

Yet today, it is a problem...

Apple is still working on the battery life problem: “The

recent iOS software update addressed many of the battery

issues that some customers experienced on their iOS 5 devices.

We continue to investigate a few remaining issues.” (nov 2011,

wired.com)

“ Windows 8's power hygiene: the scheduler will ignore the unused software” (Feb 2012, MSDN)

Energy-aware Algo+Ds ?

Locality pays off

Memory-level impacts

I/Os and compression

are obviously important

BUT

here there is a new twist

MIPS per Watt ?Battery life !!

Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?

2.takes x% more space than optimal, but it is more energy efficient ?

Idea:Multi-objective optimization in data-structure design

Approach in aprincipled way

A preliminary step

Took inspiration from BigTable (Google), ...

Design a compressed storage scheme that can trade in a principled way between

space vs decompression time [vs energy efficiency]

Requirements: gzip-like compression [like Snappy or lz4 by

Google]

Goal: Fix the space occupancy, find the best compression

that achieves that space and minimizes the decompression time (or vice versa)

[abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>

Copy back new char Copy back

A preliminary step...

Modeled as a Constrained Shortest Path problem: Nodes = one per char of the text to be compressed

Edges = single char or copy back substrings

2 edge weights = decompression time (t) and compressed space (c)

NP-hard in generalThis special case is POLY: O(n3)

n is huge

m might be n2

LZ-parsing = Path from 1 to 12

We solved heuristically (Lagrangian Dual) and provably (Path Swap)

A preliminary step...

String MatchingRAM model, char cmp and time

1990s: Data BasesHierarchical memories and I/Os

2000s: Data CompressionSpace reduction in indexesand entropy space-bounds

Graph TheorySpace reduction in compressors

OptimizationMulti-objective design and joules

2010s: Computational GeometryLower bounds on indexes

New upper bounds on I/Os, entropy

Nowadays…

We mainly commented:

A quote to conclude

“The distance between theory and practice is closer in theory

than in practice”[Y. Matias, Google]

Big steps come from theory

... but do NOT forget practice ;-)

That’s all !

Documents

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa