Upload
alondra-boothroyd
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Algorithms and data structures for big data,
what’s next?
Paolo FerraginaUniversity of Pisa
Is Big Data a buzz word ?
“Big Data” vs “Grid Computing”
VLDB does exist since 1992
Big data, big impact !
Big data are everywhere !
[Procs OSDI 2006] No SQL
HyperTable
CassandraHadoop
Cosmos
From macro to micro-users
Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players
Our driving moral...
Big steps come from theory
... but do NOT forget practice ;-)
Our running example
(String-)Dictionary Problem
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P.
Exact search Hashing
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
Performance:• Search ≈ O(|P|) time
• Space ≈ O(N)
Dominated the string-matching scene in the ‘80s-90s
Most known is the Suffix Tree
Software engineers objected:• Search: random memory accesses
• Space: pointers + strings
Lexicographic search
P = systo
Timeline: theory and practice...
‘60
Trie
’90
’70-
’80
Suffix Tree
What aboutSoftware Engineers ??
What did systems implement?
Used the Compacted trie, of course, but with 2 other concerns because of large data
1° issue: space concern
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
systile syzygetic syzygial syzygy….2,zygetic 5,ial 5,y
FrontCoding
2° issue: Disk memorytrack
BCPU Internal
Memory
1
2 main features:• Seek time = I/Os are costly
• Blocked access = B items per I/O
Count I/Os
Strings may be arbitrarily long
Why are strings challenging ?
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory One main limitation:
Sampling rate & lengths of sampled strings
Trade-off btw speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 disk access
• Space ≈ Front-coding over buckets
(Prefix) B-tree
B B
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
Space+
Hierarchical Memory
Do we need to tradespace by I/Os ?
1995
String B
-tree
An old idea: Patricia Trie
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
stile zyg
5
etic
ial
y
aibelyte
czecin
omo
[Morrison, J.ACM 1968]
Disk
A new (lexicographic) search
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
sz
5
e
i
y
a
c
o
Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation
Lexicographic search:P = syzytea
01
2 5 yg
Lexicographic position
Only 1 string is checked on disk
Trie Space ≈ #strings, NOT their
length
[Ferragina-Grossi, J.ACM 1999]
Disk
The String B-tree
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PTSearch(P)
•O((p/B) logB K) I/Os
O(occ/B) I/OsIt is dynamic...
Check 1 string = O(p/B) I/Os
O(logB K) levels
+
Lexicographic position of P
[Ferragina-Grossi, J.ACM 1999]
> 15 US-patents cite it !!
Knuth, vol 3°, pag. 489: “elegant”
I/O-aware algorithms & data structures
[CACM 1988]
[2006]
Huge literature !!
I/Os was the
main concern
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
CPUregisters
L1 L2 RAM
Cache
HD net
Cache-oblivious solutions, aka parameter-free algo+ds Anywhere, anytime, anyway... I/O-optimal !!
Not just 2 memory levels
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Space
Cache-oblivious data structures
Compresseddata structures
Not just 2 memory levels
Can we “automate” and “guarantee” the process ?
A challenging question [Ken Church, AT&T
1995]
Software Engineers use “squeezing heuristics” that
compress data and still support fast access to them
Aka: Compressed self-indexes
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
Space for text+index space for compressed text
only ( Hk) Query/Decompression time theoretically
(quasi-)optimal
...now, J.ACM 2005
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
Highly compressible, but…
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
T
bzip2 = BWT + other simple compressors
bwt(T)
From practice to theory...
FM-index = BWT is searchable
...or Suffix Array is compressible
• Space = |T| Hk + o(|T|) bits
• Search(P) = O(p + occ * polylog(|T|))
Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]
[Ferragina-Manzini, IEEE Focs ‘00]
10 pi#mississi p 9 ppi#mississ i 7 sippi#missi s 4 sissippi#mi s 6 ssippi#miss i 3 ssissippi#m i
5 issippi#mis s
1 mississippi # 2 ississippi# m
12 #mississipp i11 i#mississip p 8 ippi#missis s
bwt(T)sa(T)
Compressed & Searchable data formats
After our paper in FOCS 2000, about texts
We find nowdays compressed indexes for: Trees Labeled trees and graphs Functions Integer Sets Geometry Images ...
From theory to practice…
December 2003
ACM J. on Experimental Algorithmics, 2009
> 103 faster than Smith-W.
>102 faster than SOAP & Maq
What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]
An XML excerpt<dblp> <book>
<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>
</book> <article>
<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>
</article>...</dblp>
IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012
A tree interpretation
XML document exploration Tree navigation XML document search Labeled subpath
searches
XBW
transform
XBW Transform: Some performance figures
Nu
m s
earc
hes p
er
secon
d
Xerces better on smaller files
larger and larger datasets
Xerces worse on larger files
Xerces uses10x space
Where we are nowadays
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Cache-oblivious data structures
Compresseddata structures
Something is known... yet very preliminary
Lower Bounds derived from Geometry
Text search = 2d Range Search
New food for research..
[E. Gal, S. Toledo. ACM Comp. Surv., 2005]
[Ajwani et al, WEA 2009]
Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling
Self-adjusting or Weighted design Time ops depend on some (un/known) distribution
Challenge: no pointers, self-adjust (perf) vs compression (space)
[Ferragina et al, ESA 2011]
40Gb, about 100$
The energy challenge
IEEE Computer, 2007
Browsing a web site
Javascript framework
Prototype Dojo jQuery
Chrome best choice best choice 1,5%
FireFox 2,5% 4,8% 4,3%
IE 10,2% 8,5% 11%
The most
used!The most
used!
Yet today, it is a problem...
Apple is still working on the battery life problem: “The
recent iOS software update addressed many of the battery
issues that some customers experienced on their iOS 5 devices.
We continue to investigate a few remaining issues.” (nov 2011,
wired.com)
“ Windows 8's power hygiene: the scheduler will ignore the unused software” (Feb 2012, MSDN)
Energy-aware Algo+Ds ?
Locality pays off
Memory-level impacts
I/Os and compression
are obviously important
BUT
here there is a new twist
MIPS per Watt ?Battery life !!
Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?
2.takes x% more space than optimal, but it is more energy efficient ?
Idea:Multi-objective optimization in data-structure design
Approach in aprincipled way
A preliminary step
Took inspiration from BigTable (Google), ...
Design a compressed storage scheme that can trade in a principled way between
space vs decompression time [vs energy efficiency]
Requirements: gzip-like compression [like Snappy or lz4 by
Google]
Goal: Fix the space occupancy, find the best compression
that achieves that space and minimizes the decompression time (or vice versa)
[abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>
Copy back new char Copy back
A preliminary step...
Modeled as a Constrained Shortest Path problem: Nodes = one per char of the text to be compressed
Edges = single char or copy back substrings
2 edge weights = decompression time (t) and compressed space (c)
NP-hard in generalThis special case is POLY: O(n3)
n is huge
m might be n2
LZ-parsing = Path from 1 to 12
We solved heuristically (Lagrangian Dual) and provably (Path Swap)
A preliminary step...
String MatchingRAM model, char cmp and time
1990s: Data BasesHierarchical memories and I/Os
2000s: Data CompressionSpace reduction in indexesand entropy space-bounds
Graph TheorySpace reduction in compressors
OptimizationMulti-objective design and joules
2010s: Computational GeometryLower bounds on indexes
New upper bounds on I/Os, entropy
Nowadays…
We mainly commented:
A quote to conclude
“The distance between theory and practice is closer in theory
than in practice”[Y. Matias, Google]
Big steps come from theory
... but do NOT forget practice ;-)
That’s all !