Compress Me, Stupid! - Blosc · Faster-than-memcpy 0 1 2 3 4 5 6 7 8 Compresssion ratio 0 5000...

Compress Me, Stupid!

Valentin Haenel

Freelance Consultant and Software Developer@esc___

23 July 2014 - EuroPython Berlin (EP14)

Version: 2014-EuroPython https://github.com/esc/compress-me-stupidThis work is licensed under the Creative Commons Attribution-ShareAlike 3.0 License.

A Historical Perspective

The Memory Hierarchy – Up to end of 80’s

The Memory Hierarchy – 90’s and 2000’s

The Memory Hierarchy – 2010’s

Starving CPUs

The Status of CPU Starvation in 2014:

I Memory latency is much slower (between 100x and 500x) thanprocessors.

I Memory bandwidth is improving at a better rate than memorylatency, but it is also slower than processors (between 30x and100x).

I Net effect: CPUs are often waiting for data

It’s the memory, Stupid

Problem: It’s the memory, Stupid! [1]

Solution: Compress me, Stupid!

[1] R. Sites. It’s the memory, stupid! MicroprocessorReport,10(10),1996

I Designed for: in-memory compressionI Addresses: the starving CPU ProblemI (In fact, it also works well in general purpose scenarios)I Written in: C

Faster-than-memcpy

0 1 2 3 4 5 6 7 8Compresssion ratio

14000Speed (

/s) memcpy (write to memory)

Compression speed (256.0 MB, 8 bytes, 19 bits)

1 threads2 threads3 threads4 threads5 threads6 threads7 threads8 threads9 threads10 threads11 threads12 threads

Faster-than-memcpy

0 1 2 3 4 5 6 7 8Compresssion ratio

35000Speed (

memcpy (read from memory)

Decompression speed (256.0 MB, 8 bytes, 19 bits)

1 threads2 threads3 threads4 threads5 threads6 threads7 threads8 threads9 threads10 threads11 threads12 threads

Blosc is a Metacodec

I Blosc does not actually compress anythingI Cutting data into blocksI Application of filtersI Management of threads

I Can use ‘real’ codecs under the hood.I Filters and codecs are applied to each block (blocking)I Thread-level parallelism on blocks

Shuffle Filter

I Reorganization of bytes within a blockI Reorder by byte significance

Shuffle Filter Example – Setup

Imagine we have the following array as uint64 (8 byte, unsignedinteger):

[0, 1, 2, 3]

Reinterpret this as uint8:

[0, 0, 0, 0, 0, 0, 0, 0,1, 0, 0, 0, 0, 0, 0, 0,2, 0, 0, 0, 0, 0, 0, 0,3, 0, 0, 0, 0, 0, 0, 0]

Shuffle Filter Example – Application

What the shuffle filter does is:

[0, 1, 2, 3, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0]

Which, reinterpreted as uint64 is:

[50462976, 0, 0, 0]

Shuffle Filter Benefits

I Works well for multibyte data with small differencesI e.g. Timeseries

I Exploit similarity between elementsI Lump together bytes that are alikeI Create longer streams of similar bytesI Better for compressionI Shuffle filter implemented using SSE2 instructions

Shuffle Fail

It does not work well on all datasets, observe:

[18446744073709551615, 0, 0, 0]

Or, as uint8:

[255, 255, 255, 255, 255, 255, 255, 255,0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0]

Shuffle Fail in Action

When shuffled yields:

[1095216660735, 1095216660735,1095216660735, 1095216660735]

Or, as uint8:

[255, 0, 0, 0, 255, 0, 0, 0,255, 0, 0, 0, 255, 0, 0, 0,255, 0, 0, 0, 255, 0, 0, 0,255, 0, 0, 0, 255, 0, 0, 0]

OK, so what else is under the hood?

I By default it uses Blosclz – derived from FastlzI Alternative codecs

I LZ4 / LZ4HCI SnappyI Zlib

Support for other codecs (LZO, LZF, QuickLZ, LZMA) possible,but needs to be implemented.

Blosc + X

So. . . using Blosc + X can yield higher compression ratios usingthe shuffle filter and faster compression/decompression timeusing multithreading.

That’s pretty neat!

Python-Blosc

Python API

I It’s a codecI Naturally we have a compress/decompress pair

I Can operate on byte strings or pointers (encoded as integers)I compress vs. compress_ptr

I TutorialsI http://python-blosc.blosc.org/tutorial.html

I API documentationI http://python-blosc.blosc.org/

I Implemented as a C-extension using the Python-C-API

Example – Setup

>>> import numpy as np>>> import blosc>>> import zlib

>>> bytes_array = np.linspace(0, 100, 1e7).tostring()>>> len(bytes_array)80000000

Example – Compress

>>> %timeit zpacked = zlib.compress(bytes_array, 9)1 loops, best of 3: 14.7 s per loop

>>> %timeit bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9)1 loops, best of 3: 317 ms per loop

Example – Ratio>>> zpacked = zlib.compress(bytes_array, 9)>>> len(zpacked)52945925

>>> bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9)>>> len(bpacked)1011304

>>> len(bytes_array) / len(zpacked)1.5109755849954458>>> len(bytes_array) / len(bzpacked)79.10578817052044>>> len(zpacked) / len(bzpacked)52.35411409427828

Example – Decompress

>>> %timeit zupacked = zlib.decompress(zpacked)1 loops, best of 3: 388 ms per loop

>>> %timeit bupacked = blosc.decompress(bzpacked)10 loops, best of 3: 76.2 ms per loop

Example – Demystified

I Blosc works really well for the linspace datasetI Shuffle filter and multithreading bring benefits

Example – Speed Demystified

I Use a single thread and deactivate the shuffle filter

>>> blosc.set_nthreads(1)>>> %timeit bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9,... shuffle=False)1 loops, best of 3: 12.9 s per loop

Example – Ratio Demystified

>>> bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9,... shuffle=False)>>> len(zpacked) / len(bzpacked)0.9996947439311876

So, What about other Codecs? – Compress

I Zlib implements a comparatively slow algorithm (DEFLATE),let’s try LZ4

>>> %timeit bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9)1 loops, best of 3: 329 ms per loop

>>> %timeit blpacked = blosc.compress(bytes_array,... typesize=8,... cname=’lz4’,... clevel=9)10 loops, best of 3: 20.9 ms per loop

So, What about other Codecs? – Ratio

I Although this speed increase comes at the cost of compressionratio

>>> bzpacked = blosc.compress(bytes_array,... typesize=8,... cname=’zlib’,... clevel=9)>>> blpacked = blosc.compress(bytes_array,... typesize=8,... cname=’lz4’,... clevel=9)>>> len(bzpacked) / len(blpacked)0.172963927766

So, What about other Codecs? – Decompress

>>> %timeit bzupacked = blosc.decompress(bzpacked)10 loops, best of 3: 74.3 ms per loop

>>> %timeit blupacked = blosc.decompress(blpacked)10 loops, best of 3: 25.3 ms per loop

C-extension Notes

I Uses _PyBytesResize to resize a string after compressing intoit

I Release the GIL before compression and decompression.

Installation and Compilation

Installation via Package – PyPi/pip

Using pip (inside a virtualenv):

$ pip install blosc

Provided you have a C++ (not just C) compiler..

Installation via Package – binstar/conda

Using conda:

$ conda install -c https://conda.binstar.org/esc python-blosc

Experimental, Numpy 1.8 / Python 2.7 only..

Compilation / PackagingBlosc is a metacodec and as such has various dependencies

Compilation / Packaging – Flexibility is Everything

I Blosc uses CMake and ships with all codec sourcesI Try to link against existing codec libraryI If not found, use shipped sources

I Python-Blosc comes with Blosc sourcesI Compile everything into Python moduleI Or link against Blosc library

I Should be beneficial for packagers

Other Projects that use Blosc

PyTables HDF LibraryBloscpack Simple file-format and Python implementation

bcolz In-memory and out-of-core compressed array-likestructure

The Future

I What might be coming. . .I More codecsI Alternative filtersI Auto-tune at runtimeI Multi-shuffleI A Go implementation

I How can I help?I Run the benchmarks on your hardware, report the resultsI http://blosc.org/synthetic-benchmarks.htmlI Incorporate Blosc into your application

Advertisment

I EuroPythonI Francecs Alted - Out of Core Columnar Datasets - Friday 11:00

I PyData BerlinI Francecs Alted - Data Oriented Programming - Saturday 13:30

B05I Valentin Haenel - Fast Serialization of Numpy Arrays with

Bloscpack - Sunday 11:00 am B05

Getting In Touch

I Main website: http://blosc.orgI Github organization: http://github.com/BloscI python-bloc: http://github.com/Blosc/python-bloscI Google group:

https://groups.google.com/forum/#!forum/bloscI This talk: https://github.com/esc/compress-me-stupid

Compress Me, Stupid! - Blosc · Faster-than-memcpy 0 1 2 3 4 5 6 7 8 Compresssion ratio 0 5000...

Documents

Lecture 5: Scheduling · • 5 points/queson. – Queson 5 was worth 1 extra point. • Memcpy ‐ “Protecng memory state” – Not privileged. – Memory state = meta‐data concerning

roscon.ros.orgpipe1->pub->publish(msg); executor.add_node(pipe1); executor.add_node(pipe2); executor.spin(); return 0;} shared_ptr memcpy Best effort Reliable Keep last Keep all Volatile

DXT is NOT ENOUGHtwvideo01.ubm-us.net/o1/vault/gdc2012/slides/DXT is... · In archive w/ other data Memcpy right to the GPU . Why do I care? ... Huffman compression Delta encoding

The MeeGo Multimedia Stack - GStreamer...Hot Topic – Video Rendering Xvideo interface is getting old Inherent memcpy Hard to integrate with alpha in Uis Classic GL is not well suited

IMAGE COMPRESSSION AND ENCRYPTION USING SCAN PATTERN

NVIDIA PROFILING TOOLS€¦ · OpenMP profiling Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph Support for version 3 NVIDIA Tools Extension API

BIKE LANE PROTECTION - Impact Recovery Systems · anti-twist, reactive spring assembly. This unique compresssion spring is designed to consistently return the post to its original,

Abusing the Windows Kernel: How to Crash an Operating ... · What • Fun with memory functions o nt!memcpy (and the like) reverse copying order o nt!memcmp double fetch More fun

Best Practices in GPU-Based Video Processing - GTC 2012...—Using async API’s for memcpy with CUDA streams —Memory is allocated as page-locked Graphics/OpenGL —Using PBO’s

HDF5 Advanced Topics · • BLOSC lossless compression (PyTables) • LZF lossless compression H5Py . November 3-5, 2009 HDF/HDF-EOS Workshop XIII 16 Creating Compressed Dataset 1

Out-of-Core Columnar Datasets · 2017-07-14 · About Me • I am the creator of tools like PyTables, Blosc, bcolz, and a long-term maintainer of Numexpr • I am an experienced developer

Mail merge MadE easy - … · Byte Manipulation Functions Functions that operate on multibyte fields With b (for byte) bzero bcopy bcmp With mem (for memory), memset memcpy memcmp

lecture5 - CS50cdn.cs50.net/2017/fall/lectures/5/lecture5.pdf · #include void foo(char *bar) { char c[12]; memcpy(c, bar, strlen(bar)); } int main(int argc, char

Best Practices: Protecting, Replicating and Deduplicating · • Redo blocks can be transmitted directly from systems memory: like a memcpy over the network 1. Better performance

Blosc Talk by Francesc Alted from PyData London 2014

Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted

audio and video compresssion