View
110
Download
3
Category
Tags:
Preview:
DESCRIPTION
Citation preview
Big Data forBig Questions
Cliff Click, CTO 0xdatacliffc@0xdata.comhttp://0xdata.comhttp://cliffc.org/blog
● Motivation: What & Why Big Math?● Better Mousetrap● Demo● Fork: Deep Dive into
Math Hacking ...or... K/V Store
Source: https://github.com/0xdata/h2o
0xdata.com 3
42!
0xdata.com 4
42!What was the question again?
0xdata.com 5
42!What was the question again?
Oh yeah, it was:● How do I place ads based on a clickstream?
0xdata.com 6
42!What was the question again?
Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?
0xdata.com 7
42!What was the question again?
Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?
0xdata.com 8
42!What was the question again?
Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?
0xdata.com 9
42!What was the question again?
Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?● Find people (un)like me?● ... or ... or ... or... ????
0xdata.com 10
How do I figure it all out?
● Well... what are my tools?● Domain Knowledge,
● (me! The Expert)
● Math & Science! Data Science, and● Data – lots and lots and lots of it
● Old logs, new logs, databases, historical records, click-streams, CSV files, dumps
● Often TB's, sometimes PB's of it
0xdata.com 11
Data: The Main Player
● Data: I got lots of it● But it's a messy mixed-up lot
● Stored in HDFS, S3, DB2 or scattered about● Incompatible formats, older & newer bits● Missing stuff, or "known broken" fields
● And it's Big● Too big for my laptop, or even one server
0xdata.com 12
Data: Cleaning it Up
● Just the parts I want:● SQL, Hive, HBase, grep● Data is Big, so this is slow
● Wrong format: ● Awk, shell scripts, files, disk-to-disk
● Inspection (do I got it right yet?)● Grep/awk, histograms, plots/prints● Visualization tools
0xdata.com 13
From Facts to Knowledge
● Data cleaned up: lots of neat rows of facts● Lots of rows: millions and billions ...
● But facts is not knowledge● Too much to "get it" by looking
● Time for a mathematical Model!● Here again, Big limits my tools
● Either can't deal, or deal very very slowly
0xdata.com 14
Modeling: math(data)
● Modeling gives a simpler view● A way to understand● And predict in real time
● Modeling is Math!● Generalized Linear Modeling
– Oldest, most well known & used● Random Forest● K-Means Clustering
0xdata.com 15
Big Data vs Modeling
● Model: a concise description of my data● A more accurate model predicts better
● Generally More Data builds a better Model● But only if the tool can handle it● (some datasets are not helped but it rarely hurts)
● Tools can't handle Big: so down sample, and use better (more complex) algorithm
0xdata.com 16
Big Data vs Better Algorithm
● Don't want to choose Big vs Better● Down sampling loses information
● Want a way to manipulate Big Data like it's small: interactive & fast. Subtle when I need it and brute force when I don't
● Build the Better Algorithm and use Big Data● Seeing 10x more data yield prediction
increases e.g. from 75% to 85%
0xdata.com 17
Building The Better Big Data Mousetrap
● Want fast: means dram instead of disk● Fall back to disk, if data >>> dram
● Want fast: use all cpus● Problems are mostly data-parallel anyways
● Want ease-of-programming: ● “parallelism without effort”● Well understood programming model
0xdata.com 18
● Want ease-of-use:● python, json, REST/HTML interfaces● Full R semantics (via fastr project)
● Data ingest:● where: HDFS, S3, NFS, URL, URI, browser● what: csv, hive, rdata
Building The Better Big Data Mousetrap
0xdata.com 19
Building The Better Big Data Mousetrap
● Want ease-of-admin:● e.g. java -jar h2o.jar● auto-cluster (no config at all) or hadoop Job
● Want ease-of-upgrade: adding more servers gives
● More CPU (faster exec)● More DRAM (larger data in dram)● More network/disk bandwidth (faster ingest)
0xdata.com 20
H2O: An Engine for Big Math
● Built in layers – pick your abstraction level● Analysts, starters: REST, browser
– "clicky clicky" load data, build model, score● Scientists: R, JSON, python to drive engine
– Complex math● Math hackers: building new algos
– Full (distributed) Java Memory Model– "codes like Java, runs distributed"
● Core Engineering: call us, we're hiring
0xdata.com 21
Core Engineering: K/V Store
● Classic distributed Key/Value store● get/put/atomic-transaction● Full JMM semantics, exact consistency● Full caching as-needed
– Cached keys "get" in 150 nano's– Misses limited by network speed
● Hardware-like cache coherency protocol
● Distributed fork/join (thanks Doug Lea)
0xdata.com 22
Core Engineering: D/F/J
● Distributed fork/join (jsr 166y)● Recursive-descent for data-parallel● Distribution handled by the core
– Log-tree scatter/gather across cluster
● Supports map/reduce-style directly● But also "do this on all nodes" style● Or random graph hacking
0xdata.com 23
Math Hacking
● “Tastes like (distributed) java”
(actual inner loop, auto-parallel, auto-distributed)
● Big “vector math” is easy● The obvious for-loop "just works"
for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; }
0xdata.com 24
Math Hacking
● Dense-vector algorithms are easy● Generalized Linear Modeling: 2 weeks● K-means: 2 days● Histogram: 2 hours
● Random Forest: not dense vectors● Still makes good use of D/F/J● All-CPUs, all-nodes still light up
– Very fast tree building
0xdata.com 25
Science: dancing with the data
● Like the belle of the ball, the main algos (GLM, k-means, RF) only arrive when the data is properly dressed
● Munging data: dropping junk columns, replacing missing bits, adding features
● H2O provides a tool-kit● Big vector calculator: "d := a+b*c"● dram speeds: "msec per Gbyte"
0xdata.com 26
Science: APIs
● Need to script, automate repetitive tasks● R via fastr and bigmemory package
● Full R semantics, 5x R speed single-thread● But your vectors can be very very big...● https://github.com/allr/fastr
● REST / URL / JSON● Drive from e.g. python, scripts, curl, wget
– e.g. h2o testing harness is all python
0xdata.com 27
Demos & Quick Starts
● Full browser interface● Tutorials● Handful of clicks to run e.g. RF or GLM
on gigabytes of data
● Auto-cluster in seconds● On EC2 (or your laptops right now)
● Good enough for serious work● (and have customers using this interface!)
0xdata.com 28
Demo Time!
0xdata.com 29
H2O: An Engine for Big Math
● Focus on Big Math● Easy to extend via M/R or K/V programming
● Auto-cluster● Data-parallel exec across all CPUs● dram caching across all servers● Parallel ingest across all servers● Open source: https://github.com/0xdata/h2o
0xdata.com
0xdata.com 30
Math Hacking: The M/R API
● Make a 'golden object'● Will be endlessly replicated across cluster● Set 'input' fields:
– Auto-serialized, distributed– Shallow-copy on nodes: eg arrays share state
● golden.map(key_1mb)● map() called on clone for each 1mb● Set 'output' fields now
0xdata.com 31
Math Hacking: The M/R API
● gold.reduce(gold)● Combine pairs of 'golden' objects● Both locally and remotely (distributed)● Log-tree roll-up
● 'output' fields will be shipped over the wire● null-out 'input' fields● transient marker available
0xdata.com 32
Math Hacking: Example
CalcSumsTask cst = new CalcSumsTask(); cst._arykey = ary._key; // BigData Table key cst._colA = colA; // integer indices to columns cst._colB = colB; cst.invoke(ary._key); // Do It!
// Results returned directly in 'cst' object...cst._sumX... // use results
public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
0xdata.com 33
Math Hacking: Example
public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// map called for every 1Mb of data, or so public void map( Key key1Mb ) { … boiler plate... // lots of unimportant details // Standard for-loop over the data for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; } }
0xdata.com 34
Math Hacking: Example
public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// reduce called between pairs of golden objects// always reduce right-side into 'this' object
public void reduce( DRemoteTask rt ) { CalcSumsTask cst = (CalcSumsTask)rt; _sumX += cst._sumX ; _sumY += cst._sumY ; _sumX2+= cst._sumX2; } }
0xdata.com 35
A Fast K/V Store
● Distributed in-memory K/V Store● Peer-to-peer, no master● Full JMM semantics, get/put/atomic/remove● Hardware-style cache-coherency protocol● Fast: 150nanos for cache-hitting 'get'● Fast: 50micros for cache-missing 'put'● No persistence (see above for 'fast')● No locks: use 'atomic' instead
0xdata.com 36
K/V Design Goals
● JMM semantics on all get/put● Cache-hitting 'gets' as fast as possible
● Local hashtable lookup + few tests
● 'puts' as lazy as possible (still JMM)● Typically do not block for remote put
● Arbitrary transactions on single Keys
0xdata.com 37
K/V Coherency Protocol
● Many are possible● Picked a {fast-enough,easy} one● Faster is possible
● Every Key has 1 master node● And everybody knows it from Key hash
● Master orders racing writes● Winner of NBHM insert
0xdata.com 38
K/V Coherency Protocol
● Master tracks replicas● Single CAS update
● Invalidate replicas on update● Single CAS required, plus the invalidates● Cache miss on replica will reload
● Interlocking get/put races solved withfinite state machine
0xdata.com 39
K/V Coherency Protocol
0xdata.com 40
Backup Slides
0xdata.com 41
The Expert
● Domain Expert:● What data is useful, which is trash● What needs help to become useful● Missing elements? Toss outliers?● Build new features from old?
● All through this process Big Data is, well, Big, hence Slow to cp / awk / grep
● And Big limits my tools
Recommended