109
Copyright © 20112013, Dean Wampler, All Rights Reserved From Big Data to Big InformaBon NoSQL Search RoadShow San Francisco, June 6, 2013 [email protected] @deanwampler polyglotprogramming.com/talks Thursday, June 6, 13 Copyright © Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with permission. Otherwise, the content is free to use. Photo: San Francisco Bay, just south of the airport in Burlingame, before sunrise.

FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

From  Big  Datato  Big  InformaBon

NoSQL  Search  RoadShow  San  Francisco,  June  6,  [email protected]@deanwampler  polyglotprogramming.com/talks

Thursday, June 6, 13

Copyright © Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with permission. Otherwise, the content is free to use.Photo: San Francisco Bay, just south of the airport in Burlingame, before sunrise.

Page 2: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Who  am  I?

Dean Wampler, Jason Rutherglen &

Edward Capriolo

HiveProgramming

Dean Wampler

Functional Programming

for Java Developers

Thursday, June 6, 13

My books…

Photo: San Francisco Bay, Burlingame, around sunrise.

Page 3: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Why  this  talk?

Hadoop  hype  cycle:But  what  gives  usactual  value?

3

Thursday, June 6, 13This talk reflects my experiences working on “Big Data” projects, mostly using Hadoop, with many clients.People sometimes jump on this bandwagon to avoid “being left behind” or to boost their career. Sometimes, it doesn’t make sense for their actual needs. Big Data itself doesn’t do you any good. It’s the information we extract that’s important, so let’s see how different technologies address specific problems.

Photo: Photo: San Francisco Bay, Burlingame, around sunrise.

Page 4: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved4

What  Is  Big  Data?

Thursday, June 6, 13Photo: Photo: San Francisco Bay, Burlingame, around sunrise.

Page 5: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Big  DataData  so  big  that  

tradiBonal  soluBons  are  too  slow,  too  small,  or  too  expensive  to  use.

5

Hat tip: Bob Korbus

Thursday, June 6, 13

It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.

Page 6: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

3  Trends

Thursday, June 6, 13Three trends to organizer our thinking…

Photo: Gull on a pier near Fort Mason, SF

Page 7: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved7

Data  Size  ⬆

Thursday, June 6, 13Data volumes are obviously growing… rapidly.Facebook now has over 600PB (Petabytes) of data in Hadoop clusters!

Page 8: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved8

Formal  Schemas  ⬇

Thursday, June 6, 13There is less emphasis on “formal” schemas and domain models, i.e., both relational models of data and OO models, because data schemas and sources change rapidly, and we need to integrate so many disparate sources of data. So, using relatively-agnostic software, e.g., collections of things where the software is more agnostic about the structure of the data and the domain, tends to be faster to develop, test, and deploy. Put another way, we find it more useful to build somewhat agnostic applications and drive their behavior through data...

Page 9: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved9

Data-­‐Driven  Programs  ⬆

Thursday, June 6, 13This is the 2nd generation “Stanley”, the most successful self-driving car ever built (by a Google-Stanford) team. Machine learning is growing in importance. Here, generic algorithms and data structures are trained to represent the “world” using data, rather than encoding a model of the world in the software itself. It’s another example of generic algorithms that produce the desired behavior by being application agnostic and data driven, rather than hard-coding a model of the world. (In practice, however, a balance is struck between completely agnostic apps and some engineering towards for the specific problem, as you might expect...)

Page 10: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved10

Probabilistic Models vs. Formal Grammars

tor.com/blogs/...

Thursday, June 6, 13An interesting manifestation of the last two points is the public argument between Noam Chomsky and Peter Norvig on the nature of language. Chomsky long ago proposed a hierarchical model of formal language grammars. Peter Norvig is a proponent of probabilistic models of language. Indeed all successful automated language processing systems are probabilistic.http://www.tor.com/blogs/2011/06/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai

Page 11: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Big  DataArchitectures

Thursday, June 6, 13What should software architectures look like for these kinds of systems?Photo: Light on the Boardwalk and Coit Tower.

Page 12: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved12

Object Model

toJSONParentB1

toJSONChildB1

toJSONChildB2 Object-

Relational Mapping

Other, Object-Oriented

Domain Logic

Database

Query

SQL

Result Set

Objects

1

2

3

4

Thursday, June 6, 13Traditionally, we’ve kept a rich, in-memory domain model requiring an ORM to convert persistent data into the model. This is resource overhead and complexity we can’t afford in big data systems. Rather, we should treat the result set as it is, a particular kind of collection, do the minimal transformation required to exploit our collections libraries and classes representing some domain concepts (e.g., Address, StockOption, etc.), then write functional code to implement business logic (or drive emergent behavior with machine learning algorithms…)

The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!

Page 13: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved13

Object Model

toJSONParentB1

toJSONChildB1

toJSONChildB2 Object-

Relational Mapping

Other, Object-Oriented

Domain Logic

Database

Query

SQL

Result Set

Objects

1

2

3

4

Relational/Functional

Domain Logic

Database

Query

SQL

Result Set

1

2

Functional Wrapper for

Relational Data

3

Functional Abstractions

Thursday, June 6, 13But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich transformations into those libraries, transformations that are composable to implement business logic.

Page 14: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved14

Relational/Functional

Domain Logic

Database

Query

SQL

Result Set

1

2

Functional Wrapper for

Relational Data

3

Functional Abstractions

• Focus on:

• Lists

• Maps

• Sets

• Trees

• ...

Thursday, June 6, 13But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with data is mathematical transformation, so we’re far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich transformations into those libraries, transformations that are composable to implement business logic.

Page 15: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved15

Relational/Functional

Domain Logic

Database

Query

SQL

Result Set

1

2

Functional Wrapper for

Relational Data

3

Functional Abstractions

• NoSQL?

• Cassandra,HBase

• Riak, Redis

• MongoDB

• Neo4J

• ...Thursday, June 6, 13NoSQL databases fit this model well, with focused abstractions for their data model.

Page 16: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved16

toJSONParentB1

toJSONChildB1

toJSONChildB2

Web Client 1 Web Client 2 Web Client 3

FilesDatabase

Thursday, June 6, 13In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it easier to write smaller, focused services that we compose and deploy as appropriate.

Page 17: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved17

toJSONParentB1

toJSONChildB1

toJSONChildB2

Web Client 1 Web Client 2 Web Client 3

FilesDatabase

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

Thursday, June 6, 13In a broader view, object models tend to push us towards centralized, complex systems that don’t decompose well and stifle reuse and optimal deployment scenarios. FP code makes it easier to write smaller, focused services that we compose and deploy as appropriate. Each “ProcessN” could be a parallel copy of another process, for horizontal, “shared-nothing” scalability, or some of these processes could be other services…Smaller, focused services scale better, especially horizontally. They also don’t encapsulate more business logic than is required, and this (informal) architecture is also suitable for scaling ML and related algorithms.

Page 18: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved18

• Smaller, more focused middleware

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

Thursday, June 6, 13An important “software engineering” benefit is the way it promotes smaller, more focused middleware, avoiding bloat that is hard to evolve.

Page 19: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved19

• Data Size ⬆

• Formal Schema ⬇

• Data-Driven Programs ⬆

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

Thursday, June 6, 13And this structure better fits the trends I outlined at the beginning of the talk.

Page 20: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved20

• MapReduce

• Distributed FS

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

Thursday, June 6, 13And MapReduce + a distributed file system, like Hadoop’s MapReduce and HDFS, fit this model.

Page 21: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved21

• Hadoop, other middleware

• NoSQL stores

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

Thursday, June 6, 13MapReduce and a “bare”, distributed file system aren’t the right model for many systems (as we’ll discuss). A similar model uses a NoSQL store and ad-hoc middleware. In this case, more logic is in the storage layer vs. more limited capabilities in a distributed file system.

Page 22: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved22

• Node.js?

• JSON database?

Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

FilesDatabase

• JSON

Thursday, June 6, 13One interesting incarnation of this is JavaScript through the full stack, with JSON as the RPC format, stored directly (more or less) in a database like Mongo, CouchBase, and RethinkDB. Node gives you JS in the mid-tier, and JSON is obviously a browser tool.

Page 23: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

What  IsMapReduce?

Thursday, June 6, 13Let’s be clear about what MR actually is.

Photo: Near the Maritime Museum

Page 24: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

MapReduce  in  Hadoop

Let’s  look  at  a  MapReduce  algorithm:  

WordCount.

(The  Hello  World  of  big  data…)

24

Thursday, June 6, 13Let’s walk through the “Hello World” of MapReduce, the Word Count algorithm, at a conceptual level. We’ll see actual code shortly!

Page 25: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

We  need  to  convert  the  Input  

into  the  Output.

25

Thursday, June 6, 13

Four input documents, one left empty, the others with small phrases (for simplicity…). The word count output is on the right (we’ll see why there are three output “documents”). We need to get from the input on the left-hand side to the output on the right-hand side.

Page 26: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

26

Thursday, June 6, 13

Here is a schematic view of the steps in Hadoop MapReduce. Each Input file is read by a single Mapper process (default: can be many-to-many, as we’ll see later). The Mappers emit key-value pairs that will be sorted, then partitioned and “shuffled” to the reducers, where each Reducer will get all instances of a given key (for 1 or more values).Each Reducer generates the final key-value pairs and writes them to one or more files (based on the size of the output).

Page 27: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers

There is a Reduce phase (n, "…")

27

Thursday, June 6, 13

Each document gets a mapper. All data is organized into key-value pairs; each line will be a value and the offset position into the file will be the key, which we don’t care about. I’m showing each document’s contents in a box and 1 mapper task (JVM process) per document. Large documents might get split to several mapper tasks.The mappers tokenize each line, one at a time, converting all words to lower case and counting them...

Page 28: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers

There is a Reduce phase (n, "…")

(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)

28

Thursday, June 6, 13

The mappers emit key-value pairs, where each key is one of the words, and the value is the count. In the most naive (but also most memory efficient) implementation, each mapper simply emits (word, 1) each time “word” is seen. However, this is IO inefficient!Note that the mapper for the empty doc. emits no pairs, as you would expect.

Page 29: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers Sort,Shuffle

Reducers

There is a Reduce phase (n, "…")

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

29

Thursday, June 6, 13

The mappers themselves don’t decide to which reducer each pair should be sent. Rather, the job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle phase, where the key-value pairs in each mapper are sorted by key (that is locally, not globally) and then the pairs are routed to the correct reducer, on the current machine or other machines.Note how we partitioned the reducers, by first letter of the keys. (By default, MR just hashes the keys and distributes them modulo # of reducers.)

Page 30: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (n, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

30

Thursday, June 6, 13

The reducers are passed each key (word) and a collection of all the values for that key (the individual counts emitted by the mapper tasks). The MR framework creates these collections for us.

Page 31: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (n, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

map 1mapreduce 1phase 2

a 2hadoop 1is 2

reduce 1there 2uses 1

Output

31

Thursday, June 6, 13

The final view of the WordCount process flow. The reducer just sums the counts and writes the output.The output files contain one line for each key (the word) and value (the count), assuming we’re using text output. The choice of delimiter between key and value is up to you, but tab is common.

Page 32: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(n, "…")

(n, "…")

(n, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (n, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

map 1mapreduce 1phase 2

a 2hadoop 1is 2

reduce 1there 2uses 1

Output

Map:

• Transform one input to 0-N outputs.

Reduce:

• Collect multiple inputs into one output.

32

Thursday, June 6, 13

To recap, a “map” transforms one input to one output, but this is generalized in MapReduce to be one to 0-N. The output key-value pairs are distributed to reducers. The “reduce” collects together multiple inputs with the same key into

Page 33: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Arguably,  Hadoop  is  our  best,  generic*  tool  for  

scaling  Big  Data  horizontally

(at  least  today).

33

Thursday, June 6, 13

As I’ll argue it’s definitely more first-generation than “perfect”. By generic, I mean beyond the more focused goals of a database, SQL or NoSQL, in the sense that it provides a very open-ended compute model and primitive storage, on which you can build a database, which is exactly what HBase is! However, HBase doesn’t use MapReduce, just HDFS with its own compute engine. The asterisk is because MR is very general-purpose, but actually somewhat difficult, inflexible, and inefficient for many algorithms.

Page 34: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

By  design,  Hadoop  is    great  for  batch  mode  

data  crunching.

 Not  so  great  for  event-­‐stream  processing.

34

Thursday, June 6, 13

… but less so for “real-time” event handling, as we’ll discuss...

Page 35: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

By  design,  Hadoop  is    great  for  batch  mode  

data  crunching.

 Not  so  great  for  transac=ons.

35

Thursday, June 6, 13

Also, there’s no built-in concept of transactional behavior in Hadoop.

Page 36: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

MapReduce  is  very  course-­‐grained.

1-­‐Map  and  1-­‐Reduce  phase...

36

Thursday, June 6, 13

I’m going to revisit this topic as we go...

Page 37: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

For  Hadoop  in  parBcularly,  

the  Java  API  ishard  to  use.

37

Thursday, June 6, 13

There are great higher-level APIs that relieve this burden. We can’t discuss them today, but see Cascading, Scalding, Cascalog, for examples.

Page 38: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Use  Cases

Thursday, June 6, 13Let’s look at some use cases to understand the strengths and weaknesses of Hadoop-oriented solutions vs. NoSQL-oriented solutions.

Photo: Transamerica Building

Page 39: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved39

TL;DR• Hadoop

• Very flexible compute model

• “Table” scans

• Batch mode

• NoSQL

• Focused on a particular data model

• Transactional

• Event driven

Thursday, June 6, 13

Top-level comparison. These are points of emphasis for these options. What I mean is that we tend to write Hadoop-centric or database-centric apps, depending on the use case.

Page 40: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved40

TL;DR

But  Hadoop  MapReduce  is  oaen  used  with  a  NoSQL  

store.

Thursday, June 6, 13

Just to be clear that there isn’t a clear divide between these technologies, of course.For example, it’s not uncommon to integrate MR jobs with Cassandra, HBase, MongoDB, and even relational tables.

Page 41: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Search

Thursday, June 6, 13

Improving information search and retrieval, usually through some means of indexing. See the two search talks today by Ryan Tabora and Drew Raines.

Page 42: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved42

A  specific  solu=on  for  search.

Lucene  withSolr  or  ElasBcSearch

Thursday, June 6, 13

This is an example of a specific problem domain and focused tools to solve it. http://www.elasticsearch.org/, http://lucene.apache.org/solr/

Page 43: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved43

NetApp  project:Use  Solr  to  store  indices  to  data  in  HDFS  and  

HBase.

For  very  large  data  sets...

Thursday, June 6, 13

Publicly described project for NetApp done by Think Big Analytics. That project could now be done with Datastax integrated support for Cassandra and Solr.

Page 44: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Low-­‐cost  Data  Warehouse

Thursday, June 6, 13

Data warehouse systems hit scalability limits and cost/TB concerns at large scale...

Page 45: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved45

Your  current  data  warehouse  can  only  

store  6-­‐months  of  data  without  a  $1M  upgrade.

Problem:

Thursday, June 6, 13

Very common scenario...

Page 46: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved46

• Pros

• Mature

• Rich SQL, analytics

• Mid-size Data

• Cons

• Expensive - $/TB

• Scalability limits

TradiBonal  DW

Thursday, June 6, 13

Data warehouses tend to be more scalable and a little less expensive than OLTP systems, which is why they are used to “warehouse” transactional data and perform analytics. However, their $/TB is ~10x the cost on Hadoop and Hadoop scales to larger data sets.

Page 47: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved47

Replace  the  data  warehouse  with  NoSQL?

SoluBon?

Thursday, June 6, 13

Could you use a NoSQL store instead??

Page 48: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved48

SQL  is  very  important  for  data  warehouse  

applicaBons.

Thursday, June 6, 13

NoSQL does give you the more cost-effective storage, but SQL is very important for most DW applications, so your “NoSQL” store would need a powerful query tool to support common DW scenarios. However, DW experts usually won’t tolerate anything that isn’t SQL.

Page 49: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved49

Replace  the  data  warehouse  with  Hadoop?

SoluBon?

Thursday, June 6, 13

Could you use Hadoop instead??

Page 50: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved50

• Traditional DW

+Mature

+Rich SQL, analytics

- Scalability

- $$/TB

• Hadoop

- Less mature

+ Improving SQL

+Scalable!

+Low $/TB

Thursday, June 6, 13

Data warehouses tend to be more scalable and a little less expensive than OLTP systems, which is why they are used to “warehouse” transactional data and perform analytics. However, their $/TB is ~10x the cost on Hadoop and Hadoop scales to larger data sets.

Page 51: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved51

Hadoop  has  become  a  popular  data  warehouse  supplement/replacement.

Thursday, June 6, 13

Many of my projects have offloaded an overburdened or expensive traditional data warehouse to Hadoop. Sometimes a wholesale replacement, but more often a supplemental strategy, at least for a transitional period of some duration.

Page 52: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

SQL on Hadoop

Thursday, June 6, 13Let’s discuss the SQL options now, as I consider them very important. Otherwise, the majority of Data Analysts and casual SQL users would not find Hadoop very useful.

Page 53: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Hive: SQL on top of MapReduce.

• Shark: Hive ported to Spark.

• Impala: HiveQL with new, faster back end.

• Lingual: SQL on Cascading.

 Use  SQL  when  you  can!

53

Thursday, June 6, 13

See http://hive.apache.org/ or my book for Hive, http://shark.cs.berkeley.edu/ for shark, and http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/cloudera-enterprise-RTQ.html for Impala. Impala is very new. It doesn’t yet support all Hive features.

Page 54: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved54

CREATE TABLE docs (line STRING);LOAD DATA INPATH '/path/to/docs' INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\W+')) AS word FROM docs) wGROUP BY wordORDER BY word;

Works for Hive, Shark, and Impala

Word  Count  in  Hive  SQL!

Thursday, June 6, 13This is how you could implement word count in Hive. We’re using some Hive built-in functions for tokenizing words in each “line”, the one “column” in the docs table, etc., etc.

Page 55: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• SQL dialect.

• Uses MapReduce back end.

• So annoying latency.

• First SQL on Hadoop.

• Developed by Facebook.

 Hive

55

Thursday, June 6, 13

http://hive.apache.org

Page 56: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• HiveQL front end.

• Spark back end.

• Provides better performance.

• Developed by Berkeley AMP.

 Shark

56

Thursday, June 6, 13

See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/cloudera-enterprise-RTQ.html. However, this was just announced a few ago (at the time of this writing), so it’s not production ready quite yet...

Page 57: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• HiveQL front end.

• C++ and Java back end.

• Provides up to 100x performance improvement!

• Developed by Cloudera.

 Impala

57

Thursday, June 6, 13

See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/cloudera-enterprise-RTQ.html. However, this was just announced a few ago (at the time of this writing), so it’s not production ready quite yet...

Page 58: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• ANSI SQL front end.

• Cascading back end.

• Same strengths/weaknesses for runtime performance as Hive.

 Lingual

58

Thursday, June 6, 13

http://www.cascading.org/lingual/ A relatively new “API” for Cascading, still in Beta. Because Cascading runs on MapReduce (there’s also a standalone “local mode” for small jobs and development), it will have the same perf. characteristics as Hive and other MR-based tools.

Page 59: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Event  Stream  Processing

Thursday, June 6, 13

Data warehouse systems hit scalability limits and cost/TB concerns at large scale...

Page 60: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Recall,  Hadoop  is    great  for  batch  mode  data  

crunching.

 Not  so  great  for  event-­‐stream  processing.

60

Thursday, June 6, 13

… but less so for “real-time” event handling, as we’ll discuss...

Page 61: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Storm!Thursday, June 6, 13Nathan Marz’s Storm system for scalable, distributed event stream processing. A complement to Hadoop, as he describes in detail in his book “Big Data” (Manning).

Photo: Top of the AON Building in Chicago after a Storm passed through.

Page 62: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Storm  implements  reliable,  distributed  event  processing.

62

Thursday, June 6, 13

http://storm-project.net/ Created by Nathan Marz, now at Twitter, who also created Cascalog.

Page 63: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Spout

Bolt

Bolt

Bolt

BoltSpout

63

Thursday, June 6, 13

In Storm terminology, Spouts are data sources and bolts are the event processors. There are facilities to support reliable message handling, various sources encapsulated in Sprouts and various targets of output. Distributed processing is baked in from the start.

Page 64: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved64

• Hadoop

+Cheap

+Scalable

+Commercial Support

- Batch mode

• Storm

- Less mature

+Robust

- Commercial Support

+“Real time”

Thursday, June 6, 13

If you need to respond to event streams, Hadoop simply isn’t suitable. Storm is one possible solution. It was designed for this problem, but it is less mature and commercial support is just starting to appear.

Page 65: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Databases  to  the  Rescue?

Thursday, June 6, 13Photo: Actually, this is in Chicago, looking out my condo window.

Page 66: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

 SQL    or  NoSQL  Databases?

66

Databases are designed for fast, transactional updates.

So, consider a database for event processing.

Thursday, June 6, 13

Use a SQL database unless you need the scale and looser schema of a NoSQL database!

Page 67: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Machine  Learning

Thursday, June 6, 13

ML includes recommendation engines (e.g., the way Netflix recommends movies to you or Amazon recommends products), classification (e.g., SPAM classifiers, character and image recognition), and clustering. Other specialized examples include text mining and other forms of natural language processing (NLP).

Page 68: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved68

• Recommendations: Netflix movies, Amazon products, ...

• Classification: SPAM filters, character recognition, ...

• Clustering: Find groups in social networks, ...

Thursday, June 6, 13

Page 69: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved69

Arbitrary  ML  algorithms  can  be  implemented  using    MapReduce.

Hadoop  has  a  general-­‐purpose  compute  model.

Thursday, June 6, 13

Having a general-purpose computation framework (as opposed to storage model) is useful for implementing arbitrary algorithms, without having to worry about distribution and scalability boilerplate, like you might for an ad-hoc middleware layer.

Page 70: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

However,  recall  that  MapReduce  is  very  course-­‐grained...

70

Thursday, June 6, 13

Page 71: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved71

A  new  toolkit  for  wriBng  models  in  SAS  ,  etc.,  then  run  them  on    

Cascading.

PaMern

Thursday, June 6, 13

See details about Pattern on http://cascading.org

Page 72: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved72

An  alternaBve  to  MapReduce  with  more    flexible  and  composable  

abstracBons.

Spark

Thursday, June 6, 13

http://spark-project.org/

Page 73: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved73

Originally  designed  for  Machine  Learning.

Spark

Thursday, June 6, 13

http://spark-project.org/

Page 74: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved74

Train  with  Hadoop,  etc.Serve  requests  with  

NoSQL  store.

However  you  train  the  models...

Thursday, June 6, 13

A common model is to use Hadoop to train models (e.g., recommendation engine) over very large data set, then store the model in an event-processing system (NoSQL, Storm) to serve requests in real time. (Problem: how do you update the model to reflect new inputs? Rerun training periodically or use an “online” algorithm - out of scope here!)

Page 75: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Social  NetworkData

Thursday, June 6, 13

How should you represent associations in social networks, e.g., friends on Facebook, followers on Twitter, ...

Page 76: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Google invented MapReduce,

… but MapReduce is not ideal for Page Rank and other graph algorithms.

 Google’s  Page  Rank

76

Thursday, June 6, 13

PageRank is the famous algorithm invented by Sergey Brin and Larry Page to index the web. It’s the foundation of Google’s search engine (and total world domination ;).

Page 77: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• 1 MR job for each iteration that updates all n nodes/edges.

• Graph saved to disk after each iteration.

• ...

Why  not  MapReduce?

C

E

A

D

F

B

77

Thursday, June 6, 13

The presentation http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-largescale-graph-processingitemizes all the major issues with using MR to implement graph algorithms.In a nutshell, a job with a map and reduce phase is waaay to course-grained...

Page 78: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Use  Graph  Processing

Thursday, June 6, 13A good summary presentation: http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-largescale-graph-processing

Photo: San Francisco Bay

Page 79: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Pregel: New graph framework for Page Rank.

• Bulk, Synchronous Parallel (BSP).

• Graphs are first-class citizens.

• Efficiently processes updates...

 Google’s  Pregel

79

Thursday, June 6, 13

Pregel is the name of the river that runs through the city of Königsberg, Prussia (now called Kaliningrad, Ukraine). 7 bridges crossed the river in the city (including to 5 to 2 islands between river branches). Leonhard Euler invented graph theory when we analyzed the question of whether or not you can cross all 7 bridges without retracing your steps (you can’t).

Page 80: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Apache Giraph.

• Apache Hama.

• Aurelius Titan.

 Open-­‐source  AlternaBves

All are somewhat immature.

80

Thursday, June 6, 13

http://incubator.apache.org/giraph/http://hama.apache.org/http://thinkaurelius.github.com/titan/None is very mature nor has extensive commercial support.

Page 81: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Neo4J.

• Mature, single machine graphical networks.

 Open-­‐source  AlternaBves

81

Thursday, June 6, 13

http://neo4j.org

Page 82: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Not the same kind of distributed system like Pregel, but

• More mature.

• Commercially supported.

• Maybe you don’t need distributed?

Neo4J

82

Thursday, June 6, 13

http://neo4j.org

Page 83: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Batch  Mode  +  Event-­‐Stream  Processing?

Thursday, June 6, 13Now that we have cataloged some issues and solutions, let’s recap and look forward.Photo: Grebe and distorted post reflection.

Page 84: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Nathan Marz’s vision for data

systems...

“Big  Data”

84

Thursday, June 6, 13

Page 85: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Use Hadoop for batch-mode processing of very large data sets.

• Use Storm for event handling, e.g., increment updates.

• Use NoSQL for flexible, scalable storage.

• Stir...85

Thursday, June 6, 13

http://neo4j.org

Page 86: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

So,  where  are  we??

Thursday, June 6, 13

Some thoughts about where the industry is and where it’s headed.

Page 87: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Hadoop  meets  lots  of  needs  now,  

but...

Thursday, June 6, 13

Page 88: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Hadoop  MapReduce

 is  the  Enterprise  Java  Beans  of  our  Bme.

Thursday, June 6, 13I worked with EJBs a decade ago. The framework was completely invasive into your business logic. There were too many configuration options in XML files. The framework “paradigm” was a poor fit for most problems (like soft real time systems and most algorithms beyond Word Count). Internally, EJB implementations were inefficient and hard to optimize, because they relied on poorly considered object boundaries that muddled more natural boundaries. (I’ve argued in other presentations and my “FP for Java Devs” book that OOP is a poor modularity tool…) The fact is, Hadoop reminds me of EJBs in almost every way. It’s a 1st generation solution that mostly works okay and people do get work done with it, but just as the Spring Framework brought an essential rethinking to Enterprise Java, I think there is an essential rethink that needs to happen in Big Data, specifically around Hadoop. The functional programming community, is well positioned to create it...

Page 89: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

I  see  Hadoop  as  

transi=onal.

Thursday, June 6, 13At least MapReduce. HDFS has longer legs, although it has its own flaws, like a single-point of failure NameNode that was only recently fixed (with replicated NNs), but is still immature, and it has upper bounds on storage. The MapR File System is superior in all these ways, but proprietary. Anyway, HDFS will last longer, but eventually be replaced by “next generation” distributed file systems, but MapReduce is already in decline and being replaced by more performant engines such as Impala, Spark, and Storm (for different purposes…).

Page 90: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

What  about  NoSQL??

Thursday, June 6, 13

Does it feel “dated” like Hadoop?

Page 91: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

The  “good”  NoSQL  systems  have  all  the    right  qualiBes,  likeminimal,  focused  abstracBons.  

91

Thursday, June 6, 13

By good, I mean the obvious leaders who are proven… They have very clean data models, e.g., key-value storage, column-oriented storage, etc. They have narrow, focused APIs without a lot of “confusion”. They scale extremely well with excellent cost curves. They do one (very important) thing well, yet typically support a wide variety of applications.

Page 92: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Purpose-­‐built  Tools

Thursday, June 6, 13I see that a trend where completely generic tooling is giving way to more “purpose-built” tooling...

Page 93: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

ElasBcSearch  and  Solr  for  Search.  

93

Thursday, June 6, 13

Page 94: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

New  Hadoop  file  formats,  opBmize  access  (e.g.,  Parquet).  

94

Thursday, June 6, 13

In the quest for ever better performance over massive datasets, the generic file formats in Hadoop and MapReduce are hitting a performance wall (although not everyone agrees). Parquet is column oriented & contains the data schema, like Thrift, Avro, and Protobuf. It will be exploited to optimize queries over massive data sets, much faster than the older file formats. Similarly, Impala is purpose built optimized query engine (that relies on Parquet).

Page 95: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

New  compute  engines  in  Hadoop  that  

replace  MapReduce.

95

Thursday, June 6, 13

In the quest for ever better performance over massive datasets, the generic file formats in Hadoop and MapReduce are hitting a performance wall (although not everyone agrees). Parquet is column oriented & contains the data schema, like Thrift, Avro, and Protobuf. It will be exploited to optimize queries over massive data sets, much faster than the older file formats. Similarly, Impala is purpose built optimized query engine (that relies on Parquet).

Page 96: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Machine  Learning  and  other  advanced  analyBcs  

are  hard  to  writeand  perform  poorly

on  Hadoop...

96

Thursday, June 6, 13

People do it because they feel they have no other choice...

Page 97: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

…  but  alternaBves  are  emerging,  including  Spark,  Storm,  and  

proprietary  soluBons.

97

Thursday, June 6, 13

There’s a “Let a thousand flowers bloom” process happening. Many attempts will fall by the wayside, but some will emerge as viable alternatives.

Page 98: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

A  Final  Emerging  Trend...

Thursday, June 6, 13Both are examples of technologies focused on particular problems.photo: Trump hotel and residences on the Chicago River.

Page 99: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Languages for Probabilistic Graphical Models??

• Bayesian Networks.

• Markov Chains.

• ...

Probabilis=c  Programming

99

Thursday, June 6, 13

http://en.wikipedia.org/wiki/Bayesian_networkhttp://en.wikipedia.org/wiki/Markov_chainPGMs are essential tools for many machine learning and artificial intelligence systems. But they require some expertise to build, both mastery of the PGM concepts and implementing them in conventional programming languages There is growing interest in designing languages that encapsulate this complexity.

Page 100: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Final  Thoughts

Thursday, June 6, 13

Photo: Gull on a pier near Fort Mason, SF

Page 101: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Consider  Small-­‐batch,  ArBsanal  Data...

101

Thursday, June 6, 13

If you have lots of data, it can be valuable, but you also have lots of work to do to manage and exploit it. Finding the right amount of “curated” data to use and a matching architecture still have valuable benefits, as always.Unfortunately, I see people jump on the Big Data bandwagon who think there’s gold in their few GBs of data… That said, the “medium data” market is not well served at the moment. Hadoop can be overkill, yet traditional tools can be too expensive or not handle the load...

Page 102: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

SQL  Strikes  Back!

Thursday, June 6, 13NoSQL solves meets lots of requirements better than traditional RDBMSs, but people loves them some SQL!!

Page 103: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• It’s entrenched.

• Organizations want a SQL solution if they can have it.

Don’t  overlook  SQL

103

Thursday, June 6, 13

Page 104: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Hadoop  owes  a  lot  of  its  popularity  to  Hive!

104

Thursday, June 6, 13

In large companies, the data analysts outnumber the developers by a large margin. Almost all of them know SQL (even if they happen to use SAS or similar tools more often…).

Page 105: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

Some  “NoSQL”  databases  have  added  query  languages  (e.g.,  Cassandra,  MongoDB).

105

Thursday, June 6, 13

Cassandra’s is relatively new and based on SQL. MongoDB has always had one, based on a Javascript DSL.

Page 106: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

“NewSQL”  databases  are  bringing  NoSQL  performance  to  the  relaBonal  model.

106

Thursday, June 6, 13

Page 107: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• Google Spanner and F1.

• NuoDB.

• VoltDB.

Examples

107

Thursday, June 6, 13

Spanner is the successor to BigTable. It is a globally-distributed database (consistency is maintained using the Paxos algo. and hardware synchronized clocks through GPS and atomic clocks!) Each table requires a primary key. F1 is an RDBMS built on top of it.NuoDB is a cloud based RDBMS.VoltDB is an example “in-memory” database, which are ideal for lots of small transactions that leverage indexing and rarely require full table scans.

Page 108: FromBigData to*Big*Informaonnosqlroadshow.com/dl/NoSQL-sanfran-2013/FromBigDataToBigInfor… · Hadoop&hype&cycle: Butwhatgives*us actual*value? 3 Thursday, June 6, 13 This talk reflects

Copyright  ©  2011-­‐2013,  Dean  Wampler,  All  Rights  Reserved

• People love SQL...

• but NoSQL meets other requirements.

What  does  this  meanfor  NoSQL?

108

Thursday, June 6, 13