39
Big Data and NoSQL What and Why?

Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Embed Size (px)

DESCRIPTION

Motivation: Scalability Another characteristica is the potential rapidly expanding market –Facebook (end of year data): 2004: 1 mio 2006: 12 mio 2008: 150 mio 2010: 600 mio 2012: mio How to scale fast enough? 3

Citation preview

Page 1: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Big Data and NoSQL

What and Why?

Page 2: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Motivation: Size• WWW has spawned a new era of applications

that need to store and query very large data sets– Facebook / 1.1 billion (sep 2012) [Yahoo finance]– Twitter / 500 mio (jul 2012) [techcrunch.com]– Netflix / 30 mio (okt 2012) [Yahoo finance]

– Google / index ~45 billion pages (nov 2012) • [http://www.worldwidewebsize.com/]

• So – how do we store and query this type of data set?

2

Page 3: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Motivation: Scalability• Another characteristica is the potential rapidly

expanding market– Facebook (end of year data):

• 2004: 1 mio• 2006: 12 mio• 2008: 150 mio• 2010: 600 mio• 2012: 1.100 mio

• How to scale fast enough?

3

Page 4: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Motivation: Availability• Bad user experience

– Down time– Long waits

• ... Equals lost users

• How to ensure good user experience?

4

Page 5: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

From Alvin Richard’s Goto presentation

5

Page 6: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Requirements:• So the QA requirements are

– Performance:• Time-dimension: fast storage and retrival of data• Space-dimension: large data sets

– Availability: of data service

6

Page 7: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

The players• The old workhorse

– RDBM: Relational DataBase Management• The hyped and forgotten (?)

– OODB– XMLDB

• BaseX

• The hyped and uncertain– NoSQL

7

Page 8: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

RDBM

• Performance:– Time

• Queries do not scale linearly in the size of the data set

– 10.000 rows ~ 1 second– 1 billion rows ~ 24 hours

8[Jacobs (2009)]

Page 9: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

RDBM• Performance:

– Space• Generally, RDBM are rather monolith

One RDBM = One machine

• Limitations on disk size, RAM, etc.

– (There are ‘tricks’ to distribute data over many machines, but it is generally an ‘after-the-fact’ design, not something considered a premise...)

9

Page 10: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

RDBM• Availability:

– OK I think...• Redundancy, redundant disk arrays, the usual bag of tricks

10

Page 11: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

RDBM Performance

Starting with Time behaviour

11

Page 12: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

RDBM Space tricks• Sharding

• Exercise: Give examples from WoW …

12

[Wikipedia]

Page 13: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Why do RDBMs become slow?• Jacobs’ paper explains the details.• It boils down to:

– The memory hierarchy• Cache, RAM, Disk, (Tape)• Each step has a big performance penalty

– The reading pattern• Sequential read faster than Random read• Tape: obvious but even for RAM this is true!

13

Page 14: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Issue 1

14

Page 15: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Thus

Issue 1: Large data sets cannot reside in RAM

Thus read/write time is slow

15

Page 16: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Issue 2• Issue

– Often one domain type is fixed at ”large”• Number of customers/users ( < 7 billion )• Number of sensors

– While other domains increase indefintely• Observations over time

– each page visit, transaction, sensor reading• Data production

– tweets, facebook update, log entry

16

Page 17: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

But...

The relational database model explicitly ignores the ordering of rows in tables.

17

Page 18: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Example• Transactions recorded by a web server

– Transaction log is inherently time ordered

– Queries are usually about timeas well

• How may last day?• Most visiting user last week?

• However results in randomvisits to the disk!

18

Page 19: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Normalization• It becomes even worse due to the classic RDBM

virtue of normalization:• A user and a transaction table

19

Page 20: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Jacobs’ statement

20

Page 21: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Denormalization technique

• Denormalization: – Make aggregate

objects– Allows much faster

access– Liability: Much more

space intensive

• And!– Store data in the

sequences they are to be queried…

21

Page 22: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

NoSQL Databases

A new take on storing and quering big data

22

Page 23: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Key features• NoSQL: ”Not Only SQL” / Not Relational

– Horizontal scaling of simple operations over many servers

– Repliation and partitioning data over many servers– Simple call level interface– Weaker concurrency model than ACID– Efficient use of RAM and dist. Indexes for storage– Ability to dynamically add new attributes to records

• Architectural Drivers:– Performance– Scalabilty

23

[Cattel, 2010]

Page 24: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Clarifications• NoSQL focus on

– Simple operations• Key lookup, read/write of one or a few records• Opposite: Complex joins over many tables (SQL)

• (NoSQL generally denormalize and thus do not require joins)

– Horizontal scaling• Many servers with no RAM nor disk sharing• Commodity hardware

– Cheap but more prone to failures

24

Page 25: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

NoSQL DB Types

25

Page 26: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Basic types• Four types

– Key-value stores• Memcached, Riak, Regis

– Document stores• MongoDB

– Extensible record (Fowler: Column-family)• Cassandra, Google BigTable

– Graph stores• Neo4J

26

Fowler: Aggregate-oriented

Page 27: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Key-value• Basically the Java Map<KeyType, ValueType>

datastructure:– Map.put(”pid01-2012-12-03-11-45”, measurement);

– m = Map.get(”pid01-2012-12-03-11-45”);• Schema: Any value under any key…• Supports:

– Automatic replication– Automatic sharding– Both using hashing on the key

• Only lookup using the (primary) key27

Page 28: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Document• Stores ”documents”

– MongoDB: JSON objects.– Stronger queries, also in

document contents– Schema: Any JSON object may be stored!– Atomic updates, otherwise no concurrency control

• Supports– Master-slave replication, automatic failover and

recovery– Automatic sharding

• Range-based, on shard key (like zip-code, CPR, etc.)

28

Page 29: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Extensible Record/Column• First kid on the block: Google BigTable• Datamodel

– Table of rows and columns• Scalability model: splitting both on rows and columns• Row: (Range) Sharding on primary key• Column: Column groups – domain defined clustering

– No Schema on the columns, change as you go– Generally memory-based with periodic flushes to disk

29

Page 30: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

(Graph)• Neo4J

30

Page 31: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

CAP Theorem

31

Page 32: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Reviewing ACID• Basic RDBM teaching talks on ACID• Atomicity

– Transaction: All or none succeed• Concistency

– DB is in valid state before and after transaction• Isolation

– N transactions executed concurrently = N executed in sequence

• Durability– Once a transaction has been committed, it will remain

so (even in the event of power loss, crash)

32

Page 33: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

CAP• Eric Brewer: only get two of the three:• Consistency

– Set of operations has occurred at once (Client view)

• Availability– Operations will terminate in intended reponse

• Partition tolerence– Operation will complete, even if components are

unavailable

33

Page 34: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Horizontal Scaling: P taken• We have already taken P, so we have to relax

either

– Consistency

– Availability

• RDBM prefer consistency over availability• NoSQL prefer availability over consistency

– replacing it with eventual consistency

34

Page 35: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Achieving ACID• Two-phase Commit

1. All partitions pre-commit, report result to master2. If success, master tells each to commit; else roll-back

• Guaranty consistency, but availability suffer• Example

– Two partitions, 99.9% availability• => 99.92 = 99.8% (+43 min down every month)

– Five partitions: • 99,5% (36 hours down time in all)

35

Page 36: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

BASE• Replace ACID with BASE

• BA: Basically Available• S: Soft state• E: Eventual consistent

• Availability achieved by partial failures not leading to system failures– In two-phase commit, what would master do if one

partition does not repond?

36

Page 37: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Eventual Consistency• So what does this mean?

– Upon write, an immediate read may retrieve the old value – or not see the newest added item!

• Why? Gets data from a replica that is not yet updated…

– Eventual consistency: • Given sufficiently long period of time which no updates, all

replicas are consistent, and thus all reads consistently return the same data…

• System always available, but state is ‘soft’ / cached

37

Page 38: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Discussion• Web applications

– Is it a problem that a facebook update takes some minutes to appear at my friends?

38

Page 39: Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook

Summary• RDBMs have issues with ‘big data’

– Ignores the physical laws of hardware (random read)– RDB model requires joins (random read)

• NoSQL– Class of alternative DB models and impl.– Two main classes

• Key-value stores• Document stores

• CAP– You can only get two of three– NoSQL choose A, and sacrifice C for Eventual

ConsistencyCS@AU Henrik Bærbak Christensen 39