37
http://www.coordguru.com Woohyun Kim The creator of open source “Coord(http://www.coordguru.com ) 2010-01-27 Emergent Distributed Data Storages for Big Data, Storage, and Analysis

Emergent Distributed Data Storage

Embed Size (px)

DESCRIPTION

This was presented at NHN on Jan. 27, 2009. It introduces Big Data, its storages, and its analyses. Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce. In addition, in terms of Schema-Free, various non-relational data storages are explained.

Citation preview

Page 1: Emergent Distributed Data Storage

http://www.coordguru.com

Woohyun Kim

The creator of open source “Coord”

(http://www.coordguru.com)

2010-01-27

Emergent Distributed Data Storagesfor Big Data, Storage, and Analysis

Page 2: Emergent Distributed Data Storage

http://www.coordguru.com

ContentsThe Advent of Big Data

• Noah’s Ark Problem

• Key Issues with ‚Big Data‛

• How to deal with ‚Big Data‛

MapReduce Debates• MapReduce is just A Major Step Backwards!!!

• RDB experts Jump the MR Shark

• DBs are hammers; MR is a screwdriver

• MR is a Step Backwards, but some Steps Forward

Hadoop Revolution• Best Practice in Hadoop

• Hadoop is changing the Game

• Big Data goes well with Hadoop

• Case Study: Parallel Join

• Case Study: Further Study in Parallel Join

• Case Study: Improvements in Parallel Join

A Hybrid of MapReduce and RDBMS • Integrate MapReduce into RDBMS

• In-Database MapReduce vs. File-only MapReduce

Non-Relational Data Storages• Throw ‘Relational’ Away, and Take ‘Schema-Free’

• A Comparison of Non-Relational Data Storages

• Emergent Document-oriented Storages

• Document-oriented vs. RDBMS

Page 3: Emergent Distributed Data Storage

http://www.coordguru.com

The Advent of Big Data

Page 4: Emergent Distributed Data Storage

http://www.coordguru.com

Noah’s Ark Problem• Did Noah take dinosaurs on the Ark?

• The Ark was a very large ship designed especially for its important purpose

• It was so large and complex that it took Noah 120 years to build

• How to put such a big thing• Diet?• Split?

• Differentiate• Put• Integrate

• Scale Up?• Scale Out?

• ‚Big Data‛ problem is just like that

Page 5: Emergent Distributed Data Storage

http://www.coordguru.com

Key Issues with ‚Big Data‛• Lookup

• Metadata server -> centralized or distributed -> partitioned replicas to avoid a single of the failure

• Partition• Data locality -> network bandwidth reduction -> putting the computation near the data

• Replication• Hardwar Failure -> Data Loss -> Availability from redundant copies of the data

• Load-balanced Parallel Processing• Corrupt Data or Remote process failure -> speculative execution or rescheduling

• Ad-hoc Analysis• Some partitioned data may need to be combined with another data

Page 6: Emergent Distributed Data Storage

http://www.coordguru.com

Struggling to STORE and ANALYZE ‚Big Data‛

How to deal with ‚Big Data‛

Page 7: Emergent Distributed Data Storage

http://www.coordguru.com

Appendix: What is ETL?

ETL(Extract, Transform, and Load)

• Talend Open Studio• Pentaho Data Integration (Kettle)• RapidMiner• Jitterbit 2.0, • Apatar• Clover.ETL• Scriptelle

• A process in database usage and especially in data warehousing that involves:

• Extracting data from outside sources(such as different data

organization/format, non-relational database structures)

• Transforming it to fit operational needs (which can include quality levels)

• Selection, translation, encoding, calculation, filtering, sorting, joining,

aggregation, transposing or pivoting, splitting, disaggregation

• Loading it into the end target (database or data warehouse)

ETL Open Sources

Page 8: Emergent Distributed Data Storage

http://www.coordguru.com

Hadoop Revolution

Page 9: Emergent Distributed Data Storage

http://www.coordguru.com

Row

Row key Column key

Column

Family

Column

Family

Time

stamp

Best Practice in Hadoop• Software Stack in Google/Hadoop • Cookbook for ‚Big Data‛

StructuredData

• Structured Data Storage for ‚Big Data‛

Page 10: Emergent Distributed Data Storage

http://www.coordguru.com

Appendix: What is MapReduce?

Map• Read a set of ‚records’ from an input file, which acts as filtering or transformations

• Output a set of (key, data) pair, which partitions them into R disjoint buckets by

the key

Reduce• Read a set of (key, a list of data) pairs from R disjoint buckets

• Each R from map’s outputs is shuffled, and aggregated into its corresponding

reduce with being ordering by the key

• Output a new set of records

MapReduce

ReduceMap

Map

Group-By/Filter

Aggregate/Aggregator

Page 11: Emergent Distributed Data Storage

http://www.coordguru.com

Hadoop is changing the Game

• Hadoop, DW, and BI

Page 12: Emergent Distributed Data Storage

http://www.coordguru.com

Big Data goes well with Hadoop

• Parallelize Relational Algebra Operations using MapReduce

Page 13: Emergent Distributed Data Storage

http://www.coordguru.com

Case Study: Parallel Join

• A Parallel Join Example using MapReduce

Page 14: Emergent Distributed Data Storage

http://www.coordguru.com

Case Study: Further Study in Parallel Join

Problems

• Need to sort

• Move the partitioned data across the network

• Due to shuffling, must send the whole data

• Skewed by popular keys

• All records for a particular key are sent to the same reducer

• Overhead by tagging

Alternatives• Map-side Join

• Mapper-only job to avoid sort and to reduce data movement across the

network

• Semi-Join

• Shrink data size through semi-join(by preprocessing)

Page 15: Emergent Distributed Data Storage

http://www.coordguru.com

Case Study: Improvements in Parallel Join

Map-Side Join• Replicate a relatively smaller input source to the cluster

• Put the replicated dataset into a local hash table

• Join – a relatively larger input source with each local hash table

• Mapper: do Mapper-side Join

Semi-Join• Extract – unique IDs referenced in a larger input source(A)

• Mapper: extract Movie IDs from Ratings records

• Reducer: accumulate all unique Movie IDs

• Filter – the other larger input source(B) with the referenced unique IDs

• Mapper: filter the referenced Movie IDs from full Movie dataset

• Join - a larger input source(A) with the filtered datasets

• Mapper: do Mapper-side Join• Ratings records & the filtered movie IDs dataset

Page 16: Emergent Distributed Data Storage

http://www.coordguru.com

MapReduce Debates

Page 17: Emergent Distributed Data Storage

http://www.coordguru.com

MapReduce is just A Major Step Backwards!!!Dewitt and StoneBraker in January 17, 2008

• A giant step backward in the programming paradigm for large-scale data intensive applications

• Schema are good• Type check in runtime, so no garbage

• Separation of the schema from the application is good• Schema is stored in catalogs, so can be queried(in SQL)

• High-level access languages are good• Present what you want rather than an algorithm for how to get it

• No schema??!• At least one data field by specifying the key as input• For Bigtable/Hbase, different tuples within the same table can

actually have different schemas• Even there is no support for logical schema changes such as

views

Page 18: Emergent Distributed Data Storage

http://www.coordguru.com

MapReduce is just A Major Step Backwards!!! (cont’d)Dewitt and StoneBraker in January 17, 2008

• A sub-optimal implementation, in that it uses brute force instead of indexing

• Indexing• All modern DBMSs use hash or B-tree indexes to accelerate access to data• In addition, there is a query optimizer to decide whether to use an index or

perform a brute-force sequential search• However, MapReduce has no indexes, so processes only in brute force fashion

• Automatic parallel execution• In the 1980s, DBMS research community explored it such as Gamma, Bubba,

Grace, even commercial Teradata

• Skew• The distribution of records with the same key causes is skewed in the map

phase, so it causes some reduce to take much longer than others

• Intermediate data pulling• In the reduce phase, two or more reduce attempt to read input files form the

same map node simultaneously

Page 19: Emergent Distributed Data Storage

http://www.coordguru.com

MapReduce is just A Major Step Backwards!!! (cont’d)Dewitt and StoneBraker in January 17, 2008

• Not novel at all – it represents a specific implementation of well known techniques developed nearly 25 years ago

• Partitioning for join• Application of Hash to Data Base Machine and its Architecture, 1983

• Joins in parallel on a shared-nothing• Multiprocessor Hash-based Join Algorithms, 1985• The Case for Shared-Nothing, 1986

• Aggregates in parallel• The Gamma Database Machine Project, 1990• Parallel Database System: The Future of High Performance Database Systems,

1992• Adaptive Parallel Aggregation Algorithms, 1995

• Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years

• PostgreSQL supported user-defined functions and user-defined aggregates in the mid 1980s

Page 20: Emergent Distributed Data Storage

http://www.coordguru.com

MapReduce is just A Major Step Backwards!!! (cont’d)Dewitt and StoneBraker in January 17, 2008

• Missing most of the features that are routinely included in current DBMS• MapReduce provides only a sliver of the functionality found in modern DBMSs

• Bulk loader – transform input data in files into a desired format and load it into a DBMS• Indexing – hash or B-Tree indexes• Updates – change the data in the data base• Transactions – support parallel update and recovery from failures during update• integrity constraints – help keep garbage out of the data base• referential integrity – again, help keep garbage out of the data base• Views – so the schema can change without having to rewrite the application program

• Incompatible with all of the tools DBMS users have come to depend on• MapReduce cannot use the tools available in a modern SQL DBMS, and has none of

its own• Report writers(Crystal reports)• Prepare reports for human visualization• business intelligence tools(Business Objects or Cognos)• Enable ad-hoc querying of large data warehouses• data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner)• Allow a user to discover structure in large data sets• replication tools(Golden Gate)• Allow a user to replicate data from on DBMS to another• database design tools(Embarcadero)• Assist the user in constructing a data base

Page 21: Emergent Distributed Data Storage

http://www.coordguru.com

What the !@# MapReduce?

Page 22: Emergent Distributed Data Storage

http://www.coordguru.com

RDB experts Jump the MR SharkGreg Jorgensen in January 17, 2008

• Arg1: MapReduce is a step backwards in database access• MapReduce is not a database, a data storage, or management system• MapReduce is an algorithmic technique for the distributed processing of large

amounts of data

• Arg2: MapReduce is a poor implementation• MapReduce is one way to generate indexes from a large volume of data, but it’s not

a data storage and retrieval system

• Arg3: MapReduce is not novel• Hashing, parallel processing, data partitioning, and user-defined functions are all old

hat in the RDBMS world, but so what?• The big innovation MapReduce enables is distributing data processing across a

network of cheap and possibly unreliable computers

• Arg4: MapReduce is missing features• Arg5: MapReduce is incompatible with the DBMS tools

• The ability to process a huge volume of data quickly such as web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness

Page 23: Emergent Distributed Data Storage

http://www.coordguru.com

DBs are hammers; MR is a screwdriverMark C. Chu-Carroll

• RDBs don’t parallelize very well• How many RDBs do you know that can efficiently split a

task among 1,000 cheap computers?

• RDBs don’t handle non-tabular data well• RDBs are notorious for doing a poor job on recursive data

structures

• MapReduce isn’t intended to replace relational databases

• It’s intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines

Page 24: Emergent Distributed Data Storage

http://www.coordguru.com

Eugene Shekita

• Arg1: Data Models, Schemas, and Query Languages• Semi-structured data model and high level of parallel data flow query language is

built on top of MapReduce• Pig, Hive, Jaql, Cascading, Cloudbase

• Hadoop will eventually have a real data model, schema, catalogs, and query language

• Moreover, Pig, Jaql, and Cascading are some steps forward• Support semi-structured data• Support more high level-like parallel data flow languages than declarative query

languages• Greenplum and Aster Data support MapReduce, but look more limited than Pig, Jaql,

Cascading• The calls to MapReduce functions wrapped in SQL queries will make it difficult

to work with semi-structured data and program multi-step dataflows

• Arg3: Novelty• Teradata was doing parallel group-by 20 years ago• UDAs and UDFs appeared in PostgreSQL in the mid 80s• And yet, MapReduce is much more flexible, and fault-tolerant

• Support semi-structured data types, customizable partitioning

MR is a Step Backwards, but some Steps Forward

Page 25: Emergent Distributed Data Storage

http://www.coordguru.com

Page 26: Emergent Distributed Data Storage

http://www.coordguru.com

Lessons Learned from the Debates

Who Moved My Cheese?• Speed

• The seek times of physical storage is not keeping pace with improvements

in network speeds

• Scale

• The difficulty of scaling the RDBMS out efficiently

• Clustering beyond a handful of servers is notoriously hard

• Integration

• Today’s data processing tasks increasingly have to access and combine

data from many different non-relational sources, often over a network

• Volume

• Data volumes have grown from tens of gigabytes in the 1990s to

hundreds of terabytes and often petabytes in recent years

Stolen from 10 Ways To complement the Enterprise RDBMS using Hadoop

Page 27: Emergent Distributed Data Storage

http://www.coordguru.com

A Hybrid of MapReduce and RDBMS

Page 28: Emergent Distributed Data Storage

http://www.coordguru.com

Integrate MapReduce into RDBMS

HadoopDB Greenplum Aster Data

RDBMS MapReduceData size Gigabytes PetabytesUpdates Read and write(Mutable) Write once, read many times(Immutable)Latency Low HighAccess Interactive(point query) and batch Batch(ad-hoc query in brute-force)

Structure Fixed schema Semi-structured schemaLanguage SQL Procedural (Java, C++, etc)Integrity High LowScaling Nonlinear Linear

Page 29: Emergent Distributed Data Storage

http://www.coordguru.com

In-Database MapReduce vs. File-only MapReduce

In-Database MapReduce File-Only MapReduce

Target User Analyst, DBA, Data Miner Computer Science Engineer

Scale & Performance High High

Hardware Costs Low Low

Analytical Insights High High

Failover & Recovery High High

Use: Ad-Hoc Queries Easy (seamless) Harder (custom)

Use: UI, Client Tools BI Tool (GUI), SQL (CLI) Developer Tool (Java)

Use: Ecosystem High (JDBC, ODBC) Lower (custom)

Protect: Data Integrity High (ACID, schema) Lower (no transaction guarantees)

Protect: Security High (roles, privileges) Lower (custom)

Protect: Backup & DR High (database backup/DR) Lower (custom)

Performance: Mixed Workloads High (workload/QoS mgmt) Lower (limited concurrency)

Performance: Network Bottleneck No (optimized partitioning) Higher (network inefficient)

Operational Cost Low (1 DBA) Higher (several engineers)

• In-Database MapReduce

• Greenplum, Aster Data, HadoopDB

• File-only MapReduce

• Pig, Hive, Cloudbase

Page 30: Emergent Distributed Data Storage

http://www.coordguru.com

Non-Relational Data Storages

Page 31: Emergent Distributed Data Storage

http://www.coordguru.com

Throw ‘Relational’ Away, and Take ‘Schema-Free’

The new face of data• Scale out, not up• Online load balancing, cluster growth• Flexible schema

• Some data have sparse attributes, do not need ‘relational’ property• Document/Term vector, User/Item matrix, Log-structured data

• Key-oriented queries• Some data are stored and retrieved mainly by primary key, without complex joins

• Trade-off of Consistency, Availability, and Partition Tolerance

Two of Feasible Approaches• Bigtable

• How can we build a distributed DB on top of GFS?

• B+ Tree style Lookup, Synchronized consistency• Memtable/Commit Log/Immutable SSTable/Indexes, Compaction

• Dynamo• How can we build a distributed hash table appropriate for the data center?

• DHT style Lookup, Eventually consistency

Page 32: Emergent Distributed Data Storage

http://www.coordguru.com

A Comparison of Non-Relational Data StoragesName Language Fault-tolerance Persistence Client Protocol Data model Docs Community

Hbase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A Apache, yes

Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable AZvents, Baidu, yes

Neptune Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A NHN, some

Voldemort Javapartitioned, replicated, read-repair

Pluggable: BerkleyDB, Mysql

Java APIStructured / blob / text

A Linkedin, no

Ringo Erlangpartitioned, replicated, immutable

Custom on-disk (append only log)

HTTP blob B Nokia, no

Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob B OnScale, no

Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob C no

Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob D+ Powerset, no

MemcacheDB C replication BerkleyDB Memcached blob B some

ThruDB C++ ReplicationPluggable: BerkleyDB, Custom, Mysql, S3

ThriftDocument oriented

C+Third rail, unsure

CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, jsonDocument oriented (json)

A Apache, yes

Cassandra Java Replication, partitioning Custom on-disk ThriftBigtable meets Dynamo

F Facebook, no

Coord C++ Replication?, partitioningPluggable: in-memory, Lucene, BerkelyDB, Mysql

Custom API, Thrift text / blob A NHN, some

HBaseHypertable

Bigtable

Dynamo

Dynomite

Voldemort

DHT

KAI

CouchDB

ThruDB

MongoDB

Document-oriented

SimpleDB

Scalaris

Tokyo Cabinet

Key-Value

Chordless

MemcacheDB

Cassandra

Stolen from Anti-RDBMS - A list of distributed key-value stores by Richard Jones

On-going classification by Woohyun Kim

Page 33: Emergent Distributed Data Storage

http://www.coordguru.com

Emergent Document-oriented Storages

Why Document-oriented?• All fields become optional

• All relationships become Many-to-Many

• Chatter always expands

Key Features• Schema-Free

• Straightforward Data Model

• Full Text Indexing

• RESTful HTTP/JSON API

Page 34: Emergent Distributed Data Storage

http://www.coordguru.com

Document-oriented vs. RDBMSCouchDB MongoDB MySQL

Data Model Document-Oriented (JSON) Document-Oriented (BSON) Relational

Data Types ? string, int, double, boolean, date, bytearray, object, array, others

Link

Large Objects (Files) Yes (attachments) Yes (GridFS) no???

Replication Master-master (with developer supplied conflict resolution)

Master-slave Master-slave

Object(row) Storage One large repository Collection based Table based

Query Method Map/reduce of javascript functions to lazily build an index per query

Dynamic; object-based query languageDynamic; SQL

Secondary Indexes Yes Yes Yes

Atomicity Single document Single document Yes – advanced

Interface REST Native drivers Native drivers

Server-side batch data manipulation

? Yes, via javascript Yes (SQL)

Written in Erlang C++ C

Concurrency Control MVCC Update in Place Update in Place

Page 35: Emergent Distributed Data Storage

http://www.coordguru.com

Thank you.

Page 36: Emergent Distributed Data Storage

http://www.coordguru.com

Appendix: What is Coord?

Architectural Comparison• dust: a distributed file system based on DHT

• coord spaces: a resource sharable store system based on SBA

• coord mapreduce: a simplified large-scale data processing framework

• warp: a scalable remote/parallel execution system

• graph: a large-scale distributed graph search system

Page 37: Emergent Distributed Data Storage

http://www.coordguru.com

Appendix: Coord Internals A space-based architecture built on distributed hash tables

SBA(Space-based Architecture) processes communicate with others thru. only spaces

DHT(Distributed Hash Tables) data identified by hash functions are placed on numerically near nodes

A computing platform to project a single address space on distributed memories As if users worked in a single computing environment

node 1 node 2 node 3 node n

02m-1

App

writetakeread