Intro
• Israel’s Big Data Meetup
• By Developers, For Developers
• We talk code, architecture, solutions
– No products
– No sales pitches
• Suggest ideas for next meetups
• Suggest yourselves as speakers for next meetups
About Me
• Liran Zelkha, from Tona Consulting
• Formerly co-founder at ScaleBase
– NewSQL solution for scaling RDBMS
• Works (allot) with applications that need to scale, both reads and writes
Agenda
• Terminology
• Massive Reads vs. Massive Writes
• Solutions
– RDBMS
– NoSQL
– Code
TERMINOLOGY
Terminology
• OLTP– a class of systems that facilitate and manage
transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions. OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application.
http://en.wikipedia.org/wiki/Online_transaction_processing
Terminology
• OLAP– s an approach to swiftly answer multi-dimensional
analytical (MDA) queries. OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture. The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).
http://en.wikipedia.org/wiki/Online_analytical_processing
MASSIVE READS VS. MASSIVE WRITES
Reads vs. Writes
Reads
• Caching helps
• No need for availability
• No transactions
• No locking
Writes
• Caching sucks
• Availability is a must (?)
• Transactions are a must (?)
• Locking is a must (?)
Massive Reads Solutions
• Memory, Memory, Memory
– For Caching, Caching, Caching
• Column stores (?)
Caching
• Tons of solutions.
• To name a few:
– java.util.Map
– Hazelcast
– Infinispan
– Coherence
When Caching Consider
• Time To Live
• Memory Leaks
• Updates (?)
• What is cached and where
Column Store Databases
• Store data in columns vs. rows
• Faster reads, slower writes
• Compression helps store more data in less disk space
Massive Write Solutions
• Memory?
– Only with fast networks
• Fast disks
Memory
• Memory can fail, machines can fail
• Distributed memory
• Size of memory
• See Nati Shalom’s points at http://natishalom.typepad.com/nati_shaloms_blog/2010/03/memory-is-the-new-disk-for-the-enterprise.html
Memory Is The New Disk
Fast Disks
• When massive writes translates to massive disk writes
– Fast disks are a must
• Can offer
– HA
– Scalability
– Low latency
2 Words On Storage Technologies
• RAID
• SSD
• SAN
• NAS
Example
• NMS System
• Each device interaction is built from 5-10 request/response pair
• Each request causes up to 3 database insert/updates– And multiple reads
• Support up to 5M devices
• Technology stack– Jboss 6
– EJB3, JPA
– Oracle Database
JDBC Profiling – Standard Disks
JDBC Profiling – Fast SAN
• Spec
– HP P2000 with 16 drives configured as:
• 14 300G 10K SAS Drives in RAID 10 (Data)
• 2 300G 10K SAS Drives in RAID 1 (Redo)
• Write-back is enabled
– Sisk enclosure is connected via FC Switch and 8GB Qlogic HBA on the server side.
JDBC Profiling – Fast SAN
RDBMS
MySQL Scaling Options
• Tuning
• Hardware upscale
• Partitioning
• New MySQL distribution
• Read/Write Split
Database Tuning
• There are many ways to tune your database
• Allot of data online, check out this post
– http://forge.mysql.com/wiki/Top10SQLPerformanceTips
Database Tuning – Some Examples
• innodb_buffer_pool_size– Holds the data and indexes of tables in memory.– Bigger buffer results in faster row lookups.– The bigger the better.– Default – 8M
• Query Cache– Keeps the result of queries in memory until they are invalidated by
writes. – query_cache_size
• total size of memory available to query caching
– query_cache_limit• the maximum number of kilobytes one query may be in order to be cached.
– query_cache_size = 128MB– query_cache_limit = 4MB
Database Tuning – Pros and Cons
Pros Cons
May result in major
performance improvements
Doesn’t scale. No matter how well
the tuning is performed, it will
reach a limit defined by machine
capabilities.
Doesn’t require application
changes
SQL Tuning
• If you write lousy SQL code, you’ll get lousy performance– Java gurus are not SQL gurus
– Your ORM code does not know how to write good SQL code
• What will happen when executing– SELECT * FROM
• Tuning your SQL commands is tedious but very rewarding
SQL Tuning – Some Examples
• Here are just some examples:
– Use EXPLAIN to profile the query execution plan
– Use DISTINCT – not GROUP BY
– Don’t use an indexed column with a function
– Try not to use a non deterministic functions in where clause
SQL Tuning – Pros and Cons
Pros Cons
May result in major
performance improvements
Requires code modifications.
Doesn’t scale. No matter how
well the tuning is performed, it
will reach a limit defined by
machine capabilities.
Scaling Up Hardware
• Usually DB gets the strongest servers
• However – there is a limit to how much performance gains you can get from increasing hardware
• Some data:
http://www.mysqlperformanceblog.com/2011/01/26/modeling-innodb-scalability-on-multi-core-servers/
Scaling Up Hardware – Pros and Cons
Pros Cons
May result in major performance improvements Scaling is limited
Transparent Might be expensive
Easy
SSD
• Solid State Drive
– Better latency and access time than regular HDD
– Cost more per GB (but prices are dropping)
• Vadim Tkachenko from Percona gave a great lecture on SSD at MySQL Conf 2011
– (see slides at http://en.oreilly.com/mysql2011/public/schedule/detail/17117)
– Claims you can expect up to X7 performance from SSD
SSD – Pros and Cons
Pros Cons
May result in major performance improvements Expensive
Transparent Still limited scalability
Partitioning
• Partitioning was introduced to MySQL at version 5.1.
• It is a way to split tables across multiple files, a technique that proved very useful for improving database performance.
• Benefits:– Helps fit indexes in RAM
– Faster query/insert
– Instant delete
Partitioning Performance
• See excellent presentation by Giuseppe Maxia from 2010
– http://www.slideshare.net/datacharmer/partitions-performance-with-mysql-51-and-55
Engine 6 month range query
InnoDB 4min 30s
MyISAM 25.03s
InnoDB partitions 13.19s
MyISAM partiotions 4.45s
Partitioning
Pros Cons
May result in major performance
improvements
MySQL server itself introduces limits. User
concurrency, transaction chains, isolation,
are still bottlenecked by the single MySQL
that owns all partitions
Mostly transparent to the application
New MySQL Distributions
• There are many MySQL drop-in replacements
• Are MySQL, but tuned differently, different extensions
• Leading examples
– PerconaServer
– MariaDB
New MySQL Distributions – Pros and Cons
Pros Cons
Provide performance
improvements
Still limited scalability
Transparent
Other Storage Engines
• InnoDB better than MyISAM
– Oh really?
– As always, it depends.
– InnoDB will cause less corruptions, and is probably better for most high traffic applications.
• MEMORY can be great
– However, no persistency
Read/Write Splitting
• Write to MySQL master, read from 1 (or more) slaves
• Excellent read scaling
• Many issues:
– Since replication is a-synchronous – read might not be up to date
– Transactions create stickiness
– Code changes
Read/Write Splitting – Pros and Cons
Pros Cons
Provides performance
improvements
Requires code changes
Scale out the database Good for scaling reads, not writes
Since replication is asynchronous, some reads
might get data that is not up to date.
Sharding
App
DB1
DB2
MySQL + NoSQL - HandlerSocket
• Fascinating post -http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html
• MySQL spends huge amount of time on SQL statement parsing
• Using InnoDB API directly
• MySQL Plugin
• Comes builtin with Percona Server 5.5
HandlerSocket - Architecture
Code Sample
#!/usr/bin/perl
use strict;use warnings;use Net::HandlerSocket;
#1. establishing a connectionmy $args = { host => 'ip_to_remote_host', port => 9998 };my $hs = new Net::HandlerSocket($args);
#2. initializing an index so that we can use in main logics.# MySQL tables will be opened here (if not opened)my $res = $hs->open_index(0, 'test', 'user', 'PRIMARY',
'user_name,user_email,created');die $hs->get_error() if $res != 0;
Code Sample – Cont’
#3. main logic#fetching rows by id#execute_single (index id, cond, cond value, max rows, offset)$res = $hs->execute_single(0, '=', [ '101' ], 1, 0);die $hs->get_error() if $res->[0] != 0;shift(@$res);for (my $row = 0; $row < 1; ++$row) {
my $user_name= $res->[$row + 0];my $user_email= $res->[$row + 1];my $created= $res->[$row + 2];print "$user_name\t$user_email\t$created\n";
}
#4. closing the connection$hs->close();
Bashing Some NewSQL Solutions
• Xeround– Limited Database size– Only on the cloud
• VoltDB– Rewrite your entire app to use stored procedures
• NimbusDB– Still in Beta
• Clustrix– Insanely expensive– NoSQL that looks like MySQL
• Schooner– Fast MySQL on SSD
• And no word on ScaleBase
NOSQL
NoSQL Is Here To Stay
NoSQL
• A term used to designate databases which differ from classic relational databases in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term which would include classic relational databases as a subset.
http://en.wikipedia.org/wiki/NoSQL
NoSQL Types
• Key/Value– A big hash table– Examples: Voldemort, Amazon Dynamo
• Big Table– Big table, column families– Examples: Hbase, Cassandra
• Document based– Collections of collections– Examples: CouchDB, MongoDB
• Graph databases– Based on graph theory– Examples: Neo4J
• Each solves a different problem
NO-SQL
http://browsertoolkit.com/fault-tolerance.png
MongoDB
• I use the slides of Roger Bodamer from 10gen
• Find them here:
– http://assets.en.oreilly.com/1/event/61/Building%20Web%20Applications%20with%20MongoDB%20Presentation.ppt
• In my book – Mongo doesn’t fit the massive write story.
MongoDB
• Document Oriented Database– Data is stored in documents, not tables / relations
• MongoDB is Implemented in C++ for best performance• Platforms 32/64 bit Windows Linux, Mac OS-X, FreeBSD, Solaris
• Language drivers for:– Ruby / Ruby-on-Rails – Java– C#– JavaScript – C / C++ – Erlang Python, Perl others..... and much more ! ..
Design
• Want to build an app where users can check in to a location
• Leave notes or comments about that location
• Iterative Approach:– Decide requirements
– Design documents
– Rinse, repeat :-)
Requirements
• Locations
– Need to store locations (Offices, Restaurants etc)
• Want to be able to store name, address and tags
• Maybe User Generated Content, i.e. tips / small notes ?
– Want to be able to find other locations nearby
Requirements
• Locations– Need to store locations (Offices, Restaurants etc)
• Want to be able to store name, address and tags• Maybe User Generated Content, i.e. tips / small notes ?
– Want to be able to find other locations nearby
• Checkins– User should be able to ‘check in’ to a location– Want to be able to generate statistics
Terminology
RDBMS Mongo
Table, View Collection
Row(s) JSON Document
Index Index
Join Embedded Document
Partition Shard
Partition Key Shard Key
Collections
loc1, loc2, loc3
Location
sUsers
User1, User2
JSON Sample Doc
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "roger",
date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",
text : ”MongoSF",
tags : [ ”San Francisco", ”MongoDB" ] }
Notes:
- _id is unique, but can be anything you’d like
BSON
• JSON has powerful, but limited set of datatypes– Mongo extends datypes with Date, Int types, Id, …
• MongoDB stores data in BSON
• BSON is a binary representation of JSON– Optimized for performance and navigational abilities– Also compression– See bsonspec.org
location1= {name: "10gen East Coast”,
address: ”134 5th Avenue 3rd Floor”,
city: "New York”,
zip: "10011”
}
Locations v1
location1= {name: "10gen East Coast”,
address: ”134 5th Avenue 3rd Floor”,
city: "New York”,
zip: "10011”
}
db.locations.find({zip:”10011”}).limit(10)
Places v1
location1 = {name: "10gen East Coast”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
tags: [“business”, “mongodb”]
}
Places v2
location1 = {name: "10gen East Coast”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
tags: [“business”, “mongodb”]
}
db.locations.find({zip:”10011”, tags:”business”})
Places v2
location1 = {name: "10gen East Coast”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
tags: [“business”, “mongodb”],
latlong: [40.0,72.0]
}
Places v3
location1 = {name: "10gen East Coast”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
tags: [“business”, “cool place”],
latlong: [40.0,72.0]
}
db.locations.ensureIndex({latlong:”2d”})
Places v3
location1 = {name: "10gen HQ”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
tags: [“business”, “cool place”],
latlong: [40.0,72.0]
}
db.locations.ensureIndex({latlong:”2d”})
db.locations.find({latlong:{$near:[40,70]}})
Places v3
location1 = {name: "10gen HQ”,
address: "17 West 18th Street 8th Floor”,
city: "New York”,
zip: "10011”,
latlong: [40.0,72.0],
tags: [“business”, “cool place”],
tips: [
{user:"nosh", time:6/26/2010, tip:"stop by for
office hours on Wednesdays from 4-6pm"},
{.....},
]
}
Places v4
Creating your indexes
db.locations.ensureIndex({tags:1})
db.locations.ensureIndex({name:1})
db.locations.ensureIndex({latlong:”2d”})
Finding places:
db.locations.find({latlong:{$near:[40,70]}})
With regular expressions:
db.locations.find({name: /^typeaheadstring/)
By tag:
db.locations.find({tags: “business”})
Querying your Places
Initial data load:
db.locations.insert(place1)
Using update to Add tips:
db.locations.update({name:"10gen HQ"},
{$push :{tips:
{user:"nosh", time:6/26/2010,
tip:"stop by for office hours on Wednesdays from 4-6"}}}}
Inserting and updating locations
Requirements
• Locations– Need to store locations (Offices, Restaurants etc)
• Want to be able to store name, address and tags• Maybe User Generated Content, i.e. tips / small notes ?
– Want to be able to find other locations nearby
• Checkins– User should be able to ‘check in’ to a location– Want to be able to generate statistics
user1 = {name: “nosh”
email: “[email protected]”,
.
.
.
checkins: [{ location: “10gen HQ”,
ts: 9/20/2010 10:12:00,
…},
…
]
}
Users
Simple Stats
db.users.find({„checkins.location‟: “10gen HQ”)
db.checkins.find({„checkins.location‟: “10gen HQ”})
.sort({ts:-1}).limit(10)
db.checkins.find({„checkins.location‟: “10gen HQ”,
ts: {$gt: midnight}}).count()
user1 = {name: “nosh”
email: “[email protected]”,
.
.
.checkins: [4b97e62bf1d8c7152c9ccb74, 5a20e62bf1d8c736ab]
}
checkins [] = ObjectId reference to locations collection
Alternative
Check-in = 2 opsread location to obtain location id
Update ($push) location id to user object
Queries: find all locations where a user checked in: checkin_array = db.users.find({..},
{checkins:true}).checkins
db.location.find({_id:{$in: checkin_array}})
User Check in
Unsharded Deployment
Secondary
Primary•Configure as a replica set for automated failover
•Async replication between nodes
•Add more secondaries to scale reads
Secondary
Sharded Deployment
Secondary
Primary
MongoS
•Autosharding distributes data among two or more replica sets•Mongo Config Server(s) handles distribution & balancing•Transparent to applications
config
Cassandra
• Slides used from eben hewitt
• See original slides here:
– http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt
cassandra properties
• tuneably consistent• very fast writes• highly available• fault tolerant• linear, elastic scalability• decentralized/symmetric• ~12 client languages
– Thrift RPC API
• ~automatic provisioning of new nodes• 0(1) dht• big data
write op
Staged Event-Driven Architecture
• A general-purpose framework for high concurrency & load conditioning
• Decomposes applications into stages separated by queues
• Adopt a structured approach to event-driven concurrency
instrumentation
data replication
• configurable replication factor
• replica placement strategy
rack unaware Simple Strategy
rack aware Old Network Topology Strategy
data center shard Network Topology Strategy
partitioner smack-down
Random Preserving
• system will use MD5(key) to distribute data across nodes
• even distribution of keys from one CF across ranges/nodes
Order Preserving
• key distribution determined by token
• lexicographical ordering
• required for range queries – scan over rows like cursor in
index
• can specify the token for this node to use
• ‘scrabble’ distribution
agenda
• context
• features
• data model
• api
structure
keyspace
settings (eg,
partitioner)
column family
settings (eg, comparator, type [Std])
column
name value clock
keyspace
• ~= database
• typically one per application
• some settings are configurable only per keyspace
column family
• group records of similar kind
• not same kind, because CFs are sparse tables
• ex:
– User
– Address
– Tweet
– PointOfInterest
– HotelRoom
think of cassandra as
row-oriented
• each row is uniquely identifiable by key
• rows group columns and super columns
column family
n=42
user=ebenkey123
key456
user=alisonicon=
nickname=The
Situation
json-like notation
User {
123 : { email: [email protected],
icon: },
456 : { email: [email protected],
location: The Danger Zone}
}
example$cassandra –f
$bin/cassandra-cli
cassandra> connect localhost/9160
cassandra> set Keyspace1.Standard1[‘eben’][‘age’]=‘29’
cassandra> set Keyspace1.Standard1[‘eben’][‘email’]=‘[email protected]’
cassandra> get Keyspace1.Standard1[‘eben'][‘age']
=> (column=6e616d65, value=39, timestamp=1282170655390000)
a column has 3 parts
1. name– byte[]– determines sort order– used in queries– indexed
2. value– byte[]– you don’t query on column values
3. timestamp– long (clock)– last write wins conflict resolution
column comparators
• byte
• utf8
• long
• timeuuid
• lexicaluuid
• <pluggable>
– ex: lat/long
super column
super columns group columns under a common name
<<SCF>>PointOfInterest
super column family
<<SC>>Central Park
10017
<<SC>>Empire State Bldg
<<SC>>Phoenix
Zoo85255
desc=Fun to walk in.
phone=212. 555.11212
desc=Great view from
102nd floor!
PointOfInterest {
key: 85255 {
Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. },
Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. },
}, //end phx
key: 10019 {
Central Park { desc: Walk around. It's pretty.} ,
Empire State Building { phone: 212-777-7777,
desc: Great view from 102nd floor. }
} //end nyc
}
s
super column
super column family
flexible schema
key
column
super column family
about super column families
• sub-column names in a SCF are not indexed
– top level columns (SCF Name) are always indexed
• often used for denormalizing data from standard CFs
slice predicate
• data structure describing columns to return
– SliceRange
• start column name
• finish column name (can be empty to stop on count)
• reverse
• count (like LIMIT)
read api• get() : Column– get the Col or SC at given ColPath
COSC cosc = client.get(key, path, CL);
• get_slice() : List<ColumnOrSuperColumn>– get Cols in one row, specified by SlicePredicate:
List<ColumnOrSuperColumn> results =client.get_slice(key, parent, predicate, CL);
• multiget_slice() : Map<key, List<CoSC>>– get slices for list of keys, based on SlicePredicate
Map<byte[],List<ColumnOrSuperColumn>> results =client.multiget_slice(rowKeys, parent, predicate, CL);
• get_range_slices() : List<KeySlice> – returns multiple Cols according to a range– range is startkey, endkey, starttoken, endtoken:
List<KeySlice> slices = client.get_range_slices(parent, predicate, keyRange, CL);
write apiclient.insert(userKeyBytes, parent,
new Column(“band".getBytes(UTF8),
“Funkadelic".getBytes(), clock), CL);
batch_mutate
– void batch_mutate(map<byte[], map<String, List<Mutation>>> , CL)
remove
– void remove(byte[], ColumnPath column_path, Clock, CL)
batch_mutate//create param
Map<byte[], Map<String, List<Mutation>>> mutationMap =
new HashMap<byte[], Map<String, List<Mutation>>>();
//create Cols for Muts
Column nameCol = new Column("name".getBytes(UTF8),
“Funkadelic”.getBytes("UTF-8"), new Clock(System.nanoTime()););
Mutation nameMut = new Mutation();
nameMut.column_or_supercolumn = nameCosc; //also phone, etc
Map<String, List<Mutation>> muts = new HashMap<String, List<Mutation>>();
List<Mutation> cols = new ArrayList<Mutation>();
cols.add(nameMut);
cols.add(phoneMut);
muts.put(CF, cols);
//outer map key is a row key; inner map key is the CF name
mutationMap.put(rowKey.getBytes(), muts);
//send to server
client.batch_mutate(mutationMap, CL);
what about…
SELECT WHERE
ORDER BY
JOIN ON
GROUP
rdbms: domain-based modelwhat answers do I have?
cassandra: query-based model
what questions do I have?
SELECT WHEREcassandra is an index factory
<<cf>>USERKey: UserIDCols: username, email, birth date, city, state
How to support this query?
SELECT * FROM User WHERE city = ‘Scottsdale’
Create a new CF called UserCity:
<<cf>>USERCITYKey: cityCols: IDs of the users in that city.Also uses the Valueless Column pattern
• Use an aggregate key
state:city: { user1, user2}
• Get rows between AZ: & AZ;
for all Arizona users
• Get rows between AZ:Scottsdale & AZ:Scottsdale1
for all Scottsdale users
SELECT WHERE pt 2
ORDER BY
Rows
are placed according to their Partitioner:
•Random: MD5 of key
•Order-Preserving: actual key
are sorted by key, regardless of partitioner
Columns
are sorted according to
CompareWith or CompareSubcolumnsWith
rdb
ms
cass
and
ra
When To Use NoSQL
• No schema – No SQL– Don’t do this KV DB design
• Terrible performance, impossible to maintain
• No persistency – No SQL– Heck – use a distributed cache for that
• Low write latency– And fast storage is not an option
• Simple queries– You can always ETL to a DB later
• Cool factor– Good luck with that
Real Life NoSQL Usages
• MongoDB is great for CMS
– Try MTV…
• Cassandra is great for low latency writes
• Use the right tool for the job – or you’ll get worse performance than a DB
• Expect a very high learning curve to implement
SpringData
• In your face – there are frameworks that use NoSQL
And To Finish
• A long time ago people told me Java doesn’t perform nor scale…
CODE
ORM
• Check the queries created
– For instance, in Hibernate
• setLast,setFirst – kill your performance
• Lazy fetching
• N+1
• Use batch updates
Hibernate With Batch Updates
session.doWork(new Work() {public void execute(Connection connection) throws SQLException {
PreparedStatement stmt = connection.prepareStatement(“insert into test values(?,?)");
stmt.setLong(1, id);stmt.setLong(2, transId);call.execute();
}});
JDBC Code
• Always use prepared statements
• Use your Database performance extensions
– For instance, massive inserts for Column Store DBs
Transactions
• One transaction per hit. At most.
• Reads can hurt your writes
Database Tuning
• We talked about it, but:– Indexes – good for read, bad for write
– Multi column indexes – good only if query reads using the same order of columns
– Indexes and views
– EXPLAIN is your friend
• Keep your DB small– BI against read-replica
– Delete history
– Ensure you only save what you actually need
Final Note
• NoSQL, SQL – it doesn’t matter if your code sucks!
2 Words On Cloud
• Storage sucks
• Network sucks
• So– Cache locally
– Write a-sync
– Write to files, not DB
– Don’t use RDS
– Check Cloud providers that offer SSD