: what’s all the buzz about?. Next generation databases are: Non-relational, Distributed,...

: what’s all the buzz about?

http://nosql-database.org/

Next generation databases are:• Non-relational, • Distributed, • Open-source,• Horizontal scalable

Often more characteristics: Schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount

List of NoSQL databases [122+]• Wide Column Store / Column FamiliesHBase, Cassandra, Hypertable, Cloudata, Cloudera, Amazon SimpleDB• Document StoresCouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB• Key Value / Tuple StoreAzure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, KeyspaceBerkeley DB, MemcacheDB, Faircom C-Tree, Mnesia, LightCloud, Hibari, HamsterDB, STSdb, Pincaster, RaptorDB• Eventually Consistent Key Value StoresAmazon Dynamo, Voldemort, Dynomite, KAI• Graph DatabasesNeo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink, Virtuoso, VertexDB, FlockDB• Object Databasesdb4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, Caching, ZODB, NEO, PicoLisp, Sterling• More and more databases

So what’s wrong with relational databases?

Main principals of RDBMS

• SQL• ACID• Atomic “all or nothing”• Consistent means that data moves from one correct

state to another correct state, with no possibility that readers could view different values that don’t make sense together.

• Isolated means that transactions executing concurrently will not become entangled with each other.

• Durable once a transaction has succeeded, the changes will not be lost.

Shortcomings of RDBMS• Transactions under heavy load• Complexities of vertical scaling• 2 phase commit (2PC) protocol

Sharding

If you can’t split it, you can’t scale it (Randy Shoup, distinguished architect, eBay)

• Sharging approach• Feature-based shard or functional segmentation• Key-based sharding• Lookup table

• Shared-nothing or Cassandra like sharding

The real question is not “What’s wrong with relational

databases?” but rather, “What problem do you have?”

Brewer’s CAP Theorem

Availability

Consistency Partition Tolerance

Brewer’s CAP Theorem

Availability

ConsistencyPartition Tolerance

Amazon Dynamo derivatives:Cassandra, Voldemort, Riak, CouchDB

Neo4j, Google Big Table and its derivatives: MongoDB, Redis, Hypertable

Relational: MySQL, Oracle, MSSQL

in 50 words or less

Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.

Cassandra case studies

Cassandra outlines

• BASE (Basically Available Soft-state Eventual consistency) and not ACID (Atomicity, Consistency, Isolation, Durability)

• Distributed and decentralized• Elastic scalability• High availability and fault tolerance• Tunable consistency

Use cases for Cassandra

• Large deployments• Lots of writes, statistics and analysis• Geographical distribution• Evolving applications

Writes

Commit log

Memtable

SSTable SSTable

• No reads• No seeks• Fast• Sequential disk access • Atomic within a column family• Any node• Always writable (hinted hand-off)• ≈ 0.2 ms

Threshold

SSTable SSTable

• Bloomfilter field to determine whether a provided key is in the SSTable

• Index field for quick read• Any node• Read repair• ≈ 15 ms

Memtable

Bf Idx Bf Idx

The tenets of column-oriented model

• Keyspace Outer container, that contains column families (is sort of like a relational database)

• Column Family Logical division that associates similar data (very roughly analogous to tables in the relational world)

• Column Name/value pair (and a client-supplied timestamp of when it was last updated)

• Super Column Family Container for super columns sorted by their names

• Super Column Structure with name and set of dependent columns

Column Family\Column

Column FamilyA container for columns sorted by their names. Column Families are referenced and sorted by row keys.

row key

column name 1

column value 1

column name n

column value n

ColumnA name value pair (contains also a time-stamp for conflict resolution on the server side)

column name

column value

+ timestamp : long

: byte[]

row key

super column name 1super column name m

column name 1

column value 1

column name n1

column value n1

column name 1

column value 1

column name nm

column value nm

Super Column FamilyA container for super columns sorted by their names. Like Column Families, Super Column Families are referenced and sorted by row keys.

Super Column Family\Super ColumnSuper Column

A sorted associative array of columns.

column name 1

column value 1

column name n

column value n

super column name

row key

super column name 1super column name m

column name 1

column value 1

column name n1

column value n1

column name 1

column value 1

column name nm

column value nm

Addressing Super Column Family

• Five-dimensional hash• [Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

row key

column name 1

column value 1

column name n

column value n

Addressing Column Family

• Four-dimensional hash• [Keyspace][ColumnFamily][Key][Column]

Cassandra client optionsThrift (12 different languages)Avro (data serialization system)

Java: Hector: http://github.com/rantav/hector (abstraction over thrift) Pelops: http://github.com/s7/scale7-pelops (abstraction over thrift) CQL: JDBC driver for Cassandra version starting from 0.8 (SQL like language) Hector JPA: https://github.com/riptano/hector-jpa (ORM client) Cassandrelle: http://demoiselle.sf.net/component/demoiselle-cassandra / (documentation ???) Kundera: http://code.google.com/p/kundera / (buggy ???)

Python: Pycassa, TelephusGrails: grails-cassandra.NET: Aquiles, FluentCassandraRuby: CassandraPHP: phpcassa, SimpleCassie

Cassandra\RDBMS query differences

• No update query• Record-level atomicity on writes• No duplicate keys• Basic write properties: consistency level

(ZERO, ANY, ONE, QUORUM, ALL)• Basic read properties: consistency level (ONE,

QUORUM, ALL)

IntegratingHadoop (http://hadoop.apache.org) is a set of open source projects that deal with large amounts of data in a distributed way.• Hadoop Distributed File System (HDFS): a distributed file system that provides high-

throughput access to application data.• Hadoop MapReduce: a software framework for distributed processing of large data sets

on compute clusters.

Other Hadoop-related projects at Apache include:• Cassandra™: a scalable multi-master database with no single points of failure.• Hive™: a data warehouse infrastructure that provides data summarization and ad hoc

querying.• Mahout™: a Scalable machine learning and data mining library.• Pig™: a high-level data-flow language and execution framework for parallel computation.

The end

Questions?

: what’s all the buzz about?. Next generation databases are: Non-relational, Distributed,...

Documents

Relational Models - uni-muenchen.detresp/papers... · Relational learning, statistical relational models, statistical relational learning, relational data mining 2 Glossary Entities

FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Cloning

AsterixDB: A Scalable, Open Source BDMS - VLDBAsterixDB: A Scalable, Open Source BDMS ... This paper is the ﬁrst complete ... parallel relational DBMS, a popular NoSQL store, and

Bad buzz & very bad buzz research findings

Good buzz, bad buzz et inversement

Elementary IR: Scalable Boolean Text Searchcs186/fa06/lecs/11BooleanSearch.pdf · •IR & Relational systems share basic building blocks for scalability –IR internal representation

Relational Algebra & Relational Calculus

Scalable Package Queries in Relational Database Systems · Scalable Package Queries in Relational Database Systems ... and algorithmic support for package query speciﬁcation and

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: 110022478 Award: MSc (Computer & Information

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk by Cody Koeninger

Scalable Linear Algebra on a Relational Database Systemsl45/pdf/TKDE_published.pdf · Scalable Linear Algebra on a Relational Database System Shangyu Luo , Zekai J. Gao , Michael

Toward Scalable Keyword Search over Relational Data

API Reference...Welcome to Relational Database Service API Reference. RDS is an online relational database service based on the cloud computing platform. RDS is reliable, scalable,

MongoDB, Redis & ClickHouse Beyond Relational Databases · Redis highlights Fast… very fast. Mature / large community Many data structures Advanced features Horizontally scalable

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010

Relational Algebra. 2 Outline Relational Algebra Unary Relational Operations Relational Algebra Operations from Set Theory Binary Relational Operations

Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter

A Scalable Approach for Statistical Learning in …tresp/papers/swj252_1.pdf · A Scalable Approach for Statistical Learning ... Statistical Relational Learning, Linked Life Data

1 Components of a Scalable Distributed Relational Information Service Dong Lu June 14, 2005

Relational Model & Relational Algebra