30
Brisk: More Powerful Hadoop Powered by Cassandra [email protected] Monday, July 25, 2011

Brisk: more powerful Hadoop powered by Cassandra

  • Upload
    jbellis

  • View
    7.141

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Brisk: more powerful Hadoop powered by Cassandra

Brisk: More Powerful Hadoop Powered by [email protected]

Monday, July 25, 2011

Page 2: Brisk: more powerful Hadoop powered by Cassandra

The evolution of Analytics

Analytics + Realtime

Monday, July 25, 2011

Page 3: Brisk: more powerful Hadoop powered by Cassandra

The evolution of Analytics

Analytics Realtime

replication

Monday, July 25, 2011

Page 4: Brisk: more powerful Hadoop powered by Cassandra

The evolution of Analytics

ETL

Monday, July 25, 2011

Page 5: Brisk: more powerful Hadoop powered by Cassandra

Brisk re-unifies realtime and analytics

Monday, July 25, 2011

Page 6: Brisk: more powerful Hadoop powered by Cassandra

The Traditional Hadoop Stack

Master Nodes

Name Node

Secondary Name Node

Job Tracker

ZooKeeper

MetaStore

Slave Nodes

Data Node

Task Tracker

Region Server

Hbase MasterPig

Hive

Region Server

Client Nodes

Monday, July 25, 2011

Page 7: Brisk: more powerful Hadoop powered by Cassandra

7

Monday, July 25, 2011

Page 8: Brisk: more powerful Hadoop powered by Cassandra

Brisk Architecture

Monday, July 25, 2011

Page 9: Brisk: more powerful Hadoop powered by Cassandra

Brisk Highlights

✤ Easy to deploy and operate✤ No single points of failure✤ Scale and change nodes with no downtime✤ Cross-DC, multi-master clusters✤ Allocate resources for OLAP vs OLTP

✤ With no ETL

Monday, July 25, 2011

Page 10: Brisk: more powerful Hadoop powered by Cassandra

Cassandra data model

✤ ColumnFamilies contain rows + columns✤ (Not really schemaless for a while now)

password name site* Nate McCall* Brandon Williams

* Jonathan Ellis datastax.com

zznatedriftxjbellis

Monday, July 25, 2011

Page 11: Brisk: more powerful Hadoop powered by Cassandra

Sparse

password name* Nate McCall

zznate

password name* Brandon Williams

driftx

password name site* Jonathan Ellis datastax.com

jbellis

Monday, July 25, 2011

Page 12: Brisk: more powerful Hadoop powered by Cassandra

Rows as containers / materialized views

driftx thobbs pcmanus jbellis zznatecircle1

xedin mdenniscircle2

xedin pcmanus ymorishitacircle3

Monday, July 25, 2011

Page 13: Brisk: more powerful Hadoop powered by Cassandra

Monday, July 25, 2011

Page 14: Brisk: more powerful Hadoop powered by Cassandra

CassandraFS

✤ data stored as ByteBuffer internally -- excellent fit for blocks✤ local reads mmap data directly (no rpc)✤ blocks are compressed with google snappy✤ hadoop distcp hdfs:///mydata cfs:///mydata

Monday, July 25, 2011

Page 15: Brisk: more powerful Hadoop powered by Cassandra

Hive support

✤ Hive MetaStore in Cassandra✤ Unified schema view from any node, with no external systems

and no SPOF✤ Automatically maps Cassandra column families to Hive tables

✤ Supports static and dynamic column families (and supercolumns)

Monday, July 25, 2011

Page 16: Brisk: more powerful Hadoop powered by Cassandra

Hive: CFS and ColumnFamilies

CREATE TABLE users (name STRING, zip INT); 

LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;

 

CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT)

STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';

CREATE EXTERNAL TABLE Keyspace1.Users

(row_key STRING, column_name STRING, value string)

STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';

Monday, July 25, 2011

Page 17: Brisk: more powerful Hadoop powered by Cassandra

Pig Support

✤ With standard Cassandra:$ export PIG_HOME=/path/to/pig

$ export PIG_INITIAL_ADDRESS=localhost

$ export PIG_RPC_PORT=9160

$ export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner

$ contrib/pig/bin/pig_cassandra

grunt>

✤ With Brisk:$ bin/brisk pig

grunt>

Monday, July 25, 2011

Page 18: Brisk: more powerful Hadoop powered by Cassandra

Pig: CFS and ColumnFamilies

grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as (name:chararray, value:long);

data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage() AS (key, columns: {T: tuple(name, value)});

data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S' using CassandraStorage() AS (key, columns: {T: tuple(name, value)});

Monday, July 25, 2011

Page 19: Brisk: more powerful Hadoop powered by Cassandra

19

Monday, July 25, 2011

Page 20: Brisk: more powerful Hadoop powered by Cassandra

Data model: Realtime

GOOG LNKD P AMZN AAPLE80 20 40 100 20

Portfolio1

Portfolios

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

StockHist

last$95.52

$186.10

$112.98

GOOG

LiveStocks

AAPLAMZN

Monday, July 25, 2011

Page 21: Brisk: more powerful Hadoop powered by Cassandra

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Monday, July 25, 2011

Page 22: Brisk: more powerful Hadoop powered by Cassandra

Data model: Analytics

ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);

Monday, July 25, 2011

Page 23: Brisk: more powerful Hadoop powered by Cassandra

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

Monday, July 25, 2011

Page 24: Brisk: more powerful Hadoop powered by Cassandra

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;

Monday, July 25, 2011

Page 25: Brisk: more powerful Hadoop powered by Cassandra

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Monday, July 25, 2011

Page 26: Brisk: more powerful Hadoop powered by Cassandra

Portfolio Demo dataflow

Portfolios

Historical Prices

Intermediate Results

Largest loss

Web-based Portfolios

Live Prices for today

Largest loss

Monday, July 25, 2011

Page 27: Brisk: more powerful Hadoop powered by Cassandra

OpsCenter

Monday, July 25, 2011

Page 28: Brisk: more powerful Hadoop powered by Cassandra

Monday, July 25, 2011

Page 29: Brisk: more powerful Hadoop powered by Cassandra

Where to get it

✤ http://www.datastax.com/brisk

Monday, July 25, 2011

Page 30: Brisk: more powerful Hadoop powered by Cassandra

Monday, July 25, 2011