Kiji cassandra la june 2014 - v02 clint-kelly

Preview:

DESCRIPTION

Big Data Camp LA 2014, Don't re-invent the Big-Data Wheel, Building real-time, Big Data applications on Cassandra with the open-source Kiji project by Clint Kelly of Wibidata

Citation preview

Don’t Reinvent the Big-Data Wheel!

Clint Kelly - @clintwkellyWibiData

Building real-time, Big Data applications on Cassandra with the open-source Kiji project

Big Data Camp LA14 June 2014

Agenda

Agenda

The problem

Agenda

The problemHow Kiji works

Agenda

The problemHow Kiji works

Kiji in production

Agenda

The problemHow Kiji works

Kiji in productionKiji on Cassandra

The problem.

!

!

!Open source

software

!

!

!

!

!

!

?

Data in

Data in

Data in

REST

Inspect

Inspect

Inspect

Inspect

Inspect

Train

Train

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Train

“Trained model”

Model

Model

AaBb

Model

AaBb

Score

Score

ScoreAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

ScoreAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Score

Batch

AaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb

Data out

Data out

Data out

REST

Data out

REST

REST

REST

REST

AaBb

AaBb

AaBb

AaBb

Experiments / Deployment

Experiments / Deployment

Experiments / Deploymentc

d

c

d

Experiments / Deploymentc

d

c

d

3

Data in / out

Data in / out(REST)

Inspect and train

Score

Score(real-time)

!

?

!!

Kiji

How Kiji works

Kiji History

Kiji History

Kiji History

How does it work?

Kiji

How does it work?

Kiji

EngineeringData

Science

How does it work?

Kiji

Data Science

Write

Engineering

How does it work?

Kiji

Data Science

Write

Channels Engineering

How does it work?

Kiji

Data Science

WriteLogs

DBs

EngineeringChannels

How does it work?

Kiji

Data Science

WriteLogs

DBs

Kij

iMR

EngineeringChannels

How does it work?

Kiji

Data Science

Write

Kij

iRE

ST

Stream

EngineeringChannels

How does it work?

Kiji

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

How does it work?

KijiSchema(Cassandra)

Data Science

Write

Read

Kij

iRE

ST

Stream

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

Data

C

C

C

EngineeringChannels

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiMR

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMR

Scorer

C

C

C

R

R

R

EngineeringChannels

Data

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

KijiSchema(Cassandra)

How does it work?Data

Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

QueryKijiHive

KijiExpress

KijiMRK

ijiS

cori

ng

C

C

C

R

Kiji Model Repository

EngineeringChannels

Data

Scorer

R

R

R

3

Data in / outKijiRESTKijiMR

Inspect and trainKijiHiveKijiMR

KijiExpress

Score(real-time)

KijiModelRepositoryKijiScoring

Modular

Kiji in production

In production now

Fortune 500 retailer : Personalized recommendations

Opower: Energy usage and analytics reporting

Fortune 500 retailer

Serving personalized recommendations

Kiji

WriteLogs

DBs

Kij

iMR

EngineeringChannels

Bulk load

KijiSchema(Cassandra)

Data Science

User 1

User 2

User 3

KijiExpress

KijiMR

C

C

C

Data

Train

KijiSchema(Cassandra)

Data Science

Write

Read

Kij

iRE

ST

Stream

User 1

User 2

User 3

Kij

iSco

rin

g

C

C

C

R

Kiji Model Repository

EngineeringChannels

Scorer

Score

Kiji on Cassandra

KijiSchema

KijiSchema

KijiSchema

Cassandra

KijiSchema

Cassandra

KijiSchema

HBase

Kiji ~ BigTable

table

table

rowrowrowrowrowrowrowrowrowrowrowrow

row

Row key = entity ID

entity ID data

Composite entity IDs

data0xfa “bob”

Column families

payment0xfa “bob” interactions recommendations

inter:clicks

inter:search0xfa “bob” payment:

cardnumpayment:address

rec:scorer1

rec:scorer2

Columns

Timestamped versions

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Complex data types

record Search { string search_term; long session_id; device_type device;}

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Locality group

Locality group

Column families

Locality group

Locality group

Batch Batch Batch

Locality group

Batch Batch BatchReal-time

Real-time

Real-time

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch BatchReal-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch

Real-time

Real-time

Real-time

Batch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed.

locality_group_real_timelocality_group_batch

Locality group

Batch Batch Real-time

Real-time

Real-timeBatch

On disk.Compressed. In memory.

Row ➔ transactional consistency

Locality group ➔ Column family

CREATE TABLE loc_grp

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Entity ID ➔ Primary key

CREATE TABLE loc_grp (city text, user text,

PRIMARY KEY (city, user) )

WITH CLUSTERING ORDER BY (user ASC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Family, Qualifier, Version ➔ Clustering Columns

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Column values ➔ Blobs

CREATE TABLE loc_grp (city text, user text,

family text, qualifier text, version bigint, value blob,

PRIMARY KEY (city, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

inter:search0xfa “bob” songs:

let it besongs:let it besongs:

let it beinter:clicks

1396560123

payment:cardnum

payment:address

rec:scorer2

rec:scorer3rec:

scorer3rec:scorer3

rec:scorer1

1395650231

Implementation notes

Implementation notes

DataStax Java driver

Implementation notes

DataStax Java driverCassandra 2.0.6

Implementation notes

DataStax Java driverCassandra 2.0.6

Async API

Implementation notes

DataStax Java driverCassandra 2.0.6

Async APINew MapReduce InputFormat

Issues

Operations across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Operations across locality groupsKiji locality group ➔ C* column family

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Operations across locality groupsKiji locality group ➔ C* column family

Read across locality groups➔ multiple C* reads (async API!)

Compare-and-set across locality groups➔ not allowed in C* Kiji

Lose transactional consistency

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Filters

HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses

Client-side filtering

Project status

Components working with Cassandra

KijiSchemaKijiMR

KijiRESTKijiExpress

All code available with tutorial within 1-2 months

Summary

3

Data in / outKijiRESTKijiMR

Inspect and trainKijiHiveKijiMR

KijiExpress

Score(real-time)

KijiModelRepositoryKijiScoring

Thanks to Cassandra community

Mailing listsMeetups, webinars, conferences