Building a Flexible, Real-time Big Data Applications Platform
on Cassandra with Kiji
Clint KellyMember of Technical StaffWibiData
Cassandra Meetup23 April 2014
Agenda
Agenda
The problem
Agenda
The problemHow Kiji works
Agenda
The problemHow Kiji works
Kiji on Cassandra
!
!
!Open source
software
!
!
!
!
!
!
?
Data in
Data in
Data in
REST
Inspect
Inspect
Inspect
Inspect
Inspect
Train
Train
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Model
Model
AaBb
Model
AaBb
Model
Model
Model
Apply
Apply
ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
ApplyAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
Apply
Batch
AaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
Data out
Data out
Data out
REST
Data out
REST
REST
REST
REST
AaBb
AaBb
AaBb
Experiments / Deployment
Experiments / Deployment
Experiments / Deploymentc
d
c
d
Experiments / Deploymentc
d
c
d
3
Data in / out
Data in / out(REST)
Inspect and train
Apply
Apply(real-time)
!
?
!!
Kiji
How Kiji works
Kiji History
Kiji History
Kiji History
Kiji History
Kiji History
Kiji History
Kiji History
Kiji History
In production now
Fortune 500 retailer : Personalized recommendations
Opower: Energy usage and analytics reporting
How does it work?
Kiji
How does it work?
Kiji
EngineeringData
Science
How does it work?
Kiji
Data Science
Write
Engineering
How does it work?
Kiji
Data Science
Write
Channels Engineering
How does it work?
Kiji
Data Science
WriteLogs
DBs
EngineeringChannels
How does it work?
Kiji
Data Science
WriteLogs
DBs
Kij
iMR
EngineeringChannels
How does it work?
Kiji
Data Science
Write
Kij
iRE
ST
Stream
EngineeringChannels
How does it work?
Kiji
Data Science
Write
Read
Kij
iRE
ST
Stream
EngineeringChannels
How does it work?
KijiSchema(Cassandra)
Data Science
Write
Read
Kij
iRE
ST
Stream
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiMR
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
R
R
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
R
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
R
R
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
R
R
c
d
c
d
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Kiji Model Repository
Kij
iSco
rin
g
Freshness Policy
C
C
C
R
EngineeringChannels
Data
3
Data in / outKijiRESTKijiMR
Inspect and trainKijiHiveKijiMR
KijiExpress
Apply(real-time)
KijiModelRepositoryKijiScoring
Modular
Kiji on Cassandra
Kiji ~ BigTable
table
table
rowrowrowrowrowrowrowrowrowrowrowrow
row
Row key = entity ID
entity ID data
Composite entity IDs
data0xfa “bob”
Column families
payment0xfa “bob” interactions recommendations
inter:clicks
inter:search0xfa “bob” payment:
cardnumpayment:address
rec:scorer1
rec:scorer2
Columns
Timestamped versions
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Complex data types
record Search { string search_term; long session_id; device_type device;}
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Locality group
Locality group
Column families
Locality group
Locality group
Batch Batch Batch
Locality group
Batch Batch BatchReal-time
Real-time
Real-time
Locality group
Batch BatchReal-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch BatchReal-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch
Real-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
On disk.Compressed.
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
On disk.Compressed. In memory.
Row ➔ transactional consistency
Locality group ➔ Column family
CREATE TABLE loc_grp
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Entity ID ➔ Primary key
CREATE TABLE loc_grp (city text, user text,
PRIMARY KEY (city, user) )
WITH CLUSTERING ORDER BY (user ASC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Family, Qualifier, Version ➔ Clustering Columns
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Column values ➔ Blobs
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint, value blob,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
bob:pay:cardnum:t
AMEX1234...
bob:pay:addr:t5
1234 Main St, SF
bob:inter:clicks:t9
...
bob:inter:clicks:t7
...
bob:inter:clicks:t6
...
0xfa
Implementation notes
Implementation notes
DataStax Java driver
Implementation notes
DataStax Java driverCassandra 2.0.6
Implementation notes
DataStax Java driverCassandra 2.0.6
Async API
Implementation notes
DataStax Java driverCassandra 2.0.6
Async APINew MapReduce InputFormat
Issues
Operations across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Operations across locality groupsKiji locality group ➔ C* column family
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Lose transactional consistency
Filters
HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses
Filters
HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses
Client-side filtering
Entity IDs with unhashed components
EntityId(state, city, username)
EntityId(state, city, username)
hashed
EntityId(state, city, username)
hashed unhashed
EntityId(state, city, username)
hashed unhashed
0x235af-alice
0x235af-bob
0x235af-cathy
0x235af-dave
0x38e0a-andy
0x38e0a-jane
0x38e0a-lucy
0x38e0a-nancy
HBase
EntityId(state, city, username)
hashed unhashed
0x235af-alice
0x235af-bob
0x235af-cathy
0x235af-dave
0x38e0a-andy
0x38e0a-jane
0x38e0a-lucy
0x38e0a-nancy
HBase0x235af | alice | bob | cathy | dave
0x38e0a | andy | jane | lucy | nancy
Cassandra
EntityId(state, city, username)
hashed unhashed
0x235af-alice
0x235af-bob
0x235af-cathy
0x235af-dave
0x38e0a-andy
0x38e0a-jane
0x38e0a-lucy
0x38e0a-nancy
HBase0x235af | alice | bob | cathy | dave
0x38e0a | andy | jane | lucy | nancy
Cassandra
Limited to width of C* wide row!
Project status
KijiSchema (alpha) ready now.
https://github.com/kijiproject/kiji-schema/blob/cassandra/cassandra_tutorial.md
(tinyurl.com/mmubg5o)
Next quarterCassandra in all Kiji components
Run MapReduce jobs with KijiExpressExpose Cassandra-specific features
3
Data in / outKijiRESTKijiMR
Inspect and trainKijiHiveKijiMR
KijiExpress
Apply(real-time)
KijiModelRepositoryKijiScoring
Thanks to Cassandra community
Mailing listsMeetups, webinars, conferences
Try it now!
www.kiji.org/getstarted
tinyurl.com/mmubg5o