Cassandra Hadoop Best Practices by Jeremy Hanna

Preview:

Citation preview

Hadoop + CassandraBest Practices

Thursday, June 6, 13

Some Background

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

• Data locality just like with HDFS

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

• Data locality just like with HDFS

• Cassandra can handle ~200 CFs

Thursday, June 6, 13

Setup

Thursday, June 6, 13

Setup

• Analytics specific datacenter

Thursday, June 6, 13

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

Thursday, June 6, 13

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

Thursday, June 6, 13

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

• Writes will be replicated

Thursday, June 6, 13

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

• Writes will be replicated

• Same best practices as with Hadoop alone

Thursday, June 6, 13

Vanilla Hadoop

Thursday, June 6, 13

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

Thursday, June 6, 13

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

• Workload isolation with separate Cassandra datacenter configured

Thursday, June 6, 13

Planning

Thursday, June 6, 13

Planning

• MapReduce over full column family

Thursday, June 6, 13

Planning

• MapReduce over full column family

• Model data accordingly

Thursday, June 6, 13

Planning

• MapReduce over full column family

• Model data accordingly

• Add more column families

Thursday, June 6, 13

Planning

• MapReduce over full column family

• Model data accordingly

• Add more column families

• Can use secondary index, but use caution

Thursday, June 6, 13

Execution

Thursday, June 6, 13

Execution

• Project and select early in your workflow

Thursday, June 6, 13

Execution

• Project and select early in your workflow

• Store common intermediate datasets (in CFS/HDFS)

Thursday, June 6, 13

Execution

• Project and select early in your workflow

• Store common intermediate datasets (in CFS/HDFS)

• Bulk loader output format excels

Thursday, June 6, 13

Use Cases

Thursday, June 6, 13

Use Cases

• Typical Hadoop tasks

Thursday, June 6, 13

Use Cases

• Typical Hadoop tasks

• Validate data

Thursday, June 6, 13

Use Cases

• Typical Hadoop tasks

• Validate data

• Fix data

Thursday, June 6, 13

Use Cases

• Typical Hadoop tasks

• Validate data

• Fix data

• Bootstrap a new column family from existing data

Thursday, June 6, 13

Thank you

• Jeremy Hanna

• @jeromatron (twitter and irc)

• jeremy@datastax.com

• Ping me if you have any questions

Thursday, June 6, 13

Recommended