Cassandra Hadoop Best Practices by Jeremy Hanna

Hadoop + CassandraBest Practices

Thursday, June 6, 13

Some Background

• Hadoop support since early 2010

Some Background

• MapReduce/Pig works with any Hadoop 1.x distribution.

Some Background

• Hive is a neatly integrated piece of DSE

Some Background

• Data locality just like with HDFS

Some Background

• Data locality just like with HDFS

• Cassandra can handle ~200 CFs

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

• Writes will be replicated

• Same best practices as with Hadoop alone

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

• Workload isolation with separate Cassandra datacenter configured

Planning

• MapReduce over full column family

Planning

• Model data accordingly

Planning

• Add more column families

Planning

• Add more column families

• Can use secondary index, but use caution

Execution

• Project and select early in your workflow

Execution

• Store common intermediate datasets (in CFS/HDFS)

Execution

• Store common intermediate datasets (in CFS/HDFS)

• Bulk loader output format excels

Use Cases

• Typical Hadoop tasks

Use Cases

• Validate data

Use Cases

• Validate data

• Fix data

Use Cases

• Validate data

• Fix data

• Bootstrap a new column family from existing data

Thank you

• Jeremy Hanna

• @jeromatron (twitter and irc)

• jeremy@datastax.com

• Ping me if you have any questions

Cassandra Hadoop Best Practices by Jeremy Hanna

Technology

High order bits from cassandra & hadoop

Hadoop - yappidays.ruyappidays.ru/wp-content/uploads/2017/09/Hadoop-2017-Yaroslavl.pdf · Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.) Hadoop Jungle

C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Online Analytics with Hadoop and Cassandra

DCatch: Automatically Detecting Distributed Concurrency ...people.cs.uchicago.edu/~haopliu/paper/asplos17-preprint.pdf · source distributed cloud systems, Cassandra, Hadoop MapRe-duce,

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

BENCHMARKING CLOUD DATABASES - JBoss Developer · benchmarking cloud databases case study on hbase, hadoop and cassandra using ycsb

Intro Cassandra - Meetupfiles.meetup.com/16806932/BDA_Meetup5-Introduction... · Cassandra was designed as a fast, reliable and scalable operational data store. Hadoop was designed

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

UNIVERSIDAD POLITECNICA DE MADRIDoa.upm.es/56397/1/TFG_ANTONIO_JIMENEZ_HERNANDEZ.pdf · Hadoop data source, including Cassandra databases. In addition, the main algorithms of Machine

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

OSMC 2014: Processing millions of logs with Logstash and integrating with Elasticsearch, Hadoop and Cassandra | Valentin Fischer-Mitoiu

C* Summit EU 2013: From CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Introduction Big Data - BCIT School of Businessfaculty.bcitbusiness.ca/kevinw/4800/Lecture_Slides/...Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects,

Introduction to Real-Time Analytics with Cassandra and Hadoop

Evaluating Apache Cassandra as a Cloud DatabaseDataStax Enterprise – Certified Cassandra for Production Applications ..... 11 Solving the Cloud Mixed-Workload Problem ..... 11 Hadoop

Cassandra Query Language - Tutorials · PDF filedeveloped as a part of Apache Hadoop project and runs on ... Cisco, Rackspace, ebay, Twitter, Netflix ... Cassandra has become so popular

Migrating to Cassandra in the Cloud, the Netflix Way · Migrating to Cassandra in the Cloud, the Netflix Way Jason Brown ... Export to JSON Import to Hadoop Find new data Import to

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial