C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016

C* for Deep Learning

Tractable

• Artificial Intelligence products for businesses

• Combining Deep Learning / Neural Networks and traditional Machine Learning

• AI using Deep Learning has surpassed human intelligence: – Go

– Image Recognition

Deep Learning 101

Deep Learning A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using a deep

graph with multiple processing layers, composed of multiple linear and non-linear transformations.

1.Requires large amounts of data to train

networks

2.Computations made feasible by use of GPUs for dramatic speedup

Semantic Image Search

Similar ImageSimilar object

Search in the meaning of the image

NOT search on the image itself

Semantic Image Search

Similar ImageSimilar object

Semantic Search 2

Same object

Same style

• Intent can vary!

• With training AI can do both of these tasks

• Semantic search is more than just classification: – Ranking within a classification – Search for things that are not

classification categories

Feature Extraction

Feature Vector

Trained Deep Net

. . .Vn-1

1 Image

1 TB Images

4000 dimension vector

300 GB of features

Feature Extraction

GPU Servers C*

Feature Extraction

CREATE TABLE features listing_id uuid, image_id uuid, feature_vector blob, PRIMARY KEY ((listing_id), image_id)

• Processing images at 5 GigaBytes per second

• Features generated at 1.6 GigaBytes per second

Search

“I want a truck like this”

• 300 GB of features

• Millions of rows in C*

Distributed, in-memory computation

• Map-reduce • Graph analysis

• SQL abstraction

DataStax Spark Connector • Understands Cassandra partitioning• Push-down queries to C* keys where possible• Join on C* tables

Spark Similarity Search

CREATE TABLE features listing_id uuid, image_id uuid, feature_vector blob, PRIMARY KEY ((listing_id), image_id)

CREATE TABLE query_results query_id uuid, listing_id uuid, score float, PRIMARY KEY ((query_id), score)

1. Cache our 300GB features in Spark

2. Score every feature vector with our search image

3. Aggregate score for images in each listing

4. (optional) Join on ‘live’ C* data

5. Write result back to C*

CREATE TABLE stock_level listing_id uuid, inventory int, etag uuid, PRIMARY KEY (listing_id)

Search

Search returns results within 5 seconds

100x Speedup from holding data in-memory in Spark

10x Speedup from co-locating Spark and C*

20x Speedup from partitioning on the group key (listing id)

5x Speedup from “Cassandra sort”

Search Results

Sophisticated Analytics• Visualisation

• Clustering

• Outlier Detection

Real time, interactive exploration of large visual data sets

Sophisticated Analytics

Gas pipe

Gas joint Water pipe

Water joint

Electricity cable

Telecoms cable

• Visualisation

• Clustering

• Outlier Detection

Real time, interactive exploration of large visual data sets

Pre-processing

Ingest / ETL

Training Annotation / Labelling

Analytics & Exploration

Classification & Feature Extraction

Pre-processing

SPANCKIngest / ETL

Training Annotation / Labelling

Analytics & Exploration

Classification & Feature Extraction

SPANCK: Spark-Python Algolia Node Cassandra Kafka

Data Sources, Files, DB extracts Batched Data

Read the dataWrite the model

Data ModelingFast Analytics

DistributedData Store

Event Processing

Read the model

Streaming Data

Web API

Alerts and Notifications

Fast Data Processing

Data ExplorationSearch Index

Many Services, 1 Engineer

FOSS70+ engineers

FOSSBerkley RAD Lab200+ contributors

Many Services, 1 Engineer

FOSS70+ engineers

FOSSBerkley RAD Lab200+ contributors

Resource Allocation Scheduling

Install & Configuration | Maintenance

AI Software

Persistence | Compute | Queue SPANCK

Service Discovery | Resilience

DCOS FrameworksEasy Install

$ dcos package install kafka -- options kafka-options.json

$ dcos package install cassandra --options cassandra-options.json

{ "nodes": { "count": 12, "cpus": 9, "mem": 30000, "disk": 400000, "heap": { "size": 8000, "new": 1500 } }, "cassandra": { "row_cache_size_in_mb": 12000, "row_cache_save_period": 600, "commitlog_segment_size_in_mb": 64, "concurrent_reads": 24, "concurrent_writes": 64, "memtable_allocation_type": "offheap_objects", "compaction_throughput_mb_per_sec": 128 }}

DCOS Frameworks

function fetchMarathonServiceBaseUrl(serviceName){ var dns_url = ['_' + serviceName, "_tcp.marathon.mesos"].join('.'); return resolveSrv(dns_url).then(function(addresses){ return "http://" + addresses[0].name + ":" + addresses[0].port; }) }

fetchCassandraNodes = (mesosHost, authToken) -> mesosDns.fetchMarathonServiceBaseUrl('cassandra').then (baseUrl) -> request( url: baseUrl + "/v1/nodes/connect", headers: headers ).then (result) -> result = JSON.parse(result) return _.pluck(result.nodes, “ip")

Service Discovery • 0 configuration files

• 0 configuration / orchestration systems

• Works ever time, in every environment

DCOS Frameworks

dcos cassandra —name=cassandra \ cleanup —key_spaces=dev,test

dcos cassandra —name=cassandra \ replace node-4

dcos kafka broker replace 3

Maintenance and Admin DCOS Command Line Tools • Replace a C* node • Backup / Restore C* to AWS S3 • Run C* repair / cleanup • Restart C* nodes • Replace a Kafka broker • Rebalance Kafka brokers • Restart Kafka brokers

DCOS FrameworksShipping Docker Apps 1 JSON file 1 CLI command

Apps include • Spark Drivers • Web Servers • Kafka Producers • Kafka Consumers • APIs

{ "id": "/my-app", "cpus": 1, "mem": 2048, "instances": 1, "env": { "NODE_ENV": "dev" }, "container": { "type": "DOCKER", "docker": { "image": "tractableio/my-app:0.1", "network": "BRIDGE" } }, "healthChecks": [ { "path": "/healthy", "maxConsecutiveFailures": 2 } ]}

*JSON Slightly simplified

DCOS Cluster in 15 minutes

1.Deploy DCOS cluster into AWS using CloudFormation template

2.Install Cassandra & Kafka

3.Deploy app docker containers into DCOS

4.Profit!

Conclusion

• Cassandra’s high write speeds allow it to ingest features from Deep Networks

• Cassandra and Spark provides a powerful compute+storage system

• Spark, Cassandra and Kafka can provide a versatile data backbone that supports a range of use cases

• Mesosphere DCOS is a low-effort, high-reward way of running distributed systems

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016

Software

MariaDB Cassandra Interoperability Cassandra Storage Engine in MariaDB

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Cassandra CLuster Management by Japan Cassandra Community

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Solr & Cassandra: Searching Cassandra with DataStax Enterprise

Jefferson, Stoicism, and the Stoic Argument against … · Jefferson, Stoicism, and the Stoic Argument against Esuriency M. Andrew Holowchak Abstract In this paper, I trace out the

THE RECONSTRUCTION OF JEFFERSON FRANKLIN LONG · PDF fileTHE RECONSTRUCTION OF JEFFERSON FRANKLIN LONG GEORGIA’S FIRST ... Williams and Andrew Young for their enormous contributions

Cassandra Community Webinar: Apache Cassandra Internals

Team Take 2: Sindus Alattar, Sarah R Connors, Cassandra Hutchinson, Tammy James, Andrew Krespan

State of Cassandra, 2012 - NoSQL | Apache Cassandra · State of Cassandra, 2012 Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax @spyced ©2012 DataStax Some Cassandra

Cassandra Day Atlanta 2015: Python & Cassandra

Paris Cassandra Meetup - Cassandra for Developers

Helsinki Cassandra Meetup #2: From Postgres to Cassandra

From Alexander Hamilton to Andrew Jackson to the …fchistorysolreview.wikispaces.com/file/view/Jefferson to Jackson...From Alexander Hamilton to Andrew Jackson to the ... The Bank

Thomas Jefferson and Andrew Johnson Erin Kerr, Shaney Soderquist and Alex Weisner

RF Power Upgrade at Jefferson Lab Rick Nelson, Andrew Kimber Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract

Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeffrey Carpenter, Choice Hotels) | C* Summit 2016

Cabs, Cassandra, and Hailo (at Cassandra EU)

A fast RF kicker for the MEIC electron cooler Andrew Kimber Amy Sy 31 st March 2015 Thomas Jefferson National Accelerator Facility is managed by Jefferson