Instrumenting your Instruments

INSTRUMENTING YOUR INSTRUMENTS

Premal ShahCo-Founder @ 6senseHadoop Summit 2016

AGENDA

What does 6sense do?How do we do it?What does the pipeline look like?Where do we do it?What are the challenges?How are we planning to solve them?

WHAT DOES 6SENSE DO?

• We find prospects that are in market to buy• We empower marketing and sales teams

SAMPLE OUTPUTAccount Name Buying Stage Profile Fit

ACME Corporation Purchase Strong

ABC Corp Decision Strong

XYZ Systems Consideration Medium

Doe Inc Awareness Strong

PURCHASE

DECISION

CONSIDERATION

AWARENESS

HOW DO WE DO IT?

1st Party WebCRM

Marketing Automati

on

3rd Party• Web• Search • Ad

Impressions

Modelling & Scoring

Actionable Data for the

Customer

Customer Systems

WHAT DOES THE PIPELINE LOOK LIKE?

Customer

Systems

Ingest

Process

Export

Customer

Systems

THE DAILY PROCESS GRAPH (DAG)

THE REAL WORLD

THE REAL WORLD * N

PIPELINE COMPONENTS

Hadoop Eco System

YARN

Hive

Presto

Mesos World

Mesos

Chronos

Marathon

WORKFLOW

Chronos Queue Marathon

JobsHadoop

HivePrestoPython

WHERE DO WE DO IT?

• AWS─ Elastic─ Easy to experiment─ No CAPEX

• Hadoop─ Data Nodes are run separately from Node Managers─ Most of the data sits in S3

PROJECT RAVEN

WHAT AFFECTS PERFORMANCE

• Hive─ Joins ─ Non-Partitioned tables─ Filters─ Bucketing

• Hadoop─ File format─ Compression─ Data Locality

METRICS THAT MATTER• # of Mappers

• # of Input Files

• # of Input Records

• # of Records passed on to the next stage

• Time taken in─ Mappers─ Copy─ Shuffle─ Reducers

• # of Reducers

• # of compressed vs uncompressed files

• File formats

• Etc.

WHAT DO WE STORE?

• Job Name 1─ Date 1

o Yarn Job # 1 Metrics

o Yarn Job # 2 Metrics

─ Date 2o Repeat as above

• Job Name 2─ Repeat as above

WHAT DO WE USE THEM FOR?

• Finding the Job that ─ Is the slowest─ Process the most files─ Filter out most of the data─ Use the most amount of memory

• Observe trends over time in the above metrics

• Get alerted on changes in the trends, both up and down

RECOMMENDATIONS

• Storage Format

• Compression Type

• Partition Columns

• Bucketing

• Etc.

OPTIMIZATIONS

• Which job is causing the bottleneck?

• How many errors can we tolerate?

• Which job is the biggest offender?

• Which job fails the most?

• What did the latest release do?

SCALING

• Can we scale the number of customers?

• What does it cost to add a customer?

• What does it cost to add a job to each customer’s pipeline?

VENDOR SHOUT OUT

• ClusterK (now AWS Spot Fleet)─ Allows us to use different instance types to load balance and reduce costs

• Sumo Logic─ Detect variances in behavior over a custom time period

• OpsClarity─ Collects, monitors and alerts on the following metrics

o AWS Cloud Watch metrics (Queue length, S3 bucket size, etc.)o Host metrics (CPU, Memory, Disk Space, etc.)o Service metrics (YARN, HBase, Mesos, etc.)o Container metrics - Dockero Custom metrics – Anything else you want to send

THANK YOU

• premal at 6sense.com

• https://www.linkedin.com/in/premaljshah

Technology

Instrumenting your Instruments