Upload
hadoop-summit
View
222
Download
0
Embed Size (px)
Citation preview
INSTRUMENTING YOUR INSTRUMENTS
Premal ShahCo-Founder @ 6senseHadoop Summit 2016
AGENDA
What does 6sense do?How do we do it?What does the pipeline look like?Where do we do it?What are the challenges?How are we planning to solve them?
WHAT DOES 6SENSE DO?
• We find prospects that are in market to buy• We empower marketing and sales teams
SAMPLE OUTPUTAccount Name Buying Stage Profile Fit
ACME Corporation Purchase Strong
ABC Corp Decision Strong
XYZ Systems Consideration Medium
Doe Inc Awareness Strong
PURCHASE
DECISION
CONSIDERATION
AWARENESS
HOW DO WE DO IT?
1st Party WebCRM
Marketing Automati
on
3rd Party• Web• Search • Ad
Impressions
Modelling & Scoring
Actionable Data for the
Customer
Customer Systems
WHAT DOES THE PIPELINE LOOK LIKE?
Customer
Systems
Ingest
Process
Export
Customer
Systems
THE DAILY PROCESS GRAPH (DAG)
THE REAL WORLD
THE REAL WORLD * N
PIPELINE COMPONENTS
Hadoop Eco System
YARN
Hive
Presto
Mesos World
Mesos
Chronos
Marathon
WORKFLOW
Chronos Queue Marathon
JobsHadoop
HivePrestoPython
WHERE DO WE DO IT?
• AWS─ Elastic─ Easy to experiment─ No CAPEX
• Hadoop─ Data Nodes are run separately from Node Managers─ Most of the data sits in S3
PROJECT RAVEN
WHAT AFFECTS PERFORMANCE
• Hive─ Joins ─ Non-Partitioned tables─ Filters─ Bucketing
• Hadoop─ File format─ Compression─ Data Locality
METRICS THAT MATTER• # of Mappers
• # of Input Files
• # of Input Records
• # of Records passed on to the next stage
• Time taken in─ Mappers─ Copy─ Shuffle─ Reducers
• # of Reducers
• # of compressed vs uncompressed files
• File formats
• Etc.
WHAT DO WE STORE?
• Job Name 1─ Date 1
o Yarn Job # 1 Metrics
o Yarn Job # 2 Metrics
─ Date 2o Repeat as above
• Job Name 2─ Repeat as above
WHAT DO WE USE THEM FOR?
• Finding the Job that ─ Is the slowest─ Process the most files─ Filter out most of the data─ Use the most amount of memory
• Observe trends over time in the above metrics
• Get alerted on changes in the trends, both up and down
RECOMMENDATIONS
• Storage Format
• Compression Type
• Partition Columns
• Bucketing
• Etc.
OPTIMIZATIONS
• Which job is causing the bottleneck?
• How many errors can we tolerate?
• Which job is the biggest offender?
• Which job fails the most?
• What did the latest release do?
SCALING
• Can we scale the number of customers?
• What does it cost to add a customer?
• What does it cost to add a job to each customer’s pipeline?
VENDOR SHOUT OUT
• ClusterK (now AWS Spot Fleet)─ Allows us to use different instance types to load balance and reduce costs
• Sumo Logic─ Detect variances in behavior over a custom time period
• OpsClarity─ Collects, monitors and alerts on the following metrics
o AWS Cloud Watch metrics (Queue length, S3 bucket size, etc.)o Host metrics (CPU, Memory, Disk Space, etc.)o Service metrics (YARN, HBase, Mesos, etc.)o Container metrics - Dockero Custom metrics – Anything else you want to send
THANK YOU
• premal at 6sense.com
• https://www.linkedin.com/in/premaljshah