Hadoop at datasift

HADOOP AT

DATASIFT

ABOUT MEJairam ChandarBig Data Engineer

Datasift@jairamc

http://about.me/jairamhttp://jairam.me

And I’m a Formula 1 Fan!

OUTLINE

•What is Datasift ?

•Where do we use Hadoop ?

• The Numbers

• The Use-cases

• The Lessons

!! SALES PITCH ALERT !!

WHAT IS DATASIFT?

THE NUMBERS

•Machines

• HBase

• 60 Machines as RegionServers

• 1 HMaster

• 3 Zookeeper nodes

THE NUMBERS•Machines

• Hadoop

• 135 Machines divided into 2 clusters

•Datanodes/Tasktrakers

•Namenodes with High-Availability Failover

• 1 Jobtracker each

THE NUMBERS• Machines

• DL380 Gen8

• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)

• 48GB RAM

• 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage)

• 1 Gigabit network links

THE NUMBERS• Data

• Average load of 7500 interactions per second

• Peak loads of 15000 interactions per second sustained over a min

• Peak of 21000 interactions per second during superbowl

• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB

• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3)

• And that’s not it!

THE USE CASES• HBase

• Recordings

• Archive

• Map/Reduce

• Exports

• Historics

• Migration

THE USE CASES• Recordings

• User defined streams

• Stored in HBase for later retrieval

• Export to multiple output formats and stores

• <recording-id><interaction-uuid>

• Recording-id is a SHA-1 hash

• Allows recordings to be distributed by their key without generating hot-spots.

THE RECORDER

THE USE CASES• Exporter

• Export data from HBase for customer

• Export files ~ 5 – 10 GB or ~ 3-6 million records

•MR over HBase using TableInputFormat

• But the data needs to be sorted

• TotalOrderPartioner

EXPORTER

HISTORICS

THE USE CASES

• Twitter Import

• 2 years of Tweets

• About 95,000,000,000 tweets

•Over 300 TB with added augmentation

• Import was not as simple as you would imagine

THE USE CASES• Archive

• Not just the Firehose but the Ultrahose

• Stored in HBase as well

• HBase architecture (BigTable) creates Hotspots with Time Series data

• Leading randomizing bit (see HBaseWD)

• Pre-split regions

• Concurrent writes

THE USE CASES• Historics

• Export archive data

• Slightly different from Exporter

• Much larger time lines (1 – 3 months)

• Controlled access to Hadoop cluster with efficient job scheduling

• Unfiltered Input Data

• Therefore longer processing time

• Hence more optimizations required

HISTORICS

THE LESSONS• Tune Tune Tune (Default == BAD)

• Based on use case tune -

• Heap

• Block Size

• Memstore size

• Keep number of column families low

• Be aware of hot-spotting issue when writing time-series data

THE LESSONS

• Use compression (eg. Snappy)

•Ops need intimate understanding of system

•Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)

•Don't be afraid to fiddle with HBase code

• Using a distribution is advisable

QUESTIONS?

We are hiringhttp://datasift.com/about-us/careers

Hadoop at datasift

Technology

Towards Social Media as a Data Source for Opportunistic ... · social media already exist, such as Datasift (DataSift 2010) and Gnip (Gnip 2008), and provide this service by performing

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)

DataSift VEDO FOCUS introduction

Presto at Hadoop Summit 2016

Apache Hadoop Talk at QCon

Hadoop & its Usage at Facebook - Borthakurborthakur.com/ftp/hadoop_condor.pdf · Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org

Hadoop at Orange - SophiaConf2012

Hadoop at aadhaar

Hadoop Performance at LinkedIn

Hw09 Hadoop Applications At Yahoo!

Hadoop at Nokia

Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)

Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Stewart Townsend, Datasift 'Know your customer - Stop listening

Improved Speed + Accuracy for Research with Datasift Demographics

Apache Hadoop and Hive. Outline Architecture of Hadoop Distributed File System Hadoop usage at Facebook Ideas for Hadoop related research

Railroad Modeling at Hadoop Scale

Hadoop Platform at Yahoo

Hadoop Networking at Datasift

Nick Halstead DataSift Cognitive Match Doing it all Again BoS2016