Building Scalable Big Data Infrastructure Using Open Source Software Sam William...

Building Scalable Big Data Infrastructure Using Open Source

Software

Sam William

sampd@stumbleupon.com

What is StumbleUpon?

Help users find content they did not expect to find

The best way to discover new and interesting things from across the

How StumbleUpon works

1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages

We use your interests and behavior to recommend new content for you!

StumbleUpon

The Data Challenge

1 Data collection

2 Real time metrics

3 Batch processing / ETL

4 Data warehousing & ad-hoc analysis

5 Business intelligence & Reporting

Challenges in data collection

• Different services deployed of different tech stacks

• Add minimal latency to the production services

• Application DBs for Analytics / batch processing– From HBase & MySQL

Site Rec /Ad Server Other internal services

Apache / PHP Scala / Finagle Java / Scala / PHP

Data Processing and Warehousing

Raw Data

Warehouse(HDFS)

TablesTablesMassively

Denormalized Tables

Challenges/Requirements:

•Scale over 100 TBs of data

•End product works with easy querying tools/languages

•Reliable and Scalable – powers analytics and internal reporting.

Real-time analytics and metrics

• Atomic counters

• Tracking product launches

• Monitoring the health of the site

• Latency – live metrics makes sense

• A/B tests

Open Source at SU

Data Collection at SU

Activity Streams and Logs

All messages are Protocol BuffersFast and EfficientMultiple Language Bindings ( Java/ C++ / PHP )CompactVery well documentedExtensible

Apache Kafka

• Distributed pub-sub system • Developed @ LinkedIn• Offers message persistence• Very high throughput

– ~300K messages/sec

• Horizontally scalable • Multiple subscribers for topics .

– Easy to rewind

• Near real time process can be taken offline and done at the consumer level

• Semantic partitioning through topics

• Partitions for parallel consumption

• High-level consumer API using ZK

• Simple to deploy- only requires Zookeeper

Kafka At SU

• 4 Broker nodes with RAID10 disks

• 25 topics

• Peak of 3500 msg/s

• 350 bytes avg. message size

• 30 days of data retention

• Scala/Finagle

• Generic Kafka message producer

• Deployed on all prod servers

• Local http daemon

• Publishes to Kafka asynchronously

• Snowflake to generate unique Ids

Sutro - Kafka

Apache/PHP

Ad Server-

Scala/Finagle

Rec Server-

Scala/Finagle

OtherServices

BrokerBroker BrokerBrokerBrokerBroker

Application Data forAnalytics & Batch Processing

• HBase inter-cluster replication(from production to batch

cluster)

• Near real-time sync on batch cluster

• Readily available in Hive for analysis

• MySQL replication to Batch DB Servers

• Sqoop incremental data transfer to HDFS

• HDFS flat files mapped to Hive tables & made available for analysis

Real-time metrics

1. HBase – Atomic Counters

2. Asynchbase - Coalesced counter inc++

1. OpenTSDB (developed at SU)

– A distributed time-series DB on HBase– Collects over 2 Billion data points a day – Plotting time series graphs– Tagged data points

Real-time counters

Real time metrics from OpenTSDB

Kafka Consumer framework aka Postie

• Distributed system for consuming messages• Scala/Akka -on top of Kafka’s consumer API• Generic consumer - understands protobuf • Predefined sinks HBase / HDFS (Text/Binary) / Redis• Consumers configured with configuration files• Distributed / uses ZK to co-ordinate• Extensible

Postie

• Building concurrent applications made easy !!

• The distributed nodes are behind Remote Actors

• Load balancing through custom Routers

• The predefined sink and services are accessed through local actors

• Fault-tolerance through actor monitoring

Postie

Batch processing / ETL

GOAL: Create simplified data-sets from complex data

Create highly denormalized data sets for faster querying

Power the reporting DB with daily stats

Output structured data for specific analysis

e.g. Registration Flow analysis

Our favourite ETL tools:• Pig

– Optional Schema– Work on byte arrays– Many simple operations can be done without UDFs– Developing UDFs is simple (understand Bags/Tuples)– Concise scripts compared to the M/R equivalents

• Scalding– Functional programming in Scala over Hadoop

– Built on top of Cascading

– Operating over tuples is like operating over collections in Scala

– No UDFs .. Your entire program is in a full-fledged general purpose language

Warehouse - Hive

Uses SQL-like querying language

All Analysts and Data Scientists versed in SQL

Supports Hadoop Streaming (Python/R)

UDFs and Serdes make it highly extensible

Supports partitioning with many table properties configurable at the partition level

Hive at StumbleUpon

HBaseSerde• Reads binary data from HBase• Parses composite binary values into

multiple columns in Hive (mainly on key)

ProtobufSerde

• For creating Hive tables on top of binary protobuf files stored in HDFS

• Serde uses Java reflection to parse and project columns

Data Infrastructure at SU

Data Consumption

End Users of Data

Data Pipeline

Warehouse

Who uses this data? •Data Scientists/Analysts•Offline Rec pipeline•Ads Team

… all this work allows them to focus on querying and analysis which is critical to the business.

Business Analytics / Data Scientists

• Feature-rich set of data to work on• Enriched/Denormalized tables reduce JOINs,

simplifies and speeds queries – shortening path to analysis.

• R: our favorite tool for analysis post Hadoop/Hive.

Recommendation platform

• URL score pipeline– M/R and Hive on Oozie– Filter / Classify into buckets– Score / Loop– Load ES/HBase index

• Keyword generation pipeline– Parse URL data– Generate Tag mappings

URL score pipeline

Advertisement Platform

• Billing Service– RT Kafka consumer– Calculates skips– Bills customers

• Audience Estimation tool– Pre-crunched data into multiple dimensions– A UI tool for Advertisers to estimate target audience

• Sales team tools– Built with PHP leveraging Hive or pre-crunched

ETL data in HBase

More stuff on the pipeline

• Storm from Twitter– Scope for lot more real time analytics.– Very high throughput and extensible– Applications in JVM language

• BI tools– Our current BI tools / dashboards are minimal– Google charts powered by our reporting DB

(HBase primarily).

Open Source FTW!!

• Actively developed and maintained

• Community support

• Built with web-scale in mind

• Distributed systems – Easy with Akka/ZK/Finagle

• Inexpensive

• Only one major catch !!– Hire and retain good engineers !!

Thank You!

Questions

Building Scalable Big Data Infrastructure Using Open Source Software Sam William...

Documents

Building scalable cloud-native applications (Sam Vanhoutte at Codit Azure PaaS Event)

SAM Pheochromocytoma/Paraganglioma: · PDF fileName: ... 3:30 REGIONAL SCIENTIFIC MEETINGPediatric Nuclear Medicine: ... SAM SAM SAM SAM SAM SAM SAM SAM SAM. Greater New York and New

SA-2 SAM SITES AND SAM SUPPORT FACILITIES, USSR … · title: sa-2 sam sites and sam support facilities, ussr identified on keyhole mission 4007 subject: sa-2 sam sites and sam support

Building a Highly Scalable File Processing Platform with NServiceBus NSBCon by Sam Martindale

SAM Tutorial - Rangel & Field 2010ecoevol.ufg.br/sam/SAMTutorial.pdf1 SAM Spatial Analysis in Macroecology SAM Tutorial SAM workshop, IBS meeting, Crete 07 Jan 2011 Thiago Fernando

Assisting with Scalable Scalable Vector Graphics and ... · SVG Scalable Vector Graphics [6] SSVG Scalable Scalable Vector Graphics [10] LWA Live Website Annotate [See Section 4]

AT07175: SAM-BA Bootloader for SAM D21 - …ww1.microchip.com/.../Atmel-42366-SAM-BA-Bootloader-for-SAM-D21...AT07175: SAM-BA Bootloader for SAM D21 [APPLICATION NOTE] Atmel-42366A-SAM-BA-Bootloader-for-SAM-D21-ApplicationNote_082014

Scalable database, Scalable language @ JDC 2013

The research library: scalable efficiency and scalable learning

Efficient Scalable Video Compression by Scalable Motion Coding

Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University

CS-BWAMEM: A Fast and Scalable Read Aligner July 10 11 ...vast.cs.ucla.edu/sites/.../CS-BWAMEM_poster_0710_0.pdf · Achieve similar quality to BWA-MEM Input: FASTQ files Output: SAM

Unit 1, Story 1 Sam, Come Back - Chetek … 1, Story 1 Sam, Come Back Sam the cat is on my lap. Sam ran. Sam, come back! Sam ran that way. Nab that cat! See Sam in the sack. Sam ran

Flash Crowds in an Open CDN - Princeton University · stumbleupon.com 15 google.com 11 facebook.com 10 dugmirror.com 8 duggback.com 4 twitter.com 4 . Common Referrers 12 Referrer

SAM Boot Assistant (SAM-BA) - Keelog · SAM Boot Assistant (SAM-BA) User Guide 2-1 6132C–ATARM–09-Oct-06 Section 2 Installing SAM-BA 2.x 2.1 SAM-BA Installation To install SAM-BA

…A Scalable Systems Initiative Scalable Microsoft ...€¦ · A Scalable Systems Initiative Scalable Microsoft Application Re engineering Technology Scalable Systems Inc Web: systems.com

Sam The Sam Wooly Bully

NVIDIA Scalable Visualization Solutions - PNY Library/Quadro/NVIDIA-Quadro-PNY-Scalable-Visualization-Mosaic...NVIDIA Scalable Visualization Solutions See The Big Picture. NVIDIA Scalable

€¦ · Web viewSAM - DISBURSEMENTS. SAM – DISBURSEMENTS. SAM—DISBURSEMENTS. SAM—DISBURSEMENTS. SAM—DISBURSEMENTS. SAM – DISBURSMENTS. SAM—DISBURSEMENTS. SAM

AT07175: SAM-BA Bootloader for SAM D21ww1.microchip.com/downloads/en/DeviceDoc/Atmel-42366-SAM... · 2016-12-10 · AT07175: SAM-BA Bootloader for SAM D21 [APPLICATION NOTE] Atmel-42366A-SAM-BA-Bootloader-for-SAM-D21-ApplicationNote_082014