Big Data Tutorial - qconsp.com · Introduction to Big Data and its uses Survey of Big Data...

Big Data TutorialQCon São Paulo 2013

Everything old is new again

Everything old is technically

feasible

The ability to summon 100’s or 1000’s of

machines with an API call is what brings parallel

computing to everyone...

combined with virtually limitless cloud storage,

Big Data is now accessible to everyone, not just big companies.

Tweet @jedberg with feedback!

Jeremy Edberg

What is reddit?

Netflix is the world’s leading Internet television network with nearly 38 million members in 40

countries enjoying more than one billion hours of TV shows and movies per month, including original

series. For one low monthly price, Netflix members can watch as much as they want, anytime,

anywhere, on nearly any Internet-connected screen.Source: http://ir.netflix.com

What is Netflix?

Why Big Data, how is it useful and what can it do for you?

SQL and NoSQL -- What's the difference, what are the pros and cons, how do you move from one

to the other?

Practical steps to keep your Big Data systems reliable.

NoSQL technologies such as HBase/HDFS, BigTable, MongoDB, S3, Redis, Cassandra, Hadoop, Pig, Hive,

Flume and more.

What You Will Learn

This is your workshop

• We’ll be together for 3+ hours

• You (or your employer) paid a lot of money to be here

• Let’s make it worth your while!

Let’s make this awesome together

• Ask questions

• Let me know if you want me to move on or go into more detail

• Keep it interactive!

Schedule

Introduction to Big Data and its uses

Survey of Big Data Technology

Real-Time Data Systems

Demo: Cassandra in Action -- Building and using a data model

Building reliable Big Data systems

Wrap up, conclusions, questions

What is Big Data?

• The tools and processes of managing and utilizing large datasets.

• (with virtualized resources)

• Structured and Unstructured data

(I’ll ask this again at the end)

Simple vs. Complex

Flu outbreak

Data Wants to be Free

Data is the most important asset your business will

Privacy

• That sharing comes at a cost, and that’s privacy.

• Some people value privacy vs utility, and some don’t.

• Teenagers don’t seem to value privacy at all.

So how can Big Data help me?

Security

How Big Data transformed the dairy

industry

How India’s “Satyamev Jayate” uses Big Data to power their TV show.

Trend Analysis

Actionable Metrics

Other Metrics

• Pennies earned

• Pageviews

• Votes / comments / links

How Big Data can make your business more successful.

• Use big data to do real time analysis to deliver better experiences for your customers

• Sometimes information is more valuable when it is shared.

• We are floating in good answers, but the good questions are scarce.

• Keep your data clean on the way in.

• Where does big data create value in your company?

What's possible -- and what's difficult -- for companies that adopt Big Data approaches to

storage and analysis.

• Data gravity. As you data gets bigger you need to move your application closer to it.

• Moving from Sql to NoSql

DataWhat does Netflix do with it all?

We store it!

• Cache (memcached)

• Cassandra

• RDS (MySql)

RDS (Relational Database Service)

Cassandra

Overview

!""#$%&'()* +(##,%-(./*

Data collection pipeline

Data processing pipeline

Overview Data collection pipeline Data collection pipeline Data collection pipeline

Data Collection Pipeline

Data processing pipeline Data processing pipeline

TextTextData Processing Pipeline

Chuckwa/Honu messages / min

63 billion

messages a day

Hiveselect videoID, count(*) as cfrom events where dateint>=20120611 and dateint<=20120617 and event="Watched" and result="SUCCESS" group by videoid order by count desc limit 5;

A/B Testing

Online Data Offline Data

Test Cell allocationTest MetadataStart/End dateUI Directives

Test trackingRetention

Fraction ViewedPages Viewed

AWS Usage (Ice)Dollar amounts have been carefully removed

Chronos

Netflix Dataoven

Data WarehouseOver 2 Petabytes

Ursula

Aegisthus

Data Pipelines

From cloud Services

~100 BillionEvents/day

From C*Terabytes ofDimension

Hadoop Clusters – AWS EMR

1300 nodes 800 nodes Multiple 150 nodes

Over 2 Petabytes

Hadoop Clusters – AWS EMR

Metadata

Gateways

Genie: Goals

• Open up the data engineering infrastructure• Self-service for SLA/production

• Abstraction/management of back-end resources• Hadoop/Hive/Pig as a Service• Eliminate “gateway” bottlenecks

Genie: Set of Services

• Job Execution• REST-ful API to run Hadoop,

Hive and Pig jobs

• Abstracting out cluster details from clients

•Horizontal scalability via auto-scaling groups on the cloud

Genie: Set of Services

• Resource Configuration/Management

• Management of cluster status

• Repository of configurations (for cluster, hive, pig)

• Mapping of jobs to clusters

Data Gravity

• Coined by Dave McCrory

• First described here: http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

What is Data Gravity?

Source: nationalgeographic.com

Data Gravity and you

• The bigger your dataset, the harder it is to move from anywhere to anywhere

• Also, how do you move that data without affecting your running application?

reddit’s data gravity problem

• We had a lot of data that was ever-growing

• We were so resource constrained we couldn’t move it without hurting our application

Netflix’s data gravity problem

• Needed the data in the datacenter

• We were “Roman Riding” for a long time

Source: http://horseandman.com

Questions?

Schedule

SQL vs. NoSQL

• NoSql is generally unstructured and the data storage is schemaless

• Eventually consistent systems

• Horizontally scalable

SQL vs. NoSQL

• SQL systems have structured data and fixed schemas

• ACID compliant (I’d rather put my $$ here than in an eventually consistent system!)

• Generally have to scale up, not so good at out

CAP Theorem

• Consistent

• Available

• Partition-resistant

Key/Value vs. Document Store

• Key/Value is just like the hash table data structure you are used to

• Great for use with object oriented languages

• redis, Cassandra, S3

Key/Value vs. Document Store

• Stores whole documents with certain properties, often in JSON or XML

• Good for large chunks of data, like things scraped from the web

• MongoDB, CouchDB

JSON• JavaScript Object Notation

• Originally a subject of JSON, now a standard cross platform document format

• Lots of parsers in many languages

• Very similar to XML, less verbose

{"firstName": "John","lastName": "Smith","age": 25,"address": {

"streetAddress": "21 2nd Street","city": "New York","state": "NY","postalCode": "10021"

},"phoneNumber": [

{"type": "home","number": "212 555-1234"

"type": "fax","number": "646 555-4567"

The Technologies

Cassandra

Cassandra Architecture

How it works• Replication factor

• Quorum reads / writes

• Bloom Filter for fast negative lookups

• Immutable files for fast writes

• Seed nodes

• Multi-region

• Gossip protocol

Cassandra Benefits

• Fast writes

• Fast negative lookups

• Easy incremental scalability

• Distributed -- No SPoF

Things Netflix stores in Cassandra

• Track service level call

• Instrument low level HTTP client

• Calls graph (who is calling who)

• Request processing vs Perceived latency

• Payload marshalling/unmarshalling- duration, size, etc

• Service Results- Status, Error code, Exception, etc

Things Netflix stores in Cassandra

• Video Quality

• Network issues

• Usage History

• Playback Errors

Why Cassandra?

• Availability over consistency

• Writes over reads

• We know Java

• Open source + support

astyanax

• Netflix Cassandra Java client

• High level abstractions for Cassandra

• https://github.com/Netflix/astyanax

Hadoop

Image from searchworks.org

http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

• Open source clone of Google’s BigTable (a sparse, distributed multi-dimensional sorted map)

• Integrated with Hadoop for easy reads and writes.

• Distributed

Overview

!""#$%&'()* +(##,%-(./*

Data collection pipeline

Data processing pipeline

Overview Data collection pipeline Data collection pipeline Data collection pipeline

Data Collection Pipeline

Data processing pipeline Data processing pipeline

TextTextData Processing Pipeline

Hiveselect videoID, count(*) as cfrom events where dateint>=20120611 and dateint<=20120617 and event="Watched" and result="SUCCESS" group by videoid order by count desc limit 5;

INSERT OVERWRITE TABLE user_active SELECT user.* FROM user WHERE user.active = 1;

PigA = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into ‘id.out’;

Oozie<property> <name>cassandra.thrift.address</name> <value>${cassandraHost}</value></property><property> <name>cassandra.thrift.port</name> <value>${cassandraPort}</value></property><property> <name>cassandra.partitioner.class</name> <value>org.apache.cassandra.dht.RandomPartitioner</value></property><property> <name>cassandra.consistencylevel.read</name> <value>${cassandraReadConsistencyLevel}</value></property><property> <name>cassandra.consistencylevel.write</name> <value>${cassandraWriteConsistencyLevel}</value></property><property> <name>cassandra.range.batch.size</name> <value>${cassandraRangeBatchSize}</value></property>

Other “NoSQL” solutions

• Memcache

• Redis

• CouchDB

• MongoDB

• DynamoDB

• Voldemort

• Riak

• Zookeeper

• S3

• Postgres

I love memcacheI make heavy use of memcached

EVCache

• Stores the entire database in RAM

• Support complex data structures

• Writes to disk periodically

• Fast and predictable with small datasets

Redis data structures

• strings

• hashes

• lists

• sets and sorted sets

Redis use cases

• As a drop in replacement for memcache

• Ephemeral data that you’re ok with losing

• Performance falls off a cliff when the dataset gets bigger than RAM

CouchDB

• Document oriented database

• Stores json like objects with deep queries

• Not easy to scale horizontally

• Uses JS mapreduce functions call “views” for data access

CouchDB use cases

• You have a large dataset and you want easy access to attributes

• Prototyping or just starting out

MongoDB

• Document store, similar to CouchDB

• JSON like objects that are easy to work with

• Javascript query language

MongoDB use cases

• Similar to CouchDB

• Less scalable than CouchDB

• Biases towards speed over durability

Voldemort

• Open source clone of Amazon’s Dynamo database (not DynamoDB)

• Consistent key hashing for fast lookups and easy horizontal scaling

• Built in versioning

Voldemort use cases

• Places where eventual consistency are ok

• Like an Amazon shopping cart for example!

• Or Linkedin!

• Sometimes multiple different answers can come back, it is up to the client to figure out the right answer

• Like Voldemort (Amazon’s Dynamo paper)

• Uses a gossip protocol like Cassandra

• Query in Erlang or Javascript

Riak use cases

• Similar to Voldemort, where eventual consistency is ok

Zookeeper

• Specialized key/value store

• Presents like a file system

• Distributed for reliability and fast reads

• At the expense of slow writes with more nodes

Zookeeper use cases

• System configuration

Postgres

Sample Schemalink_thing int id timestamp date int ups int downs bool deleted bool spam

link_data int thing_id string name string value char kind

The thing layer

• Postgres is used like a key/value store

• Thing table has denormalized data

• Data table has arbitrary keys

• Lots of indexes tuned for our specific queries

• Thing and data tables are on the same box, but don’t have to be

Moving from Postgres to Cassandra

• We were lucky -- we already used key/value

• But it wasn’t completely straightforward

• Some things are a lot easier relationaly

• Like taking counts of things

Tips to moving successfully

• No normalizaion

• Your app will have to do a lot of what your database used to do

• De-normalize

Schedule

Hadoop -- Past its prime?

• Was pioneered by Google, then an open source clone came along

• Google has mostly moved on to more real-time systems

Google Projects

• Dremel a.k.a. BigQuery

• Percolator

• Pregel

Other real-time projects

• Storm -- Twitter

• Turbine -- Netflix

• Redshift -- Amazon

NoSQL + SQL + Hadoop

• The latest trend in Big Data

• Putting a layer of SQL on top of a distributed data store

• Finally splitting the query layer from the data layer!

Schedule

Building a Data Model

• What questions you want to ask your data?

• Don’t try and normalize anything

• Instead of changing a value keep a record of what happened

Let’s build a telemetry system!

• This is a slightly modified real-world example of something we built to support the Netflix open connect project

Background

• Caches all over the world

• Named like ORD1, LAX1, SJC2, etc.

• We need to collect about 20 metrics for each cache on a regular basis

The questions

• Get last 3 runs for SJC2 and show the collected data

• What caches did we see on the last run and what are their details?

The tables

collected_propertiescollected_propertiescollected_propertiescollected_propertiescollected_propertiescollected_propertiescollected_properties

Keys HealthyHealthy other load upup

collection_cache_by_timescollection_cache_by_timescollection_cache_by_timescollection_cache_by_timescollection_cache_by_timescollection_cache_by_timescollection_cache_by_times

Keys cache1 cache2 cache3 cache4 cache5 ...

collections_by_cachecollections_by_cachecollections_by_cachecollections_by_cachecollections_by_cachecollections_by_cachecollections_by_cache

Keys 1 2 3 4 5 ...

Python Code Walkthrough

Files129837 95014

43534 10020

345345 90069

980345 10001

1098445 59390

9084309 32901

43534 98898Queue

Data Loader

Data Processor

Schedule

Building a Reliable Data Store

If it won’t scale, it'll fail.-- paradrox

1 > 2 > 3 Going from two to three is hard

1 > 2 > 3 Going from one to two is harder

1 > 2 > 3If possible, plan for 3 or more from the beginning.

Going multi-zone

Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region

Going Multi-region

Leveraging Multi-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Reliability and $$

Alert Systems

alerting

COREEvent

Gateway

Paging Service

AmazonSES

CORE Agent

Other Team’s Agent

CORE Agent

Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment

Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment

Netflix has moved the granularity from the

instance to the cluster

!"#$%&'()*'+,-')./!0)/120)3456)

7'8)1,$')%()*,#-%+'(9):/;)

<#'()*=$=)

/'(#%>=?,@=A%>)

1$('=&,>B):/;)

E%1)F%BB,>B)

GH'>!%>>'-$)!*I)J%K'#)

!*I)D=>=B'&'>$)=>L)

1$''(,>B)

!%>$'>$)M>-%L,>B)

!%>#"&'()M?'-$(%>,-#)

:71)!?%"L)1'(+,-'#)

!*I)MLB')F%-=A%>#)

J(%N#')

7=$-O)

The Netflix SOA

The Netflix way

• Everything is “built for three”

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

The Monkey Theory

• Simulate things that go wrong

• Find things that are different

The simian army• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

Circuit BreakersBe liberal in what you accept, strict in what you send

Incident Reviews

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of problem in the future?

• How can we improve our behavior for next time?

Ask the key questions:

Database Resiliency with Shardingwith Sharding

Horizontal vs. Vertical

Sharding• reddit split writes across four master databases

• Links/Accounts/Subreddits, Comments, Votes and Misc

• Each has at least one slave in another zone

• Avoid reading from the master if possible

• Wrote their own database access layer, called the “thing” layer

Queues are your friend• Votes

• Comments

• Thumbnail scraper

• Precomputed queries

• Spam

• processing

• corrections

Pain Points

Higher and more varied network latency

Workaround: Fewer network calls, ask for more data at a time.

Pain Points

EBS sometimes slowed down a bit

Workaround: Use caching and replication with read slaves to avoid relying on a single disk, or better yet, avoid the need for EBS altogether.

Pain Points

Instances go away sometimes or become so slow that you want to make them go away.

Workaround: Avoid single points of failure and make sure your servers have automated configuration.

Protip

The environment in a public cloud is inherently more variant (co-tenants, abusive or heavy users, etc)

Make sure your code is written to handle this -- state should be kept somewhere shared and redundant, not on the instance.

Protip

Security was not the first thought when a lot of the cloud systems were designed

Make it your first thought though. A little planning goes a long way. Use security groups judiciously and keep those keys safe!

Protip

Keep track of those limits!

To prevent someone from consuming too much, all resources have per account limits. Keep track of them and get them raised ahead of when you need them. Make sure to catch the exceptions too.

Cause chaos

Best Practices

• Keep data in multiple Availability Zones

• Avoid keeping state on a single instance

• Take frequent snapshots of EBS disks

• No secret keys on the instance

• Different functions in different Security Groups

Autoscaling

Traffic Peak

What about private clouds?

• Some of the problems you don’t have: noisy neighbors, lack of physical access

• Problem you do have: You have to pay for your spare capacity instead of someone else

A taxonomy of Big Data and next-

generation storage solutions

• Noisy neighbors are a problem.

• Efficiency is necessary and getting better

Schedule

What is Big Data?

• The tools and processes of managing and utilizing large datasets.

• (with virtualized resources)

• Structured and Unstructured data

(What’s missing?)

This is where the slide on what you should have learned would

I’m more interested in what you actually learned.

More Netflix details

• http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html

• http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

• http://techblog.netflix.com/2011/03/cloud-connect-keynote-complexity-and.html

Just a quick reminder...(Some of) Netflix is open source:

https://github.com/netflix

Including astyanax:

https://github.com/Netflix/astyanax

reddit is open source too:

https://github.com/reddit

patches are now being accepted!

Netflix is hiring

http://jobs.netflix.com/jobs.html

Please don’t forget to vote!

Voting is how we know what to present to you next time. :)

Email: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg

You can contact me here for questions:

Big Data Tutorial - qconsp.com · Introduction to Big Data and its uses Survey of Big Data...

Documents

Big Data Technology Big Data - aakritsubedi9.com.npaakritsubedi9.com.np/files/Big Data Technology.pdf · Big Data Technology Big Data 1"Big data" is a field that treats ways to analyze,

BIG DATA, BIG INNOVATIONS - Data Storage, … · BIG DATA, BIG INNOVATIONS ... before possible with traditional business intelligence and data warehouse ... desired data sets needed

Unite and Free your Data Making Big Data Big …files.meetup.com/14077672/WiDB - Making Big Data Big...Unite and Free your Data Making Big Data Big Business East Coast Chapter Launch

Caterpillar Big Data Infrastructure Big Data, Data Analytics, and … · Caterpillar Big Data Infrastructure Big Data, Data Analytics, and Machine Learning. Caterpillar is the world’s

Introduction to Big Data, Big Data Processing, and Big

Big Data Meets Big Data Analytics 105777

Big Data ในภาครัฐ - library2.parliament.go.th · Big Data))' 2' 2559) (Big Data)" Big Data Big Data 2559) -

Big Data Visualization: Turning Big Data Into Big Insights – White

Real Time Big data Applications: file · Web viewUNIT I. INTRODUCTION TO BIG DATA. Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs. Traditional

· for executive: box big data ussuiu lla:ansnnns1ðxnu big data -big data big -wifñuiaÖ big data big data • hadoop big clouderâ manager hive impala big data 22 airuntju 2559

Big Data, Big Challenges, Big - Oracle

Caterpillar Big Data Infrastructure Big Data, Data

Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Success With Big Data - Accenture · Big Success with Big Data 3 Big success with big data Big data is clearly delivering significant value to users who have actually completed

Big Data to Big Results - AMT-SYBEX · Big Data – really? Big Data – a bigger definition Pioneers of Big Data ... 16 May 2012 From Big Data to Big Results 9 Smart meters, security

Big Data, Big Commerce, Big Challenge

BIG DATA, SMART DATA AND BIG ANALYSIS

Informatica Big Data Management - Meetup › 16208282 › Big Data Management... · 2016-04-15 · Big Data = Big Opportunity Sources: Informatica Big Data Survey, March 2012 Cisco,

การประยุกต์ใช้ Big Data · การประยุกต์ใช้ Big Data ในการบริหารจัดการฐานข้อมูลทางด้าน

Big Data Madison: Architecting for Big Data