35
RealBigDataand it’s constant evolution Scott MacGregor

JDD2014: Real Big Data - Scott MacGregor

  • Upload
    proidea

  • View
    120

  • Download
    3

Embed Size (px)

DESCRIPTION

-The evolution of Big Data, both inside Akamai and in the industry. -The current Big Data Ecosystem with real-world examples. -Challenges in Big Data and future directions.

Citation preview

Page 1: JDD2014: Real Big Data - Scott MacGregor

Real… Big… Data… and it’s constant evolution Scott MacGregor

Page 2: JDD2014: Real Big Data - Scott MacGregor

Who is this guy?

Page 3: JDD2014: Real Big Data - Scott MacGregor

Akamai Big Data Infrastructure

150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day

Page 4: JDD2014: Real Big Data - Scott MacGregor

What is Big Data?

Page 5: JDD2014: Real Big Data - Scott MacGregor

The V’s

Page 6: JDD2014: Real Big Data - Scott MacGregor

Data that is Big

From Hortonworks

Page 7: JDD2014: Real Big Data - Scott MacGregor

What’s it really about?

Page 8: JDD2014: Real Big Data - Scott MacGregor

From the beginning…

•  Akamai needed a billing system and scalable monitoring •  The Open Source community wanted a search engine •  Yahoo needed better product analytics for page views •  Google needed more scalable computation for ad

management •  Facebook needed real-time updates to social graph •  LinkedIn needed a real-time activity data pipeline •  Twitter needed hashtag and topic streams •  Amazon needed durable shopping carts •  Netflix needed a recommendation engine

Page 9: JDD2014: Real Big Data - Scott MacGregor

Big Data timeline

1998 2006 2001 2003 2005 2007 2008 2010 2011 2012 2013 2014

Akamai

Industry

Generalized map/reduce on 1 machine

Decentralized job scheduling Multiple machines File System DB

Google MapReduce Google FS

Nutch Yahoo spins off Hadoop

Amazon Dynamo

NoSql

Wide area, real-time, in-memory system monitoring

Geographical redundancy

Real-time reporting Columnar DB

Distributed File System DB

Wide-area MapReduce ExaByte Query

HBASE Neo4J

Facebook Cassandra LinkedIn Kafka

Twitter Storm Facebook Presto

Page 10: JDD2014: Real Big Data - Scott MacGregor

How it works…

Page 11: JDD2014: Real Big Data - Scott MacGregor

Big Data modes

•  Batch – Computation over a large static data set – Results are complete

•  Online – Computation on data as it’s generated – Localized results, must be aggregated

downstream

Page 12: JDD2014: Real Big Data - Scott MacGregor

Big Data primitives

•  Collection •  Parsing •  Partitioning •  Filtering •  Throttling •  Aggregation •  Tracking •  Validation •  Analysis

Page 13: JDD2014: Real Big Data - Scott MacGregor

Collection

•  What – Logs – Metadata – System stats – Application

events – Application stats – Network data

•  How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom

Page 14: JDD2014: Real Big Data - Scott MacGregor

Parsing

•  Read lines or blocks and split into fields •  Transform, e.g. protobuf •  Map keys to values

S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com

1359486900 1423 a440.phobos.apple.com 1 3158

1359486900 1423 200 1 30128

1359486900 1423 1 209158

Page 15: JDD2014: Real Big Data - Scott MacGregor

Partitioning

•  Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc.

•  Hashing – Bucket blocks or records of data by a hash

function

Page 16: JDD2014: Real Big Data - Scott MacGregor

Filtering

•  Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid)

•  Sampling – Random – Reservoir

Page 17: JDD2014: Real Big Data - Scott MacGregor

Throttling

•  Limit on cardinality per partition – Requires central management – Drop records over max

•  Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com

S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~

Page 18: JDD2014: Real Big Data - Scott MacGregor

Aggregation

•  Merge –  Merge-sort blocks in a partition

•  Reduce –  Combine values for like keys

•  Sum, Min, Max, Mask, etc. •  Shuffle

–  Move the data to where its needed or closer to like data

1359486900 1423 1 209158

1359529800 1423 1 209158 1359486900 1423 1 209158

1359486900 1423 2 418316

1359529800 1423 1 209158

Aggregate

2 418316

{1423, 1359486900}

1 209158

{1423, 1359529800}

Shuffle

Page 19: JDD2014: Real Big Data - Scott MacGregor

Tracking

•  Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs)

vs. actual (embedded GUID)

Page 20: JDD2014: Real Big Data - Scott MacGregor

Data integrity

•  Watermark – Producer watermarks every n-lines with a

crypto key – Receiver checks watermarks

•  Checksum – Block checksums – Line CRC – Etc.

Page 21: JDD2014: Real Big Data - Scott MacGregor

Analysis

•  Online – Precomputed reports

•  Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL

Page 22: JDD2014: Real Big Data - Scott MacGregor

Big Data at Akamai

•  Billing and Reporting •  System monitoring •  Media Analytics •  Security •  Log archive

Page 23: JDD2014: Real Big Data - Scott MacGregor

Billing and reporting

Logs Akamai Edge Networks and

Products Q Parse

Pipelines

Shuffle Split

Billing DB

Reporting Reporting

Reporting Parsing •  splits lines into fields •  maps keys to values per pipeline •  each log generates many pipelines •  each pipeline represents a streaming table

Evolution •  Logs were emailed (up to 1PB/day) •  Now delivered via SPDY (3PB/day)

Customers

3 PB/day Doubles every year

Reporting Reporting Internal

Apps

Aggregate

Page 24: JDD2014: Real Big Data - Scott MacGregor

System monitoring

Akamai Networks and

Products Client SQL

Parser TLA Agg

Agg Agg

Alert

Trend

TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally

50M jobs/day

Evolution Single machine memory for table joins Future: distributed memory for table joins

Page 25: JDD2014: Real Big Data - Scott MacGregor

Media analytics

Pipelines Akamai

Products Front end

Column Store

Index Reporting Reporting Reporting

API / UI

Customers

Indexes are recreated for each update Supports insert and update Reads are flexible and fast

Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting

Events

Page 26: JDD2014: Real Big Data - Scott MacGregor

Security products

Pipelines Akamai Edge

Networks Front end

HDFS

Events

Akamai Web Firewall

Map/Reduce

HBASE

Hive

Cloudera Graphite

Operations Center

Reputation Scoring

Threat Analysis

Intelligence Reports

Risk Based Authentication

Payment Fraud

External Data External Data

External Data

Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor

20 TB/day

Page 27: JDD2014: Real Big Data - Scott MacGregor

Log archive

Logs

Q Archive

Parse

180 PB, 450 Trillion records Doubles every year

Archive Index (10TB) Pipelines

Log cache 10%

Client IP Sketch

Spark

Spark SQL

HDFS

Archive Front End

Client Request

Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark

Get Index and/or CIP

Cache first Then archive

Page 28: JDD2014: Real Big Data - Scott MacGregor

HDFS Hadoop / Yarn

The Ecosystem

Script Pig

SQL Hive

NoSQL HBASE

Stream Kafka Storm

Search Solr

In-Mem Spark

Integration Flume Avro

Operations Ambari Zookeeper Oozie

Monitoring

Graphite

Sharing

Mesos

Page 29: JDD2014: Real Big Data - Scott MacGregor

HDFS Hadoop / Yarn

Building a system

If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): •  Start with HDFS or Cassandra •  Add HBASE column store •  Add Hive for SQL-like access •  Add Pig for scripting

HBASE Get, Put

Hive Select *

Pig { … }

Page 30: JDD2014: Real Big Data - Scott MacGregor

Building a system

If you need to search logs: •  Start with HDFS •  Add Flume for log data integration •  Add Avro for data serialization •  Add Solr for search

HDFS Hadoop / Yarn

Solr Search, e.g. Ip = 1.1.1.1

Flume Agent Avro Sink

Flume Collector Avro Source

Page 31: JDD2014: Real Big Data - Scott MacGregor

HDFS Hadoop / Yarn

Building a system

If you need flexible and shared access to unlimited amounts of data: •  Start with HDFS or Cassandra •  Add Hadoop for Map/Reduce or •  Add Hive for SQL-like access or •  Add Pig for scripting •  Add Mesos for resource sharing •  Add Ambari for cluster management and provisioning •  Add map/reduce programs for business logic

Pig {…}

Hive Select * Flume Ambari

Mesos

Map/Reduce Java { … }

Page 32: JDD2014: Real Big Data - Scott MacGregor

Building a system

If you need fast, flexible access to in-memory data: •  Start with HDFS •  Add Spark •  Add Spark SQL for SQL-like access or •  Create Spark programs for other business logic

HDFS Hadoop / Yarn

Spark

SparkSQL Select * from

Spark Progs Java { … }

Page 33: JDD2014: Real Big Data - Scott MacGregor

Building a system

If you need real-time stream event processing: •  Start with HDFS •  Add Kafka for messaging and pub/sub •  Add Storm for event processing •  Develop Java Bolts for processing logic

HDFS Hadoop / Yarn

Kafka Storm Bolts { … }

Page 34: JDD2014: Real Big Data - Scott MacGregor

Future at Akamai

•  100x –  Everything bigger and faster –  Requires new R&D across many Big Data

components •  Scaling Big Data Eco across wide-area •  Internet Security

•  Positive reputation scoring •  Automatic DDoS mitigation

•  Low latency data collection –  2^53 unique keys, <1 minute latency

•  Support DevOps –  Near real-time monitoring and control

Page 35: JDD2014: Real Big Data - Scott MacGregor

Thank You