6
CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 1 Week 2-A-0 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 FAQs Slides are available on the course web Canvas Discussion Board is available: Find your teammates! PA1 Hadoop and Spark installation guides are posted If you would like to start your homework, please send me an email with your team information. I will assign the port range for your team. 1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-1 Overview of Part A Duration: Week 1 ~ Week 6 1. Introduction to Big Data (W1) 2. Data Process Paradigms for Big Data (W2) 3. Distributed Computing Models for Scalable Batch Computing Part 1. MapReduce (W2) Part 2. In-Memory Cluster Computing Model: Apache Spark (W3, W4) 4. Real-time Streaming Computing Models (W5) Apache Storm and Twitter Heron 5. Scalable Distributed File Systems: Google File Systems I and II (W6) 1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-2 2. Data Processing Paradigms For Big data Lambda Architecture 1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-3 This material is built based on Nathan Marz and James Warren, “Big Data, Principles and Best Practices of Scalable Real-Time Data System”, 2015, Manning Publications, ISBN 9781617290343 1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-4 Why we are looking at Lambda Architecture To perform large-scale analytics over voluminous data, we need a high- level architecture that provides, Robustness Fault-tolerant: Both against hardware failures and human mistakes Support for a wide range of workloads and use cases Low-latency reads and updates Batch analytics jobs Scalability Scale-out capabilities with minimal maintenance 1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-5

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickaracs535/slides/week2-A.pdfApache Storm and Twitter Heron 5. ScalableDistributedFileSystems:GoogleFileSystemsIandII (W6) 1/28/2019

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 1

Week 2-A-0

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. DATA PROCESSING PARADIGMS FOR BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

FAQs• Slides are available on the course web

• Canvas Discussion Board is available: Find your teammates!

• PA1• Hadoop and Spark installation guides are posted• If you would like to start your homework, please send me an email with your team information. I will

assign the port range for your team.

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-1

Overview of Part A• Duration: Week 1 ~ Week 6

1. Introduction to Big Data (W1)2. Data Process Paradigms for Big Data (W2)3. Distributed Computing Models for Scalable Batch Computing

Part 1. MapReduce (W2)Part 2. In-Memory Cluster Computing Model: Apache Spark (W3, W4)

4. Real-time Streaming Computing Models (W5)Apache Storm and Twitter Heron

5. Scalable Distributed File Systems: Google File Systems I and II (W6)

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-2

2. Data Processing Paradigms For Big dataLambda Architecture

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-3

This material is built based on• Nathan Marz and James Warren, “Big Data, Principles and Best Practices of Scalable

Real-Time Data System”, 2015, Manning Publications, ISBN 9781617290343

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-4

Why we are looking at Lambda Architecture • To perform large-scale analytics over voluminous data, we need a high-level architecture that provides,

• Robustness

• Fault-tolerant: Both against hardware failures and human mistakes

• Support for a wide range of workloads and use cases

• Low-latency reads and updates

• Batch analytics jobs

• Scalability

• Scale-out capabilities with minimal maintenance

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-5

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 2

Typical problems for scaling traditional databases

• Suppose that the application should track the number of page views for any URL a customer wishes to track• The customer’s web page pings the application’s web server with its URL every time a page view is

received• Application tells you top 100 URLs by number of page views

Id (integer) User_id (integer) url (varchar(255) Pageviews(bigint)

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-6

Direct access• Direct access from Web server to the backend DB cannot handle the large amount

of frequent write requests • Timeout errors

Web server

DB

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-7

Scaling with a queue

• Batch many increments in a single request

• What if your data amount increases even more?• Your worker cannot keep up with the writes

• What if you add more workers?• Again, the Database will be overloaded

Queue Worker

Web server

DB

Pageview

100 at a time

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-8

Scaling by sharding the database

• Horizontal partitioning or sharding of database• Uses multiple database servers and spreads the table across all the servers• Chooses the shard for each key by taking the hash of the key modded by the number of shards

• What if your current number of shards cannot handle your data?• Your mapping script should cope with new set of shards • Application and data should be re-organized

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-9

Other issues• Fault-tolerance issues

• What if one of the database machines is down?• A portion of the data is unavailable

• Corruption issues• What if your worker code accidentally generated a bug and stored the wrong number for some of the

data portions

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-10

How will Big Data techniques help?• The databases and computation systems used in Big Data applications are aware of

their distributed nature• Sharding and replications will be considered as a fundamental component in the design of Big Data

systems

• Data is dealt as immutable• Users will mutate data continuously, however

• The raw pageview information is not modified

• Applications will be designed in different ways

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-11

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 3

Lambda Architecture• Big Data data processing architecture as a series of layers

1. Batch layer

2. Serving layer

3. Speed layer

Speed layer

Serving layer

Batch layer

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-12

1. Batch layer• Batch layer

• Precomputes results using distributed processing system• The component that performs the batch view processing

• batch view= function(e.g. Sum of values)• batch view= function(e.g. Training a Predictive Model)

• After the computationàStores an immutable, constantly growing master dataset• E.g. values, model, or distribution

• Computes arbitrary functions on that dataset• Batch-processing systems

• e.g. Hadoop, Spark, TensorFlow

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-13

Generating batch views

All Data Batch layer

Batch view

Batch view

Batch view

Batch layer is often a high-latency operation

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-14

2. Serving layer• The batch layer emits batch view as the result of its functions

• These views should be loaded somewhere and queried

• Specialized distributed database that loads in a batch view and makes it possible to do random reads on it

• Batch update and random reads should be supported• e.g. BigQuery, ElephantDB, Dynamo, MongoDB, Cassandra

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-15

3. Speed layer• Q: Is there any data not represented in the batch view?

• Data arrives while the precomputation is running

• With fully real-time data system

• Speed layer looks only at recent data

• Whereas the batch layer looks at all the data (except realtime data) at once

• realtime view= function(realtime view, new data)

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-16

How long should the real time view be maintained?

• Once the data arrives at the serving layer, the corresponding results in the real-time views are no longer needed• You can discard pieces of the realtime views

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-17

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 4

Relevance of Data

Data absorbed into batch view Data absorbed into real-time view

Time

Now

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-18

Lambda architecture

ProcessStream

Increment view

Realtime increment

Speed layer

Batch layer

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

mergeNew data stream

ProcessStream

Increment view

Batch recompute

Composing Algorithm

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-19

Practical Example with Lambda architecture• Web analytics application tracking the number of page views over a range of days

• The speed layer keeps its own separate view of [url, day]

• Updates its views by incrementing the count in the view whenever it receives new data

• The batch layer recomputes its views by counting the page views

• To resolve the query, you query both the batch and realtime views

• With satisfying ranges

• Sum up the results

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-20

Extended examples and use cases• Full re-computation vs. partial re-computation

• e.g. using Bloom filters, PageRank algorithm

• Additive algorithms vs. approximation algorithms• e.g. HyperLogLog for count-distinct problem

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-21

Exercise Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.

Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-22

Exercise—Answers Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.

Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-23

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 5

Composing algorithms• The batch/speed layer will split your data

• The exact algorithm on the batch layer• An approximate algorithm on the speed layer

• The batch layer repeatedly overrides the speed layer• The approximation gets corrected• Eventual accuracy

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-24

Example of Cardinality estimation

• Cardinality estimation

• Count-distinct problem: finding number of distinct elements

• Counting exact unique counts in the batch layer

• A Hyper-LogLog as an approximation in the speed layer

• Batch layer corrects what’s computed in the speed layer

• Eventual accuracy

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-25

Recent trends in technology (1/3)• Physical limits of how fast a single CPU can go

• Parallelize computation to scale to more data• Scale-out solution

• Elastic clouds• Infrastructure as a Service (IaaS)• Rent hardware on demand rather than owning your hardware• Increase and decrease the size of your cluster nearly instantaneously• Simplifies system administration

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-26

Recent trends in technology (2/3)• Open source ecosystem for Big Data

• Batch computation systems• Hadoop• Spark

• Flink

• Serialization frameworks• Serializes an object into a byte array from any language

• Deserialize that byte array into an object in any language

• Thrift, Protocol Buffers, and Avro

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-27

Recent trends in technology (3/3)• Open source ecosystem for Big Data- cont.

• Random-access NoSQL databases• Sacrifice the full expressiveness of SQL

• Specializes in certain kinds of operations

• Cassandra, Hbase, MongoDB, etc.

• Messaging/queuing systems• Sends and consumes messages between processes in a fault-tolerant manner

• Apache Kafka

• Real-time computation system• High throughput, low latency, stream-processing systems

• Apache Storm

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-28

Mapping recent technologies

Process

StreamIncrement view

Realtime increment

Speed layer

Batch layer

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

mergeNew data

stream

Process

StreamIncrement view

Batch

recompute

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-29

CS535 Big Data 1/28/2019 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University, Page 6

Mapping course components

ProcessStream

Increment view

Realtime increment

Speed layer

Batch layer

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

mergeNew data stream

ProcessStream

Increment view

Batch recomputePart 1

Part 2

Part 2

Part 2

PA1

PA2

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-30

Kappa architecture

Serving layer

Query

Job n

Job n+1

Stream processingsystem

ProcessStream

output n

output n+1

Jay Kreps, LinkedInKappa architecture is a simplified version of lambda architectureAppend only immutable log

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-31

Brainstorming Quiz• Lambda architecture vs. Kappa architecture

Question 1. What is the difference between Lambda architecture and Kappa architecture? Question 2. Use case of Lambda architecture?Question 3. User case of Kappa architecture?

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-32

Questions?

1/28/2019 CS535 Big Data - Spring 2019 Week 2-A-33