22
Big Data: Into the Multi-cloudverse SANDEEP ARORA © 2019 Binlogic.

SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Big Data: Into the Multi-cloudverse

SANDEEP

ARORA

© 2019 Binlogic.

Page 2: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

About the PresenterSANDEEP ARORA

© 2019 Binlogic.

What I do?

• Project Engineer at Pythian

• Cloud Architect

• DevOps Engineer

• Database Administrator

• Automation

• Google Authorized Trainer

Certifications

• Google Cloud Certified Professional Cloud Architect

• Google Cloud Certified Professional Data Engineer

• AWS Certified DevOps Engineer - Professional

• AWS Certified Solutions Architect - Professional

• Microsoft Certified Solutions Expert: Data Management and

Analytics

• AWS Certified SysOps Administrator - Associate

• AWS Certified Developer - Associate

• AWS Certified Solutions Architect - Associate

Page 3: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

AgendaWhat are we going to discuss?

© 2019 Binlogic.

Cloud Vendor Ingest Store Process & Analyse

Page 4: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data IngestionTypes of workloads?

© 2019 Binlogic.

BATCH

PROCESSING

STREAMING

DATA

APPLICATION

DATA

Page 5: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Batch WorkloadsAWS S3 vs Google Cloud Storage vs Azure Blob Storage

© 2019 Binlogic.

Feature Matrix

Key differentiators

Availability SLA 99.99% 99.95% 99.99%

Hot Data S3 Standard Cloud Storage Hot Blob Storage

Cold Data Glacier Coldline Cold Blob Storage

Storage Limits Unlimited Unlimited Unlimited

Hot Spotting Issues Yes No Yes

Integration with 3rd Party High Low to Medium Moderate to High

Replication Needs to be configured Geo-redundant storage Multi-Regional Bucket

Page 6: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Which object store is better?Upload Speeds Comparisons

© 2019 Binlogic.

When comparing upload performance lot of factors influenced the results,

• The total capacity vs the system load

• The technology and software used for transferring data

• Internet Speed etc.

It was estimated that while trying to upload,

• Large Files in Google Cloud Storage it was approximately 2.5 times faster than AWS S3 and

Blob Storage, whereas,

• Small chunks file uploads on Google Cloud Storage was 10 times faster than AWS S3 and Blob

Storage (not Premium).

“I would still root for Amazon S3 because it has established enterprise-ready infrastructure, more

features and is far more integrated than Google Cloud Storage.”

Page 7: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Which Object Store is cheaper?AWS S3 vs Google Cloud Storage vs Azure Blob Storage

© 2019 Binlogic.

“GCP has the least expensive pure object storage costs, plus the free transfer of data and it costs 35%

less than other vendors.”

Pricing Matrix S3 Cloud Storage Blob Storage

Service Cost (Per GB

Per Month)

$0.023 $0.02 $0.0023 (ZRS)

Replication Cost (Per GB

Per Month)

$0.046 + $0.02 (Data

Transfer)

$0.026 (Multi-Regional) +

0 (Data Transfer)

$0.0368 (GRS)

Cold Storage Cost (Per

GB Per Month)

$0.004 $0.004 $0.0125 (ZRS)

Page 8: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Streaming WorkloadsKinesis Data Streams vs Event Hubs vs Pub/Sub

© 2019 Binlogic.

Feature Matrix

Key differentiators

Messaging Guarantees At least once At least once At least once

Ordering Guarantees Within a shard Within a Partition None

Throughput One shard can support 1

MB/s input, 2 MB/s output

or 1000 records per

second.

Scaled in throughput

units. Each supporting 1

MB/s ingress, 2 MB/s

egress or 84 GB storage.

Standard tier allows 20

throughput units

Default is 100MB/s in,

200MB/s out but

maximum is quoted as

unlimited

Persistence Period 1-7 Days 1-7 Days 7 Days

Partitioning Yes (using Shards) Yes Yes

Page 9: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Streaming Workloads Continued...Kinesis Data Streams vs Pub/Sub vs Event Hubs

© 2019 Binlogic.

Feature Matrix

Key differentiators

DR - Cross Regional Across 3 zones only Yes (Standard Tier) Yes (Automatic)

Max Size for each

dataset

1 MB 1 MB 10 MB

Push Method Support Yes Yes Yes

Pull Method Support Yes No Yes

Scale Regional Only Multi-Regional (Standard) Global

Latency Milliseconds to Seconds Milliseconds to Seconds Milliseconds to Seconds

Page 10: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Which data streaming service is better?

Cost vs Performance Comparison

© 2019 Binlogic.

● Performance can be evaluated in terms of latency which is a time-based measure of the performance of

a system. The 2 important latency metrics are,

○ The amount of time it takes to acknowledge a published message.

○ The amount of time it takes to deliver a published message to a subscriber.

● Pricing can be a little complex to compare though,

○ Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT

payload units.

○ Event Hubs Standard costs $0.03/hr per throughput unit and $0.028 per million events.

○ Cloud Pub/Sub is priced at $60/TiB/month for amount of data ingested after the first 10 GB.

“Latency measurement for all carefully designed systems was up to the mark. ”

~

“Cost of Kinesis is significantly lower than Pub/Sub and Event Hubs.”

Page 11: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data Processing: Streaming workloadsKinesis Data Analytics vs Azure Stream Analytics vs GCP Dataflow

© 2019 Binlogic.

Feature Matrix

Key differentiators

Batch Processing No No Yes

Stream Processing Yes Yes Yes

Serverless Yes Yes Yes

Latency Sub-Second Low Low

Native ML Integration Yes Yes Yes

Page 12: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data Processing: Managed Hadoop Clusters

EMR vs Dataproc vs HDInsight

© 2019 Binlogic.

Feature Matrix

Key differentiators

Open source ecosystem Y Y Y

Native Integration with other

Vendor Services

Y Y Y

Multiple Application Support Y Y Y

Elasticity and Flexibility Y Y Y

Page 13: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Cost and Performance Comparisons

EMR vs Dataproc vs HDInsight

© 2019 Binlogic.

● With 4 vCPUs and around 15 GB of RAM you pay,

○ $0.240 an hour with Dataproc.

○ $0.336 per hour running EMR.

○ $0.338 per hour running HDInsight

● We set up a trial to compare the performance and

cost of a typical Spark workload. The trial used

Public dataset and clusters with one master and five

worker instances of,

○ AWS m3.xlarge,

○ Azure’s A4m v2 General purpose instance

○ GCP’s n1-standard-4.

Page 14: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Why DataProc is a Game Changer?EMR vs Dataproc vs HDInsight

© 2019 Binlogic.

● Jobs-first Hadoop+Spark, not Clusters-first approach.

● Cheaper: per-minute billing, Custom VMs, Preemptible VMs, sustained use discounts, and cheaper

VMs list prices.

● Faster: rapid cluster boot-up times, best-in-class object storage, best-in-class networking, and RAM-

like performance characteristics of Local SSDs.

● Easier: lots of capacity, less fragmented instance type offerings, VPC-by-default, and images that

closely follow Apache releases.

Page 15: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data Warehouse: OLTP vs OLAPWhat is the difference?

© 2019 Binlogic.

Page 16: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data Warehouse: Row vs Column storeWhat is the difference?

© 2019 Binlogic.

Country Product Sales

US Alpha 3000

US Beta 1250

JP Alpha 700

UK Alpha 450

Country Product Sales

US Alpha 3000

US Beta 1250

JP Alpha 700

UK Alpha 450

Row Store Column Store

Page 17: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data AnalysisRedshift vs Snowflake vs BigQuery

© 2019 Binlogic.

Feature Matrix

Key differentiators

Elasticity Hours Minutes Query

Availability Backup Distributed System Distributed System

JSON Support No Arrays Native User-Defined Functions

Tune where

clause?

Sortkey Partition By Partitioning

Tune Joins? Distkey No No

metal-most “tuneable system” hybrid system 100% shared distributed

system

Page 18: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Data WarehousingCosting: Redshift vs Snowflake vs BigQuery

© 2019 Binlogic.

• Redshift offers 3 pricing models

• On-Demand Pricing: no upfront commitments and cost, you simply pay an hourly rate depending

upon the types and number of nodes in your cluster.

• Spectrum Pricing: you merely pay for the bytes scanned while querying against Amazon S3.

• Reserved Instance Pricing: save up to 75% over on-demand rates.

• Snowflake bills per-second, with a minimum of 1 minute, so you can save money by configuring your

cluster to turn off during periods of inactivity.

• BigQuery bills per-query, so you only pay for exactly what you use.

Page 19: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Which data warehouse if the fastest?Execution times for 99 TPC-DS Queries

© 2019 Binlogic.

“All warehouses had excellent average execution speed, suitable for ad-hoc, interactive querying.

However, AWS Redshift was slower primarily due to its slower query planner.”

Page 20: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Which data warehouse if the cheapest?Cost of 99 TPC-DS Queries

© 2019 Binlogic.

“Overall, for unpredictable and spiky workloads BigQuery would be much cheaper than the other

warehouses but for steady and continuous workloads BigQuery is a rather expensive choice.”

Page 21: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

Overview: Differentiating FactorsRedshift vs Snowflake vs BigQuery

© 2019 Binlogic.

Amazon

Redshift

Azure

SnowflakeGoogle

BigQuery

“BigQuery is a shared-resource query service, so there is no equivalent ‘configuration’.”

~

“Redshift and Snowflake are fantastic choices for users with large, on-going data needs.”

~

“BigQuery and Snowflake are easy to use.”

~

“Snowflake has support for every kind of SQL Statement.”

Page 22: SANDEEP - DataOps Barcelona | Databases€¦ · Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT payload units. Event Hubs Standard costs

© 2019 Binlogic.