Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Big Data: Into the Multi-cloudverse
SANDEEP
ARORA
© 2019 Binlogic.
About the PresenterSANDEEP ARORA
© 2019 Binlogic.
What I do?
• Project Engineer at Pythian
• Cloud Architect
• DevOps Engineer
• Database Administrator
• Automation
• Google Authorized Trainer
Certifications
• Google Cloud Certified Professional Cloud Architect
• Google Cloud Certified Professional Data Engineer
• AWS Certified DevOps Engineer - Professional
• AWS Certified Solutions Architect - Professional
• Microsoft Certified Solutions Expert: Data Management and
Analytics
• AWS Certified SysOps Administrator - Associate
• AWS Certified Developer - Associate
• AWS Certified Solutions Architect - Associate
AgendaWhat are we going to discuss?
© 2019 Binlogic.
Cloud Vendor Ingest Store Process & Analyse
Data IngestionTypes of workloads?
© 2019 Binlogic.
BATCH
PROCESSING
STREAMING
DATA
APPLICATION
DATA
Batch WorkloadsAWS S3 vs Google Cloud Storage vs Azure Blob Storage
© 2019 Binlogic.
Feature Matrix
Key differentiators
Availability SLA 99.99% 99.95% 99.99%
Hot Data S3 Standard Cloud Storage Hot Blob Storage
Cold Data Glacier Coldline Cold Blob Storage
Storage Limits Unlimited Unlimited Unlimited
Hot Spotting Issues Yes No Yes
Integration with 3rd Party High Low to Medium Moderate to High
Replication Needs to be configured Geo-redundant storage Multi-Regional Bucket
Which object store is better?Upload Speeds Comparisons
© 2019 Binlogic.
When comparing upload performance lot of factors influenced the results,
• The total capacity vs the system load
• The technology and software used for transferring data
• Internet Speed etc.
It was estimated that while trying to upload,
• Large Files in Google Cloud Storage it was approximately 2.5 times faster than AWS S3 and
Blob Storage, whereas,
• Small chunks file uploads on Google Cloud Storage was 10 times faster than AWS S3 and Blob
Storage (not Premium).
“I would still root for Amazon S3 because it has established enterprise-ready infrastructure, more
features and is far more integrated than Google Cloud Storage.”
Which Object Store is cheaper?AWS S3 vs Google Cloud Storage vs Azure Blob Storage
© 2019 Binlogic.
“GCP has the least expensive pure object storage costs, plus the free transfer of data and it costs 35%
less than other vendors.”
Pricing Matrix S3 Cloud Storage Blob Storage
Service Cost (Per GB
Per Month)
$0.023 $0.02 $0.0023 (ZRS)
Replication Cost (Per GB
Per Month)
$0.046 + $0.02 (Data
Transfer)
$0.026 (Multi-Regional) +
0 (Data Transfer)
$0.0368 (GRS)
Cold Storage Cost (Per
GB Per Month)
$0.004 $0.004 $0.0125 (ZRS)
Streaming WorkloadsKinesis Data Streams vs Event Hubs vs Pub/Sub
© 2019 Binlogic.
Feature Matrix
Key differentiators
Messaging Guarantees At least once At least once At least once
Ordering Guarantees Within a shard Within a Partition None
Throughput One shard can support 1
MB/s input, 2 MB/s output
or 1000 records per
second.
Scaled in throughput
units. Each supporting 1
MB/s ingress, 2 MB/s
egress or 84 GB storage.
Standard tier allows 20
throughput units
Default is 100MB/s in,
200MB/s out but
maximum is quoted as
unlimited
Persistence Period 1-7 Days 1-7 Days 7 Days
Partitioning Yes (using Shards) Yes Yes
Streaming Workloads Continued...Kinesis Data Streams vs Pub/Sub vs Event Hubs
© 2019 Binlogic.
Feature Matrix
Key differentiators
DR - Cross Regional Across 3 zones only Yes (Standard Tier) Yes (Automatic)
Max Size for each
dataset
1 MB 1 MB 10 MB
Push Method Support Yes Yes Yes
Pull Method Support Yes No Yes
Scale Regional Only Multi-Regional (Standard) Global
Latency Milliseconds to Seconds Milliseconds to Seconds Milliseconds to Seconds
Which data streaming service is better?
Cost vs Performance Comparison
© 2019 Binlogic.
● Performance can be evaluated in terms of latency which is a time-based measure of the performance of
a system. The 2 important latency metrics are,
○ The amount of time it takes to acknowledge a published message.
○ The amount of time it takes to deliver a published message to a subscriber.
● Pricing can be a little complex to compare though,
○ Amazon Kinesis pricing varies by region and is $0.015/hr per shard and $0.014 per million PUT
payload units.
○ Event Hubs Standard costs $0.03/hr per throughput unit and $0.028 per million events.
○ Cloud Pub/Sub is priced at $60/TiB/month for amount of data ingested after the first 10 GB.
“Latency measurement for all carefully designed systems was up to the mark. ”
~
“Cost of Kinesis is significantly lower than Pub/Sub and Event Hubs.”
Data Processing: Streaming workloadsKinesis Data Analytics vs Azure Stream Analytics vs GCP Dataflow
© 2019 Binlogic.
Feature Matrix
Key differentiators
Batch Processing No No Yes
Stream Processing Yes Yes Yes
Serverless Yes Yes Yes
Latency Sub-Second Low Low
Native ML Integration Yes Yes Yes
Data Processing: Managed Hadoop Clusters
EMR vs Dataproc vs HDInsight
© 2019 Binlogic.
Feature Matrix
Key differentiators
Open source ecosystem Y Y Y
Native Integration with other
Vendor Services
Y Y Y
Multiple Application Support Y Y Y
Elasticity and Flexibility Y Y Y
Cost and Performance Comparisons
EMR vs Dataproc vs HDInsight
© 2019 Binlogic.
● With 4 vCPUs and around 15 GB of RAM you pay,
○ $0.240 an hour with Dataproc.
○ $0.336 per hour running EMR.
○ $0.338 per hour running HDInsight
● We set up a trial to compare the performance and
cost of a typical Spark workload. The trial used
Public dataset and clusters with one master and five
worker instances of,
○ AWS m3.xlarge,
○ Azure’s A4m v2 General purpose instance
○ GCP’s n1-standard-4.
Why DataProc is a Game Changer?EMR vs Dataproc vs HDInsight
© 2019 Binlogic.
● Jobs-first Hadoop+Spark, not Clusters-first approach.
● Cheaper: per-minute billing, Custom VMs, Preemptible VMs, sustained use discounts, and cheaper
VMs list prices.
● Faster: rapid cluster boot-up times, best-in-class object storage, best-in-class networking, and RAM-
like performance characteristics of Local SSDs.
● Easier: lots of capacity, less fragmented instance type offerings, VPC-by-default, and images that
closely follow Apache releases.
Data Warehouse: OLTP vs OLAPWhat is the difference?
© 2019 Binlogic.
Data Warehouse: Row vs Column storeWhat is the difference?
© 2019 Binlogic.
Country Product Sales
US Alpha 3000
US Beta 1250
JP Alpha 700
UK Alpha 450
Country Product Sales
US Alpha 3000
US Beta 1250
JP Alpha 700
UK Alpha 450
Row Store Column Store
Data AnalysisRedshift vs Snowflake vs BigQuery
© 2019 Binlogic.
Feature Matrix
Key differentiators
Elasticity Hours Minutes Query
Availability Backup Distributed System Distributed System
JSON Support No Arrays Native User-Defined Functions
Tune where
clause?
Sortkey Partition By Partitioning
Tune Joins? Distkey No No
metal-most “tuneable system” hybrid system 100% shared distributed
system
Data WarehousingCosting: Redshift vs Snowflake vs BigQuery
© 2019 Binlogic.
• Redshift offers 3 pricing models
• On-Demand Pricing: no upfront commitments and cost, you simply pay an hourly rate depending
upon the types and number of nodes in your cluster.
• Spectrum Pricing: you merely pay for the bytes scanned while querying against Amazon S3.
• Reserved Instance Pricing: save up to 75% over on-demand rates.
• Snowflake bills per-second, with a minimum of 1 minute, so you can save money by configuring your
cluster to turn off during periods of inactivity.
• BigQuery bills per-query, so you only pay for exactly what you use.
Which data warehouse if the fastest?Execution times for 99 TPC-DS Queries
© 2019 Binlogic.
“All warehouses had excellent average execution speed, suitable for ad-hoc, interactive querying.
However, AWS Redshift was slower primarily due to its slower query planner.”
Which data warehouse if the cheapest?Cost of 99 TPC-DS Queries
© 2019 Binlogic.
“Overall, for unpredictable and spiky workloads BigQuery would be much cheaper than the other
warehouses but for steady and continuous workloads BigQuery is a rather expensive choice.”
Overview: Differentiating FactorsRedshift vs Snowflake vs BigQuery
© 2019 Binlogic.
Amazon
Redshift
Azure
SnowflakeGoogle
BigQuery
“BigQuery is a shared-resource query service, so there is no equivalent ‘configuration’.”
~
“Redshift and Snowflake are fantastic choices for users with large, on-going data needs.”
~
“BigQuery and Snowflake are easy to use.”
~
“Snowflake has support for every kind of SQL Statement.”
© 2019 Binlogic.