Click here to load reader
Upload
flydata-inc
View
103.915
Download
2
Embed Size (px)
DESCRIPTION
Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides
Citation preview
FlyData: Amazon RedshiftBENCHMARK Series 01
Amazon Redshift is10x faster and cheaperthan Hadoop + Hive
Comparisons of speed and cost efficiency
www.flydata.com
Amazon Redshift took 155 seconds to run our
queries for 1.2TB data
Hadoop + Hive took 1491 seconds to run our
queries for 1.2TB data
Amazon Redshift was 10X faster
Amazon Redshift cost $20 to run a query every 30
minutes
Hadoop + Hive took $210 to run a query every 30
minutes
Amazon Redshift was 10X cost effective
www.flydata.com
Amazon Redshift is a new data warehouse for
big data on the cloud. Before Redshift, users
had to turn to Hadoop for querying over TBs
of data.
We have run benchmarks to compare Redshift
to Hadoop (Amazon Elastic MapReduce), both
on AWS environments, specifically to show
differences for advertisement agencies.• Between 100GB to ~50TB• Frequent query (more than once an hour)• Short turn around time required
www.flydata.com
Prerequisite - Data
TSV files, gzip compressed
Imp_log
1) 300GB / 300M record
2) 1.2TB / 1.2B recorddate datetimepublisher_id integerad_campaign_id integerbid_price realcountry varchar(30)attr1-4 varchar(255)
click_log
1) 1.4GB / 1.5M record
2) 5.6GB / 6M recorddate datetimepublisher_id integerad_campaign_id integercountry varchar(30)attr1-4 varchar(255)
1) for 1 month2) for 4
months
ad_campaign100MB / 100k
recordpublisher10MB / 10k
record
advertiser10MB / 10k
record
We use 5 tables to run a query which join tables and creates a report.
www.flydata.com
1. Query Speed• Redshift takes 155
seconds to complete our query for 1.2TB
• Hadoop takes 1491 seconds to complete our query for 1.2TB
• Redshift is about 10 times faster than Hadoop for this query
Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge).
672sec
38sec155sec
1491sec
* The query used can be referenced in our Appendix
www.flydata.com
2. Total Cost• Redshift costs $20
per month to run queries every 30 minutes
• Hadoop costs $210 per month to run queries every 30 minutes
• Redshift is about 10 times cheaper than Hadoop to run this job
Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time.
* The query used can be referenced in our Appendix
www.flydata.com
Redshift Query Result
Data Size Instance Type Number of Instances
TrialProcessing
TimeAverage Server Cost Per Day
300GB dw.hs1.xlarge 1
1 58s
38s $20.40
2 43s
3 31s
4 30s
5 30s
1.2TB dw.hs1.xlarge 1
1 164s
155s $20.40
2 149s
3 158s
4 156s
5 150s
* The query used can be referenced in our Appendix
www.flydata.com
Hadoop Query Result
Data Size Instance Type Instance Number Processing Time Server Cost Per Day
300GB
c1.xlarge 1 1h 23m 2s $0.80
c1.medium 10 37m 48s $0.89
c1.xlarge 10 11m 12s $1.06
1.2TB
m1.xlarge 1 6h 43m 24s $3.22
c1.medium 4 5h 14m 0s $3.04
c1.xlarge 10 37m 7s $3.58
c1.xlarge 20 24m 51s $4.64
* The query used can be referenced in our Appendix
www.flydata.com
Discussion
• Consider Redshift– If your data is big (>TB) and you need to run your
queries more than once an hour– If you want to get quick results
• Consider Hadoop (EMR)– If your data is too big (>PB)– If your job queries are once a day, week or month– If you already have invested in Hadoop
technology specialists
www.flydata.com
appendix – Sample Query
select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPMfrom ad_campaigns acjoin advertisers adv on (ac.advertiser_id = adv.advertiser_id)
join(select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id) ims on (ims.ad_campaign_id = ac.ad_campaign_id)join(select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click numbers,advertiser spending, CTR, CPC and CPM.
www.flydata.com
APPENDIX - Additional Comments
• Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database
• Importing large amounts of data takes a lot of time– 17 hours for 1.2TB in our case– Continuous importing is useful
• Redshift supports only “Separated” formats like CSV, TSV– JSON is not supported
• Redshift supports only primitive data types– 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE..
(as of Feb. 17, 2013)
www.flydata.com
APPENDIX – Additional Information
• All resources for our benchmark are on our github repository– https://github.com/hapyrus/redshift-benchmar
k– The dataset we use is open on S3, so you
can reproduce the benchmark
www.flydata.com
About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift, with real-time data loading
– Automated ETL process with multiple supported data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift
Contact us at: [email protected]
We are an official data integration partner of Amazon Redshift
Formerly known as Hapyrus
www.flydata.com
www.flydata.com www.flydata.com
Check us out!-> http://flydata.com
Toll Free: 1-855-427-9787
http://flydata.com
We are an official data integration partner of Amazon Redshift