View
68
Download
1
Category
Preview:
Citation preview
All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
Big Data Month 2016 – Up Next…
15.11
22.11
22.11
28.11 30.11
14.11
All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
13:00 – 13:20 Intro to Amazon Redshift by IronSource13:20 – 15:00 LAB I – Using Amazon RedShift15:00 – 15:15 Break15:15 – 17:25 LAB II – Table Layout and Schema Design with Amazon Redshift17:25 – 17:30 Your next steps on AWS by CloudZone
Master AWS Redshift - Agenda
Shimon Tolts General Manager, Data Solutions
AtomData Pipeline Processing 200B events
with Node.js And Docker On AWS
About ironSource: Hypergrowth
People Reached Each Month
4200Apps Installed Every Minutewith the ironSource Platform
Registered & Analyzed Data EventsEvery Month
200B
800M
50B
0
100B
150B
200B
Jun 201
5
Jul 201
5
Aug 201
5
Sep 201
5
Oct 201
5
Nov 201
5
Dec 201
5
Jan 201
6
Feb 201
6
Mar 201
6
Apr 201
6
May 201
6
We needed a way to manage this data:
Our Business Challenge
ProcessCollect Store
Collection
● Multi region layer - Latency based
routing
● Low latency from client to Atom servers
● High Availability - AWS regions does
fail!
● Storing raw data + headers upon
receiving
Data Enrichment● Enrich data before storing in your Data
Lake and/or Warehouse○ IP to Country○ Currency conversion ○ Decrypt data○ User Agent parsing - OS, Browser, Device...
● Any custom logic you would like! - fully extendible
Data Targets● Near real-time data insertion - 1
minute!● Stream data to Google Storage and/or
AWS S3● Smart insertion of data into AWS
Redshift○ Set the amount of parallel copys○ Configure priority on tables
● BigQuery - Streaming data using batch files import (saves 20% cost)
Micro-Services Architecture● Everything is a service● Decoupling● Distributed systems
Separate lifecycle● Communication using RESTful /
Queue / Streams
Docker● Linux Container● Save provisioning time● Infrastructure as code● Dev-Test-Production - identical
container● Ship easily
Cloud infrastructure● Pay as you go - (grow)● SaaS services ● Auto-scaling-groups● DynamoDB● RDS *SQL● Redshift data warehouse
Continuous Integration● From commit to production● Jenkins commit hook● Git branching model● AWS dynamic slaves● Unit tests● Docker builds● Updating live environment
Diagram
● Xplenty - hadoop service - ~40min query● One big cluster - 96 xlarge nodes● No WLM configuration● CSV copy● No reserved nodes● different ETL process implemented by every department.
STARTING POINT
● using 8xlnodes if needed● Redshift cluster per department● “hot and cold” clusters - SSD: fast and furios, HDD: slow but cheap● WLM configuration● Reserved Nodes● JSON copy● One pipeline to rule them all - ironBeast - currently supporting over 50B events per month. inserting data to more than 10 Redshift clusters.
SOLUTION:
THINGS WE LEARNED ALONG THE WAY● https://github.com/awslabs/amazon-redshift-utils (AdminViews)
● users permissions does not apply on new tables created in a schema
● Vacuum Vacuum Vacuum
● Avoid parallel inserts (especially in 8xl nodes) - if you copy to multiple tables, it is better to
implement a COPY queue
● STL_LOAD_ERRORS - money on the floor
● Columnar datastore does not mean you can use as much columns as you want - it is better to
split to multiple tables.
● Encode your columns - ‘analyze compression’
● instances that query Redshift should use MTU 1500 - link
10 MillionFree Monthly Events
Thank you!
ironsrc.com/atom
shimont@ironsrc.com @shimontolts
Recommended