Com as Mão Sujas de Dados
Julio Faerman1981269-3
FACOM TechWeek 2014
http://jfaerman.com.br/facom14
“Na Prática a Teoria é Outra...”
16 years2000+ employees
40 million user
http://aws.amazon.com/solutions/case-studies/netflix/http://techblog.netflix.com/2013/12/netflix-presentation-videos-from-aws.html
Amazon Web Services for 100%
of Streaming
34.2% of all downstream
during primetime
AmazonSimpleStorageService
• Durable, scalable and fast storage (99.999999999%)
• 2+ Trillion (1012) objects• 1.1+ Million RPS• Native HTTP/S• And more:
Permissions, Static Hosting, Logging, Versionamento, Archival and Expiration Lifecycle, Torrent, Tags, Redundancy, Requester Pays, Criptography, Reduced Redundancy and more
http://aws.amazon.com/s3/
“Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from televisions, laptops, and mobile devices every hour captured by our log data pipeline, plus dimension data from Cassandra supplied by our Aegisthus pipeline.”http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
“87% Cost Reduction per Streaming Start.”http://youtu.be/XBgkZxAljbs
“In terms of scale, we have a 10 petabyte data warehouse on S3.”http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html
StructuredRelationalOn-Line
GB-TB-PB
Semi-structuredMap Reduce
BatchTB-PB-EB
Once upon a time…
Today
Structured
On-Line
GB
TB
PB
EB
Semi-structured
UnstructuredDistributed Cache
In-Memory Data Grid
Map Reduce
ETLExtract-Transfer-Load
Graph Database
Document Database
Columnar Database
Batch
Real Time
Machine Learning
Relational Database
http://nathanmarz.com/
Data Structure Server
Stream Processing
Rule Engine
NoSQL
AmazonElastic
MapReduce
• Distributed processing with Apache Hadoop
• Near linear scalability• Resizable and disposable Clusters• Apache Hadoop ecosystem:
Hive, Pig, Impala, Spark, ..., …, …• Instant automatic provisioning• Simplified Administration• 5.5M+ Clusters
http://aws.amazon.com/elasticmapreduce/
http://aws.amazon.com/solutions/case-studies/pinterest/
50K -> 17M Usuários em 9 Meses
12- Funcionários
48M Usuários
8 Bilhões de Objetos
400+ TB de dados
April 2013:
400+ Web Engines400+ API Engines70x2+ MySQL DBs100+ Redis Instances230+ Memcache Instances10 Redis Task Manager500 Redis Task Processors80 Sharded Solr20 HBase12 Kafka + Azkabhan8 Zookeeper Instances 12 Varnish
http://www.infoq.com/presentations/scaling-pinterest
AmazonRelationalDatabaseService
• MySQL, Postgres, Oracle or SQL Server
• Highly Available (Multi-AZ)• Read-Replicas• Automated Backup, Patching and
Scaling
http://aws.amazon.com/rds/
AmazonElastiCache
• Memcached and Redis• Replication• Backup and Restore• Managed patch management,
failure detection and recovery• Elastic• Reliable
http://aws.amazon.com/elasticache/
• Petabyte Scale Data Warehousing
• Massively parallel OnLine Analytic Processing
• Resizable without downtime• Managed provisioning and
administration• Compatible with PostgreSQL
AmazonRedshift
http://aws.amazon.com/redshift/
Amazon Redshift Architecture
Leader Node• SQL endpoint• Stores metadata• Coordinates query execution
Compute Nodes• Local, columnar storage• Execute queries in parallel• Load, backup, restore via
Amazon S3; load from Amazon DynamoDB or SSH
Two hardware platforms• Optimized for data processing• DW1: HDD; scale from 2TB to 1.6PB• DW2: SSD; scale from 160GB to 256TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
ETL from EMR/Hive to Amazon Redshift trough Amazon S3
EMR S3 Redshift
Extract & Transform Load
UnstructuredUnclean
StructuredClean
ColumnarCompressed
Amazon Redshift at Pinterest Today
• 16 node 256TB cluster • 2TB data per day• 100+ regular users• 500+ queries per day
75% <= 35 seconds, 90% <= 2 minute• Operational effort <= 5 hours/week
Shazam @ Superbowl
http://www.allthingsdistributed.com/2012/06/amazon-dynamodb-growth.html
Relational Indexvs.
Key-Value
B? Treevs.
Distributed? Hash Table
O(log n)vs.
O(1)
• NoSQL Database• Provisioned Throughput• Seamless Salability• Zero Admin• Single digit millisecond latencyAmazon
DynamoDB
http://aws.amazon.com/dynamodb/
~5TB em Base de Dados
1 Bilhão de Requests/Mês
67.000 Requests/Minuto
34 milhões de Recomendações/Dia
4 milhões de produtos
27 Milhões de usuário
"A gente não pode se dar ao luxode jogar fora informação"
Availability Zone
Tomcat 6
MySQL
Primórdio
1a Etapa
Availability Zone
Tomcat 6EhCache
MySQL
Backup
2a Etapa
Availability Zone
Tomcat 6EhCacheNewRelic
MySQL Primário
Availability Zone
MySQL Secundário
EBS RAID0 EBS RAID0
Replicação
Availability ZoneAvailability Zone
3a Etapa
Availability Zone
Tomcat 6 + EhCache
Nginx HAProxy
Availability Zone Availability Zone
MySQL 1
EBS RAID0
MySQL 2
EBS RAID0
Replicação
MemcachedElasticLoad
Balancer
4a Etapa
Auto Scaling group
NginxHAProxyJettyEhCache
Availability Zone
Memcached Availability ZoneAvailability Zone
region region
Evolução da Arquitetura
AmazonKinesis
AmazonData
Pipeline
Cenas dos próximos capítulos…
http://aws.amazon.com/datapipeline/ http://aws.amazon.com/kinesis/
Where to begin?
http://aws.amazon.com/training/intro_series/
http://aws.amazon.com/free/
http://aws.amazon.com/training/
https://www.youtube.com/user/AmazonWebServices
http://aws.amazon.com/podcasts/aws-podcast/
http://aws.amazon.com/blogs/aws/
http://awshub.com.br