69
Karan Bhatia, PhD Introducing Elastic MapReduce Big Data Solutions Practice

Introducing Elastic MapReduce

Embed Size (px)

DESCRIPTION

Introducing Elastic MapReduce

Citation preview

Page 1: Introducing Elastic MapReduce

Karan Bhatia, PhD

Introducing Elastic MapReduce

Big Data Solutions Practice

Page 2: Introducing Elastic MapReduce

Vários Tutoriais , treinamentos e mentoria em

português

Inscreva-se agora !!

http://awshub.com.br

Page 3: Introducing Elastic MapReduce
Page 4: Introducing Elastic MapReduce

4 bytes x 1,000,000 households x 1 measurement/month x 10 years

480 MBytes

Page 5: Introducing Elastic MapReduce

4 bytes x 1,000,000 households x 1 measurement/min x 10 years

220 TBytes

Page 6: Introducing Elastic MapReduce

Big Data as Business Transformation

Page 7: Introducing Elastic MapReduce

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 8: Introducing Elastic MapReduce

AWS Elastic MapReduce

Map reduce

HDFS

Page 9: Introducing Elastic MapReduce

Thousands of customers, 2 million+ clusters in 2012

Page 10: Introducing Elastic MapReduce

EMR Sample Use Cases

Page 11: Introducing Elastic MapReduce

Apontador e MapLink

e AWS

Apoio:

Page 12: Introducing Elastic MapReduce

• O que conheço do usuário?

{"BaseLogId":"RmlpbjZkWVhCM0NxckNjYjF3eFU0dGNTYnhJPQ","TrackUserId":"a18e0672-ad07-4f28-b447-fc0cba90ee17","SiteId":"apto-dv01","SessionId":"1369827720327:f52c5b","ExternalId":"1933510381","Hostname":"integra01.apontador.lan","Path":"/local/sp/sao_paulo/bares_e_casas_noturnas/QYN7825H/","Referer":null,"PageTitle":"Locais, Eventos, Endereços, Mapas - Apontador.com","IpAddress":"200.150.177.249","AgentInfo":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36","Position":"{ \"lat\": -23.5934691, \"lon\": -46.6882606, \"acc\": 36}","SearchInfo":null,"RawRequestInfo":”RawRequest”: ","CreateAt":"2013-06-24T14:39:46.7082358Z"}

•O que mais?

Ações, cliques, buscas

COMO trazer o melhor para o usuário?

Page 13: Introducing Elastic MapReduce

• O que recebemos para determinar o transito?

<Route><Category>1</Category><DateTime>0001-01-01T00:00:00</DateTime><Destination xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>-8.150483</a:Lat><a:Lng>-35.420284</a:Lng></Destination><Origin xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>-8.149973</a:Lat><a:Lng>-35.41825</a:Lng></Origin>

COMO descobrir o trânsito?

Page 14: Introducing Elastic MapReduce

Teorema de Bayes:

O MODELO estatístico

Page 15: Introducing Elastic MapReduce

• Hive (~ 40 instancias spot m3.large)

90% - Utilidades diárias

• Streaming

10% - Solr, MapReduces mais complexos (MCMC, FastFourier, e.g.)

• Estrutura usada

Hive ( ~ 40 instancias spot m3.large), Elastic MapReduce S3 (aproximadamente 7 Tb de dados estruturados em diversos buckets) RDS (dados de organização dos dados do S3)

O QUE usamos?

Page 16: Introducing Elastic MapReduce

• A Chaordic é a empresa líder em personalização para e-commerce no Brasil, tendo como clientes 9 dos 15 maiores players do país.

• Os produtos desenvolvidos pela

Chaordic se integram aos maiores sites de e-commerce brasileiros e precisam de uma infra-estrutura confiável, rápida, escalável e de baixo custo.

“Com a AWS conseguimos construir um único sistema para

atender a demanda dos maiores sites de e-commerce do Brasil a

um custo relativamente baixo”.

“Construir um data

center próprio para

atender nossa

demanda seria

economicamente

inviável” - João Bosco, CTO

Page 17: Introducing Elastic MapReduce

O Desafio

• Atender dezenas de milhões de usuários únicos por mês;

• Processamento de Big Data;

• Responder em menos de 100ms;

• Escalar bem em momentos de pico de acesso;

• Tudo isto a um custo acessível.

Page 18: Introducing Elastic MapReduce

Sobre o Papel da AWS e

Benefícios alcançados

• 4 bilhões de requisições por mês;

• +300 mil requisições por minuto;

• +200 milhões de recomendações todos os dias;

• Spot instances: -20% custo aws.

Page 19: Introducing Elastic MapReduce

Map Reduce

Page 20: Introducing Elastic MapReduce
Page 21: Introducing Elastic MapReduce

Map Shuffle Reduce

Page 22: Introducing Elastic MapReduce

AWS Elastic MapReduce

Page 23: Introducing Elastic MapReduce

Managed Hadoop analytics

Page 24: Introducing Elastic MapReduce

Input data

S3, DynamoDB, Redshift

Page 25: Introducing Elastic MapReduce

Elastic

MapReduce

Code

Input data

S3, DynamoDB, Redshift

Page 26: Introducing Elastic MapReduce

Elastic

MapReduce

Code Name

node

Input data

S3, DynamoDB, Redshift

Page 27: Introducing Elastic MapReduce

Elastic

MapReduce

Code Name

node

Input data

Elastic

cluster

S3, DynamoDB, Redshift

S3/HDFS

Page 28: Introducing Elastic MapReduce

Elastic

MapReduce

Code Name

node

Input data

S3/HDFS Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

Page 29: Introducing Elastic MapReduce

Elastic

MapReduce

Code Name

node

Output

Input data

Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

S3/HDFS

Page 30: Introducing Elastic MapReduce

Output

Input data

S3, DynamoDB, Redshift

Page 31: Introducing Elastic MapReduce
Page 32: Introducing Elastic MapReduce
Page 33: Introducing Elastic MapReduce
Page 34: Introducing Elastic MapReduce
Page 35: Introducing Elastic MapReduce

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128

Mem

ory

(GB)

EC2 Compute Units

Instance Types

Standard 2nd Gen Standard Micro High-Memory High-CPU Cluster Compute Cluster GPU High I/O High-Storage Cluster High-Mem

hi1.4xlarge 60.5 GB of memory 35 EC2 Compute Units 2x1024 GB SSD instance storage 64-bit platform

cc1.4xlarge 23 GB of memory 33.5 EC2 Compute Units 1690 GB of instance storage 64-bit platform

c1.xlarge 7 GB of memory 20 EC2 Compute Units 1690 GB of instance storage 64-bit platform

m1.small 1.7 GB memory 1 EC2 Compute Unit 160 GB instance storage 32-bit or 64-bit

m1.medium 3.75 GB memory 2 EC2 Compute Unit 410 GB instance storage 32-bit or 64-bit platform

m1.large EBS Optimizable 7.5 GB memory 4 EC2 Compute Units 850 GB instance storage 64-bit platform

m1.xlarge EBS Optimizable 15 GB memory 8 EC2 Compute Units 1,690 GB instance storage 64-bit platform

m2.xlarge 17.1 GB of memory 6.5 EC2 Compute Units 420 GB of instance storage 64-bit platform

m2.2xlarge 34.2 GB of memory 13 EC2 Compute Units 850 GB of instance storage 64-bit platform

m2.4xlarge EBS Optimizable 68.4 GB of memory 26 EC2 Compute Units 1690 GB of instance storage 64-bit platform

t1.micro 613 MB memory Up to 2 EC2 Compute Units EBS storage only 32-bit or 64-bit platform

c1.medium 1.7 GB of memory 5 EC2 Compute Units 350 GB of instance storage 32-bit or 64-bit platform

cg1.4xlarge 22 GB of memory 33.5 EC2 Compute Units 2 x NVIDIA Tesla “Fermi”  M2050  GPUs 1690 GB of instance storage 64-bit platform

cc2.8xlarge 60.5 GB of memory 88 EC2 Compute Units 3370 GB of instance storage 64-bit platform m3.xlarge

15 GB of memory 13 EC2 Compute Units

m3.2xlarge EBS Optimizable 30 GB of memory 26 EC2 Compute Units

hs1.8xlarge 117 GB of memory 35 EC2 Compute Units 24x2 TB instance storage 64-bit platform

cr1.8xlarge 244 GB of memory 88 EC2 Compute Units 2x120 GB SSD instance storage 64-bit platform

Page 36: Introducing Elastic MapReduce
Page 37: Introducing Elastic MapReduce
Page 38: Introducing Elastic MapReduce
Page 39: Introducing Elastic MapReduce
Page 40: Introducing Elastic MapReduce
Page 41: Introducing Elastic MapReduce
Page 42: Introducing Elastic MapReduce

1. Elastic clusters

Page 43: Introducing Elastic MapReduce

10 hours

Page 44: Introducing Elastic MapReduce

5 hours

Page 45: Introducing Elastic MapReduce

Peak capacity

Page 46: Introducing Elastic MapReduce

2. Rapid, tuned provisioning

Page 47: Introducing Elastic MapReduce

Tedious.

Page 48: Introducing Elastic MapReduce

Remove undifferentiated

heavy lifting.

Page 49: Introducing Elastic MapReduce

3. Hadoop all the way down

Page 50: Introducing Elastic MapReduce

Robust ecosystem. Databases, machine learning, segmentation,

clustering, analytics, metadata stores,

exchange formats, and so on...

Page 51: Introducing Elastic MapReduce

4. Agility for experimentation

Page 52: Introducing Elastic MapReduce

Instance choice. Stay flexible on instance type & number.

Page 53: Introducing Elastic MapReduce

5. Cost optimizations

Page 54: Introducing Elastic MapReduce

Built for Spot. Name-your-price supercomputing.

Page 55: Introducing Elastic MapReduce

1. Elastic clusters

2. Rapid, tuned provisioning

3. Hadoop all the way down

4. Agility for experimentation.

5. Cost optimizations

Page 56: Introducing Elastic MapReduce

Data, data, everywhere... Data is stored in silos.

Page 57: Introducing Elastic MapReduce

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

Page 58: Introducing Elastic MapReduce

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

Page 59: Introducing Elastic MapReduce

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 60: Introducing Elastic MapReduce

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 61: Introducing Elastic MapReduce

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 62: Introducing Elastic MapReduce

AWS Data Pipeline

Announced in November, available now.

Orchestration for data-intensive workloads.

Page 63: Introducing Elastic MapReduce

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute

resources

Page 64: Introducing Elastic MapReduce

Anatomy of a pipeline

Page 65: Introducing Elastic MapReduce

Additional checks and notifications

Page 66: Introducing Elastic MapReduce

Arbitrarily complex pipelines

Page 67: Introducing Elastic MapReduce

aws.amazon.com/datapipeline

Page 68: Introducing Elastic MapReduce

aws.amazon.com/big-data

Page 69: Introducing Elastic MapReduce

Thanks

[email protected]