Pipedrive DW on AWS

Pipedrive DW on AWSNovember 3, 2016

Erkki Suurna

Talking pointsDW role in PipedriveAWS services PDW useDW infrastructure

VPC, S3, Redshift, RDS, EC2, Kinesis, ELBSecurity - KMS, IAM, S3 encryptionHadoop and Spark stackSelf developed ETL in Python (300+ tasks running daily)

PDW main goalSupport data driven organisation

SaaS KPI metrics reporting to executive levelProduct instrumentation and analysisFeedback to end user

Business intelligence - data acquisition and transformation into meaningful information for business analysis

PDW data platformDisparate data sources are processed into one unified trusted data

level (3TB)25+ models and aggregates100+ tablesSensitive data is obfuscated

Encourage to use aggregated models instead of many canned reportsProcess daily 100+k backups 330GB compressed => 3-4 TB data

AWS services PDW use - VPCVPC

Logically isolated section, where you can define your virtual networkEvery PDW service is in separate subnet and has separate security groupsUS east region, 5 availability zonesEverything is closed by defaultInfra is not permanent, script everything, because you will recreate it one day

AWS services PDW use - S3Simple Secure Storage

Central point to all servicesNot file system, more like endless storageEncryption - AES-256Lifecycle policy and storage classesDistribute filenames in bucket by filename prefix for better performance (100TPS)Multipart upload.

AWS services PDW use - Redshift and RDSRedshift 3 node 6TB cluster

Based on PostgresMPP - great product for analytical queriesAutomatic backupScale in compute dense mode up to 300 TB, in storage dense mode up to 2 PBNo stored procedures nor functionsLoad data from S3

Relational Database Service

Multi AZ, automatic backup and failoverETL and Spark backend use Postgres as metastoreSnapshots of Pipedrive app backend Mysql-s

AWS services PDW use - EC2Elastic Compute Cloud

Script all instancesPre-built images (we use mainly Amazon Linux because it is managed by Amazon)Spot instances (instance vs vCPU)Auto Scaling Group

Access Control

KMS to encrypt and decrypt secretsIdentity and Access Management (IAM)Security group as virtual firewall on instance level

AWS KinesisScalable buffer

A shard is base throughput unit 1MB/sec data input 2MB/sec data output up to 1000 PUT records/sec24h retention

Hadoop and Spark stackHDFS - new DW concept Namenode is micro instance 2x datanode provide 10+TB storageSpark - lightning fast processing cluster ETL cluster in Yarn mode Ad-hoc cluster in standalone modeAPIs Python, Scala, Java, SQL, GraphX, Streaming, MLibFormats Source data in JSON Destination data in parquet Change deltas in CSV for Redshift importCurrent setup r3.8xlarge ( 32Core 244GB memory 2X320 GB ssd, $2.66, spot saves 80-90%)

Data visualisation toolsTableau

Desktop licence for developmentTableau Online serve interactive reports to end users

Re:dashFast and simple data visualisation

GrafanaTechnical dash for infra monitoring

ZeppelinPart or Spark stack25+ interpreters available, currently we use Markdown, Shell, Scala, Python, SQL

Spark Zeppelin demoInterpreters

ShellSQLPythonScala

http://youtube.com/v/jZ_ZfAuIWaw

Next stepsSpark

Spark Streaming POCKafka POCEnable real-time dashboard on top of ElasticSearch + KibanaPOC based on graphic intensive machines (utilize GPU in Spark)Alluxio - data storage in memory

Q & A

Technology

Pipedrive DW on AWS