Upload
pipedrive
View
129
Download
3
Embed Size (px)
Citation preview
Pipedrive DW on AWSNovember 3, 2016
Erkki Suurna
Talking pointsDW role in PipedriveAWS services PDW useDW infrastructure
VPC, S3, Redshift, RDS, EC2, Kinesis, ELBSecurity - KMS, IAM, S3 encryptionHadoop and Spark stackSelf developed ETL in Python (300+ tasks running daily)
PDW main goalSupport data driven organisation
SaaS KPI metrics reporting to executive levelProduct instrumentation and analysisFeedback to end user
Business intelligence - data acquisition and transformation into meaningful information for business analysis
PDW data platformDisparate data sources are processed into one unified trusted data
level (3TB)25+ models and aggregates100+ tablesSensitive data is obfuscated
Encourage to use aggregated models instead of many canned reportsProcess daily 100+k backups 330GB compressed => 3-4 TB data
AWS services PDW use - VPCVPC
Logically isolated section, where you can define your virtual networkEvery PDW service is in separate subnet and has separate security groupsUS east region, 5 availability zonesEverything is closed by defaultInfra is not permanent, script everything, because you will recreate it one day
AWS services PDW use - S3Simple Secure Storage
Central point to all servicesNot file system, more like endless storageEncryption - AES-256Lifecycle policy and storage classesDistribute filenames in bucket by filename prefix for better performance (100TPS)Multipart upload.
AWS services PDW use - Redshift and RDSRedshift 3 node 6TB cluster
Based on PostgresMPP - great product for analytical queriesAutomatic backupScale in compute dense mode up to 300 TB, in storage dense mode up to 2 PBNo stored procedures nor functionsLoad data from S3
Relational Database Service
Multi AZ, automatic backup and failoverETL and Spark backend use Postgres as metastoreSnapshots of Pipedrive app backend Mysql-s
AWS services PDW use - EC2Elastic Compute Cloud
Script all instancesPre-built images (we use mainly Amazon Linux because it is managed by Amazon)Spot instances (instance vs vCPU)Auto Scaling Group
Access Control
KMS to encrypt and decrypt secretsIdentity and Access Management (IAM)Security group as virtual firewall on instance level
AWS KinesisScalable buffer
A shard is base throughput unit 1MB/sec data input 2MB/sec data output up to 1000 PUT records/sec24h retention
Hadoop and Spark stackHDFS - new DW concept Namenode is micro instance 2x datanode provide 10+TB storageSpark - lightning fast processing cluster ETL cluster in Yarn mode Ad-hoc cluster in standalone modeAPIs Python, Scala, Java, SQL, GraphX, Streaming, MLibFormats Source data in JSON Destination data in parquet Change deltas in CSV for Redshift importCurrent setup r3.8xlarge ( 32Core 244GB memory 2X320 GB ssd, $2.66, spot saves 80-90%)
Data visualisation toolsTableau
Desktop licence for developmentTableau Online serve interactive reports to end users
Re:dashFast and simple data visualisation
GrafanaTechnical dash for infra monitoring
ZeppelinPart or Spark stack25+ interpreters available, currently we use Markdown, Shell, Scala, Python, SQL
Next stepsSpark
Spark Streaming POCKafka POCEnable real-time dashboard on top of ElasticSearch + KibanaPOC based on graphic intensive machines (utilize GPU in Spark)Alluxio - data storage in memory
Q & A