Upload
altocloud
View
485
Download
0
Embed Size (px)
Citation preview
The Life of Data @Altocloud
Maciej Dabrowski, Chief Data Scientist
Darragh Kirwan, Full Stack Engineer
1
Focus on real-time analytics
Make predictions on live visitors in real-time (in seconds) by:
Ingesting customer actions (events) and context
Building predictive models
Actions offered to customers based on real-time predictions
5
The life of data @Altocloud
Data platform
Use cases real-time analytics machine learning Ad hoc analysis
7
Engineering challenges
Product complexity Communication platform Data platform
Scale Millions of events per day Billions of events overall Typically no stable schemas
8
Real-time aspects Response in second(s) Streaming nature
Reliability 24/7 availability Services go down Servers disappear
Data
Demographic device location organisation contact details, and more
JSON
9
Events: page views form fills searches purchases custom events
ALTOCLOUD DATA PLATFORM
ALTOCLOUD PLATFORM
Altocloud Platform
10
APIs
MESSAGE QUEUES
DATA PROCESSORS
STORAGE
APIsAPIs
APIs
MESSAGE QUEUES
DATA PROCESSORS
DATA INGESTION
QUERY LAYER
STORAGE LAYER
Altocloud Data Platform
12
PLATFORM APIs
APIs
Why Spark
Fast for iterative algorithms (important for Machine Learning)
Good integration with other tools (Kafka and Cassandra)
One code base for streaming and batch processing
Easy to deploy and maintain
Growing ecosystem (SQL, MLlib, GraphX, …)
Large open-source community
13
Data source: Kafka
Pub-sub message broker
Fast: 100s MBs /s on a single broker
Scalable: partitioned data streams
Durable: messages persisted and replicated
Distributed: Strong durability and fault-tolerance
Downside: requires ZooKeeper
15
Scalable storage
Easy to setup
High availability - no master
Great performance
CQL - SQL like querying
Great support and bug-free drivers from Datastax
Key: Design your schema around queries;
16
Use case: Real-time Analytics
Analytics Fully distributed Idempotent - Spark, Cassandra and Kafka Powered by in-house solution
Highly scalable and performant Minutely roll ups on raw events, hourly on pre-aggregates 1bn+ events / day on a 3 node cluster Sub-second query time
17
PREDICTIONS
Use case: Machine Learning
18
MESSAGE QUEUE
EVENTSEVENTS
EVENTS
STREAMING
BATCH
MODEL LEARNING JOB
EVENT STORAGE JOB MODEL EVALUATION JOB
1
2
3
INSIGHTS
Apache Zeppelin Workbooks
Browser-based notebook with access to cluster compute power
Support for Scala, Python, SparkSQL, Hive and more
Data Analytics, visualisation and collaboration tool
21
Apache Zeppelin
Data visualisation out of the box Support for D3 library for complex visualisations Integration with live datasources
22