Upload
rolf-koski
View
121
Download
0
Embed Size (px)
Citation preview
Yle Data Cloud
• Helps to create better user experience• Content development• Service optimization• Recommendation• Marketing automation
• Purpose to create broad and real-time data for strategic decisions and actions• Reachability, demographics
Yle Data Cloud
Data => Information => Knowledge
● Daily predictions, how they match with reality, accuracy over time
● Identify control set of users that can be used as a reference point
● Combine Internet metrics with analog TV measurements
Yle Data Cloud● What we measure?
○ Page hits○ Heartbeats○ Social media○ Articles read, time spent with media, date / time /
location■ AMR (Average Minute Rating)
○ Genre○ Age groups
Content
User information
Behavior
Data Cloud Relational Data Hub
Sources UsageData hub
Raw data PublishingStructured(Data vault)
Web events
Dashboards
Recommen-dations
Strategy
Marketing automation
Panel
S3
Yle Analytics Pipeline - Current situation
Elastic Beanstalk
Kinesis Streams
Analytics Collector
Web events
~100 mill./day
Kinesis FirehoseRedShift
Lambda S3 EMR
Daily archiving
Areena Recommendation
Delay: 10-15 minutes
Small batch sizes
Compression and larger file sizes.~90 Gb per day
CloudWatch
Dashboards
Small sized filesDelay: ~minute
Managed and/or serverless processing?
● Firehose used for data ingestion to Redshift● Inflexible for faster analytics needs (< 2 minutes)● Kinesis + Lambda consumers for both S3 archiving
and fast lane analysis (TBD).
DevOps in analytics pipeline
● Baseline support from dedicated operations team● Terraform for infrastructure management● Most other Yle services run as Dockerized APIs in
ECS cluster.
Redshift consumers
● Data mart (Postgre) for web-accessed precomputed results
● Scheduled lambdas for very light queries● Lambda-driven task containers for long but memory
light queries● Lambda-started EC2 instances for memory intensive
computing. (Recommendations, user classifications)● Data scientists running exploratory queries
Redshift performance
● Default user group is limited to 5 concurrent queries● Set up WLM queues for different workloads, split by
usage.● Isolate data scientists into separate WLM queue that
doesn’t block scheduled activity.
Lambda for batch queries?
● Fast, serverless, stateless, cheap, reactive.● Limited by 300s max timeout● Unreliable in high load situations.
Lambda-driven task containers for batch
● Lambda allows reactive and scheduled running of tasks.
● Task containers not limited by execution timeouts.● Logging and monitoring support for ECS containers
is already there (ELK).
Why not use AWS Batch?
● AWS Batch wasn’t available when batch containers were first needed - published in re:Invent Dec 2016
● Not yet supported in Terraform v0.8.5.● Once support is there, switch to AWS Batch from
homebrew resources.
Monitoring?
● Cloudwatch Alerts for component red/green health● Cloudwatch Dashboards for overall graph view● Kibana and Cloudwatch Logs (with custom scripting)
for log management.