AWS Big Data in everyday use at Yle

AWS Big Data in Everyday Use at YleSaku Vaittinen, Yle / Jukka Dahlbom, Webscale

9.2.2017

Yle Data Cloud

• Helps to create better user experience• Content development• Service optimization• Recommendation• Marketing automation

• Purpose to create broad and real-time data for strategic decisions and actions• Reachability, demographics

Yle Data Cloud

Data => Information => Knowledge

● Daily predictions, how they match with reality, accuracy over time

● Identify control set of users that can be used as a reference point

● Combine Internet metrics with analog TV measurements

Yle Data Cloud● What we measure?

○ Page hits○ Heartbeats○ Social media○ Articles read, time spent with media, date / time /

location■ AMR (Average Minute Rating)

○ Genre○ Age groups

Content

User information

Behavior

Data Cloud Relational Data Hub

Sources UsageData hub

Raw data PublishingStructured(Data vault)

Web events

Dashboards

Recommen-dations

Strategy

Marketing automation

Panel

S3

Yle Analytics Pipeline - Current situation

Elastic Beanstalk

Kinesis Streams

Analytics Collector

Web events

~100 mill./day

Kinesis FirehoseRedShift

Lambda S3 EMR

Daily archiving

Areena Recommendation

Delay: 10-15 minutes

Small batch sizes

Compression and larger file sizes.~90 Gb per day

CloudWatch

Dashboards

Small sized filesDelay: ~minute

Managed and/or serverless processing?

● Firehose used for data ingestion to Redshift● Inflexible for faster analytics needs (< 2 minutes)● Kinesis + Lambda consumers for both S3 archiving

and fast lane analysis (TBD).

DevOps in analytics pipeline

● Baseline support from dedicated operations team● Terraform for infrastructure management● Most other Yle services run as Dockerized APIs in

ECS cluster.

Redshift consumers

● Data mart (Postgre) for web-accessed precomputed results

● Scheduled lambdas for very light queries● Lambda-driven task containers for long but memory

light queries● Lambda-started EC2 instances for memory intensive

computing. (Recommendations, user classifications)● Data scientists running exploratory queries

Redshift performance

● Default user group is limited to 5 concurrent queries● Set up WLM queues for different workloads, split by

usage.● Isolate data scientists into separate WLM queue that

doesn’t block scheduled activity.

Lambda for batch queries?

● Fast, serverless, stateless, cheap, reactive.● Limited by 300s max timeout● Unreliable in high load situations.

Lambda-driven task containers for batch

● Lambda allows reactive and scheduled running of tasks.

● Task containers not limited by execution timeouts.● Logging and monitoring support for ECS containers

is already there (ELK).

Why not use AWS Batch?

● AWS Batch wasn’t available when batch containers were first needed - published in re:Invent Dec 2016

● Not yet supported in Terraform v0.8.5.● Once support is there, switch to AWS Batch from

homebrew resources.

Monitoring?

● Cloudwatch Alerts for component red/green health● Cloudwatch Dashboards for overall graph view● Kibana and Cloudwatch Logs (with custom scripting)

for log management.

Monitoring

Monitoring

Data & Analytics

AWS Big Data in everyday use at Yle