Upload
tim-case
View
156
Download
1
Embed Size (px)
Citation preview
IGNITING AUDIENCE MEASUREMENT
AT TIME WARNER CABLETIM CASE
Agenda
• Who is Time Warner Cable & Time Warner Cable Media
• What is Audience Measurement?
• Challenges With Legacy Architecture
• Next Generation Architecture
• Lessons Learned
1
Who Am I?
• 10+ years in E-commerce
• Focused on Data Warehousing for the last 5 years
• Certifications
– Cloudera Certified Administrator for Apache Hadoop (CCAH)
– Cloudera Certified Developer for Apache Hadoop (CCDH)
– Teradata Certified Professional
– IBM Certified Specialist - PureData System for Analytics
– Tableau Server Certified Professional
– MicroStrategy Certified Engineering Principal
– Certified ScrumMaster
– Certified SAFe Agilist
• College sports fan – Go Noles!
2
Time Warner Cable & Time Warner Cable Media
Time Warner Cable is among the largest providers of video, high-speed data and voice services in the U.S., connecting more than 15 million customers to entertainment, information and each other
• Serves customers in 29 states
• More than 50,000 employees across the U.S.
Time Warner Cable Media, the advertising arm of Time Warner Cable, provides national, regional and local marketers and agencies with innovative, strategic and cost effective advertising solutions.
3
The Audience Measurement platforms enables census reporting of subscriber viewership and allows us to answer the Five W’s
Who is watching?
– Anonymized demographics, consumer behaviors
What are they watching?
– Station, program information, advertisements
When are they watching?
– Day of week, daypart, time-shifted
Where are they watching?
– Set-top box, TWC TV apps
Why are they watching?
– Program metadata
What is Audience Measurement?
4
Viewership Data
• Set-top box
– Processing more than 500 million events per day
• Largest table is Program Tuning Event Fact
75 TB of raw data
180+ Billion records
• TWC TV app (iPad, iPhone, Android, Xbox, etc.)
• Video On Demand (VOD)
Ads Data
• TWC Media and 3rd party spots
Reference Data
• Household demographics
• Program data
• Automotive data
• Political affiliation
5 Heavy Analytical
Users
200 Audience
Finder Users
50 Tableau Consumers
5
Audience MeasurementBy the Numbers
• Around 100 Tableau Workbooks
– Authored by the business and IT
• Numerous ad hoc queries
6
Video Viewership Analyzer (VVA)
• Custom application that enables complex audience definition by the user community
– Date range
– Geography (DMA, Ad Zone or Zip Code)
– Platform (Classic, IPTV)
– Audience Definition
• Daypart
• Station and/or Program
• Demographics (includes line-of-business, propensities, Tribes and automotive)
• Platform usage (VOD, IPTV, high-speed data)
• Custom segmentation
• Output includes ranked list of stations and some high-level metrics
7
Audience Finder
Audience Finder: Reference Program
8
Technology
9
• 3rd Party application ingests raw data and performs anonymization, correlation and some enrichment/mediation
• TWC ingests files provided by 3rd Party and performs additional enrichment as well as applying business rules and stitching logic
– Executed in Netezza using SQL and shell scripts
• Two Netezza appliances
– TwinFin 36 used for ELT processing
– TwinFin 72 used for BI and customer-facing workloads
10
Legacy Platform Architecture
Source Data
TWC Media Business Logic
StitchingFiltering
Zombie Logic
Core LogicAnonymization
CorrelationMediation
Enrichment
Collection
• Inconsistency around reliability and availability of source and reference data
Processing
• Slow catch up process
• Arch does not promote speed to market for new features
Data Storage + Delivery
• Platform instability
• Does not support concurrent users
Analysis + Presentation
• Limited exploration and interactive capabilities
Challenges With Legacy Architecture
11
• SLA’s for T-3 and T-14
• Frequency of reprocessing
• Reference data quality
• Duration of reprocessing
• Team Velocity when introducing ETL changes
• Platform availability
• Query response times
• Response time SLA’s during mixed workload
• User satisfaction w/ the interface
• Customer dependency on IT for changes
Metrics to Assess
Technical Criteria
• Performance
• Supports batch and streaming
• Leverage software engineering patterns
• Open source momentum
• “-ilities”
– Scalability
– Elasticity
– Availability
– Durability
– Extensibility
• Enables DevOps to compliment Agile adoption
– Automated testing
– Test-driven Development (TDD)
– Continuous Integration (CI)
• Strong foundation for Data Lake12
Data Warehouse
Event Persistence
Hadoop
Visualization
Data Integration
Apache Spark is a more appropriate solution for set-top box processing logic:
Reduces complexity, simplifies code maintenance, improves defect resolution
time, improves run-time.
Can be applied in batch or near real-time with modest changes which positions
for T-x data availability (where ‘x’ is only limited by the availability of reference
data)
Enables use of Agile development principles (test-driven development and
continuous integration) there by Improving time-to-market, code quality, and
radically reducing QA costs and time.
Hadoop/HDFS for storing large historical data positions the organization
to leverage the evolving open source big data analytics technologies
(machine learning, SQL on Hadoop, graph processing, etc.)
Teradata will allow for large volumes of tuning event data to be secure,
easily accessible, and highly available to large numbers of users and at
reasonable cost.
Tableau enables self-service analytics, including advanced algorithms,
against the audience measurement data, then present information to
various consumers in meaningful ways.
Kafka is a high-performance, fault-tolerant, real-time messaging platform that
will allow us to keep a history of tuning events for faster reprocessing. This
component is critical once we are performing near real-time streaming of
events.
13
Technologies Selected
Core LogicAnonymization
CorrelationMediation
Enrichment
TWC Media Business Logic
StitchingFiltering
Zombie Logic
Initial Nextgen Architecture
Replace MicroStrategy with Tableau to enable self-service
Replace Netezza for customer facing workloads with Teradata, improving platform stability, enabling sandboxes (e.g., Data Labs) and workload management tools which assist in managing to performance SLA’s
o Replace 3rd Party application and Netezza ELT with Spark for Collection and Processing logic (anonymization , correlation, enrichment, filtering, stitching, & zombie logic)
Source Data
14
Long-Term Architecture
• Implement an enterprise Data Lake to enable non-Media use cases
• Migrate to Spark Streaming and Kafka to enable near real-time use cases
• Evaluate dedicated infrastructure for more predictable performance
Event Data
Business LogicStitchingFiltering
Zombie Logic
Data LakeAnonymization
CorrelationMediation
Enrichment
15
Reference Data
Lessons Learned
• Have Executive support
• Infrastructure is critical
– Node sizes
– Network
• Leverage the open source community
– Enhancements
– Extensions (Spark Packages)
• Talent is hard to find
– Consider abstractions
16
Partners
17
We’re Hiring!http://jobs.timewarnercable.com/