Upload
booker
View
55
Download
0
Tags:
Embed Size (px)
DESCRIPTION
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ). Conviva monitors and optimizes online video for premium content providers. We see 10s of millions of streams every day. Conviva data processing architecture. Monitoring Dashboard. - PowerPoint PPT Presentation
Citation preview
How Conviva used Spark to Speed Up Video Analytics by 25xDilip Antony Joseph (@DilipAntony)
Conviva monitors and optimizes online video for premium content providers
We see 10s of millions of streams every day
Conviva data processing architecture
Hadoop for historical data
Live data processing
Spark
Reports
Ad-hoc analysis
Video Player
Monitoring Dashboard
Group By queries dominate our workload
• 10s of metrics, 10s of group bys• Hive scans data again and again from HDFS Slow
• Conviva GeoReport took ~ 24 hours using Hive
SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ’GROUP BY videoName;
Group By queries can be easily written in Sparkval sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest }
val cachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cacheval mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }
val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
Spark is blazing fast!• Spark keeps sessions of interest in RAM • Repeated group by queries are very fast.• Spark-based GeoReport runs in 45 minutes
(compared to 24 hours with Hive)
Spark queries require more code, but are not too hard to write
• Writing queries in Scala – There is a learning curve
• Type-safety offered by Scala is a great boon– Code completion via Eclipse Scala plugin
• Complex queries are easier to write in Scala than in Hive.– Cascading IF()s in Hive
Challenges in using Spark• Learning Scala• Always on the bleeding edge – getting
dependencies right• More tools required
Spark @ Conviva today• Using Spark for about 1 year• 30% of our reports use Spark, rest use Hive
– Analytics portal with canned Spark/Hive jobs
• More projects in progress– Anomaly detection– Interactive console to debug video quality issues– Near real-time analysis and decision making using Spark
• Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva
We are [email protected]