How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

How Conviva used Spark to Speed Up Video Analytics by 25xDilip Antony Joseph (@DilipAntony)

Conviva monitors and optimizes online video for premium content providers

We see 10s of millions of streams every day

Conviva data processing architecture

Hadoop for historical data

Live data processing

Spark

Reports

Ad-hoc analysis

Video Player

Monitoring Dashboard

Group By queries dominate our workload

• 10s of metrics, 10s of group bys• Hive scans data again and again from HDFS Slow

• Conviva GeoReport took ~ 24 hours using Hive

SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ’GROUP BY videoName;

Group By queries can be easily written in Sparkval sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest }

val cachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cacheval mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }

val reduceFn : (Long, Long) => Long = { (a,b) => a+b }

val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

Spark is blazing fast!• Spark keeps sessions of interest in RAM • Repeated group by queries are very fast.• Spark-based GeoReport runs in 45 minutes

(compared to 24 hours with Hive)

Spark queries require more code, but are not too hard to write

• Writing queries in Scala – There is a learning curve

• Type-safety offered by Scala is a great boon– Code completion via Eclipse Scala plugin

• Complex queries are easier to write in Scala than in Hive.– Cascading IF()s in Hive

Challenges in using Spark• Learning Scala• Always on the bleeding edge – getting

dependencies right• More tools required

Spark @ Conviva today• Using Spark for about 1 year• 30% of our reports use Spark, rest use Hive

– Analytics portal with canned Spark/Hive jobs

• More projects in progress– Anomaly detection– Interactive console to debug video quality issues– Near real-time analysis and decision making using Spark

• Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

We are [email protected]

Documents

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )