10
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@DilipAntony)

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

  • Upload
    booker

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ). Conviva monitors and optimizes online video for premium content providers. We see 10s of millions of streams every day. Conviva data processing architecture. Monitoring Dashboard. - PowerPoint PPT Presentation

Citation preview

Page 1: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

How Conviva used Spark to Speed Up Video Analytics by 25xDilip Antony Joseph (@DilipAntony)

Page 2: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Conviva monitors and optimizes online video for premium content providers

We see 10s of millions of streams every day

Page 3: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Conviva data processing architecture

Hadoop for historical data

Live data processing

Spark

Reports

Ad-hoc analysis

Video Player

Monitoring Dashboard

Page 4: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Group By queries dominate our workload

• 10s of metrics, 10s of group bys• Hive scans data again and again from HDFS Slow

• Conviva GeoReport took ~ 24 hours using Hive

SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ’GROUP BY videoName;

Page 5: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Group By queries can be easily written in Sparkval sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest }

val cachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cacheval mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }

val reduceFn : (Long, Long) => Long = { (a,b) => a+b }

val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

Page 6: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Spark is blazing fast!• Spark keeps sessions of interest in RAM • Repeated group by queries are very fast.• Spark-based GeoReport runs in 45 minutes

(compared to 24 hours with Hive)

Page 7: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Spark queries require more code, but are not too hard to write

• Writing queries in Scala – There is a learning curve

• Type-safety offered by Scala is a great boon– Code completion via Eclipse Scala plugin

• Complex queries are easier to write in Scala than in Hive.– Cascading IF()s in Hive

Page 8: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Challenges in using Spark• Learning Scala• Always on the bleeding edge – getting

dependencies right• More tools required

Page 9: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Spark @ Conviva today• Using Spark for about 1 year• 30% of our reports use Spark, rest use Hive

– Analytics portal with canned Spark/Hive jobs

• More projects in progress– Anomaly detection– Interactive console to debug video quality issues– Near real-time analysis and decision making using Spark

• Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

Page 10: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

We are [email protected]