Indexing & Retrieving Multimedia Resources with Elastic Searchfiles.meetup.com/7646592/20160921_Sandy_Ingram_ES_Meetup_Lausanne.pdf · Like a Relational DB with schema deﬁnition

Indexing & Retrieving Multimedia Resources with Elastic Search

Septembre 21, 2016

Sandy Ingram Lead Dev. Engineer

© Klewel - 2016

Overview• Presentation of Klewel

• Presentation of Triskel

• Overview of Data Model/Store

• ES: Terminology

• ES: Index Creation & Mapping

• ES: Indices vs Types

• ES: Nested Objects/Parent-Child Relations

• ES: Aggregations & Filters

• ES: Relevancy Scores

• ES: Multi-Word Queries

• ES: Updating the Index

• ES: What’s next for us?

Triskel: A Seamless Webcasting Solution

© Klewel - 2016

Among our Clients

© Klewel - 2016

Triskel

© Klewel - 2016

Triskel

© Klewel - 2016

WISE

© Klewel - 2016

Triskel

Triskel: Cross-Platform

© Klewel - 2016

Old Capture Station

© Klewel - 2016

Capture Station (Heavy)

Recording Application: • C# et C++ • Based on a SDK of

multimedia capture • Records, Encodes,

Detects Slide change, Sends data via APIs

Triskel AV/Pro (Triskel « light »)

No more Station: • PC or Mac • Converters only (Aja, Blackmagick) Cheaper!

Recording Application: • C++ with Qt • Sends data via API

© Klewel - 2016

Triskel: From A to Z

© Klewel - 2016

Processing: OCR, ASR, indexing

Composite Video Slide Detection

Recommendation

Publishing: Admin Edition/Public Portal

Multimodal Search Customized Widget

Solution

Capture: Triskel AV/Pro

Triskel SC Old Capture Station

Front-end: Search Widget

https://portal.klewel.com/watch/webcast/ecovillage-2016/talk/4

Contextual Search Inside a conference

https://portal.klewel.com/watch/webcast/ecovillage-2016/talk/4

Backend: DataStore• V0: XML file with talk title and slide list with timing and ocr

text (search inside each talk)

• V1: Relational Database (MySQL Server)

• V2 (since 2013):

• Relational database as primary data store

• Document-based storage and indexing (ElasticSearch with ES Python Library)

© Klewel - 2016

(Indexed) Data Model

© Klewel - 2016

Webcast(Name, Location, Date)

Talk2(Title)

Talk3(Title)

Talk1(Title)

ASR

Slide2

Slide1 ASR

Slide2

Slide1ASR Transcript

Slide2(OCR)

Slide1(OCR)

Our Relational DB is more complex

Elastic Search Terms• Cluster of nodes (nodes = running instances) • Analysis: conversion of full-text into terms • Index: has a mapping, shard(s) and replica(s) Like a Relational DB with schema definition

• Document: JSON object stored with ID & field(s) Like a Row in Relational DB

• Type: type of doc with defined fields Like a Table in Relational DB

© Klewel - 2016

Creating an Index• Create an index: create_index(index_name, INDEX_SETTINGS) INDEX_SETTINGS: customized analyzers, number of shards, replicas

• Define a mapping: put_mapping(‘conference_type’, CONFERENCE_PROPS, index_name)

• Define Fields 'conferenceDate': { 'type': 'string', "index" : "not_analyzed", "store" : 'true', }, 'conferenceName': { 'type': 'string', "index" : "analyzed", 'analyzer' : "analyzer_en", "store" : 'true', },© Klewel - 2016

Indices vs Types

• Data stored in shards

• 1 large index file better than multiple small indices

• Too many indices lead to expensive merge across index shards results

© Klewel - 2016

Indices vs Types• Types are (Not exactly) to an Index what Tables are to a Database

• Lucene uses a single flat mapping for all types • Choose types when they share the same schema • Name field and types (e.g: str/int) consistently across Types

• Klewel: 1 Index + 4 types: conference_type, talk_type, asr_type, slide_type)

© Klewel - 2016

Nested Objects

https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html© Klewel - 2016

https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html

Nested Objects• Enables searching within one nested object

• Almost as fast as having them in a single document

• Drawback: can’t CRUD nested docs directly

• CRUD is atomic over a single document © Klewel - 2016

Nested Objects

Parent-Child Relationships• Like Nested Objects but in separate docs

• Stored in the same shard

• Updates/Search independent

• Webcast = Parent, Talks = Childrenes_conn.index( json.dumps(prepared_data), self.esIndex, TALK_TYPE, id = prepared_data['talkUniqueId'], parent = prepared_data['conferenceShortUUID'], bulk=bulk)© Klewel - 2016

Aggregations

© Klewel - 2016

Aggregations

"aggregations": { "talks": { "filter": { "type": { "value": "talk_type" }}, "aggregations": {

"by_conference": { "terms": { "field": "conferenceId" }}} },slides": {

"filter": { "type": { "value": "slide_type" }}, "aggregations": { "by_talk": { "terms": {"field": "talkId" }}} },

© Klewel - 2016

Query Context /Filter Context • How well a document matches a search query? => Query

Context

• What should and should not be included? => Filter Context

• Use Queries for Full-text Search & Relevance Scores

• Multiword queries: "minimum_should_match": « 75% », « -25% » « 2<-25% 9<-3 »

© Klewel - 2016

"query": { "match": { "title": { "query": " Elastic Search ", "operator": "and" } }

Filtering Private/Public Results

if only_public == False: query["query"]["filtered"]["filter"] = publicFilter (v2 => query["bool"]["filter"] = publicFilter)

publicFilter = { "or" : [parentisPublicFilter, publicFilter]}

Add filters to query before execution

© Klewel - 2016

publicFilter = { "term": { "is_public": 1 }}

parentisPublicFilter = { "has_parent": { "type": "conference_type"

….}

Context: Privacy at Conference Level

Theory Behind Relevance Scoring

• Order relevant docs:

- Default TF/IDF: for each query term in the document

- Fixed-Length Norm: The longer the field the lower the weight (term found in title vs description)

- Both TF & Fixed-Length can be disabled

- Multi-term query: 1 final (combined) score indicating how well a document matches © Klewel - 2016

• Select relevant docs:

- Boolean Model :« meetup » AND « elastic » AND (« lausanne » OR « zurich »)

Function Scores• On the docs returned for a query

• score_mode: how to combine scores (min, max, first, multiply, sum)

• min_score: exclude docs

• Different types: field_value_factor, script_score, weight, random_score, decay

© Klewel - 2016

"field_value_factor": { "field": "talkAuthor", "factor": 1.2, "modifier": "sqrt", "missing": 1 }

Updating the Index

• Most editions after the upload and before « publishing » the webcast

• New conferences and talks added regularly

When conference is « published » (user-triggered action): add talk metadata to the index

Slide OCR is async, when finished: index slides

ASR is async, when finished: index talk transcript

(Child objects can be updated independently from the talk or the conference!)

• https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html

© Klewel - 2016

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html

Next• Business need: Usage Analytics with Elastic

Search and Kibana

• Improving Search Results

- GUI: Highlight relevant slides in the player (new htlm5)

- Improve the ordering of search results© Klewel - 2016

Documents

Indexing & Retrieving Multimedia Resources with Elastic Searchfiles.meetup.com/7646592/20160921_Sandy_Ingram_ES_Meetup_Lausanne.pdf · Like a Relational DB with schema deﬁnition