Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Indexing & Retrieving Multimedia Resources with Elastic Search
Septembre 21, 2016
Sandy Ingram Lead Dev. Engineer
© Klewel - 2016
Overview• Presentation of Klewel
• Presentation of Triskel
• Overview of Data Model/Store
• ES: Terminology
• ES: Index Creation & Mapping
• ES: Indices vs Types
• ES: Nested Objects/Parent-Child Relations
• ES: Aggregations & Filters
• ES: Relevancy Scores
• ES: Multi-Word Queries
• ES: Updating the Index
• ES: What’s next for us?
Triskel: A Seamless Webcasting Solution
© Klewel - 2016
Among our Clients
© Klewel - 2016
Triskel
© Klewel - 2016
Triskel
© Klewel - 2016
WISE
© Klewel - 2016
Triskel
Triskel: Cross-Platform
© Klewel - 2016
Old Capture Station
© Klewel - 2016
Capture Station (Heavy)
Recording Application: • C# et C++ • Based on a SDK of
multimedia capture • Records, Encodes,
Detects Slide change, Sends data via APIs
Triskel AV/Pro (Triskel « light »)
No more Station: • PC or Mac • Converters only (Aja, Blackmagick) Cheaper!
Recording Application: • C++ with Qt • Sends data via API
© Klewel - 2016
Triskel: From A to Z
© Klewel - 2016
Processing: OCR, ASR, indexing
Composite Video Slide Detection
Recommendation
Publishing: Admin Edition/Public Portal
Multimodal Search Customized Widget
Solution
Capture: Triskel AV/Pro
Triskel SC Old Capture Station
Front-end: Search Widget
https://portal.klewel.com/watch/webcast/ecovillage-2016/talk/4
Contextual Search Inside a conference
Backend: DataStore• V0: XML file with talk title and slide list with timing and ocr
text (search inside each talk)
• V1: Relational Database (MySQL Server)
• V2 (since 2013):
• Relational database as primary data store
• Document-based storage and indexing (ElasticSearch with ES Python Library)
© Klewel - 2016
(Indexed) Data Model
© Klewel - 2016
Webcast(Name, Location, Date)
Talk2(Title)
Talk3(Title)
Talk1(Title)
ASR
Slide2
Slide1 ASR
Slide2
Slide1ASR Transcript
Slide2(OCR)
Slide1(OCR)
Our Relational DB is more complex
Elastic Search Terms• Cluster of nodes (nodes = running instances) • Analysis: conversion of full-text into terms • Index: has a mapping, shard(s) and replica(s) Like a Relational DB with schema definition
• Document: JSON object stored with ID & field(s) Like a Row in Relational DB
• Type: type of doc with defined fields Like a Table in Relational DB
© Klewel - 2016
Creating an Index• Create an index: create_index(index_name, INDEX_SETTINGS) INDEX_SETTINGS: customized analyzers, number of shards, replicas
• Define a mapping: put_mapping(‘conference_type’, CONFERENCE_PROPS, index_name)
• Define Fields 'conferenceDate': { 'type': 'string', "index" : "not_analyzed", "store" : 'true', }, 'conferenceName': { 'type': 'string', "index" : "analyzed", 'analyzer' : "analyzer_en", "store" : 'true', },© Klewel - 2016
Indices vs Types
• Data stored in shards
• 1 large index file better than multiple small indices
• Too many indices lead to expensive merge across index shards results
© Klewel - 2016
Indices vs Types• Types are (Not exactly) to an Index what Tables are to a Database
• Lucene uses a single flat mapping for all types • Choose types when they share the same schema • Name field and types (e.g: str/int) consistently across Types
• Klewel: 1 Index + 4 types: conference_type, talk_type, asr_type, slide_type)
© Klewel - 2016
Nested Objects
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html© Klewel - 2016
Nested Objects• Enables searching within one nested object
• Almost as fast as having them in a single document
• Drawback: can’t CRUD nested docs directly
• CRUD is atomic over a single document © Klewel - 2016
Nested Objects
Parent-Child Relationships• Like Nested Objects but in separate docs
• Stored in the same shard
• Updates/Search independent
• Webcast = Parent, Talks = Childrenes_conn.index( json.dumps(prepared_data), self.esIndex, TALK_TYPE, id = prepared_data['talkUniqueId'], parent = prepared_data['conferenceShortUUID'], bulk=bulk)© Klewel - 2016
Aggregations
© Klewel - 2016
Aggregations
"aggregations": { "talks": { "filter": { "type": { "value": "talk_type" }}, "aggregations": {
"by_conference": { "terms": { "field": "conferenceId" }}} },slides": {
"filter": { "type": { "value": "slide_type" }}, "aggregations": { "by_talk": { "terms": {"field": "talkId" }}} },
© Klewel - 2016
Query Context /Filter Context • How well a document matches a search query? => Query
Context
• What should and should not be included? => Filter Context
• Use Queries for Full-text Search & Relevance Scores
• Multiword queries: "minimum_should_match": « 75% », « -25% » « 2<-25% 9<-3 »
© Klewel - 2016
"query": { "match": { "title": { "query": " Elastic Search ", "operator": "and" } }
Filtering Private/Public Results
if only_public == False: query["query"]["filtered"]["filter"] = publicFilter (v2 => query["bool"]["filter"] = publicFilter)
publicFilter = { "or" : [parentisPublicFilter, publicFilter]}
Add filters to query before execution
© Klewel - 2016
publicFilter = { "term": { "is_public": 1 }}
parentisPublicFilter = { "has_parent": { "type": "conference_type"
….}
Context: Privacy at Conference Level
Theory Behind Relevance Scoring
• Order relevant docs:
- Default TF/IDF: for each query term in the document
- Fixed-Length Norm: The longer the field the lower the weight (term found in title vs description)
- Both TF & Fixed-Length can be disabled
- Multi-term query: 1 final (combined) score indicating how well a document matches © Klewel - 2016
• Select relevant docs:
- Boolean Model :« meetup » AND « elastic » AND (« lausanne » OR « zurich »)
Function Scores• On the docs returned for a query
• score_mode: how to combine scores (min, max, first, multiply, sum)
• min_score: exclude docs
• Different types: field_value_factor, script_score, weight, random_score, decay
© Klewel - 2016
"field_value_factor": { "field": "talkAuthor", "factor": 1.2, "modifier": "sqrt", "missing": 1 }
Updating the Index
• Most editions after the upload and before « publishing » the webcast
• New conferences and talks added regularly
When conference is « published » (user-triggered action): add talk metadata to the index
Slide OCR is async, when finished: index slides
ASR is async, when finished: index talk transcript
(Child objects can be updated independently from the talk or the conference!)
• https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
© Klewel - 2016
Next• Business need: Usage Analytics with Elastic
Search and Kibana
• Improving Search Results
- GUI: Highlight relevant slides in the player (new htlm5)
- Improve the ordering of search results© Klewel - 2016