24

Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Embed Size (px)

Citation preview

Page 1: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi
Page 2: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Multi-language Content Discovery Through Entity Driven SearchAlessandro Benedetti

Search Consultant and R&D Software EngineerZaizi

http://uk.linkedin.com/in/alexbenedetti

Page 3: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Who I am

Alessandro Benedetti

Apache ManifoldCF committer Search Consultant R&D Software Engineer Master in Computer Science Information Retrieval Background Semantic, NLP, Machine Learning Technologies Enthusiast Beach Volleyball Player & Snowboarder

Page 4: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

ZAIZI

Page 5: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

ZAIZI

Experienced at building and delivering a wide range of enterprise solutions across the whole information life cycle

Alfresco & Ephesoft certified Platinum Partner

Red Hat Enterprise Linux Ready Partner

R&D department specialising in Open SourceSearch Solutions

Alfresco Partner of the Year 2012 and 2013

Page 6: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Agenda

Context

Problem

Solution

Demo

What's upcoming

Page 7: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Zaizi R&D Department

Giving sense to the content

Enriching it semantically

Adding value to ECM/CMS

More structured content, easy to manage, link and search

Improving search

Across different domains, data sources, User Experience

Machine Learning applied research

Content Organization – Recommendation Systems

Page 8: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Enterprise Search Problems Challenge :

Search within Big and Heterogeneus Repositories

Heterogeneus data sources

Filesystems, DB, ECM/CMS, Email, …

Unstructured content in different formats

PDF, text plain, Word …

Documents not linked between each other

Federated Search

across data sources

preserving permissions

centralized endpoint

Page 9: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Sensefy

Semantic Enterprise Search Engine

Federated Search

Evolved User Experience

Based on cutting-edge Open Source Frameworks

Page 10: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Architecture

Page 11: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Entity Driven Search

Moving from keywords to Entities More understandable to Humans

Process the unstructured text at indexing time

Enrich it

Build specific indexes

Use entities and concepts in searches• Trying to foresee the concepts the user wants to express

Page 12: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

What is an Entity in our domain ?

Real world concepts

Linked Data resources

Rdf(xml) structured data• Unique identifier + properties

Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)

Page 13: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Redlink

Semantic Cloud platform Providing Software as a Service Text analysis and Entity Linking using Knowledge Bases Linked Data Publishing Enterprise Data Linking Open-Source based components

Page 14: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Indexing - NLP & Semantic Enrichment

Apache ManifoldCF custom processors/output connectors

From unstructured to structured NLP Analysis. POS Tagging Named Entities Recognition Entity Linking using Knowledge Bases Disambiguation

Indexing in specific Solr Collections • Primary Index (documents)• Entity Index• Entity Types

Page 15: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Search - Smart Autocomplete

Multi Phase suggestions

Closer to natural language query formulation

Named Entities

Entity Types

Document Titles

Page 16: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Smart Autocomplete – Named Entities

Infix Suggestion ( ron → Cristiano Ronaldo)

Fuzzy suggestion ( cristinao → Cristiano Ronaldo)

Brief description of the suggested entity

Specific Solr index for the entities• Schema ( label, notable_type, occurrences...)• Edge-Ngram token filtered label field• Fuzzy queries with variable distance / classic queries to the label suggestion

field

Page 17: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Smart Autocomplete – Entity Types

Infix Suggestion ( play → Football Player)

Fuzzy suggestion ( foobtall → Football Team)

Multi Language ( calcia → Calciatore[it]( Football Player)[en] )

Multi phase suggestion through properties ( ital → football player nationality italian)

Specific Solr collection for the entity types• SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...)• EdgeNgram token filtered type• Multi-language suggestion highlight

Page 18: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Smart Autocomplete – configuration

Knowledge base for entity linking and dereference DbPedia, Freebase, Custom Dataset

Properties For each entity type of interest Ldpath will be used to identify the property in the graph

Hierarchy All the sub-instances of a type will automatically inherit their parent properties to ease the configuration

Page 19: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Semantic Search

Search by Named Entity Ex. Give me all the documents related to

Christian Bale

Search by Entity Type Ex. Give me all the documents about football players

Search by Entity Type + properties Ex. Give me all the documents about football players whose nationality is British

Query time Join : Entity-Entity Type collection → primary Index

Page 20: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Semantic Facets

Dynamic calculated semantic facets based on types and entities from documents

Improve the navigation of results

Allow refined search through semantic information

Configurable custom layer on top of Solr faceting component

Page 21: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Semantic More Like This

Search for similar documents based on Entities and Entity Types

Similarity function based on document meaning

Multi Language / Not based on text tokens but concepts

Solr More Like This on custom fields

Entity Frequency / Inverted Document Frequency

Entity Type Frequency / Inverted Document Frequency

Page 22: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Live Demo

Context

Problem

Solution

Demo

What's upcoming

Page 23: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

What's upcoming

Machine Learning components:– Classification– Topic annotation– Clustering

Secured Entity Search Image and Media searches Advanced Geo-search Personalized/collaborative search Recommendations Q&A Advanced configurable Admin Dashboard

Page 24: Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi

Any Questions?

Alessandro BenedettiSearch Consultant and R&D Software EngineerZaizi Email: [email protected]: @Zaizi