View
328
Download
0
Category
Preview:
Citation preview
Taxonomical Semantical Magical SearchOpenSource Connections
Doug TurnbullRelevance Leaddturnbull@o19s.com@softwaredoug
© OpenSource Connections, 2017
Solr/ES consulting: team 100% focused on relevance
Learn to rank – semantic search – relevance – personalization – findability
Who are we?
© OpenSource Connections, 2017
Reflect:What problem are you trying to solve when you jump to 'semantic search'?
© OpenSource Connections, 2017
"We studied spontaneous word choice for objects in five application-related domains, and found the variability to be surprisingly large. In every case two people favored the same term with probability <0.20. "
"Simulations show how this fundamental property of language limits the success of various design methodologies for vocabulary-driven interaction. "
© OpenSource Connections, 2017
Solve with keyword stuffing?
- Content creators guarantee every "shoe" has a "shoe" keyword somewhere!
- And every wing-tip mentions dress shoes…
- ...Ad infinitum…
© OpenSource Connections, 2017
Solve with tagging?
- Java is a type of JVM language. Should this be tagged JVM too? What is a "query string"? Which of these tags is useful for search?
- Who tags everything? Is it consistent? What are the rules?
(taken from Stackoverflow)
© OpenSource Connections, 2017
Solve with synonyms?
Yes! Synonyms can be a tool that can help us. But it's easy to mess up:
shoes => dress shoeswing tips,shoestennis shoes,shoes
When I search for tennis shoes, why do I get wing tips; why do I get dresses?!?
© OpenSource Connections, 2017
Talking teaches/reminds vocab (Searching)
shoes dress shoes brown wing tips
Searcher learning: results gives clues to help shopper refine further
Searcher trusting: more confident on terms to use
Searcher uncertain: uses broad queries to experiment
© OpenSource Connections, 2017
Searchers get more specific...
wing tips
Hierarchy of Ideas:
NP (item): "wing tips"
type_of:"dress shoes"
type_of:"shoe"
shoes
NP(item): "shoe"
More specific
© OpenSource Connections, 2017
… and try types of modifierswing tips
NP (item): "wing tips"
type_of:"dress shoes"
type_of:"shoe"
sapphire wing tips
NP (item): "wing tips"
type_of:"dress shoes"
type_of:"shoe"
ADJ (color) "sapphire"
type_of:"blue"
© OpenSource Connections, 2017
Semantic search: enable semantic exploration
Low term specificity: search term specifies a wide category to explore
Searching for "shoes"
High term specificity: search term too specific, try semantically broader/similar items
"Show 'dress shoes' for 'oxfords' "
© OpenSource Connections, 2017
Make Solr grok type-of relationships
"wing tip" is a type of "dress shoe" is a type of "shoe"
Search here, only show wing tips
Search here, show all things that are a type-of shoe
Beyond the actual terms used in docs
© OpenSource Connections, 2017
Per-entity terms a taxonomy
Shoes
Athletic Shoes
Dress Shoes
High HeelsOxfords
Wing Tips
Running Shoes
Tennis Shoes
Blue Sapphire
Sky blueA search taxonomy (not the taxonomy for your site nav)
© OpenSource Connections, 2017
Index-time tax. expansion
Item
Color
Size
Substrings -> Entities
Expand to broad/narrow
tennis shoes => footwear\shoes\athletic\tennis_shoes
sapphire => blue\sapphire
© OpenSource Connections, 2017
In Solr...
Item
Color
Size
Possible to build from simple keepwords
Query or Index time synonyms uses TF*IDF of concept
Substrings -> Entities
Expand to broad/narrow
tennis shoes => tennis_shoes,athletic_shoes,shoes,...
sapphire => sapphire,blue
© OpenSource Connections, 2017
In Solr, index time...(Input Text) You will love these maroon dress shoes
(tokenization & maybe stemming) [you] [will] [love] [these] [maroon] [dress] [shoes]
compound/decompound (syn filter) [you] [will] [love] [these] [maroon] [dress_shoes]
Keepwords for entity [dress_shoes]
Semantic expansion (syn filter) [dress_shoes] [shoes]
(Input Text) You will love these maroon dress shoes
(tokenization & maybe stemming) [you] [will] [love] [these] [maroon] [dress] [shoes]
compound/decompound (syn filter) [you] [will] [love] [these] [maroon] [dress_shoes]
Keepwords for entity [maroon]
Semantic expansion (syn filter) [maroon] [brown]
"Item" copy field
"Color" copy field
© OpenSource Connections, 2017
Index time solution(Input Text) brown wing tips
(Item analyzer output) [wing_tips] [dress_shoes] [shoes]
(Input Text) brown wing tips
(Color analyzer output) [brown]
Matches maroon, because at index time: maroon => brown, maroon
IDF Highest for wing_tipsLowest for shoes(eliminate TF? norms?)
q=brown wing tips&defType=edismax&sow=false&qf=item^100 color^10
(you'll want to search more than these semantic fields)
© OpenSource Connections, 2017
Query-time tax. expansion
How do users think of your items?
Item
Color
Size
Trained/built From Query logs
Substrings -> Entities
Expand to broad/narrow
tennis shoes => item:"tennis shoes" OR item:"athletic shoes" OR item:"shoes" ...
sapphire => color:blue OR color:sapphire
sapphire tennis shoes
© OpenSource Connections, 2017
Query Phrase In Solr...(Input Text) Brown wing tips
Semantic expansion (syn filter) [wing tips] [dress shoes] [shoes]
(Input Text) Brown wing tips
Semantic expansion (syn filter) [brown] [maroon]
ItemSemanticAnalyzer
Color SemanticAnalyzer
Transform to description("dress shoes" OR "wing tips" OR shoes OR maroon OR brown)
Problems: - two query analyzers for same field not possible in Solr- Can't re-tokenize [dress shoes] -> "dress shoes" phrase q
© OpenSource Connections, 2017
Match Query Parserhttps://github.com/o19s/match-query-parser
q=brown wing tips&defType=edismax&qf=description title
&bq={!match analyze_as=item_tax search_with=phrase qf=description v=$q}^100
&bq={!match analyze_as=color_tax search_with=phrase qf=description v=$q}
How to analyze query string
Phrase: retokenize multi word tokens and do phrase search
© OpenSource Connections, 2017
Other building blocksAuto Phrase Token Filter / Query Auto Filtering:
- https://github.com/lucidworks/auto-phrase-tokenfilter- https://lucidworks.com/2015/02/17/introducing-query-autofiltering/
Health-on-net Lucene Synonyms- https://github.com/healthonnet/hon-lucene-synonyms
Sematext Query Segmenter:- https://github.com/sematext/query-segmenter
Shopping 24 Bmax Query Parser- https://github.com/shopping24/solr-bmax-queryparser
© OpenSource Connections, 2017
Deriving Querqy rules from taxonomies
https://github.com/renekrie/querqy
© OpenSource Connections, 2017
Query Time vs Index Time
Query Time:
PROS- No need to reindex when
updating managed vocab
CONS- Relevance scoring of terms
(boosts help)- Complex / slow queries
Index Time:
PROS- TF*IDF more accurate scoring
(broad concepts score low, narrow score high)
- Faster queries
CONS- Reindexing for synonym
changes
© OpenSource Connections, 2017
Structure your docs for query understandingRelevance engineer's challenge:
- Where can we begin with a taxonomy?- Reuse filters & facets- Reuse your page's navigational taxonomy?- Track which searches land on pages (old school click
tracking)?- Zero results tracking?
- How do we incentivize content creators to move away from keyword stuffing to organizing to search keyword taxonomy?
- Finally: we don't care about the source data model, only what helps users find things
© OpenSource Connections, 2017
SHReC AlgorithmSimple doc frequency in-content to look for super-concepts / sub-concepts
term/phrase x subsumes y (x parent concept?) when:
df(x) > df(y)
df(x ∧ y) / df(y) >= α (α = 1 complete subsumption)
© OpenSource Connections, 2017
SHReC Algorithm Example
ShoesWing Tips
df("shoes") > df("wing tips")
df("shoes" ∧ "wing tips") / df("wing tips") >= 0.8
© OpenSource Connections, 2017
SHReC Algorithm with Solr
ShoesWing Tips
df("shoes") > df("wing tips")
df("shoes" ∧ "wing tips") / df("wing tips") >= 0.8
Cache doc freq (q=*:*&facet.field=item&facet=true)
q=item:"wing tips" AND item:shoes, num results
© OpenSource Connections, 2017
Unfortunately reality is messy
ShoesWing Tips
Your data probably looks like
© OpenSource Connections, 2017
Idea:mine other corpus?
Shoes Wing Tips
● but still, what phrases do you test?
© OpenSource Connections, 2017
Statistically sig. colocations
Wing Tips WingTips
Student t-test against null hypothesis that wing / tips unrelated
© OpenSource Connections, 2017
Refinements
shoe
dress shoe (12%) wing tip (23%)
tennis shoe (11%)
blue dress shoe (1%)
sapphire brooks brothers dress shoe (0.001%)
brown dress shoe (20%)
Colors scattered throughout
Sub concepts, likely child phrases
tennis shoe (11%)
Siblings refine each other
running shoe (34%)
Should these be in supercategory "athletic shoes"?
© OpenSource Connections, 2017
Refinement mining in Solr
docs = [{"query": "shoe""refinement": "dress shoe"
},{
"query": "shoe""refinement": "brown shoe"
},{
"query": "tie""refinement": "brown tie"
}]
q=query:shoe&facet=true&facet.field=refinement
Refinements:- dress shoe (4)- tennis shoe (2)- ...
© OpenSource Connections, 2017
SHReC w/ Refinements
docs = [{"query": "shoe""refinement": "dress shoe"
},{
"query": "shoe""refinement": "brown shoe"
},{
"query": "tie""refinement": "brown tie"
}]
q=query:shoe&facet=true&facet.field=refinement
© OpenSource Connections, 2017
SHReC w/ Refinements
q=query:shoe&facet=true&facet.field=refinement
Num results for q=shoe
(Slow, but you do this rarely)
Seed the corpus exploration SHReC
© OpenSource Connections, 2017
SHReC w/ sig terms
scoreNodes( select( facet(collectionName, q="query:shoes", buckets="refinements", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), refine_graph as node, "count(*)", replace(collection, null, withValue=collectionName), replace(field, null, withValue=refine_graph)))
What's actually happening in SHReC is significance scoring, which is baked into Solr:
Relationship of local vs global
© OpenSource Connections, 2017
Other ways of measuring term stat. significance
● Trey G. Solr knowledge graph (hope you saw his talk)! https://lucidworks.com/video/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine/
● Mark Harwood Elastic Graph / Sig Termshttps://www.elastic.co/elasticon/conf/2016/sf/graph-c
apabilities-in-the-elastic-stack
© OpenSource Connections, 2017
But word2vec, LDA, etc- Focused on content, not users: Focused on discovering topics/synonyms in
content: we often need search query to content vernacular mappings
- Traditional topic modeling flat
- Hierarchies extracted from content don't reflect user's hierarchies & how they map to content
- Don't confuse co-occurences with synonyms without extensive data modeling/munging to get your content here
© OpenSource Connections, 2017
Questions?
Further Reading:- Relevant Search!- Blog articles:
- Building Entity-focused search w/ Keyphrases:- http://opensourceconnections.com/blog/2016/12/02/solr-elasticsearch-synony
ms-better-patterns-keyphrases/- Synonym best practices:
- http://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
- Match Query Parser:- http://opensourceconnections.com/blog/2017/01/23/our-solution-to-solr-multite
rm-synonyms/
Discount code: relsearchhttp://manning.com
Recommended