Sanoma Search Use CasesSander Kieft@skieft
About me@skieft
Manager Core Services at Sanoma
Responsible for all common services, including the Big Data
platform
Work:
– Centralized services
– Data platform
– Search
Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff24 April 20152
Sanoma, Publishing and Learning company
2+1002 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name3
5TV channels in Finland
and The Netherlands
200+Websites
100Mobile applications on
various mobile platforms
Sanoma = Donald Duck
24 April 20154
24 April 2015 Presentation name5
Use cases
Full text search
Use cases
24 April 2015 Presentation name7
Full Text Search
Photo credits: Igal Koshevoy - https://www.flickr.com/photos/igalko/6345215839/
Facetted search
24 April 2015 Presentation name9
Full text search
Facetted search
Guided search
Use cases
24 April 2015 Presentation name10
Guided search
Photo credits: http://www.flickr.com/photos/emyanmei/8223998414/
24 April 2015 Presentation name12
Source: ThesisDefense - Dirk Guijt
Startpagina.nl Search
Content
Source: ThesisDefense - Dirk Guijt
Startpagina.nl Search
ContentTags
#vakantie
#arke
#vakantie arke
#arke vakantie
#arke stedentrip
#stedentrip arke
…
#arkefly
Source: ThesisDefense - Dirk Guijt
Startpagina.nl Search
ContentTags
#vakantie
#arke
#vakantie arke
#arke vakantie
#arke stedentrip
#stedentrip arke
…
#arkefly
Query: vakantie arke
Source: ThesisDefense - Dirk Guijt
Startpagina.nl Search
ContentTags
#vakantie
#arke
#vakantie arke
#arke vakantie
#arke stedentrip
#stedentrip arke
…
#arkefly
Query: vakantie arke
Source: ThesisDefense - Dirk Guijt
Startpagina.nl Search
ContentTags
#vakantie
#arke
#vakantie arke
#arke vakantie
#arke stedentrip
#stedentrip arke
…
#arkefly
?
Query: vakantie arke stedentrip
Thesis Defense - Dirk Guijt
Term mismatch
“vakantie arke stedentrip” “arke stedentrip”
Query-Flow-Graph
User ID Date / Time Query
User A 02-12-2014 10:30:15 owls
User A 02-12-2014 10:30:23 snow owls
User A 02-12-2014 10:30:46 snow owls food
User A 02-12-2014 10:31:03 owls food
User B 02-12-2014 13:21:34 lemon
User B 02-12-2014 13:22:02 lemon trees
User B 02-12-2014 13:22:12 lemon cove
User B 02-12-2014 16:53:01 owls
User B 02-12-2014 16:53:53 forest owls
Source: ThesisDefense - Dirk Guijt
Model as a Graph
snow owls
snow owls food
owls food
owls
lemon trees
lemon cove
lemon
forest owls
User ID Date / Time Query
User A 02-12-2014 10:30:15 owls
User A 02-12-2014 10:30:23 snow owls
User A 02-12-2014 10:30:46 snow owls food
User A 02-12-2014 10:31:03 owls food
User B 02-12-2014 13:21:34 lemon
User B 02-12-2014 13:22:02 lemon trees
User B 02-12-2014 13:22:12 lemon cove
User B 02-12-2014 16:53:01 owls
User B 02-12-2014 16:53:53 forest owls
Source: ThesisDefense - Dirk Guijt
Model as a Graph
snow owls
snow owls food
owls food
owls
lemon trees
lemon cove
lemon
forest owls
5
3
1
2
6
5
owls snow owls (S) specialization
snow owls owls (G) generalization
olws owls (C) same query / error correction
snow owls forest owls (P) parallel move / equivalent rephrase
Source: ThesisDefense - Dirk Guijt
Using Query-Log based Collective Intelligence to Generate Query Suggestions for Tagged Content Search (paper)15th International Conference on Web Engineering (ICWE 2015): http://icwe2015.webengineering.org
Query Reformulation Types
Examples
Source: ThesisDefense - Dirk Guijt
Using Query-Log based Collective Intelligence to Generate Query Suggestions for Tagged Content Search (paper)15th
International Conference on Web Engineering (ICWE 2015): http://icwe2015.webengineering.org
Full text search
Facetted search
Guided search
Content repository
Use cases
24 April 2015 Presentation name24
24 April 201526
24 April 201527
Monolithic vs integrated
24 April 201528
Two approaches
Master
Content
*
*
Content
Master
Content
*
*
MyJour
Item Based Framework
….CMS
Architecture Content Platform
24 April 2015 Presentation name29
Content Platform Core
Search
Solr
Blob
Storage
(S3 & MT)
Article
storage
MongoDB
Analyse
CMS
CMS
Editorial
reuse-interface
ePub
Digital
Template
system
WoodWing
Content
Portal
Feeds
Noma
Viva
PDF Based Framework
….
HomeDeco
Sources Services Solutions Products
??
??
??
??
eLinea
Blendle
Google Currents
LINDA. nieuws
NU.nl search
NLPNatural Language Processing
Understanding a language is hard, really hard
Photo credits: Celines Photographer - http://www.flickr.com/photos/celinesphotographer/2295348530/
Ambiguous
Photo credits: M Hatrey - http://www.flickr.com/photos/mhatrey/6968211400/
I made her duck.
24 April 201533
Photo credits: Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/
Photo credits: Atomic Seed - http://www.flickr.com/photos/atomic_seed/6824087444/
Photo credits: Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/
Photo credits: Pintoy - http://www.flickr.com/photos/pintoys/6155690814/
Photo credits: KyanosAum - http://www.flickr.com/photos/kyanos_aum/3926971954/
Photo credits: Super Hua - http://www.flickr.com/photos/superhua/286479024/
Photo credits: Glim Eend - http://www.flickr.com/photos/glimeend/5075731300/
Wrong
Photo credits: Learnscope - http://www.flickr.com/photos/learnscope/5536614305/
Recursion
Photo credits: Wikipedia - http://en.wikipedia.org/wiki/File:Droste.jpg
Creative Tools - http://www.flickr.com/photos/creative_tools/4353860378/
Multi lingual
Ambiguity
Creative
Multi lingual
Wrong
..and many more
Understanding language is hard
24 April 201544
..but we don’t need to fully understand language to take care of it.
Tagging
Quote detection
Sentiment analysis
Topic detection
Named Entity Recognization
What did we use to enhance our index?
24 April 201546
Tagging
24 April 201547
Spectators outside the White House received a rare treat this morning
when they witnessed First Lady Michelle Obama on the South Lawn
going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a
rhino is a lot of work, but all of the Obamas—and especially Michelle—
really love Chauncey,” said White House spokesperson Sam
Davidson of the 3,000-pound eastern black rhinoceros the family
adopted in December after Barack Obama’s reelection promise to
“finally get Sasha and Malia that rhino they’ve been wanting.”
Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/
TF/IDF
Latent semantic indexing
Tagging
24 April 201548
TF/IDF
24 April 201549
TF-IDFTerm Frequency-Inverse Document Frequency
How often does the search
term occur in the text
How many words are in
the entire text
5/24 = 0,213/12 = 0,25
More relevant
NumPy, for SVD
Latent semantic indexing
24 April 201550
?
Many interesting things about text are longer than one word
– bigram: a sequence of two tokens
– collocation: a bigram that seems to be more than the sum of its parts
When is a bigram interesting?
#(vice president) #(vice)
#(president)
#(first lady) #(first)
#(lady)
Statistics beyond single words
24 April 201551
Quotes
24 April 201552
Spectators outside the White House received a rare treat this morning
when they witnessed First Lady Michelle Obama on the South Lawn
going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a
rhino is a lot of work, but all of the Obamas—and especially Michelle—
really love Chauncey,” said White House spokesperson Sam Davidson
of the 3,000-pound eastern black rhinoceros the family adopted in
December after Barack Obama’s reelection promise to “finally get Sasha
and Malia that rhino they’ve been wanting.”
Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/
24 April 201553
TF/IDF
Average per sentence
Quotes
24 April 201554
Classification
Photo credits: biodivlibrary- http://www.flickr.com/photos/biodivlibrary/6989150578/
Distinguish things from other things based on examples
Using supervised learning
Applications:
– Sentiment
– Spam filtering
– Topic detection
– Language detection
Classification
24 April 201556
Classification > Sentiment
24 April 201557
Bayesian model trained on Kieskeurig.nl review data
Calculates chance of being a positive of negative article
Classification > Sentiment
24 April 201558
Classification > Topic detection
24 April 201559
Food and drinks BeautyRelationships and
sexScience
K Nearest Neighbor
Solr related search
Classification > Topic detection
24 April 201560
?
Named Entities
24 April 201561
Spectators outside the White House received a rare treat this morning
when they witnessed First Lady Michelle Obama on the South Lawn
going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a
rhino is a lot of work, but all of the Obamas—and especially Michelle—
really love Chauncey,” said White House spokesperson Sam Davidson
of the 3,000-pound eastern black rhinoceros the family adopted in
December after Barack Obama’s reelection promise to “finally get
Sasha and Malia that rhino they’ve been wanting.”
Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/
Locations
Persons
This can get you the persons, organizations and locations based from the sentence structure
Other types of entities can be trained as well, e.g. events, ingredients
Bayesian model
Features:
– Capitalization “Donald Duck”
– Part of speech tag “Did you meet Donald Duck last week?”
– Gazetteer of common names “Donald”
Named Entities
24 April 201562
It was revealed in December 2006 that Michael Jackson "is and has been suffering for at least a
decade from Parkinson's Disease."[10] He also suffered from diabetes. Michael Jackson died of a heart
attack in his home the morning of 30 August 2007 at the age of …
Named Entities++
24 April 201563
?
It was revealed in December 2006 that Michael Jackson "is and has been suffering for at least a
decade from Parkinson's Disease."[10] He also suffered from diabetes. Michael Jackson died of a heart
attack in his home the morning of 30 August 2007 at the age of …
Named Entities++
24 April 201564
{'December': 3,'Parkinson’s Disease': 1,'died': 4,'heart attack': 2}
{'December': 2,'Thriller': 1,'Neverland': 2,'heart attack': 2}
{'December': 3,'Parkinson’s Disease': 1,'Writer': 4,'heart attack': 2}
Storing it in the search index
Use it for faceted search
Use it for boosting
Use it as a poor-man knowledge base/graph
What do we do with this additional info?
24 April 201565
Images & Video
Photo credits: Danila- http://www.flickr.com/photos/58372504@N05/6996542804/
Face detection
24 April 201567
Not face recognition!
OpenCV
Viola Jones
Cascading haar classifier
Currently identify:
– Portrait shots
– Faces
– Group shots
Side project:
– Content sensitive cropping
Color detection
24 April 201568
Image Histogram
Manual mapping
Improving image search
Photo credits: Camera Wiki- http://www.flickr.com/photos/camerawiki/5610384297/
EUVision, UvA spin off
Acquired by Qualcomm end of ‘14
Image and video classification
24 April 2015 Presentation name70
High level architecture
24 April 201571
Content Library
NERSentim
ent
Search
index
API
Loader
SolrMongo
DB
Image
recon.
Analyse
Pipeline
Python/Django, with Django Rest framework
MongoDB
Solr
Celery
Libraries:
– nltk
– NumPy & SciPi
– OpenCV
– ssdeep
– EUVision
Technology
24 April 201572
Full text search
Facetted search
Guided search
Content repository
Analytics
Use cases
24 April 2015 Presentation name73
Search for analytics
Keyword search can be combined with
advanced forms of ranking the results
Most of the fields go to an index
Facets can be used for analytics
Ranker can be replaced with custom logic
Products:
– Solr
– ElasticSearch
– Marklogic
Use cases:
– Content Search
– Analytics / Faceted
– Percolation
24 April 2015 Presentation name74
ELK – ElasticSearch, LogStash & Kibana
24 April 2015 Presentation name75
24 April 2015 Presentation name76
Full text search
Facetted search
Guided search
Content repository
Analytics
Adserving
Use cases
24 April 2015 Presentation name77
Search
24 April 2015 Presentation name78
Content
Q Σ Result ranking
Search too
24 April 2015 Presentation name79
Content
t
Σ Result ranking
User
Search too
24 April 2015 Presentation name80
Ads
Page
Σ Result ranking
User
Search too
24 April 2015 Presentation name81
Ads
Page
Σ Result ranking
User
History
24 April 201582
Open Projects
OL2R
– Implement
Content Library
– Additional semantic modules
– Impact Sematic Enrichments
Content Library
– Improve Ranking
Guided Search
– Evaluations
– User History
Image search
Analytics
Knowledge Base
– Build for Content Library and Guided Search
Multi lingual search
– NL, EN and FI
Probalistic search
– Product searches
24 April 2015 Presentation name83
Peter Dutton - https://www.flickr.com/photos/joeshlabotnik/14040532589/
Camera Wiki - http://www.flickr.com/photos/camerawiki/5610384297/
Danila - http://www.flickr.com/photos/58372504@N05/6996542804/
Biodivlibrary - http://www.flickr.com/photos/biodivlibrary/6989150578/
Wikipedia - http://en.wikipedia.org/wiki/File:Droste.jpg
Creative Tools - http://www.flickr.com/photos/creative_tools/4353860378/
Learnscope - http://www.flickr.com/photos/learnscope/5536614305/
Super Hua - http://www.flickr.com/photos/superhua/286479024/
Kyanos Aum - http://www.flickr.com/photos/kyanos_aum/3926971954/
Pintoy - http://www.flickr.com/photos/pintoys/6155690814/
Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/
Atomic Seed - http://www.flickr.com/photos/atomic_seed/6824087444/
M Hatrey - http://www.flickr.com/photos/mhatrey/6968211400/
Celines Photographer - http://www.flickr.com/photos/celinesphotographer/2295348530/
Irene Mei - http://www.flickr.com/photos/emyanmei/8223998414/
Igal Koshevoy - https://www.flickr.com/photos/igalko/6345215839/
Photo creditsThanks to all Photographers!
24 April 2015 Presentation name85