Upload
arjen-de-vries
View
32
Download
4
Embed Size (px)
Citation preview
Challenges for industrial-strength
Information Retrieval on DatabasesR. Cornacchia, M. Hildebrand, A.P. de Vries, F. Dorssers
KARS2017 - 21 March 2017, Venice, IT
Outline
1. Search is everywhere
2. Tailored search is expected
3. Tailored search needs modelling
4. Search modelling by information specialists
5. Search modelling needs flexible IR & DB
6. IR on DB: it works
Search is everywhere
Real world scenarios
Technical
Desktop
Coding content assistant
Product recommendation
Personalised newsfeed
Let’s pick a simple one: autocompletion
iphone 7
iphone 5c
iphone 6s
ipho|“autocompletion is trivial”
.. not so fast!
Tailored search is expected
autocompletion
iphone 7
iphone 5c
iphone 6s
ipho|
Basic - products
○ Any matching term from the index
○ Suggest products
Tailored search is expected
autocompletion
iphone 7
iphone 5c
iphone 6 cases
ipho|
Basic - products & categories
○ Any matching term from the index
○ Suggest products & categories
Tailored search is expected
autocompletion
iphone 7
iphone 6 cases
iphone 6s
ipho|
Filtered
○ Any matching term from the index
○ “iPhone 5c” out of stock
Tailored search is expected
autocompletion
iphone 8
iphone 7
iphone 6 cases
ipho|
Filtered & ranked
○ “iPhone 5c” out of stock
○ “iPhone 8” the most requested
Tailored search is expected
autocompletion
iphone cases
iphone adapters
iphone 7
ipho|
Exploratory
○ First suggest categories..
○ .. then products
Tailored search is expected
autocompletion
iphone 7 cases
iphone 7 adapters
iphone 8
ipho|
Personalised
○ I already own an “iPhone 7”
○ Suggest compatible accessories
○ Suggest upgrade
Tailored search is expected
What if my search API isn’t enough?
Tailored search needs modelling
iphone 7 cases
iphone 7 adapters
iphone 8
ipho|
<your favourite autocompletion>
○ Out-of-the-box API may fall short
○ Build custom search API
○ Who? How?
http://localhost:8983/solr/suggest?q=ipho
How do we build custom search APIs?
Search modelling by information specialists
data modelling search modelling
Spinque: Empower the information specialist
Empowering the information specialist
data modelling search modelling
Search modelling by information specialists
Search modelling
standard autocompletion custom autocompletion
Search modelling by information specialists
http://spinque/suggest?q=ipho http://spinque/suggest_ranked?q=ipho
The IR & DB challenge
Search modelling needs flexible IR & DB
○ IR & DB both needed even for trivial tasks
○ Different technologies / focus
○ How / where to integrate task results?
○ Do they stay black boxes?
○ Can we express them in the same platform,
and when does this make sense?
http://spinque/suggest_ranked?q=ipho
Text retrieval by strategy
Search modelling needs flexible IR & DB
text retrieval.. ..is just another DB query
○ strategy-driven “collection” and “documents”
○ on-demand indexing
○ it takes just standard SQL
Graph DB by strategy
Search modelling needs flexible IR & DB
Visual modelling Relational Algebra Graph
subject property object
123 name pen
123 availability in stock
123 price 9.99
Graph DB by strategy
Search modelling needs flexible IR & DB
we want DB & ranking
together & seamlesslywhat if this.. ..could work on this?
subject property object p
123 name pen 1.0
123 availability in stock 0.8
123 price 9.99 1.0
Rank. Everything. Always.
Search modelling needs flexible IR & DB
rank products.. ..get ranked orders and customers
Fuhr, Rölleke, 1997, A probabilistic relational algebra for the integration of IR and DB
SELECT g.obj, (o.p * g.p) as pFROM graph g, ranked_orders oWHERE g.subj = o.idAND g.rel = ’orderedBy’;
PROJECT [$3]
JOIN INDEPENDENT [$1=$1] SELECT [$2=’orderedBy’] (g) ranked_orders SQLPRA
What about efficiency?
IR on DB: it works
1.1M docs, 2.3GB
4-core i7-3770s, 16GB RAM, 256GB SSD
find documents: 20ms
8M lots, 25K auctions (10GB raw data)
VM (8 CPUs) on Xeon E5-2620, 16GB RAM, 256GB SSD
find lots: 150ms
topic
What about efficiency?
IR on DB: it works
pre-compute what can be pre-computed.. ..but do it query-driven
○ Index on demand
○ Cache result of relational expressions
○ Algebraic analysis to determine cache
What about efficiency?
IR on DB: it works
choose it carefully.. ..then enjoy
○ Main benefits of IR on DB
○ IR as a DB optimisation problem
○ No custom extensions, no vendor-lock
○ Column-store, CPU-friendly DB engine
Hey, we made our join 20% faster. You are welcome.
○ If you just text retrieval on documents
○ Lucene-like will serve you well
○ Information needs tend to be more complex
○ Solve at application-level: common and painful
○ A one-platform approach pays off
IR on DB: when does it make sense?
IR on DB: it works
Conclusions
1. Search is everywhere
○ In the real world..
2. Tailored search is expected
○ ..there is no search like another.
3. Tailored search needs modelling
○ Someone will put effort in it..
4. Search modelling by information specialists
○ ..who better than the right person for the job?
5. Search modelling needs flexible IR & DB
○ Who takes care of the low-level details then?
6. IR on DB: it works
○ The right tools. The right architecture.
○ Live updates
○ ACID transactions overhead
○ Scale out
○ It’s more than “just an inverted file” to be distributed
○ Even better support for information specialists
○ Strategy auto-tuning
Challenges ahead