Upload
findwise
View
493
Download
0
Embed Size (px)
Citation preview
CASE STUDY: SPAREBANK1 GRUPPENSébastien Muller
Project background
•SpareBank1 Gruppen
• 19 individual bank portals and 1 forside
•Boost 25 umbrella project
• ”Seman7c” URLs:
h>ps://www2.sparebank1.no/9898/3_privat?_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149354625&_
• New search GUI
•CMS with no easy way of telling which bank has published what
• Mass duplica7ons
• Access to other portal specific ar7cles
• Webcrawlers
What is better search?
At the very least :
•Relevant hits
•Facetting
•Query completion
•Spelling check and suggestions
•Basic search analytics
Relevant hits
• Relevancy = ”.. The quality of results returned from
a query...”
• Based on hits in fields generated from document processing
• Clean and meta-data rich index
• Pushed from CMS or extracted by crawlers
Crawling and Indexing
•Clean and meta-data rich index
•OpenPipeline
• Ignore irrelevant ar7cles
• Extract ar7cle text contents
• Detect duplicates
• Facet data
• Populate index fields including *_qc and *_sp fields
Crawling and Indexing
• Crawlers will be as smart as you make them
• Very rigid logic
• Heavily reliant on ar7cle quality
• Don’t blame the crawler
https://www2.sparebank1.no/portal/4702/3_privat?_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false
https://www2.sparebank1.no/portal/9898/3_privat?_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false
Relevant hits
Scoring model<bean id="qf" class="com."ndwise.jelly"sh.solr.querymodi"er.dismax.StaticQueryFieldSetter">
<property name="queryFields">
<list value-type="java.lang.String">
<value>keyword^4</value>
<value>content1^8</value>
<value>content2^3</value>
<value>content3^2</value>
<value>stem1^1.5</value>
<value>stem2^1.2</value>
<value>stem3</value>
</list>
</property>
</bean>
System Architecture
•Solr is incredibly !exible
• Master/slave
•Security constraints
• Search services available publicly
• Search analy7cs available internally but limited
• Indexing
Quality Assurance
•Crawler friendly content modi"cations
• Edit
• Delete
• Add
• Share
• Risk analyse etc