Upload
lucenerevolution
View
880
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Presented by Engy Ali | The Library of Alexandria See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.
Citation preview
5/14/12 h(p://dar.bibalex.org 1
Accessing Your Library Book Collec5ons Using Solr
By: Engy Morsy Software project manager, Bibliotheca Alexandrina
BA & Solr 5/14/12 h(p://dar.bibalex.org 2
h(p://bibalex.org
5/14/12 h(p://dar.bibalex.org 3
h(p://wamcp.bibalex.org
5/14/12 h(p://dar.bibalex.org 4
h(p://ssc.bibalex.org
5/14/12 h(p://dar.bibalex.org 5
h(p://dar.bibalex.org
5/14/12 h(p://dar.bibalex.org 6
Introductory Video
5/14/12 h(p://dar.bibalex.org 7
Agenda
• Brief introducFon to DAR architecture • Indexing books’ collecFon • Searching across Metadata and Content • FaceFng • Searching Book Content • Solr with personalizaFon • Future • Q&A 5/14/12 h(p://dar.bibalex.org 8
About 1.5 Million books
5/14/12 h(p://dar.bibalex.org 9
5/14/12 h(p://dar.bibalex.org 10
Digital Assets Repository
Digital Assets Repository
5/14/12 h(p://dar.bibalex.org 11
Book site
• Approximately 260,000 books • Nearly 220,000 books published online • About 1.5 TB of content • Average book size 6 MB • Daily indexing rate is about 150 books.
5/14/12 h(p://dar.bibalex.org 12
What do we want…?
• Allow simple and advanced search across metadata and content in 5 languages
5/14/12 h(p://dar.bibalex.org 13
Simple Search
5/14/12 h(p://dar.bibalex.org 14
What do we want…?
• Allow simple and advanced search across metadata and content in 5 languages
• FaceFng
5/14/12 h(p://dar.bibalex.org 15
What do we want…?
• Allow simple and advanced search across metadata and content in 5 languages
• FaceFng • AnnotaFons
5/14/12 h(p://dar.bibalex.org 20
Text Underlining
Text Highligh5ng
Adding S5cky Notes
What do we want…?
• Allow simple and advanced search across metadata and content in 5 languages
• FaceFng • AnnotaFons • PersonalizaFon
5/14/12 h(p://dar.bibalex.org 25
Arranging Books in Bookshelves
SubmiIng Comments
Ra5ng
Embedding
Sharing the book link in other social networks
What lies beneath!!
5/14/12 h(p://dar.bibalex.org 31
Book site indices
5/14/12 h(p://dar.bibalex.org 32
AR Index
EN Index
FR Index
IT Index
SP Index
Query
Indexing Book CollecFon
• Index per language • A Document in the content index correspond to a page in a book
• Maintain a field to disFnguish between metadata record and content record (e.g. SolrType)
• Use staFc fields for all content index (e.g. PageID..etc)
5/14/12 h(p://dar.bibalex.org 33
What is the problem with this solu5on?
5/14/12 h(p://dar.bibalex.org 34
Problem for content search
Example : Advanced Search search for Title: Mobile Technology And Content : “cloud compuFng”
5/14/12 h(p://dar.bibalex.org 35
SolrType Content
SolrType Meta
Proposed soluFon
5/14/12 h(p://dar.bibalex.org 36
Title: Mobile Technology
Content : “cloud compuFng”
.. index
.. index
Get intersecFon
Result IDs
Facet result
Final result
Parent Book IDs
.. index
The problem is…
• Can’t get the faceFng result directly from the content index
• Need to query the metadata index in order to get the final facet result
processing Fme!!!
5/14/12 h(p://dar.bibalex.org 37
SoluFon…!
• Metadata denormalizaFon – Denormalize metadata into content index
5/14/12 h(p://dar.bibalex.org 38
SolrType Content
SolrType Meta
Proposed soluFon
5/14/12 h(p://dar.bibalex.org 39
Title: Mobile Technology
Content : “cloud compuFng”
.. index
.. index
Get intersecFon
Result IDs
Facet result
Final result
Problem for content search
• Metadata denormalizaFon…..
5/14/12 h(p://dar.bibalex.org 40
Worst choice! • Re-‐indexing for changes in
metadata • Data processing is required.
New Solu5on
5/14/12 h(p://dar.bibalex.org 41
Indexing Metadata
• Index per language • Separate content and metadata index • Text field holds the whole book content in the metadata index – The maxFieldLength has been set to maximum.
• e.g: 2147483647
5/14/12 h(p://dar.bibalex.org 42
Back to the example
Example : Advanced Search search for Title: Mobile Technology And Content : “cloud compuFng”
5/14/12 h(p://dar.bibalex.org 43
SoluFon
5/14/12 h(p://dar.bibalex.org 44
Title: Mobile Technology
Content : “cloud compuFng”
Meta index
Facet result
soluFon
5/14/12 h(p://dar.bibalex.org 45
Title: Mobile Technology
Content : “cloud compuFng”
Meta index
Content index
Get intersecFon
Meta index
Facet result
Separate indexes Vs. All in one
• Separate indexes
+ Indexing Fme + Index size -‐ Processing results (facets..) -‐ Scoring
5/14/12 h(p://dar.bibalex.org 46
Separate indexes Vs. All in one
• Separate indexes
+ Indexing Fme + Index size -‐ Processing results (facets..) -‐ Scoring
• One index – Index size – Indexing Fme + Scoring + Processing Fme
5/14/12 h(p://dar.bibalex.org 47
Book content index
5/14/12 h(p://dar.bibalex.org 48
AR Index
EN Index
FR Index
IT Index
SP Index
5/14/12 h(p://dar.bibalex.org 49
Searching
• Simple and advanced search – Cache the resulted IDs only
• HighlighFng search result – Get the full search result and highlight per page result
5/14/12 h(p://dar.bibalex.org 50
Book Content Search
• Search using – Search query – Book ID – List of pages’ IDs
• Highlights • AnnotaFons – Saved currently in DB
5/14/12 h(p://dar.bibalex.org 51
FaceFng
• Fixed facet fields – Category, sub-‐category, language..etc. – Stored, indexed, exact fields
• Process facets from different indices
5/14/12 h(p://dar.bibalex.org 52
PersonalizaFon
• Using separate index of personalizaFon – Different Solr fields for different languages. – Search across all fields.
• Saving in both Solr and DB • Indexing tags, raFng and comments using type field
5/14/12 h(p://dar.bibalex.org 53
Future
• Book mobile applicaFon using Solr • Using Hadoop • Indexing other digital media (Maps, audio, video)
5/14/12 h(p://dar.bibalex.org 54
Contact
engy.morsy @bibalex.org Library website: h(p://bibalex.org
Digital Asset Repository: h(p://dar.bibalex.org
5/14/12 h(p://dar.bibalex.org 55
5/14/12 h(p://dar.bibalex.org 56
Thank you…
5/14/12 h(p://dar.bibalex.org 57