28
A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected]

A Collections Searching Center Using Lucene – Solr

  • Upload
    ezra

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

A Collections Searching Center Using Lucene – Solr. Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected]. Background Information. Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge , - PowerPoint PPT Presentation

Citation preview

Page 1: A Collections Searching Center Using  Lucene – Solr

A Collections Searching Center Using Lucene – Solr

Ching-hsien WangSmithsonian InstitutionCollections.si.edu [email protected]

Page 2: A Collections Searching Center Using  Lucene – Solr

Background Information

Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge,

19 museums and 9 research institutes, 136 million collection objects, 12 major museum collection information

systems (with 30 databases), Hundreds of other databases.

Page 3: A Collections Searching Center Using  Lucene – Solr

Issues we faced

Users want information now! Google Effect and user’s mentality:

“if it is not online, it does not exist.” Users want immediate access to

digital documents. Separate databases are confusing

to the public.

We must act now!

Page 4: A Collections Searching Center Using  Lucene – Solr

Smithsonian’s Collection Searching Center Overview

a discovery center for information with a single searching point

faceted searching and content-sensitive navigation

positive and negative browse & select options

relevancy ranking of search resultsautomatic stemming for word matching

Page 5: A Collections Searching Center Using  Lucene – Solr

Smithsonian’s Cross Searching Catalog Overview (continued)

integrated searching of data from multiple types of databases

scalability for large data setsa metadata center which interacts with other

online applications

Page 6: A Collections Searching Center Using  Lucene – Solr

Project Team and Resources

Andrew Gunther – Software development and implementation

Jim Felley – Data conversion and implementation George Bowman – Database management and security

configuration Randy Arnold – Project support Ching-hsien Wang – Program Manager

Since August 2007, we have integrated data from 12 major databases with 2 million records.

Page 7: A Collections Searching Center Using  Lucene – Solr

Starting from Multiple databases

Page 8: A Collections Searching Center Using  Lucene – Solr

Transform into a single Search Center

Page 9: A Collections Searching Center Using  Lucene – Solr

Cross Searching Demo – simple opening screen

Page 10: A Collections Searching Center Using  Lucene – Solr

Demo – search result screen

Page 11: A Collections Searching Center Using  Lucene – Solr

Demo – search history

Page 12: A Collections Searching Center Using  Lucene – Solr

Process Flow Diagram

Solr

Solr

Lucene

Index

Horizon

Horizon

Horizon

Data

Extract

and

Trans-

Formation

XML

documents

MuseumDigital

Data

Extract

and

Trans-

Formation

ArchivesDigital

LibraryDigital

XML

documents

Output data

In XML

Output data

In JSON

Output data

In Python

Online

Exhibition

Virtual

Museum

In 2nd Life

Education

Interface

Open Access

Applications

Cross

Searching

Catalog

Page 13: A Collections Searching Center Using  Lucene – Solr

HorizonArchives

XML Data Transformation

ArtInventory

PhotoArchives

Archives

ExhibitionCatalogs

ResearchBibliographies

SmithsonianHistory

LibraryTrigger

Trigger

Trigger

Trigger

Trigger

Trigger

Trigger

Solr_Index_

Pending…….DB

Table

AirplaneDirectory

Trigger

A Perl

programconvertsrecordsbased

onBIB#

XMLDocuments

Automated Process

Page 14: A Collections Searching Center Using  Lucene – Solr

Define an Index Metadata Model:Free text data fields used for Keyword searching & display

Record LinkTitle/Object-nameIdentifierPhysical DescriptionGallery LabelNotesPublisherObject TypeTaxonomic Name

LanguageTopicPlaceDateNameCultureSet NameData SourceCredit LineOnline Media Group

Page 15: A Collections Searching Center Using  Lucene – Solr

Facet data fields used for browsing and limiting

Record IDObject TypeLanguageTopicPlaceDateNameCultureData SourceOnline Media TypeRights for Online Media FileRelated RecordUsage Flag

Taxon-KingdomTaxon-PhylumTaxon-DivisionTaxon-ClassTaxon-OrderTaxon-FamilyTabxon-Sub-FamilyScientific_nameCommon name

Geo-age-EraGeo-Age-SystemGeo-Age-SeriesGeo-Age-StageStrat-GroupStrat-Formation

Strat-Member

Page 16: A Collections Searching Center Using  Lucene – Solr

Getting help from Solr

Task specific handlers:Request handlerRespond handler

Update handler Schema.xml file defines fields to be

indexed, displayed, and searchable. Solrconfig.xml file defines cache size,

faceted field type, request handler customization.

Solr

Solr

Lucene

Index

Page 17: A Collections Searching Center Using  Lucene – Solr

Solrconfig.xml Example facet field definition  <str name="facet.field">object_type</str>   <str name="facet.field">language</str>   <str name="facet.field">topic</str>   <str name="facet.field">place</str>   <str name="facet.field">date</str>   <str name="facet.field">name</str>   <str name="facet.field">culture</str>   <str name="facet.field">online_media_type</str>   <str name="facet.field">set_name</str>   <str name="facet.field">data_source</str>   <str name="facet.field">tax_kingdom</str>   <str name="facet.field">tax_phylum</str>   <str name="facet.field">tax_division</str>   <str name="facet.field">tax_class</str>   <str name="facet.field">tax_order</str>   <str name="facet.field">tax_family</str>   <str name="facet.field">tax_sub-family</str>   <str name="facet.field">common_name</str>   <str name="facet.field">scientific_name</str>   <str name="facet.field">freetext</str>   <str name="facet.field">text</str>   </lst>   </requestHandler>

Page 18: A Collections Searching Center Using  Lucene – Solr

Data Example (abbreviated) – a Library Book

<doc boost="1"><descriptiveNonRepeating><record_ID>siris_sil_905285</record_ID><unit_code>SIL</unit_code><data_source>Smithsonian Institution Libraries</data_source><title_sort>STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN

LIFE</title_sort><title label="Title">Story of West Point: 1802-1943; the West Point tradition in American

life</title></descriptiveNonRepeating><descriptiveOptional><freetext category="dataSource" label="Data Source“ >Smithsonian Institution Libraries</freetext><freetext category="objectType" label="Type“ >Books</freetext><freetext category="date" label="Date">1943</freetext></descriptiveOptional><indexedStructured><object_type>Books</object_type><date>1943</date></indexedStructured></doc>

Page 19: A Collections Searching Center Using  Lucene – Solr

Data Example (abbreviated) – a Photograph<doc boost="6.4"><descriptiveNonRepeating><record_ID>siris_arc_104765</record_ID><unit_code>EEPA</unit_code><data_source>Eliot Elisofon Photographic Archives</data_source><title_sort>AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE</title_sort><title label="Title">Aerial view of downtown Johannesburg, South Africa, [slide]</title><online_media mediaCount="1"><media thumbnail=http://sirismm.si.edu/eepa/eepthb/eepa_05859thb.jpg Type="Images">http://sirismm.si.edu/eepa/eep/eepa_05859.jpg< /media></online_media></descriptiveNonRepeating><descriptiveOptional><freetext category="dataSource" label="Data Source">Eliot Elisofon Photographic Archives</freetext><freetext category="identifier" label="Local number">EEPA EECL 15973</freetext><freetext label="photographer" category="name">Elisofon, Eliot</freetext><freetext category="physicalDescription" label="Physical description">slide : col</freetext><freetext category="notes" label="Summary">This photograph was taken when Eliot Elisofon was on assignment for Life magazine and traveled to Africa from August 18, 1959 to December

20, 1959</freetext><freetext category="objectType" label="Type">Photographs</freetext><freetext category="topic" label="Topic">Mod. architecture/cityscape</freetext><freetext category="place" label="Place">South Africa</freetext><freetext category="date" label="Date">1959</freetext><freetext category="setName" label="See more items in">Eliot Elisofon Field photographs 1942-1972</freetext></descriptiveOptional><indexedStructured><name>Elisofon, Eliot</name><object_type>Color slides</object_type><object_type>Photographs</object_type><object_type>Archival materials</object_type><topic>Mod. architecture/cityscape</topic><topic>Cultural landscapes</topic><topic>Aerial photography</topic><place>Africa</place><place>South Africa</place><date>1959</date><online_media_type>Images</online_media_type></indexedStructured></doc>

Page 20: A Collections Searching Center Using  Lucene – Solr

Data Example (abbreviated) – a sculpture<doc boost="6.4">- <descriptiveNonRepeating> <record_ID>siris_ari_7985</record_ID> <unit_code>ARI</unit_code> <data_source>Art Inventories</data_source> <title_sort>DREXEL MONUMENT SCULPTURE</title_sort> <title label="Title">The Drexel Monument, (sculpture)</title> <record_link>http://siris-artinventories.si.edu/ipac20/ipac.jsp?&profile=all&source=~!

siartinventories&uri=full=3100001~!7985~!0#focus</record_link> - <online_media mediaCount="7"> <media thumbnail="http://sirismm.si.edu/saam/scan3thb/S75004286_1bthb.jpg"

type="Images">http://americanart.si.edu/images/1966/1966.47.36_1b.jpg</media> </online_media> </descriptiveNonRepeating>- <descriptiveOptional> <freetext category="dataSource" label="Data Source">Art Inventories</freetext> <freetext category="identifier" label="Control number">IAS 75004286</freetext> <freetext label="sculptor" category="name">Manger, Heinrich b. 1833</freetext> <freetext label="founder" category="name">Chas. F. Heaton</freetext> <freetext category="title" label="title">Francis M. Drexel Monument, (sculpture)</freetext> <freetext category="physicalDescription" label="Physical description">metal: bronze Sculpture: bronze; Base:

granite; Fountain basin: concrete</freetext> <freetext category="notes" label="Description">Index of American Sculpture, University of Delaware,

1985</freetext> <freetext category="objectType" label="Type">Sculptures-Fountain</freetext> <freetext category="name" label="Subject">Drexel, Francis M</freetext> <freetext category="place" label="Place">Illinois</freetext> <freetext category="date" label="Date">1881. Cast 1882. Dedicated 1883</freetext> </descriptiveOptional>- <indexedStructured> <name>Manger, Heinrich</name> <name>Chas. F. Heaton</name> <object_type>Sculptures</object_type> <topic>Portrait male</topic> <name>Drexel, Francis M</name> <place>Illinois</place> <date>1880s</date> <online_media_type>Images</online_media_type> </indexedStructured> </doc>

Page 21: A Collections Searching Center Using  Lucene – Solr

A system is only as good as the data that is in it.

Page 22: A Collections Searching Center Using  Lucene – Solr

Data mapping for multiple databases (truncated)

Page 23: A Collections Searching Center Using  Lucene – Solr

Faceted Categories

Determine the most useful facets; more is not better. Number of unique facets will affect system

response timeSmithsonian has 4.6 million unique

terms. Among them: 864,000 names, 126,000 topics, 47,000 places, 139 dates(down from 40,000 before cleanup), 1,000 types (down from 2,000 before cleanup)

Page 24: A Collections Searching Center Using  Lucene – Solr

Build the facet terms

650 $a Art $z Africa, North $v Periodicals.

<Topic> Art </Topic><Place> Africa, North </place><object_type> Periodicals </object_type>

Page 25: A Collections Searching Center Using  Lucene – Solr

Build the facet terms

655 $a Photographs $y 1840-1860.

<type> Photographs </type><date> 1840s </date><date> 1850s </date><date> 1860s </date>

Page 26: A Collections Searching Center Using  Lucene – Solr

Challenges

Adapting LCSH and AAT terms in a whole new way

Still seeking a good way to use See and See Also reference data

Reduce Data inconsistency in our records for better quality facet terms

Character conversion challenge with MARC8, UNICODE and UTF8

Page 27: A Collections Searching Center Using  Lucene – Solr

Future plans Continue to add data from more digital library

databases and museum collection databases Working on National History museum, and American

Indian museum.

Complete the implementation of the capability to interact with external applications

Plan to support “American Art and Artist” application

Add new functionality such as my-list, list-sharing, social tagging.

Support more visual displays such as Google map and time slider

Page 28: A Collections Searching Center Using  Lucene – Solr

A Collections Searching Center Using Lucene – Solr

Ching-hsien WangSmithsonian Institutionwww.siris.si.edu [email protected]