26
© Copyright 2013 LucidWorks Solr Powered Libraries: A survey of the world's knowledge bases May 2, 2013 Presented by Erik Hatcher Thursday, May 2, 13

Solr Powered Libraries

Embed Size (px)

DESCRIPTION

Using Apache Lucene and Solr search technologies, information and knowledge have become vastly more searchable, findable, and accessible. Because scholars and researchers are some of the most demanding users of search systems, the problems encountered by the implementers are complex. For example, many of the applications built on these technologies also thrive on intentionally designed-in serendipitous discovery capabilities, bringing to light previously unknown, yet related and potentially interesting, content. Libraries and other public knowledge-sharing environments, such as Wikipedia, generally embrace "open source" and community improving contributions as core principles, making a lovely synergy with the power, features, and community-driven ecosystem provided by Lucene and Solr. This talk will introduce you to several Solr powered library-related systems, detail how they work, and leave you with lessons learned that can be applied to your applications.

Citation preview

Page 1: Solr Powered Libraries

© Copyright 2013 LucidWorks

Solr Powered Libraries:A survey of the world's knowledge bases

May 2, 2013Presented by Erik Hatcher

Thursday, May 2, 13

Page 2: Solr Powered Libraries

© 2013 LucidWorks

Abstract

Using Apache Lucene and Solr search technologies, information and knowledge have become vastly more searchable, findable, and accessible. Because scholars and researchers are some of the most demanding users of search systems, the problems encountered by the implementers are complex. For example, many of the applications built on these technologies also thrive on intentionally designed-in serendipitous discovery capabilities, bringing to light previously unknown, yet related and potentially interesting, content.

Libraries and other public knowledge-sharing environments, such as Wikipedia, generally embrace "open source" and community improving contributions as core principles, making a lovely synergy with the power, features, and community-driven ecosystem provided by Lucene and Solr.

This talk will introduce you to several Solr powered library-related systems, detail how they work, and leave you with lessons learned that can be applied to your applications.

2

Thursday, May 2, 13

Page 3: Solr Powered Libraries

© 2013 LucidWorks

Real Solar Powered Library !

•http://www.ktsm.com/news/texas-library-runs-sunshine

3

Thursday, May 2, 13

Page 4: Solr Powered Libraries

© 2013 LucidWorks

Card carrying library geek

•Applied Research in Patacriticism (ARP)- Rossetti Archive: http://www.rossettiarchive.org- NINES: http://www.nines.org/- Collex: http://www.collex.org

•Blacklight- originated as an implementation of Solr Flare

•Presentations- http://code4lib.org/conference: 2007, 2009, 2010, 2011, 2013- Library of Congress: "Solr Powered Libraries" (2007)

»http://www.loc.gov/today/cyberlc/feature_wdesc.php?rec=4113- EBTI/CBETA Conference 2008- Publication: “Library 2.0 Initiatives in Academic Libraries”

•Windsor Lucene Summit•eIFL-FOSS

4

Thursday, May 2, 13

Page 5: Solr Powered Libraries

© 2013 LucidWorks

Rossetti Archive

5

Thursday, May 2, 13

Page 6: Solr Powered Libraries

© 2013 LucidWorks

NINES/Collex

6

Thursday, May 2, 13

Page 7: Solr Powered Libraries

© 2013 LucidWorks

Card catalog

• the original inverted index

7

http://commons.wikimedia.org/wiki/File:Copyright_Card_Catalog_Files.jpg

Thursday, May 2, 13

Page 8: Solr Powered Libraries

© 2013 LucidWorks

•http://openlibrary.org/- project of the Internet Archive

•Goal: "A (community editable) web page for every book"

8

Thursday, May 2, 13

Page 9: Solr Powered Libraries

© 2013 LucidWorks

dp.la - Digital Public Library of America

9

Lucene/ElasticSearch Powered

Thursday, May 2, 13

Page 10: Solr Powered Libraries

© 2013 LucidWorks

Wikimedia/Wikipedia/MediaWiki

•Solr powered: translation memory service, GeoData extension, etc

• "heavily modified Lucene" powers main site search currently

10

Thursday, May 2, 13

Page 11: Solr Powered Libraries

© 2013 LucidWorks

HathiTrust

• "partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future."

• 10.5M books, 12TB OCR+metadata, hundreds of languages- "Books are different"- http://code4lib.org/conference/2013/burton-west

• http://www.hathitrust.org/blogs/large-scale-search- http://www.hathitrust.org/blogs/large-scale-search/too-many-words- "org.apache.solr.common.SolrException: Impossible Exception"- CommonGrams- word segmentation: autoGeneratePhraseQueries="false"

• HathiTrust Research Center- The infrastructure includes an entrance portal, search and collection-building tools (using

Blacklight), ... analysis algorithms that can be run against the HathiTrust public domain corpus (more than 3 million volumes). In addition to the production services, the HTRC offers a development “sandbox”. The sandbox runs against non-Google scanned content (about 260,000 volumes) and provides a test-bed for interested researchers to experiment with writing their own algorithms for use in the HTRC infrastructure.

11

Thursday, May 2, 13

Page 12: Solr Powered Libraries

© 2013 LucidWorks

Smithsonian Institution

•http://collections.si.edu•Many disparate data sources:

- 19 museums, 20 libraries, 14 archives,1 National Zoo,1 Astrophysical Observatory, research centers in Panama,Boston, New York, Maryland,and Virginia

• "Documents" of all varieties:- Photographs, paintings, manuscripts, letters, postage stamps,scientific

specimens, rockets, airplanes, postcards, sound recordings, posters, decorative arts, ceramics, maps, sculptures, publication papers, books, trade catalogs, etc

•User tagging, negative/exclude filtering, DIH SolrEntityProcessor•http://bit.ly/13P41YJ

- http://www.basistech.com/pdf/events/open-source-search-conference/oss-2011-wang-steps-toward-open-government.pdf

12

Thursday, May 2, 13

Page 13: Solr Powered Libraries

© 2013 LucidWorks

13

Thursday, May 2, 13

Page 14: Solr Powered Libraries

© 2013 LucidWorks

14

Thursday, May 2, 13

Page 15: Solr Powered Libraries

© 2013 LucidWorks

•SerialsSolutions Summon

•http://www.serialssolutions.com/en/services/summon•SaaS, single unified index, match & merge

15

Thursday, May 2, 13

Page 16: Solr Powered Libraries

© 2013 LucidWorks

Astrophysics Data System Labs

•Smithsonian, NASA, Harvard•http://adslabs.org

16

http://code4lib.org/conference/2013/luker

Thursday, May 2, 13

Page 17: Solr Powered Libraries

© 2013 LucidWorks

•vufind.org•Powers main HathiTrust UI (currently) and many more

- see http://vufind.org/wiki/installation_status

17

Thursday, May 2, 13

Page 18: Solr Powered Libraries

© 2013 LucidWorks

18

Thursday, May 2, 13

Page 19: Solr Powered Libraries

© 2013 LucidWorks

• "Blacklight is an open source Ruby on Rails gem that provides a discovery interface for any Solr index. Blacklight provides a default user interface which is customizable via the standard Rails (templating) mechanisms. Blacklight accommodates heterogeneous data, allowing different information displays for different types of objects."- http://projectblacklight.org

• Founded at the University of Virginia (2007): search.lib.virginia.edu- UV-A solar radiation == blacklight

• Initial contributors: UVa, Stanford, JHU, WGBH• University of Hull, United States Holocaust Memorial Museum, University of Wisconsin-

Madison, Tufts, Australian gov't (Natural Resource Management), Penn State's ScholarSphere, Northwestern, New York Public Library, NCSU, Columbia University, Agriculture Network Information Center (USDA), alicelaw.org (American Legislative and Issue Campaign Exchange, is a one-stop web-based public library of progressive state and local laws), and many more

• http://projecthydra.org/ uses Blacklight as UI component

19

Thursday, May 2, 13

Page 20: Solr Powered Libraries

© 2013 LucidWorks

searchworks at Stanford

20

Thursday, May 2, 13

Page 21: Solr Powered Libraries

© 2013 LucidWorks

Advanced search at Stanford's searchworks

21

Thursday, May 2, 13

Page 22: Solr Powered Libraries

© 2013 LucidWorks

searchworks: Mapping Text Boxes to Solr query pieces

•http://code4lib.org/conference/2010/dushay_keck

22

Thursday, May 2, 13

Page 23: Solr Powered Libraries

© 2013 LucidWorks

•https://catalyst.library.jhu.edu/

23

Thursday, May 2, 13

Page 24: Solr Powered Libraries

© 2013 LucidWorks

Rock and Roll!

• \m/

24

Thursday, May 2, 13

Page 25: Solr Powered Libraries

© 2013 LucidWorks

Community and Resources

•code4lib:- http://www.code4lib.org/

•HathiTrust folks- http://www.hathitrust.org/blogs/large-scale-search- http://robotlibrarian.billdueber.com/

•http://bighumanities.net/- The Workshop on Big Humanities will be held in conjunction with the 2013

IEEE International Conference on Big Data (IEEE BigData 2013), which will take place between 6-9 October 2013 in Silicon Valley, California, USA, and which provides a leading international forum for disseminating the latest research in the growing field of “big data

25

Thursday, May 2, 13

Page 26: Solr Powered Libraries

© 2013 LucidWorks

26

http://heatherbrewer.com/blog/2013/04/15/libraries-rock/

Thursday, May 2, 13