U.S Geological Survey National Biological Information Infrastructure

U.S Geological Survey National Biological Information

Infrastructure

Technical Overview:

NBII Metadata Clearinghouse

May 2008

Mike Frame

Topics for discussion

Metadata CH Background New Metadata CH Design & Demo Underlying Architecture

text

Describe and Discover

www.NBII.gov

PORTAL

My.NBII.gov

Content Management Integrated/Federated SearchCollaboration Services

Database and Web Services

Model ServicesGeospatial Services

ITIS DIGR CatalogThesaurus Mapping Geoparsing CatalogGeo -

referencingDiscovery CatalogOperations

Dublin Core (plus)

UDDI / WSDL ??OGC/ISO FGDC/ISO

Distributed Applications , Databases , Websites , Tools and Models

Consume

Integrated View

DistributedServices

Resource and Service Catalogs

DistributedResources

Resource Catalog

Geospatial Services Catalog

Geospatial Dataset

Resource Clearinghouse

Database and Web Services

Catalog

Model ServicesCatalog

Services Overview

NBII Metadata Resourceshttp://metadata.nbii.gov

http://metadata.nbii.gov

Metadata Resources:FGDC Metadata Program

Tool reviews Training OpportunitiesResources for using the Standard

NBII Clearinghouse

7 Sections make up the FGDC Standard:

1. Identification Information

2. Data Quality Information

3. Spatial Data Information

4. Spatial Reference Information

5. Entity and Attribute Information

6. Data Distribution Information

7. Metadata Reference Information

Some basic metadata facts…about the FGDC Standard

NBII Metadata CH

Rational for Metadata CH Redesign

User Feedback Metadata creation Metadata management Metadata integration with data Open architecture framework Speed and Reliability Data quality Data visualization License Costs

NBII Metadata CH provides: Single portal to information contained in disparate data

management systems Free text, fielded, spatial, and temporal search

capabilities Allow individuals and database managers to distribute

their data while maintaining complete control and ownership

Leverage investment in existing information systems and research• NBII is part of the Mercury Consortium @ ORNL

NBII CH: New Functionalities

Rich Client Interface

Combined search results (status page)

Filterring search results (Facet)

Dynamic sorting of search results

Bookmark brief and full metadata pages

Based on open source technologies:

• Lucene• Solr

NBII CH New Functionalities Cont.. SOA based design

• Web services• RSS services for search results• Portlet support• Search Sharing support

Thesaurus Support Seamless data ordering/data extraction with various data

partners Seamless data visualization integration with external

visualization tools Improved User Statistics Collection

The Clearinghouse is operated for NBII by the Oak Ridge National Laboratory

Over 38,000 records

41 partners contributing metadata records

Ability to search in a variety of ways

Redesigned in 2008

The NBII Clearinghouse

NBII CH Demo NBII Clearinghouse interface:

http://mercdev3.ornl.gov/nbii3/

How does the NBII Clearinghouse work?




Metadata CH RSSWorld Data Center

http://wdc.nbii.gov

NBII Metadata ClearinghouseArchitecture

Metadata CH Architecture

CH Function of the NBII Metadata Program Operated by ORNL• NBII is 1 Organization in Mercury Consortium

Established relationship in 2001 Formerly based on “Blue Angel

Technologies” Currently based on Lucene/Solr Open

Source Technologies

3. Remote users query the index via a Web-based browser

6. Highly detailed data and documentation are downloaded directly from the contributing agency

1. Principal investigators create detailed metadata and data files using local applications or ORNL- OME 2. NBII Mercury collects metadata and key data

from contributing agencies’ servers distributed around the country and builds a centralized index

4. Metadata summaries are returned to the remote users, including links back to detailed information and data at the PIs’ server or data repository

5. Remote users select links to data of interest

Index

Users

Virtual Internet Database

P.I. Summary – John Smith Product A Container: 1; 10/12/2003 Container 2; 01/20/2002 Container 3; 07/05/2001 Product B Container 1; 03/05/1999….

P.I. NameProduct NumberProduct TitleSiteSubject AreaThematic AreaKeywordsetc.

Distributed Data Discovery and Access System

Custom Export

Program

Custom Export

Program

ExistingDatabase

ExistingDatabase

ExistingDatabase

ExistingDatabase

ExistingDatabase

ExistingDatabase

EncryptedXML

EncryptedXML

Index

Metadata exists in remote legacy databases using any

platform, OS or RDBMS

Metadata are extracted into XML files yielding standardized data objects

Harvested metadata are combined at the central site, transformed (if needed), and indexed

Users work with a single, simple, web-like interface to access all data simultaneously

Databases can be of different structures

and content

Export programs are easily written and automated

These files can be remotely harvested via the Internet

Frequent, automated harvesting and complete re-building of the index keeps the aggregate database up to date

No re-programming of existing systems

required

Business as usual for contributing

databases

EncryptedXML

EncryptedXML

Custom Export

Program

Custom Export

Program

Z39.50 or WS

Z39.50 or WS

A Virtual Aggregate Database

NBII CH Design Diagram

Solr Schema for defining the fields Index metadata

records

NBII CH Harvester

FGDC-BIO

Transformed Files MySQLMercury3_harvests_nbii

DB updater tool(custom Java)

Solr Indexer tool(custom java)

XML Beans to extract the contents

SOLR Search Server Extended Lucene Index

UI

Solr Searcher(custom Java Spring)

Web Service Web Service

RSS

Portlets Portlets

External Metadata

http, ftp, web crawl

Future Development Phase II (May 2008 to September 2008):

• Harvester engine to use open source tools (Remove COTS) (Phase I & II)• Portal integration through JSR-168 Portlet standard

• Search portlets, portlets for recent datasets, top most searched words etc..• Web service implementation (Phase I & II):

• Thesaurus support (semantic web integration support)• Gazetteer web service implementation • OGC Catalog Service (include Web Mapping/Coverage/Feature Servers in search)• Universal Description, Discovery, and Integration (UDDI) Directory Services

• Dynamic RSS support, including Geo-RSS support• ISO 19115 support• OpenSearch support• Documentation and Help (Phase I & II)• User Statistics Application modifications

Phase III (October 2008 to January 2009):• Save, Retrieve and Email user queries• Possible integration to OPeNDAP • Web Service Harvesting (OAI)• Internationalization• ????

Search technology using Lucene/SOLR

Lucene• Overview

• Who uses Lucene

Solr• Overview

• Who uses Solr

Lucene Overview

High-performance, full-featured text search engine library written entirely in Java

Mature Apache Open Source Java Project Index speed and integrity, search speed

• uses file based full text and inverted indexing

• is extremely fast with built-in caching

Can easily handle millions of documents Very active mailing list for support

Who uses Lucene

Wikipedia MediaWiki European Bioinformatics Institute Liferay Bigsearch.ca Monster Academic Archive On-line Complete list:

• http://en.wikipedia.org/wiki/Lucene

• http://wiki.apache.org/lucene-java/PoweredBy

SOLR Overview

Open source enterprise search server based on the Lucene Java search library

Apache project, sub-project of Lucene

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML and HTTP

Solr uses Lucene search library and extends it

SOLR Overview Contd..

A Real Data Schema, with Numeric Types, Date fields, Dynamic Fields

Dynamic Faceted Browsing and Filtering

Advanced, Configurable Text Analysis

Highly Configurable and User Extensible Caching

External Configuration via XML

Scalability - Efficient Replication to other Solr Search Servers

Administration Interface is available

Who uses SOLR

CNET Reviews shopper.com AOL Music netflix search.com The Digital Commonwealth mindquarry for complete list:

http://wiki.apache.org/solr/PublicServers

Mercury Instances Demo NBII Clearinghouse interface:

http://mercdev3.ornl.gov/nbii3/

ORNLDAAC interface: http://daac.ornl.gov/

LBA Mercury interface: http://mercdev3.ornl.gov/lba3/

DADDI Mercury interface: http://mercdev3.ornl.gov/daddi3/

GFIS RSS Portal interface: http://www.gfis.net/gfis/home.faces

User Statistics Report Generation Tool

Open source Harvester Re-design (Aperture)

Questions, Comments,

Mike Frame

865 576-3605

[email protected]

Thanks to:

Giri PalanisamySystems Architect and Team LeaderMercury Consortium [email protected]

Vivian HutchisonNBII Metadata Program [email protected]

Documents

U.S Geological Survey National Biological Information Infrastructure