U.S Geological Survey National Biological Information Infrastructure Technical Overview: NBII Metadata Clearinghouse May 2008 Mike Frame

U.S Geological Survey National Biological Information Infrastructure

  • Upload

  • View

  • Download

Embed Size (px)


U.S Geological Survey National Biological Information Infrastructure. Technical Overview: NBII Metadata Clearinghouse May 2008. Mike Frame. Topics for discussion. Metadata CH Background New Metadata CH Design & Demo Underlying Architecture. www. NBII. gov. My. NBII. gov. PORTAL. - PowerPoint PPT Presentation

Citation preview

Page 1: U.S Geological Survey  National Biological Information Infrastructure

U.S Geological Survey National Biological Information


Technical Overview:

NBII Metadata Clearinghouse

May 2008

Mike Frame

Page 2: U.S Geological Survey  National Biological Information Infrastructure

Topics for discussion

Metadata CH Background New Metadata CH Design & Demo Underlying Architecture

Page 3: U.S Geological Survey  National Biological Information Infrastructure


Describe and Discover




Content Management Integrated/Federated SearchCollaboration Services

Database and Web Services

Model ServicesGeospatial Services

ITIS DIGR CatalogThesaurus Mapping Geoparsing CatalogGeo -

referencingDiscovery CatalogOperations

Dublin Core (plus)


Distributed Applications , Databases , Websites , Tools and Models


Integrated View


Resource and Service Catalogs


Resource Catalog

Geospatial Services Catalog

Geospatial Dataset

Resource Clearinghouse

Database and Web Services


Model ServicesCatalog

Services Overview

Page 4: U.S Geological Survey  National Biological Information Infrastructure

NBII Metadata Resourceshttp://metadata.nbii.gov


Page 5: U.S Geological Survey  National Biological Information Infrastructure

Metadata Resources:FGDC Metadata Program

Tool reviews Training OpportunitiesResources for using the Standard

NBII Clearinghouse

Page 6: U.S Geological Survey  National Biological Information Infrastructure

7 Sections make up the FGDC Standard:

1. Identification Information

2. Data Quality Information

3. Spatial Data Information

4. Spatial Reference Information

5. Entity and Attribute Information

6. Data Distribution Information

7. Metadata Reference Information

Some basic metadata facts…about the FGDC Standard

Page 7: U.S Geological Survey  National Biological Information Infrastructure

NBII Metadata CH

Page 8: U.S Geological Survey  National Biological Information Infrastructure

Rational for Metadata CH Redesign

User Feedback Metadata creation Metadata management Metadata integration with data Open architecture framework Speed and Reliability Data quality Data visualization License Costs

Page 9: U.S Geological Survey  National Biological Information Infrastructure

NBII Metadata CH provides: Single portal to information contained in disparate data

management systems Free text, fielded, spatial, and temporal search

capabilities Allow individuals and database managers to distribute

their data while maintaining complete control and ownership

Leverage investment in existing information systems and research• NBII is part of the Mercury Consortium @ ORNL

Page 10: U.S Geological Survey  National Biological Information Infrastructure

NBII CH: New Functionalities

Rich Client Interface

Combined search results (status page)

Filterring search results (Facet)

Dynamic sorting of search results

Bookmark brief and full metadata pages

Based on open source technologies:

• Lucene• Solr

Page 11: U.S Geological Survey  National Biological Information Infrastructure

NBII CH New Functionalities Cont.. SOA based design

• Web services• RSS services for search results• Portlet support• Search Sharing support

Thesaurus Support Seamless data ordering/data extraction with various data

partners Seamless data visualization integration with external

visualization tools Improved User Statistics Collection

Page 12: U.S Geological Survey  National Biological Information Infrastructure

The Clearinghouse is operated for NBII by the Oak Ridge National Laboratory

Over 38,000 records

41 partners contributing metadata records

Ability to search in a variety of ways

Redesigned in 2008

The NBII Clearinghouse

Page 13: U.S Geological Survey  National Biological Information Infrastructure

NBII CH Demo NBII Clearinghouse interface:


Page 14: U.S Geological Survey  National Biological Information Infrastructure

How does the NBII Clearinghouse work?

Page 15: U.S Geological Survey  National Biological Information Infrastructure

How does the NBII Clearinghouse work?

Page 16: U.S Geological Survey  National Biological Information Infrastructure

How does the NBII Clearinghouse work?

Page 17: U.S Geological Survey  National Biological Information Infrastructure

How does the NBII Clearinghouse work?

Page 18: U.S Geological Survey  National Biological Information Infrastructure

Metadata CH RSSWorld Data Center


Page 19: U.S Geological Survey  National Biological Information Infrastructure

NBII Metadata ClearinghouseArchitecture

Page 20: U.S Geological Survey  National Biological Information Infrastructure

Metadata CH Architecture

CH Function of the NBII Metadata Program Operated by ORNL• NBII is 1 Organization in Mercury Consortium

Established relationship in 2001 Formerly based on “Blue Angel

Technologies” Currently based on Lucene/Solr Open

Source Technologies

Page 21: U.S Geological Survey  National Biological Information Infrastructure

3. Remote users query the index via a Web-based browser

6. Highly detailed data and documentation are downloaded directly from the contributing agency

1. Principal investigators create detailed metadata and data files using local applications or ORNL- OME 2. NBII Mercury collects metadata and key data

from contributing agencies’ servers distributed around the country and builds a centralized index

4. Metadata summaries are returned to the remote users, including links back to detailed information and data at the PIs’ server or data repository

5. Remote users select links to data of interest



Virtual Internet Database

P.I. Summary – John Smith Product A Container: 1; 10/12/2003 Container 2; 01/20/2002 Container 3; 07/05/2001 Product B Container 1; 03/05/1999….

P.I. NameProduct NumberProduct TitleSiteSubject AreaThematic AreaKeywordsetc.

Distributed Data Discovery and Access System

Page 22: U.S Geological Survey  National Biological Information Infrastructure

Custom Export


Custom Export











Metadata exists in remote legacy databases using any

platform, OS or RDBMS

Metadata are extracted into XML files yielding standardized data objects

Harvested metadata are combined at the central site, transformed (if needed), and indexed

Users work with a single, simple, web-like interface to access all data simultaneously

Databases can be of different structures

and content

Export programs are easily written and automated

These files can be remotely harvested via the Internet

Frequent, automated harvesting and complete re-building of the index keeps the aggregate database up to date

No re-programming of existing systems


Business as usual for contributing




Custom Export


Custom Export


Z39.50 or WS

Z39.50 or WS

A Virtual Aggregate Database

Page 23: U.S Geological Survey  National Biological Information Infrastructure

NBII CH Design Diagram

Solr Schema for defining the fields Index metadata


NBII CH Harvester


Transformed Files MySQLMercury3_harvests_nbii

DB updater tool(custom Java)

Solr Indexer tool(custom java)

XML Beans to extract the contents

SOLR Search Server Extended Lucene Index


Solr Searcher(custom Java Spring)

Web Service Web Service


Portlets Portlets

External Metadata

http, ftp, web crawl

Page 24: U.S Geological Survey  National Biological Information Infrastructure

Future Development Phase II (May 2008 to September 2008):

• Harvester engine to use open source tools (Remove COTS) (Phase I & II)• Portal integration through JSR-168 Portlet standard

• Search portlets, portlets for recent datasets, top most searched words etc..• Web service implementation (Phase I & II):

• Thesaurus support (semantic web integration support)• Gazetteer web service implementation • OGC Catalog Service (include Web Mapping/Coverage/Feature Servers in search)• Universal Description, Discovery, and Integration (UDDI) Directory Services

• Dynamic RSS support, including Geo-RSS support• ISO 19115 support• OpenSearch support• Documentation and Help (Phase I & II)• User Statistics Application modifications

Phase III (October 2008 to January 2009):• Save, Retrieve and Email user queries• Possible integration to OPeNDAP • Web Service Harvesting (OAI)• Internationalization• ????

Page 25: U.S Geological Survey  National Biological Information Infrastructure

Search technology using Lucene/SOLR

Lucene• Overview

• Who uses Lucene

Solr• Overview

• Who uses Solr

Page 26: U.S Geological Survey  National Biological Information Infrastructure

Lucene Overview

High-performance, full-featured text search engine library written entirely in Java

Mature Apache Open Source Java Project Index speed and integrity, search speed

• uses file based full text and inverted indexing

• is extremely fast with built-in caching

Can easily handle millions of documents Very active mailing list for support

Page 27: U.S Geological Survey  National Biological Information Infrastructure

Who uses Lucene

Wikipedia MediaWiki European Bioinformatics Institute Liferay Bigsearch.ca Monster Academic Archive On-line Complete list:

• http://en.wikipedia.org/wiki/Lucene

• http://wiki.apache.org/lucene-java/PoweredBy

Page 28: U.S Geological Survey  National Biological Information Infrastructure

SOLR Overview

Open source enterprise search server based on the Lucene Java search library

Apache project, sub-project of Lucene

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML and HTTP

Solr uses Lucene search library and extends it

Page 29: U.S Geological Survey  National Biological Information Infrastructure

SOLR Overview Contd..

A Real Data Schema, with Numeric Types, Date fields, Dynamic Fields

Dynamic Faceted Browsing and Filtering

Advanced, Configurable Text Analysis

Highly Configurable and User Extensible Caching

External Configuration via XML

Scalability - Efficient Replication to other Solr Search Servers

Administration Interface is available

Page 30: U.S Geological Survey  National Biological Information Infrastructure

Who uses SOLR

CNET Reviews shopper.com AOL Music netflix search.com The Digital Commonwealth mindquarry for complete list:


Page 31: U.S Geological Survey  National Biological Information Infrastructure

Mercury Instances Demo NBII Clearinghouse interface:


ORNLDAAC interface: http://daac.ornl.gov/

LBA Mercury interface: http://mercdev3.ornl.gov/lba3/

DADDI Mercury interface: http://mercdev3.ornl.gov/daddi3/

GFIS RSS Portal interface: http://www.gfis.net/gfis/home.faces

Page 32: U.S Geological Survey  National Biological Information Infrastructure

User Statistics Report Generation Tool

Page 33: U.S Geological Survey  National Biological Information Infrastructure

Open source Harvester Re-design (Aperture)

Page 34: U.S Geological Survey  National Biological Information Infrastructure

Questions, Comments,

Mike Frame

865 576-3605

[email protected]

Thanks to:

Giri PalanisamySystems Architect and Team LeaderMercury Consortium [email protected]

Vivian HutchisonNBII Metadata Program [email protected]