Upload
oscar-lambert
View
213
Download
1
Tags:
Embed Size (px)
Citation preview
Beispielbild
OpenUp!
BioCASe Workshop
Jörg Holetschek, Gabriele DrögeBotanic Garden & Botanical Museum Berlin-DahlemDept. of Biodiversity Informatics and LaboratoriesKönigin-Luise-Straße 6-814195 Berlin
BioCASe Workshop Berlin, May 30th/ 31st 2011
2BioCASe Workshop, Berlin, May 30-31st 2011
Agenda
Monday
11.00 Welcome by Walter Berendsohn, Housekeeping
11.20 – 12.00 The BioCASe Architecture: An Overview
12.00 – 13.00 The BioCASe Provider Software I: An Overview
13.00 – 14.00 Lunch break
14.00 – 15.45 The BioCASe Provider Software II: Installation (Hands-on)
16.00 – 17.00 The ABCD data standard: Intention, Structure, Elements, Use
17.00 – 18.00 Preparing the database for BioCASe/ABCD
19.00 Dinner
Tuesday
09.30 – 12.00 Setting Up Datasources with the BPS (Hands-on):DB connection, Table Setup, Mapping; Testing, Data Backups
12.00 – 13.00 Lunch break
13.00 – 14.30 Setting up Networks with BioCASe (Hands-on)
15.00 – 15.30 A Thematic BioCASe Network: The DNA Bank Network
15.30 – 17.00 Questions (and answers?)
3BioCASe Workshop, Berlin, May 30-31st 2011
Workshop Presentation
http://www.biocase.org/files/BioCASe_Workshop_Berlin_2011.ppt
WiFi
Network: Conference
Key: g59mn3w2
Beispielbild
1. BioCASe Technology:
Motivation, Idea and Architecture
5BioCASe Workshop, Berlin, May 30-31st 2011
Primary Biodiversity Information
© Agnes Kirchhoff, J. Holstein et al.
6BioCASe Workshop, Berlin, May 30-31st 2011
Primary Biodiversity Data Items
- Living specimen- Preserved specimen- Multimedia document (drawing, photo, video, sound)- Observation
= Primary Biodiversity Data Record
Documentation of the occurrence of one species
at a given location at a certain point in time
Biological Collection Access Service
7BioCASe Workshop, Berlin, May 30-31st 2011
Data sources worldwide
- Index Herbariorum: 3,293 herbaria, 400 million herbarium sheets- 50-100,000 natural history collections, 1.5-2 billion specimens- With observations added, occurrence records 3+ billion (10b?)
Over 75% of biodiversity information are stored in developed countries.
Est. 75% of all species are found in the developing world.
Source: BARTHLOTT et al. 1999
8BioCASe Workshop, Berlin, May 30-31st 2011
Accessibility
Stage 0: Only in real world (paper catalogues, just stacks)Only meta information available on the web
Stage 1: Stage 2: Online catalogue Digitalization of specimen
9BioCASe Workshop, Berlin, May 30-31st 2011
Biodiversity Data
Level 3: Networking the databases
10BioCASe Workshop, Berlin, May 30-31st 2011
Global Biodiversity Information Facility (GBIF)
11BioCASe Workshop, Berlin, May 30-31st 2011
Biological Collection Access Service (BioCASe)
12BioCASe Workshop, Berlin, May 30-31st 2011
Architecture of Biodiversity Networks
2. Wrapper Software: BioCASe Provider Software
1. Protocols/Data Standards:BioCASe Protocol/ABCD
Data Quality CheckerDataMining
3. Applications
Data Portal
13BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Design Principles
No central database Data remain in the existing DB systems Data Provider gets full credit Full control over published data by collection holder
Partial publication possible
Collection holder can withhold information from publication (e.g., locality data for endangered species) or exclude records (e.g. until research results are published)
Wrapper principle Data remain in original collection management system No changes in workflow for curator/local users
14BioCASe Workshop, Berlin, May 30-31st 2011
2: The BioCASe ProviderSoftware
Wrapper: BioCASeProvider Software
Protocols/Data Standards
Data Quality CheckerDataMining
Applications
Data Portal
15BioCASe Workshop, Berlin, May 30-31st 2011
Software package that „wraps“ around the collection database Equips it with a BioCASe protocol compliant interface
1. Accepts requests from the network
3. Transforms results into ABCD documents and sends them back
BioCASe Provider Software (Wrapper)
Marmotamarmota?
2. Translates queries to the collection database
SELECT *FROM specimenWHERE ScientificName LIKE “Marmota marmota%“
16BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Provider Software (Wrapper)
Compatible with several protocols (BioCASe, DiGIR) and data schemas (ABCD, DarwinCore, ABCD-EFG, ABCD-DNA)
Works with most SQL-compliant databases (Access, MySQL, Postgres, SQL Server, ...)
Currently ~95 production installations serving ~1,500 collections with ~33.5m records to GBIF and BioCASe
Platform independent
Support
available!
17BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Providers Worldwide
~95 production installationsserving ~1.500 collections
18BioCASe Workshop, Berlin, May 30-31st 2011
Requirements
1. SQL compliant databasewith existing Python connectivity module:MySQL, SQL Server, Postgres, Access, Foxpro, Excel
2. Webserver (preferrably Apache),allowing the execution of Python scripts
3. Privileges to install additional Python packages
19BioCASe Workshop, Berlin, May 30-31st 2011
Steps
1. Installing Apache
2. Installing Python
3. Downloading BPS
4. Installing BPS(from repository/archive)
5. Creating the link Apache/BPS
6. Test of Installation
7. Changing directory permissions
8. Setup of additional packages (DB Connectivity Package)
20BioCASe Workshop, Berlin, May 30-31st 2011
1. Installing Apache
http://httpd.apache.org/download
21BioCASe Workshop, Berlin, May 30-31st 2011
2. Installing Python
http://www.python.org/download/
22BioCASe Workshop, Berlin, May 30-31st 2011
3. Downloading BPS
Archive: http://www.biocase.org/products/provider_software/
Subversion repository
Latest stable version: http://ww2.biocase.org/svn/bps2/branches/stable Defined version: http://ww2.biocase.org/svn/bps2/tags/release_2.5.3
Linux:
svn co <url> <path>
Windows: Tortoise client
23BioCASe Workshop, Berlin, May 30-31st 2011
4. Installing the BPS
Setup.py
No files copies,
only adapted!
24BioCASe Workshop, Berlin, May 30-31st 2011
5. Linking BPS with Apache
http.conf
25BioCASe Workshop, Berlin, May 30-31st 2011
6. Testing BPS, Installing Additional Packages
http://localhost/biocase Utilities Library Test
26BioCASe Workshop, Berlin, May 30-31st 2011
6. Write permissions
…/bps2/configuration
…/bps2/log
27BioCASe Workshop, Berlin, May 30-31st 2011
7a: mysqldb
http://sourceforge.net/projects/mysql-python/
28BioCASe Workshop, Berlin, May 30-31st 2011
Changing the Password
... /bps/configuration.ini
29BioCASe Workshop, Berlin, May 30-31st 2011
3: ABCD Standard
Protocols/Data Standards
Wrapper Software
Data Quality CheckerDataMining
Applications
Data Portal
30BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Data Schema
Access to Biological Collection Data:
Data schema for all types of primary biodiversity data (living/preserved/observational, botanical/zoological/bacterial/viral, marine/terrestrial)
XML (eXtensible Markup Language) based can be consumed by humans and machines
Highly complex, hierarchical, currently 1,055 data elements almost every data item will fit in
Extendable (plug-in slot for additional information)
standard (currently version 2.06)
31BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Structure
Namespace: http://www.tdwg.org/schemas/abcd/2.06
32BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Technical/Content Contact
33BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Description
34BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Coverage
35BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Revision/Version
36BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Ownership
37BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata: Intellectual Property Rights
38BioCASe Workshop, Berlin, May 30-31st 2011
ABCD Metadata
39BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Triple ID, Record Basis
40BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Identification (multiple)
41BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Gathering Event
42BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Multimedia
OpenUp: Thumbnails will be created
Always provide link to image file!
43BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Unit Associations
44BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: Specialised Portions
Specimen Unit: Acquisition, Accession, Peparation, Duplicate Distribution, Type Status
Herbarium Unit: Loan Information
Botanical Garden Unit: Location in Garden, Hardiness, Lineage, Cultivation, Planting Date
Other Specialised Subtrees forObservationsCulture CollectionsMycological UnitsZoological UnitsPaleontological UnitsPlant Genetic Resources
45BioCASe Workshop, Berlin, May 30-31st 2011
ABCD: UnitExtension
Own Namespace for Extension http://www.chah.org.au/schemas/hispid/5
Other Extensions: Extension for Geoscienes (ABCD-EFG) DNA Bank Network (ABCD-DNA)
46BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Protocol
Biological Collection Access Service Protocol:
Manages data exchange between data providers (collections) and applications (data portals)
Vehicle for transporting requests: data portal collection and responses (ABCD documents): collection database data portal
XML based
47BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Protocol: Capabilities request
48BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Protocol: Inventory Request
49BioCASe Workshop, Berlin, May 30-31st 2011
BioCASe Protocol: Search Request
Beispielbild
4. Preparing the database for BioCASe
51BioCASe Workshop, Berlin, May 30-31st 2011
4. Reasons for not publishing the live DB
1. Publishing the live DB is not desired creating snapshots for publication
2. DBMS not accessible for the BPS export into another DBMS
3. Performance considerations (too highly normalized) partial, controlled denormalization
4. Repeatable elements kept in columns, not in separate rows Moving repeatable elements to separate records
52BioCASe Workshop, Berlin, May 30-31st 2011
Each repeatable elements needs its own primary key!
Repeatable elements kept in columns
specimen_id ... class order family
3476 ... Conjugatophyceae Desmidiales Desmidiaceae
3477 ... Conjugatophyceae Desmidiales Desmidiaceae
3478 ... Conjugatophyceae Desmidiales Closteriaceae
specimen_id ...
3476 ...
3477 ...
3478 ...
sp_id ht_entry ht_rank ht_name
3476 456765 class Conjugatophyceae
3476 456766 order Desmidiales
3476 456767 family Desmidiaceae
3477 456768 class Conjugatophyceae
3477 456769 order Desmidiales
3477 456770 family Desmidiaceae
3478 456771 class Conjugatophyceae
3478 456772 order Desmidiales
3478 456773 family Closteriaceae
53BioCASe Workshop, Berlin, May 30-31st 2011
Example View
CREATE VIEW [dbo].[vwHigherTaxa]
AS
SELECT 'k_' + [EDIT_ATBI_RecordID] AS id, [EDIT_ATBI_RecordID] AS unit_id, [kingdom] AS name, 'kingdom' AS rankFROM unit_dataWHERE [kingdom] IS NOT NULL
UNION
SELECT 'p_' + [EDIT_ATBI_RecordID], [EDIT_ATBI_RecordID], [phylum], 'phylum‚ FROM unit_dataWHERE [phylum] IS NOT NULL
UNION
...
54BioCASe Workshop, Berlin, May 30-31st 2011
Commonly used repeatable elements
- Identification- HigherTaxon- GatheringSite/NamedArea- Metadata/Scope/GeoecologicalTerms- Metadata/Scope/TaxonomicTerms- MultimediaObjects- MeasurementsOrFacts- ...
55BioCASe Workshop, Berlin, May 30-31st 2011
Controlled Denormalization
insert into [dbo].[abcd_Object]
SELECT dbo.CollectionObject.CollectionObjectID, ISNULL(dbo.CatalogSeries.SeriesName, '') + '-' + ISNULL(CAST(dbo.CollectionObjectCatalog.SubNumber AS nvarchar(20)), '') + '-' + ISNULL(CAST(dbo.CollectionObjectCatalog.CatalogNumber AS nvarchar(20)), ''), dbo.f_getParentID(dbo.CollectionObject.CollectionObjectID), dbo.f_getCollectingEventID(dbo.CollectionObject.CollectionObjectID), dbo.f_getFieldNumber(dbo.CollectionObject.CollectionObjectID), cast(dbo.CollectionObjectCatalog.CatalogNumber as int), dbo.CollectionObject.PreparationMethod, case when Sex = '<No Data>' then NULL else Sex end, case when Stage = '<No Data>' then NULL else Stage end, case when dbo.CollectionObject.Text1 is null then '' else 'Barcode: ' + dbo.CollectionObject.Text1 + '; ' end + case when dbo.Accession.Number is null then '' else 'Specimen Location: ' + dbo.Accession.Number end + case when DerivedFrom.Remarks is null then '' else ' <br> ' + cast(DerivedFrom.Remarks as nvarchar(2000)) end
FROM dbo.BiologicalObjectAttributes RIGHT OUTER JOIN dbo.CollectionObject ON dbo.BiologicalObjectAttributes.BiologicalObjectAttributesID = dbo.f_getParentID(dbo.CollectionObject.CollectionObjectID)
LEFT OUTER JOIN dbo.CollectionObjectCatalog LEFT OUTER JOIN dbo.CatalogSeries ON dbo.CollectionObjectCatalog.CatalogSeriesID = dbo.CatalogSeries.CatalogSeriesID ON dbo.CollectionObject.CollectionObjectID = dbo.CollectionObjectCatalog.CollectionObjectCatalogID
LEFT JOIN dbo.Accession on Accession.AccessionID = CollectionObjectCatalog.AccessionID
LEFT JOIN dbo.CollectionObject AS DerivedFrom ON CollectionObject.DerivedFromID = DerivedFrom.collectionObjectID
WHERE (dbo.f_hasChildObjects(dbo.CollectionObject.CollectionObjectID) = 0) AND ...
56BioCASe Workshop, Berlin, May 30-31st 2011
How Do I See Someting is Wrong?
Errors in ABCD documents:
Several datasets (one for each unit)
Reason: Metadata field stored in Units table (no separate PK several datasets need to be created)
Several units for one specimen record
Reason: Several records in DB for non-repeatable elements (several ABCD objects are necessary to create a valid document)
Beispielbild
5. Setting Up a BioCASe Data Source:Database connection, Table Setup, Schema Mapping
58BioCASe Workshop, Berlin, May 30-31st 2011
BPS Datasource
URL for a BioCASe protocol compliant webservice:http://ww3.bgbm.org/biocase/pywrapper.cgi?dsa=AlgenEngels
<?xml version='1.0' encoding='UTF-8'?><request xmlns='http://www.biocase.org/schemas/protocol/1.3'> <header> <type>search</type> </header> <search> <requestFormat>http://www.tdwg.org/schemas/abcd/2.06</requestFormat> <responseFormat start='0' limit='10'> http://www.tdwg.org/schemas/abcd/2.06</responseFormat> <filter> <like path='/DataSets/DataSet/Units/Unit/Identifications/Identification/ Result/TaxonIdentified/ScientificName/FullScientificNameString'>A*</like> </filter> <count>false</count> </search></request>
59BioCASe Workshop, Berlin, May 30-31st 2011
BPS QueryForms
Tool for sending Scan, Search and Capabilities Requests to a datasource
Choose Datasource „Test and Debug“
60BioCASe Workshop, Berlin, May 30-31st 2011
Steps for Setting Up a Datasource
1. Create a new Datasource
2. Configure Datasource:1. Database Connection
2. Table Setup
3. Create new empty Mapping
4. Edit Mapping:
1. Choose root table
2. Edit mandatory ABCD elements (red)
3. Save Configration, test datasource (QueryForms)
4. Add additional ABCD elements, occasional testing
3. Test/Debug Datasource
61BioCASe Workshop, Berlin, May 30-31st 2011
FloraExsiccataBavarica: Additional Fields
Concept Table/Column
Metadata/…
Description/Representation/Details metadata.description (text) IconURI metadata.logo_url (text) Version/Major metadata.source_version (text)
Metadata/IPRStatements/…
Citations/Citation/Text metadata.citationsText (text) Copyrights/Copyright/Text metadata.copyright (text) Disclaimers/Disclaimer/Text metadata.disclaimer (text) Acknowledgements/Acknolwedgement/Text metadata.acknowledgement (text) TermsOfUseStatements/TermsOfUse/Text metadata.terms_of_use (text)
Units/Unit/Gathering/…
Agents/GatheringAgent/Person/FullName unit.sammler (text) Altitude/MeasurementOrFactText unit.hoehe (text) + “m” Altitude/MeasurementOrFactAtomised/LowerValue unit.hoehe (text) Altitude/MeasuremntOrFactAtomised/UnitOfMeasurement “m” Country/ISO3166Code “DE” Country/Name “Germany” DateTime/DateText unit.datum1 (text) LocalityText unit.fundort (text) NamedAreas/NamedArea/AreaClass “State” NamedAreas/NamedArea/AreaName “Bavaria”
62BioCASe Workshop, Berlin, May 30-31st 2011
How The BPS performs requests
1. Get an ID list of records matching the filter
2. Loading all details for the matching IDs Joining of ALL tables, beginning with the root table (table with UnitID, one record per Unit)
63BioCASe Workshop, Berlin, May 30-31st 2011
Typical Mapping Errors
-Incomplete Mappings
-Missing explicit mappings for implicit knowledge (e.g. Country = “Germany” for a German collection)
-Abusing the MultimediaObject for non-multimedia Documents (e.g. Links to taxon pages)
-Providing “0” values for non-existent data
64BioCASe Workshop, Berlin, May 30-31st 2011
Datasource Loglevel
The lower the loglevel, the more information is logged: Debug < Info < Warning < Error
Datasource Configuration Settings
65BioCASe Workshop, Berlin, May 30-31st 2011
Datasources folder
... /configuration/datasources/<dsname>
querytool_prefs.xmlJust what its name says.
cmf_xxx.xmlConcept mapping; one for each supported schema.
provider_setup_file.xmlDatabase conncetion, table setup, supported schemas.
Regular backup of configuration folder is highly recommended!
66BioCASe Workshop, Berlin, May 30-31st 2011
Metadata tables
If metadata differ for each or some of the records: several records in metadata table, linked to unit by foreign key
If metadata is unique for all records possible to hold data in one record no reference key is needed static table
67BioCASe Workshop, Berlin, May 30-31st 2011
Applications
2. Wrapper Software
1. Protocols/Data Standards
Data Quality CheckerDataMining
3. Applications
Data Portal
68BioCASe Workshop, Berlin, May 30-31st 2011
Local QueryTool
69BioCASe Workshop, Berlin, May 30-31st 2011
Distributed Search: BioCASe Simple UI
BioCASe Distributed Search: http://search.biocase.org/simple-ui
70BioCASe Workshop, Berlin, May 30-31st 2011
Harvesting: GBIF Data Portal
71BioCASe Workshop, Berlin, May 30-31st 2011
GBIF Registration
72BioCASe Workshop, Berlin, May 30-31st 2011
GBIF Indexing History
73BioCASe Workshop, Berlin, May 30-31st 2011
EDIT Specimen Explorer: Interactive filters
74BioCASe Workshop, Berlin, May 30-31st 2011
Distributed Search vs. Harvesting
Distributed Search
+ No harvesting application/database required
+ No Delay with data updates (instantly visible)
- Dependent on Provider Availability
- Slow
- No data verification
- No maps, taxon lists, …
Harvesting
- Need for a harvester/cache database
- Delays when records get updated/added/removed
+ No heavy dependency on provider availability
+ Fast (as long as your portal is)
+ Data verification/improvements/transformation in harvesting process
+ Maps, suggestion lists, Interactive filters, …
75BioCASe Workshop, Berlin, May 30-31st 2011
OpenUp! Harvesting
BioC
AS
EB
ioCA
SE
BioC
AS
E
OpenU
p! Harvester
OA
I-PM
H
Harvester
ABCDESEEDM
76BioCASe Workshop, Berlin, May 30-31st 2011
Jörg Holetschek, Gabriele Dröge
Botanischer Garten & Botanisches MuseumAbteilung Biodiversitätsinformatik & Labors
Königin-Luise-Straße 6-814195 Berlin-Dahlem
[email protected]. +49 30 838 50150
0448 831 980
www.bgbm.org/biodivinf
www.biocase.orgsearch.biocase.orgsearch.biocase.de
http://www.biocase.org/files/BioCASe_Workshop_Berlin_2011.ppt