View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Globally Unique Identifiersand
Life Science Identifiers
Dave [email protected]
University of KansasCalifornia Academy of Sciences
www.learningsite.com
Outline
1. Describe Global Unique Identifiers
2. Show how they’re relevant
3. Describe one GUID system (LSIDs)
4. Outline some issues around using GUIDs for TDWG-related activities
5. Provide some resources
6. Open discussion
GUID Is Not An Ugly Word
It ’s guid to be merry and wise, It ’s guid to be honest and true, Robert BurnsHere’s a Health to Them that ’s Awa’.
Pteroptochos tarnii AKA Guidguid
Image From: animaldiversity.ummz.umich.edu
GUID: Globally Unique Identifier
• A short name for a complex entity
• Useful for locating information about the entity
• Each name identifies only one entity
• There is some sense of permanence
Some things which fit this description
• GenBank accession numbers: AP006480.1
• US Patent numbers: 5443036 (laser guided cat exercise)
• Digital Object Identifier: 10.121/3212
In Our Domain
SDD Document – Representing some data set.
<ClassName id="1"> <Label> <Representation language="en"> <Text>Cypselurus heterurus (Rafinesque, 1810)</Text> </Representation> </Label> <Link> <LSID>lsid.gbif.net:www.fishbase.org:1029</LSID> </Link> <Rank>sp</Rank> </ClassName>
SDD Document – Representing some data set.
<ClassName id="1"> <Label> <Representation language="en"> <Text>Cypselurus heterurus (Rafinesque, 1810)</Text> </Representation> </Label> <Link> <LSID>lsid.gbif.net:www.fishbase.org:1029</LSID> </Link> <Rank>sp</Rank> </ClassName>
Napier Schema Document – Representing some taxon.
<TaxonConcept id=“urn:lsid:bioguid.org:seek:121212“ type="original"> <Name type="scientific"> <NameSimple>Canis lupus</NameSimple> </Name>… <Relationships> <Relationship type=“is child of"> <ToTaxonConcept ref=“urn:lsid:bioguid.org:seek:5743" /> </Relationship> </Relationships></TaxonConcept>
Napier Schema Document – Representing some taxon.
<TaxonConcept id=“urn:lsid:bioguid.org:seek:121212“ type="original"> <Name type="scientific"> <NameSimple>Canis lupus</NameSimple> </Name>… <Relationships> <Relationship type=“is child of"> <ToTaxonConcept ref=“urn:lsid:bioguid.org:seek:5743" /> </Relationship> </Relationships></TaxonConcept>
Features of a GUID system
• Global uniqueness scoped to Internet
• Should be easily resolvable by a computer or human
• Should identify things down to whatever level of granularity necessary
• Should not be limited to proprietary systems
• Should serve up all sorts of data– Database records– Text files– Images
• It would be nice if the identifier had associated metadata
Life Science Identifiers
• Official standard of the Object Management Group (OMG)
• Support for metadata and authentication• Supports multiple protocols (e.g. HTTP, SOAP)• Can serve up data in any format• Decentralized – anyone can issue an LSID• LSID code available in Java and Perl.• A young standard, but increasingly used.
Organizations Using LSIDs
• National Center for Biotech Information (NCBI)– Pubmed– Genbank
• European Bioinformatics Institute (EBI)• US Long Term Ecological Research Network (LTER)• BioMOBY – an biological database interoperability
program (biomoby.org)• Open Bioinformatics Foundation (open-bio.org)• myGrid– a BioGRID project (mygrid.org.uk)
A Small Pause For More Squid Humor
LSID Format
• urn – indicates that this is a URN• lsid – indicates that it’s an LSID-type urn• bioguid.org – the authority who issued the LSID
– Doesn’t have to be a domain name – but for now probably should be.– bioguid.org does not necessarily have the data or metadata.– There may not even be a machine called bioguid.org.
• seek – a name space id internal to that authority– The name space is meaningless to systems outside that authority.
• 117866 – the local identifier within that authority– Also internal to the authority
• v1 – an optional version number– If no version, no trailing colon either.
urn:lsid:bioguid.org:seek:117866:v1
Data and Metadata
• An LSID has data– Examples
• The gene sequence in GenBank• The actual LTER data set, maybe in excel, or in a text file
– The data should never change• An LSID also has metadata
– Example metadata• The format of the data• A display title for clients displaying the LSID• Dublin core metadata• Anything you want
– The metadata can change
Example LSIDs
• An LTER fish abundance data set– urn:lsid:limnology.wisc.edu:dataset:ntlfi02
• A PubMed reference:– urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:12441808
• A GenBank sequence:– urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027
How LSIDs work
LSIDClient
Maybe LaunchpadMaybe HaystackMaybe BioFerretMaybe myGRIDMaybe Yours!
DNSFind DNS recordResolve it to get
Address of Authority
LSID Authority
1. Find the authority for this LSID
Returns the LSID Authority Server
2. Query authority for available services
Returns WSDL for this LSID
3. Chose a service, get the goods
HTTP, SOAP, FTP, others
Data Store
Metadata Store
LSID Promises
• I promise to never change the data behind an LSID.
• I will make sure my LSIDs are being served, or give them to someone who can do it.
• I will give my LSIDs metadata – at least give them a title and a format
Other GUID systems
• URLs– Files move – The data change– Unstructured metadata
• UUIDs – 128 bit string, guaranteed unique– 58f202ac-22cf-11d1-b12d-002035b29092 – No resolution– No metadata
• Handle System / DOIs (10.12/2312)– Non standard protocol– Centralized resolution– Unstructured metadata (for Handle System)– High costs (for DOI)
Issues For This Community
• What gets a GUID?
• For each of those things, what’s the data, what’s the metadata?
• One GUID per item?
• Centralization – who issues GUIDs?
What Gets a GUID?
• These things probably should get GUIDs– Taxonomic concepts– Specimens– Publications– People
• These things might get GUIDs– Taxonomic names– Journals– Data providers– Observations
Specimen Data? Metadata?
• If specimens get a GUID – what does it identify?– The physical specimen?– A collection’s database record of the specimen?– What about multiple labels?– Main question – what doesn’t change about a
specimen?– Other main question – how should the data be
represented? • Darwin core includes current institution location. Not a good
idea for the data of a GUID since that may change.
One GUID Per Item?
• No GUID system inherently enforces a 1:1 mapping between GUID and data.
• Everyone should TRY to limit the number of GUIDs per item.
• Should there be any centralization to help achieve this?
Degrees of Centralization
• An index– List your GUID authority in an index so your GUIDs are easy to find.
• A central authority– One authority could be responsible for issuing GUIDs to the community for
specific types of information – you’d have to get one from here.• GBIF?• The IC_Ns? (ICZN, ICBN….)• lsidauthority.org?
– This would help enforce a 1:1 mapping of GUIDs and data items– It would also alleviate data providers from the need to maintain their own
authorities– It MAY also reduce the likelihood of GUIDs becoming unresolvable– It may also be infeasible technically, or socially.
• A respected authority– With LSIDs, an authority can be set up to serve its own GUIDs and proxy other
authorities.– This would help enforce a 1:1 mapping for those who use the authority– It may also be more feasible.
LSID Resources
• LSID Articles and code from IBM– http://www-124.ibm.com/developerworks/oss/lsid/#whatislsid
• Current LSID specification– http://www.omg.org/cgi-bin/doc?dtc/04-05-01
• Launchpad – An LSID resolver for Windows IE– available from first link
• A website which resolves LSIDs– http://lsid.biopathways.org/resolver/
• URN specification– http://www.ietf.org/rfc/rfc2141.txt
Acknowledgements
• My work on GUIDs has been funded by the SEEK project – seek.ecoinformatics.org.
• SEEK is funded by National Science Foundation award 0225676.
• Thanks to Ben Szekely at IBM for his LSID articles, his LSID java code, and for answering all my questions.
Questions for Discussion
• Do we need GUIDs?
• What gets a GUID?
• One GUID per item?
• Centralization?