49
TDWG - SDD Oct 20, 2002 Search & Browsing Beyond Keys over Standards TDWG General Presentation Sessions P. Bryan Heidorn October 20, 2002

TDWG - SDD Oct 20, 2002 Search & Browsing Beyond Keys over Standards TDWG General Presentation Sessions P. Bryan Heidorn October 20, 2002

Embed Size (px)

Citation preview

TDWG - SDD Oct 20, 2002

Search & Browsing

Beyond Keys over Standards

TDWG General Presentation Sessions

P. Bryan Heidorn

October 20, 2002

TDWG - SDD Oct 20, 2002

Search & Browsing

Why Structured Data Description?

• Interactive Keys

• Search

• Authoring– Structure– Vocabulary

TDWG - SDD Oct 20, 2002

Search & Browsing

Interactive Keys

• Interchange of information between programs.– E.g. DeltaAccess and LucID

TDWG - SDD Oct 20, 2002

Search & Browsing

Interchange

LucIDDelta

Access

SDD

TDWG - SDD Oct 20, 2002

Search & Browsing

Federation (very difficult)

KeyKeyKeyKey

Key

TDWG - SDD Oct 20, 2002

Search & Browsing

SDD Supports text

• Hybid text and defined Characteristics

• Leaf characters may be defined while the flower characteristics are no.

• <Wording><Wording>

• Reuse of Flora for Search

• Automatic Markup and Classification

• Still can help people with ID

TDWG - SDD Oct 20, 2002

Search & Browsing

Search

• Warning that this is not a SDD standard but an SDD inspired implementation from the Australia discussion.

• SDD supports hybrid text and character encoded documents

• How do you search a hybrid or partly encoded document?

• Some fields based on full text, some based on DB search

TDWG - SDD Oct 20, 2002

Search & Browsing

Search

• Biological Information Browsing Environment – (http://www.biobrowser.org)

• Search subparts of descriptions– Find the blue flowers not blue mountains– Flowers: …. Blue– Location: Blue Mountains

• Search Part Hierarchy

TDWG - SDD Oct 20, 2002

Search & Browsing

Marking Structure

Taxonfiles

ML

FNAStructured

files

FragmentUnlabeledfragmentsUnlabeledfragmentsUnlabeledfragmentsUnlabeledfragments

ManualLabeling

TrainingSet

TrainingSet

TrainingSet

TrainingSet

LabeledFragments

Reassembly

LabeledFragments

LabeledFragments

LabeledFragments

TDWG - SDD Oct 20, 2002

Search & Browsing

Classification = Markup

• Support Vector Machine (Hong Cui)

• Any distance measure

• Clustering / Classification

TDWG - SDD Oct 20, 2002

Search & Browsing

Converting Documents to XML

FNA

Compositefiles

Perl

FNA

Genusfiles

FNA

Genusfiles

FNA

Genusfiles

FNA

GenusfilesFNA

HTMLfiles

FNA

HTMLfiles

FNA

HTMLfiles

FNA

Speciesfiles

FNA

Familyfiles

FNA

Familyfiles

FNA

Familyfiles

FNA

Familyfiles

TDWG - SDD Oct 20, 2002

Search & Browsing

Simple FNA.dtd<!ELEMENT FNA ANY><!ELEMENT NomenclaturalInfo (#PCDATA)><!ELEMENT Description (#PCDATA)><!ELEMENT Distribution (#PCDATA)><!ELEMENT Discussion (#PCDATA)><!ELEMENT Images (#PCDATA)><!ELEMENT ImagesMap (#PCDATA)><!ELEMENT Copyright (#PCDATA)><!ELEMENT Other (#PCDATA)><!ELEMENT Variability (#PCDATA)><!ELEMENT References0 (#PCDATA)><!ELEMENT References1 (#PCDATA)>

TDWG - SDD Oct 20, 2002

Search & Browsing

  <?xml version="1.0" encoding="iso-8859-1" ?>   <!DOCTYPE FNA (View Source for full doctype...)> - <FNA>  <Images>http://www.canis.uiuc.edu/~webvibe/fna_images/plates/I24900139.html</Images>   <Other />   <NomenclaturalInfo>3. Abies grandis (Douglas ex D. Don in Lambert) Lindley, Penny Cycl. 1: 30. 1833 - Grand fir, lowland white fir, sapin grandissime</NomenclaturalInfo>   <NomenclaturalInfo>Pinus grandis Douglas ex D. Don in Lambert, Descr. Pinus [ed. 3] 2: unnumbered page between 144 and 145. 1832</NomenclaturalInfo>   <Description>Trees to 75m; trunk to 1.55m diam.; crown conic, in age round, with age …. brown, often with reddish periderm visible in furrows bounded by hard flat opposite, light gray, sessile, apex rounded; scales ca. 2--2.5 ´ 2--2.5cm, densely … bracts included. Seeds 6--8 ´ 3--4mm, body tan; wing.</Description>   <Distribution>Moist, coastal coniferous forests and mountain slopes; 0--1500m; B.C.; Calif., Idaho, Mont., Oreg., Wash.</Distribution>  

TDWG - SDD Oct 20, 2002

Search & Browsing

Structured Index

Swish-ex StructuredIndex

FNAStructured

files

TDWG - SDD Oct 20, 2002

Search & Browsing

Swish-ex

• Based on Berkeley swish-e

• XML support added

• Hierarchical Key Structure Related to DTD

TDWG - SDD Oct 20, 2002

Search & Browsing

FNA

Nomenclature00

Morphology01

Distribution10

References11

Plant01|000

Bark01|001

Leaf0|010

Others..0|011…

Size0|010|000

Margin0|010|001

Apex0|010|010

Petiole0|010|011

TDWG - SDD Oct 20, 2002

Search & Browsing

XML Query Interface

Add the query string to a tree node, for example COMMAN NAME = dog-face

TDWG - SDD Oct 20, 2002

Search & Browsing

The query propagates to the text field, and it is ready to be sent to Server/database

TDWG - SDD Oct 20, 2002

Search & Browsing

The results are fetched back from the database, and show up on screen

TDWG - SDD Oct 20, 2002

Search & Browsing

Query results are displayed in BIBE as documents, the relationships between key terms are currently mapped as shown below.

TDWG - SDD Oct 20, 2002

Search & Browsing

Processing Thesaurus and Definition

Taxonfiles

ThesaurusProcessor

TaxonFilesWith

Definitions

StructuredThesaurus

Glossaryfiles

TDWG - SDD Oct 20, 2002

Search & Browsing

Thesaurus

• Automatic Query Expansion

• Vocabulary switching

• Novice to expert

• (acorn) becomes (acorn or glans)

• (roundish) becomes (obovate or roundish)

• (Illinois) becomes (ill or illinois)

TDWG - SDD Oct 20, 2002

Search & Browsing

Inline Definitions

TDWG - SDD Oct 20, 2002

Search & Browsing

Clicking on a set of documents reveals a list of files. By choosing a file, users investigate text and images from The Flora of North America.

TDWG - SDD Oct 20, 2002

Search & Browsing

Clicking on a set of documents reveals a list of files. By choosing a file, users investigate text and images from The Flora of North America.

TDWG - SDD Oct 20, 2002

Search & Browsing

Extracting Adjectivesfrom unstructured document sections

TDWG - SDD Oct 20, 2002

Search & Browsing

Outline

• Run FNA files through Brill tagger (p-o-s) to:

• tokenize• tag the files, i.e. mark parts of speech: nouns, verbs, adjectives, adverbs

• Modify Brill lexicon (Wall Street Journal- based) to fit FNA files

• Run an extraction program through the tagged files

Marija Markovic, Adjective Extraction

TDWG - SDD Oct 20, 2002

Search & Browsing

Why Adjectives

• Useful (with Nouns) for retrieval

• Complex plant morphology sections

• around 30 fields in FNA XML files

• Essential for other compounds • art, definitions, links

TDWG - SDD Oct 20, 2002

Search & Browsing

Examples

conspicuousfoliaceous#-keeled#-locular#-ribbed#-ridged#-veined#-flowered#-loculed#-lobed#-dentate

#-merous#-lobedV-shapedabaxialapicalaromaticaxillarybasalbell-shapedbisexualblue-violetcanary

centralchasmogamouscleistogamousclub-shapedcream-browncuneate-obovoidcylindricdistalellipsoidellipticequalfetid

TDWG - SDD Oct 20, 2002

Search & Browsing

Authoring

• OpenKey Project

TDWG - SDD Oct 20, 2002

Search & Browsing

Participants

• University of Illinois– Grad. School of Library and Information Science– Herbarium

• Illinois Natural History Survey and Herbarium

• University of North Carolina– School of Information and Library Science

• North Carolina Botanical Garden

TDWG - SDD Oct 20, 2002

Search & Browsing

Participants• University of Illinois

– Grad. School of Library and Information ScienceLesley Deem, Xiao Hu, Jing Wu,

– Herbarium, Dave Siegler

• Illinois Natural History Survey and Herbarium, Ken Robertson, Michael Jeffords

• University of North Carolina– School of Information and Library Science

Jane Greenberg, Evaline Daniels– North Carolina Botanical Garden

Bob White

TDWG - SDD Oct 20, 2002

Search & Browsing

Project Objectives

• Develop a Distributed Search and Identification Support– Character Lists, Taxonomic descriptions

• Develop Distributed Taxonomic Descriptions– Text, in situ images, ID images, herbarium sheets

• Highlight Museum Holdings

• Audience: Citizen Scientists, Schools, Scientists

TDWG - SDD Oct 20, 2002

Search & Browsing

Addressed Issues Encountered with Collaboration

• Inability to produce a final character matrix– Introduction of new characteristics

• Differing definitions of “like” terms.– Need to be defined in context– Can not proliferate (reuse)

TDWG - SDD Oct 20, 2002

Search & Browsing

Distributed Properties

• Image and / or Character States need not be harvested until access time

• No “change” or “new” log in http unlike Open Archive

• Property definitions but not references may change without notification

• New instances may be added without notification

TDWG - SDD Oct 20, 2002

Search & Browsing

Schemata

• Taxon Description

• Characteristic

• Character Value

• Character Image

• Contributor

TDWG - SDD Oct 20, 2002

Search & Browsing

Component of Description

Taxon

Characteristic

CharacterImages

CharacterState

*Contributor ContributorID

ImageID

RefKey:CharacteristicID:Value

RefKey:CharacteristicID

ValueID

*Contributor may be used for all objects

TDWG - SDD Oct 20, 2002

Search & Browsing

Decomposition and Distribution

• The selection of characteristics and their values can be distributed

• No need to have definitive set

• Potential for reuse in new taxonomic descriptions

• Potential for differing definitions but in well defined space

TDWG - SDD Oct 20, 2002

Search & Browsing

Taxonomic Description

• Project Definition (ala Autralia 2002)• Identification – Taxon

<Taxon> <Rank>Species</Rank> <Name>Echinacea pallida</Name> <Authority>(Nutt.) Nutt</Authority> <Vernacular>Pale purple coneflower</Vernacular> </Taxon>

TDWG - SDD Oct 20, 2002

Search & Browsing

Taxon (Australia, 2002)

<character keyref="stipules, color" type="nominal"> (type should be in Characteristic definitions)<state keyref="brown" autoordered="no"> <modifier keyref="dark reddish-"/> </state> <state keyref="yellow" autoordered="no"> <modifier keyref="extremely rarely"/> </state> </character>

TDWG - SDD Oct 20, 2002

Search & Browsing

(Not SDD but SDD inspired)

Preamble…..

<xsd:element name = "CharacterState">

<xsd:complexType><xsd:sequence>

<xsd:element ref = "Character"/>

<xsd:element ref = "Value"/> (State)

<xsd:element ref = "Image" minOccurs = "0"/>

<xsd:element ref = "Definition" minOccurs = "0"/>

<xsd:element ref = "Synonym" minOccurs = "0"/>

<xsd:element ref = "BroaderTerms" minOccurs = "0"/>

<xsd:element ref = "NarrowerTerms" minOccurs = "0"/>

<xsd:element ref = "RelatedTerms" minOccurs = "0"/>

</xsd:sequence>

TDWG - SDD Oct 20, 2002

Search & Browsing

Character State<Character>Leaf Shape</Character> <State>lanceolate</State> <Image>http://www.isrl.uiuc.edu/~openkey/demo/lanceolate.xml</Image> <Image>http://www.isrl.uiuc.edu/~openkey/demo/lanceolate2.xml</Image> <Definition>Lance-shaped; much longer than wide, with the widest point below the middle.</Definition> <Synonym/> <BroaderTerms>oval</BroaderTerms> <NarrowerTerms/> <RelatedTerms>elliptic-lanceolate</RelatedTerms> </Characters>

TDWG - SDD Oct 20, 2002

Search & Browsing

Federation of Characters and States

KeyKeyKeyStates

SharedCharacter

Lists

KeyKeyKeyCharacters

TDWG - SDD Oct 20, 2002

Search & Browsing

Character Sharing

UIDatabase

UNCDatabase

FederatedDatabase

OA Server OA Server

OA Harvester

Polyclave Server

The world

PV.XML

DELTAAccess

LuCID

TDWG - SDD Oct 20, 2002

Search & Browsing

<xsd:element name = "Value" type = "xsd:string">

<xsd:annotation>

<xsd:documentation>This is the values of a "character" or

"characteristic" of an item, the value of a property.

For example the legal values for /leaves/leaf_arrangement are

alternate, opposite, basal,whorled and cauline

</xsd:documentation>

</xsd:annotation>

Character Schema Element

TDWG - SDD Oct 20, 2002

Search & Browsing

Image Schema

<xsd:element ref = "Identifier"/> <xsd:element ref = "Description" minOccurs =

"0"/><xsd:element ref = "CopyrightPermission"/> <xsd:element ref = "ImageLocation" …> <xsd:element ref = "CopyrightHolder" …> <xsd:element ref = "Source" minOccurs = "0"/><xsd:element ref = "DerivedFrom" …> <xsd:element ref = "DateCreated" …> <xsd:element ref = "Species" minOccurs = "0"/> <xsd:element ref = "ContributorID“ …> <xsd:element ref = "SpecimenID" …>

TDWG - SDD Oct 20, 2002

Search & Browsing

PlantCharacteristic• <?xml version="1.0"?>• <!DOCTYPE PlantCharacteristic SYSTEM

"http://soldev.isrl.uiuc.edu/~webvibe/PlantCharacters/PlantImageCharacteristic.dtd">

• <ImageElements>• <Part>Leaf-Stem</Part>• <ValueName>Alternate</ValueName>• <Description>Leaf Arrangements - Alternate</Description>• <ImageLocation>http://soldev.isrl.uiuc.edu/~webvibe/

PlantCharacters/images/AlternateLeaf.jpg</ImageLocation>• <CopyrightHolder>Illinois Natural History

Survey</CopyrightHolder>

TDWG - SDD Oct 20, 2002

Search & Browsing

PlantCharacteristic

• <Source>Observing, Photographing, and Collecting Plants. Illinois Natural History Survey Circular 55, 1980</Source>

• <DateCreated>January 29, 2002</DateCreated>• <Species></Species>• <DerivedFrom>Another Image</DerivedFrom>• <Contributor>• <ContributorID>pbheidorn</ContributorID>• <ContributorName>P. Bryan Heidorn</ContributorName>• <ContributorDetails>University of Illinois,

GSLIS</ContributorDetails>• </ImageElements>

TDWG - SDD Oct 20, 2002

Search & Browsing

Conclusion

Many aspects of traditional printed flora can be used to enhance the functionality of electronic flora.

Full-text Search and ID are related tasks.