Upload
bryan-heidorn
View
710
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Biodiversity Informatics: Mining Untapped Resources
Citation preview
Biodiversity Informatics: Mining Untapped Resources
February 8, 2010Marine Biology Laboratory and Woods Hole
Oceanographic Institute Library
P. Bryan HeidornDirector
University of ArizonaSchool of Information Resources and Library Science
Biodiversity Information Diversity
1)Wrongly perceived as bioinformatics and two sets of base-pairs
2) Biodiversity = Data Complexitya) Requires new information theory and
cyberinfrastructure
b) Largely unrecognized as an interesting problem are in computer science
The problem
Information is not in accessible Computer Science, Information Science
and Technology has not addressed the problem
Dark data is the data that we know is/was there but we can’t see it.
Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17
Naive View of Science Data
f(x)=axk+o(xk)
Power Law of Science Data
f(x)=axk+o(xk)| X<.20
Dat
a V
olum
e
Science Projects and Initiatives
Does NSF’s Data Follow the Power Law?
I do not know but if $1 = X bytes…..
Awarded Amount 2007
$0
$1,000,000
$2,000,000
$3,000,000
$4,000,000
$5,000,000
$6,000,000
$7,000,000
1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776
20-80 Rule The small are big!
Total Grants 9347
$2,137,636,716
20% 80%
Number Grants 1869 7478
Total Dollars $1,199,088,125 $938,548,595
Range $6,892,810-$350,000
$350,000-$831
Related Ideas
John Porter: Deep verses Wide databases
Swanson: Undiscovered Public Knowledge
Science Commons: Big Verses Small science
Where to find dark data
Literature/Biodiversity Heritage LibraryMuseum SpecimensField notes(Un)Experimental data setsCitizen Observations
What is dark data good for?
Ecological Niche ModelingClimate Change niche change predictionTaxonomic Name ResolutionLiterature Search Support
Taxonomic intelligenceKey-like – character searching
Phenology and Phenology changeFood-web / trophic level
Global Biodiversity Information Facility has 100s of millions of specimen records
Animalia
Global Earth Observation System of Systems (GEOSS) and Historical Data
•Unpublished observations of flowing time in Concord by Alfred Hosmer from 1888 to 1902•Photographs of Flowers•Blue Hill Observatory meteorological dataRichard B. Primack, Abraham J. Miller-Rushing, Daniel Primack, and Sharda Mukunda (2007). Using Photographs to Show the Effects of Climate Change on Flowering Time. Arnoldia 65(1), p2-9.
Historical and Current Data need to be in a form that allow for use and reuse.
Willis CG, Ruhfel BR, Primack RB, Miller-Rushing AJ, Losos JB, et al. 2010 Favorable Climate Change Response Explains Non-Native Species' Success in Thoreau's Woods. PLoS ONE 5(1): e8878. doi:10.1371/journal.pone.0008878
Favorable Climate Change Response Explains Non-Native Species’ Success in
Thoreau’s Woods
The problem with Museum Specimens
>1 Billion Natural History SpecimensCollected over 250 years / many languagesNo publishing standardsNear infinite classes
Your high school teacher lied6 min / label * 1B labels = 100M hours Saving 1 min = 16.7 Million hours $10/hr = $167,000,0001/4790 of U.S. deregulation financial bailout
Natural History Specimens
Automatic Metadata Extraction (Darwin Core) From Museum
Specimen Labels
…<co> Curtis, </co><hdlc> North American Pl</hdlc><cnl> No.</cnl><cn> 503*</cn><gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val><hb> Coral soil,</hb><lc> Cudjoe Key, South Florida.</lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>…
With Qin Wei, Univ of Illinois
Sample records
Sample OCR Output
Yale University Herbarium
~r-^""" r-n-------
YU.001300
Curtisb, North American Pl
C^o.nr r^-n
ANTS,
No. 503* "^
Polygala ambigna, Nntt., var.
Coral soil, Cudjoe Key, South Florida.
Legit A. H. Curtiss.
Label Labels
bc - barcodebt - barcode textcm - common/colloquial namecn - collection numberco - collectorcd - collection datefm - family nameft - footer info
Label Labels
gn - genus name hd - header infoin - infra nameina - infra name authorlc - location pd - plant descriptionsa - scientific name authorsp - species name
Example Training Record
<?xml version="1.0" encoding="UTF-8"?><?oxygen
RNGSchema="http://www3.isrl.uiuc.edu/~TeleNature/Herbis/semanticrelax.rng" type="xml"?>
<labeldata><bt>Yale University Herbarium</bt><ns> ~r-^""" r-n------</ns><bc> YU.001300</bc><co cc="Curtiss"> Curtisb, </co><hdlc cc="North American Plants"> North
American Pl</hdlc><ns>C^o.nr r^-nANTS,</ns><cnl> No.</cnl><cn> 503*</cn><ns> "^</ns><gn> Polygala</gn><sp> ambigna,</sp><sa> Nntt.,</sa><val> var.</val><hb> Coral soil,</hb><lc> Cudjoe Key, South Florida.</lc><col> Legit</col><co> A. H. Curtiss.</co></labeldata>
Supervised Learning Framework
Gold ClassifiedLabels
Training Phase
Application Phase
MachineLearner
Trained Model
UnclassifiedLabels
Segmented Text
Silver Classified
Labels
Segmentation Machine Classifier
Unclassified Labels
HumanEditing
Herbis Experimental Data
295 marked up records74 label states5-fold cross-validation
Performances of NB and HMM
Element Identifiers
Improved Performance With Field Element Identifiers
Learning w/ pre categorization
GoldLabels
MachineLearner
Modeln
UnclassifiedLabels
ClassifiedLabels
Class 1Labels
Categor-ization
Class 2Labels
Class nLabels
MachineLearner
MachineLearner
Model2
Model1
Class 1Labels
Categor-ization
Class 2Labels
Class nLabels
MachineClassification
MachineClassification
MachineClassification
ClassifiedLabels
ClassifiedLabels
FIG. 5. Improved Performance of Specialist Model
Specialist100 Curtiss VS 100 General
Iterations0 2000
100
Specialist
Random
P. Bryan Heidorn1, Hong Zhang1, Eugene Chung2 and BGWG
1Graduate School of Library and Information Science, 2Linguistics, University of Illinois
Machine Learning in BioGeomancer’s Locality Specification
SPNHC & NSCA 2006
BioGeomancer Working Group (BGWG) http://203.202.1.217/bgwebsite/index.html
Worldwide collaboration of natural history and geospatial data experts
Maximize the quality and quantity of biodiversity data that can be mapped
Support of scientific research, planning, conservation, and management
Promotes discussion, manages geospatial data and data standards, and develops software tools in support of this mission
Participants
Example Locality Types
Record #
Specification of Location Locality Type
43 dario 7 mi wnw of; RIO VIEJO FOH; F
86 near Aleutian Islands; S of Amukta Pass NF; FH
100 INDIAN CREEK, 11 MI. W HWY 160 P; POH
109 TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R
P; FOH; NP
160 WALTMAN, 9 MI N, 2.5 MI W OF FOO
181 0.4 mi N Collinston on LA 138 FPOH
204 Seward Peninsula; vic. Bluff, S coast F; NF; FS
JOH : offset from a junction at heading e.g. 0.5 mi. W Sandhill and Hagadorn Roads [ FEATURE [ CITY = Sandhill ]] [ FEATURE [ ROAD= Hagadorn Roads ]] OFFSET VALUE = 0.5
DIRECTION= W
UNIT = mile
JUNCITON [ FEATURE [ CITY = Sandhill ]]
[ FEATURE [ ROAD= Hagadorn Roads ]]
FRAME
Xiaoya Tang and P. Bryan Heidorn
Xiaoya Tang and P. Bryan Heidorn
Different vocabularies in queries and documentsDifferent vocabularies in queries and documents
Long leaves
…... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m 1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, …….
User query Description of leaf Length in texts
Templates for useful information
Information Extraction From FNAInformation Extraction From FNA
Extraction
Rules
Structured information
User log analysis
Leaf_ShapeLeaf_MarginLeaf_Apex Leaf_BaseBlade_Dimension…..…..
Leaf_Shape obovateLeaf_Shape orbiculateBlade_Dimension 3—9 x 3—8 cm …………..…………..
Original documents
………..
Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often
irregularly doubly serrate to nearly dentate, . ………………
Knowledge bases
…..PartBlade:Leaf bladeBladesblade
……
Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1}Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase>Output:: leaf {bladeDimension $1}
Results – System Performance
Results – System Performance
Group NT NTH TSR SSR NSST TST NDVST
SEARFA 6.75 8.078 0.860 0.210 4.779 338.8 11.16
SEARF 4.50 3.598 0.568 0.053 9.584 435.2 14.75
Sig.(ANOVA) 0.005 0.005 0.000 0.011 0.000 0.72 0.162
NT: number of tasks accomplished in total
NTH: number of tasks accomplished per hour
TSR: task success rate
SSR: search success rate
NSST: number of searches to accomplish a task
TST: time spent to accomplish a task
NDVST: number of documents viewed to accomplish a task
Education Programs
Biological Information Specialist
Concentration in Data Curation (MSLIS)
Certificate of Advanced Study in Data Curation
Information and professional education in biodiversity informatics
Biological Information SpecialistsBiological Information Specialists
At present:
Biologists at all degree levels self-trained in information technology
Information technologists at all degree levels self-trained in biology
(both with gaps in knowledge for many months, years)
Differing roles of BIS in large and small
At present:
Biologists at all degree levels self-trained in information technology
Information technologists at all degree levels self-trained in biology
(both with gaps in knowledge for many months, years)
Differing roles of BIS in large and small
Master of Science in Biological Informatics
Master of Science in Biological Informatics
Degree Program began September 2007
Part of campus-wide bioinformatics masters program
NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)
Combines Biology, Bioinformatics, Computer Science core with LIS courses
Degree Program began September 2007
Part of campus-wide bioinformatics masters program
NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)
Combines Biology, Bioinformatics, Computer Science core with LIS courses
What does a BIS need to know?What does a BIS need to know?
Biological training and interest in solving biological research problems
Information skills Evaluation and implementation of information
systems: user based assessment and continual quality improvement for the development of tools that work and are used.
Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.
Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.
Biological training and interest in solving biological research problems
Information skills Evaluation and implementation of information
systems: user based assessment and continual quality improvement for the development of tools that work and are used.
Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.
Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.
UIUC bioinformatics core courseworkUIUC bioinformatics core coursework
Cross-disciplinary course distribution requirement
Bioinformatics: Computing in Molecular BiologyAlgorithms in BioinformaticsPrinciples of Systematics
Computer Science: AlgorithmsDatabase Systems
Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling
Cross-disciplinary course distribution requirement
Bioinformatics: Computing in Molecular BiologyAlgorithms in BioinformaticsPrinciples of Systematics
Computer Science: AlgorithmsDatabase Systems
Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling
Sample of existing LIS coursesSample of existing LIS courses
Information Organization and Knowledge Representation
LIS 551 Interfaces to Information Systems
LIS 590DM Document Modeling LIS 590RO Representing and
Organizing Information Resources
LIS590ON Ontologies in Natural Science
Information Resources, Uses and users
LIS 503 Use and Users of Information
LIS 522 Information Sources in the Sciences
LIS 590TR Information Transfer and Collaboration in Science
Information Organization and Knowledge Representation
LIS 551 Interfaces to Information Systems
LIS 590DM Document Modeling LIS 590RO Representing and
Organizing Information Resources
LIS590ON Ontologies in Natural Science
Information Resources, Uses and users
LIS 503 Use and Users of Information
LIS 522 Information Sources in the Sciences
LIS 590TR Information Transfer and Collaboration in Science
Information Systems LIS 456 Information Storage
and Retrieval LIS 509 Building Digital
Libraries LIS 566 Architecture of
Network Information Systems LIS 590EP Electronic
Publishing
Disciplinary Focus LIS 530B Health Sciences
Information Services and Resources
LIS 590HI Healthcare Informatics (Healthcare Infrastructure)
LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)
Information Systems LIS 456 Information Storage
and Retrieval LIS 509 Building Digital
Libraries LIS 566 Architecture of
Network Information Systems LIS 590EP Electronic
Publishing
Disciplinary Focus LIS 530B Health Sciences
Information Services and Resources
LIS 590HI Healthcare Informatics (Healthcare Infrastructure)
LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)
MSLIS Data Curation ConcentrationMSLIS Data Curation Concentration
Data Curation Educational Program (DCEP)
IMLS – Laura Bush 21st Century Librarian Program,
RE-05-06-0036-06 (Heidorn, PI)
Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations
Data Curation Educational Program (DCEP)
IMLS – Laura Bush 21st Century Librarian Program,
RE-05-06-0036-06 (Heidorn, PI)
Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations
New research directionsNew research directions
IGERTInteractive Keys?Focus on integration and scale Informatics infrastructure as competitive
edge
Sample areas of development
Landinformatics GroupAtmospheric science, hydrology, nutrient balance,
carbon cycle, ecology, agronomy
BREC Focus on data integration problems
across larger range of sciences
IGERTInteractive Keys?Focus on integration and scale Informatics infrastructure as competitive
edge
Sample areas of development
Landinformatics GroupAtmospheric science, hydrology, nutrient balance,
carbon cycle, ecology, agronomy
BREC Focus on data integration problems
across larger range of sciences
Example Service
JRS Biodiversity FoundationNational Science FoundationTaxonomic Database Working Group
JRS Biodiversity Foundation
History: The J.R.S. Biodiversity Foundation was created in January 2004 when the nonprofit publishing company, BIOSIS was sold to Thomson Scientific. The proceeds from that sale were applied to fund an endowment and create a new grant-making foundation.
Mission: The Foundation defined a mission within the field of biodiversity: To enhance knowledge and promote the understanding of biological diversity for the benefit and sustainability of life on earth.
JRS Biodiversity Foundation
JRS Biodiversity Foundation
Scope: To further advance the Foundation’s mission a scope was developed as: Interdisciplinary activities primarily carried out via collaborations in developing countries and economies in transition. The Foundation Board of Trustees has expressed a particular interest in focusing its grant-making in Africa.
Strategic Interest: Within those bounds a considered course has been chosen to: Advance projects, or parts of biodiversity projects that focus on: (1) collecting data, (2) aggregating, synthesizing, publishing data, and making it more widely available to potential end users, and (3) interpreting and gaining insight from data to inform policy-makers
Grant Making: about $2M/yr Animal Tracking in South Africa Specimen Digitization in Ghana Social Value of Conservation in Peru Species Pages and BD Education in Costa Rica Niche Modeling in Brazil Travel Grants Lake Victoria Data Library Project in Tanzania, Uganda
and Kenya e-Biosphere ‘09
JRS Biodiversity Foundation
National Science Foundation
Advances in Biological InformaticsData Working GroupPlant Science Cyberinfrastructure Center
(iPlant)Cyber-enabled Discovery and InnovationHiring CommitteesDivision of Biological Infrastructure
Planning
ALISE and AMISE
Historical Analysis of 100 Years backUse Library and Museum ResourcesPrepare for Blitz
K-12 / Graduate / Hobbyist
BioDiversity BlitzBuild time capsule for Bicentenial in
Cultural Heritage Institutions **Libraries and Museums**