Upload
teigra
View
24
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Toro 1. EMu on a Diet. Yale campus. Peabody Collections Approximate Digital Timeline. Peabody Collections Approximate Digital Timeline. 1991 Systems Office created & staffed 1991 Argus collections databasing initiative started. Peabody Collections Approximate Digital Timeline. - PowerPoint PPT Presentation
Citation preview
Toro 1
EMu on a Diet
Yale campus
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
• 2003 Search for Argus successor commences
• 2003 Informatics Office created & staffed
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1991 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
• 2003 Search for Argus successor commences
• 2003 Informatics Office created & staffed
• 2004 KE EMu to succeed Argus, data migration begins
• 2005 Argus data migration ends, go-live in KE EMu
Peabody CollectionsApproximate Digital Timeline
EMu migration in '05(all disciplines went live
simultaneously)
Physical move in '98-'02(primarily neontological disciplines)
Big events
Peabody CollectionsCounts & Functional Cataloguing Unit
• Anthropology 325,000 Lot• Botany 350,000 Individual• Entomology 400,000 Lot• Invertebrate Paleontology 300,000 Lot• Invertebrate Zoology 300,000 Lot• Mineralogy 35,000 Individual• Paleobotany 150,000 Individual• Scientific Instruments 3,000 Individual• Vertebrate Paleontology 125,000 Individual• Vertebrate Zoology 185,000 Lot / Individual
About 12 million specimens ( 2.1 million EMu-able units )
Peabody CollectionsFunctional Units Databased
• Anthropology 325,000 90 %• Botany 350,000 1 %• Entomology 400,000 6 %• Invertebrate Paleontology 300,000 60 %• Invertebrate Zoology 300,000 25 %• Mineralogy 35,000 85 %• Paleobotany 150,000 60 %• Scientific Instruments 3,000 100 %• Vertebrate Paleontology 125,000 60 %• Vertebrate Zoology 185,000 95 %
992,000 of 2.1 million ( 45 % overall )
What happens when …
What happens when …
… EMu gets sluggish & unresponsive ?
Why is this &^%$ thing so
ridiculously slow ?!
Transient often non-EMu issues
Transient often non-EMu issues
Persistent underlying EMu issues
Persistent issues typically have at least some “local aspect” to them
So you should look your EMu straight in the eye, and seeif there are any local solutions that suggest themselves
TRUISM #1: disk speed affects database speed, so a smaller
footprint of data on disk will increase performance
1,380 mB
10,400 mB
In an absolute sense, EMu proved much faster than Argus, but one intriguing metric was the amount of disk space occupied by the catalogue immediately after migration
TRUISM #1: disk speed affects database speed, so a smaller
footprint of data on disk will increase performance
TRUISM #2: active databases never shrink over time
Were there any existing diet protocols?
Vignettes from attempting to streamline a system, 2005-2007
EMu on a Diet
Vignettes from attempting to streamline a system, 2005-2007
EMu on a Diet
Post-migration reassessment (2005-2006)
Vignettes from attempting to streamline a system, 2005-2007
EMu on a Diet
Post-migration reassessment (2005-2006)
Preparation for Darwin Core (2006-2007)
Vignettes from attempting to streamline a system, 2005-2007
EMu on a Diet
Post-migration reassessment (2005-2006)
Preparation for Darwin Core (2006-2007)
Implications of end user habits (2007-)
The ecatalogue database was a rate limiter
an EMu data directory
The ecatalogue database was a rate limiter
an EMu data directory
The ecatalogue database was a rate limiter
an EMu data directory
EMu maintenance jobs
emulutsrebuild rebuilds lookup list tables
emumaintenance batch quick & dirty optimization (oflow)
emumaintenance compact full optimization of database table components
EMu maintenance jobs
emulutsrebuild rebuilds lookup list tables
emumaintenance batch quick & dirty optimization (oflow)
emumaintenance compact full optimization of database table components
Default EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Default EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Default EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Revised EMu “cron” maintenance job schedule
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Revised EMu “cron” maintenance job schedule
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Revised EMu “cron” maintenance job schedule
1. Post-Migration Reassessment
Anatomy of the ecatalogue database
File Name Function
~/emu/data/ecatalogue/data the actual data
~/emu/data/ecatalogue/rec indexing (part)
~/emu/data/ecatalogue/seg indexing (part)
The combined size of these was 10.4 gb -- 4 gb in data and 3 gb in each of rec and seg
1,380 mB 10,400 mB
Closer Assessment of Legacy Data
In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody modules by KE prior to migration -- Legacy Data fields were among them
Closer Assessment of Legacy Data
In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody modules by KE prior to migration -- Legacy Data fields were among them
Closer Assessment of Legacy Data
sites – round 2
constant data
lengthy prefixes
sites – round 2
data of temporary use in migration
catalogue – round 2data
rec
seg
texload –us –d{rawdatafile} –g{grammarfile} {database table}
texload –us –d{rawdatafile} –g{grammarfile} {database table}
texload –us –dinput.dat –ginput.gram ecatalogue
texload –us –d{rawdatafile} –g{grammarfile} {database table}
texload –us –dinput.dat –ginput.gram ecatalogue
Import Module in 3.2.03 and later
ecatalogue
data
rec
seg
Crunch 2data
rec
seg
delete nulls from AdmOriginalData
ecatalogue
Crunch 3data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
ecatalogue
Crunch 4data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
delete prefixes on AdmOriginalData
ecatalogue
Crunch 4data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
delete prefixes on AdmOriginalData
ecatalogue
55 % reduction !
2. Preparation for Darwin Core
Charles Darwin, 1809-1882
Natural History Metadata Standard
“ DwC ”
Affords interoperability of different database systems
Widely used in collaborative informatics initiatives
Circa 40-50 fields depending on particular version
Directly analogous to the Dublin Core standard
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,941,006
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %
We’re about to gain back all of the pounds we just took off by fixing up the legacy data !
catalogue – round 2data
rec
seg
catalogue – round 2data
rec
segaction in ecollectionevents
catalogue – round 2data
rec
segaction in eparties
catalogue – round 2data
rec
segaction in ecatalogue
catalogue – round 2data
rec
segBefore actions
catalogue – round 2data
rec
segAfter actions
ExtendedData
ExtendedData
SummaryData
ExtendedData
SummaryData
ExtendedData field is a full duplication ofIRN + SummaryData fields… delete theExtendedData field, use SummaryDatawhen in “thumbnail mode” on records
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters revised data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %
Some pain, but no weight gain !
3. Implications of end user habits
Can history of query behavior by users help identify some EMu soft spots ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” in its Texpress validation code ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” in its Texpress validation code ?
texadmin
…you make certain common types of changes to records in any EMu module
…and automatic changes then propagate via emuload (syncserver) into “local copies” of many fields in numerous records in linked modules
…those linked modules can grow a lot and slow EMu significantly between maintenance runs
EMu actions in the background you don’t see
Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !
Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !
Need to first appreciate that different EMu queries work differently
Drag and Drop Query
Drag and Drop Query
first consults the link field
Straight Text Entry Query
first consults local copy of the SummaryData from the linked record
that has been inserted into the catalogue
EMu’s audit log - GIGANTIC activity trail
How often do users employ these two verydifferent query strategies, on what fields,
and are there distinctly divergent patterns ?
catalogue audit
In this one week sample, only 7 of 52 queries for accessions from insidethe catalogue module used text queries, the other 45 were drag & drops
Of those 7 text queries, every one asked for a primary id numberfor the accession, or the numeric piece of that number, but notfor any other type of data from within those accessions
IP. 304578
IP. 304578: blah blah blah…
Summary data strings
Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our “local” advantage !
Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our “local” advantage !
We don’t need more than the primary id of the accession record in the local Summary Data copy stored in the catalogue module.
Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our “local” advantage !
We don’t need more than the primary id of the accession record in the local Summary Data copy stored in the catalogue module.
This pattern also held true for queries launched from the catalogue against the bibliography and loans modules !
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue module lost
another 19% of its bulk
over a couple months !
Internal Movements Database
Internal Movements Database
Internal movements
dropped from 550 mbytes
down to 200 mbytes…
65% reduction !
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Revised EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Revised EMu “cron” maintenance job schedule
* * *
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Revised EMu “cron” maintenance job schedule
* * *
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Revised EMu “cron” maintenance job schedule
* * *
Quick backup
Systems Office trims EMu,Peabody users expand EMu
Rapid image processing using voice-recognition and batch upload with KE EMu
Susan Butts, Jessica Bazeley, Derek Briggs
Yale Peabody MuseumDivision of Invertebrate Paleontology
Stratigraphic Collection
• Peabody Museum building - 1920s
• ~150 years of curatorial and student collections, 2100 drawers
• October 2004, a flood occurred due to a combination of a crushed and clogged pipe and rapid precipitation at high tide
Stratigraphic Collection
Agents of deterioration (Waller, 1994)
• water (flooding)• extreme fluctuations in T & RH• contamination (+oak cabinets)• physical forces (overcrowded conditions, improper containers)• dissociation via blanket and abbreviated labeling practices• loss of data from disintegrating and dirty labels
Systematic Collection
Class of 1954 Environmental Science Center (2001)• compactorized storage
• 370 Delta Designs, Ltd. Storage cabinets • baked polyester powder coatings: non-reactive & solvent-free
• facility is continually monitored & logged for T & RH (60º F and 47.5%RH)
Schuchert Collection of Brachiopods
2005 – NSF BRC
• grant to incorporate stratigraphic collection from basement into systematic collection (ESC) – Briggs & Butts, PI
• 945 drawers from basement (45%)– Inventory collection– Catalog in KE EMu– Retray – Print new labels– Imaging of brachiopod specimens – while we are
processing anyway– Rehouse in systematic collection
Imaging
minimal processing time• Image 1 specimen per tray of multiple
specimens – same taxa, same locality• 3 orientations – dorsal, ventral, hinge• Specimens require “props” to be
photographed – so some amount of contact is necessary
• Specimens must be matched with object record in EMu
How do you type in the specimen number
while taking three identified-orientation images
of the specimen (re-orienting each time),
convert those images to web-ready files, and attach the cluster of images to the associated
EMu object records for approximately
75,000 brachiopods in three years?
Call Larry!
voice recognitionimaging(Excel)
voice recognitionimaging(Excel)
imageupload
Voice recognition data entry
Voice recognition data entry
voice recognitionimaging(Excel)
imageupload
Adobe Photoshop(color correction)
Batch processing
voice recognitionimaging(Excel)
imageupload
Adobe Photoshop(color correction)
Macro: image manipulation,associate image numbers
to specimen number
MacroIN
Spoken spreadsheet
Folder of images 1. validates of camera image numbers
2. validates suffix – IP naming protocol
3. invokes ImageMagick (resize for web)
4. writes import CSV files for EMu import & attachment to records
OUT
two spreadsheets
folder of modified images
Image processing
voice recognitionimaging(Excel)
imageupload
Adobe Photoshop(color correction)
Macro: image manipulation,associate image numbers
to specimen number
validate/importMultimedia module
Multimedia module import
Multimedia data verification
voice recognitionimaging(Excel)
imageupload
Adobe Photoshop(color correction)
ImageMagick(resize)
validate/importMultimedia module
validate/importCatalog module
Macro: associate image numbers
to specimen number
Catalog module – import and validate
Object record with multimedia
Web query
Stats
Thank you.