146

Toro 1

  • Upload
    teigra

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Toro 1. EMu on a Diet. Yale campus. Peabody Collections Approximate Digital Timeline. Peabody Collections Approximate Digital Timeline. 1991 Systems Office created & staffed 1991 Argus collections databasing initiative started. Peabody Collections Approximate Digital Timeline. - PowerPoint PPT Presentation

Citation preview

Page 1: Toro 1
Page 2: Toro 1

Toro 1

EMu on a Diet

Page 3: Toro 1
Page 4: Toro 1

Yale campus

Page 5: Toro 1
Page 6: Toro 1

Peabody CollectionsApproximate Digital Timeline

Page 7: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

Peabody CollectionsApproximate Digital Timeline

Page 8: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

• 1994 Gopher services launched for collections data

Peabody CollectionsApproximate Digital Timeline

Page 9: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

• 1994 Gopher services launched for collections data

• 1997 Gopher mothballed, Web / HTTP services launched

Peabody CollectionsApproximate Digital Timeline

Page 10: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

• 1994 Gopher services launched for collections data

• 1997 Gopher mothballed, Web / HTTP services launched

• 1998 Physical move of many collections “begins”

• 2002 Physical move of many collections “ends”

Peabody CollectionsApproximate Digital Timeline

Page 11: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

• 1994 Gopher services launched for collections data

• 1997 Gopher mothballed, Web / HTTP services launched

• 1998 Physical move of many collections “begins”

• 2002 Physical move of many collections “ends”

• 2003 Search for Argus successor commences

• 2003 Informatics Office created & staffed

Peabody CollectionsApproximate Digital Timeline

Page 12: Toro 1

• 1991 Systems Office created & staffed

• 1991 Argus collections databasing initiative started

• 1994 Gopher services launched for collections data

• 1997 Gopher mothballed, Web / HTTP services launched

• 1998 Physical move of many collections “begins”

• 2002 Physical move of many collections “ends”

• 2003 Search for Argus successor commences

• 2003 Informatics Office created & staffed

• 2004 KE EMu to succeed Argus, data migration begins

• 2005 Argus data migration ends, go-live in KE EMu

Peabody CollectionsApproximate Digital Timeline

Page 13: Toro 1

EMu migration in '05(all disciplines went live

simultaneously)

Physical move in '98-'02(primarily neontological disciplines)

Big events

Page 14: Toro 1

Peabody CollectionsCounts & Functional Cataloguing Unit

• Anthropology 325,000 Lot• Botany 350,000 Individual• Entomology 400,000 Lot• Invertebrate Paleontology 300,000 Lot• Invertebrate Zoology 300,000 Lot• Mineralogy 35,000 Individual• Paleobotany 150,000 Individual• Scientific Instruments 3,000 Individual• Vertebrate Paleontology 125,000 Individual• Vertebrate Zoology 185,000 Lot / Individual

About 12 million specimens ( 2.1 million EMu-able units )

Page 15: Toro 1

Peabody CollectionsFunctional Units Databased

• Anthropology 325,000 90 %• Botany 350,000 1 %• Entomology 400,000 6 %• Invertebrate Paleontology 300,000 60 %• Invertebrate Zoology 300,000 25 %• Mineralogy 35,000 85 %• Paleobotany 150,000 60 %• Scientific Instruments 3,000 100 %• Vertebrate Paleontology 125,000 60 %• Vertebrate Zoology 185,000 95 %

992,000 of 2.1 million ( 45 % overall )

Page 16: Toro 1
Page 17: Toro 1

What happens when …

Page 18: Toro 1

What happens when …

… EMu gets sluggish & unresponsive ?

Page 19: Toro 1

Why is this &^%$ thing so

ridiculously slow ?!

Page 20: Toro 1

Transient often non-EMu issues

Page 21: Toro 1

Transient often non-EMu issues

Persistent underlying EMu issues

Page 22: Toro 1

Persistent issues typically have at least some “local aspect” to them

So you should look your EMu straight in the eye, and seeif there are any local solutions that suggest themselves

Page 23: Toro 1

TRUISM #1: disk speed affects database speed, so a smaller

footprint of data on disk will increase performance

Page 24: Toro 1

1,380 mB

10,400 mB

In an absolute sense, EMu proved much faster than Argus, but one intriguing metric was the amount of disk space occupied by the catalogue immediately after migration

TRUISM #1: disk speed affects database speed, so a smaller

footprint of data on disk will increase performance

Page 25: Toro 1

TRUISM #2: active databases never shrink over time

Page 26: Toro 1
Page 27: Toro 1
Page 28: Toro 1

Were there any existing diet protocols?

Page 29: Toro 1
Page 30: Toro 1
Page 31: Toro 1
Page 32: Toro 1
Page 33: Toro 1
Page 34: Toro 1
Page 35: Toro 1
Page 36: Toro 1

Vignettes from attempting to streamline a system, 2005-2007

EMu on a Diet

Page 37: Toro 1

Vignettes from attempting to streamline a system, 2005-2007

EMu on a Diet

Post-migration reassessment (2005-2006)

Page 38: Toro 1

Vignettes from attempting to streamline a system, 2005-2007

EMu on a Diet

Post-migration reassessment (2005-2006)

Preparation for Darwin Core (2006-2007)

Page 39: Toro 1

Vignettes from attempting to streamline a system, 2005-2007

EMu on a Diet

Post-migration reassessment (2005-2006)

Preparation for Darwin Core (2006-2007)

Implications of end user habits (2007-)

Page 40: Toro 1

The ecatalogue database was a rate limiter

an EMu data directory

Page 41: Toro 1

The ecatalogue database was a rate limiter

an EMu data directory

Page 42: Toro 1

The ecatalogue database was a rate limiter

an EMu data directory

Page 43: Toro 1

EMu maintenance jobs

emulutsrebuild rebuilds lookup list tables

emumaintenance batch quick & dirty optimization (oflow)

emumaintenance compact full optimization of database table components

Page 44: Toro 1

EMu maintenance jobs

emulutsrebuild rebuilds lookup list tables

emumaintenance batch quick & dirty optimization (oflow)

emumaintenance compact full optimization of database table components

Page 45: Toro 1

Default EMu “cron” maintenance job schedule

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Page 46: Toro 1

Default EMu “cron” maintenance job schedule

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Page 47: Toro 1

Default EMu “cron” maintenance job schedule

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Page 48: Toro 1

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Mo Tu We Th Fr Sa Su

Revised EMu “cron” maintenance job schedule

Page 49: Toro 1

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Mo Tu We Th Fr Sa Su

Revised EMu “cron” maintenance job schedule

Page 50: Toro 1

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Mo Tu We Th Fr Sa Su

Revised EMu “cron” maintenance job schedule

Page 51: Toro 1

1. Post-Migration Reassessment

Anatomy of the ecatalogue database

File Name Function

~/emu/data/ecatalogue/data the actual data

~/emu/data/ecatalogue/rec indexing (part)

~/emu/data/ecatalogue/seg indexing (part)

The combined size of these was 10.4 gb -- 4 gb in data and 3 gb in each of rec and seg

1,380 mB 10,400 mB

Page 52: Toro 1

Closer Assessment of Legacy Data

In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody modules by KE prior to migration -- Legacy Data fields were among them

Page 53: Toro 1

Closer Assessment of Legacy Data

In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody modules by KE prior to migration -- Legacy Data fields were among them

Page 54: Toro 1

Closer Assessment of Legacy Data

Page 55: Toro 1

sites – round 2

constant data

lengthy prefixes

Page 56: Toro 1

sites – round 2

data of temporary use in migration

Page 57: Toro 1

catalogue – round 2data

rec

seg

Page 58: Toro 1

texload –us –d{rawdatafile} –g{grammarfile} {database table}

Page 59: Toro 1

texload –us –d{rawdatafile} –g{grammarfile} {database table}

texload –us –dinput.dat –ginput.gram ecatalogue

Page 60: Toro 1

texload –us –d{rawdatafile} –g{grammarfile} {database table}

texload –us –dinput.dat –ginput.gram ecatalogue

Import Module in 3.2.03 and later

Page 61: Toro 1

ecatalogue

data

rec

seg

Page 62: Toro 1

Crunch 2data

rec

seg

delete nulls from AdmOriginalData

ecatalogue

Page 63: Toro 1

Crunch 3data

rec

seg

delete nulls from AdmOriginalData

shorten labels on AdmOriginalData

ecatalogue

Page 64: Toro 1

Crunch 4data

rec

seg

delete nulls from AdmOriginalData

shorten labels on AdmOriginalData

delete prefixes on AdmOriginalData

ecatalogue

Page 65: Toro 1

Crunch 4data

rec

seg

delete nulls from AdmOriginalData

shorten labels on AdmOriginalData

delete prefixes on AdmOriginalData

ecatalogue

55 % reduction !

Page 66: Toro 1

2. Preparation for Darwin Core

Charles Darwin, 1809-1882

Page 67: Toro 1

Natural History Metadata Standard

“ DwC ”

Affords interoperability of different database systems

Widely used in collaborative informatics initiatives

Circa 40-50 fields depending on particular version

Directly analogous to the Dublin Core standard

Page 68: Toro 1
Page 69: Toro 1
Page 70: Toro 1
Page 71: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,941,006

Page 72: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000

Page 73: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %

Page 74: Toro 1

We’re about to gain back all of the pounds we just took off by fixing up the legacy data !

Page 75: Toro 1

catalogue – round 2data

rec

seg

Page 76: Toro 1

catalogue – round 2data

rec

segaction in ecollectionevents

Page 77: Toro 1

catalogue – round 2data

rec

segaction in eparties

Page 78: Toro 1

catalogue – round 2data

rec

segaction in ecatalogue

Page 79: Toro 1

catalogue – round 2data

rec

segBefore actions

Page 80: Toro 1

catalogue – round 2data

rec

segAfter actions

Page 81: Toro 1
Page 82: Toro 1

ExtendedData

Page 83: Toro 1

ExtendedData

SummaryData

Page 84: Toro 1

ExtendedData

SummaryData

ExtendedData field is a full duplication ofIRN + SummaryData fields… delete theExtendedData field, use SummaryDatawhen in “thumbnail mode” on records

Page 85: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %

Page 86: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters revised data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %

Page 87: Toro 1

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %

Some pain, but no weight gain !

Page 88: Toro 1

3. Implications of end user habits

Can history of query behavior by users help identify some EMu soft spots ?

Page 89: Toro 1

3. The Validation Code SlimDing

Can history of query behavior by users help identify some EMu soft spots ?

If so, can we slip EMu a “dynamic diet pill” in its Texpress validation code ?

Page 90: Toro 1

3. The Validation Code SlimDing

Can history of query behavior by users help identify some EMu soft spots ?

If so, can we slip EMu a “dynamic diet pill” in its Texpress validation code ?

texadmin

Page 91: Toro 1

…you make certain common types of changes to records in any EMu module

…and automatic changes then propagate via emuload (syncserver) into “local copies” of many fields in numerous records in linked modules

…those linked modules can grow a lot and slow EMu significantly between maintenance runs

EMu actions in the background you don’t see

Page 92: Toro 1
Page 93: Toro 1
Page 94: Toro 1

Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !

Page 95: Toro 1

Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !

Need to first appreciate that different EMu queries work differently

Page 96: Toro 1

Drag and Drop Query

Page 97: Toro 1

Drag and Drop Query

first consults the link field

Page 98: Toro 1

Straight Text Entry Query

first consults local copy of the SummaryData from the linked record

that has been inserted into the catalogue

Page 99: Toro 1

EMu’s audit log - GIGANTIC activity trail

How often do users employ these two verydifferent query strategies, on what fields,

and are there distinctly divergent patterns ?

Page 100: Toro 1

catalogue audit

In this one week sample, only 7 of 52 queries for accessions from insidethe catalogue module used text queries, the other 45 were drag & drops

Page 101: Toro 1

Of those 7 text queries, every one asked for a primary id numberfor the accession, or the numeric piece of that number, but notfor any other type of data from within those accessions

Page 102: Toro 1

IP. 304578

Page 103: Toro 1

IP. 304578: blah blah blah…

Summary data strings

Page 104: Toro 1

Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).

Page 105: Toro 1

Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).

This is where we gain our “local” advantage !

Page 106: Toro 1

Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).

This is where we gain our “local” advantage !

We don’t need more than the primary id of the accession record in the local Summary Data copy stored in the catalogue module.

Page 107: Toro 1

Over a full year of catalogue audit data, far less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).

This is where we gain our “local” advantage !

We don’t need more than the primary id of the accession record in the local Summary Data copy stored in the catalogue module.

This pattern also held true for queries launched from the catalogue against the bibliography and loans modules !

Page 108: Toro 1
Page 109: Toro 1

Catalogue Database

Page 110: Toro 1

Catalogue Database

Page 111: Toro 1

Catalogue Database

Page 112: Toro 1

Catalogue Database

Page 113: Toro 1

Catalogue Database

Catalogue module lost

another 19% of its bulk

over a couple months !

Page 114: Toro 1

Internal Movements Database

Page 115: Toro 1

Internal Movements Database

Internal movements

dropped from 550 mbytes

down to 200 mbytes…

65% reduction !

Page 116: Toro 1

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Mo Tu We Th Fr Sa Su

Revised EMu “cron” maintenance job schedule

Page 117: Toro 1

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Revised EMu “cron” maintenance job schedule

* * *

Page 118: Toro 1

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Revised EMu “cron” maintenance job schedule

* * *

Page 119: Toro 1

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

Revised EMu “cron” maintenance job schedule

* * *

Page 120: Toro 1

Quick backup

Page 121: Toro 1

Systems Office trims EMu,Peabody users expand EMu

Page 122: Toro 1

Rapid image processing using voice-recognition and batch upload with KE EMu

Susan Butts, Jessica Bazeley, Derek Briggs

Yale Peabody MuseumDivision of Invertebrate Paleontology

Page 123: Toro 1

Stratigraphic Collection

• Peabody Museum building - 1920s

• ~150 years of curatorial and student collections, 2100 drawers

• October 2004, a flood occurred due to a combination of a crushed and clogged pipe and rapid precipitation at high tide

Page 124: Toro 1

Stratigraphic Collection

Agents of deterioration (Waller, 1994)

• water (flooding)• extreme fluctuations in T & RH• contamination (+oak cabinets)• physical forces (overcrowded conditions, improper containers)• dissociation via blanket and abbreviated labeling practices• loss of data from disintegrating and dirty labels

Page 125: Toro 1

Systematic Collection

Class of 1954 Environmental Science Center (2001)• compactorized storage

• 370 Delta Designs, Ltd. Storage cabinets • baked polyester powder coatings: non-reactive & solvent-free

• facility is continually monitored & logged for T & RH (60º F and 47.5%RH)

Schuchert Collection of Brachiopods

Page 126: Toro 1

2005 – NSF BRC

• grant to incorporate stratigraphic collection from basement into systematic collection (ESC) – Briggs & Butts, PI

• 945 drawers from basement (45%)– Inventory collection– Catalog in KE EMu– Retray – Print new labels– Imaging of brachiopod specimens – while we are

processing anyway– Rehouse in systematic collection

Page 127: Toro 1

Imaging

minimal processing time• Image 1 specimen per tray of multiple

specimens – same taxa, same locality• 3 orientations – dorsal, ventral, hinge• Specimens require “props” to be

photographed – so some amount of contact is necessary

• Specimens must be matched with object record in EMu

Page 128: Toro 1

How do you type in the specimen number

while taking three identified-orientation images

of the specimen (re-orienting each time),

convert those images to web-ready files, and attach the cluster of images to the associated

EMu object records for approximately

75,000 brachiopods in three years?

Call Larry!

Page 129: Toro 1

voice recognitionimaging(Excel)

Page 130: Toro 1

voice recognitionimaging(Excel)

imageupload

Page 131: Toro 1

Voice recognition data entry

Page 132: Toro 1

Voice recognition data entry

Page 133: Toro 1

voice recognitionimaging(Excel)

imageupload

Adobe Photoshop(color correction)

Page 134: Toro 1

Batch processing

Page 135: Toro 1

voice recognitionimaging(Excel)

imageupload

Adobe Photoshop(color correction)

Macro: image manipulation,associate image numbers

to specimen number

Page 136: Toro 1

MacroIN

Spoken spreadsheet

Folder of images 1. validates of camera image numbers

2. validates suffix – IP naming protocol

3. invokes ImageMagick (resize for web)

4. writes import CSV files for EMu import & attachment to records

OUT

two spreadsheets

folder of modified images

Page 137: Toro 1

Image processing

Page 138: Toro 1

voice recognitionimaging(Excel)

imageupload

Adobe Photoshop(color correction)

Macro: image manipulation,associate image numbers

to specimen number

validate/importMultimedia module

Page 139: Toro 1

Multimedia module import

Page 140: Toro 1

Multimedia data verification

Page 141: Toro 1

voice recognitionimaging(Excel)

imageupload

Adobe Photoshop(color correction)

ImageMagick(resize)

validate/importMultimedia module

validate/importCatalog module

Macro: associate image numbers

to specimen number

Page 142: Toro 1

Catalog module – import and validate

Page 143: Toro 1

Object record with multimedia

Page 144: Toro 1

Web query

Page 145: Toro 1

Stats

Page 146: Toro 1

Thank you.