View
5.986
Download
3
Embed Size (px)
DESCRIPTION
Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.
Citation preview
Data Clean-Up:
Is there a Better Way?
Margaret Hogarth
ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1
To Be Covered:
• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?
© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor
2
Commonalities
• Data sources• Tools• Data clean-up• Capture and use
issues• Needed technical
skills
bonobokids.com
3
General Approach
• Keep original data original; copy to a new worksheet
• Save frequently• Use meaningful
file/worksheet names• Folder of “final”
documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761
4
Issues
• Import issues• System limitations• Dirty data• Non-standardized
data
• Standardized but variable data
• Application-related issues• Others?
© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903
5
Deficits
• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems
© John Murphy, “Men Digging," ARTstor DMCC.1993.4
6
System Limitations
• Field character limits• Report field limits
7
Import => Excel, Zeros
Problem:
• Loss of trailing zero• Loss of leading zeros
Solution:• Import > Delimited >
Tab > Text format
Example:• 1944826 =>1944-8260• 14826 => 0001-4826
8
Add Hyphens to ISSNs - 1
• Sort by ISSN to make sure leading zeros are intact.
• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)
• Syntax: =MID(text,start_num,num_chars)
9
Add Hyphens to ISSNs - 2
• Or use Cell Formatting
• Select range• Format Cells
(CTRL+1) > Number tab > Custom > 0000-0000 > OK
10
Import => Excel, Numbers
Problem:• ID # garbled
Solution:• Choose Number format• Remove decimal places
11
Import => Excel, Commas
Problem:
Solution:
Comma
12
Restore Dropped Leading Zeros
• Change to TEXT format
• Sort by ISSN• Use
=CONCATENATE(“000”,A1) to add zeros
13
Excel: Remove Quotes
• Select column• Find & Select >
Replace > Find what: [space]” > Replace with: [leave blank]
14
Remove Non-Printable Characters• Use TRIM (removes
ASCII value 32 = space character, except single spaces between words)
• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157
• Use SUBSTITUTE for higher codes
Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3
15
ASCII Characters
http://en.wikipedia.org/wiki/ASCII 12/22/2009
16
Access: Subscript Out of Range - 1
Try These Steps:• Check for spaces in
column headings• Use TRIM (removes
ASCII value 32 = space character)
• Use CLEAN for ASCII code 0-31
• Delete empty right columns
• Remove empty “used” cells:
1.Find end of "used cells": CTRL+SHIFT+END
2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.
17
Access: Subscript Out of Range - 2
• Copy and paste cells into a new workbook. Save. Import into Access.
• Or, save file as CSV, import into Access. = Will see data error.
Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99
18
Access: Type Conversion Failure
• Make sure data types in fields match data types in columns.
• Data like ISBNs are text but can be “read” like numbers.
• Add top row with correct data/type: XXX for ISBN
19
Access: Remove Quotes
• Search for records
with “”: Criteria: LIKE "*" & Chr(34) & "*"
• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string
© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23
20
Access: ISSN Issues
• Find too-short ISSNs: Len([FieldName])<n [9 is good here]
• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));
© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor
21
Access & ARL Stats - 1
• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.
• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.
• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.
22
Access & ARL Stats - 2
• RowCount:Count(*) Number of titles in a set.
• =[YTD Total]*[Cost] Annual cost-per-use.
• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx
23
Access or Excel?
Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings
Excel:• Non-relational• Mostly numeric• Calculations/Statistics
Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx
24
XML - Excel
• Excel can interpret XML
• Data > Get External Data > From XML Data Import
• Format without affecting source data
25
Later Excel: Activate Developer tab through Office logo (upper left)
MarcEdit: XML - 1
• Convert large XML files to Excel
• Specify input, output files
• Choose MARC21XML => MARC
26
MarcEdit: XML - 2• Choose display fields,
input, output files• View, format in Excel
27
MarcEdit: MARC - 1
• Convert large files to local practices
DELETE existing
999 field
910 field(s)…
Copy 035 to 001
Remove (Sc-P) Prefix from 001
ADD
910 |aDEL SCP ; jc ; 2009/7/8
Field:910
Field data: \\$aDEL SCP ; jc ; 2009/7/8
998 |an
Field: 998
Field data: \\$an
28
MarcEdit: MARC - 2
29
MarcEdit Information
• By Terry Reese
• http://oregonstate.edu/~reeset/marcedit/html/
• MARCEDIT-L listserv at https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1
• Regular updates
• Tutorials, templates, scripts
30
ILS: “Global Update”• For records within ILS system
• For universal changes
• “Check website for coverage.”
31
A Better Way? Macros
• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en
• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/
• http://www.lib-stats.org.uk/ (statistics listserv)
[Courtesy Tansy Matthews]
32
Data Quality
Strategies to Improve Data Quality:
1. Identify problems
2. Treat data as an asset
3. Implement quality systems
Principle Activities for Data:
• Acquire• Store• Use
33
Poor Quality Data Indicators
• Uncorrected errors• Redundant data/
processes• Lack of data for
strategizing• Frustration with data,
data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor
34
Treat Data as an Asset
• Inventory data assets• Data = dynamic;
process = asset• Align responsibilities:
acquire, store, use data.
• Establish customer-supplier relationships for data. © The Metropolitan Museum of
Art, “Tournament,” late 16th century, ARTstor
35
Apply Quality Principles
• Create and keep a customer
• Detect and correct errors
• Determine root cause of defects
• Manage the process
• Communicate results• Audit supplier
performance
36
Library Data Quality Policy - 1
Suppliers/Creators:• Understand users,
uses, & requirements• Ensure requirements
are met• Manage data creation
process
Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality
in IT
37
Library Data Quality Policy - 2
Users:• Define requirements,
work with suppliers• Provide feedback• Interpret data
correctly• Use data legitimately• Protect privacy
Logistics:• Determine master
systems• Understand system
limitations• Accessible storage• Match inputs with
needs• Identify key keepers
38
Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154
Other Techniques?
39
Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.
http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.
• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.
• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx
• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.
• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.
• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009
40