40
Data Clean- Up: Is there a Better Way? Margaret Hogarth ER&L, 2/2/2010 © Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40- 06-12/42 1

Data Clean-up: Is There A Better Way?

Embed Size (px)

DESCRIPTION

Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.

Citation preview

Page 1: Data Clean-up: Is There A Better Way?

Data Clean-Up:

Is there a Better Way?

Margaret Hogarth

ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1

Page 2: Data Clean-up: Is There A Better Way?

To Be Covered:

• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?

© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor

2

Page 3: Data Clean-up: Is There A Better Way?

Commonalities

• Data sources• Tools• Data clean-up• Capture and use

issues• Needed technical

skills

bonobokids.com

3

Page 4: Data Clean-up: Is There A Better Way?

General Approach

• Keep original data original; copy to a new worksheet

• Save frequently• Use meaningful

file/worksheet names• Folder of “final”

documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761

4

Page 5: Data Clean-up: Is There A Better Way?

Issues

• Import issues• System limitations• Dirty data• Non-standardized

data

• Standardized but variable data

• Application-related issues• Others?

© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903

5

Page 6: Data Clean-up: Is There A Better Way?

Deficits

• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems

© John Murphy, “Men Digging," ARTstor DMCC.1993.4

6

Page 7: Data Clean-up: Is There A Better Way?

System Limitations

• Field character limits• Report field limits

7

Page 8: Data Clean-up: Is There A Better Way?

Import => Excel, Zeros

Problem:

• Loss of trailing zero• Loss of leading zeros

Solution:• Import > Delimited >

Tab > Text format

Example:• 1944826 =>1944-8260• 14826 => 0001-4826

8

Page 9: Data Clean-up: Is There A Better Way?

Add Hyphens to ISSNs - 1

• Sort by ISSN to make sure leading zeros are intact.

• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)

• Syntax: =MID(text,start_num,num_chars)

9

Page 10: Data Clean-up: Is There A Better Way?

Add Hyphens to ISSNs - 2

• Or use Cell Formatting

• Select range• Format Cells

(CTRL+1) > Number tab > Custom > 0000-0000 > OK

10

Page 11: Data Clean-up: Is There A Better Way?

Import => Excel, Numbers

Problem:• ID # garbled

Solution:• Choose Number format• Remove decimal places

11

Page 12: Data Clean-up: Is There A Better Way?

Import => Excel, Commas

Problem:

Solution:

Comma

12

Page 13: Data Clean-up: Is There A Better Way?

Restore Dropped Leading Zeros

• Change to TEXT format

• Sort by ISSN• Use

=CONCATENATE(“000”,A1) to add zeros

13

Page 14: Data Clean-up: Is There A Better Way?

Excel: Remove Quotes

• Select column• Find & Select >

Replace > Find what: [space]” > Replace with: [leave blank]

14

Page 15: Data Clean-up: Is There A Better Way?

Remove Non-Printable Characters• Use TRIM (removes

ASCII value 32 = space character, except single spaces between words)

• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157

• Use SUBSTITUTE for higher codes

Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3

15

Page 16: Data Clean-up: Is There A Better Way?

ASCII Characters

http://en.wikipedia.org/wiki/ASCII 12/22/2009

16

Page 17: Data Clean-up: Is There A Better Way?

Access: Subscript Out of Range - 1

Try These Steps:• Check for spaces in

column headings• Use TRIM (removes

ASCII value 32 = space character)

• Use CLEAN for ASCII code 0-31

• Delete empty right columns

• Remove empty “used” cells:

1.Find end of "used cells": CTRL+SHIFT+END

2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.

17

Page 18: Data Clean-up: Is There A Better Way?

Access: Subscript Out of Range - 2

• Copy and paste cells into a new workbook. Save. Import into Access.

• Or, save file as CSV, import into Access. = Will see data error.

Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99

18

Page 19: Data Clean-up: Is There A Better Way?

Access: Type Conversion Failure

• Make sure data types in fields match data types in columns.

• Data like ISBNs are text but can be “read” like numbers.

• Add top row with correct data/type: XXX for ISBN

19

Page 20: Data Clean-up: Is There A Better Way?

Access: Remove Quotes

• Search for records

with “”: Criteria: LIKE "*" & Chr(34) & "*"

• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string

© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23

20

Page 21: Data Clean-up: Is There A Better Way?

Access: ISSN Issues

• Find too-short ISSNs: Len([FieldName])<n [9 is good here]

• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));

© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor

21

Page 22: Data Clean-up: Is There A Better Way?

Access & ARL Stats - 1

• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.

• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.

• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.

22

Page 23: Data Clean-up: Is There A Better Way?

Access & ARL Stats - 2

• RowCount:Count(*) Number of titles in a set.

• =[YTD Total]*[Cost] Annual cost-per-use.

• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx

23

Page 24: Data Clean-up: Is There A Better Way?

Access or Excel?

Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings

Excel:• Non-relational• Mostly numeric• Calculations/Statistics

Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx

24

Page 25: Data Clean-up: Is There A Better Way?

XML - Excel

• Excel can interpret XML

• Data > Get External Data > From XML Data Import

• Format without affecting source data

25

Later Excel: Activate Developer tab through Office logo (upper left)

Page 26: Data Clean-up: Is There A Better Way?

MarcEdit: XML - 1

• Convert large XML files to Excel

• Specify input, output files

• Choose MARC21XML => MARC

26

Page 27: Data Clean-up: Is There A Better Way?

MarcEdit: XML - 2• Choose display fields,

input, output files• View, format in Excel

27

Page 28: Data Clean-up: Is There A Better Way?

MarcEdit: MARC - 1

• Convert large files to local practices

DELETE existing

999 field

910 field(s)…

Copy 035 to 001

Remove (Sc-P) Prefix from 001

ADD

 910 |aDEL SCP ; jc ; 2009/7/8

Field:910

Field data: \\$aDEL SCP ; jc ; 2009/7/8

998 |an

Field: 998

Field data: \\$an

28

Page 29: Data Clean-up: Is There A Better Way?

MarcEdit: MARC - 2

29

Page 30: Data Clean-up: Is There A Better Way?

MarcEdit Information

• By Terry Reese

• http://oregonstate.edu/~reeset/marcedit/html/

• MARCEDIT-L listserv at  https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1   

• Regular updates

• Tutorials, templates, scripts

30

Page 31: Data Clean-up: Is There A Better Way?

ILS: “Global Update”• For records within ILS system

• For universal changes

• “Check website for coverage.”

31

Page 32: Data Clean-up: Is There A Better Way?

A Better Way? Macros

• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en

• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/

• http://www.lib-stats.org.uk/ (statistics listserv)

[Courtesy Tansy Matthews]

32

Page 33: Data Clean-up: Is There A Better Way?

Data Quality

Strategies to Improve Data Quality:

1. Identify problems

2. Treat data as an asset

3. Implement quality systems

Principle Activities for Data:

• Acquire• Store• Use

33

Page 34: Data Clean-up: Is There A Better Way?

Poor Quality Data Indicators

• Uncorrected errors• Redundant data/

processes• Lack of data for

strategizing• Frustration with data,

data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor

34

Page 35: Data Clean-up: Is There A Better Way?

Treat Data as an Asset

• Inventory data assets• Data = dynamic;

process = asset• Align responsibilities:

acquire, store, use data.

• Establish customer-supplier relationships for data. © The Metropolitan Museum of

Art, “Tournament,” late 16th century, ARTstor

35

Page 36: Data Clean-up: Is There A Better Way?

Apply Quality Principles

• Create and keep a customer

• Detect and correct errors

• Determine root cause of defects

• Manage the process

• Communicate results• Audit supplier

performance

36

Page 37: Data Clean-up: Is There A Better Way?

Library Data Quality Policy - 1

Suppliers/Creators:• Understand users,

uses, & requirements• Ensure requirements

are met• Manage data creation

process

Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality

in IT

37

Page 38: Data Clean-up: Is There A Better Way?

Library Data Quality Policy - 2

Users:• Define requirements,

work with suppliers• Provide feedback• Interpret data

correctly• Use data legitimately• Protect privacy

Logistics:• Determine master

systems• Understand system

limitations• Accessible storage• Match inputs with

needs• Identify key keepers

38

Page 39: Data Clean-up: Is There A Better Way?

[email protected]

Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154

Other Techniques?

39

Page 40: Data Clean-up: Is There A Better Way?

Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.

http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.

• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.

• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx

• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.

• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.

• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009

40