17
Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui DATA TRANSFORMATION

CLEANING-Data-Transformation-Javier

Embed Size (px)

Citation preview

Page 1: CLEANING-Data-Transformation-Javier

Data Cleaning and Data Publishing Workshop2013 18-22 February, Nairobi, Kenya

Javier Otegui@jotegui

DATA TRANSFORMATION

Page 2: CLEANING-Data-Transformation-Javier

Data Transformation – the process of modifying data with the aim of improving or enabling its fitness for a certain purpose

Ideally, no information lossBroad term:

Content transformation Format transformation Support transformation …

Examples of use: Enable sharing of the dataset Ease calculations and processing

FUNDAMENTS

Page 3: CLEANING-Data-Transformation-Javier

Mandatory, optional or not needed, depending on scope of use

Data owned and used locally: Analysis-specific transformations

Limited or local network (lab): Analysis-specific transformations Data exchange among colleagues

Publicly shared data: Interoperability Standards

Best practices: transform to standards even in local work

FUNDAMENTS

Page 4: CLEANING-Data-Transformation-Javier

Content transformations Schema of data storage Scale of measurement Several levels of diffi culty Standardization of content

Format transformations File format: tab-delimited, CSV, zip, spreadsheet… Nowadays it is fairly straightforward Translation between programs easy Exchange of information

Support transformations Digitization, key step in general data management process Prone to issues Enable processing, management, analysis, publishing and

sharing of data

FUNDAMENTS

Page 5: CLEANING-Data-Transformation-Javier

Modify the units of the data or the elements that compose the information

Final product – same information, standard compliant

Standard – DarwinCore (DwC)Two specific aims:

Change elements Complete missing elements

Primary Biodiversity Data (PBD)Metadata

CONTENT TRANSFORMATIONS

Page 6: CLEANING-Data-Transformation-Javier

Georeferencing of localities From verbatim locality description to coordinates Currently not needed: GPS technology Improve legacy information Tools such as geolocate, geomancer…

Coordinate systems Modify units so that they comply with the standard DwC for coordinates – Decimal Degree (DD) Easy – Degree-Minute-Second to DD Hard – UTM to DD Special attention to precision

CONTENT TRANSFORMATIONS - GEOSPATIAL

Page 7: CLEANING-Data-Transformation-Javier

CONTENT TRANSFORMATIONS - GEOSPATIAL

45º 20’ – Precision 1’ (~2Km) at best

45.33333 – Precision 0.00001 (2m) too high

45º 21’

45.35 45.33(0.01,

~1.4Km)

45.3(0.1, ~14Km)

Page 8: CLEANING-Data-Transformation-Javier

Georeferencing of localities From verbatim locality description to coordinates Current GPS technology makes it easier Improve legacy information Tools such as geolocate, geomancer…

Coordinate systems Modify units so that they comply with the standard DwC for coordinates – Decimal Degree (DD) Easy – Degree-Minute-Second to DD Hard – UTM to DD Special attention to precision

Improve missing fields Use mapping tools and/or gazetteers to complete

information

CONTENT TRANSFORMATIONS - GEOSPATIAL

Page 9: CLEANING-Data-Transformation-Javier

Special character encoding

Special characters in taxonomic names and/or

authorships

Interoperability issues may appear

Transform these characters to simplified version or

enable different text-encoding

Higher level taxa completion

Transformation to broaden the potential uses

Search in taxonomic databases or literature

CONTENT TRANSFORMATIONS - TAXONOMIC

Page 10: CLEANING-Data-Transformation-Javier

Order of elements Different places use naturally different element order Example: US, July 26th 2012 Might become 07-26-2012 Slight modification with good parser to detect and

update this information to comply with standardsDate systems

Standard – DwC recommends ISO 8601 Different formats:

1984-09-14, 14th September 1984 34th week of 2012, 125th day of 2012

A good parser is needed to understand all possibilities Transformations to use common system and avoid

ambiguities

CONTENT TRANSFORMATIONS - TEMPORAL

Page 11: CLEANING-Data-Transformation-Javier

Improvement of interoperability – controlled vocabulary Example: basisOfRecord Different languages, non-standard acronyms… Transform term to standard to improve retrieval of

dataImprovement of collections – metadata

becomes data One man’s metadata is another man’s data Information common to a collection might be omitted

locally Must be added when sharing

CONTENT TRANSFORMATIONS - METADATA

Page 12: CLEANING-Data-Transformation-Javier

Modify the storage of the data

Final product – same information, easily

exchangeable format

Two key cases:

Text to spreadsheet and spreadsheet to text

Text or spreadsheet to database

FILE FORMAT TRANSFORMATIONS

Page 13: CLEANING-Data-Transformation-Javier

The most common type of format transformation

Import text fi le to spreadsheet or export from spreadsheet to text fi le

Aims Importing to spreadsheet – improve data processing Exporting to text file – share data and allow others to

import easilyTo be effective:

No loss of data No transformation of content

FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET

Page 14: CLEANING-Data-Transformation-Javier

From CSV or tab-delimited to spreadsheetCSV or tab-delimited depending on the contentModern spreadsheets have algorithms to import data

in text fi lesMost of the times, we can select the used separator

FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET

Page 15: CLEANING-Data-Transformation-Javier

FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET

Page 16: CLEANING-Data-Transformation-Javier

From CSV or tab-delimited to spreadsheetCSV or tab-delimited depending on the contentModern spreadsheets have algorithms to import data

in text fi lesMost of the times, we can select the used separatorStill, this step must be taken carefully:

More or less fields than should Hidden new-line characters …

After importing, check

FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET

Page 17: CLEANING-Data-Transformation-Javier

After import,

check

Autofilter comes

handy

“Female” value in

“individualCount”

field??

FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET