Upload
javier-otegui
View
33
Download
0
Embed Size (px)
Citation preview
Data Cleaning and Data Publishing Workshop2013 18-22 February, Nairobi, Kenya
Javier Otegui@jotegui
DATA TRANSFORMATION
Data Transformation – the process of modifying data with the aim of improving or enabling its fitness for a certain purpose
Ideally, no information lossBroad term:
Content transformation Format transformation Support transformation …
Examples of use: Enable sharing of the dataset Ease calculations and processing
FUNDAMENTS
Mandatory, optional or not needed, depending on scope of use
Data owned and used locally: Analysis-specific transformations
Limited or local network (lab): Analysis-specific transformations Data exchange among colleagues
Publicly shared data: Interoperability Standards
Best practices: transform to standards even in local work
FUNDAMENTS
Content transformations Schema of data storage Scale of measurement Several levels of diffi culty Standardization of content
Format transformations File format: tab-delimited, CSV, zip, spreadsheet… Nowadays it is fairly straightforward Translation between programs easy Exchange of information
Support transformations Digitization, key step in general data management process Prone to issues Enable processing, management, analysis, publishing and
sharing of data
FUNDAMENTS
Modify the units of the data or the elements that compose the information
Final product – same information, standard compliant
Standard – DarwinCore (DwC)Two specific aims:
Change elements Complete missing elements
Primary Biodiversity Data (PBD)Metadata
CONTENT TRANSFORMATIONS
Georeferencing of localities From verbatim locality description to coordinates Currently not needed: GPS technology Improve legacy information Tools such as geolocate, geomancer…
Coordinate systems Modify units so that they comply with the standard DwC for coordinates – Decimal Degree (DD) Easy – Degree-Minute-Second to DD Hard – UTM to DD Special attention to precision
CONTENT TRANSFORMATIONS - GEOSPATIAL
CONTENT TRANSFORMATIONS - GEOSPATIAL
45º 20’ – Precision 1’ (~2Km) at best
45.33333 – Precision 0.00001 (2m) too high
45º 21’
45.35 45.33(0.01,
~1.4Km)
45.3(0.1, ~14Km)
Georeferencing of localities From verbatim locality description to coordinates Current GPS technology makes it easier Improve legacy information Tools such as geolocate, geomancer…
Coordinate systems Modify units so that they comply with the standard DwC for coordinates – Decimal Degree (DD) Easy – Degree-Minute-Second to DD Hard – UTM to DD Special attention to precision
Improve missing fields Use mapping tools and/or gazetteers to complete
information
CONTENT TRANSFORMATIONS - GEOSPATIAL
Special character encoding
Special characters in taxonomic names and/or
authorships
Interoperability issues may appear
Transform these characters to simplified version or
enable different text-encoding
Higher level taxa completion
Transformation to broaden the potential uses
Search in taxonomic databases or literature
CONTENT TRANSFORMATIONS - TAXONOMIC
Order of elements Different places use naturally different element order Example: US, July 26th 2012 Might become 07-26-2012 Slight modification with good parser to detect and
update this information to comply with standardsDate systems
Standard – DwC recommends ISO 8601 Different formats:
1984-09-14, 14th September 1984 34th week of 2012, 125th day of 2012
A good parser is needed to understand all possibilities Transformations to use common system and avoid
ambiguities
CONTENT TRANSFORMATIONS - TEMPORAL
Improvement of interoperability – controlled vocabulary Example: basisOfRecord Different languages, non-standard acronyms… Transform term to standard to improve retrieval of
dataImprovement of collections – metadata
becomes data One man’s metadata is another man’s data Information common to a collection might be omitted
locally Must be added when sharing
CONTENT TRANSFORMATIONS - METADATA
Modify the storage of the data
Final product – same information, easily
exchangeable format
Two key cases:
Text to spreadsheet and spreadsheet to text
Text or spreadsheet to database
FILE FORMAT TRANSFORMATIONS
The most common type of format transformation
Import text fi le to spreadsheet or export from spreadsheet to text fi le
Aims Importing to spreadsheet – improve data processing Exporting to text file – share data and allow others to
import easilyTo be effective:
No loss of data No transformation of content
FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
From CSV or tab-delimited to spreadsheetCSV or tab-delimited depending on the contentModern spreadsheets have algorithms to import data
in text fi lesMost of the times, we can select the used separator
FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
From CSV or tab-delimited to spreadsheetCSV or tab-delimited depending on the contentModern spreadsheets have algorithms to import data
in text fi lesMost of the times, we can select the used separatorStill, this step must be taken carefully:
More or less fields than should Hidden new-line characters …
After importing, check
FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
After import,
check
Autofilter comes
handy
“Female” value in
“individualCount”
field??
FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET