Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
TOOLS & TECHNIQUES FOR WORKING WITH DATA
TL;DR 1. There are many tools & techniques to work with data.
2. Know about alternatives and try them.
3. Not every data management task is the same.
4. You might need different tools for separate parts of larger tasks.
3
Tools we will use working with our example data
1. csvkit
a. csvcut
b. csvstat
c. csvsql
d. csvsort
e. in2csv
2. Unix/Mac commands
a. wc
b. head
c. pbcopy
d. ‘piping’
3. DB Browser for SQLite5
8
A brief ode to CSV:
● Possibly the most widely supported structured data format in the world.
● One of the simplest possible structured formats for data.
● Strikes a delicate balance, remaining readable by both machines & humans.
Source: http://frictionlessdata.io/guides/csv/
● csvkit is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
● It is inspired by pdftk, gdal and the original csvcut tool.
9
csvkit
~ fulcrum-live
$ wc -l Fire_Inspections.csv 213210 Fire_Inspections.csv
$ csvstat --count Fire_Inspections.csvRow count: 213209
10
Row count
● wc (short for word count) is a Unix-like command. wc -l prints the line count
● csvstat --count outputs total row count.
~ fulcrum-live$ csvstat -n Fire_Inspections.csv 1: Inspection Number 2: Inspection Type 3: Inspection Type Description 4: Address 5: Inspection Address Zipcode 6: Battalion 7: Station Area 8: Fire Prevention District 9: Billable Inspection 10: Inspection Start Date 11: Inspection End Date 12: Inspection Status 13: Return Date 14: Corrective Action Date ...
11
csvstat -n, --names
Display column names and indices.
~ fulcrum-live
$ head -n 101 Fire_Inspections.csv > subset.csv
$ csvstat --count subset.csvRow count: 100
12
Preview a subset
● head is a program on Unix-like systems used to display the beginning of a text file.
13
Preview the data
Can also be done in command line with csvlook.
14
Heads up!
We’ll need to split this column into separate column for latitude and longitude.
~ fulcrum-live
$ csvstat Fire_Inspections.csv
1. "Inspection Number"
Type of data: NumberContains null values: FalseUnique values: 213209Smallest value: 220Largest value: 333,298Sum: 29,621,444,707Mean: 138,931.493Median: 120,958StDev: 89,204.887Most common values: 234,121 (1x) 234,119 (1x)
15
csvstat
● Prints descriptive statistics for all columns in a CSV file.
● Will intelligently determine the type of each column and then print analysis.
~ fulcrum-live
9. "Billable Inspection"
Type of data: BooleanContains null values: FalseUnique values: 2Most common values: False (202888x) True (10321x)
10. "Inspection Start Date"
Type of data: DateContains null values: FalseUnique values: 4523Smallest value: 2004-01-01Largest value: 2017-07-26Most common values: 2016-07-01 (661x)
16
csvstat
csvstat returning Boolean and Date types.
17
Gotcha!
Upon import of our data subset, we discovered that the date format was not correct for import into Fulcrum.
Don’t worry, csvkit has got your back!
~ fulcrum-live
14. "Neighborhood District"
Type of data: Text Contains null values: True Unique values: 42 Longest value: 30 characters Most common values: Financial District (2034x)
Tenderloin (727x) South of Market (644x) Mission (622x) Nob Hill (525x)
18
csvstat
csvstat returning a Text type and listing the most common values.
~ fulcrum-live
$ csvsql --query 'SELECT DISTINCT "Neighborhood District" FROM "Fire_Inspections";' Fire_Inspections.csv
MissionLone Mountain/USFNoe ValleyHaight AshburyNob HillLakeshoreTenderloinRussian HillChinatownMission BayFinancial District/South BeachSouth of Market
19
csvsql
Run SQL queries directly on your CSV !!
~ fulcrum-live
$ csvcut -C 2,13-24,26-32 Fire_Inspections.csv > fire-inspections-subset.csv
$ csvsort -r -c 9 fire-inspections-subset.csv | head -n 10001 | in2csv -f csv > fire-inspections-subset-sorted.csv
20
Putting it all together
~ fulcrum-live
$ csvcut -C 2,13-24,26-32 Fire_Inspections.csv > fire-inspections-subset.csv
$ csvsort -r -c 9 fire-inspections-subset.csv | head -n 10001 | in2csv -f csv > fire-inspections-subset-sorted.csv
21
Putting it all together
~ fulcrum-live
$ csvcut -C 2,13-24,26-32 Fire_Inspections.csv > fire-inspections-subset.csv
$ csvsort -r -c 9 fire-inspections-subset.csv | head -n 10001 | in2csv -f csv > fire-inspections-subset-sorted.csv
22
Putting it all together
~ fulcrum-live
$ csvcut -C 2,13-24,26-32 Fire_Inspections.csv > fire-inspections-subset.csv
$ csvsort -r -c 9 fire-inspections-subset.csv | head -n 10001 | in2csv -f csv > fire-inspections-subset-sorted.csv
23
Putting it all together
25
DB Browser for SQLite
csvkit could not do it all (for me), so we turn to the SQLite Browser tool for our last task.
Steps in SQLite Browser:● Create new database● Import CSV● Create new columns for latitude & longitude● Execute UPDATE statement
UPDATE "fire-inspections-subset-sorted"
SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''),
lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
● Export CSV containing our lat & lon columns● Import into Fulcrum● Profit!
DB Browser for SQLite
26
27
(37.751914, -122.421305)
Location column, sample value
UPDATE "fire-inspections-subset-sorted"SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''), lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
28
(37.751914, -122.421305)
Location column, sample value
UPDATE "fire-inspections-subset-sorted"SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''), lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
29
(37.751914, -122.421305)
Location column, sample value
UPDATE "fire-inspections-subset-sorted"SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''), lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
30
(37.751914, -122.421305)
Location column, sample value
UPDATE "fire-inspections-subset-sorted"SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''), lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
31
(37.751914, -122.421305)
Location column, sample value
UPDATE "fire-inspections-subset-sorted"SET lat = replace(substr(Location, 1, instr(Location, ', ') - 1), '(', ''), lon = replace(substr(Location, instr(Location, ', ') + 1), ')', '');
32
The new lat & lon columns look good!
34
Some things to note:
● csvstat and csvsql can be SLOW! csvkit is written in Python, so keep that in mind. If it’s too slow for a specific task - you’re probably better off pulling the data into SQLite for querying.
● pbcopy is a great little tool that allows you to direct your command line output to your clipboard.
- Instead of copying/pasting the results of a csvsql --query 'SELECT DISTINCT…' command from the command line, you can pipe it to pbcopy and paste it directly into your Choice field list in the Fulcrum app builder.
- For example:
csvsql --query 'SELECT DISTINCT "Neighborhood District" FROM "Fire_Inspections";' Fire_Inspections.csv | pbcopy
Thank you very much for your time
35
But before we go...let’s enjoy the fruits of our labor and IMPORT THIS DATA !