Upload
paul-bradshaw
View
1.784
Download
3
Embed Size (px)
DESCRIPTION
Presentation to ESCACC, Barcelona, 2010
Citation preview
IntroductionPaul Bradshaw
Data journalism
Ivy Lee
“Each weekday, my computer program goes to the Chicago Police Department's website and gathers all crimes reported in Chicago.”
Adrian Holovaty
Great storiesEngagementTargeting/relevance
Why?
“The Tribune’s biggest magnet by far has been its more than three dozen interactive databases, which collectively have drawn three times as many page views as the site’s stories.”
http://bit.ly/dj2dmz
Times film genres
Data Journalism Continuum
1. Finding data
What is data?
NumbersTextConnectionsLive dataBehavioural dataImages, audio, video
Anything that a computer can work with
Start with the data and look for the stories? (MPs’ expenses)Or start with a lead and look for the data?
Passive vs active data journalism
Data.gov.ukWhat Do They KnowOpenlylocal, Scraperwiki
Disclosure logsRSS feeds, XML, structured data
Some UK projects
Delicious.com/paulb/car
CAR
Advanced search by file type
“Performance figures” Filetype: pdfFiletype: xlsFiletype: docFiletype: pptFiletype: rdf OR xml
Advanced search by domain
“Disclosure logs” site: .gov.esDatabase site: .org.cat OR .org+Tables –chairs site:Health, police, military domains
Use overseas sources
• US medicine databases• EU subsidy databases • Swedish people data• International police agency
correspondence
Scraping
Scraping can automate & schedule the gathering process if there are multiple sourcesTools: OutWit Hub plugin, Yahoo! Pipes, Scraperwiki, Google Spreadsheets formulae
Interrogating data
Humans collect dataHumans enter dataHuman error
Time spent now...
Different words for the same thingDouble spaces, punctuationWrong data typeMistypedDuplicate entriesDefault entries (1/1/00)
...Saves time later
"Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."
David Donald, Center for Public Integrity
Group by term then sort to see duplicationsFind & replace double spaces, etc. Select column/row & check data typeSort to find unusually large/small, and neighbouring misspellings
Cleaning methods
Never publish a name from data without running a background check
Check.
Other tools
Freebase Gridworks: see http://vimeo.com/10081183
Visualising data
or http://chartchooser.juiceanalytics.com/
(trends, dips, correlations)
(comparison, themes)
(proportions, comparison)
Mashing data
Geocoded data with map- Live data (e.g. Twitter API)- Static data (e.g. Google Docs)- Dynamic data (e.g. Google Form)2 spreadsheets with common data- Tools: MySQL, Access, etc.
Combining data sources
TwittermapWikipedia mapNYT PropertyGuardian vs NatureBBC Most ReadBBC Olympic Village
Combining data sources
Big events (protests, Olympics, inauguration)ComparisonsGeocoded dataConnections
What mashes well?
AggregatesMapsFiltersCountsCleans or reformats (regex)
Yahoo! Pipes
Scraperwiki – mapping libraryMaptube – combine mapsGoogle Docs – publish in different formats+++
Other tools
Computer-readable dataParis – France, Texas, or Hilton?Unique identifiers – usually URIRDF, RDFa, XML, etc.
Semantic web & linked data
Application Programming InterfaceBuild on top of dataGoogle Maps, Twitter, Facebook, Digg, Guardian, NYT, NPR, They Work For You, etc.
API
Slideshare.net/onlinejournalistTwitter.com/paulbradshaw
Q&A
Delicious.com/paulb/datajournalismDelicious.com/paulb/visualisationDelicious.com/paulb/statistics
Bookmarks