Upload
reporterslab
View
251
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation to Duke University computer science students, February 2012, by Sarah Cohen, Knight Professor of the Practice
Citation preview
Reporterslab.org
Presentation for computational journalism students
February 2012
STRUCTURED DATA.. And most reporters’ inability to deal with it
New York Times reporters used Word searches and annotations to analyze Wikileaks documents in 2010 and 2011.
PANDA project trying to help gather data inside newsrooms
Barriers to Structured data analysis in the newsroom
• Expensive• Too hard to collect.• It takes practice• It takes patience.• Once collected, data has a short shelf life – its
value inside the newsroom effectively ends once a story is published.
Web-scraping software: ephemeral or too expensive for a task not viewed as mission-critical.
Solutions
• User-friendly tool for scraping websites for structured data
• Packages of algorithms from fraud and other forensic fields for use with public records datasets online.
• Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English
• Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.
TOO MUCH MATERIALWith too little information
Too many sources with too little news
• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organizations and blogs• Press releases from government agencies or beat
subjects
Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.
Solutions
• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app
that reveals more insight into the sources• Formalize each reporter’s definition of “news”
through machine learning. • Alerts for important source material. Example:
changing time of a press conference.
UNUSABLE RECORDSThe buried treasure
Solutions
• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into
their pieces for further analysis• Use speech recognition tools on government
audio and video• OCR video to find the speaker at a hearing
ANTIQUATED METHODSFor unstructured data
Our way
• Hand-enter individual items into spreadsheets
• Transcribe interviews, hearings and other audio and video content for searching
• Read each document
A newer way
• Leverage web scraping and paid crowdsourcing for data entry (MT)
• Use speech recognition for the first pass on searchable audio and video
• Use clustering, information extraction and other methods for overview of documents
Reporterslab.org working to tame audio and video
Associated Press project to bring order to unstructured data
Wordseer for historical text
Jigsaw
REPORTERSLAB.ORG
Creating sample data and documents for researchers based on real stories