22
Reporterslab.org Presentation for computational journalism students February 2012

Computational journalism projects

Embed Size (px)

DESCRIPTION

Presentation to Duke University computer science students, February 2012, by Sarah Cohen, Knight Professor of the Practice

Citation preview

Page 1: Computational journalism projects

Reporterslab.org

Presentation for computational journalism students

February 2012

Page 2: Computational journalism projects

STRUCTURED DATA.. And most reporters’ inability to deal with it

Page 3: Computational journalism projects

New York Times reporters used Word searches and annotations to analyze Wikileaks documents in 2010 and 2011.

Page 4: Computational journalism projects

PANDA project trying to help gather data inside newsrooms

Page 5: Computational journalism projects

Barriers to Structured data analysis in the newsroom

• Expensive• Too hard to collect.• It takes practice• It takes patience.• Once collected, data has a short shelf life – its

value inside the newsroom effectively ends once a story is published.

Page 6: Computational journalism projects

Web-scraping software: ephemeral or too expensive for a task not viewed as mission-critical.

Page 7: Computational journalism projects

Solutions

• User-friendly tool for scraping websites for structured data

• Packages of algorithms from fraud and other forensic fields for use with public records datasets online.

• Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English

• Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.

Page 8: Computational journalism projects

TOO MUCH MATERIALWith too little information

Page 9: Computational journalism projects

Too many sources with too little news

• Twitter, Facebook, LinkedIn and other social media• RSS feeds from other news organizations and blogs• Press releases from government agencies or beat

subjects

Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.

Page 10: Computational journalism projects

Solutions

• Archiving users’ feeds locally or in the cloud• Mash-up social media, rss feeds into an app

that reveals more insight into the sources• Formalize each reporter’s definition of “news”

through machine learning. • Alerts for important source material. Example:

changing time of a press conference.

Page 11: Computational journalism projects

UNUSABLE RECORDSThe buried treasure

Page 12: Computational journalism projects
Page 13: Computational journalism projects

Solutions

• Visual extractor of data from scanned forms.• Separate scanned boxes of documents into

their pieces for further analysis• Use speech recognition tools on government

audio and video• OCR video to find the speaker at a hearing

Page 14: Computational journalism projects
Page 15: Computational journalism projects

ANTIQUATED METHODSFor unstructured data

Page 16: Computational journalism projects

Our way

• Hand-enter individual items into spreadsheets

• Transcribe interviews, hearings and other audio and video content for searching

• Read each document

A newer way

• Leverage web scraping and paid crowdsourcing for data entry (MT)

• Use speech recognition for the first pass on searchable audio and video

• Use clustering, information extraction and other methods for overview of documents

Page 17: Computational journalism projects

Reporterslab.org working to tame audio and video

Page 18: Computational journalism projects

Associated Press project to bring order to unstructured data

Page 19: Computational journalism projects

Wordseer for historical text

Page 20: Computational journalism projects

Jigsaw

Page 21: Computational journalism projects
Page 22: Computational journalism projects

REPORTERSLAB.ORG

Creating sample data and documents for researchers based on real stories