83
Crowdsourcing with Django EuroPython, 30th June 2009 Simon Willison · http://simonwillison.net / · @simonw

Crowdsourcing with Django

Embed Size (px)

DESCRIPTION

A talk presented at EuroPython on 30th June 2009.

Citation preview

Page 1: Crowdsourcing with Django

Crowdsourcingwith DjangoEuroPython, 30th June 2009

Simon Willison · http://simonwillison.net/ · @simonw

Page 2: Crowdsourcing with Django

“Web development on journalism deadlines”

Page 3: Crowdsourcing with Django

The back story...

Page 4: Crowdsourcing with Django

November 2000The Freedom of Information Act

Page 6: Crowdsourcing with Django

2004The request

Page 7: Crowdsourcing with Django

January 2005The FOI request

Page 8: Crowdsourcing with Django

July 2006The FOI commissioner

Page 9: Crowdsourcing with Django

May 2007The FOI (Amendment) Bill

Page 10: Crowdsourcing with Django

February 2008The Information Tribunal

Page 11: Crowdsourcing with Django

“Transparency will damage democracy”

Page 12: Crowdsourcing with Django

May 2008The high court

Page 13: Crowdsourcing with Django

January 2009The exemption law

Page 14: Crowdsourcing with Django
Page 15: Crowdsourcing with Django
Page 16: Crowdsourcing with Django

March 2009The mole

Page 17: Crowdsourcing with Django

“All of the receipts of 650-odd MPs, redacted and unredacted, are for sale at a price of £300,000, so I am told. The price is going up because of the

interest in the subject.”Sir Stuart Bell, MP

Newsnight, 30th March

Page 18: Crowdsourcing with Django

8th May, 2009The Daily Telegraph

Page 19: Crowdsourcing with Django

At the Guardian...

Page 20: Crowdsourcing with Django

April: “Expenses are due out in a couple of months, is

there anything we can do?”

Page 21: Crowdsourcing with Django

June: “Expenses have been bumped forward, they’re out

next week!”

Page 22: Crowdsourcing with Django

Thursday 11th JuneThe proof-of-concept

Page 23: Crowdsourcing with Django

Monday 15th JuneThe tentative go-ahead

Page 24: Crowdsourcing with Django

Tuesday 16th JuneDesigner + client-side engineer

Page 25: Crowdsourcing with Django

Wednesday 17th JuneOperations engineer

Page 26: Crowdsourcing with Django

Thursday 18th JuneLaunch day!

Page 27: Crowdsourcing with Django
Page 28: Crowdsourcing with Django
Page 29: Crowdsourcing with Django
Page 30: Crowdsourcing with Django
Page 31: Crowdsourcing with Django
Page 32: Crowdsourcing with Django
Page 33: Crowdsourcing with Django

How we built it

Page 34: Crowdsourcing with Django
Page 35: Crowdsourcing with Django
Page 36: Crowdsourcing with Django

$ convert Frank_Comm.pdf pages.png

Page 37: Crowdsourcing with Django
Page 38: Crowdsourcing with Django

Models

Page 39: Crowdsourcing with Django

class Party(models.Model): name = models.CharField(max_length=100)

class Constituency(models.Model): name = models.CharField(max_length=100)

class MP(models.Model): name = models.CharField(max_length=100) party = models.ForeignKey(Party) constituency = models.ForeignKey(Constituency) guardian_url = models.CharField(max_length=255, blank=True) guardian_image_url = models.CharField(max_length=255, blank=True)

Page 40: Crowdsourcing with Django

class FinancialYear(models.Model): name = models.CharField(max_length=10)

class Document(models.Model): title = models.CharField(max_length=100, blank=True) filename = models.CharField(max_length=100) mp = models.ForeignKey(MP) financial_year = models.ForeignKey(FinancialYear)

class Page(models.Model): document = models.ForeignKey(Document) page_number = models.IntegerField()

Page 41: Crowdsourcing with Django

class User(models.Model): created = models.DateTimeField(auto_now_add = True) username = models.TextField(max_length = 100) password_hash = models.CharField(max_length = 128, blank=True)

class LineItemCategory(models.Model): order = models.IntegerField(default = 0) name = models.CharField(max_length = 32)

class LineItem(models.Model): user = models.ForeignKey(User) page = models.ForeignKey(Page) type = models.CharField(max_length = 16, choices = ( ('claim', 'claim'), ('proof', 'proof'), ), db_index = True) date = models.DateField(null = True, blank = True) amount = models.DecimalField(max_digits=20, decimal_places=2) description = models.CharField(max_length = 255, blank = True) created = models.DateTimeField(auto_now_add = True, db_index = True) categories = models.ManyToManyField(LineItemCategory, blank=True)

Page 42: Crowdsourcing with Django

class Vote(models.Model): user = models.ForeignKey(User, related_name = 'votes') page = models.ForeignKey(Page, related_name = 'votes') obsolete = models.BooleanField(default = False) vote_type = models.CharField(max_length = 32, blank = True) ip_address = models.CharField(max_length = 32) created = models.DateTimeField(auto_now_add = True)

class TypeVote(Vote): type = models.CharField(max_length = 10, choices = ( ('claim', 'Claim'), ('proof', 'Proof'), ('blank', 'Blank'), ('other', 'Other') ))

class InterestingVote(Vote): status = models.CharField(max_length = 10, choices = ( ('no', 'Not interesting'), ('yes', 'Interesting'), ('known', 'Interesting but known'), ('very', 'Investigate this!'), ))

Page 43: Crowdsourcing with Django

Frictionless registration

Page 44: Crowdsourcing with Django
Page 45: Crowdsourcing with Django

Page filters

Page 46: Crowdsourcing with Django
Page 47: Crowdsourcing with Django

page_filters = ( # Maps name of filter to dictionary of kwargs to doc.pages.filter() ('reviewed', { 'votes__isnull': False }), ('unreviewed', { 'votes__isnull': True }), ('with line items', { 'line_items__isnull': False }), ('interesting', { 'votes__interestingvote__status': 'yes' }), ('interesting but known', { 'votes__interestingvote__status': 'known'...)page_filters_lookup = dict(page_filters)

Page 48: Crowdsourcing with Django

pages = doc.pages.all() if page_filter: kwargs = page_filters_lookup.get(page_filter) if kwargs is None: raise Http404, 'Invalid page filter: %s' % page_filter pages = pages.filter(**kwargs).distinct() # Build the filters filters = [] for name, kwargs in page_filters: filters.append({ 'name': name, 'count': doc.pages.filter(**kwargs).distinct().count(), })

Page 49: Crowdsourcing with Django

Matching names

Page 51: Crowdsourcing with Django

On the day

Page 52: Crowdsourcing with Django
Page 53: Crowdsourcing with Django
Page 54: Crowdsourcing with Django
Page 55: Crowdsourcing with Django

def get_mp_pages(): "Returns list of (mp-name, mp-page-url) tuples" soup = Soup(urllib.urlopen(INDEX_URL)) mp_links = [] for link in soup.findAll('a'): if link.get('title', '').endswith("'s allowances"): mp_links.append( (link['title'].replace("'s allowances", ''), link['href']) ) return mp_links

Page 56: Crowdsourcing with Django

def get_pdfs(mp_url): "Returns list of (description, years, pdf-url, size) tuples" soup = Soup(urllib.urlopen(mp_url)) pdfs = [] trs = soup.findAll('tr')[1:] # Skip the first, it's the table header for tr in trs: name_td, year_td, pdf_td = tr.findAll('td') name = name_td.string year = year_td.string pdf_url = pdf_td.find('a')['href'] size = pdf_td.find('a').contents[-1].replace('(', '').replace(')', '') pdfs.append( (name, year, pdf_url, size) ) return pdfs

Page 57: Crowdsourcing with Django
Page 58: Crowdsourcing with Django
Page 59: Crowdsourcing with Django
Page 60: Crowdsourcing with Django

“Drop Everything”

Page 61: Crowdsourcing with Django

Photoshop + AppleScriptv.s.

Java + IntelliJ

Page 62: Crowdsourcing with Django

Images on our docroot (S3 upload was taking too long)

Page 63: Crowdsourcing with Django

Blitz QA

Page 64: Crowdsourcing with Django

Launch! (on EC2)

Page 65: Crowdsourcing with Django
Page 66: Crowdsourcing with Django

Crash #1: more Apache children than MySQL connections

Page 67: Crowdsourcing with Django
Page 68: Crowdsourcing with Django
Page 69: Crowdsourcing with Django

unreviewed_count = Page.objects.filter( votes__isnull = True).distinct().count()

Page 70: Crowdsourcing with Django

SELECT COUNT(DISTINCT `expenses_page`.`id`)FROM `expenses_page` LEFT OUTER JOIN `expenses_vote` ON ( `expenses_page`.`id` = `expenses_vote`.`page_id` ) WHERE `expenses_vote`.`id` IS NULL

Page 71: Crowdsourcing with Django

unreviewed_count = cache.get('homepage:unreviewed_count')if unreviewed_count is None: unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() cache.set('homepage: unreviewed_count', unreviewed_count, 60)

Page 72: Crowdsourcing with Django

• With 70,000 pages and a LOT of votes...

• DB takes up 135% of CPU

• Cache the count in memcached...

• DB drops to %35 of CPU

Page 73: Crowdsourcing with Django

unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count()

reviewed_count = Page.objects.filter( votes__isnull = False ).distinct().count()

Page 74: Crowdsourcing with Django

unreviewed_count = Page.objects.filter( is_reviewed = False ).count()

Page 75: Crowdsourcing with Django

Migrating to InnoDB on a separate server

Page 76: Crowdsourcing with Django

ssh mps-live "mysqldump mp_expenses" |sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' |

sed 's/CHARSET=latin1/CHARSET=utf8/g' |ssh mysql-big "mysql -u root mp_expenses"

Page 77: Crowdsourcing with Django

“next” button

Page 78: Crowdsourcing with Django

def next_global(request): # Next unreviewed page from the whole site all_unreviewed_pages = Page.objects.filter( is_reviewed = False ).order_by('?') if all_unreviewed_pages: return Redirect( all_unreviewed_pages[0].get_absolute_url() ) else: return HttpResponse( 'All pages have been reviewed!' )

Page 79: Crowdsourcing with Django

import random

def next_global_from_cache(request): page_ids = cache.get('unreviewed_page_ids') if page_ids: return Redirect( '/page/%s/' % random.choice(page_ids) ) else: return next_global(request)

Page 80: Crowdsourcing with Django

from django.core.management.base import BaseCommandfrom mp_expenses.expenses.models import Pagefrom django.core.cache import cache

class Command(BaseCommand): help = """ populate unreviewed_page_ids in memcached """ requires_model_validation = True can_import_settings = True def handle(self, *args, **options): ids = list(Page.objects.exclude( is_reviewed = True ).values_list('pk', flat=True)[:1000]) cache.set('unreviewed_page_ids', ids)

Page 81: Crowdsourcing with Django

The numbers

Page 82: Crowdsourcing with Django
Page 83: Crowdsourcing with Django

Final thoughts

• High score tables help

• MP photographs really help

• Keeping up the interest is hard

• Next step: start releasing the data