59
NoSql at guardian.co.uk Matthew Wall Simon Willison

NoSql presentation

  • Upload
    mat-wall

  • View
    104.110

  • Download
    3

Embed Size (px)

DESCRIPTION

Presentation given at NoSql EU conference describing architectures past, present & future for guardian.co.uk

Citation preview

Page 1: NoSql presentation

NoSql at guardian.co.ukMatthew WallSimon Willison

Page 2: NoSql presentation
Page 3: NoSql presentation

!

Page 4: NoSql presentation

SQL

Page 5: NoSql presentation
Page 6: NoSql presentation
Page 7: NoSql presentation
Page 8: NoSql presentation

ot

nly

Page 9: NoSql presentation

Guardian journalism online: 1995

Page 10: NoSql presentation

Guardian journalism online: 1999

Page 11: NoSql presentation

Guardian journalism online: 2000

Page 12: NoSql presentation

Guardian journalism online: 2010

Page 13: NoSql presentation

Read all about it!

Page 14: NoSql presentation

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached (20Gb)

Page 15: NoSql presentation

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached

Why RDBMS?

5 years ago, fewer alternatives

Understand operations procedures

Can easily recruit DBAs / devs

Developer/ops tools

Business critical system: a safe choice

Page 16: NoSql presentation
Page 17: NoSql presentation
Page 18: NoSql presentation
Page 19: NoSql presentation
Page 20: NoSql presentation

Related content from search engine

Page 21: NoSql presentation

Introduction of memcached

Related content from search engine

Page 22: NoSql presentation

Introduction of memcached

Big traffic spikeRelated content from search engine

Page 23: NoSql presentation

Distributed memcached

Protects database from peak load

Entities explicitly decached

Queries given TTL

memcached = database supercharger

Page 24: NoSql presentation

Now we have a stable “broadcast” platform

We know how to scale it

SQL running effectively at core

We’ve finished, right?

Page 25: NoSql presentation

Digital journalism is changing

We can’t cover everything

We can’t compete with everyone

Need to be “part of the web” not just “on the web”

Page 26: NoSql presentation

Mutualisethe news!

Page 27: NoSql presentation

Mutualised news!

Mutalisation of journalism

No longer only broadcasting content

User engagement & contribution:journalism

datasoftware

Data curation / linked data

Support engaged developers with data and APIs

Page 28: NoSql presentation

Mutualised news!

Be a part of the data fabric of the internet

Page 29: NoSql presentation

Mutualised news!Platform strategy

Out: Release our data to the world via APIs

In: Rapidly build new functionality outside the core

Write: Ingest, store & present arbitrary data

Page 30: NoSql presentation

Mutualised news!

Data Out

Content API

Page 31: NoSql presentation

Mutualised news!

Content API

Delivered using Apache Solr

Document oriented search engine

Loose schema:records, fields, facets

Fields can be multi-value

Supports dynamic field generation

Can apply multiple facets in queries faster than RDBMS

Page 32: NoSql presentation

Mutualised news!

Page 33: NoSql presentation

Mutualised news!

Page 34: NoSql presentation

Mutualised news!

Page 35: NoSql presentation

Mutualised news!

Is Solr a database?

Page 36: NoSql presentation

Mutualised news!Can perform complex queries, including full text search

Can filter results with facets (WHERE clause)

ANYTHING can be a facet. Very powerful.

On our dataset most queries are of a similar cost

Scales very well horizontally

Handles millions of documents

Page 37: NoSql presentation

Mutualised news!No transactions

Excellent for certain types of queries

Not truly general purpose

Schema design very important

Search index not really persistence

Page 38: NoSql presentation

App server

Web servers

CMS

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Api

rdbms

Page 39: NoSql presentation

Mutualised news!API

Currently powering iPad app

Site components

External applications

Editors tools

More to follow

Page 40: NoSql presentation

Mutualised news!

Data In

Application framework

Page 41: NoSql presentation

Mutualised news!

Application framework

Simple REST/ HTTP framework allows lightweight development

Applications proxied for performance

Apps generally hosted in the cloud, hot deployment into production

No RDBMs provided for storage

Can develop in news timeline

Page 42: NoSql presentation

App server

Web servers

CMS

Memcached (20Gb)

Core

M/Q

App

App

App

App

App

App

Apps

Proxy

external hostingapp engine etc

rdbms

Page 43: NoSql presentation

NoSQL for journalism

Page 44: NoSql presentation

Some useful characteristics

• Scale down as well as up

• Support rapid production-ready prototyping: turn projects around in hours or days

• Handle massive traffic spikes

Page 45: NoSql presentation

Desktop analysis• Leaked BNP

membership list

• Load postcodes to constituencies mapping in to Redis

• Generate heatmaps by looking up all 12,000 postcodes

Page 46: NoSql presentation

MP’s expenses

Page 47: NoSql presentation

MP’s expenses

SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()

Page 48: NoSql presentation

v2 used Redis

Page 49: NoSql presentation

v2 used RedisSet difference:labour MP pages - reviewed pages

SRANDMEMBER

Page 50: NoSql presentation

BigTable: Zeitgeist

Page 51: NoSql presentation

Zeitgeist stores pre-calculated results in BigTable

• Data comes in from stats system, comments system and OneRiot real-time search API

• AppEngine cron tasks populate task queues

• Task queues recalculate hotness levels

• “Live” BigTable queries are simple SELECT / SORT

Page 52: NoSql presentation

Live debate poll

• Over a million votes cast in an hour

• Stretched limits of BigTable / AppEngine

• Sharded counter pattern to handle writes

Page 53: NoSql presentation

Spreadsheets are NoSQL too...

Page 54: NoSql presentation

Google Docs powered infographics

Page 55: NoSql presentation

The Datablog

Page 56: NoSql presentation

• Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets

• Retrieve data as CSV, XLS, JSON, Atom...

• “Make a copy” and run your own analysis

Page 57: NoSql presentation

Mutualised news!

Write

Arbitrary data

Page 58: NoSql presentation

Mutualised news!Create schema free database alongside RDBMS

Index in Solr

Provide access in API

Investigating: CouchDB

Page 59: NoSql presentation

App server

Web servers

CMS Data feeds

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Out

App

App

App

App

App

App

In

Proxyexternal hostingapp engine etc

CouchDB?rdbms