Upload
rick-copeland
View
14.115
Download
1
Embed Size (px)
DESCRIPTION
In 2009, SourceForge embarked on a quest to modernize our websites, converting a site written for a hodge-podge of relational databases in PHP to a MongoDB and Python-powered site, with a small development team and a tight deadline. We have now completely rewritten both the consumer and producer parts of the site with better usability, more functionality and better performance. This talk focuses on how we're using MongoDB, the pymongo driver, and Ming, an ORM-like library implemented at SourceForge, to continually improve and expand our offerings, with a special focus on how3 anyone can quickly become productive with Ming and pymongo without having to apologize for poor performance.
Citation preview
© 2011Geeknet Inc
Rapid, Scalable Web Development with MongoDB,
Ming, and Python
Rick Copeland@rick446
© 2011Geeknet Inc
?
- NoSQL at SourceForge
- Rewriting Consume
- Introducing Ming
- Allura – Open-Sourcing Open Source
- Zarkov – MongoDB-based (near) real-time analytics
© 2011Geeknet Inc
SF.net “BlackOps”: FossFor.us
User Editable!
Web 2.0!(ish)
Not Ugly!
- FossFor.us used CouchDB (NoSQL)- “Just adding new fields was trivial, and was happening all the time” – Mark Ramm
- Scaling up to the level of SF.net needs research- CouchDB- MongoDB- Tokyo Cabinet/Tyrant- Cassandra... and others
Moving to NoSQL
What we were looking for
Performance – how does a single node perform?
Scalability – needs to support simple replication
Ability to handle complex data and queries Ease of development
- NoSQL at SourceForge
- Rewriting Consume
- Introducing Ming
- Allura – Open-Sourcing Open Source
- Zarkov – MongoDB-based (near) real-time analytics
Rewriting “Consume”
Most traffic on SF.net hits 3 types of pages: Project Summary File Browser Download
Pages are read-mostly, with infrequent updates from the “Develop” side of sf.net
Original goal is 1 MongoDB document per project
Later split release data because some projects have lots of releases
Periodic updates via RSS and AMQP from “Develop”
© 2011Geeknet Inc
Load Balancer / Proxy
Master DB Server
MongoDBMaster
Apachemod_wsgi / TG 2.0
MongoDBSlave
Apachemod_wsgi / TG 2.0
MongoDBSlave
Apachemod_wsgi / TG 2.0
MongoDBSlave
Gobble Server
Develop
Apachemod_wsgi / TG 2.0
MongoDBSlave
Deployment Architecture
Load Balancer / Proxy
Master DB Server
MongoDBMaster
Apachemod_wsgi / TG 2.0
Gobble Server
DevelopApachemod_wsgi / TG 2.0
Apachemod_wsgi / TG 2.0
Apachemod_wsgi / TG 2.0
Scalability is good
Single-node performance is
good, too
Deployment Architecture (revised)
- NoSQL at SourceForge
- Rewriting Consume
- Introducing Ming
- Allura – Open-Sourcing Open Source
- Zarkov – MongoDB-based (near) real-time analytics
Ming – an “Object-Document Mapper?” Your data has a schema
Your database can define and enforce it It can live in your application (as with MongoDB) Nice to have the schema defined in one place in the code
Sometimes you need a “migration” Changing the structure/meaning of fields Adding indexes, particularly unique indexes Sometimes lazy, sometimes eager
“Unit of work:” Queuing up all your updates can be handy
Python dicts are nice; objects are nicer
Ming Concepts
Inspired by SQLAlchemy Group of collection objects with schemas defined Group of classes to which you map your collections Use collection-level operations for performance Use class-level operations for abstraction Convenience methods for loading/saving objects and
ensuring indexes are created Migrations Unit of Work – great for web applications MIM – “Mongo in Memory” nice for unit tests
Ming Examplefrom ming import schema, Fieldfrom ming.orm import (mapper, Mapper, RelationProperty,
ForeignIdProperty)
WikiDoc = collection(‘wiki_page', session, Field('_id', schema.ObjectId()), Field('title', str, index=True), Field('text', str))CommentDoc = collection(‘comment', session, Field('_id', schema.ObjectId()), Field('page_id', schema.ObjectId(), index=True), Field('text', str))
class WikiPage(object): passclass Comment(object): pass
ormsession.mapper(WikiPage, WikiDoc, properties=dict( comments=RelationProperty('WikiComment')))ormsession.mapper(Comment, CommentDoc, properties=dict( page_id=ForeignIdProperty('WikiPage'), page=RelationProperty('WikiPage')))
Mapper.compile_all()
- NoSQL at SourceForge
- Rewriting Consume
- Introducing Ming
- Allura – Open-Sourcing Open Source
- Zarkov – MongoDB-based (near) real-time analytics
Load Balancer / Proxy
Master DB Server
MongoDBMaster
Apachemod_wsgi / TG 2.0
Gobble Server
DevelopApachemod_wsgi / TG 2.0
Apachemod_wsgi / TG 2.0
Apachemod_wsgi / TG 2.0
Scalability is good
Single-node performance is
good, too
Python / MongoDB Taking Over….
Allura
Web-facing App Server
Task Daemon
SMTPServer
FUSE Filesystem(repository
hosting)
Allura Architecture
Allura Threaded Discussions
MessageDoc = collection( 'message', project_doc_session, Field('_id', str, if_missing=h.gen_message_id), Field('slug', str, if_missing=h.nonce), Field('full_slug', str), Field('parent_id', str),…)
_id – use an email Message-ID compatible key
slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d dead/beef dead)
full_slug – slug interspersed with ISO-formatted message datetime (20110627…dead/20110627…beef….)
Easy queries for hierarchical data
Find all descendants of a message – slug prefix search “dead/.*”
Sort messages by thread, then by date – full_slug sort
MonQ: Async Queueing in MongoDB
states = ('ready', 'busy', 'error', 'complete')result_types = ('keep', 'forget')
MonQTaskDoc = collection( 'monq_task', main_doc_session, Field('_id', schema.ObjectId()), Field('state', schema.OneOf(*states)), Field('result_type', Schema.OneOf(*result_types)), Field('time_queue', datetime), Field('time_start', datetime), Field('time_stop', datetime), # dotted path to function Field('task_name', str), Field('process', str), # worker process name: “locks” the task Field('context', dict( project_id=schema.ObjectId(), app_config_id=schema.ObjectId(), user_id=schema.ObjectId())), Field('args', list), Field('kwargs', {None:None}), Field('result', None, if_missing=None))
Repository Cache Objects
On commit to a repo (Hg, SVN, or Git)
• Build commit graph in MongoDB for new commits
• Build auxiliary structures • tree structure, including all trees in a commit & last commit to modify
• linear commit runs (useful for generating history)
• commit difference summary (must be computed in Hg and Git)
• Note references to other artifacts and commits
- Repo browser uses cached structure to serve pages
DiffInfo
Tree TreesCommitR
un
LastCommit
Commit
Repository Cache Lessons Learned
Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun!
Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer.
Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects
- NoSQL at SourceForge
- Rewriting Consume
- Introducing Ming
- Allura – Open-Sourcing Open Source
- Zarkov – MongoDB-based (near) real-time analytics
And now, for something completely different…
Business: we need more visibility into what users are doing
- Low overhead- Near real-time- Unified view of lots of systems
- Python- PHP- Perl
Introducing Zarkov
Asynchronous TCP server for event logging with gevent
Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client)
Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}
© 2011Geeknet Inc
Lessons learned at
What We Liked Performance, performance, performance – Easily
handle 90% of SF.net traffic from 1 DB server, 4 web servers
Dynamic schema allows fast schema evolution in development, making many migrations unnecessary
Replication is easy, making scalability and backups easy
Query Language You mean I can have performance without map-reduce?
GridFS
Pitfalls
Too-large documents Store less per document Return only a few fields
Ignoring indexing Watch your server log; bad queries show up there
Too much denormalization Try to use an index if all you need is a backref Stale data is a tricky problem
Using many databases when one will do Using too many queries
Open Source
Minghttp://sf.net/projects/
merciless/MIT License
Allurahttp://sf.net/p/allura/
Apache License
Zarkovhttp://sf.net/p/zarkov/
Apache License
Future Work
mongos New Allura Tools Migrating legacy SF.net projects to
Allura Continue to optimize stats & analytics
(Zarkov and others) Better APIs to access your project data