60
HOW TO BUILD MAINTAIN H I G H TRAFFIC WEBSITES sdf AN INSIDER’S VIEW ON 25 AUG 2011 Thursday 25 August 2011

An insider's view on how to build, maintain high traffic websites

Embed Size (px)

DESCRIPTION

An overview of how we handle

Citation preview

Page 1: An insider's view on how to build, maintain high traffic websites

H O W T O B U I L D MAINTAIN H I G H T R A F F I C WEBSITES sdf

AN

INSI

DER

’S V

IEW

ON

25 AUG 2011

Thursday 25 August 2011

Page 2: An insider's view on how to build, maintain high traffic websites

OVERVIEW

‣Company history / Overview‣Hardware level‣Software - Technologies ‣Cases: “2-way matching strategy” & “Statistics”

‣Release workflow‣Remarks / Q&A

Thursday 25 August 2011

Page 3: An insider's view on how to build, maintain high traffic websites

COMPANY HISTORY

Thursday 25 August 2011

Page 4: An insider's view on how to build, maintain high traffic websites

OVERVIEW

‣ 2000 Launch of Asl.to

‣ 2002 Rebranding redbox

‣ 2005 Fulltime focus of founders + first hires

‣ 2006 Rebrand to Netlog.com

‣ 2007 Pan- european rollout / Investment of Index

Ventures

‣ 2009 Launch arabic version Netlog

‣ 2010 - 2011 Creation of parent holding:

Massive//Media, Launch of new products/ sites

Thursday 25 August 2011

Page 5: An insider's view on how to build, maintain high traffic websites

WHAT’S NETLOG?

‣ An online community where you can meet new people, create your own profile, share photos, videos, blogs, ..

Thursday 25 August 2011

Page 6: An insider's view on how to build, maintain high traffic websites

12-16

17-20

21-2425-28

29-32

33-36

37-40

40+

15%

4%

5%

6%

12% 14%

24%

20%

AGE SPLIT

GENDER SPLIT

51%

49%

Female

Male

Thursday 25 August 2011

Page 7: An insider's view on how to build, maintain high traffic websites

Bulgaria

42

82

LANGUAGES

MILLION MEMBERS

30MILLION MONTHLY UNIQUES

Thursday 25 August 2011

Page 8: An insider's view on how to build, maintain high traffic websites

MASSIVE//MEDIA

Thursday 25 August 2011

Page 9: An insider's view on how to build, maintain high traffic websites

TWOO.COM

Dating

‣ Play games

‣ Search for dates

‣ Chat & date

Thursday 25 August 2011

Page 10: An insider's view on how to build, maintain high traffic websites

EKKO.COM

Next generation share/ chat platform‣ Localized‣ Friend focus‣ Realtime

experience‣ Built upon

public API

Thursday 25 August 2011

Page 11: An insider's view on how to build, maintain high traffic websites

GATCHA.COM

Gaming‣ Game platform‣ Social &

multiplayer games‣ Support for

multiple devices

Thursday 25 August 2011

Page 12: An insider's view on how to build, maintain high traffic websites

KEZOO.COM

Daily deals‣ Watch a trailer/

Answer a question & win prices

Thursday 25 August 2011

Page 13: An insider's view on how to build, maintain high traffic websites

ABOUT ME

‣ Core development / Lead developer Netlog.com

‣ Lead framework team at Massive//Media

‣ Technical lead at Twoo.com

Thursday 25 August 2011

Page 14: An insider's view on how to build, maintain high traffic websites

HARDWARE LEVEL

Thursday 25 August 2011

Page 15: An insider's view on how to build, maintain high traffic websites

SOME NUMBERS

‣ 620 Web servers

‣ 240 Database servers

‣ 150 Memcache servers

‣ 400 Servers for other purposes (cron, mail, deploy,

Sphinx, Load balancer, hadoop development )

‣ + #### “servers” around the world for loading of static

content => CDN networks

Thursday 25 August 2011

Page 16: An insider's view on how to build, maintain high traffic websites

SERVER SUPPORT

‣ Load balancing

‣Haproxy, good support for failover

‣ Server management

‣Dedicated system people in each team

‣Configuration/ setup through Puppet

‣Monitoring: Zabbix / Prowls on mobile

‣Standby persons for overnight support

‣Backups through bacula

Thursday 25 August 2011

Page 17: An insider's view on how to build, maintain high traffic websites

EXAMPLE HAPROXY

‣<insert example puppet file here>

Thursday 25 August 2011

Page 18: An insider's view on how to build, maintain high traffic websites

EXAMPLE PUPPET

File : services/s_sphinx/manifests

class s_sphinx::twoo::server inherits s_sphinx::server{ File["/etc/sphinxsearch/sphinx.make.twoo.conf"] { source => "puppet:///s_sphinx/twoo/sphinx.make.twoo.conf" } File["/etc/sphinxsearch/sphinx.make.dev.conf"] { source => "puppet:///s_sphinx/twoo/sphinx.make.dev.conf" }}

‣ Manifests & files

‣ Manifests are written in Ruby

‣ Contains location of where files need to be put/ what services need to run etc.

‣ Inheritance can be used

Thursday 25 August 2011

Page 19: An insider's view on how to build, maintain high traffic websites

CLOUD / AWS/ ..?

‣ Experience with cloud services:

‣Works good until some level

‣We have dedicated people in house to do server

management

‣Price costs were much higher

‣No overview of “real” hardware

‣“Fixed” builds of kernel triggered errors

‣No easy point of contact when downtime

Thursday 25 August 2011

Page 20: An insider's view on how to build, maintain high traffic websites

GENERAL REMARKS

‣Split up hardware among products (if one goes down.. )

‣Split up across data centers

‣ Invest in good hardware

=> hardware is cheap <> programming hours

‣Try to split up DB/ web/ cache machines in early stage

‣Avoid manual changes

‣Buy hardware designed for the purpose

(memcache=>lots of ram, db=>lots of cpu, fast disks .. )

Thursday 25 August 2011

Page 21: An insider's view on how to build, maintain high traffic websites

SOFTWARE & TECHNOLOGIES

Thursday 25 August 2011

Page 22: An insider's view on how to build, maintain high traffic websites

SOFTWARE

‣ Languages: PHP/ Python/ C++

‣ Ubuntu

‣ Apache/ Nginx with php-fpm

‣ Evolution of version control systems

‣ CVS => SVN => GIT

‣ Main benefits of using GIT vs. SVN:

‣ Commit goes fast

‣ Easy to setup branches

‣ Merging is less of a hassle

‣ Easier to use in a release workflow

‣ Takes less space on the server (only line changes are kept)

Thursday 25 August 2011

Page 23: An insider's view on how to build, maintain high traffic websites

APPLICATION SETUP

‣ New products (Twoo, Ekko, Kezoo, ..) code:

‣<framework library>

‣<application code> code compatible w <framework>

& following defined rules

‣ Netlog code:

‣<framework library> (fixed port)

‣<application code> compatible with framework

( some parts are grown historically => rewrite takes

time)

Thursday 25 August 2011

Page 24: An insider's view on how to build, maintain high traffic websites

FRAMEWORK SETUP

‣ Inhouse written framework

‣MVC pattern

‣Pear standard (all classes prefixed by “Core_“)

‣Well documented & unit tested

‣ Framework structure: ‣ Io

‣ DB ‣ Memcache‣ Redis/ File ...

‣ Services‣ Search ‣ Image‣ Upload‣ Mail‣ Geo‣ ...

‣ Util‣ HttpRequest‣ Validator ..

‣ Bundle‣ Permissions‣ LocationSelector‣ DebugBar‣ ...

‣ Controller‣ View

Thursday 25 August 2011

Page 25: An insider's view on how to build, maintain high traffic websites

APPLICATION CODE

‣ Defined rules

‣ Codewise (Syntax/ Documentation/ ... )

‣ But also rules programming wise:

‣ No code freewheeling when new people start

working on projects

‣ Avoids code duplication

‣ Better when maintaining code

‣ Easier when switching between projects to know

where to start

‣ Less error prone

Thursday 25 August 2011

Page 26: An insider's view on how to build, maintain high traffic websites

APPLICATION CODE

‣ (Web) Application code philosophy (1) ‣ Keep it simple ‣ Code should not represent how

‣ Database is structured ‣ A view is build up => templates are used for that

‣ Code does represent the logic

‣ What data is put/ pulled from the data layer (through queries, .. )‣ Which data is assigned to a view (template)

‣ What operations are done on the data before assigning

Thursday 25 August 2011

Page 27: An insider's view on how to build, maintain high traffic websites

‣ (Web) Application code philosophy (2) ‣ In Web/ HTTP central stateless applications a “model” is never rightfully used

‣ There is no sense in replicating a storage model as long as it is query-able

‣ Code should be readable in a very “pseudo code” way in a central place =>

Controller.

Here it is easy to anticipate on logical errors while maintaining code

‣ Avoid deep inheritance

‣ Avoid dynamic queries (dynamic joins, .. )

‣ Functions need to be simple & clear, make them do 1 thing

‣ Put the (distribution to function) logic in the controller

‣ Use type hinting in functions for arguments &

return types should be documented

APPLICATION CODE

Thursday 25 August 2011

Page 28: An insider's view on how to build, maintain high traffic websites

APPLICATION CODE

‣ Application layout

Object

Structure

Library

Procedure

Controller

Exception

Contains everything representing real objects: UserLocation, Query, .. )

Classes without functions used to pass around & for type enforcing

Toolkit functions, don’t invoke datastore operations. Methods called statically, only project specific => the rest should be on framework level (TextParsingLibrary, PopularityLibrary, .. )

Classes that access datastores, called statically. Used to organize functions in classes. UserProcedure::register();. Make use of structures in params or return types

Contains the real controlling functions, representing the entrance paths, reflecting the overall logic. User, Settings, Profile, ...

Representation of exception classes used within application.

Thursday 25 August 2011

Page 29: An insider's view on how to build, maintain high traffic websites

TECHNOLOGIES

Thursday 25 August 2011

Page 30: An insider's view on how to build, maintain high traffic websites

MYSQL

‣ The inevitable “make my db scale” problem

‣ How we handled read load initially:

fragmentation through languages and clustering parts

of the application. (Messages, Friends, Blogs,

Photos, ..)

Thursday 25 August 2011

Page 31: An insider's view on how to build, maintain high traffic websites

MYSQL

master r/w

master r/w

slave r

slave r

master r/w

slave r

slave r

master r/w

slave r

slave r2

3 4

5

1master

r/w

search r

music r

messages r

master r/w

search r

music r

messages r

fragment by languagefragment by language &

type of usage

en

fr

en

fr

...

...

...

maste

searc

music

messag

master r/w

search r

music r

messages r

master r/w

search r

music r

messages r

master r/w

search r

music r

messages r

error 1040 too many

connections

error 1040 too many connections

Thursday 25 August 2011

Page 32: An insider's view on how to build, maintain high traffic websites

MYSQL

‣ Solution: partition data horizontally, through sharding

‣ Formula: <userid> % number of shards == shardid where we store the data of the user

‣ Inhouse developed ‣ Positive about the system:‣ Scale out easily

‣ Alters/ Selects/ Inserts go fast (small datasets)

‣ Failure (could) only affects x % of users

‣ High growth potential in size

‣ Helped us to grow in crucial periods

Thursday 25 August 2011

Page 33: An insider's view on how to build, maintain high traffic websites

MYSQL

‣ Sharding drawbacks

‣Lookup for shardid is point of failure

‣Loose overview of all the data

‣Adds complexity to the code

‣Maintenance costs++

‣Harder to join data

‣Not always perfectly balanced (users with lot’s of data/

activity)

Thursday 25 August 2011

Page 34: An insider's view on how to build, maintain high traffic websites

MYSQL

‣ Looking towards the future

‣ Already better/ more mature alternatives available now:

‣ MySQL Cluster

‣ Project matured a lot, back when we created our sharding system there was:

‣ no user defined partitioning available

‣ needed to take cluster offline for alters

‣ Clustrix

‣ Commercial appliance/ runs with top level hardware

‣ Can be accessed/ used as normal DB (Master/ Master replication w normal

mysql f.e)

‣ Horizontal scaling built in

‣ Custom written optimizer (rewrites selects, .. )

‣ Very fast write speed (100K/inserts sec in benchmarks)

‣ Running in test phase on one of our sites & have good results so far

‣ MongoDB cluster (document oriented)/ full switch to NoSQL solutions

‣ Requires a lot of changes in application code conceptually Thursday 25 August 2011

Page 35: An insider's view on how to build, maintain high traffic websites

MEMCACHE

‣ In memory key/ value store

‣ Important to know

‣ Always supply a TTL to your keys, no infinite timeouts (in the long run memcache

will become slower=> more calculation, .. )

‣ Cache wisely => single place in your application (no cache of a cache etc.. )

‣ Don’t overuse caching => DB’s are still pretty fast when they can use indexes

‣ Add caching in a late state, only cache things where needed + try to look at the

cache efficiency of certain keys. (every cache adds a layer of complexity)

‣ If needed use “Cache revision numbers” when creating caches

‣ f.e. A user deletes a photo => bump the revision number

$cacherev = $cache->incr(“photos<$userid>”); .. .

...// somewhere else in the code:

$cache->get(‘photos<$userid><$cacherev>’);

$cache->get(‘publicphotos<$userid><$cacherev>’);

‣ Use memcached, not memcache extension

Thursday 25 August 2011

Page 36: An insider's view on how to build, maintain high traffic websites

‣ Full-text search server

‣ Comparable to Solr

‣ Written in C++

‣ our hardware is better fit to run sphinx

‣ we have more inhouse knowledge

‣ full support on sphinx side (custom requests, etc.)

‣ How does it work?

‣ Creates an inverted index of your data/ words

=> searches/ lookups are very fast

‣ f.e.

document 1: “I work at Netlog.”

document 2: “Netlog is a belgian company”

‣ Currently used at:

‣ Netlog people search

‣ Two way matching searches on Twoo

‣ Realtime results on Ekko

SEARCH - SPHINX

word keywordid

Netlog 1

Company 2

keywordid documentid

1 1, 2

2 2

Thursday 25 August 2011

Page 37: An insider's view on how to build, maintain high traffic websites

‣ ParallelProcessor

‣ Inhouse built

‣ Used for realtime tasks, => on the fly fetching of friends of friends, ...

‣ Splits up an array in slices to divide them among x number of threads

‣ 1) opens connections to x number of sockets (hostname: “cloud”

~loadbalanced)

‣ 2) sends data to the sockets

‣ 3) client uses stream_select (select() system call) to check which socket has

data available

‣ 4) client combines results (

‣ can close sockets earlier when timeout reached ~ some cases where not

100% of the data is needed)

BACKGROUNDJOB

Thursday 25 August 2011

Page 38: An insider's view on how to build, maintain high traffic websites

‣ ParallelProcessor

BACKGROUNDJOB

client

server

server

server

server

batch1

batch2

batch3

batch4

...

does tasks on the batches

& returns results as soon as finished

process returned results when finished or when <timeout>

reached

batch1

batch2

batch3

batch4

server

server

server

server

...

Thursday 25 August 2011

Page 39: An insider's view on how to build, maintain high traffic websites

‣ Gearman‣ Job sheduler used for all things that can happen

on a later moment / in the background. ‣ 1 central job server handles all the queueing of jobs ‣ Other server have workers running that are connected to

the job server and are waiting for a potential job in a loop

‣ 1) clients sends a job to the job server‣ 2) job server checks if there is a worker free to do the job‣ If so: dispatches the job to the worker‣ If not: adds the job to a queue & pushes

this to the next worker whenever a job is done. ‣ We also use this in our cron code‣ for example tasks that need to run minutely / hourly etc. =>

all placed in separate functions, each function is queued to gearman when a minute/ hour is passed ==> all jobs run on time & no worries that some code can’t be executed due to a scrip that fails / dies in the middle

BACKGROUNDJOB

Thursday 25 August 2011

Page 40: An insider's view on how to build, maintain high traffic websites

‣ Gearman

client

worker thread

Gearmand

BACKGROUNDJOB

worker thread

worker thread

worker thread

Sends a job to gearman, can

a) wait for return value or b) continue immediately

Central role in scheduling the jobs, keeps a

queue in memory & sends out tasks to worker threads

Registers itself to the gearmand

server and continuously polls it for new tasks/ jobs + executes them if needed. runs infinitely

Thursday 25 August 2011

Page 41: An insider's view on how to build, maintain high traffic websites

‣ HandlerSocket

‣ Can only be used with MySQL innodb engine

‣ Low level interface (hack) for doing updates/ inserts/ (key/value) to MySQL

‣ Very fast writes/ Still able to do normal selects

‣ Waiting for more stable version & fully adaptation in MySQL package

‣ Membase

‣ A bit unstable when we tested it

promising product, will be the next generation memcached with focus on lists/

sets in memory

‣ Redis

‣ Has a lot of potential, currently used for debugging remote pages (Pub/Sub)

‣ Merging lists/ sets in memory have a lot of use cases > online friends/ users

etc.

‣ Waiting for Redis cluster before we can take full use of this

IN THE PIPELINE

Thursday 25 August 2011

Page 42: An insider's view on how to build, maintain high traffic websites

CASES

Thursday 25 August 2011

Page 43: An insider's view on how to build, maintain high traffic websites

‣Case: ‣ Deliver realtime statistics of what is happening on the website

‣ We want a system where we can easily monitor the evolution of

# unique logins/ # photo uploads / # registrations, ...

+ Should be possible to segment on age/ gender/ location/ browser/

language/ ...

‣ easily generate graphs for all common statistics

‣ segment those graphs in various ways on-the-fly

‣ dive into the data and examine things more closely

1.

STATISTICS

Thursday 25 August 2011

Page 44: An insider's view on how to build, maintain high traffic websites

‣ Attempted solution #1

‣ Create logging table in MySQL and add a row to it each time an event takes

place.

‣ Issues with this approach:

‣ structure needs to be decided in advance, since altering the table is

practically impossible once it gets very large

‣ generating (segmented) graphs is slow as it will basically correspond to

running SQL queries on a very big table

‣ diving into the data is slow too -- all of the raw logging data is there, but

querying it gets infeasible in practice

STATISTICS

Thursday 25 August 2011

Page 45: An insider's view on how to build, maintain high traffic websites

‣Attempted solution #2

‣ Keep aggregated stats in a MySQL table and update them on-the-fly when

events happen.

‣ This makes generating graphs fast, but:

‣ diving into the data is impossible since only the high-level aggregated stats

are stored, not the individual logs

STATISTICS

Thursday 25 August 2011

Page 46: An insider's view on how to build, maintain high traffic websites

‣Hadoop to the rescue

‣ Hadoop = platform for sequential data processing in batch

‣ can reliably store lots of big files on a cluster of computers

‣ is able to process the data very efficiently as well

‣ the data doesn't have to be structured and can evolve

‣ HBase = Hadoop-based DB for random and real-time queries

‣ allows very big data sets to be queried very quickly

‣ low maintenance due to auto partitioning and redundancy

‣ runs on top of Hadoop, so doesn't require extra servers

‣ it's very easy to write to HBase from Hadoop jobs

STATISTICS

Thursday 25 August 2011

Page 47: An insider's view on how to build, maintain high traffic websites

‣ Proper solution‣ Combination of Hadoop and HBase:

‣ 1. store detailed log files on Hadoop (even for events you're not particularly

interested in yet at the current moment)

‣ 2. run batch jobs that precompute all common stats (and their

segmentations) and store the results in HBase

‣ With this approach, we can:

‣ quickly generate (segmented) graphs by querying HBase

‣ dive into the data by running custom batch jobs on Hadoop

‣ keep lots of data around without worrying about schemas

STATISTICS

Thursday 25 August 2011

Page 48: An insider's view on how to build, maintain high traffic websites

‣ Our Hadoop setup (1)‣ Logs are gathered by syslog-ng

‣ easy to set up and seems to work OK so far

‣ but looking at more reliable alternatives such as Flume

‣ LZO compression is applied before uploading to Hadoop

‣ storage space is our main bottleneck

‣ makes jobs run faster too, as they tend to be IO-bound

‣ Hadoop cluster itself currently consists of 9 nodes

‣ started off with 6 recycled database servers

‣ now moving to 9 custom tailored nodes that each have:

‣ 12 hyperthreaded cores (2 virtual cores per core) (24)

‣ 24GB RAM

‣ 12 disks (8 x 3TB and 4 x 2TB)

‣ so in total we have > quarter of a PB raw disk space

STATISTICS

Thursday 25 August 2011

Page 49: An insider's view on how to build, maintain high traffic websites

‣ Our Hadoop setup (2)‣ Accessing/ Processing

‣ Thrift api through python

‣ On the fly from HBase with Java

‣ Dumbo jobs with python

‣ Every hour / day / month

‣ Summaries are stored in HBase for every combination of age, language,

gender date

‣ Custom queries with Hive => SQL interface on top of Hadoop

‣ When doing deep analysis

‣ Map/Reduce jobs in Hadoop if we want to dive in the data

STATISTICS

Thursday 25 August 2011

Page 50: An insider's view on how to build, maintain high traffic websites

STATISTICS

Example dashboard

Example filtering

Thursday 25 August 2011

Page 51: An insider's view on how to build, maintain high traffic websites

‣ Future plans‣ Having all this valuable data at our fingertips also enables us to go beyond

traditional analytics:

‣ social network analysis

‣ natural language processing

‣ ranking models

‣ topic models

‣ collaborative filtering

‣ ...

‣ In this way, Hadoop will also be used to directly power and improve features

of our products.

STATISTICS

Thursday 25 August 2011

Page 52: An insider's view on how to build, maintain high traffic websites

RELEASE WORKFLOW

Thursday 25 August 2011

Page 53: An insider's view on how to build, maintain high traffic websites

‣ Previous way (svn)

‣ Every developer has his own dev environment on a remote dev box

‣ Code changes get rsynced on save

‣ After testing enough on the dev environment code gets commited

‣ Every developer can put code live immediately on all the webs

‣ 1st on test webs and after confirming that there are no errors,

other webs

‣ Full deploys of code happen on weekly base (code/ templates/ frontend)

RELEASE WORKFLOW

Positive: flexible & fast way of developing/ putting things live,

works well for small teams

Negative: Bugs can get live much easier &

hard to track down issues if there are ## deploys per day

Thursday 25 August 2011

Page 54: An insider's view on how to build, maintain high traffic websites

‣ New deploy/ coding strategy:

‣ Developer runs all the code locally & commits/ pushes changes to the

develop or feature-branch, once stable

‣ Git branches:

‣ Develop: all main develop tasks (bugfixes, small refactors)

‣ Feature-branches: Things that involve more then 1 day of work/ can be

shared (feature-payments, feature-advancedsearch, .. )

‣ Release-candidate: Created every evening / tested overnight, in the

morning & goes live every day at noon.

‣ Master: version that is live / No development is done on this branch/

Only merges from rc to master

‣ Dedicated release manager that has ownership over the deploys

RELEASE WORKFLOW

Note: Critical bugs can be cherry-picked on the master & rc, deployed afterwards => limit this usage (only for

urgent issues)

Thursday 25 August 2011

Page 55: An insider's view on how to build, maintain high traffic websites

‣ Benefits of new strategy

‣ Dedicated people for code release => ownership & responsibility

‣ Low impact on site visitors, less errors/ bugs can be put live

‣ More time to test

‣ Easier to see what has come live (mailing is sent to everyone with commit messages)

‣ Can go back to a previous release if something is wrong

‣ Future plans:

‣ Increase the release-candidate <> deploy timeframe to x days

right now: code not mature enough & daily releases are needed

RELEASE WORKFLOW

Thursday 25 August 2011

Page 56: An insider's view on how to build, maintain high traffic websites

‣ Unit testing

‣ Tests are written in phpunit classes for new code pieces & bugs (almost 70%

code coverage)

‣ Database (table structure) differences between production <> development

environment are checked on a daily base

‣ We use Jenkins to do automated testing in the background (after each push)

‣ Selenium for replaying browser actions

‣ Manual testing

‣ We have a QA engineer that tests new features & important parts of the sites

daily/ weekly

‣ Responsibility of developer++

TESTING

Thursday 25 August 2011

Page 57: An insider's view on how to build, maintain high traffic websites

JENKINS

Thursday 25 August 2011

Page 58: An insider's view on how to build, maintain high traffic websites

JENKINS

Thursday 25 August 2011

Page 59: An insider's view on how to build, maintain high traffic websites

REMARKS/ Q&A

Thursday 25 August 2011

Page 60: An insider's view on how to build, maintain high traffic websites

‣That’s all folks!

‣Contact ‣ http://nl.netlog.com/jayme

‣ http://twoo.com/648281476

‣ http://be.linkedin.com/in/jaymerotsaert

‣ http://twitter.com/_jayme

‣ or email : <firstname> @ <lastname> .be

QUESTIONS

Thursday 25 August 2011