Upload
michal-ptaszek
View
2.674
Download
4
Embed Size (px)
DESCRIPTION
Would you ever play an online game if you were not able to communicate with your teammates? Isn’t it fun if you can make new friends, arrange pre-made games and celebrate your victories with people you like to play with? Riot Games’ League of Legends handles millions of online players at any given time. Each chat server is responsible for routing over 1 billion real time events a day. In order to support the overwhelming user base and be prepared future growth, as well as pave the road for the upcoming features, chat infrastructure had to be designed and built with the utmost care, so that it would never fail the players. In this talk I would like to present how we achieved linear scalability, improved the overall fault tolerance, created a framework for real time code upgrades and got ready for the new features we want to ship. I will also discuss in detail why we chose to use Erlang as a foundation for the system, and why we migrated our data from MySQL to Riak.
Citation preview
TO 70 MILLION PLAYERSSCALING LoL CHAT
Michal Ptaszek, @michalptaszekRiot Games
WHAT’S PLANNED
1 2 3 4
GAME CHAT TECH LESSONS LEARNED
Q&A
5
WHAT IS LEAGUE OF LEGENDS?
2009LAUNCH
TEAMORIENTED
100+CHAMPS
MODERNFANTASY
MESSAGING SERVICEPrivate player chat and group chats.
PRESENCE SERVICEFriend lists, availability and status.
SOCIAL GRAPH SERVICEInternal service for store, match history, leagues.
WHAT IS IT?CHAT
WHAT IS IT?CHAT
CHAT BY THE NUMBERS
67 million monthly players
27 million daily
players
7.5 million concurrent
players
1 billion events
routed per server, per
day
CHAT AT 10K FEETSTABLE, SCALABLE CHAT SERVICE
PROTOCOL DATA STORESERVER
STABLE, SCALABLE CHAT SERVICE
DATA STORESERVERPROTOCOL
CHAT AT 10K FEET
PROTOCOL: XMPP
Decentralized Architecture
Openness
Extensibility
Availability of Client
Libraries
Security Wide Adoption
STABLE, SCALABLE CHAT SERVICE
CHAT AT 10K FEET
DATA STOREPROTOCOL SERVER
SERVER: EJABBERD
‣ Open source Jabber/XMPP server
‣ Relatively nice scalability and performance with default configuration
‣ Wide adoption and active, helpful community
‣ Very good as a starting point for our own server solution
▾ We were aware that one day we would need to start customizing it
‣ Written in Erlang programming language
Which gives us...
TECHNOLOGY: ERLANG/OTPErlang is...
A functional language
Built with concurrency and distribution in mind
Able to scale extremely well
Capable of reloading code on the fly
A declarative style of programming
An easier way to build our distributed applications
More time to focus on coding
Less downtime
SERVER: EJABBERD - PHILOSOPHY
Share nothing approach; enables massive, near linear horizontal scalability. ARCHITECTURE
When something is massively broken - do not fix it!LET ITCRASH
Implementation of self-healing properties, which bring the system to a well-known, stable state.
FAULTTOLERANCE
SERVER: EJABBERD - ARCHITECTURE
ETL Queries
External Traffic (5223)
Internal Traffic
SecondaryRiak Cluster
Server
Ejabberd
Server
Ejabberd LB
Riak Riak
SERVER: EJABBERD - IMPLEMENTATION
PHASE 1 - MAKE IT WORK‣ Over time mostly rewritten
‣ Removed unwanted and unneeded parts
‣ Optimized certain flow paths
‣ Make it compatible with industry standards
‣ Wrote over 600 tests to cover it
Alice BobInvite
Alice BobAccept
Alice BobInvite
Alice BobAccept
Alice Bob
SERVER: EJABBERD - IMPLEMENTATION
PHASE 1 - MAKE IT WORK‣ Over time mostly rewritten
‣ Removed unwanted and unneeded parts
‣ Optimized certain flow paths
‣ Make it compatible with industry standards
‣ Wrote over 600 tests to cover it
Alice BobInvite
Alice BobAccept
Alice Bob
SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it fast”
MUCrouter
MUCroom
usersessionuser
sessionusersession
usersessionuser
sessionusersession
usersessionuser
sessionusersession
MUCroom
MUCroom
SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it fast”
MUCroom
usersessionuser
sessionusersession
usersessionuser
sessionusersession
usersessionuser
sessionusersession
MUCroom
MUCroom
SERVER: EJABBERD - IMPLEMENTATION
PHASE 2: MAKE IT RIGHT‣ Removed clear bottlenecks
‣ Avoid shared, mutable state
‣ “Make it work, make it right, make it fast”
session table
Alice
Session Table: JID -> Session Handler
CharlieBob
SERVER: EJABBERD - IMPLEMENTATION
PHASE 3 - MAKE IT FAST‣ Patched VM and stdlibs
‣ Sacrificing generic nature of Erlang/OTP framework in favor of better scalability and fault tolerance
‣ Better traceability and profiling functions
‣ More visibility into the system
‣ Improved logging for code reloading and real time system upgrades
STABLE, SCALABLE CHAT SERVICE
CHAT AT 10K FEET
SERVERPROTOCOL DATA STORE
NOSQL
DATA STORE: RIAK
SCALE Linearly scalable
No growth headaches
FAULTTOLERANCE No SPoF Higher
uptime
SCHEMA-LESSFaster feature
iterations
More shipped features
‣ Distributed, fault-tolerant, key-value store
‣ Masterless, fully peer-to-peer architecture
‣ AP in CAP theorem, with eventual consistency
‣ Low, predictable latency
‣ Extreme scalability
‣ Multi data center replication
LESSONS LEARNEDUNDERSTAND YOUR SYSTEM
‣ Over 500 real-time counters, rates, histograms collected each minute
‣ Make sure to know counter values for “correct” and “abnormal” conditions
‣ Alerts and logs for long running operations
‣ Integration with Graphite, Zabbix and Nagios
IMPLEMENT FEATURE TOGGLES
LESSONS LEARNED
‣ Safety valve for things that might cause problems
‣ Partial deployments allowing features to be enabled only for certain groups of people
Alice Bob Charlie
group reordering feature
Bob
whitelist: Bob
SUPPORT CODE RELOADING
‣ Patching bugs on the fly
‣ Changing server configuration
‣ Collecting data for future analysis
‣ No downtime deploys
LESSONS LEARNED
buggycode
fixedcode
serverrestart
buggycode
fixedcode
GET YOUR LOGGING RIGHT
LESSONS LEARNED
‣ Proper logging and tracing facilities
‣ Debug modes for selected users
‣ Tools for analysis of the collected data
Alice
ejabberd.log slow_db.log
muc_audit.logroster_audit.log
trace_alice.log
Honu
‣ Automatic verification of the latest builds
‣ Collecting historical results for comparison
‣ Measuring the impact of new features and changes to the code
‣ Simulating various failures
ALWAYS LOAD TEST YOUR CODE
LESSONS LEARNED
THINGS WILL FAIL
LESSONS LEARNED
‣ Prepare for the worst
‣ It’s just a matter of time for crash to happen
‣ It’s not only our code that fails
‣ Unlikely events happen every second under given scale
CHAT IS DOING GREAT!The quality uptime is over 99% each month, and is increasing, with hundreds of servers deployed all over the world.
SCALE AND PERFORMANCEEach server offer reliable, low latency to the players, routing over 1B events a day with low resource utilization.
CHAT IS EVOLVINGRolling out Riak worldwide, making LoL Chat available outside of the client, explore possibilities around using social graph data, and more...
SITUATIONCURRENT
THANK YOU!ANY QUESTIONS?