44
Epic Fails in LiveOps James Gwertzman, CEO January 16, 2017

Epic Fails in LiveOps

Embed Size (px)

Citation preview

Epic Fails in LiveOps

James Gwertzman, CEO

January 16, 2017

Silverlining:Theyusedthistimetocatchuponcontent;mightneverhavecaughtupiftheydidn’tgetthatextratime.

Introduction to PlayFab

Game ManagerMissioncontrolforyourwholeteam.Allthedataandtoolsyouneedtoengage,retainandmonetizeyourplayers.

Game ServicesBack-endbuildingblocksforyourlivegame.Storage,compute,commerce,analyticsandmuch,muchmore.

Add-On MarketplacePre-integratedtoolsandservicesfromindustry-leadingpartners.ReduceSDKfatigue,with(mostly)single-clickaccess.

PlayStreamTheactivitystreamthattiesitalltogether.Events,triggers,real-timesegmentationtoautomateyourliveops.

PlayFab is a flexible LiveOps platform for games.

Full-text search for players

• Easilylocateplayers• Searchacrossallplayerproperties• Usewildcardmatches

1/18/17 PlayFab Confidential 7

Player segmentation

• Triggeractionsasplayerenter/exitsegments

• Updatedinreal-time• Setmanuallywithtags• Usesegmentstotarget

stores,runbulkactions

1/18/17 PlayFab Confidential 8

Create and manage item catalog• Itemscanhave:

– Limiteduses– Anexpirationtime– Customdata– Defaultpricesinmultiple

currencies– Tagstohelporganize

• Limitededitionitemshaveenforcedscarcity

• Catalogscanbeimported/exportedasJSONdata

• Updatecatalogfromserverdynamicallyatanytime

1/18/17 PlayFab Confidential 9

Item stores

• Onecatalogcanhavemultiplestores• Storescanhavedifferentprices• Storescanbetargetedtodifferent

playersegments

1/18/17 PlayFab Confidential 10

Time-based leaderboards for tournaments

• Leaderboardscanberesetonafixedschedule(daily,weekly,monthly)ormanuallyatanytime

• Whenleaderboardsreset,thelistofplayersatthetimeoftheresetisarchived

• Useleaderboardstandingattimeofresettoissueprizes,determinetournamentwinners

1/18/17 PlayFab Confidential 11

{"PlayerId":"4AC350E4134A36C8","Value":620}{"PlayerId":"3CC3A4D866D9580A","Value":620}{"PlayerId":"D15EFFB805045CFA","Value":620}{"PlayerId":"B8271B32A8035722","Value":620}{"PlayerId":"B188B845940ED6D3","Value":620}{"PlayerId":"321DBA3528144483","Value":500}{"PlayerId":"EA141B9B63B53583","Value":500}{"PlayerId":"DC01857A8D90B2F5","Value":500}

Host session-based game servers

• Uploadcustomgameserverbuilds• Configuremultiplayergamemodes• Selectregionswhereserversshould

behosted• Serverswillscaleautomaticallybased

onload

1/18/17 PlayFab Confidential 12

Server-hosted JavaScript

• Writeserver-basedcodewithoutadedicatedgameserver

• EasyuploadofJavaScriptcustomlogic• Makechangestoyourgamebehavior

withoutrequiringclientupdates• Serverauthenticationprotectsagainst

client-sidecheating• AccessthemorepowerfulServerAPI(with

featuresnotavailableontheclient)• GitHubintegrationforeasyrevisioncontrol

1/18/17 PlayFab Confidential 13

Applicationsinclude:• Grantingplayerrewards• Validatingplayeractions• Resolvinginteractionsbetweenplayers• Managingasynchronousgameturns

Trigger actions from real-time events

• Triggeractionsinresponsetoreal-timeevents

• Eventscancomefromclient,server,orthirdpartyvendors

• RichsetofactionsincludingrunningCloudScript orsendingpushnotifications

1/18/17 PlayFab Confidential 14

Scheduled jobs & bulk player actions

• Schedulejobstoruninthebackground

• Runonce,oronarecurringbasis• Schedulenow,orinthefuture• Runtasksforeachplayerina

segment,orforthetitle

1/18/17 PlayFab Confidential 15

Full-text event search

• Filterandsearchthroughrecenteventhistory

• Zoominonspecifictimeperiod• Lookforspecificplayers,event

types,orerrorconditions

1/18/17 PlayFab Confidential 16

Remotely manage game configuration

• Storegameconfigurationontheservertomodifybehaviorovertime

• Comingsoon:changeconfigurationbasedonplayersegment

1/18/17 PlayFab Confidential 17

More than 1,000 developers w/ 450+ live games

1/18/17

Daily Active Players (2016)

-

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

Jan1

Jan8

Jan15

Jan22

Jan29

Feb5

Feb12

Feb19

Feb26

Mar4

Mar11

Mar18

Mar25

Apr1

Apr8

Apr1

5Ap

r22

Apr2

9May6

May13

May20

May27

Jun3

Jun10

Jun17

Jun24

Jul1

Jul8

Jul15

Jul22

Jul29

Aug5

Aug12

Aug19

Aug26

Sep2

Sep9

Sep16

Sep23

Sep30

Oct7

Oct14

Oct21

Oct28

Nov4

Nov11

Nov18

Nov25

Dec2

Dec9

Dec1

6De

c23

Dec3

0

Tips to running a live service with a small team

• FullyleveragethecloudandotherSAASservices• Continuousintegration• Frequentandautomateddeploymenttolive• Allengineerstaketurnsbeing“on-call”

SAAS services we use to run PlayFab

1/18/17 22

Tools we depend on

Basic API handling architecture

CloudScript Execution

PlayStream event handling

Multiplayer game server hosting & scaling

How the cloud has changed deploymentsScenario A: Successful deployment

Dedicatedhardware

Cloud

BuildA

BuildA BuildB

BuildBDowntime

How the cloud has changed deploymentsScenario B: Rollback needed

Dedicatedhardware

Cloud

BuildA

BuildA

BuildB

BuildB

BuildA

Downtime

Thinking about #fails

• Notallfailureiscreatedequal• #failsrangefrompraiseworthytoblameworthy• Typesoffailure:– Failuresinroutineoperationswhichcanbeprevented– Failuresincomplexoperationswhichcan’tbeavoided,butcanbemanagedsotheydon’tturnintocatastrophe

– Unwantedoutcomesinresearch,whichgenerateknowledge• Goalswithfailureshouldbe:– Detectearly– Analyzedeeply– Designexperimentsorpilotstoproducethem

• Everyoneonteammustfeelsafeadmitting&reportingfailures

Source:StrategiesforLearningfromFailure.HarvardBusinessReview.April2011.

Our most common sources of failure

• Operatorerrors(e.g.,mis-configuration)• Designerrors(e.g.,cascadingfailures)• Unexpectedsituations(e.g.,surprisingcustomeractions)

Misconfiguration failure

• Failure:– Matchmakerserverwasdownfor13minutes

• Cause:– Wehaveaprimaryanda“hot”backup– Intheprimaryfails,trafficshouldswitchtobackup– Route53wasmisconfiguredtoroutetraffic(correctly)toprimary,butcheckhealthonthebackup(incorrectly)

– Whentheprimarydidfinallyfail,trafficdidn’tswitch

• Solution:– Short-term:Fixtheconfiguration– Long-term:Automatehealth—checkintegrity

Route53(DNSservice)

MatchmakerPrimary

MatchmakerBackup

Traffic HealthCheck

X

Design failure

• Failure:2-minutesystem-wideoutage• Cause:

– Agamewasrunningatestofitemconsumption– Designissue:calling”consume”loadedentireinventory– Result:100+requestsfor13Kiteminventoryin1minute– ThisblockedAPIservers,waitingforthedatabase– DynamoDBthenauto-scaled,socallsunblocked,leadingAPI

serverstothenpegCPUto100%processingload– Thismeantserversstoppedrespondingtohealthchecks– Serverswereallthenauto-terminated

• Solution:– Short-term:werolledback,whichredeployedservers– Long-term:APIthrottles,pagingdatarequests

Complexity failure• Failure:ElasticSearch wentdownfor2days• Cause:

– AWSElasticSearch wastryingtoscaleourcluster– Insteadofaddingnodes,itreplacesthem– Thisrequiresmovingalldatafromnodetonode– Theydon’tthrottledatamoves,soCPUswentto100%– Thistriggeredhealthcheckfails,andnodetermination– Butnewnodestriggerindexrebalancing,buttheywerealready

rebalancingbecauseofthescaling– Atonepointwewerelosing4nodesevery30minutes

• Solution:– Short-term#1:turnoffwritestocatch-up;notenough– Short-term#2:spinupnewcluster,aimwritesatnewcluster,

back-fillw/datafromKinesisqueue– Long-term:MoveoffAWSESontoourownEScluster;customize

configurationbasedonexperience

ESWrites/sec

ESDelayinseconds

Unexpected failure

• Failure:– Suddensurgeoftraffictoourdocsite,resemblingaDDoSattack

• Cause:– Customerwaspingingourdocsiterepeatedlyasa“healthcheck”– Theyranacustomeracquisitioncampaignsotrafficspiked– Wesawtheseunusualqueries,withastrangeuseragentstring– Weassumeditwasanattack,soquarantinedthatuseragentstring– Thishadtheaffectoftakingdowntheirgame!

• Solution:– Restoretheirtraffic;explaintodeveloperwhythisisabadidea

Other common customer fails

• Notusingreceiptvalidation• “We’llwait,andifit’ssuccessful,we’llinvestintools”• Notrunningevents• Launchingwithoutreal-timedata

Fake receipts are a big problem

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

LegitReceipts FakeReceipts

Why in-game events matter (AdCap)

PlayFabaddedasbackendplatform

Setting up a live event

• Moveart/assetsintoUnityAssetBundle• Moveplanetconfig toGoogleSheets• ExportdatatoPlayFabcatalogs

LiveOps depends on tools

Howmuchcanyoudowithoutwritingcode?

TheLiveOpsToolsContinuumWritingSQLHackinggameDB

WebtoolsModifygameparams

Questions?James [email protected]@gwertz