Microservices in Production - O'Reilly Media · Microservices in Production Introduction Although the adoption of microservice architecture brings consider‐ able freedom to developers,

Susan J. Fowler

Standard Principles and Requirements

Microservices in Production

http://www.oreilly.com/programming/newsletter

Susan J. Fowler

Microservices in ProductionStandard Principles and Requirements

Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing

978-1-491-97297-7

[LSI]

Microservices in Productionby Susan J. Fowler

Copyright © 2017 O’Reilly Media. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://safaribooksonline.com). Formore information, contact our corporate/institutional sales department:800-998-9938 or [email protected].

Editors: Nan Barber and Brian FosterProduction Editor: Colleen LobnerCopyeditor: Octal Publishing, Inc.

Interior Designer: David FutatoCover Designer: Randy ComerIllustrator: Rebecca Demarest

October 2016: First Edition

Revision History for the First Edition2016-09-07: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Microservices inProduction, the cover image, and related trade dress are trademarks of O’ReillyMedia, Inc.

While the publisher and the author have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe author disclaim all responsibility for errors or omissions, including without limi‐tation responsibility for damages resulting from the use of or reliance on this work.Use of the information and instructions contained in this work is at your own risk. Ifany code samples or other technology this work contains or describes is subject toopen source licenses or the intellectual property rights of others, it is your responsi‐bility to ensure that your use thereof complies with such licenses and/or rights.

http://safaribooksonline.com

Table of Contents

1. Microservices in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Introduction 1The Challenges of Microservice Standardization 1Availability: The Goal of Standardization 3Production-Readiness Standards 4Stability 5Reliability 6Scalability 7Fault Tolerance and Catastrophe Preparedness 8Performance 10Monitoring 11Documentation 13Implementing Production-Readiness 15

v

CHAPTER 1

Microservices in Production

IntroductionAlthough the adoption of microservice architecture brings consider‐able freedom to developers, ensuring availability requires holdingmicroservices to high architectural, operational, and organizationalstandards. This report covers the challenges of microservice stand‐ardization in production, introduces availability as the goal of stand‐ardization, presents the eight production-readiness standards, andincludes strategies for implementing production-readiness stand‐ardization across an engineering organization.

The Challenges of MicroserviceStandardizationThe architecture of a monolithic application is usually determined atthe beginning of the application’s lifecycle. For many applications,the architecture is determined at the time a company begins. As thecompany grows and the application scales, developers who areadding new features often find themselves constrained and limitedby the choices made when the application was first designed. Theyare constrained by choice of language, by the libraries they are ableto use, by the development tools they can work with, and by theneed for extensive regression testing to ensure that every new fea‐ture they add does not disturb or compromise the entirety of theapplication. Any refactoring that happens to the standalone, mono‐lithic application is still essentially constrained by initial architec‐

1

tural decisions: initial conditions exclusively determine the future ofthe application.

The adoption of microservice architecture brings a considerableamount of freedom to developers. They are no longer tied to thearchitectural decisions of the past; they can design their servicehowever they wish; and they have free reign regarding decisions oflanguage, of database, of development tools, and the like. The mes‐sage accompanying microservice architecture guidance is usuallyunderstood and heard by developers as follows: build an applicationthat does one thing—one thing only—and does that one thing extra‐ordinarily well; do whatever you need to do, and build it howeveryou want. Just make sure it gets the job done.

Even though this romantic idealization of microservice developmentis true in principle, not all microservices are created equal—norshould they be. Each microservice is part of a microservice ecosys‐tem, and complex dependency chains are necessary. When you haveone hundred, one thousand, or even ten thousand microservices,each of them are playing a small part in a very large system. Theservices must interact seamlessly with one another, and, mostimportantly, no service or set of services should compromise theintegrity of the overall system or product of which they are a part. Ifthe overall system or product is to be any good, it must be held tocertain standards, and, consequently, each of its parts must abide bythese standards as well.

It’s relatively simple to determine standards and give requirementsto a microservice team if we focus on the needs of that specific teamand the role its service is to play. We can say, “your microservicemust do x, y, and z, and to do x, y, and z well, you need to make sureyou meet this set S of requirements.” Of course, you must also giveeach team a set of requirements that is relevant to its service, and itsservice alone. This simply isn’t scalable and ignores the fact that amicroservice is but a very small piece of an absurdly large puzzle.We must define standards and requirements for our microservices,and they must be general enough to apply to every single microser‐vice yet specific enough to be quantifiable and produce measurableresults. This is where the concept of production-readiness comes in.

2 | Chapter 1: Microservices in Production

Availability: The Goal of StandardizationWithin microservice ecosystems, service-level agreements (SLAs)regarding the availability of a service are the most commonly usedmethods of measuring a service’s success: if a service is highly avail‐able (that is, has very little downtime), we can say with reasonableconfidence (and a few caveats) that the service is doing its job.

Calculating and measuring availability is easy. You need to calculateonly three measurable quantities: uptime (the length of time that themicroservice worked correctly), downtime (the length of time thatthe microservice was not working correctly), and the total time aservice was operational (the sum of uptime and downtime). Availa‐bility is then the uptime divided by the total time a service wasoperational (uptime + downtime).

As useful as it is, availability is not in and of itself a principle ofmicroservice standardization, it is the goal. It can’t be a principle ofstandardization because it gives no guidance as to how to build themicroservice; telling a developer to make her microservice moreavailable without telling her how to do so is useless. Availabilityalone comes with no concrete, applicable steps, but, as we will see inthe following sections, there are concrete, applicable steps we cantake toward reaching the goal of building an available microservice.

Calculating AvailabilityAvailability is measured in so-called “nines” notation, which corre‐sponds to the%age of time that a service is available. For example, aservice that is available 99% of the time is said to have “2-ninesavailability.”

This notation is useful because it gives us a specific amount ofdowntime that a service is allowed to have. If your service isrequired to have 4-nines availability, it is allowed 52.56 minutes ofdowntime per year, which is 4.38 minutes of downtime per month,1.01 minutes of downtime per week, and 8.66 seconds of downtimeper day.

Here are the availability and downtime calculations for 99% availa‐bility to 99.999% availability:

99% availability: (2-nines):

• 3.65 days/year (of allowed downtime)

Availability: The Goal of Standardization | 3

• 7.20 hours/month• 1.68 hours/week• 14.4 minutes/day

99.9% availability (3-nines):

• 8.76 hours/year• 43.8 minutes/month• 10.1 minutes/week• 1.44 minutes/day


• 52.56 minutes/year• 4.38 minutes/month• 1.01 minutes/week• 8.66 seconds/day


• 5.26 minutes/year• 25.9 seconds/month• 6.05 seconds/week• 864.3 milliseconds/day

Production-Readiness StandardsThe basic idea behind production-readiness is this: a production-ready application or service is one that can be trusted to serve pro‐duction traffic. When we refer to an application or microservice as“production-ready,” we confer a great deal of trust upon it: we trustit to behave reasonably; we trust it to perform reliably; and we trustit to get the job done and to do the job well. Production-readiness isthe key to microservice standardization.

However, the idea of production-readiness as just stated isn’t usefulenough to serve as the exhaustive definition we need, and withoutfurther explication the concept is completely useless. We need toknow exactly what requirements every service must meet in order to


be deemed production-ready and be trusted to serve productiontraffic in a reliable, appropriate way. The requirements must them‐selves be principles that are true for every microservice, for everyapplication, and for every distributed system. Standardizationwithout principle is meaningless.

It turns out that there is a set of eight principles that fits this criteria.Each of them is quantifiable, gives rise to a set of actionable require‐ments, and produces measurable results. They are: stability, reliabil‐ity, scalability, fault-tolerance, catastrophe-preparedness, performance,monitoring, and documentation. The driving force behind each ofthese principles is that, together, they contribute to and drive theavailability of a microservice.

Availability is, in some ways, an emergent property of a production-ready microservice. It emerges from building a scalable, reliable,fault-tolerant, performant, monitored, documented, andcatastrophe-prepared microservice. Any one of these principlesindividually is not enough to ensure availability, but together theyare. Building a microservice with these principles as the drivingarchitectural and operational requirements guarantees a highlyavailable system that can be trusted with production traffic.

StabilityWith the introduction of microservice architecture, developers aregiven the freedom to develop and deploy at a very high velocity.They can add and deploy new features each day, they can quickly fixbugs and swap-out old technologies the newest ones, and they canrewrite outdated microservices and deprecate and decommissionthe old versions. With this increased velocity comes increased insta‐bility, and we can trace the majority of outages in microservice eco‐systems to bad deployments containing buggy code. To ensureavailability, we need to carefully guard against this instability thatstems from increased developer velocity.

Stability allows us to reach availability by giving us ways to responsi‐bly handle changes to microservices. A stable microservice is onefor which development, deployment, the addition of new technolo‐gies, and the decommissioning and deprecation of services doesn’tcause instability in the larger microservice ecosystem. We can deter‐mine stability requirements for each microservice to mitigate thenegative side effects that might accompany each change.

Stability | 5

To mitigate any problems that might arise from the developmentcycle, we can put stable development procedures into place. Tocounteract any instability introduced by deployment, we can ensureour microservices are deployed carefully with proper staging, can‐ary, and production rollouts. To prevent the introduction of newtechnologies and the deprecation and decommissioning of oldmicroservices from compromising the availability of other services,we can enforce stable introduction and deprecation procedures.

Stability RequirementsFollowing are the requirements of building a stable microservice:

• A stable development cycle• A stable deployment process• Stable introduction and deprecation procedures

ReliabilityStability alone isn’t enough to ensure a microservice’s availability—the service must also be reliable. A reliable microservice is one thatcan be trusted by its clients, by its dependencies, and by the ecosys‐tem as a whole.

Although stability is related to mitigating the negative side effectsaccompanying change, and reliability is related to trust, the two areinextricably linked. Each stability requirement also carries a reliabil‐ity requirement alongside it. For example, developers should notonly seek to have stable deployment processes, they should ensurethat each deployment is reliable from the point of view of one oftheir clients or dependencies.

The trust that reliability secures can be broken into several require‐ments, the same way we determined requirements for stability ear‐lier. For example, we can make our deployment processes reliable byensuring that our integration tests are comprehensive and our stag‐ing and canary deployment phases are successful.

By building reliability into our microservices, we can protect theiravailability. We can cache data so that it will be readily available toclient services, helping them protect their SLAs by making our own


services highly available. To protect our own SLA from any prob‐lems with the availability of our depencies, we can implement defen‐sive caching.

The last reliability requirement is related to routing and discovery.Availability requires that the communication and routing betweendifferent services be reliable: health checks should be accurate,requests and responses should reach their destinations, and errorsshould be handled carefully and appropriately.

Reliability RequirementsFollowing are the requirements of building a reliable microservice:

• A reliable deployment process• Planning, mitigating, and protecting against the failures of

dependencies• Reliable routing and discovery

ScalabilityMicroservice traffic is rarely static or constant, and the hallmark of asuccessful microservice (and of a successful microservice ecosys‐tem) is a steady increase in traffic. We need to build microservices inpreparation for this growth—they need to accommodate it easily,and they need to actively scale with it. A microservice that can’t scalewith growth experiences increased latency, poor availability, and, inextreme cases, a drastic increase in incidents and outages. Scalabilityis essential for availability, making it our third production-readinessstandard.

A scalable microservice is one that can handle a large number oftasks or requests at the same time. To ensure a microservice is scala‐ble, we need to know both its qualitative growth scale (e.g., whetherit scales with page views or customer orders) and its quantitativegrowth scale (i.e., how many requests per second it can handle). Assoon as we know the growth scale, we can plan for future capacityneeds and identify resource bottlenecks and requirements.

The way a microservice handles traffic should also be scalable. Itshould be prepared for bursts of traffic, handle them carefully, and

Scalability | 7

prevent them from taking down the service entirely. Of course, thisis easier said than done, but without scalable traffic handling, devel‐opers can (and will) find themselves looking at a broken microser‐vice ecosystem.

Additional complexity is introduced by the rest of the microserviceecosystem. We need to prepare for the inevitable additional trafficand growth from a service’s clients. Likewise, any dependencies ofthe service should be alerted when increases in traffic are expected.Cross-team communication and collaboration are essential for scal‐ability: regularly communicating with clients and dependenciesabout a service’s scalability requirements, status, and any bottlenecksensures that any services relying on one another are prepared forgrowth and for potential pitfalls.

Last but not least, the way a microservice stores and handles dataneeds to be scalable. Building a scalable storage solution goes a longway toward ensuring the availability of a microservice, and is one ofthe most essential components of a truly production-ready system.

Scalability RequirementsHere are the requirements of building a scalable microservice:

• Well-defined quantitative and qualitative growth scales• Identification of resource bottlenecks and requirements• Careful, accurate capacity planning• Scalable handling of traffic• The scaling of dependencies• Scalable data storage

Fault Tolerance and CatastrophePreparednessEven the simplest of microservices is a fairly complex system. As weknow quite well, complex systems fail, they fail often, and anypotential failure scenario can and will happen at some point in themicroservice’s lifetime. Microservices don’t live in isolation, butwithin dependency chains as part of a larger, incredibly complex


microservice ecosystem. The complexity scales linearly with thenumber of microservice in the overall ecosystem, and ensuring theavailability of not only an individual microservice, but the ecosys‐tem as a whole, requires that we impose yet another production-readiness standard onto each microservice. Every microservicewithin the ecosystem must be fault-tolerant and prepared for any cat‐astrophe.

A fault-tolerant, catastophe-prepared microservice is one that canwithstand both internal and external failures. Internal failures arethose that the microservice brings on itself: for example, code bugsthat aren’t caught by proper testing lead to bad deploys, causing out‐ages that affect the entire ecosystem. External catastrophes such asdatacenter outages and or poor configuration management acrossthe ecosystem lead to outages that affect the availability of everymicroservice and the entire organization.

We can quite adequately (though not exhaustively) prepare for fail‐ure scenarios and potential catastrophes. Identifying failure and cat‐astrophe scenarios is the first requirement of building a fault-tolerant, production-ready microservice. After these scenarios havebeen identified, the hard work of strategizing and planning for whenthey will occur begins. This must happen at every level of the micro‐service ecosystem, and any shared strategies should be communica‐ted across the organization so that mitigation is standardized andpredictable.

Standardization of failure mitigation and resolution at the organiza‐tional level means that incidents and outages of individual microser‐vices, infrastructure components, or the ecosystem as a whole needto be wrapped into carefully executed, easily understandable proce‐dures. Incident response procedures need to be handled in a coordi‐nated, planned, and thoroughly communicated manner. If incidentsand outages are handled in this way, and the structure of incidentresponse is well defined, organizations can avoid lengthy downtimesand protect the availability of the microservices. If every developerknows exactly what they are supposed to do during an outage,knows how to mitigate and resolve problems quickly and appropri‐ately, and knows how to escalate if an issue is beyond their capabili‐ties or control, the time to mitigation and time to resolution dropdrastically.

Fault Tolerance and Catastrophe Preparedness | 9

Making failures and catastrophes predictable means going one stepfurther after identifying failure and catastrophe scenarios and plan‐ning for them. It means forcing the microservices, the infrastruc‐ture, and the ecosystem to fail in any and all known ways to test theavailability of the entire system. We can accomplish this throughvarious types of resiliency testing. Code testing (including unit tests,regression tests, and integration tests) is the first step in testing forresiliency. The second step is load testing—this is when we testmicroservices and infrastructure components for their ability tohandle drastic changes in traffic. The last, most intense, and mostrelevant type of resiliency testing is chaos testing, in which failurescenarios are run (both scheduled and at randomly) on productionservices to ensure that microservices and infrastructure componentsare prepared for all known failure scenarios.

Fault-Tolerance and Catastrophe-PreparednessRequirements

Here are the requirements for building a fault-tolerant microservicethat is prepared for any catastrophe:

• Potential catastrophes and failure scenarios are identified andplanned for

• Single points of failure are identified and resolved• Failure detection and remediation strategies are in place• The microservice is tested for resiliency through code testing,

load testing, and chaos testing• Traffic is managed carefully in preparation for failure• Incident and outages are handled appropriately and

productively

PerformanceIn the context of the microservice ecosystem, scalability (which wecovered in brief detail earlier), is related to how many requests amicroservice can handle. Our next production-readiness principle,performance, refers to how well the microservice handles thoserequests. A performant microservice is one that handles requests


quickly, processes tasks efficiently, and properly utilizes resources(such as hardware and other infrastructure components).

A microservice that makes a large number of expensive networkcalls, for example, is not performant. Neither is a microservice thatprocesses and handles tasks synchronously in cases for whichasynchronous (non-blocking) task processing would increase theperformance and availability of the service. Identifying and archi‐tecting away these performance problems is a strict production-readiness requirement.

Similarly, dedicating a large number of resources (like CPU) to amicroservice that doesn’t utilize them is inefficient. Inefficiencyreduces performance—if it’s not clear at the microservice level inevery case, it’s painful and costly at the ecosystem level. Underutil‐ized hardware resources affect the bottom line, and hardware is notcheap. There’s a fine line between underutilization and propercapacity planning, and so the two must be planned and understoodtogether to avoid compromising the availability of any microserviceand to keep the costs associated with underutilization reasonable.

Performance RequirementsHere are the requirements of building a performant microservice:

• Appropriate service-level agreements (SLAs) for availability• Proper task handling and processing• Efficient utilization of resources

MonitoringAnother principle necessary for guaranteeing microservice availabil‐ity is proper microservice monitoring. Good monitoring has threecomponents: (1) proper logging of all important and relevant infor‐mation, (2) useful graphical displays (dashboards) that are easilyunderstood by any developer in the company and that accuratelyreflect the health of the services, and (3) alerting on key metrics thatis effective and actionable.

Logging belongs and begins in the codebase of each microservice.Determining precisely what information needs to be logged will dif‐

Monitoring | 11

fer for each service, but the goal of logging is quite simple: whenfaced with a bug—even one from many deployments in the past—you want and need your logging to be such that you can determinefrom the logs exactly what went wrong and where things fell apart.Versioning microservices (and their endpoints) is discouraged inmicroservice architecture, so you won’t have a precise version torefer to in which to find any bugs or problems. Code is revised fre‐quently, deployments happen multiple times per week, features areadded constantly, and dependencies are constantly changing, butlogs will remain the same, preserving the information needed topinpoint any problems. Just ensure that your logs contain the infor‐mation necessary to determine possible problems.

All key metrics (such as hardware utilization, database connections,responses and average response times, and the status of API end‐points) should be graphically displayed in real time on an easilyaccessible dashboard. Dashboards are an important component ofbuilding a well-monitored, production-ready microservice; theymake it easy to determine the health of a microservice with oneglance, and enable developers to detect strange patterns and anoma‐lies that might not be extreme enough to trigger alerting thresholds.When used wisely, dashboards allow developers to determinewhether a microservice is working correctly simply by looking at thedashboard, but developers should not need to watch the dashboardin order to detect incidents and outages.

The actual detection of failures is accomplished through alerting. Allkey metrics can be alerted on, including (at the very least) CPU andRAM utilization, number of file descriptors, number of databaseconnections, the SLA of the service, requests and responses, the sta‐tus of API endpoints, errors and exceptions, the health of the serv‐ice’s dependencies, information about any database(s), and thenumber of tasks being processed (if applicable).

Normal, warning, and critical thresholds can be set for each of thesekey metrics, and any deviation from the norm (i.e., hitting the warn‐ing or critical thresholds) should trigger an alert to the developerswho are on-call for the service. The best thresholds are signal-providing: high enough to avoid noise, but low enough to catch anyand all real problems.

Alerts need to be useful and actionable. A non-actionable alert is nota useful alert, and a waste of engineering hours. Every actionable


alert should be accompanied by a runbook. For example, if an alertis triggered on a high number of exceptions of a certain type, thereneeds to be a runbook containing mitigation strategies that any on-call developer can refer to while attempting to resolve the problem.

Monitoring RequirementsFollowing are the requirements of building a properly-monitoredmicroservice:

• Proper logging and tracing throughout the stack• Well-designed dashboards that are easy to understand and

accurately reflect the health of the service• Effective, actionable alerting accompanied by runbooks• Implementing and maintaining an on-call rotation

DocumentationMicroservice architecture carries the potential for increased techni‐cal debt—it’s one of the key tradeoffs that come with adoptingmicroservices. As a rule, technical debt tends to increase with devel‐oper velocity: the more quickly a service can be iterated on,changed, and deployed, the more frequently shortcuts and patcheswill be put into place. Organizational clarity and structure aroundthe documentation and understanding of a microservice cut throughthis technical debt and shave off a lot of the confusion, lack ofawareness, and lack of architectural comprehension that tend toaccompany it.

Reducing technical debt isn’t the only reason to make good docu‐mentation a production-readiness principle—doing so would makeit somewhat of an afterthought (an important afterthought, but anafterthought nonetheless). No, just like each of the otherproduction-readiness standards, documentation and its counterpart(understanding) directly and measurably influence the availabilityof a microservice.

To see why this is true, we can think about how teams of developerswork together and share their knowledge of a microservice. You cando this yourself by getting one of your development teams togetherin a room, standing in front of a whiteboard, and asking them to

Documentation | 13

illustrate the architecture and all important details of the service. Ipromise you’ll be surprised by the result of this exercise, and you’llmost likely find that knowledge and understanding of the service isnot cohesive or coherent across the group. One developer will knowone thing about the application that nobody else does, whereas asecond developer will have such a different understanding of themicroservice that you will wonder if he is even contributing to thesame codebase. When it’s time for code changes to be reviewed,technologies to be swapped, or features to be added, the lack ofalignment of knowledge and understanding will lead to the designand/or evolution of microservices that are not production-ready,containing serious flaws that undermine the service’s ability to relia‐bly serve production traffic.

This confusion and the problems that it creates can be successfullyand rather easily avoided by requiring that every microservice fol‐low a very strictly standardized set of documentation requirements.The best documentation contains all the essential knowledge (facts)about a microservice, including an architecture diagram, anonboarding and development guide, details about the request flowand any API endpoints, and an on-call runbook for each of the serv‐ice’s alerts.

You can accomplish understanding of a microservice in severalways. The first is by performing the aforementioned exercise: stickthe development team in a conference room, and ask them to white‐board the architecture of the service. Thanks to increased developervelocity, microservices change radically at different times through‐out their lifecycle. By making these architecture reviews part of eachteam’s process and scheduling them regularly, you can guaranteethat knowledge and understanding about any changes in the micro‐service will be disseminated to the entire team.

To cover the second aspect of microservice understanding, we needto jump up by one level of abstraction and consider the production-readiness standards themselves. A great deal of microservice under‐standing is captured by determining whether a microservice isproduction-ready and where it stands with regard to theproduction-readiness standards and their individual requirements.You can achieve this in a myriad of ways, one of which is runningaudits of whether a microservice meets the requirements, and thencreating a roadmap for the service detailing how to bring it to aproduction-ready state. Checking the requirements can also be


automated across the organization. We’ll dive into other aspects ofthis in more detail in the next section on the implementation ofproduction-readiness standards in an organization that has adoptedmicroservice architecture.

Documentation RequirementsHere are the requirements of building a well-documented microser‐vice:

• Thorough, updated, and centralized documentation containingall of the relevant and essential information about the micro‐service

• Organizational understanding at the developer, team, and eco‐system levels

Implementing Production-ReadinessWe now have a set of standards that apply to every microservice inany microservice ecosystem, each with its own set of specificrequirements. Any microservice that satisfies these requirementscan be trusted to serve production traffic and preserve a high levelof availability.

Now that we have the production-readiness standards, the questionthat remains is how we can implement them in a specialized, real-world microservice ecosystem. Going from principle to practice,and applying theory to real-world applications always presents somesignificant level of difficulty. However, the power of theseproduction-readiness standards and the requirements they imposelies in their remarkable applicability and strict granularity: they areboth general enough to apply to any ecosystem yet specific enoughto provide concrete strategies for implementation.

Standardization requires buy-in from all levels of the organization,and needs to be adopted and driven both from the top-down andfrom the bottom-up. At the executive and leadership (managerialand technical) levels, these principles need to be driven and sup‐ported as architectural requirements for the engineering organiza‐tion. On the ground floor, within individual development teams,standardization needs to be embraced and implemented. Impor‐

Implementing Production-Readiness | 15

tantly, standardization needs to be seen and communicated not as ahindrance or gate to development and deployment, but as a guidefor production-ready development and deployment.

Many developers might resist standardization. After all, they mightargue, isn’t the point of adopting microservice architecture to pro‐vide greater developer velocity, freedom, and productivity? Theanswer to these sorts of objections is not to deny that the adoptionof microservice architecture brings freedom and velocity to develop‐ment teams, but to agree and point out that that is exactly whyproduction-readiness standards need to be in place. Developervelocity and productivity grind to a halt whenever an outage bringsa service down, whenever a bad deploy compromises the availabilityof a microservice’s clients and dependencies, whenever a failure thatcould have been avoided with proper resiliency testing brings theentire microservice ecosystem down. If we’ve learned anything inthe past 50 years of software development, it’s that standardizationbrings freedom and reduces entropy. As Brooks says in The MythicalMan-Month, perhaps the greatest collection of essays on the practiceof software engineering, “form is liberating.”

After the engineering organization has adopted and agreed to followproduction-readiness standards, the next step is to evaluate andelaborate on each standard’s requirements. The requirements pre‐sented here are very general and need the addition of context,organization-specific details, and implementation strategies. Whatremains to be done is to work through each production-readinessstandard and its requirements, and then figure out how eachrequirement can be implemented across the engineering organiza‐tion. For example, if the organization’s microservice ecosystem has aself-service deployment tool, implementing a stable and reliabledeployment process needs to be communicated in terms of theinternal deployment tool and how it works. Rebuilding internaltools and/or adding features to them can also come out of this exer‐cise.

The actual implementation of the requirements and determiningwhether a given microservice meets them can be done by the devel‐opers themselves, by team leads, by management, or by DevOpsengineers, or by site reliability engineers. At both Uber and the sev‐eral other companies I know that have adopted production-readiness standardization, the implementation and enforcement ofthe production-readiness standards is driven by the site reliability


engineering (SRE) organizations. SREs are, typically, responsible forthe availability of the services, and so driving these standards acrossthe microservice ecosystem fits in quite well with existing responsi‐bilities. That isn’t to say that the developers or development teamshave no responsibility for ensuring their services are productionready, rather, SREs inform, drive, and enforce production-readinesswithin the microservice ecosystem, and the responsibility of imple‐mentation falls on both the SREs embedded within developmentteams and on the developers themselves.

Building and maintaining a production-ready microservice ecosys‐tem is not an easy challenge to undertake, but the rewards are greatand the impact is reflected in the increased availability of eachmicroservice. Implementing production-readiness standards andtheir requirements provides measurable results and allows develop‐ment teams work knowing that the services they depend on aretrustworthy, stable, reliable, fault-tolerant, performant, monitored,documented, and prepared for any catastrophe.

Implementing Production-Readiness | 17

About the AuthorSusan J. Fowler is a site reliability engineer at Uber, where she runsa production-readiness initiative across all Uber microservices andembeds within business-critical teams to bring their services to aproduction-ready state. She worked on application platforms andinfrastructure at several small startups before joining Uber, andbefore that, studied particle physics at Penn, where she searched forsupersymmetry and designed hardware for the ATLAS and CMSdetectors.

Documents

Microservices in Production - O'Reilly Media · Microservices in Production Introduction Although the adoption of microservice architecture brings consider‐ able freedom to developers,