WhyPods.docx

Embed Size (px)

Citation preview

Why Pods?This document was written in June 2014, by Scott Hansma & Ian Varley, to explain why we scale with Pods (and Superpods) in Salesforce infrastructure. sfdc.co/pods

New! Now in presentation form and video form.For noobs: What are Pods (& Superpods)?Pods are identical[footnoteRef:0] collections of hardware & software that support a discrete subset of our customer base. Any customer organization (org) exists on exactly one Pod, and only moves between them via migration or split. We call Pods instances to our customers, and the name of the instance is a visible part of the URL they use (e.g. http://na1.salesforce.com)[footnoteRef:1]. As of mid-2014 we have ~50 Pods. Read more. [0: Theyre not really identical. Theyre all beautiful snowflakes. But that is a story for another time.] [1: It can be masked by custom domains (like org62.my.salesforce.com) but not all customers do that. In fact, hardly any do.]

Superpods are sets of several Pods, in the same location, along with shared services used by all of those Pods (like DNS, Hadoop, etc). Superpods arent directly visible to customers. Pods can move[footnoteRef:2] between Superpods during Pod migrations. Superpod is a confusing term; it also refers to the design of how multiple Pods and services are laid out, and to just the shared services; more on that below: Naming Is Hard. [2: Move logically, not physically; we dont actually move the machines in a Pod migration, just the data.]

Every bit of hardware and data in a production Pod has an identical mirror-image copy in a DR (Disaster Recovery) data center, somewhere else in the world. (So technically, an instance is composed of two Pods in different data centers, but only one is ever running. We havent done many failovers; were getting better.)Why do we have Pods?To be honest, we kind of chose the Pod design accidentally; our databases were growing faster than we could scale them, and splitting them was the only answer. (For a fun look back in time, check out this deck from the NA1 split back in 2006.)

Our Pods started to proliferate, and we realized wed hit a limit on vertical growth. The choice was to either double down on the Pod strategy, or start powering the Oracle databases with magical uranium, because shit was getting crazy[footnoteRef:3]. So we doubled down. [3: Basically, we exceeded our capacity to scale and had a painful few months. There are 2 distinct historical developments: the move from E25k to linux 8 node RAC, and the proliferation of new Pods recognizing that we'd hit a limit on vertical growth of a Pod. Ask an old-timer about it, fun times.]

Nowadays, the Pod strategy is our explicit choice for scaling. Why? We like it for 3 big reasons:

The database is the center of gravity Fault domains work We like predictability

Lets explore each one in turn.One: The Database Is The Center of Gravity

Data is the center of the Salesforce universe; specifically, our relational database backbone (which causeth us both joy and sorrow). The transactional correctness of Salesforce, as a product, relies directly on having each orgs master data stored in a single database, which is shared across many orgs, and has near-perfect availability and low latency from every other server in the Pod.

We share one relational database across many customers (multi-tenancy is the cornerstone of our architecture[footnoteRef:4]). But we dont share one database across all customers, because that would be too big. Relational databases arent designed to scale horizontally, they have max size limits. For Salesforces Oracle databases, weve found the practical happy size limit to be around 30TB, run by ~8 beefy RAC nodes and an attached SAN. That configuration runs about 10K small orgs, give or take a few big ones. [4: Good video on that multi-tenant magic from 2009, by former CTO Craig Weissman: Salesforce Multi-Tenant Architecture]

Fortunately, we have a clean way to shard our data: by customer. So instead of scaling up, we scale out, with multiple databases. Each customer lives on one database; when the DB gets to 60% capacity, we stop letting new orgs sign up there (and fill the rest of the space via organic growth, from the existing orgs on that Pod).

So why should all the other infrastructure components (like app servers, networks, file servers, etc) also be sharded into Pods, like the DB? There are two answers: one for stateless services, and one for stateful ones.Stateless Services

For stateless services (like The Core App), the main concern is database connectivity. Every customer request does multiple DB reads & writes, so we keep a pool of database connections open at all times (more on that here). Weve talked about hooking the same app servers up to multiple databases (in a project code-named 2-headed-chicken) but for a lot of stupid reasons, its harder than it should be.

So instead, we cluster an appropriate amount of compute (about 30 app servers) around a fixed size of database (8 RAC Nodes), and it works pretty well. We also size memcached (which runs on the app servers[footnoteRef:5]) accordingly. The same goes for MQ and other services that provide compute on top of the database. [5: If memcached lived on, say, 2-4 boxes, rather than the whole app tier, it would reduce response variance a lot. We should do that.]

Now, the core app isnt just java; it also includes a cool million lines of PL/SQL that run on the database, and must be in sync with the app. So if your app servers talked to many databases, theyd need to be running the exact same version of PL/SQL, down to the e-release level. That would be a pain in the ass.[footnoteRef:6] (This is part of whats hard about 2-headed chicken.) [6: Of course if the database were completely presented as a service with forward and backward compatibility then you wouldn't have that problem; instead you'd have a worse problem: trying to replace the standard database abstraction layer (SQL) with a "better" one. Many have died trying.]

So, keeping all of the stateless processing logic in the app in orbit around the relational DB makes sense.Stateful Services

For stateful services (in particular, for data stores like FFX and HBase), theres a more pressing reason to orbit the relational database: DR (disaster recovery). If we fail over to another data center, we must be able to reliably fail over all the data, and we need to be able to prove that its correct. For individual data items, these other stores are their own master; but for the overall org state, the relational database is the brain. If you could fail over some parts of the data but not others, youd get into a situation thats very difficult to reason about (and likely has app servers talking to data stores in another data center!). The Keystone project is our attempt to reason about this explicitly, but were not there yet.

Many data stores are not capable of taking writes from two data centers at the same time, which means that if you had a single Pods DB fail, you would need to fail over all the Pods that used that larger data store, and that makes the DR domain larger than wed like it to be -- youd end up having to DR a lot of Pods when you only wanted to DR one, because you cant disentangle the arbitrary network of databases. Thatd be confusing and bad.

One area where we currently violate this is File Force, and the "TI-Don't-Copy" feature, which avoids copying files to sandboxes, and instead points at those same files in the production org. In those cases, a single sandbox org's dataset spans both its sandbox Pod and original source Pod and that leads to this exact type of grief: it makes it really hard to reason about DR.

There are also some services that are in the gray area between stateful and stateless. One example is search, which is not technically a System of Record (SOR) because its just a transformed copy of the primary database (so you could make multiple copies, recreate it, etc[footnoteRef:7]). But, because it cant be recreated fast enough to deal with an outage, we have to treat it like SOR data. So here again, it makes sense for search to orbit the single relational database it indexes. [7: Though in reality, losing it for big orgs would be tantamount to a service outage because it would take several days to recreate it.]

Two: Fault Domains: It Works

Look at this screen grab from trust.salesforce.com:

You see how theres no columns where all the icons are or ? Exactly: it doesnt happen. (Often.[footnoteRef:8]) [8: It does happen sometimes, most recently in April 2014 when a DNS provider failure brought all Pods down. (This was outside our control, but of course we still should have had a strategy that didnt depend on a single one and we do now.)]

When something goes wrong with Salesforces service, its really important to our business that any disruption is localized; we dont want all our eggs (AOV[footnoteRef:9]) in one basket (pod). And if you have radically uncorrelated systems, you have a much greater shot at doing this. The worst shit in the world could happen to one Pod, and we wouldnt kill all the golden geese. (Wed have a lot of egg on our faces, though. (Sorry folks, just a yolk.)) [9: AOV = Annual Order Value, i.e. the money our customers (including renewals) pay us. For this and other handy acronyms, see here.]

This kind of protection isnt just about failures of software or infrastructure: its also about service protection. Customers can do sophisticated things on our platform, like run massive reports and Apex triggers and Pig pipelines. Part of service protection is that when we do find a customer abusing the system, the impact of that degradation is limited. No customer can hose customers in other Pods, no matter how hard they try[footnoteRef:10]. [10: They most certainly can hose other customers in their own Pod (and org). Preventing that is the work of the service protection team.]

But wait! you say. What about Superpods? Do we really have fault isolation, if failure in a shared service like NTP can hose many Pods at once? Yeah, youre right. Superpods are a compromise, so we can divorce our real estate strategy from our scaling strategy. But, look at the evidence; the number of times when screw-ups at the super-pod level cause customer issues is a tiny fraction of how often Pod-level problems do.

In the end, nothing provides perfect fault isolation; to paraphrase Randy, Earth is a single point of failure. In other words, while you could argue it's pointless to have a smaller fault zone (pods) because, hey, a datacenter can still fail, right?. But the reality is more about probability: servers fail more often than Pods, which fail more often than datacenters, which fail more often than the entire planet; there will always be some larger fault domain, but that doesn't make it pointless.

3: Goldilocks & The Three Bears of Predictability

Heres a story. Goldilocks went to the bears house. One of the bears was a total asshole. He made people test every damn thing in lab environments, and was always requiring a zillion approvals, and imposing change moratoriums and shit. He said that nothing should change, ever. Nobody liked this, but it was very predictable.

Another bear was a total stoner. He didnt test anything, changed stuff randomly in production and then went on vacation. What a dick. That might be OK for Etsy, buddy.

But the third bear, she was pretty cool. She understood that the infrastructure would have to change over time, and you cant stop it. You need to add new services, roll in new hardware, and build cool new stuff. But you cant play fast and loose with a service that runs critical services for hundreds of thousands of businesses.

The third bear had a strategy called the Immutable Pod Design Pattern. It goes like this: once you get a known-good Pod (or Superpod) design, you pretty much stick with it. When capacity starts to be an issue, you just stamp out another instance, you dont change what it means to be an instance. So, e.g., when you need 8 more nodes of Oracle RAC capacity, you stamp out another tried-and-true 8 node cluster (and all the supporting services around it) instead of expanding your existing cluster from 8 to 16 and praying that 16 nodes works in production.

Now, of course, sometimes you do need to make a change (like, say, adding another Fileforce buddy pair, or rolling out a new kind of Search service). But, in these cases, you do it carefully: you dark launch[footnoteRef:11] it in one instance, you watch the stats like a hawk, and only when its known to be stable in production do you roll it out everywhere. You dont treat infra changes lightly, because in a heavily coupled, complex system like ours, they can have unpredictable effects. [11: A dark launch is one where you expose new functionality in a way that allows you to verify it, without turning it on for all live customer requests. That can mean either exposing it to a small subset of customer requests, or launching it in a parallel, unobserved way.]

This is what we aim for at Salesforce. Were a multi-billion dollar company, and we do need predictability. Pods let us avoid science experiments in our critical production services, so we can stamp out repeatable, known-good designs.[footnoteRef:12] [12: We learned this lesson anew with the autobuild infrastructure (which runs all our bazillion unit tests every time someone checks in a change to a component, like Hodor). We increased capacity, only to be greeted with raging cascading failures.]

Having multiple Pods gives us more ways to canary[footnoteRef:13] our changes. We roll out first to GS0, then the sandboxes, then NA1 in R0, then over the following weeks, we roll out R1 and R2 to the rest of the Pods. Glaring issues at any stage give us time to react and fix the problems before the majority of our customers see them. We can do this with infrastructure changes, too: roll out HBase cap adds to one Pod first, verify that the sky didnt fall, and then do the rest. This is a good thing. [13: As in canary in a coal mine -- the idea that you make a change to a small portion of infrastructure first to see if it dies a horrible painful death, because the horrible painful death of a canary is much better than of a miner.]

Now, this approach is not about complete immutability. In particular, some services are themselves intended to be horizontally scalable, like Keystone and HBase. It's fine for horizontally scalable services to have different server counts per Pod. It would be madness to say that because NA5 needs some extra File Force or HBase space, we have to add File Force or HBase capacity everywhere.

But for now, adding a new rack of servers to any of these services follows the same careful rollout model,[footnoteRef:14] because its not just your service youre changing; its the chemistry of the whole Superpod (the network, memcached, the WAN pipe, etc). Even with a scalable data store like HBase, the philosophy would be to pick some expected max size (say, 300TB) and if were pushing the limits of that, consider splitting the Pod, just like we would if the DB got too big or APT got too high. (And maybe over time, that threshold changes to 350TB or 400TB, but it doesnt suddenly jump to 30PB). [14: Or, Scott would say, when you add capacity, a puppy dies.]

Well always have variability between Pods. But, our goal is to track and understand that variability. Exceptions and irregularities need to justify their existence (i.e. be explained in an obvious, visible way). If the justification for variance isn't good enough, we generally want to eradicate it in the name of predictability.

Salesforce as Distributed System: Splits & Migrations

One way to think about the Pod & Superpod design is as one giant, massive distributed database. In some distributed databases (like HBase), when the load on a single node in the system gets too high, the node is often split (for HBase, this means that when a Region gets too big, its automatically split into two smaller ones).

At the macro level for salesforce, this same process happens, but at the Pod level: when a Pod gets too big, we split it. We havent done many lately, but in the next year we have over a dozen splits planned. Its critical for this process to be easy and repeatable. Right now, it takes months and hundreds of people. :(

The other option we have in this process is org migration. Because of our multi-tenant architecture and the way traditional relational databases work, its quite difficult to migrate an org from one Pod to another without much downtime. Recent projects like Buffalo[footnoteRef:15] have made huge strides in improving this, but were far from the holy grail of seamless, instant, zero touch migration. Keystone aims to be a leg up in that fight. If we could do fast, seamless migration, we really shouldnt ever have to do a split again.Comment by bobby.white: Why not migrate lots of small orgs to other pods quickly? Then that would free up space for the big ones.Comment by ivarley: Unfortunately, org migration today is a manual process that requires human effort on the part of our ops engineers, as well as coordination directly with the client (to change their URLs, etc). As such, the effort to migrate a small org is not (today) much smaller than the effort to migrate a big org.Comment by ivarley: _Marked as resolved_Comment by bobby.white: I was purely thinking about "transaction time". If you migrate as one"unit of work", the smaller the unit of work, the better. Why does thisneed to involve humans? We need to automate this fully to make it feasible.Comment by ivarley: _Re-opened_Part of the reason it's manual today is that there's no way to get around some amount of downtime for the customer being migrated (because we have tables, like sharing, that are so big and active, we can't put the required triggers on them; even a noop trigger pushes system performance past the limits). That will hopefully change when we have integrated 11g's EBR (Edition Based Redefinition) into our schema process, and can put triggers on these tables only for a subset of customers. Reach out to the Core Database Services team in the DB & Core Cloud for more on this, particularly [email protected] [15: Buffalo is a recursive acronym for BUFFalo A Live Org. Also this is important for you to read.]

What goes in the Pod, vs the Superpod?

The default answer is that most stuff should go in the Pod, because of the 3 reasons listed above. Things that have to stay in-pod are:

Services that provide system of record data storage Services that are transactionally coupled with our Systems of Record (Oracle, FFX, HBase, etc) Services that are built in such a way that they cant be shared across DBs (like memcached, QPID, etc)

Things that can live at the Superpod level include: Much of the network (routers, load balancers, etc). Hadoop, because its not a SOR, and its a batch system with no SLA Insights, because its not a SOR, and needs to be horizontally scalable and agile The ops & M&M stacks (Gigantor, Ajna, kerberos infrastructure, etc) Other logging and monitoring functions. Various other shared services like Raiden, UMPS, LA, etc.

And actually, some of this stuff lives at an even broader level than Superpod: some of it lives at the Data Center level, like iDB. You can see a lot more detail about what goes where in the Pod & Superpod link library.

Pods Also SuckFrom this document, you might think that Pods are all sweetness and light. Theyre not; there are definitely downsides we should be up front about. Here are a few.They prohibit an elastic (AWS-like) modelOne of the great infrastructure trends in recent years is elasticity; AWS (including EC2, S3, etc) is the prime example. As traffic increases, you bring on more capacity transparently, and as it decreases, you shed it. At a high level, this is exactly what we offer to our customers (were a SaaS platform) but we dont have it ourselves on the implementation side. Tieing the service to specific metal makes the service more vulnerable to outages and security breaches (unlike VMs that can move around).Truly global shared data is hardThere are a small number of things (like the global set of users, orgs, ISV packages, etc) that need to be synchronized across all Pods at the database level. This prompted us to build a feature in ~2004 called Org Replication, which syncs database tables (ALL_USERS, ALL_ORGANIZATIONS) and makes them identical everywhere. The move from ~4 Pods to ~50 Pods caused a rewrite in 2011, and well need another rewrite when the number of Pods cross 100, because the process is still inherently O(n^2) in the number of instances.Inherently scalable services suffer by being partitionedSome services, like Keystone and HBase, are built on a horizontally scalable architecture; they dont just tolerate being deployed in larger installations, they actually thrive on it. Deploying a 20-node HBase cluster has relatively high overhead (5 master nodes vs 17 data nodes) whereas a larger cluster has lower overhead (100 data nodes would still need only 5 master nodes). The degree of parallelism is generally higher in a larger cluster, and fluctuations in load are amortized. The Pod architecture forces us to deploy services like this in many small clusters rather than the preferred approach of fewer big clusters; but, for we cant.Getting a holistic view of your services is hardSplunk indexes are per-pod; Graphite too. You can combine several in a single view, but you face a slowdown with each Pod you add, so youre discouraged from doing that. But, looking individually at 50+ different graphs is madness. This is part of why we are so shite at looking at things in production: its really difficult b/c of Pods.We cant take advantage of our many Pods for High Availability (HA)Pods are the unit of availability. But we have (un)scheduled downtime on Pods. So things in Pods are not HA. While wed love to get to HA within the Pods (and have projects, like Active Data Guard and ReadOnlyDB, to help that), until then, we should be smarter about services that can't go down (e.g. Login/Identity).Modifying global state is hardIf you want to make a change to something in black tab, or add a rule in Gater, you have to do it once per Pod; theres no master switches. As the rest of this document makes clear, this is a pain-in-the-ass by design, for the protection of the service. But its still a pain-in-the-ass. This is a problem that automation can fix.Without modern automation, its a big snowflake partyWe started down the Pods road before we understood the importance of avoiding snowflake infrastructure. But, due to the Pod design, were stuck with a LOT of snowflakes today, and automation is an uphill battle. Naming Is HardThe names Pod and Superpod (as described in this doc) are primarily internal usages. There are also some other words in use, so this section is a (possibly futile) attempt to make a little sense of it.

You may have heard the words kingdom and estate, which relate to future Data Center Automation ideas:

"Kingdom" is the group of machines controlled by a single R2 controller. "Estate" is a group of services within a kingdom. It's also a security boundary: all services within an estate can talk to each other freely and directly to hosts (without any ACLs, etc).

So depending how we implement it, a Superpod may be a kingdom; or a kingdom may contain multiple Superpods[footnoteRef:16]. For more on that, see here. [16: We haven't decided yet. Likely it would be 1:1. And a Pod may be a single estate, though more likely we'd have a few estates making up a Pod, something like hbase-na1, app-na1, db-na1 estates.]

For now, the external name for Pod is Instance. Theres no external name for Superpod (Instance Group has been proposed but isnt widely used). Superpods are an implementation detail, so we dont talk about them (except when we do). Theres been talk of banning pod and using instance internally, too.

People use the word Superpod to mean many different things:

A single collection of several specific Pods. A generation of Superpod design (sp1, sp2, sp3, etc). We should call this Superpod Generation. The supporting services for a collection of Pods (e.g. insights is part of the Superpod, not the Pod). The collection of instances AND the supporting services (1 + 3 above).

This means that if someone says Were going to build[footnoteRef:17] a Superpod!, it could either mean one of two things, depending on whats already in place: [17: Also "build" means 2 things: either physically rack and cable metal or set up software on said metal. And pod, to some people, means a room in the data center. Good times.]

If the Pods are in place, it means we're going to build the supporting services"; If the Pods arent in place, "we're going to build the supporting services and the Pods".

We also have the not-very-helpful addition of the HP Superpod to our marketing, which has exactly nothing to do with any of this. (Thanks, Marc!) Thats the idea that one company could have a Pod all to itself. If it were just one org, that would be somewhat impractical (since our RAC node architecture is deeply predicated on an org living on exactly one RAC node, or at most two). But in reality, most big companies have a lot of individual orgs (business units, acquisitions, etc) so its not quite as goofy as it sounds: their set of orgs would legitimately get service protection from the foibles of other companies. Were internally referring to things like this (HP Superpods, or other dedicated Pods) as Blue Pods to reduce the confusion.Comment by vdevadhar: Starting 192, we have software support for an org traffic going to as many rac nodes as needed, available. 62 org is going to need this in next few months because we are hitting capacity limits on 2 rac nodes during quarter end processing.