10
process.st https://www.process.st/2016/02/it-process/ Here’s What Goes Catastrophically Wrong When You Don’t Follow Your IT Process On November 18, 2014, Microsoft Azure went dark. Thousands of the cloud computing service’s customers experienced downtime on their sites for over 9 hours. When they flooded Microsoft customer support to ask what the hell had gone wrong, customers learned that it wasn’t some glitch, natural disaster, or devious hacking scheme. It was pure human error .

Heres what goes catastrophically wrong when you dont follow your it process

Embed Size (px)

Citation preview

process.st https://www.process.st/2016/02/it-process/

Here’s What Goes Catastrophically Wrong When You Don’tFollow Your IT Process

On November 18, 2014, Microsoft Azure went dark. Thousands of the cloud computing service’s customersexperienced downtime on their sites for over 9 hours. When they flooded Microsoft customer support to ask what thehell had gone wrong, customers learned that it wasn’t some glitch, natural disaster, or devious hacking scheme.

It was pure human error.

Microsoft deployed an update without running through the standard operating guidelines specifically laid out for thisscenario. Instead of rigorously checking that the update was good to go, engineers shipped it on the assumption itwas bug-free. This wasn’t just this one-time incident—engineers at Azure regularly violated standard operatingprocedure because 99.9% of the time, it was a total waste of their resources.

And it’s not just Azure that does this—habitual violation of process leads to a ton of mistakes in all kinds of ITarenas.

When I was 16, I landed my first “real” job which so happened to be in IT. I had just passed my CCNA with the helpof my dad (apparently the youngest person in Australia at the time) and was hungry to get my hands on some realtechnology.

But once I started, I found there was rigid process everywhere, and configuring a live Windows 2000 server was atotally different experience to tinkering with my fathers machines at home, with serious consequences for notfollowing the process.

As Microsoft hardware designer Dan Luu writes,

How many technical postmortems start off with “someone skipped some steps because they’reinefficient”, e.g., “the programmer force pushed a bad config or bad code because they were surenothing could go wrong and skipped staging/testing?”

His answer: a lot. In fact, a lot of major IT mess-ups—the high-profile Sony hacks, cloud computing outages,downtime on major websites—are totally preventable. They happen because IT people don’t follow the processesput in place to prevent them from happening. What’s more, they don’t even care, until it becomes a catastrophe.

This behavior is actually a widely-studied sociological phenomenon called the “normalization of deviance”—and ithas huge implications for your computing security.

Why Deviance Becomes the Norm

The term “normalization of deviance” was coined by Columbia University sociology professor Diane Vaughan. Shedescribes it as: “a gradual process that leads to a situation where unacceptable practices or standards becomeacceptable, and flagrant violations of procedure become normal—despite that fact that everyone involved knowsbetter.”

Essentially, it’s when negligence becomes the norm. In IT scenarios, it’s often most noticeable when a new personjoins the team. Here’s common scenario, in Dan Luu’s words:

[New person joins the team, discovers bad IT processes]New person: WTF WTF WTF WTF WTFOld hands: Yeah we know, we’re concerned about it.New person: WTF WTF wTF wtf wtf w…[New person gets used to bad IT processes]

[New person #2 joins]New person #2: WTF WTF WTF WTFNew person #1: Yeah we know. We’re concerned about it.

“The thing that’s really insidious here,” Dan writes, “is that people will really buy into the WTF idea, and they canspread it elsewhere for the duration of their career.” Things that should be eradicated are totally normalized, whetherit’s not wearing gloves in a hospital or not double-checking that you have the right inventory inside PlayStationpackaging.

As security expert Bruce Schneier argues, normalization of deviance has huge implications in the field of IT,especially because it’s easy for problems to go unnoticed until they reach a large scale.

Here are a couple scenarios of what can happen when IT deviates from processes.

1. Hackers Steal Your Customers’ Data

For a companies like the social sharing tool Buffer, trust is a crucial. Customers trust Buffer with their passwordsand login info to essentially post things to Facebook and Twitter for them. And when Buffer was hacked in 2013,customers experienced perhaps the most flagrant violation of that trust: Buffer used their logins to post spam.

Buffer was extremely open about the security breach, sending customers regular updates about the app’sfunctionality and ultimately, a post-mortem explaining what had happened—including a list of security measuresthey put in place after the fact.

The real problem with the hack was that once hackers got into Buffer’s system, the data was right there and totallyusable since Buffer hadn’t encrypted any of the data.

This was of the most important measures Buffer implemented after the fact. As they told customers, “we have addedencryption of OAuth access tokens and we have changed all API calls to use an added security parameter.” Problemwas, it was too late.

Deviant Behavior: Not Encrypting Data

It’s the classic “out of sight, out of mind” problem, the same way that you don’t need health insurance until you’rediagnosed with a rare skin disease. If you’re planning on getting hacked, it’s not worth the effort to encrypt data. Butthen again, who plans on getting hacked, or getting a rare skin disease?

It’s basic common sense that every IT department needs to follow. They really need regularly check up on andenforce these processes.

As Auth0 says, normalization of deviance in security is totally normal, especially when companies are focused onother things. “Moreover, security gets de-prioritized on your roadmap because it’s hard to build and isn’t immediatelyurgent.”

They table security to tackle more pressing issues. By consistently doing so, they get accustomed to sweeping thisdeviant behavior under the rug to the extent that they don’t even consider it deviant anymore. It’s totally normalized,even though any outsider would recognize that it’s bad practice and needs to be fixed immediately.

What You Can Do

This de-prioritization is why a lot of companies elect to opt out of important security steps like multi-factorauthentication or encrypting data, which saved the day in the Patreon hack. Because the data was encrypted, noone’s credit card information was stolen.

An easy way to circumvent this deviant behavior is to outsource the issue to another provider. Tools like Auth0 takecare of your security needs, providing code for features like multi-factor authentication, so you don’t have to worryabout it.

But even if your company has these in place, it’s important to regularly run through checklists to make sureeverything is up to date. Try running Process Street‘s checklist on network security management every coupleweeks, to provide accountability on a company-wide scale.

2. You Literally Lose Your Data

In 2010, Zurich Insurance was fined a record 2.28 million pounds for losing personal details on 46,000policyholders, which happened because they lost an unencrypted data tape in a routine transfer.

The real problem in this case—the one that caused the Financial Services Authority real concern—was that theyweren’t adhering to guidelines that are in place. As the Telegraph reported, Zurich Insurance’s big blunder was that itdidn’t have “controls in place to prevent the lost data being used for financial crime.”

It’s sloppy behavior, but more common than you might think. Gartner estimates 28 percent of corporate data isstored only on endpoint devices. You lose the device, you lose the data.

Deviant Behavior: Relying Solely on Endpoint Devices

The company certainly knew they should have had these “controls in place,” but somewhere along the line they hadgrown accustomed to not having them. The deviant behavior—only relying on endpoint devices—was normalizedover time.

Zurich Insurance became so used to doing this the wrong way—the way that a new hire might say “WTF WTFWTF”—but everyone became accustomed to it.

The company didn’t do anything about it because it had fallen off their radar. Since the bad behavior wasnormalized, there wasn’t a pressing need to change their processes (or so they thought). The bad process onlybecame a problem once it was too late.

What You Can Do

Changing deviant behavior is hard, but sometimes it’s just a matter of constantly reminding teams about values andreinforcing your IT processes. Take a look at Process Street’s Client Data Backup Best Practices checklist—havingteams run through this every couple of weeks puts your company’s processes back on everyone’s radar.

This checklist in particular is a reminder that even the small stuff that you might think is irrelevant, like verifying thatyou have the right kind of equipment, labeling tapes properly, or setting up recurring reminders to automaticallyperform a differential backup, can make a huge difference.

Deviant behavior will actually feel deviant, which makes teams want to address it.

3. Your Server Crashes When You Need it Most

In 2011, Italian designer Margherita Maccapini Missoni, whose clothes often put you back for upwards of $4,000,developed a limited-edition line at Target. The release of the line was highly anticipated, and while it made long linesoutside brick-and-mortar stores, it caused Target’s website to come crashing down.

“It’s a little bit embarrassing for one of the nation’s largest retailers to have a Web site that can’t support a rush —it’s not like they’re any strangers to rushes,” said Ian Schafer, chief executive of the digital marketing firm DeepFocus.

They managed to handle a lot of other rushes, but not this one. Why? That testing went out the window. Theyfigured it wouldn’t be a big problem, so they neglected to perform the tests they normally would for a big event likeCyber Monday.

There are a lot of reasons servers crash, but it’s no excuse for not preparing or preventing the scenario fromhappening. It’s a big problem with customer retention. 74% of visitors leave your site after hitting a 404 error page—and they’re much less likely to return in the future.

Deviant Behavior: Neglecting Server Maintenance

The problem here is that IT department knew that they should be doing better, but they ignored it. It’s reallycommon. As CEO Steve Klein of StatusPage.io says, there’s a good reason people put off doing servermaintenance, or creating an external status page:

The idea of creating and maintaining a status page outside of your infrastructure is like a cold thatwon’t go away. You should probably see the doctor, but keep putting it off until suddenly you’re on

your way to the ER coughing up a lung. Similarly, a status page is one of those things you shouldcreate before you experience downtime, but end up building at 4am in the morning when your hostingprovider goes down.

The behavior is deviant, but everyone is so accustomed to it, that they don’t take action. You don’t care that behavioris deviant until it becomes a problem. And oftentimes, like in the case of the Target crash, they don’t prepare forthose emergencies.

What You Can Do

IT should regularly run through a checklist to make sure they’re not missing any steps of routine servermaintenance. Check out Process Street’s server maintenance checklist for a template you can customize to yourown needs. Setting regular reminders will help your team do the important stuff like setting up an external statuspage so that if your site goes down, your visitors won’t encounter that dreaded 404 error.

It’s especially important to run through this checklist if you’re anticipating any changes to your server, like a surge intraffic.

When Paper magazine released nude photos of Kim Kardashian, they knew they ran a risk of their serverspractically melting. Their response? Anything but deviance—they strictly adhered to processes, rigorously checkingthat the site would stay functional. You can test the server load capacity of your store with tools like LoadImpact.comor Blitz.io.

Bottom Line: Battle the Culture of Complacency

It’s really easy to get complacent, especially when we’re so reliant upon technology to do the work for us. It’s easy tolet things slide, but it can have huge repercussions. There’s no excuse for human error, especially when there are somany automation tools at your disposal to prevent them.

Checklists are a powerful tool to hedge against human error and the normalization of deviance. They’re a toolsurgeon Atul Gawande uses to combat the very same problem he’s witnessed in his operating room. Essentially,

deploying checklists is a reminder that our deviant behavior is actually deviant, and not normal. It prevents us fromgetting accustomed to bad behavior.

This changes the very culture behind adhering to processes. As Atul writes in the Checklist Manifesto, checklistshelp fundamentally change the mindset teams enter when they do their work. Rather than just getting complacent,and normalizing deviant behavior, they are reminded to be vigilant and disciplined.

That’s because checklists are about more than just adhering to rules. As Atul writes, “Just ticking boxes is not theultimate goal here. Embracing a culture of teamwork and discipline is.” Checklists restrain our natural instinct tonormalize deviance. In the case of IT, it’s easy to get complacent, especially when everything appears to be runningjust fine.

So if you don’t want to see your company’s name in the headlines of the next high-profile hack, checklists might bethe best tool in your belt.