35
Cloud Security @ Netflix October 25, 2013 Jay Zarfoss (Cloud Security Guy @ Netflix)

Cloud Security At Netflix, October 2013

Embed Size (px)

DESCRIPTION

Netflix Cloud Security Architecture

Citation preview

Page 1: Cloud Security At Netflix, October 2013

Cloud Security @ Netflix

October 25, 2013Jay Zarfoss

(Cloud Security Guy @ Netflix)

Page 2: Cloud Security At Netflix, October 2013

This presentation• What it covers:

– A discussion of what it means to fit security into the Netflix Cloud universe

– A description of the the past, present, and future Netflix cloud security architecture

• What it (mostly) skips:– The broader Netflix culture and architecture – For generally cloudy topics, see Adrian Cockcroft’s slideshare

at www.slideshare.net/adrianco– For general culture see www.slideshare.net/netflix

Page 3: Cloud Security At Netflix, October 2013

Netflix Company Profilenow via self service*

> UPDATED_SIZE=`curl ir.netflix.com | perl -ne 's/\ / /g; if(/\d+ million members in \d+ countries/){print "$&";}’`

> echo “Netflix is the world’s leading Internet subscription service for enjoying TV and movies, with more than ${UPDATED_SIZE}”

*No whining; remember that you’ll never again need to wait for me to update this slide like you had to wait for database access when you started at your last job.

Instructions: Find your favorite BASH terminal and type the following:

Page 4: Cloud Security At Netflix, October 2013

Our Cloudy Culture

No waiting

Decoupled

Agile

Ephemeral

Chaotic

Open Source

Dynamic

NoSQL

*These are not terms that are normally associated with security, or security architectures, but yet we adopt all of these for security development; with some perspective (of course).

Self Service Freedom

Decentralized

Unsynchronized

Redundant

ResilientRapid

*aaS

Page 5: Cloud Security At Netflix, October 2013

“But how can you trust the Cloud?”• This is simply an old question rephrased for the

new generation of computing.

– How can you trust the CPU?– How can you trust the OS?

• Security design often requires trust of the lower layer.– Even through they’ve all let us

down at some point before.– And “trust” does not mean “blind faith”

Page 6: Cloud Security At Netflix, October 2013

“But we have special requirements”• Frankly, they’re probably not that special– You can fail pretty much any requirement with or

without using cloud methodologies– 67% of 670 surveyed companies fail PCI compliance*

• The core AWS services (EC2, S3, ELB) meet PCI DSS 2.0 compliance**– It’s generally assumed that the more exotic features

(DynamoDB) will be getting compliance sooner rather than later -- So why not offload some of that compliance work?

**http://www.slideshare.net/CloudPassage/aws-slides-pci-20130124*http://www.informationweek.com/security/management/67-of-companies-fail-credit-card-securit/229401946

Page 7: Cloud Security At Netflix, October 2013

The Security Conflict• Goal: prevent us from hurting ourselves, while

not preventing us from moving quickly and being flexible.

Page 8: Cloud Security At Netflix, October 2013

Perspective, Perspective, Perspective

• No one will worry about you getting hurt playing paintball in a bomb disposal suit.

But then, you’ll almost certainly lose the game.

• Bomb technicians don’t wear paintball suits.

Even if they are easier to work in.

Page 9: Cloud Security At Netflix, October 2013

Further Security Caveats

Technology alone will never prevent malicious insiders from doing damage. (Never has this sentiment been more relevant.)

Smart professionals will use safer tools when they’re available (so let’s give them those tools)!

Page 10: Cloud Security At Netflix, October 2013

What do good tools look like?• Intuitive yet powerful GUIs that shield you

from stumbling over the secrets– Integrate with single sign-on to keep out your kids

and track you down ifwhen you screw up

• Powerful APIs to do just about everything… – Except what there’s no legitimate use case for

Page 11: Cloud Security At Netflix, October 2013

Reflections on Better APIs

The Cloud Offers Incredible APIs so developers can call upon new hardware with a single line of code.

With great power comes great responsibility.

Page 12: Cloud Security At Netflix, October 2013

Packets from the skyDon’t worry, it’s just rain…

• Your own trust of software running on a cloud instance should ideally be predicated on some cryptographically authenticated material.– Ironically, your cloud provider wants to do the same thing,

since they don’t want you denying your bill…• Not long ago, there was no way to do this other than

deploying these keys yourself in your own build pipeline.– Thus, your security was only as nimble as your build and

deployment system. Maybe ok. Probably much slower than you want/need it to be.

Page 13: Cloud Security At Netflix, October 2013

Deploying AWS keys, the Legacy Way“That was in the before time, in the long long ago… (alright, it was 2011)”

Presumably, your machines in the cloud are running code that actually wants to do something against the Cloud Provider’s API. E.g. Read/write to a database. Legacy AWS paradigm is that all of these operations need to be authenticated by signing (HMACing) with access keys. (Amazon’s term: “credentials”; my term: “AWS Keys”).

//fortunately, AWS provides helper objects that do most of the workBasicAWSCredentials cred =

new BasicAWSCredentials("accessID", "secretKeyID");

AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);

//ugly HMAC generating code safely tucked away in here somewhereclient.listDomains();

Sure.. But how did “accessID” and “secretKeyID” get on the machine?

Page 14: Cloud Security At Netflix, October 2013

1st Attempt: Stick them in a system property

• This… works… I guess…, but what happens if the key gets out?*.– Rebake hundreds of AMIs– Redeploy thousands of Machines

• Requires all hands on deck and a big fiasco.

// if it makes you feel better, let’s pretend I obfuscated thisBasicAWSCredentials cred =

new BasicAWSCredentials(System.getProperty(“accessID”),System.getProperty(“secretKeyID"));

AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);

client.listDomains();

*Thanks to supplemental security controls, like ip-whitelisting, this may not be quite as horrible as it sounds. Still bad.

Page 15: Cloud Security At Netflix, October 2013

2nd Try: Load Keys At Runtime (Better?)• Fits nicely into Cloud Platform “whatever”-aaS layer.

– Security Groups can enforce who can make request.– And makes a pretty tidy REST call:

GET server/getAWSKey

<AWSKEY> <accessKeyID>open</aceessKeyID> <secretKey>sesame</secretKey></AWSKEY>

• What happens when the subaccount associated with the key gets accidentally deleted?– Update the key in AWS console and then swap the key in the key

servers (technically easy; will still get your heart pumping when you do it for real – trust me!)

– You may still have to reboot a lot of machines! But why?

Page 16: Cloud Security At Netflix, October 2013

Objects, like peaches, are sticky.(Still delicious.)

RESTfulObj AWSKey = RESTService.get(“server/getAWSKey”);

BasicAWSCredentials cred = new BasicAWSCredentials(

AWSKey.getAccessID(),AWSKey.getSecretKey());

AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);

client.listDomains();

The mindful Object-Oriented programmer will tend to keep this object around rather than re-creating all of the time. (Trust me). Guess what object caches the AWS Keys.

Page 17: Cloud Security At Netflix, October 2013

Promote Safer Foods.// provider paradigm dynamically asks for keys every timeAWSCredentialsProvider prov = new AWSCredentialsProvider(){

public AWSCredentials getCredentials(){ RESTfulObj AWSKey = RESTService.get(“server/getAWSKey”);

return new BasicAWSCredentials(AWSKey.getAccessID(),AWSKey.getSecretKey());

}};

AmazonSimpleDBClient client = new AmazonSimpleDBClient(prov);client.listDomains();

No cached key (yay!). But…Goodluck chasing everyone around with a broomstick making them write their code this way.

Page 18: Cloud Security At Netflix, October 2013

Systematically enforce Refresh.Or: Revoke Privileges for unsafe food altogether

• Only issue temporary keys good for a few hours (> your longest conceivable operation)– AWS Mechanism to do this: (AWSSecurityTokenService)

GET server/getAWSKey

<AWSKEY> <accessKeyID>open</aceessKeyID> <secretKey>sesame</secretKey> <expires>1352083995</expires></AWSKEY>

• Simple, but powerful consequences to this, i.e. Accidentally writing keys to logs and backup lost?– Disadvantages? (I would argue materially none)

Page 19: Cloud Security At Netflix, October 2013

Abracadabra at Runtime (Best)http://aws.typepad.com/aws/aws-iam/

• June 11th 2012: Amazon introduces temporary AWS Security Credentials via Metadata Service– On-demand access keys via Amazon API; expire quickly– Effectively, Amazon is hosting the key server and only giving

keys to your cloud instances.– Predefined “roles” determine the permissions of the keys– Wish we had had this when we first moved to the cloud.

• Still useful to have your own key server, why?– For one, developers will chase you down with

pitchforks if they can’t run against the cloud API at their desk. (And they’d have every right to…)

Page 20: Cloud Security At Netflix, October 2013

IAM Role configuration via Asgard

View into Asgard Launch configuration assigning a Role which determines the permissions of the key an instance will receive via IAM paradigm.

Page 21: Cloud Security At Netflix, October 2013

New Ways to Hide All Your Keyshttp://aws.typepad.com/aws/2013/04/variables-in-aws-access-control-policies.html

// one ACL to rule them all{ "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mybucket/myclientsoftware.${aws:userid}.keystore"]}

• April 3, 2013: Amazon introduces variables in AWS access control policies. – Provides an obvious place to store sensitive

nuggets your software needs to work

Just apply the right role to your auto-scale group and you’re done!

Page 22: Cloud Security At Netflix, October 2013

Secure Bootstrapping (still) frustrating// at least now there’s a reasonable place to put the file-Djavax.net.ssl.keyStore=<file smartly loaded from ACL-limited store>

• Options are better today with new ACL Rules• But…– What if I want to hot-swap these? Wait, you mean I

have to write them to a file and restart?! Yuck!!• Unfortunate artifact of software designed for the datacenter

where machines stay put for a long time

– One mistake in the AWS console and my keystore file (complete with SSL private keys) is open to the world? • If your eyelid isn’t twitching, it should be.

Page 23: Cloud Security At Netflix, October 2013

So… we still want our own tools

• (Most) developers don’t want to think about where this key lives. So let’s have the library worry about that for them.

• Some keys are more important than others– “oh, shit” vs. “OH SHIT”

// whenever you find yourself writing code like this, // I hope you’re asking yourself if the keys aren’t // left sitting on the kitchen countercipherContext = factory.getCipherContext(“algorithm”, “keyName”);

Page 24: Cloud Security At Netflix, October 2013

Custom Cloud Key Management

Don’t leave your child in the middle of a busy intersection.

Page 25: Cloud Security At Netflix, October 2013

Netflix Key Management

• All sorts of business cases require keying material:– Password reset tokens– Encrypting sensitive databases– Authenticating Netflix Ready Devices (NRDs)– DRM keys

• I’m not having the DRM debate here; so don’t try

– Symmetric, Asymmetric, HMAC keys, ….

• So how do you handle those keys?– Depends. (Paintballs or Pipebombs?)

Page 26: Cloud Security At Netflix, October 2013

Cryptex Service• Without going into too much detail, Cryptex is our

*aaS for key management with associated client libraries in Java and Python.– We worry about where the keys live

• So you (Mr./Ms. big data person) don’t have to

– Flexible, dynamic, auto-scaling, fast moving• Except when it’s not supposed to be

• Future/Ongoing work– Better integrating this into Datacenter-y software that

wants fixed static things is a constant challenge and requires lots of new plumbing – wanna help?!

Page 27: Cloud Security At Netflix, October 2013

Variations in Key Handling

• Low: Key is provided to the edge service instance– Virtually unlimited throughput, resistant to any

backend service outages• Medium: Key stays on the single-purpose Netflix

key management servers; each instigating crypto operation is a REST call (small data is better!)– Key never lives on a customer facing server

(one nasty bug or “oops” won’t cause exposure)• High: Keys live in specialized hardware (HSM)

Page 28: Cloud Security At Netflix, October 2013

Netflix Global Crypto Ops/Sec

• Low (< 1ms latency)– It’s a (really) big number. And highly variable.

• Medium (~ 4ms latency)– Tens of thousands of operations/sec at daily peak

(number is shrinking as we get smarter with our protocols which favor low sensitivity keys)

• High (~ 10ms latency)– Over one thousand operations/sec at daily peak

Page 29: Cloud Security At Netflix, October 2013

(Fairly) Common Dialogue

Big Data Developer: I’m working on super-cool new feature, X. And it will use some crypto and need some keys. Which sensitivity of key do I want?

Me: Tell me the story of what happens when we lose the key somehow.

Page 30: Cloud Security At Netflix, October 2013

Various Key Loss Scenarios

Low: We’d rotate the key via one button-push and customers wouldn’t notice an impact; minimal damage control.

Medium: We’d rotate the key and the whole team would have to work for a week straight cleaning up the mess created.

High: I don’t want to talk about it.

Let the Cloud help you along the way….Early and automated detection, combined with fast-reaction means more keys can be low/medium sensitivity (less resource intensive).

Design your new system to be able to use LOW keys for the bulk of the heavy lifting!!

Page 31: Cloud Security At Netflix, October 2013

AWS CloudHSMhttp://aws.typepad.com/aws/2013/03/aws-cloud-hsm-secure-key-storage-and-

cryptographic-operations.html

• March 26th 2013, AWS announces availability of Safenet-manufactured CloudHSMs to general cloud-computing public.– Old-skool industry standard security solution… without the

need for your IT people to baby sit. – All the right acronyms: FIPS 140-2, CC EAL 4+– Amazon has no way to recover your keys (do please take

care not to lose them)– Single tenant

• This is the new home for our high sensitivity keys.

Page 32: Cloud Security At Netflix, October 2013

Some Final Thoughts…

Page 33: Cloud Security At Netflix, October 2013

Why are we sharing?• In a sense, Netflix benefits when other cloud

users and cloud venders follow common paths.– Problems will invariably pop up, but when these

problems occur to industry standard practices, everyone shares the load of getting them fixed.

• Example of a great benefit of common practice– TLS has become industry standard for secure

transport, but has had its lumps lately (BEAST, RC4)*– Because it affects everyone, we’re all motivated to

look for solutions and share those cost

*http://blog.cryptographyengineering.com/2011/12/whats-deal-with-rc4.html

Page 34: Cloud Security At Netflix, October 2013

Security and Flexibility don’t have to be always at odds with each other…

Security can fit in a fast-changing environment where flexibility is paramount.

The trick is to leverage the same flexibility to allow the Security to keep up.

Page 35: Cloud Security At Netflix, October 2013

Sound Interesting? We’re hiring!