Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The Science DMZ: A Network Design Pa8ern for Data-‐Intensive Science
Jason Zurawski – [email protected] Science Engagement Engineer, ESnet Lawrence Berkeley National Laboratory
Southern Partnership in Advanced Networking April 8th 2015
ESnet at a Glance • High-‐speed naGonal network,
opGmized for DOE science missions: – connecGng 40 labs, plants and
faciliGes with >100 networks (naGonal and internaGonal)
– $32.6M in FY14, 42FTE – older than commercial Internet,
growing twice as fast
• $62M ARRA in 2009/2010 grant for 100G upgrade: – transiGon to new era of opGcal networking – world’s first 100G network at conGnental scale
• Culture of urgency: – 4 awards in past 3 years – R&D100 Award in FY13
– “5 out of 5” for customer saGsfacGon in last review – Dedicated staff to support the mission of science
8
UniversitiesDOE laboratories
The Office of Science supports:� 27,000 Ph.D.s, graduate students, undergraduates, engineers, and technicians� 26,000 users of open-access facilities� 300 leading academic institutions� 17 DOE laboratories
SC Supports Research at More than 300 Institutions Across the U.S.
2 – ESnet Science Engagement ([email protected]) - 4/10/15
Network as Infrastructure Instrument
ESnet Vision: ScienGfic progress will be completely unconstrained by the physical locaGon of instruments, people, computaGonal resources, or data.
3 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • Science DMZ Security
• User Engagement
• Wrap Up
4 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
• Science & Research is everywhere – Size of school/endowment does not ma8er – there is a researcher at your facility right now that is a8empGng to use the network for a research acGvity
• Networks are an essenGal part of data-‐intensive science – Connect data sources to data analysis – Connect collaborators to each other – Enable machine-‐consumable interfaces to data and analysis resources (e.g. portals), automaGon, scale
• Performance is criGcal – ExponenGal data growth – Constant human factors (Gmelines for analysis, remote users) – Data movement and analysis must keep up
• EffecGve use of wide area (long-‐haul) networks by scienGsts has historically been difficult (the “Wizard Gap”)
Mo8va8on
5 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Big Science Now Comes in Small Packages …
…and is happening on your campus. Guaranteed.
6 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Understanding Data Trends
100GB
10GB
1TB
10TB
1PB
100TB
10PB
100PB
Dat
a Sc
ale
Collaboration Scale
Small collaboration scale, e.g. light and
neutron sources
Medium collaboration scale,
e.g. HPC codes
Large collaboration scale, e.g. LHC
A few large collaborations have internal software and networking organizations
7 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Data Mobility in a Given Time Interval (Theore8cal)
These tables available:http://fasterdata.es.net/fasterdata-home/requirements-and-expectations/
8 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
The Central Role of the Network
• The very structure of modern science assumes science networks exist: high performance, feature rich, global scope
• What is “The Network” anyway? – “The Network” is the set of devices and applicaGons involved in the use of a remote resource • This is not about supercomputer interconnects • This is about data flow from experiment to analysis, between faciliGes, etc.
– User interfaces for “The Network” – portal, data transfer tool, workflow engine – Therefore, servers and applicaGons must also be considered
• What is important? Ordered list: 1. Correctness 2. Consistency 3. Performance
9 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
TCP – Ubiquitous and Fragile
• Networks provide connecGvity between hosts – how do hosts see the network? – From an applicaGon’s perspecGve, the interface to “the other end” is a socket
– CommunicaGon is between applicaGons – mostly over TCP
• TCP – the fragile workhorse – TCP is (for very good reasons) Gmid – packet loss is interpreted as congesGon
– Packet loss in conjuncGon with latency is a performance killer • We can address the first, science hasn’t fixed the 2nd (yet)
– Like it or not, TCP is used for the vast majority of data transfer applicaGons (more than 95% of ESnet traffic is TCP)
10 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
A small amount of packet loss makes a huge difference in TCP performance
Metro Area
Local (LAN)
Regional
ConGnental
InternaGonal
Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss)
With loss, high performance beyond metro distances is essentially impossible
11 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Lets Talk Performance … "In any large system, there is always something broken.”
Jon Postel • Modern networks are occasionally designed to be one-‐size-‐fits-‐most
• e.g. if you have ever heard the phrase “converged network”, the design is to facilitate CIA (ConfidenGality, Integrity, Availability) – This is not bad for protecGng the HVAC system from hackers.
• Causes of fricGon/packet loss: – Small buffers on the network gear and hosts – Incorrect applicaGon choice – Packet disrupGon caused by overzealous security – CongesGon from herds of mice
• It all starts with knowing your users, and knowing your network
12 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
PuNng A Solu8on Together • EffecGve support for TCP-‐based data
transfer – Design for correct, consistent, high-‐
performance operaGon – Design for ease of troubleshooGng
• Easy adopGon (for all stakeholders) is criGcal – Large laboratories and universiGes
have extensive IT deployments – Small universiGes/faciliGes have
overworked/understaffed IT departments
13 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
– DrasGc change is prohibiGvely difficult • Cybersecurity – defensible without compromising performance • Borrow ideas from tradiGonal network security
– TradiGonal DMZ • Separate enclave at network perimeter (“Demilitarized Zone”) • Specific locaGon for external-‐facing services • Clean separaGon from internal network
– Do the same thing for science – Science DMZ
Data Transfer Node • High performance • Configured for data
transfer • Proper tools
perfSONAR • Enables fault isolaGon • Verify correct operaGon • Widely deployed in ESnet
and other networks, as well as sites and faciliGes
Science DMZ • Dedicated locaGon for DTN • Proper security • Easy to deploy -‐ no need to redesign
the whole network
Performance TesGng &
Measurement
Dedicated Systems for Data
Transfer
Engagement with Network Users
Network Architecture
Engagement • Partnerships • EducaGon & ConsulGng • Resources & Knowledgebase
14 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
The Science DMZ Superfecta
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • Science DMZ Security
• User Engagement
• Wrap Up
15 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Science DMZ Takes Many Forms
• There are a lot of ways to combine these things – it all depends on what you need to do – Small installaGon for a project or two – Facility inside a larger insGtuGon – InsGtuGonal capability serving mulGple departments/divisions – Science capability that consumes a majority of the infrastructure
• Some of these are straighsorward, others are less obvious
• Key point of concentraGon: eliminate sources of packet loss / packet fricGon
16 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Legacy Method: Ad Hoc DTN Deployment
• This is oten what gets tried first • Data transfer node deployed where the owner has space
– This is oten the easiest thing to do at the Gme – Straighsorward to turn on, hard to achieve performance
• If lucky, perfSONAR is at the border – This is a good start – Need a second one next to the DTN
• EnGre LAN path has to be sized for data flows (is yours?) • EnDre LAN path becomes part of any troubleshooDng exercise • This usually fails to provide the necessary performance.
17 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Ad Hoc DTN Deployment
18 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Abstract Deployment
• Simplest approach : add-‐on to exisGng network infrastructure – All that is required is a port on the border router – Small footprint, pre-‐producGon commitment
• Easy to experiment with components and technologies – DTN prototyping – perfSONAR tesGng
• Limited scope makes security policy excepGons easy – Only allow traffic from partners (use ACLs) – Add-‐on to producGon infrastructure – lower risk – IdenGfy applicaGons that are running (e.g. the DTN is not a general purpose machine – it does data transfer, and data transfer only)
• Start with a single user/user case. If it works for them in a pilot, you can expand
19 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Local And Wide Area Data Flows
20 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Large Facility Deployment • High-‐performance networking is assumed in this environment
– Data flows between systems, between systems and storage, wide area, etc. – Global filesystem (GPFS, Luster, etc.) oten Ges resources together
• PorGons of this may not run over Ethernet (e.g. IB) • ImplicaGons for Data Transfer Nodes – these are ‘gateways’ really
• “Science DMZ” may not look like a discrete enGty here – By the Gme you get through interconnecGng all the resources, you end up with most of the network in the Science DMZ
– This is as it should be – the point is appropriate deployment of tools, configuraGon, policy control, etc.
– Can sGll employee security techniques to limit access (e.g. a basGon host to control logins)
• Office networks can look like an aterthought, but they aren’t – Deployed with appropriate security controls – Office infrastructure need not be sized for science traffic
21 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Large Facility (HPC, etc.)
22 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Non-‐R1 Campus
• This paradigm is not just for the big guys – there is a lot of value for smaller insGtuGons with a smaller number of users
• Can be constructed with exisGng hardware, or small addiGons – Does not need to be 100G, or even 10G. Capacity doesn’t ma8er – we want to eliminate fricGon and packet loss
– The best way to do this is to isolate the important traffic from the enterprise
• Can be scoped to either the expected data volume of the science, or the availability of external facing resources (e.g. if your pipe to GPN is small – you don’t want a single user monopolizing it)
• Factors: – Are you comfortable with Layer 2 Networking? – How rich is your cable/fiber plant?
23 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Non-‐R1 Campus
24 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Fiber Rich Environment
Non-‐R1 Campus
25 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Layer 2 Switching
Common Threads • Two common threads exist in all these examples
• AccommodaGon of TCP – Wide area porGon of data transfers traverses purpose-‐built path – High performance devices that don’t drop packets
• Ability to test and verify – When problems arise (and they always will), they can be solved if the infrastructure is built correctly
– Small device count makes it easier to find issues – MulGple test and measurement hosts provide mulGple views of the data path • perfSONAR nodes at the site and in the WAN • perfSONAR nodes at the remote site
26 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • Science DMZ Security
• User Engagement
• Wrap Up
27 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Dedicated Systems – Data Transfer Node
• The DTN is dedicated to data transfer • Set up specifically for high-‐performance data movement – System internals (BIOS, firmware, interrupts, etc.)
– Network stack – Storage (global filesystem, Fibrechannel, local RAID, etc.)
– High performance tools – No extraneous sotware
• LimitaDon of scope and funcDon is powerful – No conflicts with configuraGon for other tasks
– Small applicaGon set makes cybersecurity easier
28 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Data Transfer Tool Comparison • In addiGon to the network, using the right data transfer tool is criGcal • Data transfer test from Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.
Tool Throughput scp: 140 Mbps HPN patched scp: 1.2 Gbps tp 1.4 Gbps GridFTP, 4 streams 5.4 Gbps GridFTP, 8 streams 6.6 Gbps Note that to get more than 1 Gbps (125 MB/s) disk to disk requires properly engineered storage (RAID, parallel filesystem, etc.)
29 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • Science DMZ Security
• User Engagement
• Wrap Up
30 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
• Goal – disentangle security policy and enforcement for science flows from security for business systems
• RaGonale
Science DMZ Security
31 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
– Science data traffic is simple from a security perspecGve – Narrow applicaGon set on Science DMZ
• Data transfer, data streaming packages • No printers, document readers, web browsers, building control systems, financial databases, staff desktops, etc.
– Security controls that are typically implemented to protect business resources oten cause performance problems
• SeparaGon allows each to be opGmized
Performance Is A Core Requirement
• Core informaGon security principles – ConfidenGality, Integrity, Availability (CIA) – Oten, CIA and risk miGgaGon result in poor performance
• In data-‐intensive science, performance is an addiGonal core mission requirement: CIA à PICA – CIA principles are important, but if performance is compromised the science mission fails
– Not about “how much” security you have, but how the security is implemented
– Need a way to appropriately secure systems without performance compromises
• CollaboraGon Within The OrganizaGon – All parGes (users, operators, security, administraGon) needs to sign off up this idea – revoluGonary vs. evoluGonary change.
– Make sure everyone understands the ROI potenGal. 32 – ESnet Science Engagement ([email protected]) - 4/10/15
© 2015, Energy Sciences Network
Security Without Firewalls
• Data intensive science traffic interacts poorly with firewalls
• Does this mean we ignore security? NO! – We must protect our systems – We just need to find a way to do security that does not prevent us from gexng the science done
33 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
• Key point – security policies and mechanisms that protect the Science DMZ should be implemented so that they do not compromise performance
• Traffic permi8ed by policy should not experience performance impact as a result of the applicaGon of policy
Firewall Performance Example • Observed performance, via perfSONAR, through a firewall:
• Observed performance, via perfSONAR, bypassing firewall:
Almost 20 Gmes slower through the firewall
Huge improvement without the firewall
34 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
“Why Does it Do That?”
• Consider a network between three buildings – A, B, and C • This is supposedly a 10Gbps network end to end (look at the links on the buildings)
• Building A houses the border router – not much goes on there except the external connecGvity
• Lots of work happens in building B – so much so that the processing is done with mulGple processors to spread the load in an affordable way, and aggregate the results ater
35 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
• Building C is where we branch out to other buildings
• Every link between buildings is 10Gbps – this is a 10Gbps network, right???
No8onal 10G Network Between Devices
36 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • Science DMZ Security
• User Engagement
• Wrap Up
37 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Challenges to Network Adop8on
• Causes of performance issues are complicated for users.
• Lack of communicaGon and collaboraGon between the CIO’s office and researchers on campus.
• Lack of IT experGse within a science collaboraGon or experimental facility
• User’s performance expectaGons are low (“The network is too slow”, “I tried it and it didn’t work”).
• Cultural change is hard (“we’ve always shipped disks!”).
• ScienGsts want to do science not IT support
The Capability Gap
38 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Bridging the Gap
• ImplemenGng technology is ‘easy’ in the grand scheme of assisGng with science
• AdopGon of technology is different – Does your cosmologist care what SDN is? – Does your cosmologist want to get data from Chile each night so that they can start the next day without having to struggle with the tyranny of ineffecGve data movement strategies that involve airplanes and white/brown trucks?
39 – ESnet Science Engagement ([email protected]) - 4/10/15
© 2015, Energy Sciences Network
The Golden Spike
• We don’t want ScienGsts to have to build their own networks • Engineers don’t have to understand what a tokomak accomplishes • MeeGng in the middle is the process of science engagement:
– Engineering staff learning enough about the process of science to be helpful in how to adopt technology
– Science staff having an open mind to be8er use what is out there
40 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Overview
• Science DMZ MoGvaGon and IntroducGon
• Science DMZ Architecture
• Data Transfer Nodes & ApplicaGons • On the Topic of Security • User Engagement
• Wrap Up
41 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Why Build A Science DMZ Though?
• What we know about scienGfic network use: – Machine size decreasing, accuracy increasing – HPC resources more widely available – and
potenGally distributed from where the scienGsts are
– WAN networking speeds now at 100G, MAN approaching, LAN as well
• Value ProposiGon:
42 – ESnet Science Engagement ([email protected]) - 4/10/15
– If scienGsts can’t use the network to the fullest potenGal due to local policy constraints or bo8lenecks – they will find a way to get their done outside of what is available.
• Without a Science DMZ, this stuff is all hard – “No one will use it”. Maybe today, what about tomorrow? – “We don’t have these demands currently”. Next gen technology is always a day away
The Science DMZ in 1 Slide Consists of four key components, all required:
• “FricGon free” network path – Highly capable network devices (wire-‐speed, deep queues) – Virtual circuit connecGvity opGon – Security policy and enforcement specific to science workflows – Located at or near site perimeter if possible
• Dedicated, high-‐performance Data Transfer Nodes (DTNs) – Hardware, operaGng system, libraries all opGmized for transfer – Includes opGmized data transfer tools such as Globus Online and GridFTP
• Performance measurement/test node – perfSONAR
• Engagement with end users Details at h8p://fasterdata.es.net/science-‐dmz/
© 2013 Wikipedia
43 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Links
– ESnet fasterdata knowledge base • h8p://fasterdata.es.net/
– Science DMZ paper • h8p://www.es.net/assets/pubs_presos/sc13sciDMZ-‐final.pdf
– Science DMZ email list • Send mail to [email protected] with the subject "subscribe esnet-‐sciencedmz”
– Fasterdata Events (Workshop, Webinar, etc. announcements) • Send mail to [email protected] with the subject "subscribe esnet-‐fasterdata-‐events”
– perfSONAR • h8p://fasterdata.es.net/performance-‐tesGng/perfsonar/ • h8p://www.perfsonar.net
44 – ESnet Science Engagement ([email protected]) - 4/10/15 © 2015, Energy Sciences Network
Thanks!
Jason Zurawski – [email protected] Science Engagement Engineer, ESnet Lawrence Berkeley National Laboratory
Southern Partnership in Advanced Networking April 8th 2015