Upload
ban
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Enabling a connection-oriented internet. Outline Background CHEETAH Testbed Software Applications: GridFTP + Ensight Security Extension of work to connection-oriented internet Networking research problems. Malathi Veeraraghavan Univ. of Virginia [email protected]. - PowerPoint PPT Presentation
Citation preview
1
Enabling a connection-oriented internet
• Outline• Background• CHEETAH
• Testbed• Software• Applications: GridFTP + Ensight• Security
• Extension of work to connection-oriented internet• Networking research problems
Malathi Veeraraghavan Univ. of Virginia
Talk at the BNL, May 16, 2005
2
Team & Acknowledgment
• Team (PI/Co-PIs):– Malathi Veeraraghavan, Univ. of Virginia– Nagi Rao, Bill Wing, Tony Mezzacappa, ORNL– John Blondin, NCSU– Ibrahim Habib, CUNY
• UVA funding sources:– NSF EIN grant ANI-0335190 – NSF ITR small grant ANI-0312376– DOE FG02-04ER25640
3
UVA team members
• Postdocs: – Xuan Zheng– Haobo Wang
• Graduate students– Xiuduan Fang (GridFTP/PVFS)– Zhanxiang Huang (MPLS)– Tao Li (testing)– Anant P. Mudambi (new transport protocol: FRTP)– Xiangfei Zhu (RSVP-TE)
4
eScience application reqmts. for the network
• Our eScience partner: TSI project– High bandwidth end-to-end for
terabyte sized file transfers– End-to-end QoS assurance
• remote visualization• remote computational steering
TSI: Terascale Supernova InitiativeQoS: Quality of Service
5
“Communications” answer(no networking)
• If – number of end points is small – costs are not prohibitive
• Purchase high-speed communication links end-to-end
• Problems to solve:– Limited to end hosts– Disk limitations, bus speed limits, etc.
6
Important questions for networking
• How many end users/hosts need access to the network?
• Where are they located?• Role of networking: enable sharing
• Answers for HEP community– 300-400 physicists in the US– Located in 30 univs/labs– 2000 physicists from 150
institutions world-wide
7
Networking community’sanswers to eScience needs
• Use existing networks (Internet2/ESnet); upgrade links
• Improve TCP by only changing the software at the end hosts– High-speed TCP, Scalable TCP,
FAST– Parallel TCP (e.g. GridFTP)
• Pros:– Easy to deploy
• Cons: – End-to-end QoS assurance
• Deploy new networks (switches) while upgrading links– Connection-oriented– e.g., Cheetah, Dragon,
Canarie’s CA*net4– Complements type of service
offered by Internet2• Pros:
– End-to-end QoS• Cons:
– Cost
8
A little background on networking
• What’s a network (wide-area)?– A bunch of communication links
interconnected by network SWITCHES
• Purpose of a network switch:– enable sharing of communication link
resources
End hostGenerate/store data that needs to be moved
End host
End host End host
Directcommunicationlinks:doesn’t scaleSWITCH
9
The fun in networks research: Type of sharing
• Connectionless (CL) switch (packet switch)– No explicit requests to reserve bandwidth
prior to data transfer– All control for bandwidth sharing is
implemented at end hosts – TCP software• TCP keeps increasing the rate at which it sends
packets• Congestion detected• Rate of sending packets dropped • Slowly starts increasing rate
10
TCP at end host controls sending ratesending rate increases
sending rate decreased
Connectionless packet switch
• Bandwidth share given to one flow keeps varying depending on how many other flows join and leave within the first flow’s duration
• Socialistic sharing• Fair but hard to provide rate guarantees
11
A second type of sharing
• Connection-oriented (CO)– Request a reservation
• If accepted, can be guaranteed the assigned rate– Release reservation when done– Note: uncertainty in whether BW request will be
granted
Connection-oriented switch (packet or circuit)
Bandwidth manager
Connection-oriented switch (packet or circuit)
Bandwidth manager
BW-request BW-request
Distributed bandwidth managementScales to large networks
Complete bandwidthon a link used for oneconnection
Multiplexed: lanes on a highway
12
CHEETAH (Circuit-Switched High-speed
End-to-End Transport Architecture)
• Falls in the second category– deploying a new network– connection-oriented flavor of sharing
• Meets TSI needs– High-bandwidth connections for file
transfers– End-to-end QoS for remote visualization
• Cons– High cost
13
CHEETAHTopology & equipment
Raleigh PoP (MCNC)(Sycamore SN16000)
Atlanta PoP (SoX/SLR)(Sycamore SN16000)
Ethernetswitch
Hosts5 GbEs
Enterprise networks
NCSU
Ethernetswitch
Hosts
GbE
OC192(NLR, SLR)
G. Tech
OC192
ORNL PoP(Sycamore SN16000)
Control
Gb/s and 10Gb/sEthernetinterfacecards
Time-divisionmultiplexing optical interfacecard
bandwidth manager: dynamic distributed sharing
Hosts
Ethernetswitch
Maps GbE to equivalent SONET circuit
OC192
ToUltraScience Net
14
CHEETAH concept
• Use off-the-shelf circuit-based gateways– that support GMPLS routing and signaling protocols for
dynamic circuit setup/release– enables the creation of large-scale shared CO networks
• It is not a standalone network– Leverages the presence of connectionless IP service
(host-to-host IP connectivity; DNS)• Implement cheetah software to run on end hosts• Integrate with host applications
– applications generate requests for bandwidth as needed– SHORT-LIVED: increase sharing– Hold circuit for a few seconds/minutes and release
15
Cheetah solution leveraging the presence of the Internet
• Use second NICs at hosts for circuit connectivity leaving primary NIC for Internet access
Connectionless Internet
Connectionless Internet
End host I
End host II
Circuit-Switched Network
Circuit-Switched Network
• Attempt circuit setup• If rejected, fall back to
using TCP/IP
Should we attempt a circuit setup for ALL file transfers?
Two paths available
Or is there a crossover file size below which we use the TCP/IP network and above which we attempt a circuit setup?
16
Two metrics: delay and utilization
• For most regions of operation on 1Gbps circuits– in wide-area scenarios (50ms prop.
delay)• delay: crossover size ~10KB• utilization: crossover size is ~50MB
– in local-area scenarios (1ms prop. delay)• delay: crossover size is 1s to 10s of MBs • utilization: 1s of MBs
17
Cheetah software on end hosts
TCP NIC I
NIC II
FRTPPrimary TCP/IP path
End-to-end CHEETAH circuit
GridFTPWeb
server
Remote viz.(Ensight)
Signalingclient
End-host CHEETAH software
Routing decision
DNS lookup
Applications
DNS query(to check if far end
host is also on cheetah)Routing decision
to check whether to use the TCP/IP path or attempt a cheetah
circuit setupSignaling client
to request a circuit
Fixed-Rate Transport Protocol (FRTP)
designed for circuits
18
Transport protocol problem
• Variability in sender: – other processes (e.g. matlab) + disk access (disk head location)
• Variability in receiver: if buffer not emptied out, data loss occurs
Networkprotocols
networkcard
networkcard
Filesystem
Filetransfer
Matlab
kernel
user space
Circuit-switchednetwork
Networkprotocols
Filesystem
FiletransferMatlab
19
Effects of mismatch in nature of circuits and nature of
hosts • Choose a high circuit rate and receive
buffer can fill up if circuit rate is not matched to receive rate – impacts delay + utilization
• Choose a low circuit rate and delay will be higher than necessary
• If sending rate is not matched exactly with circuit rate– circuit lies idle; utilization impacted
20
Transport protocol for end-to-end dedicated circuits
• Requirements & solution:– No contention for bandwidth resources in network during
user data flow (bandwidth already reserved)• No congestion control
– Contention at end-hosts due to multitasking• Flow control: null or window based
– Reliable transfer: error control• Detect/recover from drops in receiver buffer
– High circuit utilization• Keep sending rate fixed to match circuit rate• Hence the name Fixed-Rate Transport Protocol (FRTP)• Receive rate selection important
– Disk-to-disk transfer• FRTP module is handed a file descriptor instead of a buffer
location in main memory
21
FRTP Implementation I
• Null flow control– data blocks can get dropped at receiver
• disk access variability and multitasking
– recover through retransmissions
• Implementation in user-space – Opens UDP and TCP sockets– UDP data channel on unidirectional dedicated circuit – TCP control channel on primary Internet path
• Modified SABUL code
22
FRTP Implementation I (cont.)
• SABUL implementation:– Uses busy-wait to maintain fixed low inter-
packet times (to achieve fixed sending rate)– Drawback: high CPU utilization
• Modified to:– send a burst of packets periodically
• set a periodic timer; when process gets a signal indicating timer expiry, send a batch of packets
– use data link layer flow control (i.e., Ethernet PAUSE) to prevent bursts
23
FRTP Implementation II
• Window-based flow control– prevents data blocks from being dropped at
receiver– due to disk access variability and
multitasking• Implementation in kernel-space • Uses Web100/Net100 code
24
Modifications• Web100/Net100
– Implement TCP– Added hooks to tune parameters at run-time
• FRTP usage of this code/modifications– Used tuning capability to set:
• initial ssthresh to the Bandwidth Delay Product– using fixed circuit rate for bandwidth
– Made code modifications to set:• additive increase (AI), multiplicative decrease (MD) factors to 0
(sending rate will not change in congestion avoidance)– Sending rate increases to the circuit rate and stays there
• Added advantage: – TCP’s self-clocking is a pretty good way to maintain fixed
sending rate
25
Disk-to-disk transfer requirement
• Sender side actions: – read() system call:
• move a block from disk to user space memory– send() system call:
• write the block to network socket– sendfile(): reduces # of copies and # of system calls
• Receive side actions:– open the file with the O_LARGEFILE flag – calibrate disk write rate limits– select file system (xfs, pvfs)– if multitasking receiver, use RT schedulers to
schedule disk write thread to match circuit rate
26
GridFTP application
• Disk considerations– Hardware solution: RAID striping
• expensive solution
– Split large file into small files and store small files on disks of different hosts in a cluster
• not user-friendly
– GridFTP striping with PVFS2 - striping across disks of different hosts of a cluster
• best solution, but both GridFTP and PVFS2 code need modifications to use on dedicated circuits
27
• Equip host with a fast CPU, a RAID controller and disk array and a 10Gbps NIC
Data blocks
Hardware solution
1 2 3 4 5 6 7 8 9 10
RAID 0 with 5 disks (Striping)
1 2 3 4 5
6 7
…
8 9 10
… … … … …
28
File splitting
File Splitter
GridFTP
GridFTP
GridFTP
GridFTP
GridFTP
Original file
File partition1
Cray
Zelda1
File partition2
File partition2
File partition2
File partition2
Zelda2
Zelda3
Zelda4
Zelda5
GridFTP
GridFTP
GridFTP
GridFTP
GridFTP
File partition1
File partition2
File partition2
File partition2
File partition2
Orbitty1
Orbitty2
Orbitty3
Orbitty4
Orbitty5
File Assembler assembled
file
Head NodeORNL NCSU
29
Implementation of file splitting
• Use GridFTP partial transfer feature– But disk space allocated on each host needs
to equal the whole file size• Without GridFTP partial transfer
– Wrote C programs to partition and assemble the file
– Use any file transfer tool to transfer the partitions (which are distinct files) in parallel
– Tools with third party transfer utility desirable
30
• PVFS2 (Parallel Virtual File System)– Three kinds of roles for nodes in PVFS2
• Compute node/client: on which applications are run• Metadata server: handles metadata operations• I/O nodes/server: stores file data for PVFS2 file
systems– Stripes a file across multiple servers like
RAID0• But
– The latest version 1.0.1 does not provide any specific utility to inspect data distribution
– The pvfs2-cp tool ignores the –s option for configuring striped size
GridFTP striped transfer over PVFS2
31
Our work on PVFS2
• Determined how PVFS2 stripes files across hosts– Using strace command that provides a trace of systems
calls • Analyzed how the file is striped
– Utility pvfs2-fs-dump gives the order of I/O servers used for file distribution (order obtained from config. file)
• Change pvfs2 code:– PVFS2 stripes files starting with a random server (done
with PINT_cached_config_get_next_io() function call in file src/common/misc/pint-cached-config.c) jitter = (rand() % num_io_servers);
– Change it into jitter = -1 to get a fixed order of data distribution
– Change the default striped size (original: 64KBytes)
32
ControlControl
globus-url-copyMode E
SPAS (Listen)
- returns list of host: port pairs
STOR <FileName>
Mode E
SPOR (Connect)
- connect to the host-port pairs
RETR <FileName>
Host A1Block 1
Block 4
Block 1
Block 4
But the current GridFTP does not work in this ideal way. The data channel connections between the sending and receiving sides are arbitrary because the processing of SPAS and SPOR commands is nondeterministic.
• Does not match with the dedicated circuit model• Code being modified
Block 2
Block 5
Block 2
Block 5
Host X2Host A2
Host X1
33
Security: control plane
• Cannot have arbitrary end hosts send bandwidth request messages to network switch– Place VPN server/firewall in front of
SN16000’s control port • provides some DDoS attack protection (ns5)
– Establish IPsec tunnels between each host and SN16000 through VPN server
• Openswan software at Linux end hosts
– Establish IPsec tunnels between SN16000s
34
Security: data plane
• GridFTP security– Between client and server
• Can use ssh, ssl, ipsec between end hosts connected on cheetah circuit
35
Back to outline
• Outline• Background• CHEETAH
• Testbed• Software• Applications• Security
Extension of work to connection-oriented internet
• Network research problems
Talk at the BNL, May 16, 2005
36
Networking community’sanswers to eScience needs
• Use existing connectionless networks with improved TCP
• Deploy new connection-oriented networks– (e.g., cheetah)
• Enable connection-oriented service in already deployed switches– MPLS– VLANs
• Spend money upgrading links
37
Extend CHEETAH concept and apps to connection-oriented internet
• On Internet2/ESnet: deployed IP routers (Cisco and Juniper) have:– MPLS capability (with RSVP-TE)– Connection-oriented service – Can map packets from one flow (five-tuple) to a
reserved MPLS tunnel
• Within LANs:– Ethernet switches have IEEE 802.1q VLAN capability– Ingress rate shaping– With external control software, can make these
switches operate in connection-oriented mode
38
Many advantages to this approach
• Already deployed (“just” enable!)• Bandwidth granularity can be low
– improves bandwidth utilization
• Allows for sharing of link bandwidth between CL and CO traffic
39
CO “internet”
• Because heterogeneous connections will be needed at least in the short-term– MPLS segments (Label Switched Paths)– VLAN segments (within enterprises)– SONET circuits (popular in commercial
world)– WDM lightpaths (research testbeds)
40
Back to outline
• Outline• Background• CHEETAH
• Testbed• Software• Applications• Security
• Extension of work to connection-oriented internet
Network research problems
Talk at the BNL, May 16, 2005
41
Networking research problems
• Bandwidth sharing modes– Low load performance – Scheduled vs. immediate-request– Multi-level problem– Partial-path reservations– Fairness
42
Fixing the bandwidth for the transfer could be a bad thing: low load problem
• Varying bandwidth list scheduling algorithm– uses knowledge of file size to make varying
bandwidth allocations for transfer– catch: requires circuit switches to be reprogrammed
multiple times within lifetime of a transfer (circuit)
Capacity CPacketSwitch
1
.
.N
2
3
Each transfer gets C/N capacity
1
23
NThe lone remaining transfer enjoys
full capacity C
Capacity CCircuitSwitch
1
.
.N
2
3
Each transfer is allocated C/N capacity
1
23
N
The lone remaining transfer continueswith capacity allocation C/N
43
Scheduled vs. immediate-request calls
Session type requests:• long holding times (2 hours)• specific rate• remote visualizations• scientists participate in sessions• best served with an advance reservation
File transfer requests:• file sizes provided not holding times• max rate specified but any rate can be allocated• scientists not involved; just computers
Small files (e.g. 1 GB on 1 Gbps takes 8 sec)• should be handled in immediate-request mode
Large files (e.g. 1 TB on 1 Gbps takes 2.2 hours)• should be handled in scheduled mode• should we allocate 10Gbps and finish in 800 sec?
• immediate-request? or scheduled?• depends on m, the number of 10Gbps circuits
44
Multi-level problem
• A new problem: not yes/no but how much?– Real-time (interactive) audio-video
applications generate data at a certain rate (constant or variable)
• implication: application requests the required bandwidth from the network, and answer is binary (accept or reject); multiple classes
– File transfers: “any” bandwidth that the network can provide could be acceptable
• implication: application requests a MAX bandwidth, but the answer can be multi-level
45
Partial-path reservations
• Peel off bandwidth (partial-path reservation)• Put back for CL traffic use when done
Enterprise EnterpriseWide-area network(Abilene, ESnet backbone)
Enterprise
Enterprise
GbE insideenterprises
GbE insideenterprises
10Gbps in WANs
Typicallythe access linkis the bottleneck
Set up MPLS tunnelsdynamically forindividual flowson bottleneck links
46
Fairness
• Call admission algorithms– Use Markov Decision Process (MDP) tools to
balance fairness and overall throughput– Long-path and short-path calls– Large files (high-BW; high holding time) and
short files (low-BW; low holding time) calls– Multi-level answer rather than binary
accept/reject– CO traffic vs. CL traffic
• Both with Fixed bandwidth and Varying bandwidth
47
Conclusions
• End-to-end dedicated connections appear to be the right answer for many eScience applications– But, many networking problems need to be
solved to achieve cost reduction through scaling
• Utilization concerns: bandwidth sharing + FRTP
• Specific concerns of TSI: TB file handling – PVFS2 and GridFTP
• Web site: http://cheetah.cs.virginia.edu