Upload
timothy-curtis
View
214
Download
2
Embed Size (px)
Citation preview
Center of ExcellenceWireless and Information Technology
CEWIT 2008
TeraPaths: Managing Flow-Based End-to-End QoS Paths
Experience and Lessons LearnedDimitrios Katramatos, Dantong Yu, Kunal Shroff
Brookhaven National LaboratoryThomas Robertazzi
Stony Brook UniversityShawn McKee
University of Michigan
CEWIT 2008
Center of ExcellenceWireless and Information Technology
2
Abstract
• TeraPaths is a Department of Energy funded network research project to support efficient, predicable, and prioritized peta-scale data replication in modern high-speed networks
• The TeraPaths network management framework establishes on-demand and manages true end-to-end, QoS-aware, virtual network paths across multiple administrative network domains
• TeraPaths dedicates network resources to data flows specifically authorized to use such network paths, in a transparent and scalable manner. This ensures that only selected flows receive a pre-determined, guaranteed level of QoS in terms of bandwidth, jitter, delay, etc.
CEWIT 2008
Center of ExcellenceWireless and Information Technology
3
Speaker’s Biography
• Dantong YuBrookhaven National Laboratory
• Dantong Yu received the Ph.D. degree in Computer Science from State University of New York at Buffalo, USA, in 2001. His research interests include high-speed network performance, network Quality of Service, cluster/grid computing, information retrieval, data mining, databases, and data warehouses. He leads the large volume WAN data transfer between CERN, BNL, ATLAS and RHIC collaboration institutes over high-speed networks with Grid middleware
CEWIT 2008
Center of ExcellenceWireless and Information Technology
4
Outline
• Background: the TeraPaths project• Establishing flow-based end-to-end QoS
paths• Domain interoperation• Encountered issues and proposed
solutions• Project status and future work• Conclusions
CEWIT 2008
Center of ExcellenceWireless and Information Technology
5
Background
• Provide QoS guarantees at the individual data flow level, all the way to the end hosts, transparently– Data flows have varying priority/importance
• Video streams• Critical data• Long duration transfers
– Default “best effort” network behavior treats all data flows as equal
– Capacity is not unlimited• Congestion causes bandwidth and latency variations• Performance and service disruption problems, unpredictability
• Dynamic flow-based SLAs = schedule network utilization– Regulate and classify (prioritize) traffic
CEWIT 2008
Center of ExcellenceWireless and Information Technology
6
End-to-End Setup
siteborderrouter
virtualborderrouter
sitehost / border
router
regionalproviderrouter
regionalproviderrouter
siteborderrouter
hostrouter
hostrouter
WAN domains
host a2
Site Ahost a1
Site BSite C
host c1
host b1
ACLs:a1 b1a2 c1
ACLs:b1 a1
ACLs:c1 a2
VLAN X
10.100.1.y1
VLAN Y10.100.1.x1
10.100.1.y210.100.1.x2
CEWIT 2008
Center of ExcellenceWireless and Information Technology
7
Establishing End-to-End QoS Paths
• Multiple administrative domains– Cooperation, trust, but each maintains full
control– Heterogeneous environment– Domain controller coordination through
web services
• Coordination models– Star
• Requires extensive information for all domains
– Daisy chain • Requires common flexible protocol across
all domains
– Hybrid (end-sites first)• Independent protocols• Direct end site negotiation
…
…
…
CEWIT 2008
Center of ExcellenceWireless and Information Technology
8
Path Setup (2)
• End site subnets are configured by TeraPaths software instances (TeraPaths Domain Controllers or TDCs)– TDCs configure end site LANs to prioritize and regulate
authorized flows via the DiffServ framework at the network device level
– Source site polices/marks authorized flow packets
– Destination site admits/re-polices/re-marks packets
– End site LANs tx/rx marked packets to/from the WAN
• WAN provides MPLS tunnels or dynamic circuits– Initiating TDC requests MPLS tunnel or dynamic circuit with
matching bandwidth and lifetime, or…
– TDC groups flows with common src/dst into MPLS tunnel or dynamic circuit with aggregate bandwidth and lifetime
– WAN preserves packet markings
CEWIT 2008
Center of ExcellenceWireless and Information Technology
9
Path Setup (3)
• WAN domains interoperate– Each end site’s TDC has a single point of contact for WAN
services
– TDCs have no knowledge of WAN internals other than what is exposed by the WAN services
• End sites have no direct control over the WAN
• Either tunnel or circuit through WAN– TeraPaths does not mix and match the layer 2 and layer 3
technology.
• TeraPaths “proxy” servers– Implement interface required by TeraPaths core
– Hide WAN service differences
– Clients to WAN web services (currently OSCARS / DRAGON)• Close cooperation with ESnet and I2 development teams
– Submit reservations for MPLS tunnels or dynamic circuits
– Handle security requirements
– Handle errors
CEWIT 2008
Center of ExcellenceWireless and Information Technology
10
Addressing L2-Specific Issues
• Limitations with VLANs– Tag range (tentatively selected 50 VLANs – 3550 to 3599)
• Each site may have its own range
– Tag conflicts • Rely on WAN service• Eliminate by synchronizing site databases• VLAN renaming (if/when possible)
• Scalability issues– Limited number of VLAN tags/Circuits:
• Flow grouping / circuit consolidation– Forward flows through same virtual WAN circuit
» Create circuit with new parameters / switch current flows / cancel old circuit
» Modify WAN reservations (if/when possible)
– PBR overhead• Virtual border router
• Sensitive/3rd party network segments– VLAN pass-thru
CEWIT 2008
Center of ExcellenceWireless and Information Technology
11
Flow Grouping/Circuit Consolidation
•Flows between same src and dst sites can share circuit, policing maintains bandwidth guarantee
•Multiple TeraPaths reservations associate with the same circuit reservation
– Easy when requirements are known in advance
– Modification of reservations required otherwise
• Selection/optimization to minimize resource waste
• Trade-off based on Δbw (bandwidth difference), Δtb, Δta (time period before and after a reservation)
2
13
4
5
2
13
4
5
time
band
wid
th
Δt
Δbw
current time
CEWIT 2008
Center of ExcellenceWireless and Information Technology
12
Flow Grouping/Circuit Consolidation (2)
• Similar approach to disk buffering (read ahead / write behind)– Bring up ahead / teardown behind
– Reuse existing active circuits
– Reserve circuits with more bandwidth and longer duration depending on differences in start time, duration, bandwidth of reservations
– Delay teardown, modify circuit duration and/or bandwidth if possible
2
1 3
45
2
1 3
45
time
band
wid
th
current time
Δtb Δta
2
1 3
45
time
band
wid
th
current time
Δta
2
1 3
45
CEWIT 2008
Center of ExcellenceWireless and Information Technology
13
Limitation of Dynamic Circuits
• A recent incident in BNL’s LHCOPN subnet: – Cisco’s PBR implementation only uses the status of
an interface to decide whether or not to forward packets
– A network circuit breaks somewhere along the path, but the involved interfaces on both ends are still up
– No probes and/or heartbeat exist to check the “health” of circuits
– Fail-over to the backup link does not work since primary interfaces are up even when such a problem exists
• End site monitoring is the most effective way to detect such a problem
CEWIT 2008
Center of ExcellenceWireless and Information Technology
14
Active Circuit Probing
Each TeraPaths site instance periodically verifies “well being” of reservations:– Selects active reservations initiated by site (site
responsibility)
– Finds circuit/VLAN associated with each reservation
– Performs a circuit check with a quick pinging of other site’s router (private ip address space)
– Less than 100% success triggers a recheck with longer duration pings in both directions (to and from other site)
– Low success % triggers reservation cancellation reverting traffic to best effort network
– Optionally, the system adapts reservation data and attempts to setup a new end-to-end path (for given time period/number of attempts)
CEWIT 2008
Center of ExcellenceWireless and Information Technology
15
Prioritizing Traffic
TeraPaths QoS test 1 (prioritize traffic)
0
200
400
600
800
1000
1200
0 200 400 600 800 1000
time (sec)
Ban
dwid
th (
Mbi
ts/s
ec)
priority
background
total
competing traffic
causes dramatic drop in
bandwidth
QoS / circuit reservation
active
CEWIT 2008
Center of ExcellenceWireless and Information Technology
16
Recovering from Circuit Failure
TeraPaths QoS test 2 (prioritize/fallback to best effort)
0
200
400
600
800
1000
1200
0 200 400 600 800 1000
time (sec)
Ban
dwid
th (
Mbi
ts/s
ec)
priority
background
total
circuit interruption
recovery to best effort
CEWIT 2008
Center of ExcellenceWireless and Information Technology
17
Competing against BE trafficremote EF against remote and local BE
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500
remote EF
local BE
remote BE
remote EF against local BE
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500 600
remote EF
local BE
local BE
CEWIT 2008
Center of ExcellenceWireless and Information Technology
18
Status
• BNL, UMich, BU, all with 10Gbps connections, multiple pass-thru configurations (BNL, UMich, NoX, Merit, MiLR)
• Utilization of L3 paths (MPLS tunnels, ESnet only), L2 paths (dynamic circuits, ESnet and Internet2)
• Multiple QoS reservations through same circuit (support for circuit consolidation)
• Multiple circuits per site subject to per-site VLAN availability (flow grouping/circuit consolidation)
• Active circuit probing for failures with fallback to best effort network/attempt to reconfigure e2e path (in testing phase)
• Dynamic bandwidth allocation within service classes (in testing phase)
• New command line client
CEWIT 2008
Center of ExcellenceWireless and Information Technology
19
Future Work
• Continue working on automatic flow grouping / circuit consolidation.
• Configurable reservation negotiation• Grid-style AAA (GUMS/VOMS)• Plug-ins: SRM (dCache), others• Compatibility with Lambda Station• Support for different hardware as needed• ATLAS Production:
– Replicate ATLAS Physics data from BU and UMich with the existing ATLAS DDM stack, and with end-to-end QoS circuits
– Tier 1 (BNL) and Tier 2 data replication
• http://www.terapaths.org
CEWIT 2008
Center of ExcellenceWireless and Information Technology
20
Conclusions
• Demonstrated the effective prioritization and protection from interference of selected data transfers between three LHC experiment institutes – Brookhaven National Laboratory, the University of Michigan, and Boston University – through guaranteed bandwidth virtual paths, at the presence of intensive best-effort IP traffic sharing the same network resources
• A practical and economical end-to-end network resource reservation system, extending new capabilities to users/applications of end sites without requiring additional, expensive network infrastructure components