1 The Experiment Lifecycle and its Major Programs

1

The Experiment Lifecycleand

its Major Programs

2

Experiment Lifecycle:the User Perspective

3

Creating an Experiment

• Done with `batchexp’ for both batch and interactive experiments– “batch”* is historical name

• Can bring the experiment to three states– swapped – pre-run only– posted – queued experiment ready to run– active – experiment swapped in

4

Swapping An Experiment

• Done with `swapexp’

• Can effect several transitions– swapped to active (swap in experiment)– active to swapped (swap out experiment)– active to active (modify experiment)– posted to swapped (dequeue batch experiment)

5

Pre-run (tbprerun)

• Parse NS file (parse-ns and parse.tcl)– Put virtual state in database (xmlconvert)

• Do visualization layout (prerender)

• Compute static routes (staticroutes)

6

swapped to active (tbswap in)

• Mapping: Find nodes for experimenter– assign_wrapper– assign

• Allocate nodes (nalloc)– Set serial console access (console_setup)

• Set up NFS exports (exports_setup)

• Set up DNS names (named_setup)

• Reboot nodes and wait for them (os_setup)– Load disks if necessary (os_load)

7

swapped to active (contd.)

• Start event system (eventsys_control)

• Create VLANs (snmpit)

• Set up mailing lists (genelists)

• Failure at any step results in swapout

8

active to swapped(tbswap out)

• Stop the event system (eventsys_control)

• Tear down VLANs (snmpit)

• Free nodes (nfree)– Scheduled reservations (sched_reserve)– Place in reloadpending experiment– Revoke console access (console_setup)

• Reset DNS (named_setup)

• Reset NFS exports (exports_setup)

• Reset mailing lists (genelists)

9

active to active (tbswap modify)

• Purpose: experiment modification– Get new virtual state (re-parse NS file)– Bring physical mapping into sync with new state

• Leaves alone nodes whose physical mapping matches the new virtual state

10

Important Daemons

• batch_daemon– Picks up posted experiments– Attempts a swapin– One experiment at a time for each user– Swaps out finished batch experiments

• reload_daemon– Picks up nodes from reloadpending experiment– Frees them when done reloading

12

Next, in More Depth• Parsing• Resource allocation

– Setup for the action: assign_wrapper– The real brains: assign

• Serial console management• Link shaping• IP routing support• Traffic generation• Inter-node synchronization• Event system

13

Parsing Experiment Configurations

14

Experiment Configuration Language

• General purpose OTcl scripting language based on NS

• Exports an API nearly identical to that of NS albeit a subset

• Testbed specific actions via the tb-* procedures– We provide a compatibility script to include when

running under a NS simulation

• Define your own procedures / classes / methods

15

Making sense out of others’ code

• The parser is also written in OTcl

• It mirrors a subset of NS classes

• Implemented methods for the above classes capture the user specified experiment attributes

• Convert experiment attributes to an intermediate XML format– Generic format makes it easy to add support for other

configuration languages

• Store the configuration in the virt_* tables such as virt_nodes, virt_lans etc.

16

Implementation Quirks

• Capture top level resource names for later use– E.g.: Use 'n0' to name the physical node when the user

asks for set n0 [$ns node]

• Rename resource names to workaround restrictions such as in DNS– E.g.: Node 'n(0)' to 'n-0'

• Parser run on ops for security reasons– Mixing trusted/untrusted OTcl code on main server (boss)

is dangerous

• Read tbsetup/ns2ir/README in the source tree for details

17

Assign Wrapper (PG Version)

18

Assign Wrapper

• Perl frontend to assign• Converts virtual DB representation to more

neutral “top” file format (input)• Converts results from plain text format into

physical DB representation• assign_wrapper is extremely testbed aware• Moves information from virtual tables to

physical tables

19

Virtual Representation

• An experiment is really a set of tables in the database

• Includes “virt_nodes” and “virt_lans” which describe the nodes and the network topology

• Other tables include routes, program agents, traffic generators, virtual types, etc.

20

Virtual Representation Cont.

• Example:set n1 [$ns node]set n2 [$ns node]set link0 [$ns duplex-link $n1 $n2 100MB 10ms]tb-set-hardware $n2 pc600

• Is stored in database tables:virt_node ('n1', '10.1.1.1', 'pc850', 'FBSD-STD', ...)virt_node ('n2', '10.1.1.2', 'pc600', 'RHL-STD, ...)virt_lan ('link0', 'n1', '100MB', '5ms', ...)virt_lan ('link0', 'n2', '100MB', '5ms', ...)

21

What’s a top file?

• Stands for "topology" file, but thats too many syllables.

• Input file to assign specifying nodes, links, desires.

• Conversion of DB format to:

node n2 pc850node n1 pc600link link0/n1:0,n2:0 n1 n2 100000 0 0

• Combine with current (free) physical resources to come up with a solution.

22

Assign Results

• Assign maps n1 and n2 to pc1 and pc41 based on types and bandwidth.

Nodes

node1 pc1

node2 pc41

End Nodes

Edges

link0/n1:0,n2:0 intraswitch pc1/eth3 pc41/eth1

End Edges

• The above is a “simplified” version of actual results. Gory details available elsewhere.

23

Assign Wrapper Continues

• Allocate physical resources (nodes) as specified by assign

• Allocate virtual resources (vnodes) on physical nodes (local and remote)

• If some nodes already allocated (someone else got them before you), try again

• Keep trying until maximum try exceeded; assign might fail to find a solution on first N tries

24

Assign Wrapper Keeps Going …

• Insert set of “vlans” into database– pc1/eth3 connected to pc41/eth1

• Update “interfaces” table with IP addresses assigned by the parser

• Update “nodes” table with user specified values from virt_nodes.– Osids, rpms, tarballs, etc.

• Update “linkdelays” table with end node traffic shaping configuration (from virt_lans)

25

And Going and Going

• Update “delays” table with delay node traffic shaping configuration

• Update “tunnels” table with tunnel configuration (widearea nodes)

• Update “agents” table with location of where events should be sent to control traffic shaping

• Call exit(0) and rest!

26

Resource Allocation:assign

27

assign’s job

• Maps virtual resources to local nodes and VLANs • General combinatorial optimization approach to

NP-hard problem• Uses simulated annealing• Minimizes inter-switch links, number of switches,

and other constraints.• Takes seconds for most experiments

28

What’s Hard About It?

• Satisfy constraints– Requested types– Can’t go over inter-switch bandwidth– Domain-specific constraints

• LAN placement for virtual nodes• Subnodes

• Maximize opportunity for future mappings– Minimize inter-switch bandwidth– Avoid scarce nodes

29

What It Can Do

• Handle multiple types of nodes on multiple switches

• Allow users to ask for classes of nodes

• Prefer/discourage use of certain nodes

• Map multiple virtual nodes to one physical node

• Handle nodes that are 'hosted' in some other node

• Partial solutions

30

What It Doesn't Do

• Map based on observed end-to-end network characteristics– Applicable to wide-area and wireless– But, we have another program, wanassign, that

can

• Satisfy requests for specific link types– But, we could approximate with subnodes

• Full node resource description

31

Issues

• Complicated– Several authors

– Subject of paper evaluating many configurations

– Nature of randomized algorithm makes debugging hard

– Evolved over time to keep up with features

• Scaling– Particularly with virtual and simulated nodes

• Not just scale (1000’s), it’s the type of node

– Pre-passes may help

• The good: it’s coped with a lot of new demands!

32

Remote Console Access

33

Executive Summary

• Allow user access to consoles via serial line

• Console proxy enables remote access

• Authentication and encryption

• All console output logged

• Requires OS support for serial consoles

• Utah Emulab: all nodes have serial lines– Not required, but handy

34

Serial Consoles

• Can redirect console in three places– BIOS: on most “server” motherboards– Boot loader: easy on BSD and Linux– OS: easy on BSD and Linux

• Boot loaders and OSes must be configured– Generally via boot loader configuration

35

The serial line proxy(capture)

• Original purpose was to log console output– Read/write serial line, log data, present tty IF– Use “tip” to access pty

• Enhanced to “remote” the console– Present a socket interface– Can be accessed from anywhere on the

network

• One capture process per serial line

36

Authentication(capserver)

• Only users in an experiment can access

• Use a one-time key– capture running on serial line host generates

new key for every “session”

• Sends key to capserver on the boss node– capserver records key in DB, returns ownership

info– capture uses info to protect ACL and log files

37

Clients(console, tiptunnel)

• console is the replacement for tip– Run on ops, obtains access info via ACL file

created by capture– File permissions restrict user access

• tiptunnel is the remove version– Binaries for Linux, BSD, Windows– Run as a helper app from browser– Access info passed via secure web connection– All communication via SSL

38

Emulab Link Shaping

39

Executive Summary

• Emulab allows setting and modification of bandwidth, latency, and loss rate on a per-link basis

• Interface through NS script or command

• Implemented either by dedicated “delay” nodes or on end nodes

• Delay nodes work with any end node OS

• End node shaping for FreeBSD or Linux

40

Delay nodes

• Run FreeBSD + dummynet + bridging

• FreeBSD kernel:– Runs at 10000Hz to improve accuracy– Uses polling device drivers to reduce overhead

• Nodes are dedicated to an experiment

• One node can shape multiple links

• Transparent to end nodes

• Not transparent to switch fabric

41

VLANs and Delay Nodes - Diagram

42

End node shaping(“link delays”)

• Handle link shaping at both ends of the link

• Requires OS support on the end nodes– FreeBSD: dummynet– Linux: “tc” with modifications

• Conserves Emulab resources at potential expense of emulation fidelity

• Works in environments where delay nodes are not practical or possible

43

Dynamic control

• Link settings can be modified at “run time”– at commands in the NS file

– tevc command

• Run a control agent (delay_agent) on all nodes implementing shaping

• Listens for events, interacts with kernel to effect changes

• OS specific

44

IP routing support in Emulab

45

Executive Summary

• Emulab offers three options for IP routing in a topology: none, manual, or automatic

• Specified via the NS file

• Routes setup automatically at boot time

• There is no agent for dynamic modification of routes

46

User-specified routing

• “None”– No experimental network routes will be setup– Used for LANs and routing experiments

• “Manual”– Explicit specification of routes in the NS file– Routes becomes part of DB state of experiment– Passed to a node at boot, part of self-config– Implies IP forwarding enabled

47

Emulab-provided routing

• “Static”– Emulab calculates routes at experiment creation

(routecalc, staticroutes)– Shortest path calculation between all pairs– Optimized to coalesce into network routes

• “Session”– Dynamic routing: runs gated/OSPF on all nodes– Auto-generated config file uses only active

experimental interfaces

48

Routing Gotcha’s

• Node default route uses the control net– Missing manual routes result in lost traffic

• Control net is visible to routing daemons– Makes their job easy (one hop to anyone)

• NxN "Static" route computation and storage do not scale as N increases, such as in multiplexed virtual nodes

49

Traffic Generation in Emulab

50

Executive Summary

• Emulab allows experiments to run and control background traffic generators

• Interface through NS script or command line tool

• Constant Bit Rate traffic only right now

• UDP or TCP only right now

51

Implementation details

• Based on TG (http://www.postel.org/tg/)– UDP or TCP, one-way, various distributions of

interarrival and length

• Modified to be an event agent– Start and stop, change packet rate and size

• Interface:– NS: standard syntax for traffic sources/sinks– tevc command line tool

52

Inter-node synchronizationin Emulab

53

Executive Summary

• Provides a simple inter-node barrier synchronization mechanism for experiments

• Example: wait for all nodes to finish running a test before starting the next one

• Not a centralized service (per-experiment infrastructure), scales well

• Easy to use: can be scripted

54

History

• Originally implemented a single-barrier, single-use “ready” mechanism:– Allowed users to know when all nodes were “up”– Used centralized TMCC to report/query status– Network/server unfriendly: constant polling

• Users wanted a more general mechanism– Multiple barriers, reusable barriers

• Tended to roll their own– Often network unfriendly as well

55

Enter the Sync Server

• In NS file, declare a node as the server:– set node1 [$ns node]

– tb-set-sync-server $node1

• When node boots, it starts up the sync server automatically

• Nodes requiring synchronization use emulab-sync application

• Use can be scripted using program agent

56

Example client use

• One node acts as barrier master, initializing barrier and waiting for a number of clients:– /usr/testbed/bin/emulab-sync -i 4

• All other client nodes contact the barrier:– /usr/testbed/bin/emulab-sync

• emulab-sync blocks until the barrier count is reached

57

Implementation

• Simple TCP-based server and client program– UDP version in the works

• Client:– Gets server info from a config file written at boot– Connect to server and write a small record– Block until a reply is read

• Server:– Accept connections, read records from clients– Write a reply when all clients have connected

58

Issues

• Why not use the event system for synchronization?– Event system is a centralized service– As we move to decentralization, may reconsider

• Authentication: none– Local: uses shared control net so this is a

problem, won't be with control net VLANs– Wide-area: wide-open, add HMAC ala events or

just use event system

59

The Emulab Event System

60

Emulab Control Plane

• Many of Emulab’s features are dynamically controllable:– Traffic generators: can be started, stopped, and

parameters altered– Link shaping: links can be brought up and down,

characteristics can be modified

• Control is via the NS file, the web interface, or a command line tool.

61

Example: A Link

• NS: create a shaped link:– set link0 [$ns duplex-link $n1 $n2 50Mb 10ms DropTail]

• NS: control the link:– $ns at 100 "$link0 modify DELAY=20 BANDWIDTH=25"

– $ns at 200 "$link0 down"

• Command line: control the link– tevc -e tutorial/linktest +10 link0 down

62

What's really happening?

• A link “agent” runs on each (delay) node to control all of the links for that node.

• The agent listens for “events” from the server telling it what to do.

• A per-experiment scheduler doles out the events at the proper time, sending them to the agents.

• Other agents include the traffic generators, program objects, link tester.

63

Come on, what's really happening?!

• Use Elvin (http://elvin.dstc.edu.au/) – off-the-shelf publish-subscribe system

• Agents "listen" for events by "subscribing" to those they care about.

• The per-experiment scheduler "publishes" events as they come due.

• Events flow from the scheduler through the Elvin daemon to the nodes, and ultimately to the agents that wanted them.

64

Static/Dynamic event flow

65

Issues: Time

• What happens to “event time” when an experiment is swapped?– Run in real time: events could be lost– Suspend time: dilation of experiment time– Restart time: replay static event stream

• Timing for dynamic events– tevc … +10 link0 down; tevc … +10 link1 up

– What is the latency between events?

• What latency do we need to guarantee?

66

Issues: Security

• Elvin mechanism is too heavyweight– Requires encryption to protect authentication keys

– We have no reason to encrypt our events

• Don't want to tie ourselves to Elvin– In principle

– Elvin has gone closed source

• Emulab past: no authentication, no wide-area

• Emulab current: use end-to-end HMAC– Key transferred via TMCC

– Wide-area nodes supported, cannot inject events

67

Issues: Scaling

• Open Elvin TCP connection for every agent– Use per-node proxy– But agents still send events directly to boss– And there are still a lot of nodes

• Use UDP?– What about lost events?

• Deliver static events to nodes early?– Doesn't help dynamic (“now”) events

• Multicast, someday (not the current usage model)• You’d think we could just find a better pub/sub

system, but haven’t.

Documents

1 The Experiment Lifecycle and its Major Programs