Visualisation of Networks - randomwire.com€¦ · Visualisation of Networks 3rd Year Software Engineering Project by David Gilbert Department of Computer Science, University of Durham

Visualisation of Networks3rd Year Software Engineering Project by David Gilbert

Department of Computer Science, University of Durham

2005

www.randomwire.com

No part of the material offered has previously been submitted by the author for a degree in the University of Durham or in any other university. All the work presented here is the sole work of the author and no one else. 18,000 words approximately.

1

Abstract

In this report we aim to explore the field of 'Information Visualisation' in relation to mapping interconnected structures (networks). We investigate the effectiveness of current methods and theories that guide the construction of visualisations. A review of the TCP/IP network protocol and possible topologies demonstrates the type and range of information available to be mapped. The prototype process model is followed to produce a design which is implemented to form a tool capable of connecting together multiple Linux tools for the purpose of collecting and visualising network data. Methods for evaluating visualisations are discussed to realise a set of evaluation criteria which are then set against a number of visualisation tools. Graphic output from these tools are discussed with relation to the knowledge we can gain from them. The OSI model is compared to our findings showing a clear relationship. Tools using external data sources are then evaluated to converge our knowledge of the domain. A static conceptual map of Durham is also created to demonstrate alternate forms of visualisation and in particular quasi geographic layout.

2

Contents 1 Background...................................................................................................................6

1.1 Problem Domain.......................................................................................................7 1.1.1 Human-Information Interaction..............................................................................7 1.1.2 Software Engineering Practices..............................................................................7

1.2 Terminology Used.....................................................................................................8 1.3 Objectives..............................................................................................................8 1.4 Deliverables............................................................................................................9

1.4.1 Minimum.........................................................................................................9 1.4.2 Intermediate....................................................................................................9 1.4.3 Advanced........................................................................................................9

1.5 Report Contents.......................................................................................................9 1.6 Time Plan..............................................................................................................9

2 Part A - Networks..........................................................................................................11 2.1 Introduction..........................................................................................................11 2.2 Background...........................................................................................................11 2.3 OSI Model & Header Composition.................................................................................12 2.4 Network Topologies.................................................................................................13 2.5 Address Classes......................................................................................................14 2.6 TCP/IP Utilities......................................................................................................14

2.6.1 Netstat..........................................................................................................15 2.6.2 Nmap ...........................................................................................................15 2.6.3 Ping..............................................................................................................15 2.6.4 Traceroute.....................................................................................................16 2.6.5 Whois............................................................................................................16

2.7 Mapping, Not Hacking...............................................................................................17 2.8 Summary..............................................................................................................17

2 Part B - Visualisation......................................................................................................18 2.9 Introduction..........................................................................................................18 2.10 Background..........................................................................................................18 2.11 Case Study 1 – The London Underground.......................................................................20 2.12 Case Study 2 – Lumeta Internet Mapping Project.............................................................20 2.13 Visualisation Techniques..........................................................................................21

2.13.1 Graph Theory.................................................................................................22 2.14 Summary.............................................................................................................27

3 Design........................................................................................................................28 3.1 Introduction..........................................................................................................28 3.2 Architecture Design.................................................................................................29

3.2.1 Data Flow.......................................................................................................30 3.3 System Interaction..................................................................................................32 3.4 Data Extraction......................................................................................................32 3.5 Database Format....................................................................................................33 3.6 Database Concurrency Control....................................................................................35 3.7 Technical Considerations...........................................................................................35 3.8 Design Rationale.....................................................................................................36 3.9 Summary..............................................................................................................37

4 Implementation............................................................................................................38 4.1 Introduction..........................................................................................................38 4.2 Scripting Language Comparison...................................................................................39 4.3 Visualisation Tools Features.......................................................................................40 4.4 Visualisation Tools Criteria.........................................................................................41 4.5 Script List.............................................................................................................42 4.6 Script Interaction....................................................................................................43 4.7 Execution Example..................................................................................................44 4.8 Source Overview.....................................................................................................45

4.8.1 Pingscan.pl.....................................................................................................45 4.8.2 Tracenetwork.pl...............................................................................................45 4.8.3 Enumerate.pl..................................................................................................46 4.8.4 Nodeconvert.pl................................................................................................46

3

4.8.5 Discovertrace.pl...............................................................................................46 4.9 Technical Issues Encountered.....................................................................................47

4.9.1 Traceroute Efficiency........................................................................................47 4.9.2 Service Scanning..............................................................................................47

4.10 Testing...............................................................................................................48 4.11 Summary.............................................................................................................49

5 Evaluation...................................................................................................................50 5.1 Beyond Data..........................................................................................................50

5.1.1 Convergence...................................................................................................50 5.2 Visualisation Criteria................................................................................................51

5.2.1 2D vs. 3D.......................................................................................................52 5.3 Class B Superstructure..............................................................................................55 5.4 Class C Substructure................................................................................................57 5.5 Overlying User Structure...........................................................................................59 5.6 Summary..............................................................................................................60

6 Conclusions.................................................................................................................61 6.1 Summary..............................................................................................................61

6.1.1 Overview.......................................................................................................61 6.2 Objective & Deliverable Fulfilment..............................................................................62 6.3 Visual Trends.........................................................................................................62 6.4 100% Open Source...................................................................................................63 6.5 Future Work..........................................................................................................63

7 References..................................................................................................................64 7.1 Books (15).............................................................................................................64 7.2 Academic Publications (9)..........................................................................................64 7.3 Web Pages (37)......................................................................................................65

8 Appendix A - Quasi Geography...........................................................................................66 8.1 Technical Considerations...........................................................................................67 8.2 Mapping the Durham 'Network'....................................................................................68 8.3 Summary..............................................................................................................70

9 Appendix B – Overlying Interconnections...............................................................................71 9.1 Flickr Graph..........................................................................................................71 9.2 GoogleBrowser.......................................................................................................73

Figures and Tables 1 Background...................................................................................................................6 2 Part A - Networks..........................................................................................................11

Figure 2.1: TCP Header Composition [TCP1998].....................................................................13Figure 2.2: Common Network Topologies (US Federal Standard 1037C) [WEB4]................................13

2 Part B - Visualisation......................................................................................................18Figure 2.3: London Underground Map [WEB14] (2004)..............................................................20Figure 2.4: Lumeta ISP Map [WEB15] (1998)..........................................................................21Figure 2.5: Circular & Radial Graph Layout [WEB16]................................................................22Figure 2.6: Hierarchical (k-layered) Graph Layout [WEB16].......................................................23Figure 2.7: Force Directed (Organic) Graph Layout [WEB16]......................................................23Figure 2.8: Orthogonal Graph Layout [WEB16].......................................................................24Figure 2.9: Tree Graph Layout [WEB16]...............................................................................25Figure 2.10: Netscan Layout [ATLS2001]..............................................................................25Figure 2.11: Walrus 3D Hyperbolic Layout [ATLS2001]..............................................................26

3 Design........................................................................................................................28Figure 3.1: Prototype Process Model [BUDG2003]...................................................................28Figure 3.2: High-level Architecture Diagram (CLI = Command Line Interface)..................................29Figure 3.3: Lower-level Architecture Diagram (showing basic data flow decomposition).....................30Figure 3.4: Data Flow Diagram..........................................................................................31Figure 3.5: Use-case Diagram...........................................................................................32Table 3.1: Database Table Definition..................................................................................33Figure 3.6: Database Entity Relationship Diagram...................................................................34Figure 3.7: Concurrency Error Example...............................................................................35

4

Figure 3.8: Traceroute Operation......................................................................................35Figure 3.9: Example Path of Traceroute Probe ......................................................................36

4 Implementation............................................................................................................38Figure 4.1: Aspects of software Quality Assurance..................................................................38Table 4.1: Scripting Language Comparison Matrix...................................................................39Table 4.2: Visualisation Tools Feature Matrix........................................................................40Table 4.3: Visualisation Tools Criteria Matrix........................................................................41Table 4.4: Script description breakdown..............................................................................42Figure 4.2: Script interactions..........................................................................................43Figure 4.3: Comparing file changes using TkDiff, a graphical diff and merge tool (illustration only) [WEB29]....................................................................................................................48Figure 4.4: Test information flow......................................................................................48

5 Evaluation...................................................................................................................50Figure 5.1: Criteria for the evaluation of visual representations [EVAL2003]...................................51Table 5.1: Visualisation criteria evaluation...........................................................................52Figure 5.2: Node entanglement – WinFDP 2D Layout (4004 nodes)...............................................53Figure 5.3: Insufficient computing power – WinFDP 3D Layout (4004 nodes)....................................54Figure 5.4: Incorrect algorithm parameters – WilmaScope 3D Layout (465 nodes).............................54Figure 5.5: WinFDP 3D Force-Directed Visualisation of Class B Network Structure (5289 nodes)............55Figure 5.6: WilmaScope Force-Directed Visualisation of Class B Network Structure (5289 nodes)...........56Figure 5.7: WinFDP 3D Force-Directed Visualisation of Class C Network Structure (469 nodes).............57Figure 5.8: WilmaScope Force-Directed Visualisation of Class C Network Structure (469 nodes)............58Figure 5.9: Sample user list from compsoc.dur.ac.uk collected with the finger service......................59Figure 5.10: SSH tunnel connections...................................................................................59Figure 5.11: Conceptual layered network related to the OSI model..............................................60

6 Conclusions.................................................................................................................61Figure 6.1: Objective and Deliverable fulfilment....................................................................62

7 References..................................................................................................................64 8 Appendix A - Quasi Geography...........................................................................................66

Figure 8.1: Tokyo Metropolitan Area Rail Map (partial).............................................................67Figure 8.2: Traditional Map of Durham (to scale, centred around university sites)............................68Figure 8.3: 3D Sketch Map of Durham.................................................................................69Figure 8.4: Quasi Geographic Static Representation of Durham (Version 1.8)..................................70

9 Appendix B – Overlying Interconnections...............................................................................71Figure 9.1: Flickr Graph start phase visualisation...................................................................71Figure 9.2: Flickr Graph post expansion visualisation...............................................................72Figure 9.3: Flickr Graph photo expansion visualisation.............................................................72Figure 9.4: Google search results example...........................................................................73Figure 9.5: GoogleBrowser related web-links visualisation........................................................73Table 9.1: Visualisation criteria evaluation...........................................................................74

5

1 Background

10x10 interactive exploration of the words and pictures that define the time

Marumushi Newsmap distilled ambient information source

6

1.1 Problem DomainInformation it is inexplicitly everywhere, permeated throughout our daily lives, and there is no escaping it. In fact, there is so much information available that no one single individual could ever possibly hope to understand it all. Chaomei Chen explains this problem well in 'Information Visualisation and Virtual Environments':

“Information overload becomes a common problem in the exponential growth of widely accessible information in modern society and efficient information filtering and sharing facilities are needed to resolve it. Information visualisation has the potential to help people find the information they need more effectively and intuitively” [CHEN1999]

Computer Science and Software Engineering have strong roles to play in the way we shape and visualise information in the modern world. Advances in computer technology are opening up new and exciting possibilities within this field and it is our primary purpose in this report to explore these avenues within the context of better understanding networked structures.

The examples of visualisations which represent news, shown on page 6, give a flavour of ways in which people are already doing this. They go beyond traditional ways of representing news headlines and provide a relationally represented snapshot of current affairs for any given moment past or present. We provide more detailed case studies of such work in Chapter 2 Part B, Visualisation.

This project sets out to visualise computer networks for the purpose of better understanding how they may be depicted so as to covey knowledge of their nature to the viewer. To do this we have investigated how networks are constructed, methods of visualisation, and then have brought both of these domains together to form a tool for achieving this objective. From this we aim to be able to perform a structural analysis and evaluate the degree to which our visualisations fulfil their purpose.

1.1.1 Human-Information Interaction

The real design problem here is not increased access to information, but greater efficiency in finding useful information within huge volumes of it. Increasing the rate at which people can find and use relevant information improves human intelligence and thus evolves society.

We aim to develop processes (and the software to support them) for mapping the structure of computer networks into a visual form so that the reader may gain a better understanding of its composition. In effect we intend to leverage the highly-developed human visual system to achieve rapid understanding of abstract information. This is known as 'Human-Information Interaction' (HII) which should not be confused with Human-Computer Interaction (HCI).

Because of time constraints and the focus of our work we do not plan to implement our own graphics rendering applications but instead to use a range of existing ones within a process framework (see Chapter 3, Design). The tools we create will essentially tie together a wide range of existing tools for the main purpose of collecting and composing data.

1.1.2 Software Engineering Practices

Within the context of Software Engineering visualisation, as we intend to use it, is a means to an end. It enables us to communicate information in a way that amplifies cognition (see Chapter 2 Part B, Visualisation). It is not about producing fancy animations which only superficially impress and we are not concerned with the graphic rendering process, although we will discuss technical concepts relating to image generation where relevant to the final output.

Even though all work is being carried out on a individual basis we have aimed to follow good Software Engineering practices to help achieve our aims (laid out later in this chapter) and by doing so ensure that the principles of rigour and formality, modularity and decomposition, abstraction, anticipation of change, scalability, compositionality and heterogeneity are maintained [PRES1997].

Throughout we have followed the prototype process model which is discussed in Chapter 3, Design.

7

1.2 Terminology UsedBelow, listed and explained, are a set of domain-specific terms which are used within the document:

Connection An association or relationship which links two distinct entities with in a data set. With reference to a computer network it is a means or channel of communication.

Framework A skeletal support used as the basis for something being constructed around, on top, or within it. It embodies an abstract design for solutions to a number of related problems.

Graphic A written or pictorial representation generated by a computer or imaging device to convey some form of meaning.

IP Space An allotted set of addresses determined by the class of address the network has been assigned. Exists as a virtual concept where each address may or may not be actually assigned to a host.

Immersive Refers to the impression that someone has of being somewhere while, in reality, they are physically in another place. [Marc Bernatchez, 2004]

Linux An open source computer operating system and its kernel developed by Linus Torvalds. Linux is widely used for servers and is gaining ground in the desktop market.

Network A system of connections that cross or interconnect, e.g. rail network, phone network, social network, computer network etc. Usually represented with nodes and arcs.

Output The net result produced by a program or process from a specific input. Output may come in many forms, e.g. textual, graphical, audible etc.

Protocol A standard procedure for regulating data transmission between computers. For out purposes the predominant protocol we examine is TCP/IP.

Service Functionality derived from a particular software program. E.g., network services may refer to programs that transmit data or provide conversion of data in a network.

Spatialisation Relates to something which fills (partially or fully) a space in space. This includes any form of graphic representation.

Structure The interrelation or arrangement of parts in a complex entity held or put together in a particular way. E.g., social structure, network structure, political structure etc.

Tool An application used to perform, facilitate, or automate a task often to speed up or reduce repetitive operations.

All other technical terminology is explained in-line with it's initial usage.

1.3 ObjectivesThis project will:

1. Investigate the type and range of information to collect from a network.

2. Review tools available in Linux for doing this (e.g. ping, traceroute, nmap, etc.).

3. Investigate current Information Visualisation techniques.

4. Produce a tool which accurately maps the structure of any given computer network.

5. Make an informed structural analysis of the Durham University network.

6. Produce a conceptual map of a network that links information with its physical geography.

These objectives directly correspond to the deliverables laid out in the next section.

8

1.4 DeliverablesWith reference to relevant technical and theoretical documentation (books, academic publications, journals, web pages etc.) this project aims to deliver the following in order of priority:

1.4.1 Minimum

1. Literature survey on Information Visualisation techniques.2. Collection of structural information from a small network (<20 nodes).3. Visualisation of network structure.

1.4.2 Intermediate

1. Collection of information and visualisation of a large network (1000's nodes - Durham internal network).

2. Associate network traffic information alongside structure.3. Ability for user to define IP scopes to scan and map.

1.4.3 Advanced

1. Visualisation of information in database.2. Comparative analysis of collected information.3. Map network structure onto a conceptual static spatialisation.

Minimum and Intermediate deliverables are mandatory while advanced ones will be fulfilled to the extent that time has allowed.

1.5 Report Contents

• Chapter 1 – Background: Provides a context for the project and describes the need for and of the problem solution.

• Chapter 2 Part A – Networks: Investigates the technological means by which computer networks can be traversed for the purpose of determining structure and other attributable information.

• Chapter 2 Part B – Visualisation: Explores the definition of Information Visualisation, the techniques available for producing visualisations and their possible application to the project.

• Chapter 3 – Design: Describes the design of the system and how the prototype process model was used to derive it. Evaluates technical considerations and the rational behind decisions made.

• Chapter 4 – Implementation: Presents a high level overview of resulting code and a description of the software development practices involved. Explains important implementation decisions made.

• Chapter 5 – Evaluation: Looks at ways of evaluating visualisations and what we can understand from the ones we have produced ourselves. Converges multiple sources of information to form a cohesive picture.

• Chapter 6 – Conclusion: Summarises the work carried our during the project and provides an overview of the results attained. Highlights achievements and further work which could be conducted.

• Appendix A – Quasi Geography: Looks at mapping information onto the physical world and discovering new visual metaphors. Presents a conceptual network map of Durham in the form of a subway map.

1.6 Time PlanThe Gantt chart on the next page shows how the time has been apportioned from October 2004 to May 2005 for the project to be completed.

9

10

2 Part A - Networks"In a time when, even if nets were to guide all consciousness that had been converted to photons and electrons toward coalescing, standalone individuals have not yet been converted into data to the extent that they can form unique components of a larger complex"

– Ghost in the Shell: Stand Alone Complex, 2030

2.1 IntroductionIn today's wired-society we take the global infrastructure of the Internet and its related technologies for granted. We are used to its 24/7 immediacy and its ability to bridge continental communication divides - it has become a metaphysical means to an end where conventional borders and boundaries no longer apply. In many ways it has become an extension of the physical realm in which we exist. There are estimated to be over 930 million people with Internet access worldwide (over 10% of the worlds population [US Central Intelligence Agency, 2004]) and this is set to rise to over a billion within the next year. With this rapid expansion we have witnessed the birth of a new form of geography, that of the 'information society' and 'cyberspace' [MAPP2001]. In effect it is a new multi-dimensional frontier which is ripe for exploration. In this chapter1 we will look at the technological implications of doing this in the context of scanning a network to determine its structural and salient characteristics.

2.2 BackgroundThe most common form of network protocol system is TCP/IP (Transmission Control Protocol/Internet Protocol), which is the amalgamation of two layered conceptual models for data transfer that began development during 1960's. The protocol defines the network communication process and most importantly how units of data (called packets) should be constructed and what meta-data (data about data) they should contain so that a receiving device can interpret and route them appropriately (provided in the form of a header). TCP/IP comprises of the following important features [TCP1998]:

• Logical Addressing - allows for physical hardware addresses to be mapped to IP addresses that segment networks into subnets.

• Routing - the means by which packets are directed between different subnets. There may be multiple paths between nodes.

• Name Resolution - maps human readable names onto IP addresses (e.g. randomwire.com -> 217.42.134.134).

• Error Checking and Flow Control - features that ensure the reliable delivery of data across the network. Used to limit congestion and ensure absolute data integrity.

• Application Support - provides an interface for applications on the computer to access the network. This is achieved through the use of logical channels (ports) which are mapped to requesting applications.

1 N.B., for the sake of clarity, this chapter has been spilt into two distinct parts (A - Networks & B - Visualisation)

11

These characteristics make network traversal possible and their ability to scale to networks with potentially millions of nodes has driven the growth of the Internet as we know it today. The beauty of a standardised protocol model is that it is OS independent and therefore should enable foreign systems to communicate seamlessly without any need for complex interchange formatting. There are many other network protocols (e.g. AppleTalk, NetBEUI etc.) but their use is so limited that for the purposes of this investigation they will not be considered.

Our aim is to understand how network structures are formed from a technical viewpoint and how they may be traversed for the purpose of storing the information which can be passed on for visualisation.

2.3 OSI Model & Header CompositionThe Open Systems Interconnection (OSI) model was developed to provide a conceptual model on which networks could be based. The OSI model divides communications functions into seven layers, each of which provide services for the layer directly above it and use the functions of the layer below it. They can be summarised as follows [WEB1]:

• Layer 1 - Physical Layer: Defines all electrical and physical specifications for devices so that data may be converted into a stream of electric or analogue pulses for transmission over the connection medium (e.g. cable, wireless or microwave).

• Layer 2 - Data-Link Layer: Defines the topology of network connections (e.g. star, ring or bus) and identification of machines on a single network segment. Unique physical addresses are hard-coded into the network cards at the time of manufacture (called MAC address).

• Layer 3 - Network Layer: Provides logical addressing, network routing, flow control, segmentation/desegmentation, and error control functions. This layer provides the standards for forwarding packets to networks not physically connected to the same one you are on.

• Layer 4 - Transport Layer: Provides for transparent transfer of data between end users. Implements 'connections', which require sequential data flow, that errors be detected and corrected (e.g. by retransmission), and that data transmissions be acknowledged (if desired).

• Layer 5 - Session Layer: Establishes 'sessions' between communicating applications. It provides check-pointing, adjournment, termination, and restart procedures. It is responsible for setting up and dropping TCP/IP sessions. This layer is generally not well defined.

• Layer 6 - Presentation Layer: Translates data into a standard format. MIME (Multi-purpose Internet Mail Extensions) encoding, encryption and compression of the presentation of data is done at this layer.

• Layer 7 - Application Layer: Provides a network interface for applications to communicate across a network, without prior knowledge of the physical topology, the network architecture, or the network protocol.

TCP/IP is based on the OSI model although it is important to note that real-world implementations do not always correspond exactly to the above, in fact some miss-out or amalgamate some of the layers. We can map the TCP/IP model (left) onto the OSI model (right) as follows:

Application Layer -> Application Layer, Presentation Layer, Session LayerTransport Layer -> Transport LayerInternet Layer -> Network LayerNetwork Access Layer -> Data-Link Layer, Physical Layer

For our purposes we will be most interested in the Application and Transport layers which are most relevant to being able to trace routes through a network. Any solution will most likely work at the application layer and have little or no direct interaction with any other layers.

To facilitate data packets passing through each layer of the stack a chunk of informational data called a header is appended to the actual data. This is stripped away, layer by layer, as the data is received at the other end of the transmission. Most firewalls use the information contained within a packet header to determine whether or not it is allowed to pass through, having implications for network traversal. A typical TCP header can be seen in Figure 2.1.

12

Source Port Destination Port

Sequence Number

Acknowledgement Number

Offset Reserved Flags Window

Checksum Urgent Pointer

Options (Optional)

Figure 2.1: TCP Header Composition [TCP1998]

Information contained within a packets header can be useful for tracking where it has come from or is going to. Custom headers may be constructed to create probes for tracing routes within a network. Malformed headers are commonly used to breach network security by tricking a host into thinking that a packet has come from a false destination. On Linux systems there are several tools available for watching/examining packets passing through your system the most popular being tcpdump (text UI) [WEB2] and ethereal (graphical UI) [WEB3].

2.4 Network TopologiesIn the context of a network, topology refers to its shape or layout. How different nodes in a network are connected to each other and how they communicate are determined by the network's topology. Topologies are either physical or logical. The most common types of network topology are illustrated in Figure 2.2 below.

• Ring Topology - Every node has exactly two branches connected to it. These nodes and branches form a ring. If one of the nodes on the ring fails than the ring is broken.

• Mesh Topology - There are at least two nodes with two or more paths between them. Messages sent on a mesh network can take any of several possible paths from source to destination.

• Star Topology - All peripheral nodes are connected to a central node, which rebroadcasts all transmissions received from any peripheral node to all peripheral nodes on the network.

• Fully Connected Topology - All nodes are connected directly to every other node. Decentralised so immune to infrastructure failures. Rarely used in practice due to impracticalities.

• Line/Linear Topology - See Bus Topology• Tree Topology - Nodes are arranged as a tree. This resembles an interconnection of star networks

but unlike the star network, the function of the central node may be distributed.• Bus Topology - Uses a single line (the bus/backbone) to which all nodes are connected. If the

backbone cable fails, the entire network effectively becomes unusable.

13

Figure 2.2: Common Network Topologies (US Federal Standard 1037C) [WEB4]

The networks we will be examining will be hybrid topologies containing star, tree and mesh-like structures of interconnected LANs (Local Area Networks) and WANs (Wide Area Networks). A hybrid topology is a combination of any two or more network topologies in such a way that the resulting network does not have one of the standard forms. It is important to note that physical topology should not be confused with logical topology which is the method used to pass information between nodes (i.e. the protocol).

Another factor which must be taken into account is the possibility of 'network overlay' where one network works on top of another (e.g. Peer-to-peer (P2P) networks which run on top of the Internet) [WEB5]. The overlay contains the distributed information in the network including connectivity information, location/name of superpeers (those which take on server-like responsibilities), closest peers, and adjacency. This can obfuscate traffic analysis as P2P networks can be difficult to distinguish from others due to their use of other transport protocols.

2.5 Address ClassesEach computer that accesses a network has to have an unique IP address which is either manually or automatically assigned by a central server (DHCP). An IP address is made up of a 32-bit logical address that is divided into octets, the minimum value for an octet is 0, and the maximum value for an octet is 255 (e.g. 217.42.134.134) [WEB6]. Every address is divided into three parts, the network address, subnet address and host address. IP addressing is divided into three different network classes to split up networks according to their size:

• Class A - Intended for use by very large network providers such as ISP's. Addresses start with a number between 1 and 126. Only 126 of these networks are available, however each class A network can handle 16,777,214 IP addresses or computers.

• Class B - Used by large networks (e.g. Durham University). Addresses of this type start with a number between 128 and 191. It's possible to have 16,384 of these networks and each class B network can handle up to 65,534 IP addresses or computers.

• Class C - This is the most widely used by small businesses. When you look at the IP address, it can be seen that class C networks start with a first number that's between 192 and 223. There can be up to 2,097,151 class C networks and each network can handle close to 254 computers.

Each IP address is also given a 'subnet mask'. This enables administrators to split up networks further and provide more addresses within the same IP space. This standard of using 32 bit addresses is know as IPv4 but to cater for the ever increasing number of devices connected to the Internet a new addressing system is slowly being brought into use known as IPv6. This uses 128 bit addresses allowing for potentially 2^128=340,282,366,920,938,463,463,374,607,431,768,211,456 total theoretically assignable addresses! [WEB7]

The network we will be looking at falls into the Class B range and will have somewhere in the region of 8000 hosts on-line at any given time depending on factors such as the time of day and week. For out purposes unfamiliarity of the network structure will be assumed so all explorations will be carried out 'blind' without the aid of any prior knowledge.

2.6 TCP/IP UtilitiesLinux (the operating system we will be using) provides a host of useful tools for enumerating information about the network it is connected to. The range of network information that we will be collecting can be divided up as follows:

• Packet statistics (I/O, distribution, source/destination, protocol, subnet)• Host information (IP address, host name, host OS, geographic location)• Route information (common routes, available nodes, congested nodes)

Below we have identified and explained the function of a number of programs which may be of use. We have also included example output of each program to demonstrate its capabilities. The list is not exhaustive but gives an overview of the most commonly used and available tools. Some administrators may restrict their use to superusers only as some may be used for malicious purposes.

14

2.6.1 Netstat

Symbolically displays the status of network connections on either TCP, UDP, RAW, or UNIX sockets to the system. There are a number of output formats, depending on the options for the information presented as stipulated by the user. This can be useful for low-level connection analysis [WEB8].

david@izumi:~$ netstatActive Internet connections (w/o servers)Proto Recv-Q Send-Q Local Address Foreign Address Statetcp 0 0 localhost:41809 localhost:3128 TIME_WAITtcp 0 0 0-7-e9-4e-1-6d.cd:42125 nntphost.dur.ac.uk:nntp ESTABLISHEDtcp 0 0 localhost:54754 localhost:3128 ESTABLISHEDtcp 0 0 localhost:3128 localhost:54754 ESTABLISHEDtcp 0 0 0-7-e9-4e-1-6d.cd:41815 saturn.dur.ac.uk:imap ESTABLISHEDtcp 0 0 0-7-e9-4e-1-6d.cd:49467 host217-42-134-134.:ssh ESTABLISHED

2.6.2 Nmap

Nmap (‘Network Mapper’) is a utility for network exploration or security auditing. “It was designed to rapidly scan large networks using raw IP packets in novel ways to determine what hosts are available on a network. It will also try to work out what services (application name and version) those hosts are offering, what operating systems (and OS versions) they are running, what type of packet filters/firewalls are in use, and dozens of other characteristics”. Highly configurable [WEB9].

root@izumi:/home/david# nmap -sV -O localhost

Starting nmap 3.50 ( http://www.insecure.org/nmap/ ) at 2004-11-14 16:09 GMTInteresting ports on localhost (127.0.0.1):(The 1644 ports scanned but not shown below are in state: closed)PORT STATE SERVICE VERSION21/tcp open ftp ProFTPD 1.2.922/tcp open ssh OpenSSH 3.8.1p1 (protocol 1.99)25/tcp open smtp Sendmail 8.12.11/8.12.1180/tcp open http Apache httpd 1.3.32 ((Unix) PHP/4.3.9)113/tcp open ident OpenBSD identd139/tcp open netbios-ssn Samba smbd 3.X (workgroup: RANDOMWIRE)587/tcp open smtp Sendmail 8.12.11/8.12.11631/tcp open ipp CUPS 1.13128/tcp open http-proxy Squid webproxy 2.5.STABLE63306/tcp open mysql MySQL 4.0.205900/tcp open vnc VNC (protocol 3.7)10000/tcp open http Webmin httpdDevice type: general purposeRunning: Linux 2.4.X|2.5.XOS details: Linux 2.5.25 - 2.5.70 or Gentoo 1.2 Linux 2.4.19 rc1-rc7)Uptime 12.916 days (since Mon Nov 1 18:10:47 2004)

Nmap run completed -- 1 IP address (1 host up) scanned in 15.674 seconds

2.6.3 Ping

Packet Internet Groper. “A program used to test whether a particular network destination is online, by sending an Internet control message protocol (ICMP) echo request and waiting for a response. (Also called packet internet gopher)”. Round-trip ping times can be used for estimating network congestion and latency [WEB10].

david@norwich:~$ ping google.comPING google.com (216.239.37.99) 56(84) bytes of data.64 bytes from 216.239.37.99: icmp_seq=1 ttl=235 time=95.9 ms64 bytes from 216.239.37.99: icmp_seq=2 ttl=235 time=90.1 ms64 bytes from 216.239.37.99: icmp_seq=3 ttl=235 time=90.9 ms64 bytes from 216.239.37.99: icmp_seq=4 ttl=235 time=90.4 ms

--- google.com ping statistics ---4 packets transmitted, 4 received, 0% packet loss, time 3002msrtt min/avg/max/mdev = 90.118/91.872/95.930/2.382 ms

15

2.6.4 Traceroute

“A tool that shows you the network path between two locations. It shows you the address and how long it takes to get to each hop in the path. When there is a problem with the network, traceroute can often be used to narrow down where the problem is occurring”. Useful for determining a linked-list of routes through a network [WEB11].

david@izumi:~$ traceroute compsoc.dur.ac.uktraceroute to bylands.dur.ac.uk (129.234.200.100), 30 hops max, 38 byte packets 1 129.234.121.254 (129.234.121.254) 0.994 ms 0.979 ms 1.893 ms 2 192.168.251.27 (192.168.251.27) 0.671 ms 0.735 ms 0.730 ms 3 192.168.251.12 (192.168.251.12) 1.025 ms 1.224 ms 5.878 ms 4 bylands.dur.ac.uk (129.234.200.100) 6.070 ms 1.392 ms 0.759 ms

2.6.5 Whois

“An Internet directory service which can be used to find information about users registered on a server, or other information about the network”. The databases are maintained by regulated registrars. Useful for finding a hosts geographic location but not always accurate and does not contain information pertaining to internal networks [WEB12].

david@norwich:~$ whois randomwire.com

Whois Server Version 1.3Domain names in the .com and .net domains can now be registeredwith many different competing registrars. Go to http://www.internic.netfor detailed information.

Domain Name: RANDOMWIRE.COM Registrar: TUCOWS INC. Whois Server: whois.opensrs.net Referral URL: http://domainhelp.tucows.com Name Server: NS12.ZONEEDIT.COM Name Server: NS15.ZONEEDIT.COM Status: REGISTRAR-LOCK Updated Date: 12-nov-2004 Creation Date: 04-aug-2003 isn't Expiration Date: 04-aug-2006 ...

The main aim of using tools such as the above is to combine their output to produce relevant data which can be interpreted to form an understanding of its meaning, which by means of analysis becomes knowledge.

A simple pseudo-code example of how some of these tools could be combined to map a network structure is shown below:

for each (possible host in network IP space) {ping <address>if (ping == true) {

traceroute <address>update database with route

information} else {

if (host is in database) {remove reference

}}

}

Here each possible host in an IP range (a set of logical IP addresses) is first 'pinged' to see if it is active/alive. If this is the case then the route to it is traced and stored in a database. If it is not active/alive and is in the database then it is removed thereby maintaining the accuracy of the data held within. This shows the power a small program such as this can have by 'gluing' together a range of different tools. It could be easily expanded to do much more.

16

2.7 Mapping, Not HackingWhilst carrying out scans of the network we do not want our probes to be confused with malicious hacking attempts. Firewalls and Intrusion Detection Systems (IDS) will often mistake frequent requests of a similar nature as an attempt to flood the network as a DoS (Denial of Service) attack so it will be necessary to time scans below these thresholds.

Tools like Nmap will allow you to set timing policies using a set of pre-defined modes [WEB13]:

“Paranoid mode scans very slowly in the hopes of avoiding detection by IDS systems. It serializes all scans (no parallel scanning) and generally waits at least 5 minutes between sending packets. Sneaky is similar, except it only waits 15 seconds between sending packets. Polite is meant to ease load on the network and reduce the chances of crashing machines. It serializes the probes and waits at least 0.4 seconds between them... Normal is the default Nmap behaviour, which tries to run as quickly as possible without overloading the network or missing hosts/ports. Aggressive This option can make certain scans (...) much faster. It is recommended for impatient folks with a fast net connection. Insane is only suitable for very fast networks or where you don't mind losing some information. It times out hosts in 15 minutes and won't wait more than 0.3 seconds for individual probes. It does allow for very quick network sweeps though.”

Here using the 'normal' setting should be sufficient to balance speed against the risk of false detection. To further minimise the risk of this happening network administrators should be informed of any plans to carry out such a survey so to minimise confusion if scans are logged by IDSs or firewalls.

2.8 SummaryNetworks can be as simple or as complex as the level at which we view them. Data can be viewed from many different abstractions, from the core kernel level showing raw data as is passes through available network interfaces to the higher application level where basic I/O calls take place. Our concerns lie between these two extremes as although we want low level information we do not want to go so low that verboseness overtakes usefulness in what we collect.

There are many tools already readily available on the Linux platform for enumerating data pertaining to connected networks. Although most of this data in its raw form is not readily accessible for human scrutiny it could be easily manipulated by scripted programming languages for alternate output and for input into other programs.

For out purposes of mapping networks other questions of ethical conduct must also be raised. Although the information we will be collecting is technically in the public domain those who it is attributed to and collected from have given no consent for its use. Here we enter the realms of 'Big Brother' who sees all and perhaps reveals interactions that were previously hidden to the naked eye. Essentially we are talking about a new form of surveillance which may constitute an invasion of privacy. Although data will not be directly attributable to any single individual some may object to having their computer scanned and provisions should be taken for people to be able to opt out if anyone complains.

17

2 Part B - Visualisation"If once we were able to view the Borges fable in which the cartographers of the Empire draw up a map so detailed that it ends up covering the territory exactly... Simulation is no longer that of a territory, a referential being, or a substance. It is the generation by models of a real without origin or reality: a hyperreal... The desert of the real itself."

- Jean Baudrillard, 'Simulacra and Simulation', 1981

"You've been living in a dream world Neo. This is the world as it exists today... Welcome to the desert of the real."

– Morpheus, 'The Matrix', 1999

2.9 IntroductionWhen talking about 'visualisation' it is all to easy to skim over exactly what we mean by this term - is it about drawing pretty pictures or about enhancing human cognitive abilities? In a society which is reaching the point of 'information overload' it has become an important skill to be able to make sense of and present the information around us so as to exploit its value [TUFT1990]. Networks have become a globalising platform on which traditional ideas of place, community and identity are being radically altered and yet we still have little knowledge of how geographies of power and political structures work within this space [MAPP2001]. Information visualisation may hold the key to some of these issues. In this chapter we will be exploring the definition of information visualisation, the techniques available for producing visualisations, their application to this project and what has already been done in this exciting multi-discipline and rapidly developing field.

2.10 BackgroundThe amount of data being collected daily by businesses and individuals from every conceivable source is astronomical. Whether is be about economic markets or technical performance data is pervasive – it has become an important commodity in the modern world [WITT2003]. Having copious amounts of data however is useless on its own, analysis must take place to extract anything meaningful from it. One solution for doing this is to build graphical representations of the data, the value of which can be most easily explained by the ancient Chinese proverb of “a picture is worth a thousand words”. Although there is no one definitive definition this forms the basis of what we can consider as 'information visualisation'.

Man has been visualising the world around him as far back as we can see, from the earliest cave paintings to the maps created by the first explorers, gradually becoming more sophisticated in their approach. Today the availability of increasingly powerful and affordable computing power is revolutionising the way data is handled and the possibilities for producing detailed and interactive graphics are virtually endless. It is important to remember however that increased computing capacity does not necessarily yield better output – careful consideration must go into building any visualisation [VISU1995].

18

If we expand upon this basic definition of mapping data into a visual form the core objectives of information visualisation can be broken down as follows:

• Amplify cognition (perception, reasoning, or intuition; knowledge) [CARD1998]• Discovery, decision making and exploration [WITT2003]• Reduce “cost-of-knowledge” [CARD1998]• Act as a link between the human mind and the modern computer [WITT2003].

In essence essence information visualisation provides the means for metaphorical reasoning based on the visual display. Individuals will derive their understanding of any graphic based on its context within the display, and the application of their knowledge of similar concepts. It is as simplistic as it complex especially today with many new possibilities and opportunities being afforded by advances in computer technology, the main ones being outlines below [INFO2000]:

• Visual representations – increasingly elaborate (e.g. 3D)• Interaction – dynamic queries and user control (e.g. 'fly through' simulations)• Multiple representations – displayed either on the same or separate devices• Animation – dynamic representations, support finding and understanding patterns and anomalies• Display options – virtual reality (currently of limited effectiveness).

A wealth of research has been done into the effectiveness of visualisations, most notably by Edward Tufte who has developed a number of guidelines for good data representation. He stipulates that great care and deliberation should be taken in constructing any visualisation and includes a number of rules which can be summarised as follows [TUFT1983]:

• Make the most of both graphical and textual features• Reflect a definite sense of scale• Contain a comprehensible level of detail (data:ink ratio)• Avoid data redundancy.

We can use these guidelines as the foundation for visualisation design but they should not be seen as solid boundaries to restrict creativity. In this respect we may ask whether developers of visualisations are scientists, engineers or artists? The answer is probably a mix of all three but currently leaning towards the artistic side [WITT2003]. As this field matures it is expected that the situation will move toward the other end of the spectrum with the introduction of recognised good practice and visualisation standards.

In addition to what has been discussed so far visualisation relates to a number of traditional statistical analysis methods, most notably in the areas of 'Data Mining' and 'Knowledge Discovery in Databases' (KDD). Both of these research areas deal with the problem of coping with information overload and have developed a number of useful methods for automating searches through large data sets.

Data mining uses a variety of methods to identify and extract useful (but usually hidden) information within a body of data. It does it in such a way that the information can be put to use in areas such as decision support, prediction, forecasting and estimation [DILL1995]. KDD is a sub-section of data mining and can be seen as a process that encapsulates the latter into a number of pre-defined steps.

The information architect, scholar & theorist Richard Saul Wurman takes an interesting view of the 'business of understanding' in his seminal book 'Information Anxiety' [WURM1989]:

“The only thing we know is our own personal knowledge and lack of knowledge... And since it's the only thing we really know, the key to making things understandable is to understand what it's like not to understand”

While this may seem a little confusing at first it makes a valid point which can be applied to visualisation – before we can begin to draw anything we must first understand what it is we are trying to show. This is no easy task as will be demonstrated through the use of examples in the rest of this chapter.

19

2.11 Case Study 1 – The London UndergroundA globally recognised example of a good visualisation which illustrate Tufte's guidelines (even though it pre-dates them) is the London Underground map, first developed by Harry Beck in 1931 and subsequently becoming the model for most tube/subway/metro maps worldwide (Figure 2.3 below).

Beck applied his knowledge of circuit diagrams to produce the basic map we know today which abandons conventional geographical layout for a schematic representation but still retains datum for the traveller to understand and interpret their approximate position above ground. This was revolutionary in its day. Its success lies in being instantly clear and comprehensible – a great improvement of its predecessor which was extremely difficult to interpret at a glance.

The beauty of the orthogonal map layout is that it provides exactly the information needed without any excess - the name of the line (which are colour coded), the direction of travel, and the stops on the way. Distance between stops and geographical orientation are of very secondary interest to finding the correct train. The main ways the map achieves this simplicity is through the use of a simple colour scheme and only 45 degree angles throughout. This enables the eye to follow a particular line and spot connections with ease.

2.12 Case Study 2 – Lumeta Internet Mapping ProjectThe London Underground map is an example of a static 2D layout (i.e. it has been manually created, does not change and has no way of representing 3D data). There are many other forms of layout, but probably the most useful for our purpose of mapping networks will involve the use of graphs. An example of a visualisation group which uses graphs is the Lumeta Internet Mapping project.

Started in 1998, its “long-term goal is to acquire and save Internet topological data over a long period of time. This data has been used in the study of routing problems and changes, DDoS attacks, and graph theory.” Figure 2.4 below shows a tree-like structure with over 100,000 nodes which represent major ISP's around the world:

20

Figure 1: London Underground Map [WEB14] (2004)

Data used to construct the maps is derived using traceroute probes which determine paths through the Internet between nodes. As routes reconfigure and the Internet grows so the paths have to be updated. This is archived through complex data comparison between new and archived data.

Although at first glance this map may look like a jumbled mess it, centres of clustering are obvious and points of interest can be easily pointed out. This is made possible through graph colouring according to either geographic location or network capacity. This graph took approximately 20 hours to generate.

2.13 Visualisation TechniquesThe technicalities of visualising infrastructure and traffic within the context of a network can essentially be reduced to a graph layout problem. Graphs are widely used to model relational structures (entities linked together in some way, e.g. railway lines between stations). They have a broad range of applications in computer science (e.g. database modelling, software development tools), economics (e.g. entity-relationship diagrams, project management), natural science (e.g. visualisation of excavations in archaeology) and in social science (e.g. social networks) so are not limited to one particular field.

Although algorithm design is beyond the scope of this project it is necessary to have an understanding of the main types of layout algorithm so informed choices may be made in choosing the most appropriate ones for our purpose. Generally a graph drawing algorithm takes as input a graph (in a pre-defined format) and computes a suitable layout based on its definition (i.e. a drawing in 2D or 3D space by assigning coordinates to the vertices and mapping each edge to a simple curve). In most cases, these curves are straight lines or polygonal chains.

21

Figure 2: Lumeta ISP Map [WEB15] (1998)

A good visualisation of a graph should be aesthetically pleasing and easily understandable by its target audience. There are a number of criteria which can be applied to help with this [TUFT1997]:

• Minimal number of edge crossings or overlaps• Evenly distributed vertices and edges• Short edges• Few edge bends (to reveal regularity)• A small layout area or volume (to control eye motion)• A good angular resolution (to use pixels efficiently).

2.13.1 Graph Theory

There are a wide variety of graph drawing techniques that take different approaches to the aesthetic construction of the visualisations. Methods can be classified according to the kind of drawings they produce (e.g. circular, hierarchical, orthogonal, straight-line) as well as the algorithmic model they employ (e.g. force-directed, springs) and the class of graphs they can be applied to (e.g. planar graphs, directed acyclic graphs, trees) [DRAW1998].

The most common types of layout graph are illustrated and described below:

Circular and radial layouts produce graphs that emphasize group and tree structures within a network. They partition nodes into groups by analysing the connectivity structure of the network. The detected groups are layed out on separate circles. The circles themselves are arranged in a radial tree layout fashion. Useful for visualising interlinked clustered structures.

Pros: static focus, avoids crossings and occlusions, scaleCons: static focus, distance, non-tree-edges

22

Figure 3: Circular & Radial Graph Layout [WEB16]

Hierarchical layouts aim to highlight the main direction or flow within a directed graph. Cyclic dependencies of nodes will be automatically detected and resolved. Nodes will be placed in hierarchically arranged layers. Additionally the ordering of the nodes within each layer is chosen in such a way that the number of line (or edge) crossings is small. Edge routing can be polyline, orthogonal or in a curved style.

Pros: reveals flow, avoids crossings and occlusionsCons: edge routing, aspect ratio, scale

23

Figure 4: Hierarchical (k-layered) Graph Layout [WEB16]

Figure 5: Force Directed (Organic) Graph Layout [WEB16]

When laying out a force directed graph nodes are considered to be physical objects with mutually repulsive forces. The connections between nodes follow a physical analogy and are considered to be metal springs attached to the pair of nodes. These springs produce repulsive or attractive forces between their endpoints if they are too short or too long. The algorithm simulates these physical forces and rearranges the positions of the nodes in such a way that the sum of the forces emitted by the nodes and the edges reaches a (local) minimum.

The algorithm is well-suited for the visualisation of highly connected backbone regions with attached peripheral ring or star structures. These structurally different regions of a network can be easily identified by looking at a drawing produced by this algorithm.

Pros: distance, symmetry, scale, implementationCons: crossings, occlusion, unit-length edge assumption

Orthogonal layouts are used for undirected graphs and as such are composed of right angles and perpendicular lines. Their construction can be broken down into three phases:

1. Edge crossings in the drawing are calculated 2. Bends in the drawing are computed 3. Final coordinates of vertices are determined

Pros: only two slopes, crossings easy to read, bend minimisation, regular node placementCons: distance, direction, edge tracing, difficult implementation

24

Figure 6: Orthogonal Graph Layout [WEB16]

Tree layouts are mainly used for directed trees that have a unique root elements. Starting with the root node the nodes are arranged either from top to bottom, left to right, or bottom to top. The edges of a graph can either be routed as straight lines or in an orthogonal bus-like fashion.

Pros: distance, crossings, regular node placementCons: aspect ratio, scale

As well as these prominent forms of graph layout there are also a wide variety of specialised layouts which have been designed by various groups, usually for specific purposes but can also be applied to other areas:

25

Figure 7: Tree Graph Layout [WEB16]

Figure 8: Netscan Layout [ATLS2001]

The netscan visualisation shown above (Figure 2.10) was designed to show relative volumes of postings/posters for a large section of Usenet space using treemaps. Treemaps show large hierarchical structures of data as 2D “space-filling” maps. The relative sizes of the boxes are based on the number of posts per newsgroup for a month. Variations in colour indicate the change in the number of posts for this month compared to last month.

The visualisation shown above (Figure 2.11) uses a special type of 3D projection called hyperbolic space to enable the interactive display of huge directed graphs with hundreds of thousands of nodes. It uses a “focus + context” view of data so that the viewer sees the visualisation through a fish-eye lens, magnifying nodes at the centre of the display. It was produced using the 'Walrus' tool designed by researchers at CAIDA, the Cooperative Association for Internet Data Analysis [WEB17].

26

Figure 9: Walrus 3D Hyperbolic Layout [ATLS2001]

2.14 SummaryThere are many questions to consider when designing any visualisation -

• What do you want to portray?• What data is available?• How is the information structured?• How is the information related (if at all)?• What techniques are best suited for use?• What tools are available to facilitate this?• How does this show something new/different?

These sorts of questions should be continually asked before and during the design process so that the final output is focused and fulfils its purpose. The last question is possibly ambiguous as without a fuller background survey we cannot be sure of everything which has been done before, added to which we are not necessarily aiming for a unique solution. We are however looking for new ways of visualising information which may take a combinatorial approach using a variety of established methods.

Graph layout algorithms can be a substantial help in organising sets of data containing node-arc relationships but they can also detract from the visual design if too heavily automated. We will be looking for tools which allow simplicity, flexibility as well as depth of functionality to maintain control over the graphic generation.

27

3 Design“A common mistake that people make when trying to design something completely foolproof was to

underestimate the ingenuity of complete fools.”

– Douglas Adams (1952-2001)

3.1 IntroductionThe design of this project revolves around writing a suite of tools which each have a specific function to either collect, transform or present information. It can be seen as a jigsaw in which each piece operates independently but when brought together they form a cohesive whole, i.e. the system. We are concerned with writing code which will enable multiple existing Linux based tools to be connected (or 'glued') together in this respect to produce meaningful output. In this sense although we are still looking to adopt good design principles we are more concerned with the 'ends' rather than the 'means'.

Throughout the process we have adopted a prototype model of development. Figure 3.1 shows how this fits into the design structure:

Unlike the commonly used 'waterfall' model, the prototype process model allows for iterative modification between each stage. For a visualisation project this is crucial because if the output is unseen (and therefore unknown) before processing takes place then there is a good chance it will need to be refined before an optimised form can be found. Prototypes are used to aid this process whereby their output is evaluated against a set of requirements (which may change). The prototype is then updated to better reflect these criteria if necessary. This model is said to have the following characteristics [BUDG2003]:

• Evolutionary – software is adapted gradually by changing requirements as they become clearer. A form of iterative refinement.

• Experimental – possible solutions are tested and evaluated against a set of assessment criteria. Unhelpful prototypes are discarded.

• Exploratory – prototypes are used to discover ways of visualising data which bring new meaning to the problem domain.

As well as looking for an optimal solution another main idea behind this form of development is that we may also meet new ideas which were not originally anticipated. It is a continual process of learning through practical experience allowing us to move from theoretical principles to having usable tools:

[ Principles -> Methods & Techniques -> Methodologies -> Tools ]

28

Figure 3.1: Prototype Process Model [BUDG2003]

For our purposes a major part of the design is about discovering and deciding exactly how we wish to capture and visualise the data available to us and the technicalities of doing so. We have already laid the foundations with the research in the previous chapters and here we will build off this theory to develop a specification for the software modules which will be required to meet our goals. In essence we are looking to produce a design which allows us to move from raw data to some form of wisdom [ARCH2004]:

[ Data -> Information -> Knowledge -> Wisdom ]

Data• Data on its own tells us very little• By observing context, we can distinguish data from information

Information• Information is derived as we organise and present data in various ways• Organisation can change meaning (either intentionally or unintentionally)• Presentation enhances existing meaning, mostly on a sensory level

Knowledge• Knowledge can be distinguished from information by “the complexity of the experience used

to communicate it” [WURM2001]• Design helps the user create knowledge from information by experiencing the in in various

ways• Conversations and stories are the traditional delivery mechanisms for knowledge

Wisdom• Wisdom is the understanding of enough patterns to use knowledge in new ways and

situations• It is personal, hard to share and reflective.

3.2 Architecture DesignIn designing the structure of systems to be implemented we are concerned with the selection of architectural elements, their interactions, and the constraints on those elements and their interactions necessary to provide a framework in which to satisfy the requirements and serve as a basis for the design [PRES1997].

We should begin with a high-level overview of the system showing its core components (Figure 3.2):

29

Figure 3.2: High-level Architecture Diagram (CLI = Command Line Interface)

In order to abstract functionality there are two main modules here which work independently of each other; the scan & visualisation engines as described below:

• The scan engine will take input in the form of a list of network hosts or an IP address range (e.g. 192.168.*.*) and trace the route to each individual node. It will the output this data to a database.

• The visualisation engine will extract data from the database and format it correctly for output to a visualisation tool which will produce a static/dynamic graphical representation of the data.

The main reason behind separating both sides is to allow for greater flexibility in the way the data is handled. The database will act as a 'middle-man' by representing the data in a common standardised format which is easily interchangeable between different input and output formats. In the above diagram a CLI-based interface is used to control both sides of they system to provide an easy way for the user to control and monitor the process.

The engines themselves will need to be written in a programming language that is primarily designed for text manipulation and can also used for a wide range of tasks including system administration, quick prototyping, web development and network programming. Raw program execution performance is not an issue of great concern here as it is not a time-critical application.

In order to better understand how the scan and visualisation engines work together we can look at the way in which data flows within the system, represented on the following diagram (Figure 3.3):

Here IIs (Input Interfaces) collect data from various sources (e.g. traceroute, nmap etc.) and then pipe it into the database in the correct format (see 'Database Design'). The OIs (Output Interfaces) then extract the information from the database and re-format it into the correct format for whatever tool it's being sent to. There is no direct interaction between both engines which operate entirely independent of each other.When we speak of data 'piping' what we are referring to is connecting the standard output of one command to the standard input of another command, thus creating an implicit link between them. For this to work properly the output must be of the correct format for whatever it's being sent to – this is the crux of what the implementation hopes to archive.

3.2.1 Data Flow

If we further abstract the level of detail show for each interface a Data Flow Diagram (DFD) can be produced to describe the flow of information around a network of operational components. Directed arcs reflect a state transition, circles show an operation and rectangles represent external components [BUDG2003]. It is divided by the database which is in effect the watershed between the Input & Output interfaces (Figure 3.4):

30

Figure 3.3: Lower-level Architecture Diagram (showing basic data flow decomposition)

As with previous diagrams each arc denotes the movement of information from one component to another. As the data moves through the model it is manipulated and transformed into a form suitable for either storage in the database or final output. The final stages of the OI are not concrete – data may either be piped directly into a visualisation tool or piped into a flat file for manual input later on. This will be determined by the technical capabilities of the visualisation tool and the discretion of the user who may prefer output in the form of an intermediate format for use in further analysis.

31

Figure 3.4: Data Flow Diagram

3.3 System InteractionThe interaction between the user and the system is minimal. All the user has to do is initiate either the input or output interface with appropriate command parameters (e.g. host list) and then optionally monitor its progress.

As the use-case diagram shows (Figure 3.5) the Input Interface has three core processes – firstly each host is pinged to check if it's alive or not. If that succeeds the route to that host is traced and finally it's scanned to check for available processes. The final option is optional and may be disabled by the user at run-time.

Log files are updated on-the-fly (i.e. as it happens) so should a catastrophic failure occur (e.g. power outage) the the system should pick-up where it left off without data-loss. A user may also cancel the process at any time using standard termination commands.

3.4 Data ExtractionAll the tools we will be using for data extraction either output to 'stdout' (the standard buffered output stream, usually the terminal display) or some form of text-based log file. It is the first job of the Input Interface to read this output in and cut the various data fields into their component variables. For each line of output this can be archived through the use of regular expressions to match each element.

A regular expression (often abbreviated as a 'regexp') is a pattern that can match various text strings; for example, 'l[0-9]+' matches 'l' followed by one or more digits. These regular sets and expressions derive from a very formal decomposition of language structure and are fairly incomprehensible at a glance. To construct one you must know and be able to predict the exact output you are going to receive so that you can form the exact algebraic formula that will trigger a match. Operands in a regular expression can be [WEB21]:

• characters from the alphabet over which the regular expression is defined.• variables whose values are any pattern defined by a regular expression.• epsilon which denotes the empty string containing no characters.• null which denotes the empty set of strings.

E.g. for a single line from the output of the traceroute tool:

3 vega.dur.ac.uk (129.234.4.198) 4.197 ms 0.858 ms 1.143 ms

...could be matched using the following regexp:

/^ ?(\d+ [\* ]{0,4}) (\S+) $(.+)$ (.+ ms|\*) (.+ ms|\*) (.+ ms|\*).*/

The leading and trailing forward-slashes mark the beginning and end of the patter match. Curved brackets are placed around each variable you wish to extract, in this case being the hop-count, hostname, IP address and round-trip times (in milliseconds). Some characters, called metacharacters, are reserved for use in regexp notation (e.g. {}[]()^$.|*+?\). A metacharacter can be matched by putting a backslash before it thus negating its meaning.

In the case that output does not match the regular expression then it is ignored. This can eliminate many of the problems associated with type checking and indexing bounds which can easily break the flow of data through the system.

The only downside of using regular expressions such as these is that you sacrifice a high level of human readability in the code. Good comments are essential here to explain their purpose.

32

Figure 3.5: Use-case Diagram

Other tools such as Nmap will output directly to an XML based format which makes data extraction simply a matter of parsing its contents into our own database format instead of dealing with regular expressions. As XML documents are flat-file based they can be read using standard I/O handlers but it important not to do this while data is still being written to it otherwise concurrency errors may occur, thus corrupting the data.

Once data extraction has taken place redundant files/objects should be expunged so as not to leave a trail of unwanted data residue. This is important, not only for the sake of keeping the file hierarchy in order but also prevent storage waste and the possibility of over-writing other important data.

3.5 Database FormatThe information we want to store can be divided into a list of links (or edges) between network hosts (or nodes/vertices) and an associative store of related information about each host (e.g. host name, running services etc.). We will name these 'route database' and 'host database' respectively. As the information in both is related it makes sense to store them together in a form that provides for uniform structural data exchange. Other desirable features of potential database formats are as follows:

• well-defined and extensible language• embedded data and external data references• visual styles and geometries• animation and edit actions• freely available interface modules for many programming languages

Bearing these factors in mind it would seem a logical choice to define an XML (eXtensible Markup Language) based database definition to store the desired information. Other database formats such as flat-file or SQL have also been considered but are either too simplistic or over-specified for our use. XML provides a flexible way to create standard information formats which are idea for data-interchange and simple enough not to obfuscate the problem.

The information we want to collect can be specified in Table 3.1 below:

Name Data Type Definition

Node

Label String Unique node identifier (a, b, c, ... etc.)

IP Address 32 Bit Integer Internet Protocol address (xxx.xxx.xxx.xxx)

Hostname String Name mapped onto IP address by DNS

Status Boolean Whether the node is available or not

Service String Network service running on the node

RRT Integer Round-Trip-Time of packet transmission

OS String Host operating system (educated guess)

Link

Latency Integer Average RTT's for connected nodes

Table 3.1: Database Table Definition

The contents of a sample XML database using this structure containing three nodes and three edges would look like this:

33

<graph><node name="a">

<label>a</label><ipaddress>129.234.200.100</ipaddress><hostname>randomwire.com</hostname><rtt probe1="178.942" probe2="177.883" probe3="179.435" /><os>Linux 2.6.3 - 2.6.8</os><service name="ftp" port="21" /><service name="ssh" port="22" /><service name="smtp" port="25" /><service name="http" port="80" /><status state="up" />

</node><node name="b">

<label>b</label></node><node name="c">

<label>c</label></node><edge source="a" target="b" /><edge source="a" target="c" /><edge source="c" target="a" />

</graph>

This shows both the hosts (nodes), associated information and routes (edges) between them.

The XML database design presented here is an extension of the GraphXML format [GXML2001] which is one of many graph description formats. By using standards such as this we are building in a certain level of future compatibility and interoperability with other external applications. It also provides a well tested base onto which enhancements can be made (whist still maintaining the standard).

The database outlined above can be modelled as an ER (Entity Relationship) Model as follows (Figure 3.6):

Entities coloured in green are mandatory and must be present for a node to be valid within the network structure. Uncoloured entities are optional and may be attributed if available. A node may have zero or more connected links and any number of associated services.

34

Figure 3.6: Database Entity Relationship Diagram

3.6 Database Concurrency ControlWhen a system has multiple applications which access shared data concurrently, internal consistency problems may occur. Concurrency control is the way in which these problems are prevented from happening. Concurrency control protocols allow transactions to be executed concurrently to improve performance and prevent data corruption. Unfortunately very little work has been done on implementing such protocols for XML databases as yet and no single standard exists as yet.

This digram (Figure 3.7) presents a typical situation where an error has occurred because subsequently to a query being made data has been removed.

Solutions to avoid these sort of errors in XML databases are already available which employ XML Path Language (Xpath) [WEB22] to enable only certain sections of an XML document to be addressed at any given time but they are complex to implement.

Because this is a single user system we will assume that the different interfaces will not be operated concurrently and will include documentation to this effect.

3.7 Technical ConsiderationsThere are three main challenges to building an accurate network map: asymmetry of network paths in outgoing and incoming directions, RTT (Round Trip Time) measurement noise and the relative instability of network routes and topologies. The first two are important for attributing latency information whilst the last concerns drawing connectivity maps. There are a number of ways to lesson the impact of these issues through our design.

Path Asymmetry – We have assumed that the path from the source to the destination host are the same whereas in fact they may vary (especially under adverse network conditions, e.g. heavy load). To understand this it is necessary to examine the way in which we can trace links between network hosts:

35

Figure 3.7: Concurrency Error Example

Figure 3.8: Traceroute Operation

Figures 3.8 & 3.9 illustrate the way traceroute works by sending UPD packets to high ports on the destination with their TTL (Time to Live) values set to 1 initially and then incremented by 1 for every subsequent packet. Routers decrement the TTL field of an IP packet by 1 before forwarding it along the route. If the TTL of an IP packet equals 0, then the particular router will send an 'ICMP Time Exceeded' packet back to the source. By examining these returned packets traceroute can determine the IP addresses of the routers along the way to the destination and thus the links between them [TCP1998].The problem here is that the route by which packets return may differ from the one they were sent along thus upsetting the RTT (Round Trip Time) calculations causing over-estimation and under-estimation. To combat this we will assume path symmetry unless inconsistent latencies are observed along a path, at which point the trace should be discarded and re-run.

RRT Measurement Noise – RTT measurements are composed of three timed components: propagation delay, router processing delay, and queuing delay. The presence of inaccurate data in traceroute RTT measurements makes their use error prone. The only way to avoid this is to filter out measurements in sample data which go against the trend in distribution. This is not a perfect solution but the best for our circumstances.

Topological Instability – By their very nature networks are dynamic. Hosts will come & go at will and paths between them may change for a variety of reasons. One consequence of this is that our maps may contain paths which are never simultaneously present in the network (e.g. primary & backup links). This can create 'back-door' paths between hosts which do not actually exist and upset latency measurements. Another problem may occur when hosts reconfigure their IP address or move to a different segment of the network. This can cause different physical links to appear to be a single link. Other than completely re-scanning the network there is no know way to combat this problem [CONN2004]. This could be addressed in future research.

3.8 Design RationaleThe design as it has been presented is primarily orientated to be flexible enough to allow for different external tools to be 'plugged' into it. Although each tool needs its own interface these should be easily adaptable from a standard template into which the correct regular expressions and output handlers can be inserted.

In this respect we are operating a system of software reuse, both in terms of what we are developing and in terms of the tools which we use that have been written by others. The immediate advantage of this is that it is quicker and easier – why “reinvent the wheel”? By introducing a template based framework into which different tools are inserted we dramatically reduce complexity whist at the same time enhancing potential functionality. Within the context of software engineering where the demand for systems to be be developed quickly and efficiently is high reusing software components can be vital to success. By reusing highly tested components our design will be more reliable and risk will be reduced [PRES1997].

36

Figure 3.9: Example Path of Traceroute Probe

An auxiliary aim of this project has been to only use software available under an open source licence [WEB24]. The rationale behind this is not only philosophical but also practical from a design perspective:

One of the many advantages of using open source tools on top of an open source operating system is that added to the transparency of functionality it is also possible to make modifications to any aspect or component of the system at will. If something does not work in a desired manor then changing it is simply a matter of having the inclination to do so. This all adds to the flexibility of the end product and allows it to evolve beyond its original specification. We are also fostering a culture whereby products are not driven by profit based incentives but by the desire to produce something which fulfils a real need [WEB24].

3.9 SummaryThe main requirement of our design is to be able to collect structural information pertaining to a variably sized network and associate other information along side it in a form suitable for visualisation. This has been realised in the design which maintains platform independence along side the flexibility to be expanded and altered as necessary to suit our needs. The database we create will enable textual queries to be a carried out on its contents, further enhancing its capabilities and fulfilling areas of both the intermediate and advanced requirements.

The use of the prototype process model (Section 3.1) and the design rationale (Section 3.8) both compound and enforce how the design interacts with with recognised Software Engineering practices. Justifications for any trade-offs which have been applied have been given throughout. The technical considerations explained in Section 3.7 must be carefully considered throughout the implementation to ensure data extracted is accurate and thus usable for further analysis.

37

4 Implementation“Life has become more complex in the overwhelming sea of information, and life, when organized into species relies upon genes to be its memory system so man is an individual only because of his intangible memory. Memory cannot be defined, but it defines mankind.”

– Ghost in the Shell: Innocence, 2032

4.1 IntroductionIn this chapter we will take a high level overview of the resulting software which has been implemented in accordance with the design laid out in the previous section. The preliminary sections will look at our methods for the selection of tools and programming languages which were used after which we look as how the implementation fits together in terms of its functionality and relation to the original design. We then examine technical issues, software engineering processes and testing strategies adopted to ensure quality (Figure 4.1).

The software (in the form of executable scripts) which have been written fulfils all the basic and intermediate requirements as well as the relevant advanced ones (excluding the static spatialisation, see Appendix A). It's scalability has been proven through its ability to be able to handle and manipulate a database containing over 5000 nodes with associated information constituting the entire Durham University network (see Chapter 3, Design). The software as it has been presented is intended for a fairly niche user base of individuals who have need to see a graphic representation of their network. These users are assumed to have a high level to technical competence and a good understanding of network theory. The software on its own does not produce any graphical visualisations. It collects data and manipulates it into the correct format for input into a separate visualisation tool (see Chapter 5, Evaluation).

38

Figure 4.1: Aspects of software Quality Assurance

, 23/03/05

=======================================================

, 23/03/05

=======================================================

, 23/03/05

For more information visit http://www.abisource.com.

, 23/03/05

Created by AbiWord, a free, Open Source wordprocessor.

4.2 Scripting Language ComparisonTo implement our design we were looking for a programming language which can be easily used for SOP (Script-Oriented Programming). In our context a script is a command line program, mostly used in a terminal (but could also be extended to include a GUI, WUI (Web User Interactive) etc.). The use of a SOP language is ideal for our purposes because they enable rapid prototyping, usually have strong text manipulation facilities and large libraries of modules that can be reused. There are many freely available SOP languages to choose from ranging from the extremely popular (Perl) to the esoteric (Haskell). To try and differentiate between them we can compare and rate the various attributes of each:

Awk Haskell JavaScript PHP Perl Python Ruby Tcl sh

Compilation and execution in one command(20 points)

x x x x x x x x x

shebang2 aware (#!)(15 points)

x x x x x x x x x

Program can be passed on command line(5 points)

x x x x x x x

Interactive interpreter(5 points)

x x x x x x x

Debugger(5 points)

x x x

Full interpreter in debugger(5 points)

x x x

Verbose execution(2 points)

x x x

Total Score 40 40 40 45 57 55 57 40 47

N.B. The first criteria has been given the most weight because of the usefulness to enable rapid prototyping without the hassle of having to manually compile and execute after each change is made in the source.

Table 4.1: Scripting Language Comparison Matrix

Whilst there are a number of strong contenders the clear leaders here are Perl and Ruby. Both are interpreted languages and work on similar principles but Ruby is considerably newer and is designed to be a pure OO (Object-Orientated) language whilst Perl is procedurally based (although OO metaphors can be applied). Until recently, information about Ruby was not really available outside of Japan but this is now changing and there is an increasing amount of documentation and on-line support to be found. Perl generally considered the most popular and also the fastest scripting language available today, but it has its drawback in that the syntax can be horrendous and difficult to maintain/read. The main advantage of Perl is that because of its maturity there is a huge wealth of documentation, third party modules and sample code available. This alone puts its ahead of Ruby for our purposes because we do not have the time to learn a new language from scratch (or translate Japanese for that matter!).

After taking all this into account we decided to implement our design in Perl.

2 “shebang” Under Linux (and other derivatives), if the first two bytes of an executable file are "#!", the kernel treats the file as a script rather than a machine code program. The word following the "!" (i.e. everything up to the first whitespace) is used as the pathname of the interpreter. For example, if the first line of an executable is:

#!/usr/bin/perl

...the script will be treated as a Perl script and passed as an argument to /usr/bin/perl to be interpreted.

39

4.3 Visualisation Tools FeaturesThere are many tools available for generating visualisations. All do so in differing ways, using different graph formats (although some common standards exist). Table 4.2 provides a breakdown of six of such tools which have been selected as the most promising within their fields of specialism (see Chapter 2 Part A, Networks, for explanations of terminology used).

daVinci GraphViz InfoVis yEd Prefuse Walrus

Static/Dynamic Static Static Both Both Both Static

2D/3D 2D 2D 2D 2D 2D 3D

Interactive Yes Yes Yes Yes Yes Yes

Directed/Undirected

Directed Both Both Both Both Directed

Layout Algorithm(s) Supported

HierarchicalCircularRadialEnergy Minimised

HierarchicalCircularRadialSprings

Scatter PlotsParallel Cords.Node-linkIcicle TreesTreeMapsAdjacency Maps

HierarchicalCircularOrthogonalOrganicTrees

Scatter PlotsForce-directedRadialTreeMapsSpace DistortionHyperbolic TreeDegree-of-Interest Tree

Hyperbolic Spanning Tree

Graph Data Source Format

daVinci Term Representation

DOT, GXL, XML

CSV, DOT, TM3, TQD, XML

YGF, GML, XML

User Defined API

LibSea

Graph Export Format

GIF GIF, JPG, PDF, PNG, SVG, VRML

PNG SVG, WMF, JPG, GIF

N/A None

Language Written In

C C Java Java Java Java

Command/GUI Both Both Both GUI N/A GUI

Focus Scaling (Zoom)

Yes Yes Yes Yes Yes Yes

Abstraction (Subgraph Hiding)

Yes No Yes Yes Yes Yes

Navigation /Find / Query

Yes No Yes Yes Yes No

• All of the above claim to support potentially large data sets (>1000 nodes).• None directly support dynamic data updating (i.e. building visualisation whilst data is streamed in).• All data is considered to be structured.

Table 4.2: Visualisation Tools Feature Matrix

It was not our intention to pick a single tool over another but to disentangle each tools functionality so to ease the implementation of Output Interfaces (OIs) which convert the data into an appropriate format for each. This list is not exhaustive and other tools have been chosen outside this selection where appropriate.

40

4.4 Visualisation Tools CriteriaTo further differentiate between the visualisation tools we can evaluate them based on a set of desirable criteria. A likert scale has been used to gauge an objective opinion of the value of each criterion:

C1. Expose linked structure in the graph. This can aid in finding directed paths and clusters. C2. Avoid visual anomalies that do not convey information about the underlying graph.C3. Allow for the attribution of extra data to both nodes and links.C4. Favour symmetry and balance. This enhances the aesthetic appeal.C5. Provide a comprehensible graph input format which can be easily generated using scripting languages.C6. Enable generated visualisations to be easily exported into a standard image file format.C7. Good use of colour to enhance spatial perception, clarity and amplify cognition.C8. Contains a comprehensible level of detail.

daVinci GraphViz InfoVis yEd Prefuse Walrus

C1 4 4 4 4 3 4

C2 3 4 3 3 4 4

C3 3 3 3 5 2 3

C4 4 4 4 4 5 5

C5 3 4 5 4 3 3

C6 3 5 2 4 0 1

C7 2 4 4 5 4 5

C8 4 4 4 3 4 4

Total (/40) 26 33 29 32 25 29

Marking scale:

1 = Awful2 = Poor3 = Average4 = Good5 = Excellent

Table 4.3: Visualisation Tools Criteria Matrix

Table 5.3 looks at a selection of tools our of a large number tested for suitability. Because of the close distribution of results it would be unwise to draw conclusions based on them alone. Experimentation is required to gain a fuller understanding of each tools potential and limitations. When tested using real sample data a number of problems were encountered, mostly relating to out-of-date dependencies on external libraries. For all tools tested these can be broken down as follows:

• daVinci – graphs with large numbers of nodes tend to look cluttered and messy• yED – requires expensive proprietary components to work• Prefuse – framework structure was to complex to integrate in our limited time-scale• Walrus – crashes with an exception error when loading large graph file• Tulip – out of date code base which refuses to compile against newer library files• aiSee – does not behave properly on modern systems• Otter/Skitter – relies on very old version of Java, no longer works properly

Two tools with similar functionality were found which provided the nearest adherence to the criteria laid out above. These were WinFDP and WilmaScope, both of which create real time interactive 3D animations of dynamic graph structures. These are looked at extensively in Chapter 5, Evaluation.

41

4.5 Script ListBelow (Table 4.4) is a breakdown of the various scripts which have been written to fulfil the design. For convenience they are divided into three categories:

• Core – The most important scripts which carry out the majority of the data handling and manipulation. They can be chained together, each piping its output into the next for further processing depending on the users requirements.

• Extra – Standalone scripts which may be useful under certain circumstances for manual data enumeration. Many of these scripts require root/superuser permissions to run. Use with care.

• Marginal – Scripts whose functionality has been subsumed by other core scripts but still may be useful in a standalone context. These are considered as 'unsupported' and as such will not be documented.

Name Description

Core

Pingscan.pl Ping scans IP address ranges to produce a list of active hosts suitable for tracing

Tracenetwork.pl Builds a list of links between nodes in a network. Outputs to plain text & XML

Enumerate.pl Port scans hosts to enumerate services & updates XML database with information

Nodeconvert.pl Converts linked list of IP addressed into standard alphanumeric format (a:b)

Commonlinks.pl Outputs a list of nodes which have the greatest number of links

Extra

Discovernetwork.pl Passively listen to network traffic & discover IP scopes

Discovertrace.pl Passively listen to network traffic & execute traceroute on collected addresses

Guessos.pl Tries to guess the operating system of the remote host using the Nmap scanner

Register.pl Searches and enumerates all the services that have registered themselves on the network

Rendezvous.pl Searches machines that provide the rendezvous service (file & device sharing) of Mac OS X

Scanshares.pl Scans for windows based file shares (incomplete)

Marginal

Hostfilter.pl Alternate version of Nodeconvert.pl (less efficient)

Ipgen.pl Prints out a list of IP address suitable for input into the trace system

Pinghosts.pl Outputs a list of active IP address based on dot format input (i.e. 192.168.*.*)

Scanfinger.pl Port scans hosts to enumerate services & connected users (where available)

Xmlcombine.pl Combines the XML output files from tracenetwork.pl

Table 4.4: Script description breakdown

The reasoning behind dividing up the scripts in this way is to clarify their functionality and to avoid a single cluttered collection of unordered fragments. There is a clear logical divide between the different components which will be expanded upon throughout this chapter.

During the development process all of the scripts and associated documentation was held within a revision control system (Subversion [WEB25]) and incremental backups were taken nightly. As well as providing the security to protect against unforeseeable hardware failures it also provided the means to roll-back changes or make development time-line comparisons where necessary.

42

4.6 Script InteractionFigure 4.2 below shows the interaction of the core scripts (shaded in blue) and how data is passed and manipulated between them. Output files are coloured purple and external tools which are used are coloured red. One marginal script (xmlcombile.pl) is included to demonstrate their potential use.

Although this has been presented as a flow diagram it may be read from any level as long as the inputs are available from the previous level (e.g. enumerate.pl relies on graph.xml from either tracenework.pl or xmlcombine.pl). As well as outputting to a separate files in appropriate formats progress information is also shown textually on the command line terminal interface (stdout) in human-readable form during execution. Log files are updated interactively which guards against loss of data due to a catastrophic failure (e.g. power outage).

With reference to our design, tracenework.pl can be seen as an 'Input Interface' while nodeconver.pl and enumerate.pl can be seen as 'Output Interfaces'. This is evident through their interaction with the XML database (graph.xml) which acts as the cornerstone of the framework we have built.

43

Figure 4.2: Script interactions

4.7 Execution ExampleTo give a simple example of how these scripts might be used together the output from running some of them in sequence is shown below. Commands are shown in green while output is shown in blue:

david@izumi:~$./pingscan.pl 192.168.0.*Ping Scanning: 192.168.0.*192.168.0.1192.168.0.3192.168.0.7[...]192.168.0.11192.168.0.13192.168.0.2549 active hosts found.

david@izumi:~$./tracenetwork.pl < hostlist.txt----------------------------------Host: (192.168.0.0) Excluded address: not traced.----------------------------------Tracing: 192.168.0.1 ...

Hop: 1Hostname: router.randomwire.comIP Address: 192.168.0.1RTT1: 0.799 ms RTT2: 0.503 ms RTT3: 0.471 ms----------------------------------Tracing: 192.168.0.3 ...

Hop: 2Hostname: firewall.randomwire.comIP Address: 192.168.0.3RTT1: 0.326 ms RTT2: 0.534 ms RTT3: 0.228 ms

Link: 192.168.0.1:192.168.0.3----------------------------------Host: (192.168.0.7) Ping Failed: host unavailable/non-existent.----------------------------------[...]

david@izumi:~$./enumerate.plParsing XML file (please be patient!)...**Port Scanning: 192.168.0.1... ftp/21: open ssh/22: closed smtp/25: closed finger/79: closed http/80: open pop3/110: closed imap/143: closed**Port Scanning: 192.168.0.3... ftp/21: open ssh/22: open smtp/25: open finger/79: closed http/80: open pop3/110: closed imap/143: closed[...]

Here the local network is first ping-scanned to determine a list of active hosts (hostlist.txt). Each active host found then has the route to it traced which creates the XML database (graph.xml). Available services at each host in the database are then enumerated and outputted to a new database (graphcomplete.xml).

44

4.8 Source OverviewHere we take a look at some of the more interesting sections of code within the implementation which demonstrate various technical aspects, advantages and trade-off's of when writing in the Perl programming language. They have been split up according to which script they have come from:

4.8.1 Pingscan.pl

unless (open NMAP, "$nmap -sP -oG - $ipadd | grep Host: | cut -d' ' -f2 2>/dev/null |") {

die "Problem with nmap on $ipadd: $!\n";} else { ... }

This section of code demonstrates how external programs (in this case nmap) are handled. The first line uses the open() function to pipe the output of running the command contained within the quoted parenthesis. Of particular interest is how we have chained together grep and cut to filter the output of nmap before passing it back to our script. By (re-)using these existing tools we are significantly reducing the complexity of our code which in turn reduces the room for error.

4.8.2 Tracenetwork.pl

use Net::Ping::External qw(ping);...

sub pinghost {

my $host = shift;my $alive = ping(host => $host); # ping the hostreturn 1 if $alive;return 0;

}

This fragment shows a subroutine for 'pinging' a host to check to see whether or not it is alive. This fragment is a good example of how the use of imported code modules (in this case Net::Ping::External) makes complex operations far easier. The functionality of this module has enabled us to encapsulate and condense over 400 lines of code into only one (indicated by the comment).

sub outputedgexml {

my @edges = @_;

open(OUTFILE, ">edges.xml") or die "Can't open edges.xml: $!";

foreach (@edges) {(my $node1, my $node2) = ($_ =~ /(\S+):(\S+)/);print OUTFILE "<edge source=\"$nodelookup{$node1}\"

target=\"$nodelookup{$node2}\" />\n";}close OUTFILE;

}

The subroutine shown above is used for outputting lists of edges (links between noes/hosts) in XML format. This is achieved by simply outputting raw text to a file handler. This might be considered a bit of a hack as it does not use a formal XML writer but for our purposes it is simple and effective.

45

my @sortedarray = sort @inputarray; #sort the list numericallymy $prev = "not equal to $sortedarray[0]"; #remove duplicate entriesmy @outputarray = grep($_ ne $prev && ($prev = $_, 1), @sortedarray);

This fragment comes from a sort routine to first sort an array numerically and then remove any duplicate entires. It demonstrates both Perls strength and weakness simultaneously – the third statement compares the previous entry with the current entry in the array and discards it if it's the same. Although it has condensed a large amount of functionality into an extremely small space this is at the expense of readability. This can cause problems in terms of future maintenance if the maintainer has to spend more time than usual on code comprehension. It is, however, a good example of concise and efficient programming.

4.8.3 Enumerate.pl

my $average = ($rtt1+$rtt2+$rtt3)/3; # compute averagemy $avg = new XML::Twig::Elt('rttavg', $average); # create the element$avg->paste('last_child', $node); # paste it in the document

The code presented here is part of the script which updates the XML database. It is responsible for adding an average Round Trip Time (RTT) tag/attribute to each host. First the average is calculated, secondly an XML::Twig (a high-level Perl XML parser) element is created using it which is then 'pasted' into the host after the last child already present. Whilst earlier we presented a simpler method for adding creating XML documents this is the most efficient and simple form of updating them without resorting to horribly complex pattern matching.

4.8.4 Nodeconvert.pl

foreach $key (sort{$a <=> $b}(keys %link_assoc)) {

print $key, ':', $link_assoc{$key}, "\n"; #generic format

#print "<Edge EndID=\"$key\" StartID=\"$link_assoc{$key}\"/>\n"; #wilma#print "<edge source=\"$key\" target=\"$link_assoc{$key}\"/>\n"; #yEd#print "{ \@source=$key; \@destination=$link_assoc{$key}; },\n"; #walrus

}

He we show a simple way of enabling output to a multitude of different graph formats (specified in the code comments). It uses a foreach() function on a hash representing links between nodes which are first sorted numerically before being outputted one at a time. Print statements are used to output to the appropriate formats which are all similar but have slightly differing syntax.

4.8.5 Discovertrace.pl

for my $net (@{$data->{subnets}}) {print "subnet ",$net->{ip},'/',$net->{mask},"\n > ",

join(', ', map { $data->{interfaces}[$_]->{ip} } @{$net->{interfaces}}), "\n\n"}

This for() loop again demonstrates how Perl can be used in an extremely powerful way (here we are outputting discovered interfaces then mapping those onto a list of addresses to be traced). Whilst this also has readability issues it does not constitute bad style, it is an unavoidable facet of Perl. The 'Perl Style Guide' [WEB26] has some useful guidelines regarding formatting and code layout which we have followed closely.

46

4.9 Technical Issues EncounteredDuring implementation there were a number of unexpected technical issues encountered which had to be overcome or trade-offs applied to compensate for. These are explained below.

4.9.1 Traceroute Efficiency

During initial testing it became clear that normal traceroute probes were taking a lot longer to run than expected. It was calculated that on average using the standard Linux traceroute tool we could scan 100 hosts per hour. Considering there are around 7000 active hosts on the network at any given time it would take approximately 70 hours (or 3 days) to carry out a complete scan! This does not even take into account time which would be wasted by route redundancy, dead-ends, high traffic congestion and inactive nodes.

It was obvious that a better solution would have to be found to speed up the process but not to the extent that our activities could flood the network with trace packets leading to a misinterpreted DOS (Denial Of Service) attack and possibly disrupting network service thereby invalidating our results. A number of alternative tools are available which carry out the same function but using different techniques. Two main contenders were identified:

• tcptraceroute - a traceroute implementation using TCP packets instead of UDP ones as with the traditional version. By sending out TCP SYN packets instead of UDP or ICMP ECHO packets, tcptraceroute is able to bypass most common firewall filters as it only 'half' connects to remote hosts (no connection acknowledgement). Because of this it is very fast but requires root/superuser privileges to run as it needs direct access to the TCP/IP control stack which is restricted [WEB27].

• lft – 'Layer Four Traceroute'. Uses TCP, SYN and FIN probes to trace routes through a network, also bypassing most firewalls. Contains “smart” engine for carrying out interactive table lookups and state inspection. LFT also distinguishes between TCP-based protocols (source and destination), which make its statistics more realistic, and gives users the ability to trace protocol routes, not just Layer-3 (IP) hops. [WEB29]

Whilst lft contains some more advanced features its output differs greatly from the original traceroute. Because we wanted to maintain compatibility with the original tool Input Interface tcptraceroute was chosen as the replacement primarily because it's output was very similar. As it is an open source tool we were able to modify the source and alter the output to exactly match that of the original (See 'TCP/IP Utilities, Chapter 2 Part A, Networks). Changes which were made in the C code can be seen visually using TkDiff (Figure 4.3).

Replacing the standard traceroute tool with a modified version of tcptraceroute (1.5beta4_mod) and improving error handing has cut scan times from an estimated 3 days to 3 hours to scan the entire network (an approximate 2400% speed increase!). Problems can still occur when a node somewhere along a route is unresponsive but this time overheads for this are negligible and cause no catastrophic errors.

4.9.2 Service Scanning

Although permission had been obtained prior to running selective service scanning on our first attempt at a full network scan the nearest router to the machine the scripts were running detected the scan as DOS activity and promptly cut off all network access to the port it was on. Two emails, a visit to the Information Technology Service to explain and five hours later the port was re-enabled!

Significant changes were made to curb the activities of the nmap service scanner (whose operation is entirely legitimate). These including restricting it to seven core ports (ftp, ssh, smtp, finger, http, pop3 & imap) and slowing it down to a 'polite' speed setting. This meant It serialises the probes and waits at least 0.4 seconds between sending them.

Even though this was clearly explained network administrators felt it was still to much of an unnecessary 'risk' to allow us to run these sort of probes. Even though highly debatable this was complied with. Due to this setback full service data for the Durham network is unavailable but small scale test have proven that the enumeration script works perfectly well.

47

4.10 TestingTo ensure all elements of the implementation work together properly, function as expected, and meet performance criteria testing has been integral throughout the development process. Testing falls under the 'Evaluation' phase of the Prototype Process Model we have followed (see Chapter 3, Design) and can be seen as a sub-process with its own inputs & outputs (Figure 4.4).

48

Figure 4.3: Comparing file changes using TkDiff, a graphical diff and merge tool (illustration only) [WEB29]

Figure 4.4: Test information flow

Our objectives for carrying out this process can be summarised as follows:

• Uncover errors in the software (through 'Black Box' tests)• Demonstrate that functionality is present according to the specification (if not possibly revise)• Demonstrate performance requirements have been met (through timed tests)• Provide indication of software reliability and quality

Because of the nature of the interactions between the different elements in our system (Figure 4.1) testing for correct I/O was simplified but errors at a higher level could have knock on effects lower down. These errors mainly revolved around data formatting and transformation checks whereby if variable substitution failed then the whole structure would quickly breakdown. These problems were easily discovered by employing simple test cases using sample data where the correct output was known prior to execution. Each test resulted in either success or failure, depending on what was expected in each case. Even without prior knowledge of the inner workings of each component this method clearly showed logic paths through the system as a whole, reinforcing its design.

Whilst testing is clearly beneficial it cannot show the absolute absence of defects so we can almost be certain that there will be minor problems with even the best tested software. The ideal trade-off for this situation is to produce a solution which is 'predictably reliable' through the use of a reliability model which is built using error rate data (Figure 4.4). This means that, once constructed, we can say that xyz will perform correctly n% of the time within a degree of tolerance. Because of time constraints we have not been able to fully construct such a model and thus cannot make such a statement about the implementation.

For Perl automated testing modules are available (Test::Simple etc.) which use integrated methods to verify code behaves as expected. Using the ok() function for each element you want to test will print out an "ok" or "not ok" message to indicate a pass or fail. Whilst this is fairly straight forward the benefits were not considered great enough to include it considering the relatively small scale of our project.

4.11 SummaryWhilst some may see a command line interface (CLI) as primitive compared with their GUI counterparts they have a long proven track record for being more expressive than visual interfaces, especially for complex tasks such as we are carrying out. The disadvantages of a CLI is that it almost always has high mnemonic load (commands must be memorised), and usually has low transparency to the end user. Most people (especially non-technical users) find such interfaces relatively cryptic and difficult to learn.

For our CLI based scripts high mnemonic load is not an issue as the input range is very limited and the intended user base should be familiar with CLI operations. A GUI wrapper could be fairly easily built on top of the scripts which have already been implemented but the advantages of doing so are nominal.

We have demonstrated how good Software Engineering practices have been adopted (Sections 4.1 & 4.10) and have identified solutions to issues which may have caused errors in the execution of our tools. The Source Overview (Section 4.8) has looked at complex and interesting aspects of our implementation with a view to it fulfilling our needs.

Our implementation has archived what we set out in the initial requirements and the design – in essence we have built a framework which integrates a plethora of existing tools to bring together date to form information which can hopefully be transferred into knowledge through visualisation. Whilst this is not a new idea our research has shown no other projects which aggregate network information in this way.

49

5 Evaluation

– del.icio.us brain map [WEB30]

5.1 Beyond DataThe preceding chapter has looked at the means by which we capture, store and create relations between data (thus transforming it into information). It also looked at the efficiency of our implementation in relation to the time taken to probe a network (Section 4.9.1). Whilst this is a crucial side to our work we basically end up with with a large XML database which on the face of it means very little. In this chapter we try to look beyond the data using visualisation as tool for inferring knowledge and ultimately wisdom. We present a number of visualisations and evaluate them against a set of specific criteria as well as our own objective reasoning.

Here we are mainly concerned with the ends rather than the means, that is to say we are more interested in what we can see rather than how we got there. We are evaluating the visual output of our research.

5.1.1 Convergence

Using information and tools from multiple sources is more likely to enable us to come to a greater understanding of the domain we are examining (network visualisation) rather than relying on a single source which may not be totally accurate. When we speak of convergence we are referring to the brining together of all these sources to form a cohesive picture. These sources include, but are not limited to:

• Data we have collected• Data others have collected• Expert knowledge• Current visualisation theory & techniques

50

5.2 Visualisation CriteriaIn Chapter 4, Implementation, we evaluated a number of visualisation tools based on their features and functionality. As a result of this none turned out to be suitable for our purpose due to technical issues so we looked to a number of other tools, primarily WinFDP and WilmaScope [WEB31]. These both use various graph layout algorithms to dynamically arrange node-arc graphs in a uniform fashion.

Any visualisation technique employed should be useful and usable, that is, the visualisation should help and support the user in the analysis process. We can use the term data usability to describe the quality of information in the context of information visualisation applications [EVAL2003]. It is associated with three principles:

1. Data Reliability – the confidence level in the data gathering process (high in our case)2. Data Presentation – the system must avoid distorting the information3. Decision-Making – the date represented should help users make decisions

Since we are looking to gain insight from information visualisation it is clear that the visual representation and the technique for interacting with it must not influence the ways the user needs to use the data. It is much harder to evaluate this area of usability when dealing with abstract tasks such as “understand data” or “make a decision based on information” because how do you quantify such values? It becomes even more difficult when you consider that interfaces for information visualisation include 2D and 3D structures which are unusual in comparison with most WIMP (Windows, Icons, Menus and Pointers) interfaces.

For our purposes we will focus on the evaluation of visual representations. To do this we aim to link interface usability knowledge, concepts and methods with evaluation of the expressiveness and semantic content of visualisation techniques. We have presented out a set of metrics for doing this shown in Figure 5.1 and explained below.

• Limitations may effect the semantic content of the data to be displayed such as geometric or visual constraints (e.g. size of display).

• Cognitive Complexity determines whether the image can be measured by data density, data dimension and the relevance of its content (e.g. how many dimensions can be simultaneously displayed?).

• Spatial Organisation relates to the overall layout of the visualisation and the analysis of how easy it is to locate specific elements within it (e.g. do some elements occlude others?).

51

Figure 5.1: Criteria for the evaluation of visual representations [EVAL2003]

• Information Coding is the use of additional symbolic notation to build alternate representations to aid perception (e.g. clustering of similar elements).

• State Transition refers to the reconstruction of a partial or entire representation after user action which can effect the perception of information (e.g. how long does it take for a change take effect?)

We have evaluated both WinFDP and WilmaScope against these criteria (Table 5.1):

WinFDP WilmaScope

Limitations Designed for Windows but will run in Linux under WINE (an Open Source implementation of the Windows API) albeit considerably slower. User controls are not clearly evident.

Java VM is memory and processor intensive, needs fast computer to run effectively.Geometric and visual constraints are not clearly evident except for the display area.

CognitiveComplexity

Can display data in both two and three dimensions. Nodes are placed at random to begin before algorithm is applied. Scale is variable & can be altered interactively.

Displays data in three dimensions only. Nodes are expanded out from a single source point in 3D space until the algorithm stabilises. Scale is variable & can be altered interactively.

SpatialOrganisation

Occlusion possible, especially when textual labels added or orientation manually changed. 'Flying-through' the visualisation allows for specific nodes to be found with relative ease.

Occlusion possible, especially when textual labels added or orientation manually changed. Spatial adjustment is slow and awkward making node location difficult and cumbersome.

InformationCoding

Node size may be altered relative to other nodes to enunciate relevance.

Supports translucent clusters of nodes, which may be collapsed to elide the contents.

StateTransition

Transitions between representations are fluid as the optimisation algorithm dynamically arranges nodes. For large data sets with multiple hierarchies this can be slow.

Transitions between two consecutive representations is archived through animation but for large data sets this is excruciatingly slow and updates are fragmented.

Table 5.1: Visualisation criteria evaluation

Both tools are fairly evenly matched, whilst WilmaScope may have a greater depth of functionality WinFDP is more stable and faster. Both tools are used in our evaluation.

5.2.1 2D vs. 3D

There is a lot of discussion within the visualisation research community as to whether or not adding a third dimension to visual representations actually has any major benefit. For certain domains the provision of 3D space obviously allows the visualisation to closer reflect reality, in particular applications with inherent 3D spatial properties, e.g. geological data. However, for any domain we again encounter the old argument of data occlusion whereby some objects hide others. The advantage of using a 3D system is that the data may be made visible by changing the viewpoint perspective of the user and the provision of a third axis allows for additional information to be represented on screen. 3D also extends the features which may be used in the generation of visualisations, e.g. the relative position of objects in 3D space or the depth of field. On the other side Nielsen summarises the range of difficulties using 3D adds [NEIL1998]:

• The screen and the mouse are both 2D devices, so we do not get true 3D unless we strap on weird head-gear and buy expensive bats (flying mice)

• It is difficult to control a 3D space with the interaction techniques that are currently in common use since they were designed for 2D manipulation (e.g., dragging, scrolling)

• Users need to pay attention to the navigation of the 3D view in addition to the navigation of the underlying model: the extra controls for flying, zooming, etc. get in the way of the user's primary task

• Poor screen resolution makes it impossible to render remote objects in sufficient detail to be recognizable; any text that is in the background is unreadable

• The software needed for 3D is usually non-standard, crash-prone, and usually memory/processor intensive.

52

WinFDP supports both 2D and 3D while WilmaScope supports only 3D visualisation. Our tests have shown that for large data sets (over 500 nodes) 3D layouts generally produce better results whereas 2D layouts have a tendency to get confused and become a big tangled mess (Figure 5.2). In these cases you end up with all the nodes rapidly 'twitching' about their positions, trying to move but being unable because other nodes are already in the positions directly adjacent to them. This problem is most likely caused by the force-directed algorithm not being able to cope with the density of so many nodes in two dimensional space.

Of course, with the wrong algorithm parameters, 3D layout can also go wrong. Figure 5.3 shows a WinFDP visualisation whereby the host computer was not powerful enough to keep up with the algorithms demands and so could not untangle itself. Figure 5.4 shows how incorrect 'Velocity Attenuation' and 'Fade Repulsion' settings have caused clusters to become bunched up severely reducing the legibility of the visualisation.

In the next section (5.3) we begin to examine what knowledge we can gain from the visualisations we have created as well as further exploring how these criteria can be identified and applied.

53

Figure 5.2: Node entanglement – WinFDP 2D Layout (4004 nodes)

54

Figure 5.3: Insufficient computing power – WinFDP 3D Layout (4004 nodes)

Figure 5.4: Incorrect algorithm parameters – WilmaScope 3D Layout (465 nodes)

5.3 Class B SuperstructureClass B network addresses (in the form of 129.234.*.*) are primarily assigned to peripheral hosts/clients, the lower subnets being reserved for servers (which number over 200) on the Durham network. A class B network can contain up to 65,534 hosts but from our scans we find that only around 10% of these addresses are being utilised leaving plenty of room for future expansion.

Figure 5.5 shows a high level of data density with the sheer number of peripheral nodes (shown as red points) and their connecting links causing the centre of each cluster to be obscured. If we were concerned purely with the interconnections between the main clusters this could be prevented by reducing the degree of relevance to only the first two levels of detail thus ignoring all the peripheral nodes. The trade-off to this is that it would change the reference context as no longer could you gage the size of each cluster.

55

Figure 5.5: WinFDP 3D Force-Directed Visualisation of Class B Network Structure (5289 nodes)

From Figure 5.5 we can identify eight distinct clusters of varying sizes. Whilst this visualisation is relatively clear we have the problem of overlapping links within 3D space which make identifying whether or not there is a central node from which all others stem uncertain. If we feel the same data into WilmaScope then we get an arguably cleared picture where a central node is evident (Figure 5.6). The single issue here is that only seven of the eight clusters are visible. Further investigation showed that the smallest cluster (visible in top-left of Figure 5.5) is occluded behind the large bottom-right cluster in Figure 5.6.

Using the commonlinks.pl script to query the database we find that the IP address of the core node is 192.168.251.27 (with 688 links) indicating that it is part of the Class C substructure and thus a router. This ties in with the idea that all Class B addresses are present on the periphery of each network cluster whilst Class C addresses form the core (see next section, 5.4).

To measure cognitive complexity metrics of both the number of nodes (for data density) and a qualitative measure of legibility, in terms of occlusion of nodes were used. For both tools low levels of occlusion were found but for graphs with clusters of of over approximately 500 nodes levels of legibility are reduced. There is no upper bound for the number of hierarchies which can be displayed but beyond a depth of three time to compute was significantly increased making real-time interaction difficult at best.

56

Figure 5.6: WilmaScope Force-Directed Visualisation of Class B Network Structure (5289 nodes)

5.4 Class C SubstructureClass C network addresses (in the form of 192.168.*.*) are primarily assigned to internal routers on the Durham network. We can see this as the backbone of the network where the majority of traffic will travel through to reach its destination. It lies topologically directly beneath the Class B superstructure thus connecting all the end nodal points to the core, thus as a whole constituting the entire span of the physical network.

57

Figure 5.7: WinFDP 3D Force-Directed Visualisation of Class C Network Structure (469 nodes)

Figure 5.7 was produces a well define star topology with small sparsely populated inner segment leading out to a number of clusters of varying sizes. The central node here is the same as the central node identified in the previous section, again reinforcing its position as a key communication point within the network which must be traversed for information to be passed between clusters.

We have confirmed with network administrators that this is indeed the main core router where all the various networks (Durham, Queens Campus Stockton & EnSuite Online) terminate by means of multiple interfaces (i.e. the router has more than one address assigned to it which are used by the different networks). The seven single nodes originating from the core are monitoring stations and critical address resolution servers which manage a large proportion of the network.

Figure 5.8 shows the same data set visualised with WilmaScope. In comparison to the Class B network we get a very similar layout which we can clearly relate to Figure 5.7. Because of the comparison small number of nodes in this graph compared with the Class B network WilmaScope has no difficulty rendering the visualisation, allowing real-time interaction with it (rotate, zoom & translate).

58

Figure 5.8: WilmaScope Force-Directed Visualisation of Class C Network Structure (469 nodes)

5.5 Overlying User StructureSo far we have examined the physical structure of the network in terms of nodes and links between them. However, our work has shown that there is a clearly identifiable third later where virtual connections map onto physical connections. This is to say that a user may create tunnels through the physical network interconnecting nodes which have no direct physical link thus forming a virtual connection. To map these connections our original plan was to use the finger service available on many of the university's servers to obtain lists of connected users which could be compared for each host and then interrelated (Figure 5.9).

Login Name Tty Idle Login Time Hostname

ads Andrew Stribblehill *pts/6 41d Feb 18 00:58 (wompom.dur.ac.uk)cim Chris Morris *pts/4 7:18 Feb 8 16:27 (dinopsis.dur.ac.uk)cjw Corwin Wright pts/38 14d Feb 22 10:58 (altair.dur.ac.uk)djw Dan Walrond pts/25 2:54 Mar 31 10:40 (spc1-leed4-6-0)drg David Gilbert pts/0 Mar 29 13:18 (host217-42-134-115)ecb Edwin Brady pts/28 4:27 Mar 31 11:49 (selberg.dur.ac.uk)ngb Nick Boalch pts/7 2:41 Feb 8 16:32 (eiszeit:S.2)pmt Paul Townend pts/18 9 Mar 10 12:32 (cspcz59:S.0)psn Peter Nuttall pts/5 Mar 29 15:38 (dsl-80-46-11-211:S)root root *pts/17 22 Feb 14 10:12 (wompom.dur.ac.uk)sr Samantha Raincock pts/33 17:14 Mar 26 18:11 (vega.dur.ac.uk)tsp Timothy Packer pts/39 9 Feb 21 14:09 (publication.org.uk)

Figure 5.9: Sample user list from compsoc.dur.ac.uk collected with the finger service

Unfortunately (for reasons explained in Chapter 4, Section 4.9.2) we were not able to carry out service scanning and so do not have enough data to draw an accurate virtual connection map. We can however postulate about the theoretical aspects of this topological layer. To do this we will look at the concept of SSH (Secure SHell) tunnelling whereby a user can securely gain access to a remote host using different types of encryption algorithms. This example is used as it is easy to conceptualise, however, it should be noted that there are many other methods which carry out similar operations.

SSH can tunnel data from any TCP application with a predefined listening port. Commonly known as "port forwarding", SSH tunnelling makes it easy to secure applications that would otherwise send unprotected traffic across public networks. Because several applications can be multiplexed over a single SSH connection, firewall and router filters can be tightened to just one inbound port: the Secure Shell port (22).

Figure 5.10 shows how an SSH tunnel works – here an application on alice.org is accessing another service on bob.org. From the applications perspective they are in direct communication (via the virtual connection) but in reality data is being sent via the SSH server through the tunnel which operates over the physical network to the SSH client at the other end. The virtual connection has created another layer on top of the physical connection.

59

Figure 5.10: SSH tunnel connections

Through what we have discovered with the previous visualisations we can relate this directly to both the Class B superstructure and the Class C substructure which sit below this, shown in Figure 5.11 below.

In the conceptual model Tier 2 relates to the establishment of physical connections (superstructure) over the physical data-links (Tier3). Tier 1 virtually maps connections onto these physical routes. In effect we can see this as an interconnected set of nodes connected to an interconnected set of users.

Further comparisons are looked at in Appendix 2, Overlying Interconnections.

5.6 SummaryOur evaluation has looked at both the ways we evaluate graphic visualisations and what we can understand from the ones we have produced ourselves. Through the convergence of different sources of information we have gained a fuller picture of the Durham network and the elements which constitute any such structure.

Whilst we have found many visualisation tools to be underdeveloped, unstable or unsuitable, the ones we have used have proved successful in producing visualisations which have enabled us to gain a better comprehension of the information they convey. This has been especially evident for the layered OSI model [WEB1] which has propounded itself through our research.

It has become evident through looking the overlying user layer that there are a number of commonalities between the ways we can view such types of networks which interact with physical networks, be it over land, wire or air, as they still share the same core structural elements. This raises an interesting issue of whether technological networks have been designed to mimic 'real-world' networks or have naturally evolved to do so on their own? This is left for further research.

60

Figure 5.11: Conceptual layered network related to the OSI model

6 Conclusions

– texone :: tree visualising Durham WWW

6.1 SummaryThis project has investigated the various methods of visualising networks. This has resulted in a comprehensive review of network and visualisation theory (Chapter 2) which has enabled a tool to be designed (Chapter 3) and implemented (Chapter 4) that accurately maps the structure of any given network. Using this tool to create a number of visualisations a structural analysis of the Durham University internal network has been carried out (Chapter 5) which has propounded the theory laid out previously.

In addition to this we have looked at a variety of new and unique ways of visualising data (see visualisation, top of page). This has let to the construction of a conceptual, quasi-geographical, map based on subway plan design (Appendix A) which links information with physical geography, thus providing the viewer with an easily understandable conceptual snapshot of the system in its current state.

6.1.1 Overview

The project has been successful in its two main objectives of investigating information visualisation and connecting together a variety of tools to form a process for visualising network structure and its accoutrements. Through this we have learnt a great deal about the nature of information, networks and ways of visually expressing them. At its core we can summarise this as follows:

EVERYTHING IS DATA

ALL DATA IS POTENTIAL INFORMATION

FINDING INTERCONECTIONS CAN AID COMPREHENSION

VISUALISATION IS THE METHOD BY WHICH WE LOOK BEYOND DATA

This is consistent with the recognised definitions initially explored in Chapter 2 Part B, Visualisation.

61

6.2 Objective & Deliverable FulfilmentFigure 6.1 visualises how each chapter of this document has realised the objectives and deliverables stated in Chapter 1. Chapter 2 can be seen as investigative research, Chapters 3 & 4 can be seen as planning and development while Chapter 5 & Appendix A can be seen as evaluative. Associated with Chapter 4, Implementation, is a repository of source code in the form of Perl scripts (see Table 4.4 in the same).

Whilst we have fulfilled all our objectives and deliverables it would have been nice to carry out more information association (Intermediate 2) and comparative analysis (Advanced 2) work as in these areas we have only really scratched the surface of what is possible.

6.3 Visual TrendsA major element of this project has been researching visualisation techniques and tools already available for producing them. In keeping with the relative immaturity of this field of study many of these tools are still heavily under development and as a result we have encountered many problems with utilising them. However, there is a clear vitality to the work many people are doing which holds exciting prospects for the future.

The growth of visualisation is not a 'fad' or 'craze' designed to stimulate new markets but a response to a real need and a growing problem of information overload. We simply can not keep up with the amount of data being produced and stored. Without new ways of extracting and displaying relevant content there may even come a critical point at which the sheer volume of data causes a catastrophic breakdown of informational structures.

62

Figure 6.1: Objective and Deliverable fulfilment

Based on current trends, especially in the news/media sector, it can be expected in the near future that we will see the emergence and standardisation of many new forms of information visualisation across a wide range of disciplines beyond the traditional pie and bar graphs. This will run parallel with the commoditisation of new display technology (e.g. HDTV (High Definition Television), 3DVR (Three Dimensional Virtual Reality) and hugely powerful graphics processing computer systems.

Another way of looking at visualisation is to see it as the functional fusion of graphic design and data mining. There is a strong need here for people from both scientific and artistic backgrounds to collaborate on bringing together these two disciplines. The role of 'Information Architect' already exists but to a large extent this mainly deals with static signage and form layout. The evolution of this perception is already taking place as can be seen with the rise of many organisations using Flash [WEB36] and Java [WEB37] animations on their websites to convey various information.

Information visualisation may still be in its incipiency but there is no doubt it has a bright future.

6.4 100% Open SourceIt is important to note that the entirety of this project has been completed using only open source software (including the production of this document, presentations and all the original graphics they contain). Some of the most widely used software used is listed below:

• OpenOffice Writer & Impress (Equivalent to MS Word & MS PowerPoint)• Imendio Planner (Equivalent to MS Project)• DIA (Equivalent to MS Visio)• The GIMP (Equivalent to Adobe Photoshop)• InkScape (Equivalent to Adobe Illustrator)• Slackware Linux / Gnome / Perl + Various Linux tools

This highlights the capability of Linux and open source software to provide a viable alternative to proprietary (and expensive) software with no loss of functionality (although this is still a matter of opinion).

6.5 Future WorkThere are many paths our work could take from this point mainly boiling down to the overall focus. So far we have kept our approach fairly wide and have encompassed a large area of theory. For any further work to be carried out it would make sense to focus on a more specific area of research within network visualisation. Some ideas are introduced below:

• Dynamic Updating: build visualisations which update their content in real time, providing the user with accurate information for any given moment in time so that resource decisions may be made.

• Graph Editing: allow the user to interactively change the layout both in terms of node positioning and layout method (e.g. allow transitions from force-directed to hyperbolic views).

• Traffic Management: use available date to find interesting paths within a graph for the purpose of identifying traffic patterns (e.g. bottlenecks caused my misconfigured routers etc.).

• Relationships: enable the user to explore relationship among nodes in a diagram for the purpose of discovering trend patterns (e.g. services which rely/depend on other services).

• Protocol Specific Analysis: look at the specific characteristics of a particular protocol to see how it interacts with the rest of a network (e.g. the BitTorrent peer-to-peer (P2P) protocol).

Another direction which could be taken would be to expand upon the static visualisation created in Appendix A (Figure 8.4). This could be transformed to include animated and interactive content derived from back-end data sources, e.g., hovering over specific 'stations' or 'lines' would display real time information relating to its status. There is ample room here for further research which could be used to produce a detailed network management 'console' from which the status of the network could be easily monitored for a variety of purposes.

The pursuit of any of these directions would provide a greater insight into how specific areas of networks may be visualised and more importantly what knowledge we can gain from doing so.

63

7 References

7.1 Books (15)[ATLS2001] Dodge, M. & Kitchin R., 'Atlas of Cyberspace', Addison-Wesley, 2001, 0201745755

[BAUD1981] Baudrillard, J., 'Simulacra and Simulation', University of Michigan Press, 1981, 0472065211

[BUDG2003] Budgen, D., 'Software Design', Addison Wesley, 2003, 0201722194

[CARD1998] Card, S.K., J.D. Mackinlay, & B. Shneiderman, 'Readings in Information Visualization: Using Vision to Think', Morgan Kaufmann, 1999, 1558605339

[CHEN1999] Chen, C., 'Information Visualisation and Virtual Environments', Springer-Verlag, 1999, 1852331364

[DRAW1998] Giuseppe, B., 'Graph Drawing', Pearson US Imports & PHIPEs, 1998, 0133016153

[INFO2000] Spence, R., 'Information Visualization' Addison Wesley, 2000, 0201596261

[MAPP2001] Dodge, M. & Kitchin R., 'Mapping Cyberspace', Routledge, 2001, 0415198844

[PRES1997] Pressman, R., 'Software Engineering: A Practitioner's Approach', McGraw-Hill Publishing Co., 1997, 0077094115

[TCP1998] Casad, J. & Willsey B., 'Teach Yourself TCP/IP', Que, 1998, 0672312484

[TUFT1983] Tufte, E.R., 'The Visual Display of Quantitative Information', Graphics Press, 1983, 096139210X

[TUFT1990] Tufte, E.R., 'Envisioning Information', Graphics Press, 1990, 0961392118

[TUFT1997] Tufte, E.R., 'Visual Explanations: Images and Quantities, Evidence and Narrative', Graphics Press, 1997, 0961392126

[VISU1995] Brown, J., Earnshaw, R., Jern, M. & Vince, J., 'Visualization', John Wiley & Sons Inc, 1995, 0471129917

[WURM2001] Wurman, R., 'Information Anxiety 2', New Riders, 2001, 0789724103

7.2 Academic Publications (9)[ARCH2004] Kolko, J. 'Information Architecture', Savannah College of Art & Design, 2004

[CONN2004] Jin, C., Wang, Z. & Jamin, S., 'Network Maps Beyond Connectivity', University of Michigan, 2004

[DILL1995] Dilly, R., 'Data Mining An Introduction', Parallel Computer Centre, The Queen's University of Belfast. 1995

[EVAL2003] Del Sasso Freitas, C, M., 'Evaluating Usability of Information Visualisation Techniques', Instituto de Informatica, University Federal do Rio Grande do Sul, 2003

[GXML2001] Herman, I. & Marshall, M.S., 'Graph XML', CWI Amsterdam, 2001

[HOFT2000] Hofton, A.E., 'Graph Layout Using Subgraph Isomorphisms', Phd Thesis, Durham University, 2000

[NEGR2002] Reffell, J., Aydelott, M. & Fitzpatrick, J., 'Networks and Graphs', University of California, 2002

[NEIL1998] Nielsen, J., '2D is Better Than 3D', Alertbox, Nielsen Norman Group, 1998

[WITT2003] Wiiter, J., 'Visualisation and Dynamic Querying of Large Multivariate Data Sets', MSc Thesis, Durham University, 2003

64

7.3 Web Pages (37)All last accessed March 2005.

[WEB1] http://en.wikipedia.org/wiki/Osi_model

[WEB2] http://www.tcpdump.org/

[WEB3] http://www.ethereal.com/

[WEB4] http://en.wikipedia.org/wiki/Network_topology

[WEB5] http://en.wikipedia.org/wiki/P2P_overlay

[WEB6] http://www.wildpackets.com/compendium/IP/IP-AdCla.html

[WEB7] http://bogpeople.com/networking/ipv6/ipv6.shtml

[WEB8] http://www.die.net/doc/linux/man/man8/netstat.8.html

[WEB9] http://www.insecure.org/nmap/

[WEB10] http://www.die.net/doc/linux/man/man8/ping.8.html

[WEB11] http://www.die.net/doc/linux/man/man8/traceroute.8.html

[WEB12] http://www.die.net/doc/linux/man/man1/whois.1.html

[WEB13] http://www.insecure.org/nmap/data/nmap_manpage.html

[WEB14] http://tube.tfl.gov.uk/content/tubemap/

[WEB15] http://www.lumeta.com/mapping.html

[WEB16] http://www.yworks.com/products/yfiles/doc/developers-guide/major_layouters.html

[WEB17] http://www.caida.org/

[WEB19] http://www.w3.org/Graphics/SVG/

[WEB20] http://www.inkscape.org/

[WEB21] http://www.cs.rochester.edu/u/leblanc/csc173/fa/re.html

[WEB22] http://www.w3.org/TR/xpath

[WEB23] http://opensource.org/licenses/

[WEB24] http://eu.conecta.it/paper/Advantages_open_source_soft.html

[WEB25] http://subversion.tigris.org/

[WEB26] http://www.perldoc.com/perl5.8.4/pod/perlstyle.html

[WEB27] http://michael.toren.net/code/tcptraceroute/

[WEB28] http://oppleman.com/lft/

[WEB29] http://tkdiff.sourceforge.net/

[WEB30] http://kevan.org/extispicious

[WEB31] http://www.wilmascope.org/

[WEB32] http://www.marumushi.com/apps/flickrgraph/

[WEB33] http://www.flickr.com

[WEB34] http://www.touchgraph.com/TGGoogleBrowser.html

[WEB35] http://www.randomwire.com/2005/03/29/graphic-meltdown/

[WEB36] http://www.macromedia.com/cfusion/showcase/

[WEB37] http://www.java.com/en/everywhere/

65

http://en.wikipedia.org/wiki/Osi_model

http://www.java.com/en/everywhere/

http://www.macromedia.com/cfusion/showcase/

http://www.randomwire.com/2005/03/29/graphic-meltdown/

http://www.touchgraph.com/TGGoogleBrowser.html

http://www.flickr.com/

http://www.marumushi.com/apps/flickrgraph/

http://www.wilmascope.org/

http://kevan.org/extispicious

http://tkdiff.sourceforge.net/

http://oppleman.com/lft/

http://michael.toren.net/code/tcptraceroute/

http://www.perldoc.com/perl5.8.4/pod/perlstyle.html

http://subversion.tigris.org/

http://eu.conecta.it/paper/Advantages_open_source_soft.html

http://opensource.org/licenses/

http://www.w3.org/TR/xpath

http://www.cs.rochester.edu/u/leblanc/csc173/fa/re.html

http://www.inkscape.org/

http://www.w3.org/Graphics/SVG/

http://www.caida.org/

http://www.yworks.com/products/yfiles/doc/developers-guide/major_layouters.html

http://www.lumeta.com/mapping.html

http://tube.tfl.gov.uk/content/tubemap/

http://www.insecure.org/nmap/data/nmap_manpage.html

http://www.die.net/doc/linux/man/man1/whois.1.html

http://www.die.net/doc/linux/man/man8/traceroute.8.html

http://www.die.net/doc/linux/man/man8/ping.8.html

http://www.insecure.org/nmap/

http://www.die.net/doc/linux/man/man8/netstat.8.html

http://bogpeople.com/networking/ipv6/ipv6.shtml

http://www.wildpackets.com/compendium/IP/IP-AdCla.html

http://en.wikipedia.org/wiki/P2P_overlay

http://en.wikipedia.org/wiki/Network_topology

http://www.ethereal.com/

http://www.tcpdump.org/

8 Appendix A - Quasi Geography

– Milan Metro Map (2005)

Mapping information onto the physical world is difficult. Fundamentally many forms of data have no natural or obvious physical representation. Discovering new visual metaphors or adapting existing ones is a key research area for doing this [ARCH2004]. As part of evaluating these possibilities within the context of this project we will look at creating an abstract prototype spatialisation of the available data.

Although possible solutions may include complex immersive three-dimensional landscapes here we will focus on a two-dimensional static (unchanging) representation. Whilst the force-directed layouts we have looked at so far are useful for determining structure and volumetric clustering they have no geographic context so it can be difficult at a glance determining where a particular group of nodes lie in physical reality. Our aim is to do this in a simple manor which conveys an abstracted high-level network architecture of Durham using a quasi geographic (approximate) layout.

The main issues involved with geographic layout of arc-node networks, such as we are dealing with, can be summarised as follows:

• The distances between various nodes may be to large to fit on a single screen/predefined area.• There may be too many nodes within a small area which when overlayed obscure the detail below.• Exact longitudinal and latitudinal data may be absent, incomplete or simply unknown. • Links between nodes may overlap, reducing clarity and covering other details.• Physical terrain is rarely uniform whereas most data is the opposite causing further incompatibilities.

Various solutions have been proposed and utilised to reduce the impact of these problems to varying degrees of success. Classic examples can be seen in tube/subway/metro map designs across the world. Most involve reducing the information to its core constituents and then simplifying the geography, hence the map is not the territory but rather its contents convey the latter through the readers assumed intuition. Where most fail is in either trying to condense the information to far (so it looses perspective) or not going far enough (Figure 8.1).

66

Having said this these circuit-diagram style maps have been hugely successful at directing travellers through the worlds underground labyrinths and have a proven track record of being more user-friendly than their conventional counterparts (see Chapter 2 Part B for a discussion of the London Underground map). It's easy to see how the deliberate introduction of distortion can be used to improve legibility (in visual systems at least). Not merely the relationships between lines, but a certain sense of the psychological (as opposed to actual) distance between central and outlying stations, come to the fore in may designs.

While this style of map has been primarily designed for MRT (Mass Rapid Transit) systems there is no reason why it should not be applied any other form of interlinked representation (such as a computer network). It would therefore seem strange that this form of quasi geographical map has found no real application outside of transport systems.

8.1 Technical ConsiderationsMaps such as we have proposed should be resolution independent, that is, they should be reproducible at any size without loss of clarity. Conventional raster based graphics packages (such as Adobe Photoshop) are inadequate for this purpose because they only allow graphics to be produced at at fixed resolution (without stretching or compression). To overcome this problem we will use a relatively new XML based graphic format known as SVG (Scalable Vector Graphics):

“SVG is a language for describing two-dimensional graphics in XML. SVG allows for three types of graphic objects: vector graphic shapes (e.g. paths consisting of straight lines and curves), images and text. Graphical objects can be grouped, styled, transformed and composited into previously rendered objects. The feature set includes nested transformations, clipping paths, alpha masks, filter effects and template objects.

SVG drawings can be interactive and dynamic. Animations can be defined and triggered either declaratively (i.e. by embedding SVG animation elements in SVG content) or via scripting.”

– W3C SVG Working Group [WEB19]

The main upshot of this being that graphics produced in the SVG format can be scaled to any size with zero loss of resolution or distortion. Graphics are no longer limited by fixed pixels. Scalable graphics adjust to the available screen resolution (or any other output medium). This alone makes SVG attractive to graphic designers, as it solves one of the most frustrating issues faced: creating designs that are as interoperable, yet as visually rich, as possible.

Unlike most graphic formats an interesting aspect of SVG is that the scalable methodology is not rendered via the formulation of a binary graphic file, but via plane ASCII language (as with HTML) that's then interpreted. Unfortunately the SVG syntax is quite complex and the more complicated a design becomes, the more complicated the mark-up. Luckily there are many GUI tools for SVG design. We have used Inkscape [WEB 20] to create the final graphic (Figure 8.4) which is a highly capable open source SVG Editor that uses the W3C standard.

An A0 oversize print of the graphic (90.5 x 124.5cm) showed no loss of clarity proving the formats claims.

67

Figure 8.1: Tokyo Metropolitan Area Rail Map (partial)

8.2 Mapping the Durham 'Network'The main focal point in Durham is the Cathedral. It is visible from almost everywhere surrounding the city centre for miles and can be used as a reference point for people wanting to orientate themselves with the landscape. Another main feature of the terrain is the River Wear which encircles the peninsular on which the Cathedral is built. Most of the university lies to the south and directly adjacent to the river which should be included on any cartographic representation.

From the two-dimensional traditional maps (Figure 8.2) we can break it down into four distinct areas (or zones):

• The bailey, running the length of the peninsular• New & Old Elvet, spanning from around the student union to Ustinov college• The Science Site, occupying a large area east of South Road• The Hill, containing all the newer colleges

Connecting the southern (hill) and northern (bailey) extremes are a series of interconnected main roads which acts as a central communications link between them.

68

Figure 8.2: Traditional Map of Durham (to scale, centred around university sites)

Figure 8.3 shows a three-dimensional projection from an oblique angle of the same terrain. This map focuses purely on university facilities and major landmarks. For anyone trying to find a specific site quickly it is somewhat limited by the fact you have to look up is corresponding numerical code to make an association on the map and areas of dense buildings slightly obscure routes between them.

Before composing our own map, to get a better understanding of the problem domain, we must first ask ourselves what is a map?

• Aids you to find your way• Provides associative information• Gives direction and instructions• “Something that suggests such a representation, as in clarity of representation.” [Dictionary.com]

When constructing a quasi geographical map the most important aspect is getting a balance between the clarity of the representation and the accuracy to which it mirrors the real geography beneath its abstraction. The reader must be able to make a connection between their own mental map of the geography and that which they are seeing in the graphic. Without it confusion is likely and representation looses its meaning.

The following principles will be applied to facilitate this [ARCH2004]:

• Identify unnecessary clutter and eliminate it• Simplify and refactor continuously• Identify hierarchies (visual elements, based on Gestalt theory3) that produce an intended or

unintentional value of information• Pay attention to contrast, contour (shape) and size of elements• Use colour sparingly

3 Gestalt theory is an “interdisciplinary general theory which provides a framework for a wide variety of psychological phenomena, processes, and applications. Human beings are viewed as open systems in active interaction with their environment. It is especially suited for the understanding of order and structure in psychological events.” [gestalttheory.net]

69

Figure 8.3: 3D Sketch Map of Durham

Inspired by graphical conventions used in MRT system map designs primarily from London (but also from Hong Kong and Paris) Figure 8.4 has been produced. In its construction the following rules have been strictly applied:

• Only horizontal, vertical and proportionally diagonal lines (45 degrees) may be used• Lines should not take right angled turns where there is no station• Lines may only overlap when intersected by a station• Readers must be able to clearly differentiate between zones/lines• Any labels should be legible at normal viewing distances and should all be placed horizontally• Physical geography should be abstracted but still retain enough information for the reader to

associate it with their own mental model of the terrain.

Great effort has been taken to closely follow the real geography of the area but at the same time abstract it into a more linear form with stations being symmetrically aligned to allow for easier comprehension by the eye. When comparing Figures 8.2 & 8.3 to Figure 8.4 you can see a clear isomorphic correspondence between the physical terrain and its abstract counterpart which are in equilibrium. This is partly due to the relative simplicity of the system but for the main part the similarity is intentional.

8.3 SummaryReactions from those who have seen the map have been extremely positive although there have been minor arguments over the placement of some 'stations' on the Hill line which generally seem to have been caused by a lack of knowledge of the physical placement of some buildings. While this map represents a concept with further development it could extended in many ways (e.g. auto generation, interactive content, real-time updating). What has been produced fulfils the third advanced requirement for a static spatialisation.

70

Figure 8.4: Quasi Geographic Static Representation of Durham (Version 1.8)

9 Appendix B – Overlying Interconnections

9.1 Flickr GraphFlickr Graph [WEB32] is an interactive flash animation tool that explores the social relationships inside the photo sharing site Flickr [WEB33]. It uses the classic attraction-repulsion algorithm for generating graphs showing links between users. The visual output of this tool has many similarities to what we were trying to archive with mapping the overlying user structure (using photos instead of hosts) so will be evaluated in lieu of its construction:

When the username of the person you wish to explore is entered Flickr Graph searches Flickr for related users and pictures with one-degree of separation from them. These users are then connected in a star topology to the original who is shown in the middle (Figure 9.1). Clicking on any of the peripheral users reorientates the graph to centre around them, carrying out the same search for related users and thus expanding the graph (Figure 9.2).

The most interesting aspect becomes evident when you click on the 'view pics' button next to the central users icon. This expands the users window to include a selection of their photos based on there relevance to everyone else's in the graph which is determined by each photos meta data (keyword tags). In the example we have given all the users are related by photos of flowers in Tokyo, Japan (Figure 9.3).

71

Figure 9.1: Flickr Graph start phase visualisation

72

Figure 9.2: Flickr Graph post expansion visualisation

Figure 9.3: Flickr Graph photo expansion visualisation

9.2 GoogleBrowserAnother tool which allows us to extrapolate and interactively visualise connection information is GoogleBrowser [WEB34]. This uses the Google search engine index (containing over 8,000,000,000 pages) to build up a graph of related pages and then displays their interconnections in a Java application using a springs algorithm (attraction & repulsion) for layout. The associated nodes correspond to the search results listed when you enter a URL into Google and and then click on the 'Similar pages' link below each result (Figure 9.4).

Figure 9.5 shows the interconnections between Durham University and other academic institutions determined in this manor. Edge colours reveal the relationships between the nodes at their endpoints. Dark grey edges indicate that the nodes at the endpoints are closely related, and light grey that the relationship is more loose.

73

Figure 9.4: Google search results example

Figure 9.5: GoogleBrowser related web-links visualisation

This time, instead of hosts or pictures, we are representing web pages overlaying the physical network, the virtual connections being the hyperlinks between them. Clicking on an individual node will reorientate the graph to show the top 10 similar URLs to that one. Interestingly, from Figure 9.5, the connections between nodes show no geographic correlation, in that connected institutions can not be related by their position within the UK. This is possibly an indicator of the decentralised way research is carried out between different Universities.

Using the original set of criteria from Figure 5.1 we have evaluated both Flickr Graph and GoogleBrowser (Table 5.2, Evaluation) to get a better understanding of their capabilities. See [WEB35] for more evaluation of a wider range of tools (same authorship).

Flickr Graph GoogleBrowser

Limitations Needs a fast computer to run smoothly.The bigger the display the more area the visualisation has to organise itself efficiently.Can only handle around 10 degrees of expansion before visualisation breaks down.

Needs a fast computer to run smoothly.The bigger the display the more area the visualisation has to organise itself efficiently.Takes around 5 minutes for graphic to stabilise. Clicking on a host will destabilise the graphic.

CognitiveComplexity

Displays data in two dimensions only. Nodes are expanded out from a single source point & may be positioned manly by dragging around the display area. 5 levels of zoom are supported.

Displays data in two dimensions only. Nodes are arranged in a rough pre-visualisation position before being properly arranged.Degrees of relevance & scale may be altered.

SpatialOrganisation

Layout is circular about the selected user and all other interconnected users. Occlusion is possible when photo view is expanded. Sometimes users will float out the display area.

Occlusion occurs frequently between node labels. This can be reduced through switching from URL titles to points but this reduces comprehension & search abilities.

InformationCoding

More users may be added to the visualisation outside the current sphere of influence. If no connection is found the previous graph fades away.

Background colour and graph radius may be altered interactively, thus updating the graphic. Coloured tags are used to indicate the state of computation for each node.

StateTransition

Transitions between representations are fluid as new users 'blossom' out of the existing central user and arrange themselves.

When enumerating new links transitions can be fragmented as the Java VM cannot keep up with the updates, even on fast machines.

Table 9.1: Visualisation criteria evaluation

While neither tool is perfect, in terms of the system resources they require to run, they both produce aesthetically pleasing visualisations which give real value and meaning to the information they display.

74

Documents

Visualisation of Networks - randomwire.com€¦ · Visualisation of Networks 3rd Year Software Engineering Project by David Gilbert Department of Computer Science, University of Durham