Network performance - skilled craft to hard science

© Martin Geddes Consulting Ltd 2015 Page 1

Network Performance Management

The journey from skilled craft to hard science

Summary Networks supply both connectivity and performance to external users’ distributed computing

applications. Networks internally can also be considered as large distributed supercomputers.

There is an underlying physical and technical reality to network operation, both externally

and internally. Networks’ ability to deliver performance-based value is subject to constraints

imposed by that reality: physics; maths; deployed technology; economics; and regulation.

For physics, there is an established science that models this one aspect of that reality. The new

science of network performance relates the performance elements of all of these aspects.

What we offer is insight into this science, and knowledge transfer of key skills. We are also

offering expertise, honed by practical experience, of applying the science at various scales in

the real world. This is backed by a practical and proven toolset. This enables a step-change in

customer experience and cost structure.

That step change is possible because network performance science is a paradigm change.

Broadband and the Internet have the counter-intuitive and paradoxical aspects of quantum

physics. That is because packet-based statistical multiplexing is stochastic in its nature. This

contrasts with the familiar “classical” world of circuits. Current broadband network design,

economics and operations are all unconsciously still tied to a circuit paradigm.

Engaging with this paradigm change poses both an intellectual and practical challenge. Taking

radically new technology and inserting it into the existing paradigm of people and processes

does not deliver the hoped-for outcome. The science can only be absorbed through cycles of

action-oriented learning that evolve at the same speed as the people and processes. There is

no magic “knowledge pill” or network mechanism that can short-circuit this.

Failure to engage with this paradigm change represents a failure to engage with technical

reality. This poses a serious (indeed existential) threat to operators. All technological inputs of

the delivery chain are becoming commoditised: Intel/ARM CPUs, fibre, standard packet and

SDN software, etc. Furthermore, the present construction of some of these is in conflict with

the external constraints imposed by reality, notably those of the mathematics of statistical

multiplexing. This forms an unsustainable technical and economic business model.

The alternative is to align with the technical reality, and fully exploit the opportunities it offers.

Mastery of this will result in valuable intellectual property: how to integrate all the elements,

and embed the resulting service into profitable digital supply chains. This cannot come from

your equipment suppliers. By definition anything they offer is a non-differentiating

commodity. We call this the “ultracomputing” challenge (see Appendix A).


Understanding the problem To get a grasp of the underlying issue, you need to zoom out to the highest level. There are

three key conceptual lenses that we can view the network performance issue through:

Timescales: The network is a resource allocation system at all timescales (10^-9 to 10^9

seconds). Networks perform “trades” to match supply to demand.

Skills sets: The three core skills are to measure, model and manipulate performance.

Business processes: All OSS/BSS processes fall into one of three buckets: concept-to-

market (C2M), lead-to-cash (L2C) and trouble-to-resolution (T2R). A detailed list is

offered in Appendix B.

Success comes from using the right skills sets to configure the right business processes to

perform the “right” trades at the right timescales. There is a complex mapping and inter-

dependency, which varies from operator to operator. The challenge is to acquire the

capabilities in the appropriate order. So where to begin?

Timescales: It is easier to work with longer timescales than shorter ones.

Skills sets: Until you can measure the right thing, and understand what it means,

scientific modelling and manipulation is impossible.

Business processes: in-life management is typically the greatest source of cost pain,

and you can’t deploy advanced assured services if you can’t do their in-life

management, so trouble-to-resolution is the place to start.

Thus rather than trying to bite off the whole problem of a paradigm shift, there are some clear

corner cases that are the best candidates with which to begin the capability transformation:

fault isolation and (problematic) capacity planning.

The journey Step zero is to select a “problem” network (ideally with a “must retain” client account that is

coming up for contract renewal). We prefer to work on a B2B (or telco-to-telco) case, such as a

large corporation outsourcing their IT.

1. People first. Engage in network performance science education. Begin with one day of

fundamentals, plus one day of practical training in measurement. (This forms the start of a

curriculum shown in Appendix C.)

2. Process next. We work on skills transfer on fault isolation. Demonstrate ROI from the

science-led approach that could not have otherwise been obtained.

3. Technology last. Create your own measurement system that can reproduce this at scale.

We can supply the tools. (You could build your own, but it would delay things by years.)

Once this is in place, further cycles are possible with expanded scope: capacity planning &

supply chain management for more sophisticated products (e.g. VPN, UC).


The ultracomputing prize The set of enablers for ultracomputing is listed in Appendix D. If you have these, what can you

expect?

In the business domain, you can be like a Fedex of the digital logistics world, controlling

compete supply chains. The value is in the coordination of the trading spaces to match supply

and demand, not the capex-heavy asset ownership.

In the network domain, you can be like a Maersk of networks, where you have massive

increasing returns to scale from being able to aggregate heterogeneous demand and

multiplex that traffic.

In the science and technology domain, you can be like a GE/Rolls Royce, with “power plants”

for information translocation and distributed computing whose “ultracomputing”

performance envelope greatly exceeds that of rivals.


Appendix A: The ultracomputing opportunity

The nature of future on-demand enterprise service delivery is a qualitative change in technical

difficulty compared to past. This dictates that a whole new skillset must be learnt, which we

call ultracomputing. This takes the essence of supercomputing, but scales it up to a highly

distributed and virtualised environment.

Those who master ultracomputing skills will achieve cost and performance for cloud-based

services that far exceed their rivals, with much lower risk. This appendix describes the

challenge, the opportunity, and how we can help you seize the ultracomputing prize.

Key understanding: all networks are large-scale distributed parallel supercomputers

In supercomputing, you have a large number of interconnected nodes involved in

computation and communication. They simultaneously work on many inter-dependent

problems. The system must remain stable at all loads, produce outputs within bounded

timeframes, and be resilient to component or process failure. The same requirements are

being placed on telecoms networks.

However, in telecoms networks the relative costs of the component computation and

communications technologies continually vary. Furthermore, the interconnection between

these functions can no longer be assumed to be carried over dedicated circuits, as all traffic is

now over a common statistically shared transmission medium. The cost structure and

performance of the transmission can vary from one territory to the next, as well as dynamically

over time.

As a result, the optimal location of each function in the distributed architecture also can

change. The performance is specific to each network configuration, rather than generic

protocol behaviour.

Ultracomputing thus demands a new discipline: the performance engineering of complete

large-scale dynamic distributed architectures. Critically, this is distinct from the engineering of

any of the sub-components.

The challenge: find the optimal resource trade-offs at all timescales

This optimal location of any function in an ultracomputing environment depends upon both

the desired customer experience and total cost of operation. The customer experience

depends on the quality of experience (QoE) performance hazards; the total cost of ownership

depends on the cost of mitigating or addressing those hazards, and the level of financial

predictability that results. The ultracomputer has to enable the appropriate resource trade-offs

using a distributed resource allocation model.

This plays out differently for each part of the service ecosystem:

For network operators: where to place caches, radio controllers, or internet breakout.

For the content distributors: where to place delivery systems, when/whether to use

multicast, where to place transcoders (from centrally down to every set top box).


For cloud-service providers: where to place the application functionality – how much

local, and how much remote, given that functional splitting increases implementation

complexity.

In the ultracomputing world, the design space is now large, irregular, and involves interactions

of sub-systems from many vendors. The current virtualisation trend has magnified the issue

over how best to allocate resources. Once a function can be located in many places, the total

number of combinations becomes too high to test and validate empirically before

deployment.

Ultracomputing therefore demands new skills to model and manage distributed systems at

scale, and the trade-offs involved at all timescales, from design to configuration to operation.

The ultracomputing skills set

In ultracomputing you need to be able to perform the following design and engineering

activities:

Reason ex-ante about complete systems, and the interaction of all their sub-

components.

Understand and model the predictable region of operation and their failure modes

under load, so as not to cause localised or widespread failure.

Understand how finite communication and computation resources are constrained by

both capacity and schedulability factors, and model the complex range of interactions

against these two constraints.

Know whether demand can be scheduled to get the resources it wants in the

timescales it requires.

Manage both the resources of the external user processes, as well as the internal

communication and coordination resources, which are all multiplexed together.

Allocate resources for all the above using a coherent distributed resource management

system.

Regrettably, the telecoms and IT industries have both yet to conceptually and practically grasp

these issues. Yet the application of known mathematical and performance engineering

techniques can resolve the technical problems: how to decompose the system, understand the

trade-offs being made, optimise for specific cost or user experience outcomes, and operate

these complex systems with a high degree of predictability even in overload.

How we can help you

We have made fundamental conceptual and practical breakthroughs in network

performance science. We (uniquely) know how to measure, model and manipulate systems at

the ultracomputing scale.

Measure: we can perform network “X-rays” to get high-resolution “pictures” of the

performance of network elements, and how they (de)compose in both space and time.


Model: we have a robust calculus that lets us predict performance of systems before

they are built or re-configured. This is the essence of service orchestration in

ultracomputing.

Manipulate: we have proven algorithms that can safely drive these systems into

saturation, where they are most profitable, whilst still delivering assured outcomes.

These techniques have been proven at clients such as Boeing, BT, CERN and a tier 1 mobile

operator.


Appendix B: Performance-related business processes

Concept-to-market

Profitability metric space + cost/value metrics

M&A performance due diligence

Performance-aware service design and capture of operational performance invariants

Performance-aware failure modes effects analysis (both business and civil

contingencies)

Deployability analysis (from lab to real world, in varying environments)

Scalability rewards and risks (managing all levels of variation + effects on scalability)

Product development pipeline & product portfolio management

Service performance architecture (beyond functional correctness to include non-

functional characteristics; turns performance into a first-class entity, not an

afterthought)

Performance-centric design (ex-ante performance engineering of outcomes + resource

costs)

Quality arbitrage management: defence; offense; partner management

Supply chain performance management (horizontal along chain; vertical to suppliers;

how to structure the performance aspects of the contracts).

Lead-to-cash

Performance fraud management

SLA/QTA compliance

Per customer service trending

Performance invariant assurance (evidencing SLA compliance)

Service performance resource accounting (cost of supporting the service; opportunity

cost of running the service) – a performance equivalent of the “bill of materials”.

Service cost pricing: options pricing, time-volume pricing

Trouble-to-resolution

Root cause “fault” isolation (as many “faults” are design issues)

Litigation + liability management

Performance variation management

Regulatory conformance of performance; explicit and implicit reputation

management; evidence of delivery


Appendix C: Sample curriculum for network performance engineering

Framework for customer experience and service quality performance

management

Basic technical understanding of network performance

The relationship between customer experience and service quality

Relationship of organisational roles in delivering optimal business outcomes

Design aspects

Technical design

- How to relate cross-sectional capacities of edge and core to performance

- How to manage performance risk within and between management domains

Service design

- How to correctly quantify costs on common infrastructure of service and

service growth

- How best to aggregate users and service in order to achieve cost reduction

Contracting

How to construct supply chains that deliver consistent end-to-end performance

How to assure service delivery from a contractual perspective

Deployment and provisioning

How to manage the end-to-end performance aspects of turnout & acceptance testing

How to incorporate performance into your initial deployment processes

Ongoing service assurance

How to tie high-level performance outcomes to low-level network performance

metrics

How to manage service quality and performance trade-offs

How to distinguish between demand-side and supply-side performance issues

Fault isolation

How to determine the root cause of performance issues

How to predict performance hazards by trending low-level metrics for hazard arming


Appendix D: The enablers you need for ultracomputing

Thinkware: capability primitives

ΔQ. A metric and algebra that captures and measures outcomes at every level of

abstraction (both network to end user).

Quantitative Translocation Agreements (QTAs). These capture the relationship of

supply and demand, at every level and timescale.

Predictable Region of Operation (PRO). Need to manage system within this,

especially where system has performance “turning points”.

Performance hazard hierarchy and analysis. This describes how to think about

performance hazards and their mitigation and various levels of size and abstraction;

central to the management problem and make rational design and operational

decisions.

ΔQ QTA aggregation. This is an input from service planning into capacity planning,

telling you how convolve services and customers into an aggregate requirement. This

then tells you how much of the resource you need at different physical locations to

achieve the outcome for a given level of performance.

ΔQ resource models. Serial-parallel flow graphs and application performance. An

algebra and calculus for combining process behaviour and network behaviour into

outcomes and costs.

Wetware

Stochasticians

Modellers

Business process design & operational management

Performance accounting function:

o Quality czar (Chief Quality Officer)

o Performance hazard modelling

Software

QTA library (for different applications and bearer combinations; distributing ΔQ

budgets and performance hazard arming)

Measurement repository & analysis

QTA-based analytics

Multipoint measurement data processing

QTA-based breach hazard warnings

Hardware

Probes & test stream data generators

QTA-aware SDN orchestration

Computation resources for measurement & modelling

Communication resources for passing data & control plane

New approaches and mechanisms to short timescale ΔQ trading