Upload
martin-geddes
View
261
Download
0
Embed Size (px)
Citation preview
© Martin Geddes Consulting Ltd 2015 Page 1
Network Performance Management
The journey from skilled craft to hard science
Summary Networks supply both connectivity and performance to external users’ distributed computing
applications. Networks internally can also be considered as large distributed supercomputers.
There is an underlying physical and technical reality to network operation, both externally
and internally. Networks’ ability to deliver performance-based value is subject to constraints
imposed by that reality: physics; maths; deployed technology; economics; and regulation.
For physics, there is an established science that models this one aspect of that reality. The new
science of network performance relates the performance elements of all of these aspects.
What we offer is insight into this science, and knowledge transfer of key skills. We are also
offering expertise, honed by practical experience, of applying the science at various scales in
the real world. This is backed by a practical and proven toolset. This enables a step-change in
customer experience and cost structure.
That step change is possible because network performance science is a paradigm change.
Broadband and the Internet have the counter-intuitive and paradoxical aspects of quantum
physics. That is because packet-based statistical multiplexing is stochastic in its nature. This
contrasts with the familiar “classical” world of circuits. Current broadband network design,
economics and operations are all unconsciously still tied to a circuit paradigm.
Engaging with this paradigm change poses both an intellectual and practical challenge. Taking
radically new technology and inserting it into the existing paradigm of people and processes
does not deliver the hoped-for outcome. The science can only be absorbed through cycles of
action-oriented learning that evolve at the same speed as the people and processes. There is
no magic “knowledge pill” or network mechanism that can short-circuit this.
Failure to engage with this paradigm change represents a failure to engage with technical
reality. This poses a serious (indeed existential) threat to operators. All technological inputs of
the delivery chain are becoming commoditised: Intel/ARM CPUs, fibre, standard packet and
SDN software, etc. Furthermore, the present construction of some of these is in conflict with
the external constraints imposed by reality, notably those of the mathematics of statistical
multiplexing. This forms an unsustainable technical and economic business model.
The alternative is to align with the technical reality, and fully exploit the opportunities it offers.
Mastery of this will result in valuable intellectual property: how to integrate all the elements,
and embed the resulting service into profitable digital supply chains. This cannot come from
your equipment suppliers. By definition anything they offer is a non-differentiating
commodity. We call this the “ultracomputing” challenge (see Appendix A).
© Martin Geddes Consulting Ltd 2015 Page 2
Understanding the problem To get a grasp of the underlying issue, you need to zoom out to the highest level. There are
three key conceptual lenses that we can view the network performance issue through:
Timescales: The network is a resource allocation system at all timescales (10^-9 to 10^9
seconds). Networks perform “trades” to match supply to demand.
Skills sets: The three core skills are to measure, model and manipulate performance.
Business processes: All OSS/BSS processes fall into one of three buckets: concept-to-
market (C2M), lead-to-cash (L2C) and trouble-to-resolution (T2R). A detailed list is
offered in Appendix B.
Success comes from using the right skills sets to configure the right business processes to
perform the “right” trades at the right timescales. There is a complex mapping and inter-
dependency, which varies from operator to operator. The challenge is to acquire the
capabilities in the appropriate order. So where to begin?
Timescales: It is easier to work with longer timescales than shorter ones.
Skills sets: Until you can measure the right thing, and understand what it means,
scientific modelling and manipulation is impossible.
Business processes: in-life management is typically the greatest source of cost pain,
and you can’t deploy advanced assured services if you can’t do their in-life
management, so trouble-to-resolution is the place to start.
Thus rather than trying to bite off the whole problem of a paradigm shift, there are some clear
corner cases that are the best candidates with which to begin the capability transformation:
fault isolation and (problematic) capacity planning.
The journey Step zero is to select a “problem” network (ideally with a “must retain” client account that is
coming up for contract renewal). We prefer to work on a B2B (or telco-to-telco) case, such as a
large corporation outsourcing their IT.
1. People first. Engage in network performance science education. Begin with one day of
fundamentals, plus one day of practical training in measurement. (This forms the start of a
curriculum shown in Appendix C.)
2. Process next. We work on skills transfer on fault isolation. Demonstrate ROI from the
science-led approach that could not have otherwise been obtained.
3. Technology last. Create your own measurement system that can reproduce this at scale.
We can supply the tools. (You could build your own, but it would delay things by years.)
Once this is in place, further cycles are possible with expanded scope: capacity planning &
supply chain management for more sophisticated products (e.g. VPN, UC).
© Martin Geddes Consulting Ltd 2015 Page 3
The ultracomputing prize The set of enablers for ultracomputing is listed in Appendix D. If you have these, what can you
expect?
In the business domain, you can be like a Fedex of the digital logistics world, controlling
compete supply chains. The value is in the coordination of the trading spaces to match supply
and demand, not the capex-heavy asset ownership.
In the network domain, you can be like a Maersk of networks, where you have massive
increasing returns to scale from being able to aggregate heterogeneous demand and
multiplex that traffic.
In the science and technology domain, you can be like a GE/Rolls Royce, with “power plants”
for information translocation and distributed computing whose “ultracomputing”
performance envelope greatly exceeds that of rivals.
© Martin Geddes Consulting Ltd 2015 Page 4
Appendix A: The ultracomputing opportunity
The nature of future on-demand enterprise service delivery is a qualitative change in technical
difficulty compared to past. This dictates that a whole new skillset must be learnt, which we
call ultracomputing. This takes the essence of supercomputing, but scales it up to a highly
distributed and virtualised environment.
Those who master ultracomputing skills will achieve cost and performance for cloud-based
services that far exceed their rivals, with much lower risk. This appendix describes the
challenge, the opportunity, and how we can help you seize the ultracomputing prize.
Key understanding: all networks are large-scale distributed parallel supercomputers
In supercomputing, you have a large number of interconnected nodes involved in
computation and communication. They simultaneously work on many inter-dependent
problems. The system must remain stable at all loads, produce outputs within bounded
timeframes, and be resilient to component or process failure. The same requirements are
being placed on telecoms networks.
However, in telecoms networks the relative costs of the component computation and
communications technologies continually vary. Furthermore, the interconnection between
these functions can no longer be assumed to be carried over dedicated circuits, as all traffic is
now over a common statistically shared transmission medium. The cost structure and
performance of the transmission can vary from one territory to the next, as well as dynamically
over time.
As a result, the optimal location of each function in the distributed architecture also can
change. The performance is specific to each network configuration, rather than generic
protocol behaviour.
Ultracomputing thus demands a new discipline: the performance engineering of complete
large-scale dynamic distributed architectures. Critically, this is distinct from the engineering of
any of the sub-components.
The challenge: find the optimal resource trade-offs at all timescales
This optimal location of any function in an ultracomputing environment depends upon both
the desired customer experience and total cost of operation. The customer experience
depends on the quality of experience (QoE) performance hazards; the total cost of ownership
depends on the cost of mitigating or addressing those hazards, and the level of financial
predictability that results. The ultracomputer has to enable the appropriate resource trade-offs
using a distributed resource allocation model.
This plays out differently for each part of the service ecosystem:
For network operators: where to place caches, radio controllers, or internet breakout.
For the content distributors: where to place delivery systems, when/whether to use
multicast, where to place transcoders (from centrally down to every set top box).
© Martin Geddes Consulting Ltd 2015 Page 5
For cloud-service providers: where to place the application functionality – how much
local, and how much remote, given that functional splitting increases implementation
complexity.
In the ultracomputing world, the design space is now large, irregular, and involves interactions
of sub-systems from many vendors. The current virtualisation trend has magnified the issue
over how best to allocate resources. Once a function can be located in many places, the total
number of combinations becomes too high to test and validate empirically before
deployment.
Ultracomputing therefore demands new skills to model and manage distributed systems at
scale, and the trade-offs involved at all timescales, from design to configuration to operation.
The ultracomputing skills set
In ultracomputing you need to be able to perform the following design and engineering
activities:
Reason ex-ante about complete systems, and the interaction of all their sub-
components.
Understand and model the predictable region of operation and their failure modes
under load, so as not to cause localised or widespread failure.
Understand how finite communication and computation resources are constrained by
both capacity and schedulability factors, and model the complex range of interactions
against these two constraints.
Know whether demand can be scheduled to get the resources it wants in the
timescales it requires.
Manage both the resources of the external user processes, as well as the internal
communication and coordination resources, which are all multiplexed together.
Allocate resources for all the above using a coherent distributed resource management
system.
Regrettably, the telecoms and IT industries have both yet to conceptually and practically grasp
these issues. Yet the application of known mathematical and performance engineering
techniques can resolve the technical problems: how to decompose the system, understand the
trade-offs being made, optimise for specific cost or user experience outcomes, and operate
these complex systems with a high degree of predictability even in overload.
How we can help you
We have made fundamental conceptual and practical breakthroughs in network
performance science. We (uniquely) know how to measure, model and manipulate systems at
the ultracomputing scale.
Measure: we can perform network “X-rays” to get high-resolution “pictures” of the
performance of network elements, and how they (de)compose in both space and time.
© Martin Geddes Consulting Ltd 2015 Page 6
Model: we have a robust calculus that lets us predict performance of systems before
they are built or re-configured. This is the essence of service orchestration in
ultracomputing.
Manipulate: we have proven algorithms that can safely drive these systems into
saturation, where they are most profitable, whilst still delivering assured outcomes.
These techniques have been proven at clients such as Boeing, BT, CERN and a tier 1 mobile
operator.
© Martin Geddes Consulting Ltd 2015 Page 7
Appendix B: Performance-related business processes
Concept-to-market
Profitability metric space + cost/value metrics
M&A performance due diligence
Performance-aware service design and capture of operational performance invariants
Performance-aware failure modes effects analysis (both business and civil
contingencies)
Deployability analysis (from lab to real world, in varying environments)
Scalability rewards and risks (managing all levels of variation + effects on scalability)
Product development pipeline & product portfolio management
Service performance architecture (beyond functional correctness to include non-
functional characteristics; turns performance into a first-class entity, not an
afterthought)
Performance-centric design (ex-ante performance engineering of outcomes + resource
costs)
Quality arbitrage management: defence; offense; partner management
Supply chain performance management (horizontal along chain; vertical to suppliers;
how to structure the performance aspects of the contracts).
Lead-to-cash
Performance fraud management
SLA/QTA compliance
Per customer service trending
Performance invariant assurance (evidencing SLA compliance)
Service performance resource accounting (cost of supporting the service; opportunity
cost of running the service) – a performance equivalent of the “bill of materials”.
Service cost pricing: options pricing, time-volume pricing
Trouble-to-resolution
Root cause “fault” isolation (as many “faults” are design issues)
Litigation + liability management
Performance variation management
Regulatory conformance of performance; explicit and implicit reputation
management; evidence of delivery
© Martin Geddes Consulting Ltd 2015 Page 8
Appendix C: Sample curriculum for network performance engineering
Framework for customer experience and service quality performance
management
Basic technical understanding of network performance
The relationship between customer experience and service quality
Relationship of organisational roles in delivering optimal business outcomes
Design aspects
Technical design
- How to relate cross-sectional capacities of edge and core to performance
- How to manage performance risk within and between management domains
Service design
- How to correctly quantify costs on common infrastructure of service and
service growth
- How best to aggregate users and service in order to achieve cost reduction
Contracting
How to construct supply chains that deliver consistent end-to-end performance
How to assure service delivery from a contractual perspective
Deployment and provisioning
How to manage the end-to-end performance aspects of turnout & acceptance testing
How to incorporate performance into your initial deployment processes
Ongoing service assurance
How to tie high-level performance outcomes to low-level network performance
metrics
How to manage service quality and performance trade-offs
How to distinguish between demand-side and supply-side performance issues
Fault isolation
How to determine the root cause of performance issues
How to predict performance hazards by trending low-level metrics for hazard arming
© Martin Geddes Consulting Ltd 2015 Page 9
Appendix D: The enablers you need for ultracomputing
Thinkware: capability primitives
ΔQ. A metric and algebra that captures and measures outcomes at every level of
abstraction (both network to end user).
Quantitative Translocation Agreements (QTAs). These capture the relationship of
supply and demand, at every level and timescale.
Predictable Region of Operation (PRO). Need to manage system within this,
especially where system has performance “turning points”.
Performance hazard hierarchy and analysis. This describes how to think about
performance hazards and their mitigation and various levels of size and abstraction;
central to the management problem and make rational design and operational
decisions.
ΔQ QTA aggregation. This is an input from service planning into capacity planning,
telling you how convolve services and customers into an aggregate requirement. This
then tells you how much of the resource you need at different physical locations to
achieve the outcome for a given level of performance.
ΔQ resource models. Serial-parallel flow graphs and application performance. An
algebra and calculus for combining process behaviour and network behaviour into
outcomes and costs.
Wetware
Stochasticians
Modellers
Business process design & operational management
Performance accounting function:
o Quality czar (Chief Quality Officer)
o Performance hazard modelling
Software
QTA library (for different applications and bearer combinations; distributing ΔQ
budgets and performance hazard arming)
Measurement repository & analysis
QTA-based analytics
Multipoint measurement data processing
QTA-based breach hazard warnings
Hardware
Probes & test stream data generators
QTA-aware SDN orchestration
Computation resources for measurement & modelling
Communication resources for passing data & control plane
New approaches and mechanisms to short timescale ΔQ trading