Whatever the specific technology used in today's voice, data and video service provider networks, it is almost certain to be in a packet architecture. Packet networks break information into small pieces -- packets -- which then share transmission resources. This process or sharing makes packet networks more efficient in their use of capacity, given that much of the traffic of networks is bursty in nature. However, it also creates potential collisions of resource demands. Packet networks normally employ a form of topology discovery and adaptive routing designed to improve resiliency, and this can make it difficult to manage network capacity and determine the exact nature of problems when they occur.
Sustaining a multimedia packet network, like any form of network management, is based on the classical FCAPS acronym: Fault, Capacity, Accounting, Performance, and Security. The order of consideration, however, doesn't match the acronym.
Capacity and performance management
Capacity and performance management are the classical preemptive network management tasks, and security management is increasingly being added to the list. The purpose of all three is to create a stable framework for service delivery that will provide customers with good experiences under normal conditions. Each of these is divided into two phases, planning and management.
In the planning phase, network operations uses historical data, estimates for growth in existing services or demand for new ones, and other factors to establish a baseline plan. This plan must include metrics to measure current network behavior against the plan to determine whether the plan's goals are being met, and it is this measurement that is the goal of the network manager.
Good capacity and performance management plans and monitoring also recognize that the network may not always be operating in its optimum state, and that alternate states that reflect the balance of business goals should be defined to respond to failures or congestion. These are often called "failure modes." Failure mode planning is critical for multimedia networks that serve voice, data, and video because these three media are likely to have major differences in economic value and tolerance to out-of-specification network operations. One of the challenges of networks with fully adaptive behavior (routing, spanning tree bridging) is that they may not support explicit failure modes, but simply adapt to conditions in a variable way.
Troubleshooting and problem isolation
Where an acute load condition is detected or where a fault is reported, the process shifts to the fault management area, popularly called "troubleshooting" or "problem isolation." The primary purpose of fault management, simply put, is to fix the problem. Another goal equally important is to sustain network operation in a valid failure mode while the problem is resolved.
When a fault occurs, the network operations personnel should first ensure that the network has entered a valid failure mode state, meaning that the remaining network resources have been allocated according to business priorities. Video is usually the form of traffic with the greatest economic value and performance sensitivity, and so failure modes should ensure video performance and perhaps block new video delivery requests to ensure the current ones are sustained. Voice can be treated similarly. Most data applications are tolerant to some delay and packet loss associated with the resource congestion likely in failure modes, so this can often be made the lowest priority.
Once the network has settled on a failure mode, the network operations troubleshooting process is directed at finding the underlying problem. This can be approached either based on reconstruction or status analysis.
Reconstruction means simply going back through alert messages to find the early period of the failure and then analyzing the events as they unfold. This approach requires a log of network events with very accurate time stamps on alert messages to ensure they are processed in sequence. This approach is often the only way to find problems that arise from "soft" faults like congestion. It normally involves moving forward to the point where the problem is clearly visible, and then backtracking to determine what caused this pivotal event set.
The problem with a reconstruction approach is that it may be time consuming. When a problem is "hard," meaning that its state persists even when the network is in failure mode, status analysis may be a better approach. Examples of this sort of problem are a trunk failure (fiber cut) or a node fault (power, equipment).
Soft problems like congestion are usually attributable either to an abnormal line/node condition (that may be causing retransmission and loss of effective throughput, for example) or by excessive load. The latter may be caused by unexpected traffic or even by an attempted security breach, virus, etc. Where the problem is caused by an abnormal resource behavior, isolating and fixing the failing resource is the goal. Where it is caused by traffic overload, it may be necessary to flow-manage some sources or, in the case of security violations, cut them off.
Hard problems should be diagnosed remotely to the greatest extent possible. For line problems, this means testing from both device endpoints, and for device problems testing each interface to determine the scope of the problem. The facilities to perform these tests vary widely; use whatever is available to its fullest extent.
Dispatching someone to a location to fix a device or reconfigure or restart it manually is a last resort, since this will be both costly and induce a longer period of abnormal network operation. Where it is necessary, be sure to fully isolate the device first, then undertake the repair, recommission and test, and only then bring the device back to operational status. In some extreme cases it may be necessary to bring devices or trunks back online during off periods to avoid adaptive reconfiguration and performance instability.
Any time a problem must be managed, it must be carefully logged. Effective network operations depends on analyzing the response to problems and tuning the procedures so that downtime and cost are minimized if the problem reoccurs in the future.
About the author: Tom Nolle is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982. He is a member of the IEEE, ACM and the IPsphere Forum, and the publisher of Netwatcher, a journal in advanced telecommunications strategy issues. Tom is actively involved in LAN, MAN and WAN issues for both enterprises and service providers and also provides technical consultation to equipment vendors on standards, markets and emerging technologies.