Network management: You can't live without it, but it's never one of the telecom issues that generates all the buzz. Check old attitudes about network management at the door this year and get ready to pay a lot more attention to it, as network use and requirements change radically.
In this e-Guide, check out what's driving radical change in traditional network management (think consumer rather than enterprise use), how Web 2.0, SOA, IMS and other application-heavy delivery platforms are changing the management requirements, how to troubleshoot a multimedia network to ensure capacity and performance, and the need for "systemic" network monitoring. There's a lot of change to absorb, and the time to put changes into place is now.
Network management: Three drivers for radical change by Tom Nolle
As long as there has been public networking, there has been network management, and that's the good, the bad, and the good news again. The first piece of good news is that in 2008 and beyond, that will still be the case. The bad news is that network management is going to change radically in that same period. The second piece of good news, however, is that these changes can be managed if they're addressed correctly. Just as there are three distinct pieces of "news" there are also three major drivers of change.
First driver: Consumerization.
The first major change in network management is driven by a major change in network mission. Network operators have tended to focus on "convergence" as the driver for change, but the real driver is a different "c" word: "consumerization." Ten years ago, enterprises purchased the only broadband connections, where today the number of consumer broadband connections is 10-to-40 times the number of enterprise connections, depending on the market area.
Consumerization creates a management problem for two reasons -- scale and literacy. Obviously, multiplying the number of broadband users by a factor of 10 or more would likely multiply management demands similarly. That increase would threaten to explode operations and administration costs that for most operators are already three to four times capex as a percent of sales. When the increase in the number of connections is due to the introduction of broadband services to users with low technology literacy -- which certainly describes the consumer -- you create an even more alarming level of cost risk.
The only solution to controlling operations costs is to automate more operations. That requirement has created the largest challenge for the network management world, because in order to automate consumer broadband operations, you must take an outside-in view of network management. Why? No network operator would ever accept a service architecture that maintained individual consumer awareness (what network practitioners would call "state") inside the network. Consumers are managed in aggregate on the network, but they must be supported as individuals when they call for service.
The consumerization shift demands that network management be linked to a service management process that maintains customer-specific information and also provides a link between the customer process and the network resources that are linked to fulfilling the customer's services. Consumer broadband services, whether Internet, VoIP or IPTV, generate more customer care events than PSTN services, and so it is critical that customer care provide direct links to not only billing data for inquiries and subscription data for service information, but also network data.
Some might call this requirement an example of customer-focused service monitoring, but the issue is much more complex. Service automation cannot be performed unless there is a machine-readable template to describe the service and to control how the service experience is created from network resources. A service management linkage to network management can, in effect, stand in for craft personnel making manual changes, and provides the most reliable form of automated provisioning/commissioning of services, modifications to existing services, and processing terminations.
Management vendors and standards groups, including IBM and Oracle in the former group, and the Telemanagement Forum and Broadband Forum in the latter, have been working to develop this new service-to-network relationship, and most of the pieces are already in place. They will come together convincingly in 2008.
Second driver: Increased computer technology use.
A second issue for service provider network management is the increased use of computer technology to host service features, content and applications used in service delivery. The most familiar architecture to support this transition from network-based to hosted features is the IP Multimedia Subsystem (IMS) of the 3GPP, ETSI and ITU. IMS creates a service layer that controls the network through a Resource Access and Control System/Facility (RACS/F). Not only does this create a break between the customer experience and the network by creating an abstraction layer, it also creates a new set of session and application resources that have to be managed. The network management system of the future will be increasingly a computer, software, and network management system and not just an NMS.
The introduction of servers, software, content and applications into networks creates a need for a more flexible model of network components and services. Work in this area is just beginning in the ITU and the TMF. The modeling process in TMF comes under the heading of the Service Delivery Framework team and is likely to produce its first results in 2008. Since a flexible management model for a complex service-driven network is critical for any management system to be effective, management buyers will want to watch this activity and the progress of their vendors in supporting the outcome.
Third driver: The rise of Ethernet convergence.
The third key issue in network management also involves abstraction, but this time it's network technology and not services that are the target. The view that "convergence" always meant "on IP" is becoming less universally held as Ethernet technology (via PBB-TE or PBT, as it is more commonly known) makes headway as the framework for network traffic management and the basis for legacy converged services like frame relay and ATM.
Vendors are also beginning to converge metro optics and Ethernet onto a multi-layer flexible infrastructure. All of this means that network traffic management will benefit from an abstract view (establishing point-to-point connections at the traffic management level and then impressing those connections downward onto whatever infrastructure is present). That same abstract view is encouraged by the emergence of independent control planes such as GMPLS, which allow management systems to create routes independent of what kind of equipment might ultimately create the network representation of the route.
Abstraction as common thread
The common theme in all of the drivers mentioned here is the need for abstraction, and even though that might be comforting on the surface, it's actually potentially a problem. All three of these drivers are operating independently and at different levels of the network process. I've presented them from the top or business-process level to the bottom or network-hardware level. A common mechanism for modeling all of this could prevent wasted effort harmonizing multi-layer models later on, but this doesn't seem to be likely to happen out of the current standards processes. The management systems vendors themselves may be the drivers of "abstraction/modeling convergence," and that may be the differentiator of the greatest value in 2008 and beyond.
Network management more important due to SOA by David Jacobs
Network management is even more important these days with the emergence of SOA, Web 2.0, IMS and other application-heavy service delivery platforms. Service-oriented architecture (SOA), in particular, has been widely adopted for its ability to reduce the time, expense and risk of developing and deploying new software applications. But the benefits of SOA also bring increased requirements on the network, network management and network operations staff.
With SOA, each application is implemented using many individual software components. Each component carries out a single aspect of the application.
The payoff is that since each component carries out one action and one only, each is quick to implement and quick to test. Components can be reused across different applications when the same action -- such as accessing a particular type of data or verifying user credentials -- is required in multiple applications.
SOA enables rapid response to changing load. As the requirement for different types of transaction varies over the course of a month or even a day, SOA management software can start up additional copies of individual components on servers with unutilized capacity.
The line between application management and network management is becoming less distinct. Individual management packages that address one or the other must be replaced by packages that view that entire environment -- the application and the underlying network -- in a unified manner. Software products to address these needs are becoming available from vendors such as HP, CA and Progress Software as they acquire and integrate software from smaller vendors focused on one aspect of the problem.
Familiar measures of network performance, such as throughput, are less relevant in an SOA environment. In a traditional application environment, a user transaction typically results in a small number of data transfers between the user workstation and an application. With SOA, a single user transaction generates a very large number of interactions among the components carrying out the transaction. Each interaction includes only a small amount of data. Management tools designed to measure byte counts and packet rates become less relevant when no single data transfer is of sufficient duration to make a throughput measure meaningful.
Overall transaction rate and responsiveness are what matters. Productivity is measured by how rapidly user transactions are completed. Data rates and the time required for each interchange between components are a factor in transaction rate -- but only one factor. Management software must be able to detect problems at the application level and then be able to drill down to find the root of the problem.
SOA's ability to add capacity as the load varies means that network traffic patterns can vary from minute to minute. Management software must be able to detect problems and react to them quickly. It isn't sufficient to produce a report and then wait for an operator to make necessary changes. Management software must enable operators to define specific events such as congestion and specify actions for the software to take when it detects those events.
In addition to adding new requirements, traditional network management tasks increase in importance. Spreading an application across multiple components resident on different systems means that a failure anywhere in the network can bring down the entire application environment. The ability to detect and respond to impending problems becomes even more vital.
Managing security and responding to threats are also increasingly important with SOA. Individual components are designed to carry out a single action and depend on the security environment surrounding the network to protect them from attack. Management software must integrate with intrusion prevention hardware and software to react immediately to threats. Actions such as changing filters on router ports or shutting down links and rerouting traffic must be automated.
The complexity brought about by the large number of individual components and rapid changes in network flows makes it vital for management software to provide clear and understandable views and reports. The vast amount of raw information must be boiled down and only the essentials presented to operators.
Finally, network operations staff must develop a basic understanding of the nomenclature, components and basic structure of SOA. Otherwise they will be unable to view network operations as a whole and manage the network to deliver excellent application performance.
Network monitoring, or the process of measuring network status, is one of the many professional tasks in the service provider business process that is undergoing changes. These changes are driven by "the usual suspects," meaning changes in technology, service makeup and service provider business practices.
One challenge for monitoring is the containing of its scope. The general term used in network management of FCAPS (Fault, Capacity, Accounting, Performance, and Security management) suggests that monitoring might be involved in each of these areas, an approach that is in fact taken by some vendors. Others view monitoring as purely fault management, or as fault and performance management. The scope of monitoring's role in this FCAPs process is likely determined by the capabilities of the network management system overall.
Most network equipment vendors in the service provider space offer network monitoring tools through their network management systems (NMSs). These tools link to management information bases (MIBs) in the individual devices, and they can obtain device status by reading the MIB variables. This process creates an "atomic" view of network state that is unlikely to directly uncover network problems. However, because NMSs have overall topology knowledge, they can often interpolate this atomic status information into more useful form. Where network monitoring is available through the NMS, it is highly desirable to utilize it fully because of its more "systemic" viewpoint.
Business activity in the network management and monitoring market has included mergers and acquisitions, for example, Computer Associates' acquisition of Concord. These acquisitions are often aimed at creating effective combinations of management and monitoring tools.
A popular example of a systemic monitoring tool is Cisco's NetFlow, which provides a full set of tools for application and user monitoring, accounting management, traffic analysis, etc. These tools are essentially mission-focused, meaning that they are designed to directly support activities of professionals involved in network, service, and customer support.
While not as "systemic" an approach as NetFlow, the modern trend toward end-to-end monitoring and measurement is a step in the right direction in terms of providing a linkage between network monitoring and service or customer experience. More and more service standards are being enhanced to include end-to-end path status monitoring.
In the IP space, RFC3429 and the Ethernet specification IEEE 802.3ah (now 802.3 Clause 57) provide for end-to-end monitoring and fault localization. These capabilities are not yet fully supported in networks, but as they roll out they will likely change the focus of network monitoring toward something more customer-experience-based. This shift will be welcomed by the networking professional, who is increasingly forced to interpolate service experience based on network status.
Device-level monitoring (SNMP, for example) is still a valuable tool for network professionals, but is more and more likely to be associated with problem resolution or the late stages of problem isolation. Device-level monitoring tools (and even interface or board-level MIBs) are valuable in developing specific information on performance of devices, but are cumbersome as tools in fault isolation because they can't easily link device conditions to network or service behavior. In part, this is due to the fact that they don't distinguish between traffic for various services and users.
Lower-level diagnosis may require a more refined tool set, and this is ironically available both as an extremely low-level and technical tool and as a high-level system tool. Application-aware monitoring is clearly the most significant trend in network monitoring simply because it focuses performance and status analysis on the relationships the network is committed to support, and leads from there to identification of resources. There is some collision between the low- and high-level approaches in a marketing sense, but in fact they are often complementary for service providers.
Low-level monitoring is normally associated with the use of smart probes that can be parameterized to detect traffic based on packet inspection. The IETF RMON specification offers a standard means of providing this, but proprietary strategies are also available from vendors such as NetScout or Network General. The purpose of remote monitoring in any form is to give a network professional a similar level of access that would be offered through the use of a local protocol analyzer. Network General's approach is in fact derived from its "Sniffer" analyzer product, now supported as a remote tool.
High-level application awareness seeks to accomplish much the same thing by obtaining data from applications at their point of network connection. This process is easier for an enterprise, where the boundary between IT and networking is vague, than it is for service providers where that boundary is absolute and where crossing it may generate customer concerns about security. However, the use of "managed services" is expanding worldwide, and application-level data is increasingly available in that context. As service providers offer higher-layer services, including hosting and software-as-a-service, the application components are inside the network and fully available for obtaining application statistics.
The best monitoring strategy is normally set by the mission of the professionals involved, but will also depend on the service and network architecture. Cisco's NetFlow is clearly targeted at IP/Internet providers, for example. Monitoring associated with capacity and performance management can focus on device health and resource congestion, largely fault and performance issues. Monitoring associated with customer care must be directed at service, application, or end-to-end behavior. The older MIB-based tools, including SNMP, are becoming less relevant as these service-based missions increase their hold on service provider agendas.
It is important to note that all fault management and much of performance management must ultimately end in some remediation of problems. This may be something that can be handled remotely in some cases (changing parameters on a device, etc.) but in many cases it will involve dispatching field personnel to complete a repair. In the former case the need for NMS integration to facilitate making network parameter changes is obvious, but in the latter case it is often necessary to manage failure-mode operation of the network, which is also an NMS function. Thus, integration of monitoring and management systems for convenient use by network operations center personnel and other craft professionals is likely to increase productivity and improve customer experience.
There is probably no topic in networking that is undergoing more radical change than the topic of service provider network management systems (NMS). Under the paradigms that prevailed before the 1980s, operations support systems (OSS) and network management systems tended to be unified. As packet networking emerged, it created a distinction between the management of the infrastructure (network management) and the management of the business of being a service provider (operations management). The demands of service providers to control operations costs have begun yet another shift -- back to a concept of unified management under a common framework.
The standard framework for network management is defined both by the ITU, as the Telecommunications Management Network or TMN, and the TeleManagement Forum, as the Multi Technology Network Management, or MTNM, initiative. In both, the concept of network management embraces the control of the system of devices (of whatever type, from whatever vendor) that behave cooperatively to offer telecommunications services. Both models place the control of individual devices ("Element Management," meaning device management) below the NMS layer.
NMS systems approach the common problem of cooperative resource behavior in two very different ways. Equipment vendors and the ITU standards tend to take a bottom-up approach, viewing a service as being the product of network features. Thus, the key to creating and ensuring services is to control network behavior. This approach is also common among IP/Internet providers because these providers often support services created over a network rather than the creation of services on the network. The TMF and vendors that are active in service creation and management (DSL activation, etc.) view network management as a function exercised from above, playing a role in translating logical service descriptions into network actions.
The distinction here is far from academic. Top-down approaches tend to make service management the goal and network management the tool. This is suited to many of the emerging provider business models, but it collides with the current practices, which have typically evolved from a bottom-up approach and are now focused on the network operations center (NOC).
The primary mission of an NMS for most service providers is to support NOC activity; the support of service management systems' access to network resources is normally a second mission today but one that is likely to dominate in the future. The primary issues in NOC support are:
- Multi-vendor, multi-device access through a uniform interface. Most provider networks include many different devices from a variety of vendors, and learning the specific management interface to each combination is difficult. It is even more difficult to perform provisioning, commission new lines or nodes, or diagnose problems with different management interfaces. A common interface to all devices is mandatory for an effective NOC and is thus the first requirement for an NMS.
- Fault correlation and filtering. One of the major problems that providers face in the NOC is what are sometimes called "alert floods," which are masses of error messages created from a single failure of a key trunk or device that may affect thousands of other service elements. An NMS should provide both mechanisms to correlate such low-level, broad-impact failures with higher-level errors and the ability to suppress those messages in order to prevent operations personnel from being swamped in error messages.
- End-to-end provisioning and management. The purpose of "network management" is to manage cooperative behavior among devices, but there are relatively few products that actually provide that capability. Most equipment vendor NMS products work only with their own devices, and few products offer complete multi-vendor support. Australia's incumbent carrier, Telstra, selected Alcatel's Cross Domain Manager product for this multi-vendor, end-to-end support, even though the devices being managed were primarily from other vendors.
The NMS market, as previously noted, is increasingly turning toward a service management perspective, meaning that the customer care process may escalate a service problem to the NOC, and thus the NMS must have some mechanism for correlating services with network conditions. This problem is more significant than it may sound, particularly for IP services and other services created through end-to-end signaling (IP/MPLS, IP VPNs, for example) rather than by "management stitching." With such services, it is often difficult to determine what resources a service is actually using, and thus to correlate network and service conditions.
IBM, Cisco, Alcatel, Telcordia, and standards bodies such as the TMF are all working on the issue of creating effective links between service processes and NMS capabilities so that NOC personnel can drill down to uncover the network cause of service problems. All of these solutions are evolving, and NMS specialists will want to review the current state of the products available to determine which fits their requirements best at a given time. When making these assessments, keep in mind that vendor-proprietary approaches nearly always evolve faster than standards, but that such approaches may be more costly and may also fail to cover all of the devices and vendors that may eventually be used in a network.
Whatever the specific technology used in today's voice, data and video service provider networks, it is almost certain to be in a packet architecture. Packet networks break information into small pieces -- packets -- which then share transmission resources. This process or sharing makes packet networks more efficient in their use of capacity, given that much of the traffic of networks is bursty in nature. However, it also creates potential collisions of resource demands. Packet networks normally employ a form of topology discovery and adaptive routing designed to improve resiliency, and this can make it difficult to manage network capacity and determine the exact nature of problems when they occur.
Sustaining a multimedia packet network, like any form of network management, is based on the classical FCAPS acronym: Fault, Capacity, Accounting, Performance, and Security. The order of consideration, however, doesn't match the acronym.
Capacity and performance management
Capacity and performance management are the classical preemptive network management tasks, and security management is increasingly being added to the list. The purpose of all three is to create a stable framework for service delivery that will provide customers with good experiences under normal conditions. Each of these is divided into two phases, planning and management.
In the planning phase, network operations uses historical data, estimates for growth in existing services or demand for new ones, and other factors to establish a baseline plan. This plan must include metrics to measure current network behavior against the plan to determine whether the plan's goals are being met, and it is this measurement that is the goal of the network manager.
The key tools for planning in the "CPS" part of the acronym are threshold measurements of certain network conditions, which would typically include the load levels of trunks and devices and the traffic generated at certain interface points. The purpose of these measurements is to test the conditions against the network plan and generate threshold alerts where there are indications of an unexpected condition. These alerts can be indications of an emerging issue (trunk loads may rise to a threshold level indicating that additional resources are required) or of an acute problem (trunk or device loads are approaching the level where performance is impacted). Thus, these planning and management tasks may also generate a need for troubleshooting.
Good capacity and performance management plans and monitoring also recognize that the network may not always be operating in its optimum state, and that alternate states that reflect the balance of business goals should be defined to respond to failures or congestion. These are often called "failure modes." Failure mode planning is critical for multimedia networks that serve voice, data, and video because these three media are likely to have major differences in economic value and tolerance to out-of-specification network operations. One of the challenges of networks with fully adaptive behavior (routing, spanning tree bridging) is that they may not support explicit failure modes, but simply adapt to conditions in a variable way.
Troubleshooting and problem isolation
Where an acute load condition is detected or where a fault is reported, the process shifts to the fault management area, popularly called "troubleshooting" or "problem isolation." The primary purpose of fault management, simply put, is to fix the problem. Another goal equally important is to sustain network operation in a valid failure mode while the problem is resolved.
When a fault occurs, the network operations personnel should first ensure that the network has entered a valid failure mode state, meaning that the remaining network resources have been allocated according to business priorities. Video is usually the form of traffic with the greatest economic value and performance sensitivity, and so failure modes should ensure video performance and perhaps block new video delivery requests to ensure the current ones are sustained. Voice can be treated similarly. Most data applications are tolerant to some delay and packet loss associated with the resource congestion likely in failure modes, so this can often be made the lowest priority.
Once the network has settled on a failure mode, the network operations troubleshooting process is directed at finding the underlying problem. This can be approached either based on reconstruction or status analysis.
Reconstruction means simply going back through alert messages to find the early period of the failure and then analyzing the events as they unfold. This approach requires a log of network events with very accurate time stamps on alert messages to ensure they are processed in sequence. This approach is often the only way to find problems that arise from "soft" faults like congestion. It normally involves moving forward to the point where the problem is clearly visible, and then backtracking to determine what caused this pivotal event set.
The problem with a reconstruction approach is that it may be time consuming. When a problem is "hard," meaning that its state persists even when the network is in failure mode, status analysis may be a better approach. Examples of this sort of problem are a trunk failure (fiber cut) or a node fault (power, equipment).
Soft problems like congestion are usually attributable either to an abnormal line/node condition (that may be causing retransmission and loss of effective throughput, for example) or by excessive load. The latter may be caused by unexpected traffic or even by an attempted security breach, virus, etc. Where the problem is caused by an abnormal resource behavior, isolating and fixing the failing resource is the goal. Where it is caused by traffic overload, it may be necessary to flow-manage some sources or, in the case of security violations, cut them off.
Hard problems should be diagnosed remotely to the greatest extent possible. For line problems, this means testing from both device endpoints, and for device problems testing each interface to determine the scope of the problem. The facilities to perform these tests vary widely; use whatever is available to its fullest extent.
Dispatching someone to a location to fix a device or reconfigure or restart it manually is a last resort, since this will be both costly and induce a longer period of abnormal network operation. Where it is necessary, be sure to fully isolate the device first, then undertake the repair, recommission and test, and only then bring the device back to operational status. In some extreme cases it may be necessary to bring devices or trunks back online during off periods to avoid adaptive reconfiguration and performance instability.
Any time a problem must be managed, it must be carefully logged. Effective network operations depend on analyzing the response to problems and tuning the procedures so that downtime and cost are minimized if the problem reoccurs in the future.
About the authors:
Tom Nolle is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982. He is a member of the IEEE, ACM, Telemanagement Forum, and the IPsphere Forum, and the publisher of Netwatcher, a journal in advanced telecommunications strategy issues. Tom is actively involved in LAN, MAN and WAN issues for both enterprises and service providers and also provides technical consultation to equipment vendors on standards, markets and emerging technologies. Check out his SearchTelecom networking blog Uncommon Wisdom.
David B. Jacobs of The Jacobs Group has more than 20 years of networking industry experience. He has managed leading-edge software development projects and consulted to Fortune 500 companies, as well as software startups.