How to determine that saturation occurs on a network ?
How to know the cause and remedy quickly ?
IT networks are designed to transmit data as fast as possible. The modern protocols like TCP can adapt their use to value the network. In general, these mechanisms are sufficient to ensure the proper management of flows. Sometimes, despite all, a poor network design or an unforeseen event can result in the sudden overload of a specific point.
A saturated network link is a link that should transmit more frames than is possible on its physical support. These include, for example, the port of a router should send 110 Mbps (megabits per second) when the standard used is the 100BaseTX (100 Mbps).
At this point the router starts by placing the packets supernumerary in a local buffer. This buffer is emptied when saturation ends and the connection allows the flow.
If congestion persists long enough to fill the buffer, the router must be resolved not to process packets (“drop”). It can optionally notify the sender by various mechanisms (“TCP ECN” or “ICMP Congestion detected”).
Routers can use “Quality of Service (QoS) policies in order to determine what packages to remove in priority (best effort). Other packages have priority and are less affected by congestion.
If congestion occurs on a less efficient equipment than a router (switch, hub, firewall or wi-fi), the results are catastrophic.
Congestion is primarily reflected by a conventional user feeling: slowness. This statement reflects the change in the network effective flow, ie the time required to transmit an entire data from one point to another. The effective flow doesn’t exist as such, it consists in reality of three separate indicators:
- loss rate
There are many studies on the influence of these parameters. In summary, we can remember few facts. Firstly, the effective flow is inversely proportional to the latency. Multiplying the latency by two means dividing the flow by two. Then the jitter, which is the latency variation over time, impacts by influencing the flow latency. Finally, the theoretical bandwidth is inversely proportional to the square root of the loss rate.
So, when congestion occurs, the latency increases due to the use of buffers, the jitter increases for the same reason and the loss rate is no longer zero. The effective flow is directly affected by this state.
Congestion symptoms allow us to rely on objective indicators to characterize it. To measure the effective flow between a user and a server is very difficult to achieve. To measure the latency, jitter or loss rate is possible.
The network interface saturation of an equipment is also an achievable objective measure. Most devices provide this indicator via SNMP.
Finally, some routers or switches provide in addition the filling rate of their buffers.
By measuring these indicators at regular intervals it is possible to detect congestion. Moreover, with a short interval, we can detect micro-congestion that lasts only a few minutes.
The presentation of these data is ideally with a map for understanding the situation in a single view.
When a link is congested, it is usually the result of a gradual charge. It may however be fast (few minutes). To anticipate the congestion, indicators monitoring listed above is the best method. Indeed when a link is used at 80%, indicators of latency and jitter are already deteriorated. By setting alert thresholds, it is possible to detect upstream any potential congestion.
Manage the congestion
Depending on the congestion cause, possible actions are differents. If congestion is generated by an unusual flow in its volume or in its timing, it is possible to stop this flow. To identify the type of flow, NetFlow / sFlow can be useful. On this subject see our document on the control flow. The following example presents a rendering of a NetFlow / sFlow analyzes module.
If congestion is the result of poor network design, the problem can not be resolved immediately. As for road infrastructure, a comprehensive study is needed. Some mapping tools can be very effective. On the map shown above, we quickly understand where is the congestion and what type of flow is the cause.
On this map, we can also detect a degraded operation of the network, as here, the poor distribution of load between two redundant links. This scenario is very common, the causes are multiple: spanning-tree switch, HSRP or LACP, incorrect configuration of a trunk etc..