Sometimes network failures can be detected in a straightforward manner. We monitor physical switches, routers, optical IDC equipment, their components and transmission media. We see CPU load, memory/buffer overflows, CRC errors, errors on the wire, etc. This direct monitoring gives us an actionable course to pursue when problems occur.
Of course, networks have lots of errors that do not show up via common monitoring or packet capture based tools. Let’s call these green-light failures: when things go wrong, all monitoring systems show everything green.
Google reports that, in their networks, management actions cause most (green light) failures. These might include control plane or switch software upgrades, topological changes, WAN/IDC errors when bringing a data center online, draining and undraining workloads, etc. While these causes may be obvious, the mechanics of the underlying failure is not. We know the external cause. We don’t have insight into where things went wrong.
This is where intent-based monitoring comes into play. If a failure leaves no telltale signs (red lights), then the only indication we have of failure is the observation that something is not working as it was intended to. The obvious thing is to define intent of the designer with some specificity, and then monitor systems against that statement of intent with an automated tool.
With most networks, the intent of the designer is not specified; it can only be surmised manually through inspection of configuration, system state and observed phenomenon. This is the hard part. It is a continuous and difficult process of forensic investigation, carried out on a system that continues to operate (hopefully).
Network designers should be empowered to specify their complete intent for all system components. That specification should be checked through simulation before being pushed into an operational environment. Then the operational system should be monitored for any deviation of function from intent of the designer.
Intent-based simulation and monitoring provides near immediate alerting when things go wrong. It enables troubleshooter a dramatically shortened time to insight. It should be a mandatory part of every operational network ecosystem.