30.09.2022 | Daniel Brunner
Network Services - Unattainably Good
Even in times of cloud services, complexity is increasing rather than decreasing. After years of customizing software solutions, a new abstraction was sought to simplify the lifecycle of products and the change of manufacturer. As a result, there was an increased focus on standard software components and manufacturers developed their own as-a-code solutions. The configurations are no longer deeply embedded in the software or entered via a graphical interface, but as code.
This led to an additional level, the configuration level, which is independent and should be easy to understand, version and replicate.
DevOps enjoyed its first major global successes. They were the ones who established as-a-code solutions. But why is this so important? DevOps took the path of least resistance and first carried out several iterations. They started with the configuration of the application, expanded the possibilities to start and stop virtual machines, and are now in the process of completing Infrastructure-as-a-Code (IaaC) after all the iterations. While resources such as CPU, RAM and disk storage have almost always been covered by IaaC, this has hardly ever been the case for network infrastructure.
Networks under your own responsibility
In many places, the network and basic services such as DHCP (Dynamic Host Configuration Protocol) v4/v6, DNS (Domain Name System), NTP (Network Time Protocol), IPAM (IP Address Management), routing and firewall are used separately. Entries are managed manually and Excel lists are kept, which at best can be tracked in a ticket system.
The recurring clean-up work is laborious and is therefore rarely carried out. This harbors risks, frustrates service operators and leads to additional work for users. A web server administrator who has to confirm a DNS entry again and again receives no added value from this. After the umpteenth request, the entry is hardly ever checked, but quickly ticked off - what is supposed to have changed? In this way, hackers who have gained access can remain undetected for months or years. One of the largest chat providers (the name is not relevant here) had an outdated DNS entry that pointed to an IP address that no longer belonged to the manufacturer. A user exploited this by providing his instance on this IP address and collecting data. Do you have outdated entries? How soon did you know about it? It’s time to act.
Network services as-a-code
The first step is to define which services should belong to which groups in the future. In turn, each group must be given the task of making its service available as-a-code or as a REST API if possible. The aim is not to incorporate every change directly into the productive environment, but rather to enable good release processes.
This is to be achieved by means of iterations. As an example for DNS:
1st iteration
- Creation of a central DNS configuration in GIT or similar.
- Release by the person responsible for the DNS service
- Acceptance of the configuration at 7 p.m. so that external entries are overwritten
2nd iteration
- DNS configuration folder, each TLD receives a configuration file
- Approval by the person responsible for the DNS service, dual control principle
- Acceptance of the configuration 15 minutes after approval, so that there is still time to correct any errors
3rd iteration
- It should be checked every hour whether the configuration from the IaaC matches the actual responses from the DNS service. An alarm should be triggered in the event of a discrepancy.
The network services can only be changed using iterations. The first step is to create a rough draft of possible iterations and then plan the next iteration thoroughly. Think about clear communication. Include the planning in your wiki, provide information about the iteration plan and ask for pilot groups. Especially in the network area, depending on the changeover, it can take days before it becomes clear that the fault lies not with the application, the user or the operating system, but with the network.
The steps again as a checkbox:
- Identify services that are to be considered enterprise network services
- Assign services to a service group (e.g. core network)
- Request the service groups to convert the services for configurations to as-a-code or REST-API
- Iteration planning by the service groups
- Communication of these plans
- Identify and invite potential pilot participants
- Start of the iterations
On-prem network services
On-prem network services refer to services that are available for components in the on-prem network, even if these are passed on from the cloud. It is probably more difficult to get started here, as end users are more likely to be affected and many historical entries and components are still in use. It is therefore advisable to get started in the cloud and establish standards that are adopted for on-prem.
Cloud network services
Cloud network services are services that are available for components in the cloud network, even if they are passed on from on-prem. Probably the most difficult thing is to enable a cloud-agnostic infrastructure. Today’s cloud providers allow all virtual machines to be created, configured and then started. This is different in the network area. IPv6 alone will cause you problems, as it is not available from all providers. But such hurdles can be overcome and newcomers in particular are predestined to finally establish IPv6 as the standard in 2022.
However, there are two issues that need to be considered separately, especially in the cloud, which have received little or no attention with on-prem. The first topic is costs. Incoming and outgoing network traffic is charged, but traffic in the same cloud region is not. The actual costs also vary depending on the provider. This is particularly important if you operate network services that cause a high volume of data in the network and are not located in the same region (cross-region costs). Even if a lot of traffic is generated by other services rather than network services, you may need to offer help. It is important to clarify which services are actually causing traffic and which soft (alarm) and hard limits (blocking traffic) have been set and why. Here, too, it is important to think in iterations and make improvements instead of putting everything off until day X. Mistakes cannot be avoided either. If the network suddenly comes to a standstill because of a deny all rule on the firewall, it is better to deal with this problem as early as possible. Logic errors in particular cannot be found on paper, but become apparent during operation.
The second issue is cloud-native services. These are provided by the manufacturer and often lack a function for the business that would be very practical. For example, extended filter options or rule activation for certain actions. These services are also often not optional, even if they are often perceived as such. IPAM is particularly noticeable because no separate tools are used, as no separate agents serve as masters; rather, IPAM can only be controlled at all via this cloud-native IPAM. The simplest and quickest solution is to work with scripts. This then leads to direct dependencies on the cloud provider. This is because the services will not be immediately accessible elsewhere. There are now tools for building cloud-agnostic infrastructure, which in turn leads to a dependency on this very tool. Whether the tool manufacturer or the cloud provider should be trusted and which dependencies result from this must be determined.
Monitoring and alerting
Less is more: This applies in particular to network services. Think carefully about what is relevant and how this can be made measurable. Especially when it comes to monitoring, new requirements can suddenly arise for service operators on the operating system side. How do you check DNS entries? It is often not at all clear who is responsible for the check and can make a statement. Exotic installations such as web servers that do not deliver HTTPS on port 443, but on other ports, must receive adapted monitoring. Other server configurations that, for example, result in no ping commands being answered all increase the workload. Maintenance work may also be in progress and the service is deliberately unavailable. Here at the latest, you will only partially succeed in enabling monitoring As a contrast to normal operation, the question can be asked as to what must not occur under any circumstances.
This would be invalid entries, for example. Or the service does not accept changes (e.g. timestamp not older than X hours). The simple keep-alive as information (system or service responds) is more important to maintain across all basic components than creating special metrics (keyword tracing) for certain services.
In the event of an alarm, this information should be available to everyone in the internal wiki in the form of a traffic light system. This means less frustration for everyone and more time to deal with the problems. The question of how alarm messages should reach the service managers was also raised frequently. One of the reasons for this question was that email would be too dependent in an emergency and nothing would be sent. There is no universal answer here. The complexity increases with every additional measure, such as the creation of an RSS feed. Some manufacturers also offer an app for smartphones that can receive push notifications directly from the monitoring system. There is still a residual risk that an alarm will not go through despite all efforts.
Unreachably good
Would you like to make the network services in your company unattainably good, for example in line with the NIST Cybersecurity Framework? Then contact us without obligation.