Cloud computing provides businesses with flexibility and scalability, but, as we all know can lead to unexpected cost surprises despite the best intentions and forecasts. Our experience with managing cloud costs has found a specific “order of operations” to be followed to optimize costs and maximize savings. This post will discuss our step-by-step approach.
Before diving in, it's important to remember that there is not a quick fix for long-term cloud cost management, and that having established processes and tools implemented to review regularly are critical for having an efficient cloud environment.
Additionally while the FinOps Foundation has their 3-phased capability lifecycle, our order or operations methodology is a bit more prescriptive; particularly when talking about cloud compute.
The first step in our process is rightsizing cloud instances. This involves selecting the right instance type and size based on actual resource requirements. Many companies simply estimate when provisioning instances, and often choose unnecessarily larger sizes “just to be safe.” This can lead to significant waste and unnecessary expenses.
For example, one client we worked with was running large high-memory instances for their web servers, even though the actual memory utilization was very low (and, importantly, stable). By right-sizing to a smaller instance size, they reduced costs by 30% while still fully meeting the resource needs. While a simplistic example, most environments we encounter have similar opportunities.
We also consider “modernizing” as part of this stage and identify those instances running on older or more costly instance types. Prior to rightsizing, be sure to update your instances to the latest available type, as it most likely has the best cost to performance ratio. After assessing performance needed, you will be able to size appropriately.
In short, proper rightsizing (and modernizing) ensures workloads have just the right amount of compute, memory, and storage.
After instances are rightsized, the next step is to take advantage of scheduling and auto-scaling capabilities. Scheduling allows instances to automatically turn on or off at specific times based on usage patterns. Auto-scaling allows instance counts to dynamically increase or decrease based on actual demand. This keeps applications responsive, while minimizing waste during times of low usage.
Many industries have very specific hours of operations and related demands on compute resources. One example is financial markets. While user access is not necessarily terminated, the majority of the systems are processing data only during the hours the stock market is open. Certain environments can terminate or scale these instances easily, but others with less modern architectures struggle to adjust. By leveraging auto-scheduling, auto-scaling, or even burstable instance types, you can intelligently reduce your spend when you have predictable load. Given the stock market’s fixed hours and number of recognized bank holidays, it's easy to see where substantial savings can be found simply by scheduling.
Combining rightsizing, scheduling, and auto-scaling ensures resources align closely with workload requirements.
After establishing well-sized and scaled environments, enterprises can further reduce expenses by using reservations and savings plans. These provide discounted rates in return for a commitment to consistent usage over a 1-3 year period.
Savings can be over 70% compared to pay-as-you-go or on-demand pricing. The key is to only reserve what is necessary after rightsizing, so as not to pay for unused capacity. As discussed in our other blog posts; you need to maintain a process to best capitalize on these opportunities, and if you have a dynamic environment (as most companies do) you will want to use a platform like OpsNow.io to mitigate risk and ensure peak efficiency.
Automated compute commitments is OpsNow’s specialty and we regularly save enterprises well over 40% by reserving instances after rightsizing and scheduling, while ensuring high utilization and coverage. We do this via a tool called AutoSavings – the biggest value being that OpsNow provides improved savings without end customers needing to commit to the usual 1 or 3 year period. By removing this risk, we can act quickly to ensure maximum savings.
The final step is continual monitoring of cloud resource usage and costs. Cloud platforms provide detailed metrics that can be analyzed to identify waste and optimization opportunities. Utilizing alerts for anomalies based on tag groups or other well-defined processes is an important and often overlooked step. Environments without a structured tagging process and loose or inconsistent processes for terminating unused resources are often written off as a cost of the cloud, but if managed correctly can save another 15% or more.
It’s also essential to set usage thresholds and alerts to be notified of unusual activity. For example, one of our clients found that a runaway batch job was needlessly maxing out CPU for hours. By setting alerts, they were able to take proactive steps to address the issue, resulting in over 20% monthly savings.
Regular monitoring ensures environments remain properly right-sized and scaled over time, while also unlocking additional savings opportunities. Don’t confuse performance management and customer experience with the financial side of the cost optimization. While they often use the same tools the objectives are different.
After going through the rightsizing, auto-scaling, and the reservations process; enterprises can further optimize costs by utilizing spot instances. These provide unused compute capacity at discounts of up to 70% over on-demand.
The tradeoff is spot instances can be reclaimed if the cloud provider needs the capacity. Therefore, Spot Instances are ideal for workloads that are fault-tolerant such as batch processing jobs, development and testing environments, big data analysis, and any application with flexible start and end times. For those of you with Kubernetes environments, you likely already have extensive spot implementations as this can be a natural fit - but for getting too many failed requests or you have a stateful environment a balance of on-demand can make sense.
Beyond resource allocation and purchasing choices, another way to improve cloud efficiency is through code optimization. Well-written code places less demand on infrastructure while enabling instances to fully maximize capabilities. While not the scope for this post, this arguably should be the first part of your optimization order of operations as this can dramatically impact performance and compute utilization.
Even small code improvements can accumulate to dramatic cost savings at scale. Just a 10% reduction in overall resource needs provides opportunity for considerable instance downsizing. For example, one company reduced each web request’s CPU utilization from 800ms to 200ms through code optimizations. This allowed servers to handle 5X more traffic without any modification to the underlying VMs.
Additionally, much of the cloud runs on NGINX. If you have NGINX or similar load balancers in your environment, tuning these tools can provide a dramatic improvement as this is oft-overlooked choke point between your services and customers.
Managing cloud costs requires following a best-practice order of operations: First rightsize instances, then enable scheduling and auto-scaling, next reserve capacity, and finally implement a process to monitor costs and identify anomalies. When implemented effectively and periodically reviewed, the hassles of unforeseen cloud spend can be greatly reduced. Using tools like OpsNow that highlight the savings opportunities and provide trustworthy analytics solve part of the problem - we think an important part; but as with anything operations needs to play an active role.