Optimizing an Update Loop to Reduce Cloud Costs
Hidden software flaws can become costly. This case study shows how fixing an update loop significantly reduced cloud expenses.
Published 2025-02-13 by Tomas Scheel
8 minute read
The Hidden Cost of Poor Software Architecture
When software architecture is poorly designed, inefficiencies can go unnoticed. At first, they may seem insignificant but over time they lead to increased costs, degraded performance, and complaints from stakeholders.
This article examines a real-world scenario where a flawed auto-update mechanism caused an excessive number of API requests. As a result, system resources were consistently taxed and operational costs increased.
To maintain confidentiality, all case studies we conduct are anonymized. By removing industry-specific details we can safeguard sensitive client information and instead focus on the architectural problem itself. The goal is to illustrate how these issues arise, how to detect them early, and how to implement effective solutions before they spiral out of control.
Identifying the Problem: Anomalous Resource Utilization
When we took over this project, we inherited an existing system with deeply embedded architectural flaws. One of the first signs that something was wrong appeared in the system’s resource utilization metrics. Even during non-business hours, when activity should have been minimal, the system maintained a constant baseline load. Normally, workloads fluctuate throughout the day, dropping significantly when users are inactive. However, in this case, the system was consuming resources at a steady rate 24/7.
A deeper investigation using monitoring tools revealed the root cause: an excessive number of API requests being made at all times. Every entity in an intermediate state was being queried every 30 seconds. This quickly compounded into hundreds of thousands of unnecessary requests per day.
The sheer volume of requests was not immediately visible in the test environment, where only a handful of entities were active at any given time. However, once the system was deployed at scale, the impact became clear. The API calls were consuming excessive cloud resources and creating an unnecessary operational burden. Eventually, even third-party providers noticed the issue and raised concerns about the load being placed on their systems.
The Root Cause: A Runaway Update Loop
We traced the issue back to a fundamental flaw in how the system handled status updates. The original architecture relied on a continuous polling mechanism to check the status of entities in an intermediate state.
The system was designed so that every entity requiring updates was queried at a fixed interval of 30 seconds. Rather than using an event-driven or adaptive approach, the system simply looped through all entities on a separate thread, issuing requests regardless of whether the entity’s status had changed.
At small scales, such as in the test environment, this approach did not present a noticeable issue. However as the number of entities grew, the volume of outbound API requests increased linearly. Since each entity queried a third-party service twice per minute, the number of daily requests quickly spiraled out of control.
For example, with just 100 entities, the system would be making nearly 300,000 API requests per day.
Additionally, there was no graceful termination mechanism for entities that no longer required updates. Even if an entity remained unchanged for days or weeks, the system continued polling it indefinitely. There was no mechanism to detect when polling should stop, further compounding inefficiencies.
This inefficient polling approach resulted in:
- Consistently high resource consumption, even during low-traffic periods.
- Increased cloud service costs due to excessive API requests.
- Performance strain on third-party services, leading to external complaints.
It was clear that a smarter, more adaptive solution was necessary.
Implementing the Solution: An Adaptive Auto-Update System
To resolve the inefficiencies caused by the runaway update loop, we designed a more intelligent, event-driven auto-update system. The goal was to maintain timely updates while dramatically reducing unnecessary API calls. Instead of continuously polling every entity at a fixed interval, we implemented an exponential falloff strategy using cloud-based messaging and scheduled execution.
Key Components of the Solution
- Azure Service Bus Queue for Scheduling Requests
- Each entity requiring an update was placed into a queue, rather than being polled constantly.
- The Service Bus allowed messages to be scheduled for future execution, ensuring controlled timing of requests.
- Azure Function to Process Update Requests
- Instead of a continuous loop, updates were handled by a cloud function triggered by queued messages.
- This function checked if the entity still required updates before making an API request, preventing unnecessary calls.
- Exponential Falloff Mechanism (Replacing Fixed Intervals)
- The initial polling rate was high because most status updates occur within the first few minutes. This was necessary to provide real-time responsiveness when changes were most likely to happen, ensuring that users received timely updates without unnecessary delays.
- Over time, the frequency of update requests gradually decreased unless a status change was detected.
- The final fallback interval ensured that entities were only checked once per day if they remained unchanged.
How It Works
- When an entity enters an auto-update state, it is queued with an appropriate delay based on how long it has been since its last status change.
- When the Azure Function processes the message:
- It checks if the entity is still in need of updates (not expired or flagged for errors).
- If the entity has changed, it resets the scheduling to a higher frequency.
- If no change is detected, it calculates a longer delay for the next scheduled check.
- If an entity fails to update multiple times in a row, it is flagged for manual review, preventing indefinite retries.
By implementing this strategy, we eliminated excessive API requests, reducing the number from millions per day to just thousands, all while ensuring critical updates were still delivered promptly.
Lessons Learned and Key Takeaways
This case study highlights the importance of designing scalable and efficient update mechanisms in software architecture. Poorly structured polling mechanisms can lead to unnecessary resource consumption, increased operational costs, and performance degradation.
Key Takeaways:
- Avoid Fixed Interval Polling: Instead of blindly checking for updates at set intervals, implement an event-driven or adaptive approach that considers when updates are most likely needed.
- Leverage Cloud Services for Scalability: Failing to design for efficient resource utilization can lead to unnecessarily high costs. Instead of treating cloud infrastructure like a traditional data center, architects should optimize workloads to take advantage of cloud-native features, ensuring that resources are used only when needed and scaled appropriately to demand.
- Design for Expected Production Scale: Designing for production means anticipating realistic usage levels rather than assuming test environments reflect actual demand. A system should be designed with a clear understanding of expected scale, avoiding both excessive over-provisioning and underestimating future growth.
- Monitor and Adjust Regularly: Every system benefits from performance tracking and logging, as it helps identify inefficiencies early. By continuously monitoring key metrics, potential issues can be detected and resolved before they escalate into operational disruptions.
A well-designed system balances functionality, efficiency, and scalability. By identifying and addressing architectural inefficiencies early businesses can avoid costly setbacks and ensure their systems remain reliable and sustainable in the long run.