When Logs Went Wild: Lessons from a PubSub Billing Blowout

The date is the 7th of January, and it was my second day back at my job from the holidays. I was planning out a new proposal for a different project’s architecture. I needed to see how much our current systems cost in order to set a reference point for the proposal. I opened up Google Cloud’s billing report for staging. I then saw an ominous spike in the billing. The forecast of our project’s billing was up 7x, just in the span of 15 days, but how?

This is the summary of how we ended up blowing up our billing forecasts in staging, and what lessons we can learn from it.

The Incident: A Billing Spike We Didn’t See Coming

After some digging, it was identified that the culprit was a PubSub worker that was spitting out logs at an abnormal rate. It seemed that the worker was failing to pull messages from PubSub due to insufficient permissions. A flaw in the code resulted in this failure to loop indefinitely, and this resulted in the excessive error logs ballooning the billing.


The Timeline: How It Happened

  1. Trigger: An operation done by the ops team resulted in a misconfiguration of the service account used in the PubSub worker, triggering the failures and excessive logging.

  2. Detection: An individual noticed that the billing report showed an anomalous trend.

  3. Response: We stopped the failing workers.


Root Cause: A Critical Misunderstanding of Code

The root cause was due to a coding flaw. Here’s some simplified pseudo-code in Golang for the issue:


for {
     err := sub.Receive(context.Background(), func(ctx context.Context, m *Message) {
        log.Printf("Got message: %s", m.Data)
         m.Ack()
     })
     if err != nil && !errors.Is(err, context.Canceled) {
        log(err)
     }
}

Here, pubsub.Receive() actually blocks execution and starts a pull stream. It will unblock the process only when an error is received.

The critical misunderstanding with the code was believing Receive() is not a blocking execution, and that it must be called synchronously in a loop.

Another oversight with the code regardless of the said misunderstanding is that there was no exponential back-off or sleeps in between retries upon errors.

This meant that when this worker fails to pull messages from PubSub, it would return an error, then immediately retry, and so on.


Underlying Issues: How It Could Have Been Avoided

The critical misunderstanding of code alone could not have caused this issue to blow up into a regrettable billing implosion.

To highlight some of the issues, we have:

  • Insufficient Operations Procedure: Any operation that could impact workloads should ensure that the workloads are working as expected after the operations. This is especially true for staging operations because all staging operations should be done with the expectation that the same procedure would be followed for production.

  • Insufficient Alerting: There were no budget alerts set on staging for some reason. Since this incident caused expected monthly budgets to be exceeded, a budget alert would have allowed a faster detection. Staging alerts of the application were also lacking. However, such alerts may have been ineffective because they may have been overlooked — especially during holidays.

  • Insufficient Operational Quality Control: Scenarios of application failure including system integration were not considered and tested well enough. Unit and integration tests covered application and system-level expectations. QA testing covered specification and end-to-end expectations. There were, however, no tests that covered operational expectations.


What We Learned: Key Takeaways

  1. More Robust Code Review Process: Linters could not detect this flawed code. We need a way to ensure our code is safe operationally.

  2. Set Billing Budgets On All Environments: Setting usage caps or alerts could have caught this earlier.

  3. Operational Excellence: Any operation that may impact applications should always have procedures to check the application state afterward. We should assume that all code is flawed, and that the only way to truly protect the system is through operational excellence. In order to achieve operational excellence, we need to be able to see our system’s health from all aspects and be able to prevent issues in advance, or react to issues quickly.


What’s Next: Our Action Plan

We’ve implemented the following steps to ensure this doesn’t happen again:

  1. Set Alerts for Billing: Cloud billing alerts are now configured to notify us of breaches in certain thresholds.

  2. Code fix: Address the code flaw.

  3. AI PR Review: Investigate solutions such as AI code reviews that could potentially increase code review quality.


Closing Thoughts

Flaws in code can happen all the time. It’s a fact that we need to learn to deal with. Preventing such code flaws in advance is important, but perhaps it is a fight that cannot be won. Instead, there may be more value in trying to detect these issues as quickly as possible during operations and quality control tests. By focusing on tests that go beyond the scope of unit and integration testing, real-world scenario risks may be prevented in advance. Moving forward, in any new releases I handle, I would like to continue exploring how we can keep increasing the quality of operations.