Container Orchestrator SIGTERM Spam

I’ve bailed now twice on using Azure Kubernetes Service (AKS) for a couple reasons:

Limited understanding, despite watching some great breakdowns on YouTube: Azure Kubernetes Services (AKS) Overview
Roadblocks due to private networking requirements in our environment
Honesty with myself that I should get back to building an the application instead of obsessing over the ops

So, you can imagine my excitement a when I learned about Azure Container Apps (ACA) which is a Kubernetes abstraction. It doesn’t go as far as App Service, wrapping your code in a container that it builds itself, instead you ship the container and set some limited networking, volume mount, and scaling parameters. It’s become my favorite way to deploy a containerized workload to Azure. I’ve literally tweeted for joy about how I thought it was the perfect level of abstraction, deep enough for me to push my knowledge in fundamental orchestration concepts like (KEDA) scaling, but shallow enough that I don’t have to think about pod networking and Kubernetes manifest files:

I gotta say, as a web dev, abstraction is a hell of a gateway drug.

I tried a managed kubernetes cluster (Azure) a year ago and couldn't tell the difference between a pod and my Airpods despite being pretty solid con container fundamentals. 😅
— Emeka C. Anyanwu, MD (@EmekaAnyanwu) October 11, 2023

One app that I’ve deployed to ACA is a fax processing system that processes thousands of documents per day. Our stack for this project is a Flask backend, Quasar/Vue frontend, Celery task queue. We have a container image for our task worker that handles a bunch of I/O bound API calls but also processes CPU bound tasks like OCR and vector similarity comparison.

For the task worker, the number of worker replicas scale according to the length of the Redis task list. If you’re interested, here’s what the Bicep code for that looks like:

module workerHeavy './templates/container-app.bicep' = {
  name: 'workerHeavyDeploy'
  params: {
    config: {
      ...buildBaseConfig(baseConstants, 'worker-heavy')
      image: '${containerRegistryFQDN}/${appName}-worker-main:${imageTag}'
      containerCommand: ['/app/server/bin/celery_worker.sh']
      containerEnvVars: concat(containerAppEnvVars, [
        { name: 'WORKER_BASE_NAME', value: 'heavy' }
        { name: 'WORKER_CONCURRENCY', value: '4' }
        { name: 'QUEUE_LIST', value: 'celery' }
      ])
      cpuCores: '2'
      memorySize: '8'
      terminationGracePeriodSeconds: 60 * 10

      secrets: [{ name: 'redis-password', keyVaultUrl: redisPasswordVaultUrl, identity: managedIdentity.id }]
      scale: {
        minReplicas: 1
        maxReplicas: 15
        cooldownPeriod: 60 * 15
        rules: [
          {
            name: 'celery-queue-length'
            custom: {
              type: 'redis'
              auth: [{ secretRef: 'redis-password', triggerParameter: 'password' }]
              metadata: {
                address: '${redisHost}:${redisPort}'
                listName: 'celery'
                listLength: '4'
                databaseIndex: '1'
                enableTLS: 'true'
              }
            }
          }
        ]
      }
    }
  }
}

This setup has worked well for over a year up until a couple months ago, and fixing it was a fun lesson in POSIX signals and a reminder that abstractions are only approximations.

One fateful Thursday morning, I noticed that many of our tasks both long and short, were being interrupted by an exception I’d almost never seen: WorkerLost.

Some investigation on the super useful ACA github repo led me to an individual who had noticed that SIGTERM signals were being sent to all container processes (not just PID 1) when ACA initiated a graceful shutdown. A graceful shutdown might be triggered by downscaling or container migration to a new node.

For a container that doesn’t have long running tasks, like a web server, this isn’t a big deal. Each task completes in fractions of a second and an upstream load balancer has likely already rerouted any new traffic to a new replica.

But a long-running task worker? Nah, this is problematic. Our task worker could be in the midst of a 3 minute conversion of a plain PDF into a text-searchable document, in a child thread. If ACA signals to the container that it needs to shutdown within the graceperiod (up to 15 minutes) then main Celery thread would normally receive that signal, not accept any new tasks, and let currently child threads tasks finish their work. Instead, ACA was passing a SIGTERM to the child threads too, which terminate immediately, mid task. To make matters worse, the problem was worst at peak because of frequent scaling events (really the down scaling).

So, how do we solve for this?

I reached out to Azure support and learned that our ACA environment had been upgraded in the background to a v2 that had this behavior. The mitigation they offered was to either change the signal handling of the container, or switch from consumption to workload profile which more closely approximates Kubernetes behavior. I chose the latter. Fixing this quirk about ACA is still an outstanding issue.
We had both I/O and CPU bound tasks running on the same workers for simplicity’s sake. I broke these into separate containers. My thinking was that breaking the short tasks would be less likely to cause scaling events if we run them in a separate queue with higher concurrency. The vast majority of our tasks are short and i/o bound.
I added a scaling cool down period to further reduce the scaling volatility during peak. Basically the scaler will be less quick to react to a drop in demand.
Finally, soon we’ll be dropping the heaviest CPU-bound task (OCRmyPDF) and instead rely on an Azure API service. This will probably save money and I’m glad one of our team members suggested this reduction in complexity. This means cheaper compute, less container dependencies,