Tips and Tricks When Using Kubernetes CronJobs

Tips and Tricks When Using Kubernetes CronJobs

In my work I had to create CronJobs that do batch data processing, and so I have acquired some tips and tricks that I believe may be useful for some people out there.

Tip 1. Have a self-recovering CronJob

Imagine you have a CronJob that runs every 10 minutes. This job aggregates data from a data source and stores it in some data store in 10 minute intervals. Even if the application architecture is simple, there are many areas that may fail. Some of these failures may not be under our control.

For example, some reasons why the job may fail are:

  • K8s Node issue

  • Network issue

  • Control Plane Issue

To solve this issue there are several quick solutions you can implement.

  1. Lower the activeDeadlineSeconds , adjust the backoffLimit, and set restartPolicy: OnFailure in the job specs to allow the job to retry the pod. If the pod fails, it will be restarted on the same node, with incremental backoff.

  2. Set restartPolicy: Never and set the backoffLimit to create new pods when the pod fails. If you do not set a backoffLimit, jobs will be created forever. This setting can be more desirable if the issue is due to a specific Node.

Still, there are cases where the job will fail despite the settings above. In such cases, it may be desirable to create a self-recovering CronJob. It will make your life more simple :)

Self-recovering CronJob to the rescue

The idea for a self-recovering CronJob is to have two CronJobs that execute at different times. From here on, I will refer to this as the Self-Recovery CronJob.

For example, look at this setup:

  • The active CronJob executes every 10 minutes

  • The Self-Recovery CronJob executes every 1 hour.

In this example, the Self-Recovery CronJob will check the last 1 hour of the active CronJob's results and evaluate if there is a need to execute a recovery job. The implementation for this checking step may vary depending on the application and use case.

Furthermore, if we set the Self-Recovery CronJob to evaluate the last 2 hours in this case, it can even recover itself. Meaning, if the previous Self-Recovery CronJob fail, the next one can retry the previous one's execution as well.

The production CronJob deployment I have created has been saved by this Self-Recovery CronJob countless times. For short-interval CronJobs, I can't stress how important having a self-recovering CronJob is.

Tip 2. Separation of concerns

If your batch process does something similar to an Extract Transform Load (ETL) process, I would recommend separating the components at least by Extract / Transform and Load.

For example:

  • The app container will be responsible for extracting and transforming data and writing it to a file. (Extract & Transform)

  • The collector container will be responsible for reading the results and formatting it to a scheme that satisfies the data store. (Transform & Load)

For the collector, you can use technologies like fluentd in order to be able to send the data to a variety of data stores.

The containers should co-exist in the same pod. Since the containers shouldn't communicate to each other directly, the only share the same output/input file.

This sort of architecture brings the following benefits:

  • The app will not depend on the data store

  • The data store can be replaced easily

  • No need to reinvent the wheel

When designing your next ETL batch pipeline, you might want to consider having a similar architecture if it satisfies your requirements.

Extra tip: Most collectors run sort of like daemons. This means that the container will stay up even if the app container completes its job. To overcome this issue you can rely on the k8s Job's activeDeadlineSeconds setting, implement a livenessProbe, or you can create a custom script that is ran as a command for the container which will terminate the process if there is a completion flag. This completion flag can be some message in the file so that the containers can communicate with each other.

Tip 3. Monitoring for errors and missing executions.

It is likely that you will need to have the logs of the jobs so you can know what went wrong when something fails.

In the case of the CronJobs I have used, we used a filebeat container that sends logs to Elasticsearch. When we first deployed our CronJob, we were only monitoring for errors. So, if we found some errors logs, we would get alerted. However, we found out that there were times when we did not receive alerts even though the job was failing. It turns out that the pods were not able to be scheduled and this caused the job to fail without the possibility of sending logs.

The lesson we learnt is that we have to monitor for both errors as well as missing executions. This is a small tip, but it is a very important one to keep in mind when creating scheduled jobs.

That is all for now. Thanks for reading!