Important: Please ensure you have access to production so that you are able to handle incidents!
Background
Jobs are an important functionality of PostHog apps / plugins. Among other things, they power much of our exports functionality.
In order to debug jobs not working properly, it's important to understand the following:
- Jobs can be triggered from plugin servers with any capability
- Jobs can only be processed from plugin servers with the jobs capability
In our Cloud environments, plugin server capabilities can be inferred from deployment names. To debug the jobs processing pipeline, you'll be looking at the plugins-jobs-xxxxx
pods.
It's also important to know that in our Cloud environments, jobs are not stored in our main Postgres database. Rather, we store them in a separate RDS instance that is used only for jobs.
The jobs pipeline works as follows:
- Enqueue job into
jobs
Kafka topic from any plugin server instance - Plugin server with jobs capability consumes from Kafka and persists the job in the jobs database (via Graphile Worker)
- The Graphile Worker in a plugin server instance with the jobs capability pulls the jobs from the jobs database when it's time, runs them, and deletes them from the database
Debugging
There are a few potential services that can cause our jobs processing pipeline to have issues:
- Kafka: We may be failing to add jobs to the
jobs
Kafka topic. Potential reasons: Kafka is down, jobs messages have gotten larger than the default Kafka limit of 1mb, we've shipped a bug causing jobs messages to be malformed. - Jobs database: The jobs database may be oversaturated or unreacheable.
- Plugin server: The plugin server could have stopped enqueueing / processing jobs because it is oversaturated or we've shipped a bug.
Actions
The most straightforward and safe operation to perform is to trigger a "restart" of the jobs pods. This can be done with a redeploy or using kubectl
to spin up an entire new set of pods.
Example:
kubectl rollout restart deployment/ingestion-plugins-jobs -n posthog
If this doesn't fix the issue, you should try to establish the health of both Kafka and the jobs database. Provided they look healthy, we've likely shipped a bug and should look at Git history and revert any suspicious changes.