Skip to main content
All case studies
Data Engineering ICP 1

Data pipeline debugging agent

200+ daily ETL jobs. A TOML-defined triage pipeline version-controlled alongside the jobs. A new team member can understand the triage logic by reading the topology file.

Pipeline triage time reduced from 30-45 minutes to 2 minutes.

The triage problem at scale

200+ ETL jobs running daily. When one fails, the on-call engineer needs to determine: is this a data issue, an infrastructure issue, or a code issue? Each category has a different escalation path and a different fix.

Before, triage meant: read the error log, check the upstream data source, check the job configuration, cross-reference with recent deploys. 30-45 minutes for a non-obvious failure, even for engineers who knew the system well.

The pipeline topology

Kernex pipelines are defined in TOML. The triage agent runs three stages:

[[pipeline.stages]]
name = "classify"
prompt_file = "prompts/classify-failure.md"
# Reads error log and emits: data | infra | code | unknown

[[pipeline.stages]]
name = "investigate"
prompt_file = "prompts/investigate.md"
depends_on = ["classify"]
# Branches on classification result

[[pipeline.stages]]
name = "recommend"
prompt_file = "prompts/recommend.md"
depends_on = ["classify", "investigate"]
# Emits: root cause, recommended action, escalation path

The topology file lives in the same repository as the ETL jobs. When the triage logic changes, the change goes through code review. New team members read the TOML to understand what the agent does.

What made this possible

The classify stage needed access to the error log files. kernex-sandbox allows declaring filesystem read permissions per-stage. The agent can read logs in /var/log/etl/ and nothing else. This was a requirement for running the agent on production infrastructure — the ops team would not approve broader filesystem access.

Result

Triage now runs automatically when a job fails. The on-call engineer receives a summary with root cause and recommended action. For the 80% of failures that are data issues (upstream null values, schema drift, late-arriving data), the recommendation is actionable immediately. For the remaining 20%, the investigation output reduces the manual triage time from 30 minutes to under 10.

Total wall time for the automated triage: 90-120 seconds.

All case studies