The triage problem at scale
200+ ETL jobs running daily. When one fails, the on-call engineer needs to determine: is this a data issue, an infrastructure issue, or a code issue? Each category has a different escalation path and a different fix.
Manual triage means reading the error log, checking the upstream data source, checking the job configuration, and cross-referencing recent deploys: 30-45 minutes for a non-obvious failure, even for engineers who know the system well. This reference architecture describes the automated first pass Kernex pipelines are designed to provide.
The pipeline topology
Kernex pipelines are defined in TOML. The triage agent runs three stages:
[[pipeline.stages]]
name = "classify"
prompt_file = "prompts/classify-failure.md"
# Reads error log and emits: data | infra | code | unknown
[[pipeline.stages]]
name = "investigate"
prompt_file = "prompts/investigate.md"
depends_on = ["classify"]
# Branches on classification result
[[pipeline.stages]]
name = "recommend"
prompt_file = "prompts/recommend.md"
depends_on = ["classify", "investigate"]
# Emits: root cause, recommended action, escalation path
The topology file lives in the same repository as the ETL jobs. When the triage logic changes, the change goes through code review. New team members read the TOML to understand what the agent does.
The sandbox angle
The classify stage needs access to the error log files and nothing else. The design intent of kernex-sandbox is exactly this shape of constraint: the agent reads logs in /var/log/etl/ and the OS, not the prompt, is what stops it reading anything else. That is the property an ops team would require before letting an agent near production infrastructure, and it is the bar this scenario is specified against.
The target outcome
Triage runs automatically when a job fails. The on-call engineer receives a summary with root cause and recommended action. For the large share of failures that are data issues (upstream null values, schema drift, late-arriving data), the recommendation is actionable immediately; for the rest, the investigation output narrows the manual work. The wall-time design target for the automated pass is 90-120 seconds.