Saltar al contenido principal
Todas las arquitecturas de referencia
Data Engineering ICP 1

Data pipeline debugging agent

A reference architecture for 200+ daily ETL jobs: a TOML-defined triage pipeline version-controlled alongside the jobs, readable by any new team member.

Design target: automated first-pass triage in about 2 minutes, against 30-45 minutes of manual log archaeology.

Arquitectura de referencia: un escenario objetivo para el que Kernex está diseñado, no el informe de un despliegue de cliente. Las métricas son objetivos de diseño.

The triage problem at scale

200+ ETL jobs running daily. When one fails, the on-call engineer needs to determine: is this a data issue, an infrastructure issue, or a code issue? Each category has a different escalation path and a different fix.

Manual triage means reading the error log, checking the upstream data source, checking the job configuration, and cross-referencing recent deploys: 30-45 minutes for a non-obvious failure, even for engineers who know the system well. This reference architecture describes the automated first pass Kernex pipelines are designed to provide.

The pipeline topology

Kernex pipelines are defined in TOML. The triage agent runs three stages:

[[pipeline.stages]]
name = "classify"
prompt_file = "prompts/classify-failure.md"
# Reads error log and emits: data | infra | code | unknown

[[pipeline.stages]]
name = "investigate"
prompt_file = "prompts/investigate.md"
depends_on = ["classify"]
# Branches on classification result

[[pipeline.stages]]
name = "recommend"
prompt_file = "prompts/recommend.md"
depends_on = ["classify", "investigate"]
# Emits: root cause, recommended action, escalation path

The topology file lives in the same repository as the ETL jobs. When the triage logic changes, the change goes through code review. New team members read the TOML to understand what the agent does.

The sandbox angle

The classify stage needs access to the error log files and nothing else. The design intent of kernex-sandbox is exactly this shape of constraint: the agent reads logs in /var/log/etl/ and the OS, not the prompt, is what stops it reading anything else. That is the property an ops team would require before letting an agent near production infrastructure, and it is the bar this scenario is specified against.

The target outcome

Triage runs automatically when a job fails. The on-call engineer receives a summary with root cause and recommended action. For the large share of failures that are data issues (upstream null values, schema drift, late-arriving data), the recommendation is actionable immediately; for the rest, the investigation output narrows the manual work. The wall-time design target for the automated pass is 90-120 seconds.

Todas las arquitecturas de referencia