Skip to main content
7 min

The Most Useful Thing Your AI Agents Can Do Is Audit Themselves

We pointed our agents at our own knowledge base and asked what they should be doing that they weren't. Then we built the fixes.

AI AgentsSelf-ImprovementAutomationKnowledge Management

The Most Useful Thing Your AI Agents Can Do Is Audit Themselves

I have a knowledge base with a few hundred notes from YouTube videos. Transcripts, topic tags, channel metadata, timestamps. Each note is structured the same way, sitting in an Obsidian vault, quietly accumulating. I also have dozens of cron scripts, a handful of active agents, skill definitions, automation pipelines, and a growing backlog of things I keep meaning to wire together.

Last week I tried something different. Instead of asking my agents to build the next feature on the list, I pointed them inward. I gave them access to the knowledge base and to the infrastructure that actually runs -- the cron jobs, the skill definitions, the automation configs -- and asked a simple question:

What do you know about that you're not doing?


The Gap Nobody Checks

Here's the thing about running AI agents for any length of time: you accumulate knowledge faster than you accumulate automation. Every article you save, every video you transcribe, every note you tag -- it all goes into some kind of knowledge store. And that knowledge store contains insights about practices, patterns, and techniques that your infrastructure doesn't actually implement.

Your notes might contain detailed breakdowns of output validation patterns. Your automation might not validate any of its own output. Your knowledge base might have three different articles about event-driven architecture. Your agents might still run on fixed cron schedules with no awareness of external triggers.

The knowledge is there. Nobody's cross-referencing it against what's actually running.

This is a structural blind spot, and it exists in every system I've seen that combines knowledge management with automation. The two layers grow independently. The knowledge layer gets smarter. The automation layer stays the same. And the delta between what you *know about* and what you *do* widens quietly over time.

How the Audit Works

The mechanics are less interesting than the idea, but they're straightforward. You need two things: a way to inventory what your agents know about (topics, techniques, patterns referenced in your knowledge base), and a way to inventory what your agents actually do (cron schedules, skill definitions, active pipelines, webhook registrations).

I had both. The knowledge base is structured Obsidian notes with topic tags. The infrastructure inventory is a combination of crontab entries, skill manifests, and Paperclip task records. Getting a machine-readable view of each took maybe twenty minutes.

The agent's job was to find the delta. Look at everything the knowledge base references -- validation patterns, learning loops, event-driven triggers, composition strategies, feedback mechanisms -- and check whether the infrastructure has a corresponding implementation. Not a vague mention in a config file. An actual running system that does the thing.

The results weren't surprising in hindsight. They were the kind of obvious-once-you-see-it gaps that you never see because you're always looking forward at the next feature instead of sideways at what's already there.

What the Gaps Looked Like

A few categories stood out.

Output goes unchecked. Agents produce work -- processed articles, triaged items, generated content -- and the system accepts it unconditionally. Nothing grades the output. Nothing asks "was this good?" The knowledge base had multiple notes about quality rubrics and validation frameworks. The infrastructure had zero.

Sessions start from scratch. Every time an agent spins up, it begins with no memory of what it learned last time. If it figured out that a particular RSS feed is consistently low-quality, or that a certain topic cluster needs deeper analysis, that insight dies with the session. The knowledge base had notes about expertise accumulation and institutional memory. The agents had neither.

Skills exist in isolation. I had individual skills that worked well on their own -- a summarizer, a triage scorer, a content classifier -- but nothing that chained them together into multi-step workflows. Each skill was an island. The knowledge base referenced pipeline composition and workflow orchestration repeatedly. The infrastructure was a collection of standalone scripts.

Loops run on vibes. Several automation loops had no clear exit condition. They'd run until they "felt done" or until a timeout killed them. No measurable completion criteria. No way to distinguish "finished" from "gave up." The knowledge base had notes about deterministic completion and measurable exit conditions. The loops had while True with a prayer.

Nothing reacts to events. Everything ran on schedules. A new item arrives in a queue? It waits for the next cron cycle. An external service sends a webhook? Nobody's listening. The knowledge base had extensive notes on event-driven architecture. The infrastructure was purely time-driven.

What We Built in the Same Session

Once you see the gaps, the fixes are surprisingly tractable. We built five things in the same session that surfaced the problems.

Self-Grading Cron Jobs

The idea: every cron job that produces output should grade its own work. Not with a language model staring at it philosophically -- with a rubric. A checklist of concrete quality signals specific to that job's output type.

The implementation is a rubric validator that runs as a post-step on cron jobs. Each job type has a YAML rubric defining what "good output" looks like for that domain. The RSS triage job checks whether items were classified with confidence above a threshold. The content processing job checks whether summaries preserved key entities from the source. The validator scores the output, logs the result, and flags runs that fall below the rubric's floor.

The point isn't perfection. It's closing the loop. Before this, a job could produce garbage for a week and nobody would know until a human happened to look. Now the system knows immediately.

Expertise That Survives Sessions

Agents now write YAML expertise files when they learn something useful during a session. Not raw conversation logs -- structured observations. "This RSS feed produces mostly duplicate content." "This topic cluster has high knowledge-base coverage, deprioritize." "This channel's transcripts are consistently low quality due to auto-generated captions."

The next session loads relevant expertise files before starting work. The agent doesn't start from zero. It starts from where the last session left off, at least for the domain knowledge that matters.

This is a simple pattern -- write structured files, read them on startup -- but the behavioral difference is significant. Agents that accumulate expertise between sessions make noticeably better decisions by the third or fourth cycle. They stop re-learning the same lessons.

Skill Composition

Individual skills got wired into multi-step pipelines. A skill orchestrator takes a workflow definition -- "run the extractor, pass output to the classifier, pass that to the summarizer, write the result" -- and executes it as a single composed operation.

The important design decision: skills remain atomic. They don't know they're part of a pipeline. The orchestrator handles the plumbing. This means any skill can participate in any pipeline without modification. The same summarizer works in the content pipeline, the research pipeline, and the triage pipeline.

Deterministic Completion

Every loop got a measurable exit condition. Not "run for a while and stop." An actual criterion: "process all items in the queue," "reach confidence threshold on classification," "complete all steps in the workflow definition."

We call the old pattern a Ralph Wiggum loop -- it runs, it does stuff, it's not clear when or why it stops, and afterward you can't tell if it finished or just wandered off. The replacement pattern is explicit: define done, measure progress toward done, exit when done, log whether you got there.

Event-Driven Dispatch

A webhook listener that can trigger agent sessions based on external events. A new item arrives in a monitored queue? Start a processing session. A dependency comes back online after an outage? Drain the pending work. A scheduled report completes? Trigger the distribution pipeline.

The dispatcher is lightweight -- it maps event types to session templates and fires them. The sessions themselves are normal agent sessions. The only difference is what starts them: an event instead of a clock.

This is the one that felt most like unlocking a capability that should have existed from the beginning. So much of what agents do is *reactive* work triggered by *scheduled* runs. The mismatch between the work's nature and its scheduling is pure waste.

The Meta-Pattern

The specific fixes matter less than the pattern that produced them. The most valuable thing your agents can do isn't build the next feature. It's audit themselves.

Point them at what you know -- your docs, your notes, your saved articles, your knowledge base -- and then point them at what you do -- your cron jobs, your configs, your running infrastructure. Ask them to find the delta. The gap between knowledge and action is where the highest-leverage improvements live, and it's the one place most people never look because they're too busy looking outward.

Your agents probably know more than they do. The knowledge is already there, sitting in your vault, your notes, your transcripts. Nobody's cross-referencing it against your actual systems. Do that first. Build outward second.

The best audit your agents can run is on themselves.


Related posts:


*Edgeless Lab builds infrastructure for autonomous AI systems.*

Related Posts