Enforcing AI Accountability from Inside the Agent
AI agents are making real decisions. They are handling patient data, conducting interviews, and moving money. The organizations deploying them log everything. But those logs sit in the operator's own infrastructure, managed by the operator's own team.
So when something goes wrong (e.g. a biased loan decision, a hallucination that costs real money, a misdiagnosis) there's no independently verifiable record of the AI's decision-making process.
From an auditor's perspective, that's self-attested evidence. It proves nothing.
If you're an organization buying high-stakes AI services (healthcare, finance, government, or beyond) you have no visibility or oversight into what a vendor's agent actually did. Every claim they make about its behavior is based on trust.
If a dispute arises, if a regulator asks what an AI actually did, if an insurance claim is filed, the AI operator will simply export logs from their own database and say "we promise that these logs are not doctored." There is no independent proof those logs haven't been tampered with.
In finance, we have independent auditors. In legal, we have chain of custody. In AI, we have "just trust us." That won't remain acceptable as agents take on increasingly high-stakes use cases.
Standards describe what should be logged. Compliance checklists confirm that logging exists. But none of that proves the record hasn’t been changed later or selectively left out when it matters. That requires technology, not policy.
How AgentSystems Notary works
Notary is an open-source mechanism that sits inside the AI agent's code (currently supports LangChain, Agno, and CrewAI). As the agent operates, Notary captures each LLM call and does two things simultaneously:
- Stores the full raw payload (prompts, responses, metadata, timestamps) in the operator's private infrastructure
- Writes a SHA-256 hash (a unique digital fingerprint) of that log entry to independent storage that nobody can alter. Think: an external, write-once record outside the operator's main logging system, so they can't quietly rewrite history later.
The sensitive data never leaves. Only the fingerprint does.
When verification is needed, the process is straightforward:
- The AI operator exports the raw logs that require verification
- A verifier regenerates the SHA-256 fingerprints from those logs
- The verifier compares them against the independently stored fingerprints
- A match confirms integrity. A mismatch means tampering or omission.
- No trust required. The math either checks out or it doesn't.
This doesn't stop mistakes in real time, it makes the record tamper-evident after the fact, when auditors, regulators, or customers need answers.
Why it matters for governance
Governance frameworks like AIUC-1 already call for tamper-evident audit trails (E015). The problem has never been awareness. It's been enforcement. You can write a control that says "maintain tamper-evident logs." But without technology embedded in the agent itself, that control is aspirational.
Notary makes E015 enforceable. It's an official integration in LangChain, the most widely used AI agent framework, meaning any team building on LangChain can add tamper-evident audit trails with minimal code changes.
Preppr, an AI platform for emergency management, integrated Notary with 25 lines of code and zero changes to their existing agent logic. Their CEO wrote about it here.
For organizations buying AI services, Notary gives you something you've never had: a simple way to independently verify what a vendor's agent actually did.
The question for governance leaders
The technology exists. It's open-source, free, and integrates into the tools developers already use.
The question is whether we’re comfortable leaving AI audit trails as a policy recommendation, or whether we're ready to make them a technical requirement vendors must meet. If you buy high-stakes AI, it may be time to start demanding verifiable logs.