Why AI agents should not own context, and how a federated architecture where memory is passive and Git is the spine closes the gaps no tool has solved yet.
The memory problem in AI agents is not a lack of tools. It is the absence of separation between who stores context and who executes it. Current solutions centralize memory in the agent, creating tool dependency, semantic contamination across domains and inevitable degradation. This paper proposes federated memory: an architecture where context belongs to the user, domains are isolated and memory is passive and sovereign. Code agents are interchangeable clients that read and write guided by a contract; none is the core. Git is the spine that unifies versioning and synchronization. Governance operates in two declared modes: cooperative (contract, the recommended floor) and adversarial (enforcement through operating system isolation, below the agent). Capture is automatic via the agent's hooks; classification by confidence and risk determines what promotes on its own and what requires human decision. The reference implementation is a Markdown vault versioned in Git, with Context Packs as the minimal unit of context and the filesystem as the access path, evolving to MCP and Graphiti on demand.
Every AI tool looks brilliant in the first few minutes. It understands the request, produces relevant output, and for a moment it seems you finally have a collaborator that follows your reasoning. The problem shows up in the second session. Or the tenth. The tool forgets what was decided, mixes context from distinct projects, repeats questions that were already answered, and needs to be re-educated continuously.
This is not a bug. It is the consequence of an architectural choice: memory, when it exists, belongs to the agent. It lives inside the tool, in the tool's format, accessible only by the tool. When you switch tools, you lose the memory. When the tool updates, the memory degrades. When you use two tools in parallel, each has a different version of reality.
The natural reaction is to try to concentrate everything in one place: a "super brain", a huge context file, a centralized system the agent reads in full before responding. At first this seems to solve it. Over time, the system degrades in another way: a marketing task pulls in code conventions; an old project contaminates a new one; memories from months ago enter as if they were still valid. The agent knows too much and, precisely because of that, begins to err in more sophisticated ways.
The field still has no established answer for this. Memory tools exist, Mem0, Zep, Letta, native systems like Claude Code and Cursor, but they all start from the same assumption: memory is an internal component of the agency system. The agent manages the memory. The agent decides what to store. The agent decides what to bring into context.
This paper argues that this assumption is the problem, not the solution. And that the way out is not an active component deciding on the agent's behalf: it is separating memory from the executor. Context lives in passive files, versioned and auditable, outside any tool. The agent pulls what it needs, guided by an explicit contract. Sections 03 and 05 develop this separation.
Existing solutions are technically sophisticated. They use vector embeddings, knowledge graphs, context compression, semantic retrieval. The problem is not the technology. It is the architectural assumption that governs where memory lives and who controls it.
Claude Code has CLAUDE.md and a local memory system. Cursor has project rules. Windsurf has Cascade's internal memories. Each implementation works well inside the tool that created it, and fails completely outside it. If you use Claude Code in the morning and Cursor in the afternoon for the same project, the two have different, possibly conflicting memories, with no synchronization or governance mechanism. Memory belongs to the tool, not the project, not the user. This holds even for an agent we meant to use as the base of the architecture: Hermes keeps its own memory (SQLite, MEMORY.md), which competes with the federated vault instead of serving it. Native memory, however good, pulls context back inside the tool.
Mem0, Zep and similar systems solve portability partially: memory lives in an external service that multiple tools query. But they introduce three new problems. First, dependency on external infrastructure, the memory is now in another company's cloud. Second, absence of explicit governance, what enters is decided automatically, without human approval, and hypotheses become permanent facts. Third, absence of semantic isolation, all memory lives in a single vector space, and distinct projects contaminate each other through semantic proximity.
Graphiti (Zep) and similar systems add a temporal and relational dimension. They are technically richer, but suffer from the same governance problem: who decides what enters the graph? By what criterion does a piece of conversation become a persistent node? Automation without curation produces graphs that grow but do not necessarily improve.
In all existing approaches, memory is an internal component of the agency system. The agent stores the context and the same agent executes with it. This concentration is the root of the problem, not an inevitable property of the architecture.
| Solution | Portability | Isolation | Governance | Sovereignty | Cross-agent sharing |
|---|---|---|---|---|---|
| Native memory (Claude Code, Cursor) | None | Partial | None | Tool | Not available by design |
| Mem0 / Zep | Partial | None | Automatic | External service | Not available by design |
| Graphiti | Partial | Partial | Limited | External service | Not available by design |
| Centralizing systems / Life OS | Locked to app | None | UI-driven | Application | Not available by design |
| Federated Memory (v3.0) | Total | Per domain | Risk-proportional (automatic for low risk, human for hypotheses and high risk) | User | Via shared vault (hive mind: structure ready, scale on the roadmap) |
Before describing the error, separate three layers that readers confuse all the time. Confusing them is the root of the problem:
The error of current approaches is placing memory inside the agent, fusing two functions that should be separate: storing context and executing based on it. When both live in the same component, any failure in one contaminates the other.
An agent that stores its own memory has a structural incentive to store too much: more context feels safer. An agent that executes based on memory it governs itself has no audit mechanism: memory errors become silent execution errors. And because the memory dies with the tool, context is held hostage by whoever executes it.
Mature software systems solved this decades ago with separation of concerns. Databases do not execute business logic. Application servers do not manage persistence directly. Data lives outside the process that consumes it, versioned and auditable. AI agent architecture has not yet internalized this principle.
And routing? In v1.0 of this paper, the temptation was to treat routing as a third function: an active and central component that decided what the agent reads. It is not. Read routing is the agent itself pulling what it needs, guided by the contract and by static versioned files. There is no central router, and there does not need to be. What provides auditing and correction is not an active router, it is memory living outside the agent, in versioned files that any client reads and any human inspects.
Federated memory is an architecture where context belongs to the user, domains are semantically isolated, routing is explicit and governed, and agents are temporary consumers of context, not its owners.
User sovereignty. The base belongs to the user. Agents come and go; context stays portable, human-readable and versionable. No tool should be a prerequisite for accessing one's own memory.
Domain isolation. Context from distinct domains, marketing, engineering, writing, research, should not share the same semantic space. Cross-domain contamination is prevented by structure, not by instruction to the agent.
Neutral consumption contract. A central contract file describes how any agent should consume the memory. Tool-specific adapters translate this contract but do not replace it. The contract is hierarchical: company and dev rules live in 00-global/RULES.md and apply across all projects; when a project needs to deviate from a rule, it registers an override in 70-decisions/ with the reason, responsible party, and date. An approved override takes precedence, and the deviation remains traceable.
Minimum sufficient context. The agent receives the context needed for the task at hand, not the full history. Context Packs are the delivery unit: lean, task-oriented packages with an explicit scope of what to use and what to avoid.
Human as last-instance auditor. Capture is automatic via the agent's hooks. Classification is the agent's responsibility (confidence + risk). Approval is risk-proportional: verified low-risk entries promote automatically with TTL; hypotheses and high risk require explicit human decision. The human is not a capture bottleneck, it is a quality filter.
v1.0 stated these five principles and stopped. Practice revealed three gaps that needed to be closed before the architecture was robust enough to be adopted by third parties.
Context Packs were delivered, agents consumed them, and no one measured whether the package had been useful. Bad packs remained in use indefinitely because there was no rejection mechanism. The solution is structural, not cultural: each pack gets a Validation: field documenting how usage is logged, what the automatic review trigger is (three consecutive bad marks) and how the human marks a pack as obsolete. Without feedback, there is no improvement cycle. With structured feedback, bad packs have a finite half-life.
When two notes state conflicting things, which wins? In v1.0 the answer was implicit ("whoever revised most recently"), and therefore non-deterministic, different agents reached different answers. v2.0 establishes the explicit rule: the most recent entry with status: approved wins. Entries with status: superseded remain in history but are ignored at runtime. Timestamp alone does not decide, status is mandatory. In the absence of status, the agent must ask the human, not guess. This structure, discrete entries with status and frontmatter, rather than a monolithic context file rewritten on every update, prevents by construction what the ACE paper calls context collapse: the degradation that occurs when full rewrites of context erase hard-won strategies. In federated memory, each decision is a discrete versioned entry; new information enters as incremental deltas via the inbox, never as a destructive rewrite. History is preserved by design.
The principle of human approval works when the human is in front of the screen. The problem begins when the agent runs in automatic mode: scheduled, headless, in CI. The v3 answer is not a core that mediates calls, because that core does not exist. It is two declared modes. In cooperative mode, the default, the rule lives in the contract and in the structure: reading is open across the vault, permanent writing outside /90-inbox/ is turned into a suggestion. The agent cooperates because the contract asks it to, and capture via hooks records what matters. In adversarial mode, against an agent that decides to ignore the contract, no component on the agent's side holds back the write. Enforcement comes from below: operating system isolation (section 05).
v2 described Hermes as an active core that mediated access to context, enforced policy and resolved conflict. Field testing and the official Hermes documentation showed that this component does not exist as described: Hermes is a full code agent with its own memory, not a gatekeeper that mediates other agents. The characterization was wrong. This section replaces it.
Before the correction, three layers must be separated, the ones readers confuse all the time, and whose confusion is the architectural error of section 03:
The correction is simple. Federated memory is passive and sovereign: a set of versioned Markdown files. It does not route, does not enforce policy, does not mediate anything. Code agents are interchangeable clients that read and write in that vault, guided by a contract (AGENT.md) and by the folder structure. Hermes is one of them. None is the core.
The four functions v2 assigned to a core still exist, but live elsewhere:
status: approved beats superseded (section 04) lives in the files themselves. Any agent reading the vault applies the same rule, because it is in the data, not in a component.hermes.policy.yml described in v2 does not exist in the real Hermes.What holds the architecture together is Git. The vault is a Git repository. The agent works on a clone: it reads with pull, writes with commit and push. In a single piece, Git delivers cross-machine synchronization, versioning, authorship, diff and rollback, with an open and neutral mechanism that does not break sovereignty nor lock you to a vendor.
The architecture grows in steps, under pain, not by default:
Climbing a step buys scale and search. It does not buy write discipline for free: that is always the OS or custom code, at any step.
Hence the threat model, declared openly:
/90-inbox/, or a separate user with no write access to protected folders. It never comes from the agent's side.Passive memory is not a weakness: it is the condition of sovereignty. A vault that depends on an active core to exist is tied to that core. A vault that is just versioned files belongs to the user and works with any agent. When enforcement is needed, it comes from the operating system, below the agent, never from the agent itself.
Multimodal support is accommodated by the /assets/ structure per domain and project. The ability to process images, audio and PDFs depends on the LLM configured in the agent, not on the agent itself. Check the provider documentation for current capabilities.
Centralizing systems treat all memory content as valid until a human explicitly marks it obsolete. In practice, this means outdated content stays in circulation for months or years, no one reviews what is not visibly broken.
Federated memory has a structural advantage over centralizing systems on this point: content lives as files in the user's operating system. Files have an mtime. Guided by the contract, the client agent, or a capture hook, inspects the mtime of each file listed under Use: when assembling a Context Pack. If any file has gone more than ninety days without an update, the pack carries a warning attached to the output.
The choice to warn, and not block, is deliberate. Old content is not necessarily wrong. Engineering principles may have been written a year ago and still hold. But old content deserves a moment of checking before being used as the basis for a new decision. The warning serves that purpose without becoming friction.
Centralizing systems have no equivalent. Memory lives as opaque objects inside the application, with no exposed mtime, no inspection possible by an external component. Federated memory inherits this capability from the file system, it was not invented, it was recognized.
The default behavior is informative. The human can escalate by configuring a hook, or instructing the agent via the contract, to block packs with content older than a threshold (for example, six months). For packs from volatile domains (regulatory, ML, market), the threshold can be lower; for stable domains (principles, fundamentals), higher. Configuration is per pack, not global, because aging speed is content-specific.
Federated memory is not superior in every scenario. An honest comparison requires acknowledging where other approaches have the advantage.
For users with a single tool, a single project and low need for portability, native memory is the right choice. Zero setup friction, perfect integration with the tool's flow, no additional infrastructure. The cost, tool dependency, lack of portability, only appears when context grows beyond what the tool can hold.
There is a step below this architecture, and it is legitimate. A single agent, few projects, little accumulated context: a global contract file with rules and preferences, plus one per project with local context, versioned in Git, work well. In Claude Code that file is CLAUDE.md; in the neutral convention adopted by several agents, it is AGENTS.md. But that is contract, not memory: it does not accumulate reviewed knowledge, does not isolate domains and has no review cycle. This architecture starts paying off when context grows, when a second agent comes in, or when memory needs to be reviewed and auditable. Before that, the simple step is the right choice, and whoever starts with AGENTS.md is already, without realizing it, at the embryo of the neutral contract this architecture formalizes.
For teams that need shared memory across multiple users, with automatic semantic retrieval and no manual curation, Mem0 and Zep are a better fit. Federated requires curation discipline that does not scale automatically to large teams without additional processes.
There is a category of projects that takes the opposite approach to federated: concentrating memory, automation, integrations and agents in a single monolithic application, often called "Personal AI", "Life OS" or "productivity super-app". They win on surface usability: ready-made UI, packaged integrations, visible community, near-zero adoption curve. The user installs, connects services and has something working in minutes.
The cost shows up later. Memory lives inside the application, in the application's format. Switching tools means starting from scratch. The architecture is, by design, centralizing, exactly the anti-pattern that motivated this paper. For those who prioritize adoption speed over sovereignty, it is the right choice; for those who prioritize portability and governance, it is a problem dressed up as a solution.
For users who work with multiple agents in parallel, multiple work domains, and who need real portability, where memory must survive a tool switch, federated is the only approach that solves the problem structurally. The differentiator is not a clever component, it is the sum of four properties that none of the others delivers together: sovereignty (memory is human-readable and belongs to the user), portability (no specific tool required to access it), auditing (versioned in Git, with authorship and history) and domain isolation. All of it with open mechanisms, without locking you to a vendor.
The DecisionNode project1, developed independently, reached partially overlapping conclusions: decisions as a structured unit, access via MCP by multiple agents, auditable history, an implicit blackboard pattern. It specifically solves the decision sub-module with semantic search via embeddings, a concrete implementation of part of the principles proposed here. The independent convergence of distinct projects toward similar conclusions is evidence that the problem is real and the architectural direction is valid.
As multiple specialized agents coexist in the same ecosystem, the blackboard pattern evolves into what we call the hive mind: each agent keeps its private memory, but reusable knowledge is published to a shared area of the vault that any agent can consult. Federated reading, controlled writing. No agent directly alters another's memory. Improvements are proposed, not imposed.
The Agentic Context Engineering (ACE) paper, published in 2025 by researchers from Stanford, SambaNova Systems and UC Berkeley (arXiv:2510.04618), arrived independently at a formulation that validates the architecture proposed here. ACE treats context as a "playbook" that evolves over time, maintained by three roles: Generator (executes the task and produces the trajectory), Reflector (analyzes successes and errors and extracts lessons) and Curator (consolidates the lessons into incremental updates of the playbook). The parallel with the federated architecture is direct: the executing agent (Claude Code, Cursor, etc.) is the Generator; capture via hooks with confidence + risk classification is the Reflector; the review-inbox with promote-skills and the approved/superseded rule is the Curator; and the federated vault is the playbook. The difference is philosophical and is the differentiator of this architecture: in ACE, the three roles are fully automatic, executed by LLMs. In federated memory, the Curator keeps the human as the last-instance auditor. Low risk consolidates automatically; hypotheses and high risk require human decision. In one phrase: this architecture is ACE with risk-proportional governance. ACE reports gains of +10.6% on agent tasks and +8.6% on domain reasoning with no fine-tuning, evidence that the direction (context as an evolving playbook, not a static prompt) is technically sound.
Systems like Paperclip (paperclip.ing) solve an adjacent but distinct problem. They are not memory systems, they are agent-team orchestration platforms, with org charts, per-agent budgets, scheduled heartbeats and hierarchical governance. The distinction matters: federated memory is context infrastructure per agent. Orchestration is coordination infrastructure between agents. Whoever uses both layers together has the most complete solution: persistent and governed context on one side, structured coordination on the other. Paperclip even uses its own SKILLS.md for runtime context injection, convergence with the /50-skills/ folder of this architecture. They are complementary projects, not competitors.
At the other end of the spectrum, Pi (pi.dev) is minimalist by design and has no native MCP. It has its own AGENTS.md, skills and automatic compaction. Integration with the federated vault happens via direct filesystem, running Pi inside the vault already loads the AGENTS.md automatically, with no need to configure an MCP server. Federated memory adds what Pi does not offer on its own: portability across machines, domain isolation and governance with human approval.
Adopting federated memory does not buy you every problem solved. Six limitations deserve to be acknowledged explicitly, not to justify the architecture, but to make clear where it ends and where the user needs to fill the gap.
The architecture governs what enters context and organizes what can be written, but it does not undo the action the agent already executed. If an agent, upon receiving a Context Pack, produces bad code, sends a wrong email or applies a change to an external system, federated memory has no way to revert it. Rollback remains a problem of the target system (Git for code, sent folder for email, transactions for databases). Federated memory improves the chance the agent does the right thing; it does not guarantee the action is reversible when it fails.
This limitation is partially mitigated by two things. First, pre-action logging: before executing high-risk actions, the agent records in /99-archive/pre-action-log.md the active context, the loaded pack, the executing agent and the target resource. It is not rollback, it is traceability. Second, because the vault is a Git repository, every write to the memory itself is a reversible commit. For the memory, Git covers rollback; for the external action, the target system has to cover it.
The mtime inspection warns when Context Pack files have passed the configured age threshold. The warning is informative, not blocking by default. Old content may still be valid; freshly updated content may be wrong. The architecture does not replace content review, it makes the point where review is needed more visible. Confusing a validity warning with a correctness guarantee is an interpretation error the system does not prevent.
This limitation is mitigated by the review_date field in Context Packs and Decisions. Instead of relying only on the file's mtime, the client agent, or a hook, checks whether next_review has expired, independent of whether the file was modified. The human who reviews the content and confirms it is still valid updates review_date without altering the file. The system distinguishes unmodified content from reviewed content.
With automatic classification by confidence + risk, what is verified + low promotes on its own with TTL, and the inbox accumulates only hypotheses and high-risk decisions. The human dependency is restricted to that: if you never review the inbox, hypotheses and high-risk items never become memory, while the automatic flow keeps operating for the rest. Capture hooks (see the guide, section 09c) reduce the dependency on manual action, but do not eliminate review of the consequential cases. This is an assumed operating premise, not a hidden flaw: cooperative mode presupposes a human who does quality curation on what matters.
The reference implementation was exercised at small scale (one person, few domains, a handful of projects). Behavior at medium or large scale is architectural projection, not field data. But scale is not a ceiling, it is the maturity ladder described in section 05: the floor is a Markdown vault + Git; under pain of real time, continuous synchronization is added; under pain of volume, an MCP server as common access and Graphiti as an index derived from the Markdown come in; and physical isolation in multiple vaults appears when NDA or sensitive client data require it. Each step enters under concrete pain, not by anticipation.
The hive mind structure is implemented: /50-skills/published/, /proposed/, /deprecated/, INDEX.md and the publication protocol via inbox. The promote-skills.mjs and update-index.mjs scripts automate TTL promotion and index updates. What is still roadmap: integration with Graphiti for relations between skills, semantic search in the index, and a local worker that indexes automatically once the ecosystem of specialized agents grows enough for sharing to make operational sense.
This is the limitation a field test exposed and that honesty requires us to declare. In the recommended setup (Markdown vault + Git + contract), write governance is contract, not a lock. An agent that cooperates deposits suggestions in /90-inbox/ and respects protected folders because AGENT.md asks it to. An agent that decides to ignore the contract writes wherever it wants, because the folders are writable by whoever runs the agent. No component on the agent's side prevents this. Whoever needs real enforcement, against a potentially hostile agent, climbs to the adversarial mode described in section 05: control comes from the operating system, below the agent (a container with a read-only mount except for the inbox, or a separate user with no write access to protected folders). Climbing a step on the maturity ladder buys scale and search, never write discipline for free.
The AI agent field matured quickly on the capability layer, reasoning, code generation, document analysis. The memory layer fell behind, and the gap is becoming the main limiter of continuous use. It is not a lack of agent intelligence. It is a lack of context continuity across sessions, projects and tools.
Current solutions treat memory as a product feature: something each tool implements internally, in a proprietary way, with the user as a passive beneficiary. This paper's proposal inverts that relationship: memory is the user's infrastructure, not the product's. Agents are temporary consumers of a context that belongs to whoever does the work, not to whoever provides the tool.
Adopting this architecture does not depend on changes to AI agents. It depends on a shift of assumption about where memory lives. When memory belongs to the user, readable, versionable, portable, governed, switching tools stops being a loss.
This architecture refuses four paths. It refuses a hosted backend as a service, it breaks sovereignty. It refuses automatic memory without review, it is the central anti-pattern. It refuses a single universal adapter, each agent has its own conventions, and a generic adapter becomes the worst common denominator. It refuses RAG on the critical path, embeddings help with search, but never replace Markdown as the source.
What this architecture accepts: being slower to adopt than a monolithic application; requiring curation discipline that automatic tools do not; asking the user to learn the contract before delegating; and operating, by default, in cooperative mode, where write governance is contract and not a lock. Enforcement against a hostile agent is not the contract's job: it is operating system hardening, outside the default scope and available to whoever needs it (section 08). The price of sovereignty is active participation.
One mechanism choice follows from this. A proprietary sync, such as Obsidian Sync, solves real-time synchronization but does not version and ties the base to a vendor. That is why the reference architecture uses Git: it synchronizes and versions in a single piece, with an open mechanism, and proprietary continuous sync stays as an optional layer on top, never as the spine.
1 DecisionNode, github.com/decisionnode/DecisionNode. Reference implementation for the decision sub-module with vector embeddings and access via MCP.
2 Hermes Agent, NousResearch. github.com/NousResearch/hermes-agent
3 Graphiti, Zep. help.getzep.com/graphiti/graphiti/overview
4 Model Context Protocol, modelcontextprotocol.io
5 ACE (Agentic Context Engineering), Stanford University, SambaNova Systems, UC Berkeley. arXiv:2510.04618. Evolving-context framework with Generator, Reflector and Curator roles.
6 For the full implementation with step-by-step, templates and vault structure: see the Implementation Guide available in the repository.