A prompt-only eval assumes the prompt is the main thing shaping behavior. That was closer to true when assistants were stateless. It is less true for agents. Modern agents read context, call tools, summarize progress, retrieve history, and write memory that affects later sessions.

The failure mode is subtle: the prompt passes, but the product still behaves differently in production because the memory state is different. A stale preference, a missing policy, an outdated tool instruction, or an over-broad retrieved note can change the answer more than a prompt wording tweak.

Memory is behavioral context

Shared memory can act like a durable tool guide. A file such as shared/AGENTS.md can tell an agent how to use tools, what not to do, which policies are read-only, and how to resolve conflicts. A user file such as user/preferences.md can steer tone, recommendations, ranking, and default assumptions.

The Living Wiki paper gives a useful research frame: a maintained knowledge base can include both content and a separate governing protocol. In product terms, that means the memory layer is not only a database of facts. It can also contain behavioral instructions that persist across sessions.

If an agent reads a memory before answering, include that memory in the eval fixture. Otherwise you are testing a cleaner system than the one users actually experience.

What to evaluate instead

LongMemEval is useful because it breaks memory into concrete capabilities: extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Product teams can adapt the same spirit without copying the benchmark. The question is not "did the prompt answer correctly?" It is "did the agent use the right memory, update it correctly, and avoid stale or unsupported memory?"

01

Memory read set

Which files or retrieved records were injected before the answer?

02

Memory write diff

What changed after the interaction, and was the change justified?

03

Conflict handling

Did the agent notice newer facts, corrections, policy priority, and uncertainty?

04

Abstention

Did the agent avoid using memory when the record was missing, stale, or ambiguous?

A memory-aware eval loop

Start every eval case with an explicit memory fixture. Run the agent. Capture the answer, tool calls, memory reads, and memory writes. Then score both behavior and state transition. A correct answer with a bad memory write is not a pass; it is a future regression waiting to happen.

case: "Returning user asks for recommendations"
initial_memory:
  user/profile.md: "Prefers quiet neighborhoods. Budget is strict."
  shared/AGENTS.md: "Never infer budget changes without user confirmation."
expected:
  reads: ["user/profile.md", "shared/AGENTS.md"]
  answer: "respects quiet + budget constraints"
  writes: "no budget mutation unless user explicitly changes it"

This is where inspectable memory matters. If your memory is only an embedding or an opaque extraction pipeline, eval failures are harder to diagnose. If memory is a scoped file with revisions and access logs, the eval can point to the exact state that caused the behavior.

The practical claim

Prompt quality still matters. But once your agent has memory, behavior is a function of prompt, tools, memory state, retrieval, and write policy. Teams that only eval prompts are optimizing one part of a larger system.