Memory Benchmarks

We publish results on the two standard memory benchmarks that serious competitors use: LongMemEval (retrieval) and LoCoMo (conversational QA). Raw run outputs are committed to the repository for full reproducibility.

Comparison

All scores are from public papers or official benchmark runs. Every R@5 number below is from the 50-question single-session-user subset of LongMemEval — the methodology every competitor publishes against — so the comparison stays apples-to-apples. AIS scores are from production API runs against ais.agentsandswarms.ai.

System	LongMemEval R@5 (subset)	LoCoMo F1	Notes
MemPalace	96.6%	—	Local only, ChromaDB + SQLite, method-of-loci; hierarchical pre-filter before vector search
Supermemory	85.4%	—	Cloud, 5-layer stack
Mem0	~80.0%	—	Cloud, graph memory (approx. — no public LongMemEval run located)
AISus	46.0%	0.8%	2026-04-15 subset run · 50/500 questions. LoCoMo F1 from 2026-04-20 extraction-only run (no LLM step).

MemPalace R@5 from LongMemEval paper. Supermemory from their published blog. Mem0 approximate (no public LongMemEval run found).

LongMemEval

LongMemEval tests retrieval quality: given a question, can the memory system surface the relevant conversation session? Metric is Recall@K — what fraction of ground-truth evidence sessions appear in the top-K results.

We publish two runs: a subset run (50 single-session-user questions) that matches the methodology competitors publish against, and a full haystack run (all 500 questions, all 6 question types). No competitor has published a full-haystack number, so that one stands alone.

Full haystack — 2026-04-16

All 500 questions across all 6 types. This is the complete benchmark.

16.1%

R@1

31.9%

R@5

42.2%

R@10

Questions: 500 of 500 (all 6 types)

Memories ingested: 9,492

Duration: 9.1 hours

Raw output: longmemeval-results.json

Per question type

Type	N	R@5	R@10
single-session-assistant	56	94.6%	96.4%
knowledge-update	78	37.8%	44.2%
single-session-user	70	27.1%	38.6%
single-session-preference	30	23.3%	30.0%
temporal-reasoning	133	19.6%	33.5%
multi-session	133	18.7%	31.5%

Subset comparator — 2026-04-15

50 questions, single-session-user only. Used for apples-to-apples comparison against MemPalace, Supermemory, Mem0 in the table above.

30.0%

R@1

46.0%

R@5

50.0%

R@10

Raw output: longmemeval-2026-04-15.json

The shape of the gap. AIS is near-perfect on single-session-assistant (94.6% R@5) and weakest on multi-session (18.7%) and temporal-reasoning (19.6%) — exactly the areas where hierarchical pre-filtering (PR #982) and temporal columns (PR #980) are built to help. Both ship today but aren't used by the benchmark runner yet. Enabling them is the next planned run.

LoCoMo

LoCoMo tests conversational memory: given a long-running conversation between two people, can the memory system answer questions about it? Metric is token-level F1 over answer extraction — the standard SQuAD scoring applied to memory retrieval.

First run — 2026-04-20

10 conversations, 1,986 questions across 5 categories. Extraction-only methodology (no LLM inference step) — see Methodology below.

66.7%

Evidence-recall (retrieval)

0.8%

F1 (extraction)

Conversations: 10

Questions: 1,986

Memories ingested: 5,880

Raw output: locomo-results.json

Per category

Category	N	Evidence-recall	F1
single-hop	282	62.1%	1.1%
multi-hop	321	50.8%	0.7%
temporal	96	45.8%	0.9%
commonsense	841	72.5%	1.2%
adversarial	446	74.4%	0.0%

Reading these numbers honestly. Evidence-recall (66.7% overall) measures whether the retriever surfaced the right session for a question — that's the retrieval problem AIS is built to solve, and the strongest published signal here. F1 (0.8% overall) is low because our LoCoMo runner is extraction-only: it does best-sentence F1 over retrieved memory content with no LLM inference step. A production answer flow adds an LLM on top of retrieved memories, which is where the F1 gap closes. We publish the raw retrieval- only number to keep the methodology comparison apples-to-apples — competitors using LLM-augmented answer extraction will score much higher on F1 even with weaker retrieval. The retrieval number is what we'd want a buyer to compare on.

The shape of the gap. Evidence-recall is strongest on adversarial (74.4%) and commonsense (72.5%) — both benefit from broad context retrieval. It's weakest on temporal (45.8%) and multi-hop (50.8%) — the same areas LongMemEval flagged. Temporal columns (PR #980) and hierarchical pre-filtering (PR #982) ship today; enabling them in the LoCoMo runner is the next planned run.

LoCoMo complements LongMemEval by testing QA extraction from conversational context rather than session retrieval ranking. LongMemEval asks "which sessions are relevant?"; LoCoMo asks "what is the answer, given the relevant sessions?"

Methodology

All benchmarks run against the live production API at ais.agentsandswarms.ai — no special configuration, no tuned hyperparameters, no cached results. What you benchmark is what you get.

LongMemEval setup: Sessions ingested as type=context memories via POST /v1/agents/:id/memory. Retrieval via GET /v1/agents/:id/memory?query=...&limit=10. Session IDs encoded in memory content with [SESSION:id] prefix for ground-truth matching.

LoCoMo setup: Conversation sessions ingested the same way. Questions posed as retrieval queries. Answer extraction via best-sentence F1 over retrieved memory content — no LLM inference step, extractive only.

Raw outputs: Every run commits a timestamped JSON to packages/agent-identity-service/benchmarks/results/. Reproducibility scripts are in the same directory.