aismemory .com ← Home

Memory Benchmarks

We publish results on the two standard memory benchmarks that serious competitors use: LongMemEval (retrieval) and LoCoMo (conversational QA). Raw run outputs are committed to the repository for full reproducibility.

Comparison

All scores are from public papers or official benchmark runs. Every R@5 number below is from the 50-question single-session-user subset of LongMemEval — the methodology every competitor publishes against — so the comparison stays apples-to-apples. AIS scores are from production API runs against ais.agentsandswarms.ai.

System LongMemEval R@5
(subset)
LoCoMo F1 Notes
MemPalace 96.6% Local only, ChromaDB + SQLite, method-of-loci; hierarchical pre-filter before vector search
Supermemory 85.4% Cloud, 5-layer stack
Mem0 ~80.0% Cloud, graph memory (approx. — no public LongMemEval run located)
AISus 46.0% 0.8% 2026-04-15 subset run · 50/500 questions. LoCoMo F1 from 2026-04-20 extraction-only run (no LLM step).

MemPalace R@5 from LongMemEval paper. Supermemory from their published blog. Mem0 approximate (no public LongMemEval run found).

LongMemEval

LongMemEval tests retrieval quality: given a question, can the memory system surface the relevant conversation session? Metric is Recall@K — what fraction of ground-truth evidence sessions appear in the top-K results.

We publish two runs: a subset run (50 single-session-user questions) that matches the methodology competitors publish against, and a full haystack run (all 500 questions, all 6 question types). No competitor has published a full-haystack number, so that one stands alone.

Full haystack — 2026-04-16

All 500 questions across all 6 types. This is the complete benchmark.

16.1%

R@1

31.9%

R@5

42.2%

R@10

Questions: 500 of 500 (all 6 types)

Memories ingested: 9,492

Duration: 9.1 hours

Raw output: longmemeval-results.json

Per question type

Type N R@5 R@10
single-session-assistant 56 94.6% 96.4%
knowledge-update 78 37.8% 44.2%
single-session-user 70 27.1% 38.6%
single-session-preference 30 23.3% 30.0%
temporal-reasoning 133 19.6% 33.5%
multi-session 133 18.7% 31.5%

Subset comparator — 2026-04-15

50 questions, single-session-user only. Used for apples-to-apples comparison against MemPalace, Supermemory, Mem0 in the table above.

30.0%

R@1

46.0%

R@5

50.0%

R@10

Raw output: longmemeval-2026-04-15.json

The shape of the gap. AIS is near-perfect on single-session-assistant (94.6% R@5) and weakest on multi-session (18.7%) and temporal-reasoning (19.6%) — exactly the areas where hierarchical pre-filtering (PR #982) and temporal columns (PR #980) are built to help. Both ship today but aren't used by the benchmark runner yet. Enabling them is the next planned run.

LoCoMo

LoCoMo tests conversational memory: given a long-running conversation between two people, can the memory system answer questions about it? Metric is token-level F1 over answer extraction — the standard SQuAD scoring applied to memory retrieval.

First run — 2026-04-20

10 conversations, 1,986 questions across 5 categories. Extraction-only methodology (no LLM inference step) — see Methodology below.

66.7%

Evidence-recall (retrieval)

0.8%

F1 (extraction)

Conversations: 10

Questions: 1,986

Memories ingested: 5,880

Raw output: locomo-results.json

Per category

Category N Evidence-recall F1
single-hop 282 62.1% 1.1%
multi-hop 321 50.8% 0.7%
temporal 96 45.8% 0.9%
commonsense 841 72.5% 1.2%
adversarial 446 74.4% 0.0%

Reading these numbers honestly. Evidence-recall (66.7% overall) measures whether the retriever surfaced the right session for a question — that's the retrieval problem AIS is built to solve, and the strongest published signal here. F1 (0.8% overall) is low because our LoCoMo runner is extraction-only: it does best-sentence F1 over retrieved memory content with no LLM inference step. A production answer flow adds an LLM on top of retrieved memories, which is where the F1 gap closes. We publish the raw retrieval- only number to keep the methodology comparison apples-to-apples — competitors using LLM-augmented answer extraction will score much higher on F1 even with weaker retrieval. The retrieval number is what we'd want a buyer to compare on.

The shape of the gap. Evidence-recall is strongest on adversarial (74.4%) and commonsense (72.5%) — both benefit from broad context retrieval. It's weakest on temporal (45.8%) and multi-hop (50.8%) — the same areas LongMemEval flagged. Temporal columns (PR #980) and hierarchical pre-filtering (PR #982) ship today; enabling them in the LoCoMo runner is the next planned run.

LoCoMo complements LongMemEval by testing QA extraction from conversational context rather than session retrieval ranking. LongMemEval asks "which sessions are relevant?"; LoCoMo asks "what is the answer, given the relevant sessions?"

Methodology

All benchmarks run against the live production API at ais.agentsandswarms.ai — no special configuration, no tuned hyperparameters, no cached results. What you benchmark is what you get.

LongMemEval setup: Sessions ingested as type=context memories via POST /v1/agents/:id/memory. Retrieval via GET /v1/agents/:id/memory?query=...&limit=10. Session IDs encoded in memory content with [SESSION:id] prefix for ground-truth matching.

LoCoMo setup: Conversation sessions ingested the same way. Questions posed as retrieval queries. Answer extraction via best-sentence F1 over retrieved memory content — no LLM inference step, extractive only.

Raw outputs: Every run commits a timestamped JSON to packages/agent-identity-service/benchmarks/results/. Reproducibility scripts are in the same directory.