We publish results on the two standard memory benchmarks that serious competitors use: LongMemEval (retrieval) and LoCoMo (conversational QA). Raw run outputs are committed to the repository for full reproducibility.
All scores are from public papers or official benchmark runs. Every R@5 number below is
from the 50-question single-session-user subset
of LongMemEval — the methodology every competitor publishes against — so the comparison
stays apples-to-apples. AIS scores are from production API runs against ais.agentsandswarms.ai.
| System | LongMemEval R@5 (subset) | LoCoMo F1 | Notes |
|---|---|---|---|
| MemPalace | 96.6% | — | Local only, ChromaDB + SQLite, method-of-loci; hierarchical pre-filter before vector search |
| Supermemory | 85.4% | — | Cloud, 5-layer stack |
| Mem0 | ~80.0% | — | Cloud, graph memory (approx. — no public LongMemEval run located) |
| AISus | 46.0% | 0.8% | 2026-04-15 subset run · 50/500 questions. LoCoMo F1 from 2026-04-20 extraction-only run (no LLM step). |
MemPalace R@5 from LongMemEval paper. Supermemory from their published blog. Mem0 approximate (no public LongMemEval run found).
LongMemEval tests retrieval quality: given a question, can the memory system surface the relevant conversation session? Metric is Recall@K — what fraction of ground-truth evidence sessions appear in the top-K results.
We publish two runs: a subset run (50 single-session-user questions) that matches the methodology competitors publish against, and a full haystack run (all 500 questions, all 6 question types). No competitor has published a full-haystack number, so that one stands alone.
All 500 questions across all 6 types. This is the complete benchmark.
16.1%
R@1
31.9%
R@5
42.2%
R@10
Questions: 500 of 500 (all 6 types)
Memories ingested: 9,492
Duration: 9.1 hours
Raw output: longmemeval-results.json
| Type | N | R@5 | R@10 |
|---|---|---|---|
| single-session-assistant | 56 | 94.6% | 96.4% |
| knowledge-update | 78 | 37.8% | 44.2% |
| single-session-user | 70 | 27.1% | 38.6% |
| single-session-preference | 30 | 23.3% | 30.0% |
| temporal-reasoning | 133 | 19.6% | 33.5% |
| multi-session | 133 | 18.7% | 31.5% |
50 questions, single-session-user only. Used for apples-to-apples comparison against MemPalace, Supermemory, Mem0 in the table above.
30.0%
R@1
46.0%
R@5
50.0%
R@10
Raw output: longmemeval-2026-04-15.json
The shape of the gap. AIS is near-perfect on single-session-assistant (94.6% R@5) and weakest on multi-session (18.7%) and temporal-reasoning (19.6%) — exactly the areas where hierarchical pre-filtering (PR #982) and temporal columns (PR #980) are built to help. Both ship today but aren't used by the benchmark runner yet. Enabling them is the next planned run.
LoCoMo tests conversational memory: given a long-running conversation between two people, can the memory system answer questions about it? Metric is token-level F1 over answer extraction — the standard SQuAD scoring applied to memory retrieval.
10 conversations, 1,986 questions across 5 categories. Extraction-only methodology (no LLM inference step) — see Methodology below.
66.7%
Evidence-recall (retrieval)
0.8%
F1 (extraction)
| Category | N | Evidence-recall | F1 |
|---|---|---|---|
| single-hop | 282 | 62.1% | 1.1% |
| multi-hop | 321 | 50.8% | 0.7% |
| temporal | 96 | 45.8% | 0.9% |
| commonsense | 841 | 72.5% | 1.2% |
| adversarial | 446 | 74.4% | 0.0% |
Reading these numbers honestly. Evidence-recall (66.7% overall) measures whether the retriever surfaced the right session for a question — that's the retrieval problem AIS is built to solve, and the strongest published signal here. F1 (0.8% overall) is low because our LoCoMo runner is extraction-only: it does best-sentence F1 over retrieved memory content with no LLM inference step. A production answer flow adds an LLM on top of retrieved memories, which is where the F1 gap closes. We publish the raw retrieval- only number to keep the methodology comparison apples-to-apples — competitors using LLM-augmented answer extraction will score much higher on F1 even with weaker retrieval. The retrieval number is what we'd want a buyer to compare on.
The shape of the gap. Evidence-recall is strongest on adversarial (74.4%) and commonsense (72.5%) — both benefit from broad context retrieval. It's weakest on temporal (45.8%) and multi-hop (50.8%) — the same areas LongMemEval flagged. Temporal columns (PR #980) and hierarchical pre-filtering (PR #982) ship today; enabling them in the LoCoMo runner is the next planned run.
LoCoMo complements LongMemEval by testing QA extraction from conversational context rather than session retrieval ranking. LongMemEval asks "which sessions are relevant?"; LoCoMo asks "what is the answer, given the relevant sessions?"
All benchmarks run against the live production API at
ais.agentsandswarms.ai — no special configuration, no tuned
hyperparameters, no cached results. What you benchmark is what you get.
LongMemEval setup: Sessions
ingested as type=context memories via
POST /v1/agents/:id/memory. Retrieval via
GET /v1/agents/:id/memory?query=...&limit=10. Session IDs
encoded in memory content with [SESSION:id] prefix for
ground-truth matching.
LoCoMo setup: Conversation sessions ingested the same way. Questions posed as retrieval queries. Answer extraction via best-sentence F1 over retrieved memory content — no LLM inference step, extractive only.
Raw outputs: Every run
commits a timestamped JSON to
packages/agent-identity-service/benchmarks/results/.
Reproducibility scripts are in the same directory.