LoCoMo Benchmark

Evaluating SoulMate's memory retrieval across 5 configurations on the LoCoMo long-conversation memory benchmark. 1,986 questions across 10 conversations.

๐Ÿค— Dataset on HuggingFace

70.0%
RLM Overall (Best)
+7pts
vs Baseline
1,986
Questions
5
Configs Tested

What This Measures

LoCoMo (Long-Context Memory) is a benchmark from Snap Research that tests how well a memory system can answer questions about long, multi-session conversations. It evaluates four capabilities:

  • Single-hop โ€” Direct factual recall from one conversation turn
  • Multi-hop โ€” Reasoning across multiple conversation turns to synthesize an answer
  • Open-domain โ€” Questions requiring general knowledge + conversation context
  • Temporal โ€” Time-sensitive questions ("What did they say before/after X?")

We tested 5 retrieval configurations in soul.py (SoulMate's memory engine) to find the best strategy for long-term agent memory.

Results

All 5 configurations, ranked by overall score.

ConfigOverallSingle-hopMulti-hopOpen-domainTemporalErrors
๐Ÿ† RLM69.99%54.13%82.06%55.10%39.99%159
Auto64.05%42.56%78.46%58.75%26.72%33
Hybrid65.57%45.97%79.49%56.04%29.84%149
Qdrant (RAG)63.42%36.45%78.72%59.38%26.97%0
BM2563.05%38.40%77.80%50.83%29.26%0

Overall Score

RLM
Hybrid
Qdrant
BM25
RLM
70.0%
Hybrid
65.6%
Qdrant
63.4%
BM25
63.1%

By Category

Single-hop (direct recall)

RLM
54.1%
Hybrid
46.0%
BM25
38.4%
Qdrant
36.5%

Multi-hop (cross-turn reasoning)

RLM
82.1%
Hybrid
79.5%
Qdrant
78.7%
BM25
77.8%

Temporal (time-sensitive)

RLM
40.0%
Hybrid
29.8%
BM25
29.3%
Qdrant
27.0%

Open-domain (general knowledge + context)

Qdrant
59.4%
Hybrid
56.0%
RLM
55.1%
BM25
50.8%

Competitor Comparison

soul.py vs other memory systems on LoCoMo. Competitor numbers from their published evaluations.

SystemOverallSingle-hopMulti-hopOpen-domainTemporal
XMem (Gemini 3-flash)91.5%90.6%92.3%91.2%91.9%
Memobase75.8%70.9%46.9%77.2%85.1%
Zep75.1%74.1%66.0%67.7%79.8%
๐Ÿ† soul.py (RLM) (Gemini 2.0 Flash)70.0%54.1%82.1%55.1%40.0%
Mem0g (YC 24)68.4%65.7%47.2%75.7%58.1%
Mem0 (YC 24)66.9%67.1%51.2%72.9%55.5%
LangMem58.1%62.2%47.9%71.1%23.4%
OpenAI52.9%63.8%42.9%62.3%21.7%
Context

soul.py RLM (70.0%) beats Mem0 (66.9%) and LangMem (58.1%), and has the highest multi-hop reasoning score (82.1%) of any system. However, it trails XMem (91.5%), Memobase (75.8%), and Zep (75.1%) on overall scores. Note: XMem uses Gemini 3-flash while soul.py used Gemini 2.0 Flash โ€” model choice matters significantly.

Key Insights

Finding #1

RLM is the clear winner. Recurrent Language Model memory outperforms all other retrieval strategies by 4โ€“7 percentage points overall. The gap is largest on temporal reasoning (+10pts) and single-hop recall (+8pts).

Finding #2

Temporal reasoning is the hardest category for everyone. Even the best config (RLM) only hits 40%. This is the frontier โ€” understanding time-ordered events in long conversations remains challenging.

Finding #3

Multi-hop is surprisingly strong across the board. All configs score 78โ€“82% on cross-turn reasoning. The orchestration and chunking strategy handles multi-step inference well regardless of retrieval backend.

Finding #4

Pure RAG (Qdrant) wins on open-domain โ€” vector similarity search naturally surfaces broadly relevant context. But it's the worst at single-hop and temporal, where precise recall matters more.

Finding #5

Hybrid provides modest lift but not enough. Combining BM25 + Qdrant gets you ~2.5pts over either alone. RLM's reinforcement layer adds another ~4.5pts on top โ€” the learning component is what really moves the needle.

Configurations Explained

ConfigHow It Works
BM25Keyword-based retrieval (TF-IDF). Fast, no embeddings. Baseline.
Qdrant (RAG)Vector similarity search using embeddings. Standard RAG approach.
HybridBM25 + Qdrant combined with reciprocal rank fusion. Best of both lexical + semantic.
RLMRecurrent Language Model memory โ€” compresses and maintains a running belief state across turns, enabling structured recall without explicit retrieval. Based on the RLM architecture (Dec 2025).
AutoDynamically selects between configs per query based on question type. Currently running.

Methodology

Dataset: LoCoMo-10 โ€” 10 long conversations sampled from the LoCoMo benchmark (Snap Research). Each conversation contains 25โ€“30 sessions with 196โ€“260 questions.


Scoring: Gemini 2.0 Flash as the evaluator LLM. Scores are 0โ€“1 per question (partial credit). Category scores are averaged across all questions in that category.


Errors: RLM and Hybrid configs hit Gemini API rate limits (429/503), resulting in ~150โ€“160 skipped questions out of 1,986. These are excluded from scoring (not counted as wrong). BM25 and Qdrant ran error-free.