SoulMate Memory — LoCoMo Benchmark Results

What This Measures

LoCoMo (Long-Context Memory) is a benchmark from Snap Research that tests how well a memory system can answer questions about long, multi-session conversations. It evaluates four capabilities:

Single-hop — Direct factual recall from one conversation turn
Multi-hop — Reasoning across multiple conversation turns to synthesize an answer
Open-domain — Questions requiring general knowledge + conversation context
Temporal — Time-sensitive questions ("What did they say before/after X?")

We tested 5 retrieval configurations in soul.py (SoulMate's memory engine) to find the best strategy for long-term agent memory.

Results

All 5 configurations, ranked by overall score.

Config	Overall	Single-hop	Multi-hop	Open-domain	Temporal	Errors
🏆 RLM	69.99%	54.13%	82.06%	55.10%	39.99%	159
Auto	64.05%	42.56%	78.46%	58.75%	26.72%	33
Hybrid	65.57%	45.97%	79.49%	56.04%	29.84%	149
Qdrant (RAG)	63.42%	36.45%	78.72%	59.38%	26.97%	0
BM25	63.05%	38.40%	77.80%	50.83%	29.26%	0

Overall Score

RLM

Hybrid

Qdrant

BM25

RLM

70.0%

Hybrid

65.6%

Qdrant

63.4%

BM25

63.1%

By Category

Single-hop (direct recall)

RLM

54.1%

Hybrid

46.0%

BM25

38.4%

Qdrant

36.5%

Multi-hop (cross-turn reasoning)

RLM

82.1%

Hybrid

79.5%

Qdrant

78.7%

BM25

77.8%

Temporal (time-sensitive)

RLM

40.0%

Hybrid

29.8%

BM25

29.3%

Qdrant

27.0%

Open-domain (general knowledge + context)

Qdrant

59.4%

Hybrid

56.0%

RLM

55.1%

BM25

50.8%

Competitor Comparison

soul.py vs other memory systems on LoCoMo. Competitor numbers from their published evaluations.

System	Overall	Single-hop	Multi-hop	Open-domain	Temporal
XMem (Gemini 3-flash)	91.5%	90.6%	92.3%	91.2%	91.9%
Memobase	75.8%	70.9%	46.9%	77.2%	85.1%
Zep	75.1%	74.1%	66.0%	67.7%	79.8%
🏆 soul.py (RLM) (Gemini 2.0 Flash)	70.0%	54.1%	82.1%	55.1%	40.0%
Mem0g (YC 24)	68.4%	65.7%	47.2%	75.7%	58.1%
Mem0 (YC 24)	66.9%	67.1%	51.2%	72.9%	55.5%
LangMem	58.1%	62.2%	47.9%	71.1%	23.4%
OpenAI	52.9%	63.8%	42.9%	62.3%	21.7%

Context

soul.py RLM (70.0%) beats Mem0 (66.9%) and LangMem (58.1%), and has the highest multi-hop reasoning score (82.1%) of any system. However, it trails XMem (91.5%), Memobase (75.8%), and Zep (75.1%) on overall scores. Note: XMem uses Gemini 3-flash while soul.py used Gemini 2.0 Flash — model choice matters significantly.

Key Insights

Finding #1

RLM is the clear winner. Recurrent Language Model memory outperforms all other retrieval strategies by 4–7 percentage points overall. The gap is largest on temporal reasoning (+10pts) and single-hop recall (+8pts).

Finding #2

Temporal reasoning is the hardest category for everyone. Even the best config (RLM) only hits 40%. This is the frontier — understanding time-ordered events in long conversations remains challenging.

Finding #3

Multi-hop is surprisingly strong across the board. All configs score 78–82% on cross-turn reasoning. The orchestration and chunking strategy handles multi-step inference well regardless of retrieval backend.

Finding #4

Pure RAG (Qdrant) wins on open-domain — vector similarity search naturally surfaces broadly relevant context. But it's the worst at single-hop and temporal, where precise recall matters more.

Finding #5

Hybrid provides modest lift but not enough. Combining BM25 + Qdrant gets you ~2.5pts over either alone. RLM's reinforcement layer adds another ~4.5pts on top — the learning component is what really moves the needle.

Configurations Explained

Config	How It Works
BM25	Keyword-based retrieval (TF-IDF). Fast, no embeddings. Baseline.
Qdrant (RAG)	Vector similarity search using embeddings. Standard RAG approach.
Hybrid	BM25 + Qdrant combined with reciprocal rank fusion. Best of both lexical + semantic.
RLM	Recurrent Language Model memory — compresses and maintains a running belief state across turns, enabling structured recall without explicit retrieval. Based on the RLM architecture (Dec 2025).
Auto	Dynamically selects between configs per query based on question type. Currently running.

Methodology

Dataset: LoCoMo-10 — 10 long conversations sampled from the LoCoMo benchmark (Snap Research). Each conversation contains 25–30 sessions with 196–260 questions.

Scoring: Gemini 2.0 Flash as the evaluator LLM. Scores are 0–1 per question (partial credit). Category scores are averaged across all questions in that category.

Errors: RLM and Hybrid configs hit Gemini API rate limits (429/503), resulting in ~150–160 skipped questions out of 1,986. These are excluded from scoring (not counted as wrong). BM25 and Qdrant ran error-free.