Evaluating SoulMate's memory retrieval across 5 configurations on the LoCoMo long-conversation memory benchmark. 1,986 questions across 10 conversations.
LoCoMo (Long-Context Memory) is a benchmark from Snap Research that tests how well a memory system can answer questions about long, multi-session conversations. It evaluates four capabilities:
We tested 5 retrieval configurations in soul.py (SoulMate's memory engine) to find the best strategy for long-term agent memory.
All 5 configurations, ranked by overall score.
| Config | Overall | Single-hop | Multi-hop | Open-domain | Temporal | Errors |
|---|---|---|---|---|---|---|
| ๐ RLM | 69.99% | 54.13% | 82.06% | 55.10% | 39.99% | 159 |
| Auto | 64.05% | 42.56% | 78.46% | 58.75% | 26.72% | 33 |
| Hybrid | 65.57% | 45.97% | 79.49% | 56.04% | 29.84% | 149 |
| Qdrant (RAG) | 63.42% | 36.45% | 78.72% | 59.38% | 26.97% | 0 |
| BM25 | 63.05% | 38.40% | 77.80% | 50.83% | 29.26% | 0 |
soul.py vs other memory systems on LoCoMo. Competitor numbers from their published evaluations.
| System | Overall | Single-hop | Multi-hop | Open-domain | Temporal |
|---|---|---|---|---|---|
| XMem (Gemini 3-flash) | 91.5% | 90.6% | 92.3% | 91.2% | 91.9% |
| Memobase | 75.8% | 70.9% | 46.9% | 77.2% | 85.1% |
| Zep | 75.1% | 74.1% | 66.0% | 67.7% | 79.8% |
| ๐ soul.py (RLM) (Gemini 2.0 Flash) | 70.0% | 54.1% | 82.1% | 55.1% | 40.0% |
| Mem0g (YC 24) | 68.4% | 65.7% | 47.2% | 75.7% | 58.1% |
| Mem0 (YC 24) | 66.9% | 67.1% | 51.2% | 72.9% | 55.5% |
| LangMem | 58.1% | 62.2% | 47.9% | 71.1% | 23.4% |
| OpenAI | 52.9% | 63.8% | 42.9% | 62.3% | 21.7% |
soul.py RLM (70.0%) beats Mem0 (66.9%) and LangMem (58.1%), and has the highest multi-hop reasoning score (82.1%) of any system. However, it trails XMem (91.5%), Memobase (75.8%), and Zep (75.1%) on overall scores. Note: XMem uses Gemini 3-flash while soul.py used Gemini 2.0 Flash โ model choice matters significantly.
RLM is the clear winner. Recurrent Language Model memory outperforms all other retrieval strategies by 4โ7 percentage points overall. The gap is largest on temporal reasoning (+10pts) and single-hop recall (+8pts).
Temporal reasoning is the hardest category for everyone. Even the best config (RLM) only hits 40%. This is the frontier โ understanding time-ordered events in long conversations remains challenging.
Multi-hop is surprisingly strong across the board. All configs score 78โ82% on cross-turn reasoning. The orchestration and chunking strategy handles multi-step inference well regardless of retrieval backend.
Pure RAG (Qdrant) wins on open-domain โ vector similarity search naturally surfaces broadly relevant context. But it's the worst at single-hop and temporal, where precise recall matters more.
Hybrid provides modest lift but not enough. Combining BM25 + Qdrant gets you ~2.5pts over either alone. RLM's reinforcement layer adds another ~4.5pts on top โ the learning component is what really moves the needle.
| Config | How It Works |
|---|---|
| BM25 | Keyword-based retrieval (TF-IDF). Fast, no embeddings. Baseline. |
| Qdrant (RAG) | Vector similarity search using embeddings. Standard RAG approach. |
| Hybrid | BM25 + Qdrant combined with reciprocal rank fusion. Best of both lexical + semantic. |
| RLM | Recurrent Language Model memory โ compresses and maintains a running belief state across turns, enabling structured recall without explicit retrieval. Based on the RLM architecture (Dec 2025). |
| Auto | Dynamically selects between configs per query based on question type. Currently running. |
Dataset: LoCoMo-10 โ 10 long conversations sampled from the LoCoMo benchmark (Snap Research). Each conversation contains 25โ30 sessions with 196โ260 questions.
Scoring: Gemini 2.0 Flash as the evaluator LLM. Scores are 0โ1 per question (partial credit). Category scores are averaged across all questions in that category.
Errors: RLM and Hybrid configs hit Gemini API rate limits (429/503), resulting in ~150โ160 skipped questions out of 1,986. These are excluded from scoring (not counted as wrong). BM25 and Qdrant ran error-free.