Performance Benchmarks

Real-world performance metrics based on LOCOMO dataset

10 long multi-turn conversations · 1,540 QA · GPT-5.5

PowerMem vs Full-Context

LLM ScoreAccuracy

87.79%

baseline 52.9%, +65.9%

Retrieval P95Latency

1.44s

baseline 17.12s, -91.6%

Token UsageConsumption

~0.9k

baseline ~26k, -96.5%

Category	Description	LLM Score	Count
1Single-Hop	Questions asking for specific facts directly mentioned in a single conversation.	90.78%	282
2Temporal	Questions can be answered through temporal reasoning and capturing time-related data cues within the conversation.	83.49%	321
3Multi-Hop	Questions that require synthesizing information from multiple sessions.	73.96%	96
4Open-Domain	Questions can be answered by integrating a speaker's provided information with external knowledge, such as commonsense or world facts.	90.01%	841