PowerMem Benchmark
A comprehensive benchmarking suite for PowerMem, including a REST API server for managing memories and a load testing tool based on the LOCOMO dataset.
Overview
The PowerMem Benchmark suite consists of two main components:
-
Benchmark Server (
benchmark/server/): A FastAPI-based REST API server that provides:- Memory storage and management
- Semantic search capabilities
- Token usage tracking
- Support for multiple database backends (OceanBase, PostgreSQL)
-
Load Testing Tool (
benchmark/lomoco/): A comprehensive benchmarking tool that:- Tests memory addition and search performance
- Evaluates response quality using multiple metrics
- Measures latency and token consumption
- Uses the LOCOMO dataset for realistic testing scenarios
Quick Start
1. Start the Benchmark Server
# Install dependencies
pip install -e .
# Configure environment
cp benchmark/server/.env.example benchmark/server/.env
# Edit benchmark/server/.env with your settings
# Start the server
uvicorn benchmark.server.main:app --host 0.0.0.0 --port 8000 --reload
2. Run Load Tests
# Install load testing dependencies
pip install -r benchmark/lomoco/requirements.txt
# Configure environment
cd benchmark/lomoco
cp .env.example .env
# Edit .env with your API keys and server URL
# Run tests
bash run.sh results
Benchmark Server
Prerequisites
- Python 3.10 or higher
- pip or poetry for dependency management
- Access to OpenAI API (or compatible API endpoint)
- Database: OceanBase or PostgreSQL (depending on your configuration)
Installation
-
Install dependencies
From the project root:
pip install -e .Or install specific dependencies:
pip install fastapi uvicorn python-dotenv powermem -
Configure environment variables
Copy the example environment file:
cp benchmark/server/.env.example benchmark/server/.envEdit
benchmark/server/.envand configure:OPENAI_API_KEY: Your OpenAI API key (required)EMBEDDER_API_KEY: Optional, separate API key for embeddings (defaults toOPENAI_API_KEY)- Database configuration (OceanBase or PostgreSQL)
- Other settings as needed
See
benchmark/server/.env.examplefor all available configuration options.
Configuration
All configuration is done through environment variables. The server automatically loads a .env file from the benchmark/server/ directory.
Required Environment Variables
OPENAI_API_KEY: Your OpenAI API key
Optional Environment Variables
EMBEDDER_API_KEY: Separate API key for embeddings (defaults toOPENAI_API_KEY)DB_TYPE: Database type -oceanbaseorpostgres(default:oceanbase)TOKEN_COUNTING: Enable token counting -trueorfalse(default:true)LLM_MODEL: LLM model name (default:gpt-4o)LLM_TEMPERATURE: LLM temperature (default:0.2)EMBEDDER_MODEL: Embedding model name (default:text-embedding-3-small)EMBEDDER_DIMS: Embedding dimensions (default:1536)
For database-specific configuration, see benchmark/server/.env.example.
Starting the Server
Method 1: Using uvicorn (Recommended)
From the project root:
uvicorn benchmark.server.main:app --host 0.0.0.0 --port 8000 --reload
The --reload flag enables auto-reload during development.
Method 2: Production Mode
For production with multiple workers:
uvicorn benchmark.server.main:app --host 0.0.0.0 --port 8000 --workers 4
Method 3: Using Python Module
python -m uvicorn benchmark.server.main:app --host 0.0.0.0 --port 8000
Method 4: From Server Directory
If you're in the benchmark/server directory:
cd benchmark/server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Accessing the API
Once the server is running, you can access:
- Alternative API Documentation (ReDoc): http://localhost:8000/redoc
- API Root: http://localhost:8000/
API Endpoints
The server provides the following main endpoints:
| Method | Endpoint | Description |
|---|---|---|
POST | /memories | Create new memories |
GET | /memories | Get all memories (with filters) |
GET | /memories/{memory_id} | Get a specific memory |
PUT | /memories/{memory_id} | Update a memory |
DELETE | /memories/{memory_id} | Delete a memory |
DELETE | /memories | Delete all memories (with filters) |
POST | /search | Search memories |
GET | /memories/{memory_id}/history | Get memory history |
POST | /reset | Reset all memories |
POST | /configure | Update server configuration |
GET | /token_count | Get token usage statistics |
POST | /reset_token_count | Reset token count |
Load Testing (LOCOMO)
The LOCOMO benchmark tool performs comprehensive evaluations of memory systems using the LOCOMO dataset. It tests both memory addition and search capabilities, measuring performance, quality, and resource consumption.
Prerequisites for Load Testing
-
Benchmark server must be running
- Start the benchmark server first (see Starting the Server)
- The server should be accessible at the URL specified in your configuration
-
Install LOCOMO dependencies
From the project root:
pip install -r benchmark/lomoco/requirements.txtOr from the lomoco directory:
cd benchmark/lomoco
pip install -r requirements.txt
Configuring Environment Variables for Load Testing
-
Create environment configuration file
cd benchmark/lomoco
cp .env.example .env -
Edit the
.envfileOpen
benchmark/lomoco/.envand configure the following variables:# OpenAI API configuration
MODEL="qwen3-max" # or "gpt-4o", "gpt-4", etc.
OPENAI_BASE_URL="https://api.openai.com/v1" # or your API endpoint
OPENAI_API_KEY="your_api_key_here" # Your OpenAI API key
# API configuration - must match your running server
API_BASE_URL="http://127.0.0.1:8000" # URL of the benchmark serverImportant Configuration Notes:
API_BASE_URL: Must match the URL where your benchmark server is running- Default:
http://127.0.0.1:8000(if server runs on localhost:8000) - If server runs on a different port, update accordingly
- Default:
MODEL: The LLM model to use for answer generationOPENAI_API_KEY: API key for the LLM serviceOPENAI_BASE_URL: Base URL for the LLM API (can be OpenAI or compatible service)
Running Load Tests
Full Test Suite
-
Ensure the benchmark server is running
In a separate terminal:
uvicorn benchmark.server.main:app --host 0.0.0.0 --port 8000 -
Run the complete test script
From the
benchmark/lomocodirectory:cd benchmark/lomoco
bash run.sh [output_folder]Or from the project root:
cd benchmark/lomoco && bash run.sh resultsThe
output_folderparameter is optional (defaults toresults). -
What the script does
The
run.shscript performs the following steps:- Resets token count on the server
- Records initial token count
- Runs memory addition experiments (
addmethod) - Runs memory search experiments (
searchmethod) - Records final token count
- Generates evaluation metrics
- Displays evaluation results
Running Individual Test Methods
You can also run individual test methods manually:
Memory Addition Test:
cd benchmark/lomoco
python3 run_experiments.py --method add --output_folder results
Memory Search Test:
cd benchmark/lomoco
python3 run_experiments.py --method search --output_folder results --top_k 30
Available Options:
--method: Test method -addorsearch(default:add)--chunk_size: Chunk size for processing (default: 1000)--top_k: Number of top memories to retrieve (default: 30)--filter_memories: Enable memory filtering--is_graph: Use graph-based search--num_chunks: Number of chunks to process (default: 1)--output_folder: Output directory for results (default:results/)
Understanding Test Results
After running the tests, results are stored in the output folder (default: results/):
Output Files
-
results.json: Detailed results for each question and conversation- Contains all question-answer pairs with retrieved memories
- Includes timing information for each operation
-
evaluation_metrics.json: Computed evaluation metrics- BLEU scores
- F1 scores
- LLM judge scores
-
evaluation.txt: Human-readable evaluation summary- Total server execution time
- Total requests processed
- Average request time
- P95 latency (95th percentile)
- Token consumption statistics
-
token1.jsonandtoken2.json: Token counts before and after tests- Used to calculate total token consumption
Evaluation Metrics
The benchmark evaluates performance using multiple metrics:
-
BLEU Score: Measures similarity between model response and ground truth
- Range: 0.0 to 1.0 (higher is better)
- Based on n-gram overlap
-
F1 Score: Harmonic mean of precision and recall
- Range: 0.0 to 1.0 (higher is better)
- Measures answer quality
-
LLM Score: Binary score (0 or 1) from LLM judge evaluation
- 1 = Correct answer, 0 = Incorrect answer
- Determined by an LLM judge
-
Token Consumption: Number of tokens used for answer generation
- Includes both prompt and completion tokens
- Tracked before and after tests
-
Latency Metrics:
- Average request time: Mean time per request
- P95 latency: 95th percentile request time
- Total execution time: Total time for all operations
Troubleshooting
Server Issues
"OPENAI_API_KEY environment variable is required"
- Solution: Create a
.envfile inbenchmark/server/directory - Verify that
OPENAI_API_KEYis set in the.envfile - Check that the file is being loaded (server logs will show the path)
Database connection errors
- Solution:
- Check your database configuration in
.env - Ensure your database server is running
- Verify connection credentials (host, port, user, password)
- Test database connectivity separately
- Check your database configuration in
Port already in use
- Solution:
- Change the port:
uvicorn benchmark.server.main:app --port 8001 - Or find and kill the process using the port:
lsof -ti:8000 | xargs kill -9
- Change the port:
OpenAI API 404 Not Found Error
- Solution:
- Check your
OPENAI_BASE_URLis right, for example:
- Check your
OpenAI API 402 Rate Limiting
- Solution:
- Reduce concurrency: Adjust
max_workersinmethods/add.pyif needed - Increase the number of ApiKeys (the multi-ApiKey proxy solution will be uploaded in the future)
- Reduce concurrency: Adjust
Module not found errors
- Solution:
- Install dependencies:
pip install -e . - Verify you're running from the project root directory
- Check Python path and virtual environment activation
- Install dependencies:
Load Testing Issues
"api_base_url is not set"
- Solution:
- Create
.envfile inbenchmark/lomoco/directory - Verify that
API_BASE_URLis set correctly - Ensure the URL matches your running server address
- Create
Connection refused errors
- Solution:
- Verify the benchmark server is running
- Check that
API_BASE_URLin.envmatches the server URL - Ensure the server is accessible from the test location
"model is not set" or "openai_api_key is not set"
- Solution:
- Check that
MODELandOPENAI_API_KEYare set inbenchmark/lomoco/.env - Verify the API key is valid
- Ensure no extra quotes or spaces in the values
- Check that
Import errors
- Solution:
- Install all dependencies:
pip install -r benchmark/lomoco/requirements.txt - Ensure you're running from the correct directory
- Check Python version (requires 3.10+)
- Install all dependencies:
Slow performance
- Solution:
- The tests use multi-threading (32 workers by default)
- Reduce concurrency: Adjust
max_workersinmethods/add.pyif needed - Consider running tests on a machine with more resources
- Check server performance and database connection pool size
Dataset not found
- Solution:
- Nltk data not found:
import nltk
nltk.download("punkt", quiet=True)
nltk.download("wordnet", quiet=True) - SentenceTransformer model not found:
from sentence_transformers import SentenceTransformer
# Initialize SentenceTransformer model (this will be reused)
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
- Nltk data not found:
License
See the main project LICENSE file for details.