Integration Testing
Integration tests execute your agents with real backend and LLM calls, validating end-to-end functionality and actual behavior. Erdo provides three execution modes to balance test coverage, speed, and cost.Live Execution: Unlike unit tests which validate structure locally, integration tests make actual API calls to the erdo backend and LLM providers. Use replay mode to cache responses and minimize costs.
Overview
Integration tests are written as Python functions with theagent_test_* prefix and executed with the erdo agent-test command:
- Execute agents with real LLM and backend calls
- Three modes: live (production), replay (cached), manual (mocked)
- Parallel execution for fast test runs
- Rich assertion helpers for validation
- Automatic test discovery
Writing Integration Tests
Basic Test Structure
InvokeResult Structure
Theinvoke() function returns an InvokeResult object with clean separation following the executor pattern. This structure provides organized access to results, messages, steps, and events:
Accessing Result Data
Theresult field is a types.Result object with this structure:
Accessing Messages
Themessages field contains all messages from all steps, including intermediate steps and sub-agents:
Accessing Steps
Thesteps field provides execution information:
Accessing Raw Events
Theevents field contains the complete raw event stream for debugging:
Test Discovery
The test runner automatically discovers functions matching these criteria:- Function name: Starts with
agent_test_ - Location: Any Python file (typically in
tests/directory) - No imports needed: Functions are discovered by pattern matching
Execution Modes
Erdo supports three execution modes, each optimized for different testing scenarios:Live Mode
Real API calls every time - no caching, production behavior:- Makes real API calls to LLM providers every execution
- Non-deterministic results (LLM responses may vary)
- Incurs API costs per test run
- Latest model behavior and responses
- ✅ Validating latest model performance
- ✅ Testing with fresh, current data
- ✅ Verifying cache integrity (compare with replay)
- ✅ Production smoke tests
- ❌ CI/CD pipelines (expensive)
- ❌ Rapid development iteration (slow)
Replay Mode (Recommended)
Intelligent caching - first run executes, subsequent runs use cached responses:- Executes agent with real LLM API calls
- Generates cache key from bot definition and parameters
- Stores LLM responses in database
- Returns result (same as live mode)
- Computes same cache key
- Retrieves cached LLM response from database
- Returns result instantly (no API call)
- Deterministic, identical response
- ✅ CI/CD pipelines (fast, free after first run)
- ✅ Development iteration (instant feedback)
- ✅ Regression testing (deterministic results)
- ✅ Most integration tests (99% use case)
- ❌ Testing latest model updates
- ❌ Validating real-time data
How Replay Caching Works
Cache Key Generation: The cache key is computed from:bot_id: Unique identifier for the agentbot_updated_at: Timestamp of last bot definition changeaction_type: Type of action (e.g.,llm.message,codeexec.execute)parameters: Action parameters (messages, model, temperature, etc.)
- ✅
llm.message- LLM API responses - ✅ Other deterministic actions (varies by action type)
- ❌ Runtime metadata (invocation_id, thread_id, timestamps)
- ❌ Non-deterministic actions
- Bot definition changes:
bot_updated_atchanges when you modify your agent - Parameters change: Different messages, model, or configuration
- Manual refresh: Use refresh option (see below)
- Cached responses are stored in the erdo backend database
- Table:
cached_action_response - Scoped to bot and parameters
- No local file cache (accessible across machines)
Refresh Cache
Force cache refresh to re-execute and update cached responses:- After significant agent changes
- Testing new model behavior
- Suspecting stale cache data
- Intentional cache reset
Manual Mode
Developer-controlled responses - fully deterministic with explicit mocks:- No API calls - all responses come from
manual_mocks - Completely deterministic and fast
- Requires explicit mock for each action
- Fails if action executed but mock not provided
- ✅ Error handling tests (simulate specific errors)
- ✅ Deterministic scenarios (controlled outputs)
- ✅ Offline development (no backend connection needed)
- ✅ Edge case testing (rare scenarios)
- ❌ Real behavior validation
- ❌ End-to-end integration tests
Mode Comparison
| Aspect | Live | Replay | Manual |
|---|---|---|---|
| API Calls | ✅ Every run | ✅ First run only | ❌ None |
| Deterministic | ❌ No | ✅ Yes (after first run) | ✅ Yes |
| Speed | 🐢 Slow | ⚡ Fast (cached) / 🐢 Slow (first) | ⚡ Very fast |
| Cost | 💰 High (every run) | 💰 Low (first run only) | 💰 Free |
| Real Behavior | ✅ Current | ✅ Snapshot | ❌ Mocked |
| Setup Required | None | None | Mock definitions |
| Best For | Production validation | CI/CD, development | Error handling, edge cases |
Recommended Strategy:
- Use replay mode for 90% of your integration tests
- Use live mode for weekly production validation
- Use manual mode for error scenarios and edge cases
Test Helpers
Erdo provides assertion helpers for common validation patterns:Text Assertions
JSON Path Assertions
Response Object
Theinvoke() function returns an InvokeResult object with clean separation following the executor pattern:
Running Integration Tests
Basic Execution
Parallel Execution
Tests run in parallel by default for faster execution:Cache Management
Test Organization
Recommended File Structure
Naming Conventions
Grouping Tests
Best Practices
1. Default to Replay Mode
2. Test Critical User Journeys
3. Use Manual Mode for Error Scenarios
4. Clear Assertions with Messages
5. Organize Tests by Feature
Continuous Integration
GitHub Actions Example
Cost Optimization
Troubleshooting
Cache Not Hitting
Cache Not Hitting
Symptom: Tests run slowly even with replay modeCommon Causes:
- Agent definition changed (invalidates cache)
- Different parameters or messages
- First run (cache being created)
- Check if you recently modified the agent file
- Verify parameters are exactly the same
- Look for
bot_updated_atchanges in agent metadata - First run is expected to be slow - subsequent runs will be fast
Manual Mock Format Errors
Manual Mock Format Errors
Symptom:
Mock not found or Invalid mock format errorsCommon Causes:- Missing mock for executed action
- Incorrect action type name
- Invalid output structure
- Ensure mock key matches action type exactly (e.g.,
llm.message) - Include both
statusandoutputfields - Match output structure to action type requirements
- Check agent execution path to see which actions are called
Tests Not Discovered
Tests Not Discovered
Symptom:
erdo agent-test finds 0 testsCommon Causes:- Function name doesn’t start with
agent_test_ - File not in specified directory
- Syntax errors in test file
- Verify function name starts with
agent_test_ - Check file path is correct
- Run
python test_file.pyto check for syntax errors - Ensure file has
.pyextension
Parallel Execution Failures
Parallel Execution Failures
Symptom: Tests pass individually but fail when run in parallelCommon Causes:
- Shared state between tests
- Resource contention (datasets, memory)
- Non-idempotent operations
- Ensure tests are independent (no shared global state)
- Use unique datasets per test
- Run with
-j 1to debug:erdo agent-test tests/ -j 1 - Add test isolation (separate namespaces, IDs)
Replay Mode Showing Stale Data
Replay Mode Showing Stale Data
Symptom: Cache returns old responses after agent changesCommon Causes:
- Cache not invalidated (shouldn’t happen - automatic)
- Using old agent definition
- Multiple agent versions
- Force refresh:
erdo agent-test tests/ --refresh - Verify you synced the latest agent:
erdo sync - Check
bot_updated_attimestamp - Clear and recreate cache with refresh flag
Dataset Not Found in Tests
Dataset Not Found in Tests
Symptom:
Dataset not found error in integration testsCommon Causes:- Dataset not uploaded to erdo
- Incorrect dataset ID/name
- Dataset not accessible to user
- Upload dataset:
erdo upload-dataset data.csv - Verify dataset name matches exactly
- Check dataset permissions
- Use test fixtures directory for test data