Integration Testing

Integration tests execute your agents with real backend and LLM calls, validating end-to-end functionality and actual behavior. Erdo provides three execution modes to balance test coverage, speed, and cost.

Live Execution: Unlike unit tests which validate structure locally, integration tests make actual API calls to the erdo backend and LLM providers. Use replay mode to cache responses and minimize costs.

Overview

Integration tests are written as Python functions with the agent_test_* prefix and executed with the erdo agent-test command:

erdo agent-test tests/test_my_agent.py

Key Capabilities:

Execute agents with real LLM and backend calls
Three modes: live (production), replay (cached), manual (mocked)
Parallel execution for fast test runs
Rich assertion helpers for validation
Automatic test discovery

Writing Integration Tests

Basic Test Structure

from erdo import invoke
from erdo.test import text_contains, json_path_equals

def agent_test_sales_analysis():
    """Test sales data analysis agent."""
    response = invoke(
        "data-analyst",                          # Agent key
        messages=[{
            "role": "user",
            "content": "Analyze Q4 sales trends"
        }],
        datasets=["sales_2024_q4"],             # Optional: datasets to include
        mode="replay"                            # Execution mode
    )

    # Assertions
    assert response.success, f"Agent failed: {response.error}"
    assert response.result['status'] == 'success', "Agent execution should succeed"

    # Extract text content from result.output.content
    content_text = ""
    if response.result.get('output'):
        for item in response.result['output'].get('content', []):
            if item.get('content_type') == 'text':
                content_text += item['content']

    assert text_contains(content_text, "revenue")

    # Verify steps executed
    assert len(response.steps) > 0, "Should have executed steps"

InvokeResult Structure

The invoke() function returns an InvokeResult object with clean separation following the executor pattern. This structure provides organized access to results, messages, steps, and events:

class InvokeResult:
    success: bool                    # Whether invocation succeeded
    bot_id: Optional[str]           # Bot identifier
    invocation_id: Optional[str]    # Unique invocation ID
    result: Optional[Dict]          # types.Result object (status/parameters/output/message/error)
    messages: List[Dict[str, Any]]  # All messages from all steps (including sub-agents)
    steps: List[Dict[str, Any]]     # Information about executed steps
    events: List[Dict[str, Any]]    # Complete raw event stream for debugging
    error: Optional[str]            # Error message if failed

Accessing Result Data

The result field is a types.Result object with this structure:

{
    "status": "success",           # "success" or "error"
    "parameters": {...},           # Input parameters
    "output": {                    # Output content
        "content": [               # Array of content items
            {
                "content_type": "text",
                "content": "The agent's response..."
            }
        ]
    },
    "message": None,               # Optional message
    "error": None                  # Error details if status is "error"
}

Example:

response = invoke("my-agent", messages=[...], mode="replay")

# Check result status
assert response.result['status'] == 'success'

# Extract text content from result.output.content
content_text = ""
if response.result.get('output'):
    for item in response.result['output'].get('content', []):
        if item.get('content_type') == 'text':
            content_text += item['content']

print(content_text)  # "The agent's response..."

Accessing Messages

The messages field contains all messages from all steps, including intermediate steps and sub-agents:

response = invoke("agent-with-sub-agents", messages=[...], mode="replay")

# Iterate through all messages
for msg in response.messages:
    print(f"{msg['role']}: {msg['content']}")

# Output might include:
# assistant: Processing your request...
# assistant: Calling data analyzer sub-agent...
# assistant: Analysis complete. Here are the results...

Accessing Steps

The steps field provides execution information:

response = invoke("my-agent", messages=[...], mode="replay")

print(f"Executed {len(response.steps)} steps:")
for step in response.steps:
    print(f"  ✓ {step['key']} ({step['action']}) [{step['status']}]")

# Output:
# Executed 3 steps:
#   ✓ parse_input (utils.parse_json) [completed]
#   ✓ analyze (llm.message) [completed]
#   ✓ format_output (utils.echo) [completed]

Accessing Raw Events

The events field contains the complete raw event stream for debugging:

response = invoke("my-agent", messages=[...], mode="replay")

print(f"Total events: {len(response.events)}")

# Access specific events for debugging
for event in response.events:
    if event.get('metadata', {}).get('user_visibility') == 'visible':
        print(f"Event: {event.get('payload')}")

Test Discovery

The test runner automatically discovers functions matching these criteria:

Function name: Starts with agent_test_
Location: Any Python file (typically in tests/ directory)
No imports needed: Functions are discovered by pattern matching

# tests/test_data_analyst.py

def agent_test_basic_query():
    """Test basic data query."""
    pass  # ✅ Discovered

def agent_test_csv_analysis():
    """Test CSV file analysis."""
    pass  # ✅ Discovered

def helper_function():
    """Helper for tests."""
    pass  # ❌ Not discovered (no agent_test_ prefix)

def test_something():
    """Regular test."""
    pass  # ❌ Not discovered (wrong prefix)

Execution Modes

Erdo supports three execution modes, each optimized for different testing scenarios:

Live Mode

Real API calls every time - no caching, production behavior:

response = invoke(
    "my-agent",
    messages=[{"role": "user", "content": "test query"}],
    mode="live"  # or omit mode parameter (live is default)
)

Behavior:

Makes real API calls to LLM providers every execution
Non-deterministic results (LLM responses may vary)
Incurs API costs per test run
Latest model behavior and responses

When to Use:

✅ Validating latest model performance
✅ Testing with fresh, current data
✅ Verifying cache integrity (compare with replay)
✅ Production smoke tests
❌ CI/CD pipelines (expensive)
❌ Rapid development iteration (slow)

Example Use Case:

def agent_test_production_validation():
    """Weekly validation with latest model behavior."""
    response = invoke(
        "customer-support",
        messages=[{"role": "user", "content": "How do I reset my password?"}],
        mode="live"  # Always get fresh LLM response
    )

    assert response.success
    assert text_contains(str(response.result), "password reset link")

Replay Mode (Recommended)

Intelligent caching - first run executes, subsequent runs use cached responses:

response = invoke(
    "my-agent",
    messages=[{"role": "user", "content": "test query"}],
    mode="replay"  # 🌟 Recommended for most tests
)

Behavior: First Execution:

Executes agent with real LLM API calls
Generates cache key from bot definition and parameters
Stores LLM responses in database
Returns result (same as live mode)

Subsequent Executions:

Computes same cache key
Retrieves cached LLM response from database
Returns result instantly (no API call)
Deterministic, identical response

When to Use:

✅ CI/CD pipelines (fast, free after first run)
✅ Development iteration (instant feedback)
✅ Regression testing (deterministic results)
✅ Most integration tests (99% use case)
❌ Testing latest model updates
❌ Validating real-time data

Example Use Case:

def agent_test_csv_analysis():
    """Test CSV analysis - runs instantly after first execution."""
    response = invoke(
        "data-analyst",
        messages=[{"role": "user", "content": "Analyze this CSV"}],
        datasets=["sample_data.csv"],
        mode="replay"  # First run: API calls + caching, subsequent: instant
    )

    assert response.success
    assert json_path_exists(response.result, "summary.total_rows")

How Replay Caching Works

Cache Key Generation: The cache key is computed from:

bot_id: Unique identifier for the agent
bot_updated_at: Timestamp of last bot definition change
action_type: Type of action (e.g., llm.message, codeexec.execute)
parameters: Action parameters (messages, model, temperature, etc.)

# Internally, the system computes:
cache_key = SHA256(
    bot_id +
    bot_updated_at +
    action_type +
    serialize(cacheable_parameters)
)

What Gets Cached:

✅ llm.message - LLM API responses
✅ Other deterministic actions (varies by action type)
❌ Runtime metadata (invocation_id, thread_id, timestamps)
❌ Non-deterministic actions

Cache Invalidation: The cache automatically invalidates when:

Bot definition changes: bot_updated_at changes when you modify your agent
Parameters change: Different messages, model, or configuration
Manual refresh: Use refresh option (see below)

# Cache invalidates automatically when bot changes
# Edit agent definition → save → sync
# Next test run will execute live and create new cache

Storage:

Cached responses are stored in the erdo backend database
Table: cached_action_response
Scoped to bot and parameters
No local file cache (accessible across machines)

Refresh Cache

Force cache refresh to re-execute and update cached responses:

# Option 1: In code
response = invoke(
    "my-agent",
    messages=[{"role": "user", "content": "test"}],
    mode={"mode": "replay", "refresh": True}  # Re-execute and update cache
)

# Option 2: CLI flag
# erdo agent-test tests/ --refresh

When to Refresh:

After significant agent changes
Testing new model behavior
Suspecting stale cache data
Intentional cache reset

Manual Mode

Developer-controlled responses - fully deterministic with explicit mocks:

response = invoke(
    "my-agent",
    messages=[{"role": "user", "content": "test"}],
    mode="manual",
    manual_mocks={
        "llm.message": {
            "status": "success",
            "output": {"content": "Mocked LLM response here"}
        },
        "codeexec.execute": {
            "status": "success",
            "output": {"stdout": "Execution output", "stderr": ""}
        }
    }
)

Behavior:

No API calls - all responses come from manual_mocks
Completely deterministic and fast
Requires explicit mock for each action
Fails if action executed but mock not provided

When to Use:

✅ Error handling tests (simulate specific errors)
✅ Deterministic scenarios (controlled outputs)
✅ Offline development (no backend connection needed)
✅ Edge case testing (rare scenarios)
❌ Real behavior validation
❌ End-to-end integration tests

Mock Structure:

manual_mocks = {
    "action_type": {              # Action type (llm.message, codeexec.execute, etc.)
        "status": "success|error", # Execution status
        "output": {...},           # Action output (varies by action type)
        "error": "..."            # Optional: error message if status="error"
    }
}

Example Use Cases: Testing Error Handling:

def agent_test_llm_error_handling():
    """Test agent behavior when LLM call fails."""
    response = invoke(
        "my-agent",
        messages=[{"role": "user", "content": "test"}],
        mode="manual",
        manual_mocks={
            "llm.message": {
                "status": "error",
                "error": "Rate limit exceeded"
            }
        }
    )

    # Verify error handler was triggered
    assert not response.success
    assert "rate limit" in response.error.lower()

Testing Specific Edge Cases:

def agent_test_empty_dataset():
    """Test agent handling of empty dataset."""
    response = invoke(
        "data-analyst",
        messages=[{"role": "user", "content": "Analyze data"}],
        mode="manual",
        manual_mocks={
            "llm.message": {
                "status": "success",
                "output": {
                    "content": "The dataset appears to be empty. No analysis possible."
                }
            }
        }
    )

    assert response.success
    assert text_contains(str(response.result), "empty")

Deterministic Multi-Step Tests:

def agent_test_multi_step_workflow():
    """Test complex workflow with controlled outputs."""
    response = invoke(
        "complex-agent",
        messages=[{"role": "user", "content": "Process this"}],
        mode="manual",
        manual_mocks={
            # First step: LLM analysis
            "llm.message": {
                "status": "success",
                "output": {"content": "Analysis: High priority item"}
            },
            # Second step: Code execution
            "codeexec.execute": {
                "status": "success",
                "output": {"stdout": "Item processed successfully"}
            },
            # Third step: Memory storage
            "memory.store": {
                "status": "success",
                "output": {"memory_id": "mem_123"}
            }
        }
    )

    assert response.success

Mode Comparison

Aspect	Live	Replay	Manual
API Calls	✅ Every run	✅ First run only	❌ None
Deterministic	❌ No	✅ Yes (after first run)	✅ Yes
Speed	🐢 Slow	⚡ Fast (cached) / 🐢 Slow (first)	⚡ Very fast
Cost	💰 High (every run)	💰 Low (first run only)	💰 Free
Real Behavior	✅ Current	✅ Snapshot	❌ Mocked
Setup Required	None	None	Mock definitions
Best For	Production validation	CI/CD, development	Error handling, edge cases

Recommended Strategy:

Use replay mode for 90% of your integration tests
Use live mode for weekly production validation
Use manual mode for error scenarios and edge cases

Test Helpers

Erdo provides assertion helpers for common validation patterns:

Text Assertions

from erdo.test import text_contains, text_equals, text_matches

def agent_test_text_validation():
    response = invoke("agent", messages=[...], mode="replay")
    result = str(response.result)

    # Contains substring (case-insensitive by default)
    assert text_contains(result, "analysis", case_sensitive=False)

    # Exact match
    assert text_equals(result, "Expected exact output")

    # Regex pattern match
    assert text_matches(result, r"\d+ recommendations found")
    assert text_matches(result, r"confidence: \d+\.\d+%")

JSON Path Assertions

from erdo.test import json_path_equals, json_path_exists

def agent_test_json_validation():
    response = invoke("agent", messages=[...], mode="replay")

    # Check if path exists
    assert json_path_exists(response.result, "analysis.summary")
    assert json_path_exists(response.result, "recommendations[0].title")

    # Check path value
    assert json_path_equals(response.result, "analysis.confidence", 0.95)
    assert json_path_equals(response.result, "status", "completed")

Response Object

The invoke() function returns an InvokeResult object with clean separation following the executor pattern:

def agent_test_response_structure():
    response = invoke("agent", messages=[...], mode="replay")

    # Basic properties
    assert response.success                             # Execution succeeded
    assert response.result is not None                  # Result object (types.Result)
    assert response.error is None                       # No error

    # Result structure (types.Result object)
    assert response.result['status'] == 'success'       # Status: success or error
    assert response.result.get('output') is not None    # Output with content array
    assert response.result.get('parameters') is not None # Input parameters

    # Messages from all steps (including sub-agents)
    assert len(response.messages) > 0                   # All messages captured
    for msg in response.messages:
        assert 'role' in msg                            # Each message has role
        assert 'content' in msg                         # Each message has content

    # Steps execution info
    assert len(response.steps) > 0                      # Steps were executed
    for step in response.steps:
        assert 'key' in step                            # Step key
        assert 'action' in step                         # Action type
        assert 'status' in step                         # Execution status

    # Raw events for debugging
    assert len(response.events) > 0                     # Raw event stream

    # Extract text content
    content_text = ""
    if response.result.get('output'):
        for item in response.result['output'].get('content', []):
            if item.get('content_type') == 'text':
                content_text += item['content']

    # Error handling
    if not response.success:
        print(f"Agent failed: {response.error}")

Running Integration Tests

Basic Execution

# Run all tests in file
erdo agent-test tests/test_my_agent.py

# Run all tests in directory
erdo agent-test tests/

# Run with verbose output
erdo agent-test tests/ --verbose

Parallel Execution

Tests run in parallel by default for faster execution:

# Default: Parallel execution
erdo agent-test tests/

# Control parallelism (8 concurrent tests)
erdo agent-test tests/ -j 8

# Sequential execution
erdo agent-test tests/ -j 1

Cache Management

# Refresh all cached responses
erdo agent-test tests/ --refresh

# Normal run (use cache if available)
erdo agent-test tests/

Test Organization

Recommended File Structure

my_project/
├── agents/
│   ├── data_analyst.py          # Agent definitions
│   └── report_generator.py
├── tests/
│   ├── test_data_analyst.py     # Integration tests
│   ├── test_report_generator.py
│   └── fixtures/
│       ├── sample_data.csv      # Test data files
│       └── test_scenarios.json
└── .github/
    └── workflows/
        └── test.yml              # CI/CD configuration

Naming Conventions

# ✅ Good: Clear, descriptive names
def agent_test_csv_analysis_with_missing_values():
    """Test CSV analysis when data has missing values."""
    pass

def agent_test_error_recovery_after_llm_timeout():
    """Test agent recovery when LLM times out."""
    pass

# ❌ Bad: Vague names
def agent_test_1():
    pass

def agent_test_test():
    pass

Grouping Tests

# tests/test_data_analyst.py

# Group 1: Basic functionality
def agent_test_basic_query():
    """Test basic data query."""
    pass

def agent_test_csv_upload():
    """Test CSV file upload and analysis."""
    pass

# Group 2: Error handling
def agent_test_invalid_dataset():
    """Test handling of invalid dataset."""
    pass

def agent_test_llm_timeout():
    """Test recovery from LLM timeout."""
    pass

# Group 3: Advanced features
def agent_test_multi_dataset_analysis():
    """Test analysis across multiple datasets."""
    pass

def agent_test_iterative_refinement():
    """Test iterative analysis refinement."""
    pass

Best Practices

1. Default to Replay Mode

# ✅ Good: Use replay for most tests
def agent_test_standard_workflow():
    response = invoke("agent", messages=[...], mode="replay")
    assert response.success

# ⚠️ Use live sparingly
def agent_test_weekly_production_check():
    """Run weekly to validate latest model behavior."""
    response = invoke("agent", messages=[...], mode="live")
    assert response.success

2. Test Critical User Journeys

# Test the most important user paths
def agent_test_new_user_onboarding():
    """Test complete new user onboarding flow."""
    response = invoke(
        "onboarding-agent",
        messages=[{"role": "user", "content": "I'm a new user"}],
        mode="replay"
    )
    assert response.success
    assert text_contains(str(response.result), "welcome")

def agent_test_data_export():
    """Test critical data export functionality."""
    response = invoke(
        "export-agent",
        messages=[{"role": "user", "content": "Export my data"}],
        datasets=["user_data"],
        mode="replay"
    )
    assert response.success
    assert json_path_exists(response.result, "export_url")

3. Use Manual Mode for Error Scenarios

# ✅ Good: Manual mode for controlled error testing
def agent_test_handles_llm_error():
    """Test error handling when LLM fails."""
    response = invoke(
        "agent",
        messages=[{"role": "user", "content": "test"}],
        mode="manual",
        manual_mocks={
            "llm.message": {
                "status": "error",
                "error": "API timeout"
            }
        }
    )
    assert not response.success
    assert "timeout" in response.error.lower()

4. Clear Assertions with Messages

# ✅ Good: Descriptive assertion messages
def agent_test_analysis_confidence():
    response = invoke("agent", messages=[...], mode="replay")

    assert response.success, f"Agent execution failed: {response.error}"

    confidence = json_path_equals(response.result, "analysis.confidence", 0.9)
    assert confidence, "Expected confidence score of 0.9 or higher"

    has_recommendations = json_path_exists(response.result, "recommendations")
    assert has_recommendations, "Analysis should include recommendations"

5. Organize Tests by Feature

# tests/test_data_analyst.py - organized by feature area

# === CSV Analysis ===
def agent_test_csv_basic_analysis():
    pass

def agent_test_csv_with_headers():
    pass

def agent_test_csv_missing_values():
    pass

# === JSON Analysis ===
def agent_test_json_simple_structure():
    pass

def agent_test_json_nested_objects():
    pass

# === Error Handling ===
def agent_test_invalid_file_format():
    pass

def agent_test_file_too_large():
    pass

Continuous Integration

GitHub Actions Example

# .github/workflows/test.yml
name: Test Agents

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - name: Install Erdo CLI
        run: |
          brew install erdoai/tap/erdo
          erdo login --token ${{ secrets.ERDO_TOKEN }}

      - name: Unit Tests (Fast)
        run: erdo test agents/

      - name: Integration Tests (Replay Mode)
        run: erdo agent-test tests/
        # First run on new branch will cache responses
        # Subsequent runs are instant and free

Cost Optimization

# Only run expensive live tests on main branch
- name: Production Validation (Live Mode)
  if: github.ref == 'refs/heads/main'
  run: erdo agent-test tests/production/ --live

Troubleshooting

Cache Not Hitting

Symptom: Tests run slowly even with replay modeCommon Causes:

Agent definition changed (invalidates cache)
Different parameters or messages
First run (cache being created)

Solutions:

Check if you recently modified the agent file
Verify parameters are exactly the same
Look for bot_updated_at changes in agent metadata
First run is expected to be slow - subsequent runs will be fast

Manual Mock Format Errors

Symptom: Mock not found or Invalid mock format errorsCommon Causes:

Missing mock for executed action
Incorrect action type name
Invalid output structure

Solutions:

Ensure mock key matches action type exactly (e.g., llm.message)
Include both status and output fields
Match output structure to action type requirements
Check agent execution path to see which actions are called

# ✅ Correct format
manual_mocks={
    "llm.message": {
        "status": "success",
        "output": {"content": "Response"}
    }
}

# ❌ Incorrect - missing status
manual_mocks={
    "llm.message": {"content": "Response"}
}

Tests Not Discovered

Symptom: erdo agent-test finds 0 testsCommon Causes:

Function name doesn’t start with agent_test_
File not in specified directory
Syntax errors in test file

Solutions:

Verify function name starts with agent_test_
Check file path is correct
Run python test_file.py to check for syntax errors
Ensure file has .py extension

# ✅ Discovered
def agent_test_my_feature():
    pass

# ❌ Not discovered
def test_my_feature():  # Wrong prefix
    pass

def my_test():  # Wrong prefix
    pass

Parallel Execution Failures

Symptom: Tests pass individually but fail when run in parallelCommon Causes:

Shared state between tests
Resource contention (datasets, memory)
Non-idempotent operations

Solutions:

Ensure tests are independent (no shared global state)
Use unique datasets per test
Run with -j 1 to debug: erdo agent-test tests/ -j 1
Add test isolation (separate namespaces, IDs)

Replay Mode Showing Stale Data

Symptom: Cache returns old responses after agent changesCommon Causes:

Cache not invalidated (shouldn’t happen - automatic)
Using old agent definition
Multiple agent versions

Solutions:

Force refresh: erdo agent-test tests/ --refresh
Verify you synced the latest agent: erdo sync
Check bot_updated_at timestamp
Clear and recreate cache with refresh flag

Dataset Not Found in Tests

Symptom: Dataset not found error in integration testsCommon Causes:

Dataset not uploaded to erdo
Incorrect dataset ID/name
Dataset not accessible to user

Solutions:

Upload dataset: erdo upload-dataset data.csv
Verify dataset name matches exactly
Check dataset permissions
Use test fixtures directory for test data

Next Steps

Unit Testing

Learn about local validation without API calls

Testing Overview

Compare unit vs integration testing strategies

CLI Commands

Complete reference for erdo test commands

SDK Reference

Python SDK testing functions and helpers

Get Started

Core Concepts

Testing & Validation

Examples & Use Cases

Python SDK

TypeScript SDK

CLI Tools

Pre-built Agents

Advanced

Integrations

​Integration Testing

​Overview

​Writing Integration Tests

​Basic Test Structure

​InvokeResult Structure

​Accessing Result Data

​Accessing Messages

​Accessing Steps

​Accessing Raw Events

​Test Discovery

​Execution Modes

​Live Mode

​Replay Mode (Recommended)

​How Replay Caching Works

​Refresh Cache

​Manual Mode

​Mode Comparison

​Test Helpers

​Text Assertions

​JSON Path Assertions

​Response Object

​Running Integration Tests

​Basic Execution

​Parallel Execution

​Cache Management

​Test Organization

​Recommended File Structure

​Naming Conventions

​Grouping Tests

​Best Practices

​1. Default to Replay Mode

​2. Test Critical User Journeys

​3. Use Manual Mode for Error Scenarios

​4. Clear Assertions with Messages

​5. Organize Tests by Feature

​Continuous Integration

​GitHub Actions Example

​Cost Optimization

​Troubleshooting

​Next Steps

Unit Testing

Testing Overview

CLI Commands

SDK Reference

Integration Testing

Overview

Writing Integration Tests

Basic Test Structure

InvokeResult Structure

Accessing Result Data

Accessing Messages

Accessing Steps

Accessing Raw Events

Test Discovery

Execution Modes

Live Mode

Replay Mode (Recommended)

How Replay Caching Works

Refresh Cache

Manual Mode

Mode Comparison

Test Helpers

Text Assertions

JSON Path Assertions

Response Object

Running Integration Tests

Basic Execution

Parallel Execution

Cache Management

Test Organization

Recommended File Structure

Naming Conventions

Grouping Tests

Best Practices

1. Default to Replay Mode

2. Test Critical User Journeys

3. Use Manual Mode for Error Scenarios

4. Clear Assertions with Messages

5. Organize Tests by Feature

Continuous Integration

GitHub Actions Example

Cost Optimization

Troubleshooting

Next Steps