Agent Tester (Blueprint)

The Agent Tester is a comprehensive testing and validation framework designed to systematically test the Clear AI v2 agent system through realistic user scenarios.

Overview

What is the Agent Tester?

A standalone tool that:

🎯 Tests the entire agent pipeline end-to-end via GraphQL
📊 Validates results against expected behavior
📈 Tracks performance, costs, and quality metrics
🐛 Detects regressions before they reach production
🤖 Generates realistic test scenarios automatically
📝 Produces actionable reports and dashboards

Why Build This?

Before deploying to production or building client applications, we need confidence that:

The system works correctly for diverse queries
Performance is acceptable under load
Costs are predictable and within budget
Edge cases are handled gracefully
Regressions are caught automatically

Status

Current Phase: Blueprint Design Complete ✅
Next Phase: Phase 1 Implementation (4-week roadmap)
Location: research/agent-tester/ (13 blueprint files)

Key Features

Comprehensive Testing

Simple Queries - Single-tool functionality
Complex Queries - Multi-tool coordination
Edge Cases - Errors, timeouts, invalid data
Performance Tests - Concurrent requests, load testing
Memory Tests - Context loading, follow-up questions

Rich Validation

Schema Validation - Structure correctness
Data Validation - Content accuracy
Performance Validation - Latency and token usage
Semantic Validation - Meaning correctness using LLM
Business Rule Validation - Domain logic compliance

Deep Metrics

Performance: End-to-end latency breakdown, agent timing
Cost: Token usage, LLM costs, total spend
Quality: Tool selection accuracy, analysis relevance
Health: Success rates, error rates, retry counts

Automated Regression Detection

Baseline comparison
Performance degradation alerts
Accuracy regression detection
Cost increase warnings
Automated CI/CD blocking

Example Test Scenario

id: simple-query-001
name: "List recent shipments"
category: simple
description: "Test basic shipment listing with time filter"

query: "Get shipments from last week"

expectedBehavior:
  toolsUsed: ["shipments"]
  minResults: 1
  responseContains: ["shipment", "week"]
  maxLatencyMs: 5000
  maxTokens: 500

validation:
  - type: tool_selection
    expected: ["shipments"]
  
  - type: data_structure
    schema: shipment_list
  
  - type: semantic
    check: "Response mentions timeframe"
  
  - type: performance
    maxLatencyMs: 5000
    maxTokens: 500

Blueprint Documentation

Complete blueprints are available in research/agent-tester/:

Core Documents

README.md - Quick start and navigation
00-overview.md - Purpose, goals, success criteria
01-architecture.md - System design and components

Testing Framework

02-test-scenarios.md - Scenario structure and format
03-scenario-generator.md - Dynamic test generation
04-graphql-client.md - GraphQL integration
05-validation-engine.md - Result validation

Observability

06-metrics-tracking.md - Metrics collection
07-reporting.md - Reports and dashboards

Implementation

08-data-management.md - Test data handling
09-implementation-phases.md - 4-week roadmap
10-usage-examples.md - Practical examples

Future

11-training-integration.md - Model training integration

Implementation Roadmap

Phase 1: Core Framework (Week 1)

Deliverables:

Basic GraphQL client
Scenario runner
Simple validation
Console output
15 test scenarios

Success Criteria: Can run 15 scenarios successfully

Phase 2: Scenario Library (Week 2)

Deliverables:

50 comprehensive scenarios
Advanced validation (schema, semantic, performance)
Metrics collection (SQLite storage)
HTML report generation

Success Criteria: 50+ scenarios passing with metrics

Phase 3: Advanced Features (Week 3)

Deliverables:

Dynamic scenario generation (template, combinatorial, LLM-based)
Performance testing (100+ concurrent requests)
WebSocket subscription tracking
Optional dashboard UI

Success Criteria: 150+ scenarios, load testing works

Phase 4: Production Ready (Week 4)

Deliverables:

CI/CD integration (GitHub Actions)
Regression detection
Historical tracking
Complete documentation

Success Criteria: Runs in CI/CD, blocks regressions

Key Capabilities

Scenario Generation

Generate test scenarios using:

Templates - Fill variable parameters
Combinatorial - Test all tool/filter combinations
Fuzz Testing - Random/invalid inputs
LLM-Based - Generate realistic queries
Adversarial - Intentionally difficult queries
Data-Driven - Based on actual usage patterns

Validation Strategies

5 Validation Types:

Tool Selection - Correct tools chosen
Data Structure - Response format correct
Semantic - Meaning is accurate (LLM-based)
Performance - Meets latency/token budgets
Business Rules - Domain logic satisfied

Each validation produces a confidence score (0.0 - 1.0).

Metrics Tracked

Performance Metrics:

End-to-end latency
Agent breakdown (planner, executor, analyzer, summarizer)
Tool execution time
Memory query time

Cost Metrics:

Token usage (prompt + completion)
LLM API costs
Per-query costs
Total run costs

Quality Metrics:

Tool selection accuracy
Analysis relevance
Response helpfulness
Validation confidence

Health Metrics:

Success rate
Error rate
Timeout rate
Retry count

Reporting

Real-Time Console:

Running test suite: comprehensive-v1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/100 (45%)

✓ simple-001: List shipments (1.2s, $0.01, 95%)
✓ simple-002: Search facilities (0.9s, $0.01, 98%)
✗ complex-001: Contamination analysis (5.2s, $0.05, 45%)
  └─ Tool selection incorrect

Current: 43 passed, 2 failed, 55 pending

Summary Reports:

JSON export for automation
HTML dashboard for visualization
PDF for sharing with stakeholders
Comparison with baseline

Regression Detection:

Automatic baseline comparison
Alert on performance degradation
Flag cost increases
Block CI/CD on critical failures

When to Use

Use Agent Tester for:

✅ System Validation - Verify full pipeline works
✅ Performance Benchmarking - Measure latency and costs
✅ Regression Testing - Catch breaking changes
✅ Load Testing - Test concurrent request handling
✅ Pre-Deployment - Validate before production
✅ Demonstrating Capabilities - Show what system can do

Don't Use for:

❌ Unit Testing - Use Jest unit tests instead
❌ Debugging - Use integration tests or manual testing
❌ Development - Too slow for rapid iteration
❌ Component Testing - Use integration tests

Comparison with Current Tests

Aspect	Unit Tests	Integration Tests	Agent Tester
Speed	Fast (< 1s)	Medium (1-10s)	Slower (2-5s per test)
Scope	Single function	Multiple modules	Full system
Dependencies	All mocked	Some real	All real (via GraphQL)
Purpose	Business logic	Module interaction	User experience
Count	802	160+	200+ (future)
When	Every commit	Pre-commit	Pre-deployment

Future: Training Integration

The Agent Tester will eventually support:

Training Data Collection - Gather query/response pairs
Performance Tracking - Monitor model improvement over time
A/B Testing - Compare different models/configurations
Fine-Tuning Pipeline - Prepare data for model fine-tuning
Continuous Learning - Automated model improvement

See training-integration.md blueprint for details.

Getting Started

Read the Blueprints

Start with README.md in research/agent-tester/
Read 00-overview.md for the big picture
Review 01-architecture.md for technical design
Check 09-implementation-phases.md for the roadmap

Implementation

Once blueprints are approved:

Follow the 4-phase implementation plan
Start with Phase 1 (core framework)
Iterate based on feedback
Deploy to CI/CD in Phase 4

Success Criteria

The Agent Tester is successful when:

✅ 200+ scenarios running in < 10 minutes
✅ Catches regressions before production
✅ Provides actionable failure reports
✅ Integrated into CI/CD pipeline
✅ Used daily by development team
✅ Blocks broken builds automatically

Testing Overview - Testing strategy
Test Coverage - Complete coverage breakdown
Agent Testing - Agent-specific tests
Development Guide - Contributing tests

Overview​

What is the Agent Tester?​

Why Build This?​

Status​

Key Features​

Comprehensive Testing​

Rich Validation​

Deep Metrics​

Automated Regression Detection​

Example Test Scenario​

Blueprint Documentation​

Core Documents​

Testing Framework​

Observability​

Implementation​

Future​

Implementation Roadmap​

Phase 1: Core Framework (Week 1)​

Phase 2: Scenario Library (Week 2)​

Phase 3: Advanced Features (Week 3)​

Phase 4: Production Ready (Week 4)​

Key Capabilities​

Scenario Generation​

Validation Strategies​

Metrics Tracked​

Reporting​

When to Use​

Use Agent Tester for:​

Don't Use for:​

Comparison with Current Tests​

Future: Training Integration​

Getting Started​

Read the Blueprints​

Implementation​

Success Criteria​

Related Documentation​

Overview

What is the Agent Tester?

Why Build This?

Status

Key Features

Comprehensive Testing

Rich Validation

Deep Metrics

Automated Regression Detection

Example Test Scenario

Blueprint Documentation

Core Documents

Testing Framework

Observability

Implementation

Future

Implementation Roadmap

Phase 1: Core Framework (Week 1)

Phase 2: Scenario Library (Week 2)

Phase 3: Advanced Features (Week 3)

Phase 4: Production Ready (Week 4)

Key Capabilities

Scenario Generation

Validation Strategies

Metrics Tracked

Reporting

When to Use

Use Agent Tester for:

Don't Use for:

Comparison with Current Tests

Future: Training Integration

Getting Started

Read the Blueprints

Implementation

Success Criteria

Related Documentation