Evals

Erdo evals score an agent’s work against a rubric, with pass/score tracking over time. They run in production and are designed to be driven by an AI coding agent (e.g. Claude Code via MCP): change how something is generated, run the suite, read the per-criterion scores, then add, rewrite, or remove cases as the generation changes. Two kinds of suite:

Text suites — judge the agent’s text answer against the rubric (e.g. the data-question-answerer).
Artifact suites (evaluate_artifact: true) — judge the rendered page the agent builds (landing pages, dashboards, apps). Erdo publishes the page, screenshots it on desktop and mobile, drives it in a real browser (submits the lead form, advances the carousel, confirms the voice widget loads, checks the conversion analytics event), and scores the screenshots + interaction with a vision model. A page that looks fine but doesn’t capture the lead scores low.

Suites are referenced by slug. Each suite has cases (name, input brief, rubric of weighted criteria), a judge model, and a pass_threshold (0–5). Cases run through the same path a real user/API call hits — a thread, a message, and the normal agent invocation — so an eval tests the actual product flow, not a synthetic one. (External egress like email/integration writes is mocked during an eval run; everything else is real.) A case may add setup_messages: ordered turns run before input in the same thread, for multi-step flows — e.g. create a voice widget, then build the landing page wired to it. The judge scores the result of input.

Evaluators

A case can specify evaluators with two distinct roles, so an eval discriminates instead of passing everything:

llm_rubric = the score. An LLM scores the output (or rendered page) against the rubric — the real signal; the case score is its weighted verdict.
script = a gate, not a scorer. A JS evaluate(ctx) returning {score, passed, reasoning}, run deterministically (no LLM tokens). ctx = {input, output, artifact, data_store}. Use it for structural prerequisites (the lead form exists, the test lead actually landed in data_store). A clean gate adds no points; a failed gate hard-fails the case (structure missing = broken).

The case score comes from the llm_rubric (a failed gate forces it to 0); a script-only case scores on the gate. Any evaluator that errors fails the case — never a silent pass. Set evaluators via --evaluator (CLI) or the evaluators field (MCP); omit to use the rubric as a single llm_rubric.

CLI

npm i -g @erdoai/cli

erdo auth login                           # paste an API key (multi-account, like gh)
erdo org list                             # your orgs (* = active)
erdo org use acme                         # set the active org (evals are org-scoped)

erdo eval suites                          # list suites
erdo eval create landing-variations --agent erdo.artifact-builder \
  --evaluate-artifact --no-cron \
  --case '{"name":"voice","input":"{\"artifact_kind\":\"landing_page\",\"description\":\"...\"}","rubric":[{"criterion":"voice widget loads","weight":2}]}'
erdo eval suite landing-variations        # show a suite + cases
erdo eval run landing-variations --watch  # run and poll to completion (CI-friendly: non-zero on failure)
erdo eval results <run-id>                # per-case scores + lenses
erdo eval runs --suite landing-variations # recent runs

erdo eval case add landing-variations \
  --name voice-concierge \
  --input '{"artifact_kind":"landing_page","description":"Landing page for ACME with a voice concierge ..."}' \
  --rubric '[{"criterion":"voice widget loads and is on-brand","weight":2},{"criterion":"hero + form render","weight":1}]'
erdo eval case rm landing-variations voice-concierge

For artifact suites the case input is the builder’s JSON input — artifact_kind (landing_page | dashboard | app) plus a description.

MCP tools

Available on the MCP server; identity and permissions ride your token, so you only see your organization’s and global suites.

Tool	Purpose
`erdo_list_eval_suites`	List suites (slug, agent, settings)
`erdo_get_eval_suite`	A suite with all its cases
`erdo_create_eval_suite`	Create a suite + first cases (set `evaluate_artifact` for visual suites)
`erdo_add_eval_case` / `erdo_update_eval_case` / `erdo_remove_eval_case`	Maintain the case corpus
`erdo_run_eval_suite`	Run a suite, returns `run_id`
`erdo_get_eval_run`	Run status + per-case scores, per-criterion breakdown, judge reasoning, artifact links
`erdo_list_eval_runs`	Recent runs with aggregate pass counts/scores

REST

Base URL https://api.erdo.ai, Authorization: Bearer YOUR_API_KEY.

Method	Path
GET	`/v1/evals/suites`
GET	`/v1/evals/suites/{suiteSlug}`
POST	`/v1/evals/suites`
POST	`/v1/evals/suites/{suiteSlug}/cases`
PUT	`/v1/evals/suites/{suiteSlug}/cases/{caseName}`
DELETE	`/v1/evals/suites/{suiteSlug}/cases/{caseName}`
POST	`/v1/evals/suites/{suiteSlug}/run`
GET	`/v1/evals/runs/{runID}`
GET	`/v1/evals/runs?suite_slug={slug}&limit={n}`

Artifact suites are expensive (they build + render + score real pages), so they’re excluded from the daily cron (cron_enabled: false) and run on demand.

​Evals

​Evaluators

​CLI

​MCP tools

​REST

Evals

Evaluators

CLI

MCP tools

REST