Evals
Erdo evals score an agent’s work against a rubric, with pass/score tracking over time. They run in production and are designed to be driven by an AI coding agent (e.g. Claude Code via MCP): change how something is generated, run the suite, read the per-criterion scores, then add, rewrite, or remove cases as the generation changes. Two kinds of suite:- Text suites — judge the agent’s text answer against the rubric (e.g. the data-question-answerer).
- Artifact suites (
evaluate_artifact: true) — judge the rendered page the agent builds (landing pages, dashboards, apps). Erdo publishes the page, screenshots it on desktop and mobile, drives it in a real browser (submits the lead form, advances the carousel, confirms the voice widget loads, checks the conversion analytics event), and scores the screenshots + interaction with a vision model. A page that looks fine but doesn’t capture the lead scores low.
name, input brief,
rubric of weighted criteria), a judge model, and a pass_threshold (0–5).
Cases run through the same path a real user/API call hits — a thread, a message,
and the normal agent invocation — so an eval tests the actual product flow, not a
synthetic one. (External egress like email/integration writes is mocked during an
eval run; everything else is real.) A case may add setup_messages: ordered turns
run before input in the same thread, for multi-step flows — e.g. create a
voice widget, then build the landing page wired to it. The judge scores the
result of input.
Evaluators
A case can specify evaluators with two distinct roles, so an eval discriminates instead of passing everything:llm_rubric= the score. An LLM scores the output (or rendered page) against the rubric — the real signal; the case score is its weighted verdict.script= a gate, not a scorer. A JSevaluate(ctx)returning{score, passed, reasoning}, run deterministically (no LLM tokens).ctx = {input, output, artifact, data_store}. Use it for structural prerequisites (the lead form exists, the test lead actually landed indata_store). A clean gate adds no points; a failed gate hard-fails the case (structure missing = broken).
llm_rubric (a failed gate forces it to 0); a
script-only case scores on the gate. Any evaluator that errors fails the case —
never a silent pass. Set evaluators via --evaluator (CLI) or the evaluators field
(MCP); omit to use the rubric as a single llm_rubric.
CLI
input is the builder’s JSON input — artifact_kind
(landing_page | dashboard | app) plus a description.
MCP tools
Available on the MCP server; identity and permissions ride your token, so you only see your organization’s and global suites.| Tool | Purpose |
|---|---|
erdo_list_eval_suites | List suites (slug, agent, settings) |
erdo_get_eval_suite | A suite with all its cases |
erdo_create_eval_suite | Create a suite + first cases (set evaluate_artifact for visual suites) |
erdo_add_eval_case / erdo_update_eval_case / erdo_remove_eval_case | Maintain the case corpus |
erdo_run_eval_suite | Run a suite, returns run_id |
erdo_get_eval_run | Run status + per-case scores, per-criterion breakdown, judge reasoning, artifact links |
erdo_list_eval_runs | Recent runs with aggregate pass counts/scores |
REST
Base URLhttps://api.erdo.ai, Authorization: Bearer YOUR_API_KEY.
| Method | Path |
|---|---|
| GET | /v1/evals/suites |
| GET | /v1/evals/suites/{suiteSlug} |
| POST | /v1/evals/suites |
| POST | /v1/evals/suites/{suiteSlug}/cases |
| PUT | /v1/evals/suites/{suiteSlug}/cases/{caseName} |
| DELETE | /v1/evals/suites/{suiteSlug}/cases/{caseName} |
| POST | /v1/evals/suites/{suiteSlug}/run |
| GET | /v1/evals/runs/{runID} |
| GET | /v1/evals/runs?suite_slug={slug}&limit={n} |
cron_enabled: false) and run on demand.
