Documentation

Deep dive into iofold's approach to automatic agent evaluation

Quick Start

Get running in 10 minutes

Core Concepts

Code vs LLM-as-judge

Examples

TypeScript eval snippets

Core Philosophy: Code is All You Need

Traditional LLM-based evaluation (LLM-as-judge) is slow, expensive, and non-deterministic. iofold takes a different approach: generate code to evaluate, not LLM responses.

The industry is converging on code-based approaches for agent evaluation:

Cloudflare's Code Mode demonstrates how code generation creates more reliable agent outputs
HuggingFace SmolAgents shows code-based agents outperform traditional LLM agents
Armin Ronacher's analysis on why code-based tool calling is the future
Research paper: Code is All You Need proves code-based evaluation outperforms LLM-as-judge

We leverage this insight to create fast, deterministic evals with the same benefits of code-based agents.

LLMs as Human-Level Eval Writers

Research demonstrates that LLMs are human-level prompt engineers. We extend this insight: LLMs are also human-level evaluation function writers.

iofold uses LLMs to generate evaluation code, not to run evaluations. This gives you:

10-100x faster execution (TypeScript vs API calls)
10-100x cheaper at scale (no inference costs per eval)
Deterministic results (same input = same output)
Transparent logic (code you can read and debug)
Versionable evals (track changes in git)

TypeScript Eval Examples

Here are some examples of generated evaluation functions:

1. Hallucination Detection

Check if agent output mentions facts not present in the context:

function checkHallucination(
  input: string,
  output: string,
  state: AgentState
): number {
  // Extract key entities from context
  const contextEntities = extractEntities(state.messages);

  // Extract entities mentioned in output
  const outputEntities = extractEntities([output]);

  // Check for hallucinated entities
  const hallucinated = outputEntities.filter(
    entity => !contextEntities.includes(entity)
  );

  // Score: 0 = no hallucination, 1 = severe hallucination
  return hallucinated.length / outputEntities.length;
}

2. Tool Calling Accuracy

Verify agent called the right tools with correct parameters:

function checkToolUsage(
  input: string,
  output: string,
  state: AgentState
): number {
  const expectedTools = ['search', 'calculate'];
  const calledTools = state.toolCalls.map(t => t.name);

  // Check if all expected tools were called
  const correctTools = expectedTools.every(
    tool => calledTools.includes(tool)
  );

  // Check parameter accuracy
  const correctParams = state.toolCalls.every(call =>
    validateToolParams(call.name, call.params)
  );

  return (correctTools && correctParams) ? 1.0 : 0.0;
}

3. Response Quality (Levenshtein)

Measure output similarity to expected response:

function checkResponseQuality(
  input: string,
  output: string,
  state: AgentState
): number {
  const expected = state.metadata.expectedOutput;

  // Calculate Levenshtein edit distance
  const distance = levenshtein(
    normalize(output),
    normalize(expected)
  );

  // Normalize by max length
  const maxLen = Math.max(output.length, expected.length);
  const similarity = 1 - (distance / maxLen);

  return Math.max(0, similarity);
}

Meta-Prompting for ReAct Agents

iofold uses meta-prompting to automatically improve your agent's system prompt based on eval results. Drawing inspiration from OpenAI's prompting guidelines and auto-prompt-optimizer, we continuously refine prompts to maximize eval scores.

The Optimization Loop

Run evals on current prompt → get scores
LLM analyzes failures + patterns
LLM generates improved prompt variations
Re-run evals on new prompts → compare scores
Keep best prompt, repeat

Result: Better instruction following, higher tool calling accuracy, reduced hallucinations

Research & References

Large Language Models Are Human-Level Prompt Engineers

Zhou et al., 2022 — Demonstrates LLMs can generate prompts as good as human experts

Read paper

Code is All You Need: Rethinking LLM Evaluation

Recent research showing code-based evaluation outperforms LLM-as-judge

Read paper

GPT-5 Prompting Guidelines

OpenAI's best practices for prompt engineering, which inspire our meta-prompting approach

View guidelines

Get Started in 10 Minutes

1. Install iofold

pip install iofold

2. Initialize with your observability tool

iofold init --with langfuse

3. Tag feedback in your app

iofold.tag(trace_id, feedback="✓")

4. Generate evals from feedback

iofold generate-evals

5. Run continuous evaluation

iofold eval --watch

View on GitHub Try Playground