Documentation
Deep dive into iofold's approach to automatic agent evaluation
Quick Start
Get running in 10 minutes
Core Concepts
Code vs LLM-as-judge
Examples
TypeScript eval snippets
Core Philosophy: Code is All You Need
Traditional LLM-based evaluation (LLM-as-judge) is slow, expensive, and non-deterministic. iofold takes a different approach: generate code to evaluate, not LLM responses.
The industry is converging on code-based approaches for agent evaluation:
- Cloudflare's Code Mode demonstrates how code generation creates more reliable agent outputs
- HuggingFace SmolAgents shows code-based agents outperform traditional LLM agents
- Armin Ronacher's analysis on why code-based tool calling is the future
- Research paper: Code is All You Need proves code-based evaluation outperforms LLM-as-judge
We leverage this insight to create fast, deterministic evals with the same benefits of code-based agents.
LLMs as Human-Level Eval Writers
Research demonstrates that LLMs are human-level prompt engineers. We extend this insight: LLMs are also human-level evaluation function writers.
iofold uses LLMs to generate evaluation code, not to run evaluations. This gives you:
- 10-100x faster execution (TypeScript vs API calls)
- 10-100x cheaper at scale (no inference costs per eval)
- Deterministic results (same input = same output)
- Transparent logic (code you can read and debug)
- Versionable evals (track changes in git)
TypeScript Eval Examples
Here are some examples of generated evaluation functions:
1. Hallucination Detection
Check if agent output mentions facts not present in the context:
function checkHallucination(
input: string,
output: string,
state: AgentState
): number {
// Extract key entities from context
const contextEntities = extractEntities(state.messages);
// Extract entities mentioned in output
const outputEntities = extractEntities([output]);
// Check for hallucinated entities
const hallucinated = outputEntities.filter(
entity => !contextEntities.includes(entity)
);
// Score: 0 = no hallucination, 1 = severe hallucination
return hallucinated.length / outputEntities.length;
}2. Tool Calling Accuracy
Verify agent called the right tools with correct parameters:
function checkToolUsage(
input: string,
output: string,
state: AgentState
): number {
const expectedTools = ['search', 'calculate'];
const calledTools = state.toolCalls.map(t => t.name);
// Check if all expected tools were called
const correctTools = expectedTools.every(
tool => calledTools.includes(tool)
);
// Check parameter accuracy
const correctParams = state.toolCalls.every(call =>
validateToolParams(call.name, call.params)
);
return (correctTools && correctParams) ? 1.0 : 0.0;
}3. Response Quality (Levenshtein)
Measure output similarity to expected response:
function checkResponseQuality(
input: string,
output: string,
state: AgentState
): number {
const expected = state.metadata.expectedOutput;
// Calculate Levenshtein edit distance
const distance = levenshtein(
normalize(output),
normalize(expected)
);
// Normalize by max length
const maxLen = Math.max(output.length, expected.length);
const similarity = 1 - (distance / maxLen);
return Math.max(0, similarity);
}Meta-Prompting for ReAct Agents
iofold uses meta-prompting to automatically improve your agent's system prompt based on eval results. Drawing inspiration from OpenAI's prompting guidelines and auto-prompt-optimizer, we continuously refine prompts to maximize eval scores.
The Optimization Loop
- Run evals on current prompt → get scores
- LLM analyzes failures + patterns
- LLM generates improved prompt variations
- Re-run evals on new prompts → compare scores
- Keep best prompt, repeat
Result: Better instruction following, higher tool calling accuracy, reduced hallucinations
Research & References
Large Language Models Are Human-Level Prompt Engineers
Zhou et al., 2022 — Demonstrates LLMs can generate prompts as good as human experts
Read paperCode is All You Need: Rethinking LLM Evaluation
Recent research showing code-based evaluation outperforms LLM-as-judge
Read paperGPT-5 Prompting Guidelines
OpenAI's best practices for prompt engineering, which inspire our meta-prompting approach
View guidelinesGet Started in 10 Minutes
1. Install iofold
pip install iofold2. Initialize with your observability tool
iofold init --with langfuse3. Tag feedback in your app
iofold.tag(trace_id, feedback="✓")4. Generate evals from feedback
iofold generate-evals5. Run continuous evaluation
iofold eval --watch