Playground
See how iofold generates evaluation code from agent traces
Interactive demo coming soon. For now, explore this example.
Input: Agent Trace
A conversation with user feedback (✓ or ✗)
Output: Generated Eval
TypeScript code to validate similar interactions
function evaluateFlightBooking(
input: string,
output: string,
state: AgentState
): EvalResult {
let score = 1.0;
const issues: string[] = [];
// Check if agent called searchFlights with correct params
const searchCall = state.toolCalls.find(
t => t.name === 'searchFlights'
);
if (!searchCall) {
score -= 0.5;
issues.push('Missing searchFlights call');
} else {
// Validate parameters
if (searchCall.params.from !== 'SF') {
score -= 0.2;
issues.push('Incorrect departure city');
}
if (searchCall.params.to !== 'NYC') {
score -= 0.2;
issues.push('Incorrect destination city');
}
}
// Check if agent filtered by time preference
const filterCall = state.toolCalls.find(
t => t.name === 'filterFlights'
);
if (!filterCall) {
score -= 0.3;
issues.push('Did not filter by time preference');
}
// Check for hallucinated information
const mentionedPrices = extractPrices(output);
const actualPrices = state.toolResults.map(r => r.price);
const hallucinated = mentionedPrices.filter(
p => !actualPrices.includes(p)
);
if (hallucinated.length > 0) {
score -= 0.4;
issues.push(`Hallucinated prices: ${hallucinated.join(', ')}`);
}
return {
score: Math.max(0, score),
issues,
passed: score >= 0.7
};
}🎯 Tool Call Validation
Checks if agent called the right functions with correct parameters
🚫 Hallucination Detection
Verifies agent didn't mention facts not present in tool results
📊 Scoring & Reporting
Returns a score (0-1) with detailed issue breakdown
Interactive Demo Coming Soon
Try iofold with your own agent traces, customize eval logic, and see results in real-time.
How Code Generation Works
1. Analyze Trace
LLM examines the conversation, tool calls, and user feedback to understand what went right or wrong
2. Generate Checks
Based on the analysis, LLM writes TypeScript code to validate key aspects of similar interactions
3. Add to Suite
The generated eval is versioned, tested, and added to your continuous evaluation pipeline
4. Run Continuously
Every new trace is evaluated against the full suite, catching regressions instantly