Playground

See how iofold generates evaluation code from agent traces

Interactive demo coming soon. For now, explore this example.

Input: Agent Trace

A conversation with user feedback (✓ or ✗)

User: Book a flight from SF to NYC on Friday
Agent: I'll help you book that flight. Let me search for options.
[Called: searchFlights(from="SF", to="NYC", date="Friday")]
Found 3 options. The cheapest is $299 on United at 6am.
User: That's too early, what about afternoon?
Agent: [Called: filterFlights(time="afternoon")]
The best afternoon option is $349 on Delta at 2pm.
Shall I book it?

✓User marked this interaction as helpful

Output: Generated Eval

TypeScript code to validate similar interactions

function evaluateFlightBooking(
  input: string,
  output: string,
  state: AgentState
): EvalResult {
  let score = 1.0;
  const issues: string[] = [];

  // Check if agent called searchFlights with correct params
  const searchCall = state.toolCalls.find(
    t => t.name === 'searchFlights'
  );

  if (!searchCall) {
    score -= 0.5;
    issues.push('Missing searchFlights call');
  } else {
    // Validate parameters
    if (searchCall.params.from !== 'SF') {
      score -= 0.2;
      issues.push('Incorrect departure city');
    }
    if (searchCall.params.to !== 'NYC') {
      score -= 0.2;
      issues.push('Incorrect destination city');
    }
  }

  // Check if agent filtered by time preference
  const filterCall = state.toolCalls.find(
    t => t.name === 'filterFlights'
  );

  if (!filterCall) {
    score -= 0.3;
    issues.push('Did not filter by time preference');
  }

  // Check for hallucinated information
  const mentionedPrices = extractPrices(output);
  const actualPrices = state.toolResults.map(r => r.price);

  const hallucinated = mentionedPrices.filter(
    p => !actualPrices.includes(p)
  );

  if (hallucinated.length > 0) {
    score -= 0.4;
    issues.push(`Hallucinated prices: ${hallucinated.join(', ')}`);
  }

  return {
    score: Math.max(0, score),
    issues,
    passed: score >= 0.7
  };
}

Tool Call Validation

Checks if agent called the right functions with correct parameters

Hallucination Detection

Verifies agent didn't mention facts not present in tool results

Scoring & Reporting

Returns a score (0-1) with detailed issue breakdown

Interactive Demo Coming Soon

Try iofold with your own agent traces, customize eval logic, and see results in real-time.

Read Docs Try on GitHub

How Code Generation Works

1. Analyze Trace

LLM examines the conversation, tool calls, and user feedback to understand what went right or wrong

2. Generate Checks

Based on the analysis, LLM writes TypeScript code to validate key aspects of similar interactions

3. Add to Suite

The generated eval is versioned, tested, and added to your continuous evaluation pipeline

4. Run Continuously

Every new trace is evaluated against the full suite, catching regressions instantly