Playground

See how iofold generates evaluation code from agent traces

Interactive demo coming soon. For now, explore this example.

Input: Agent Trace

A conversation with user feedback (✓ or ✗)

User marked this interaction as helpful

Output: Generated Eval

TypeScript code to validate similar interactions

function evaluateFlightBooking(
  input: string,
  output: string,
  state: AgentState
): EvalResult {
  let score = 1.0;
  const issues: string[] = [];

  // Check if agent called searchFlights with correct params
  const searchCall = state.toolCalls.find(
    t => t.name === 'searchFlights'
  );

  if (!searchCall) {
    score -= 0.5;
    issues.push('Missing searchFlights call');
  } else {
    // Validate parameters
    if (searchCall.params.from !== 'SF') {
      score -= 0.2;
      issues.push('Incorrect departure city');
    }
    if (searchCall.params.to !== 'NYC') {
      score -= 0.2;
      issues.push('Incorrect destination city');
    }
  }

  // Check if agent filtered by time preference
  const filterCall = state.toolCalls.find(
    t => t.name === 'filterFlights'
  );

  if (!filterCall) {
    score -= 0.3;
    issues.push('Did not filter by time preference');
  }

  // Check for hallucinated information
  const mentionedPrices = extractPrices(output);
  const actualPrices = state.toolResults.map(r => r.price);

  const hallucinated = mentionedPrices.filter(
    p => !actualPrices.includes(p)
  );

  if (hallucinated.length > 0) {
    score -= 0.4;
    issues.push(`Hallucinated prices: ${hallucinated.join(', ')}`);
  }

  return {
    score: Math.max(0, score),
    issues,
    passed: score >= 0.7
  };
}

🎯 Tool Call Validation

Checks if agent called the right functions with correct parameters

🚫 Hallucination Detection

Verifies agent didn't mention facts not present in tool results

📊 Scoring & Reporting

Returns a score (0-1) with detailed issue breakdown

Interactive Demo Coming Soon

Try iofold with your own agent traces, customize eval logic, and see results in real-time.

How Code Generation Works

1. Analyze Trace

LLM examines the conversation, tool calls, and user feedback to understand what went right or wrong

2. Generate Checks

Based on the analysis, LLM writes TypeScript code to validate key aspects of similar interactions

3. Add to Suite

The generated eval is versioned, tested, and added to your continuous evaluation pipeline

4. Run Continuously

Every new trace is evaluated against the full suite, catching regressions instantly