Documentation

Deep dive into iofold's approach to self-improving agents

Import Traces

Connect Langfuse or LangSmith

Generate Evals

Automatic from labeled traces

Optimize Prompts

GEPA-powered evolution

Core Architecture

iofold closes the loop between agent execution, evaluation, and optimization. Instead of manually writing evals, iofold's deepagent automatically generates code + LLM-as-judge evaluation functions using data science tools and backtesting, then uses them to evolve better prompts through GEPA.

The Optimization Loop

  1. Import — Pull traces from Langfuse/LangSmith
  2. Label — Mark traces as positive/negative
  3. Generate — Auto-create eval functions from patterns
  4. Evaluate — Run evals on new traces
  5. Optimize — GEPA evolves prompts using eval feedback

GEPA: 35x More Efficient Than RL

iofold uses GEPA (Genetic-Pareto Agent Evolution) to optimize your agent's system prompt. Unlike traditional reinforcement learning that requires 24,000+ rollouts, GEPA achieves better results with just 400-1,200 rollouts.

Reflective Mutation

LLMs analyze failed traces in natural language, diagnose problems, and propose targeted prompt improvements—no gradient descent required.

Pareto Selection

Maintains a frontier of top-performing prompts across different test cases, avoiding local optima and preserving diversity.

GEPA is integrated into MLflow via mlflow.genai.optimize_prompts() and works with DSPy.

Automatic Eval Generation

iofold analyzes your labeled traces to automatically generate Python evaluation functions. No manual eval writing required—just label 10+ traces and let the system learn.

How It Works

# 1. Collect labeled traces (score 0-1)
high_scored = traces.filter(score >= 0.7)  # Good examples
low_scored = traces.filter(score <= 0.3)   # Bad examples

# 2. LLM analyzes patterns
patterns = analyze_differences(high_scored, low_scored)

# 3. Generate 5 candidate evals
candidates = [
  "correctness",   # Does it solve the problem?
  "efficiency",    # Is the response concise?
  "safety",        # Is it appropriate?
  "completeness",  # All aspects addressed?
  "ensemble"       # Balanced holistic check
]

# 4. Test candidates, select winner
winner = select_best(candidates, threshold={
  accuracy: 0.80,
  kappa: 0.60,
  f1: 0.70
})

Eval Function Structure

Generated evals are Python functions that run in a sandboxed environment. They can call LLMs when needed via the EvalContext.

def eval_function(
    task: dict,           # {user_message: str}
    task_metadata: dict,  # Expected output, success criteria
    trace: dict,          # Agent's execution trace
    ctx: EvalContext      # LLM access, caching, cost tracking
) -> tuple[float, str]:
    """
    Returns (score, feedback) where:
    - score: 0.0 to 1.0
    - feedback: Explanation for the score
    """
    response = trace["agent_response"]

    # Can call LLMs when deterministic checks aren't enough
    if needs_semantic_check(response):
        judgment = ctx.call_llm(
            f"Does this response address the task? {response}"
        )
        return parse_score(judgment), judgment

    # Prefer deterministic checks when possible
    if meets_criteria(response, task_metadata):
        return 1.0, "All criteria met"

    return 0.5, "Partial completion"

Safe Sandbox

Only safe imports: json, re, math, datetime, difflib

LLM Access

Call Claude/GPT via ctx.call_llm() with cost tracking

Built-in Cache

Per-execution cache prevents redundant LLM calls

RULER: Relative Scoring

LLMs are better at ranking solutions side-by-side than scoring them in isolation. RULER (Relative Universal LLM-Elicited Rewards) leverages this insight for 8x cheaper evaluation than pairwise comparison.

How RULER Works

  1. Group 4-8 similar traces together
  2. Ask LLM to rank them relative to each other (listwise comparison)
  3. Convert rankings to advantages using GRPO normalization
  4. Use advantages as rewards for GEPA optimization

Research Foundations

iofold builds on cutting-edge research in prompt optimization and agent evaluation. See our Research page for the full list.

GEPA: Reflective Prompt Evolution

Achieves 35x efficiency over RL (400-1,200 vs 24,000 rollouts) using reflective mutation and Pareto selection.

Read paper

Judging LLM-as-a-Judge (MT-Bench)

Foundational research showing LLM judges achieve 80%+ agreement with human preferences.

Read paper

DSPy: Self-Improving Pipelines

Stanford NLP framework for automatic prompt and weight optimization.

View on GitHub

Get Started

1. Connect your observability tool

iofold init --adapter langfuse

2. Import traces

iofold import --since 7d

3. Label traces in the dashboard

# Mark 10+ traces as positive/negative

4. Generate evals

iofold generate-evals --agent my-agent

5. Optimize with GEPA

iofold optimize --iterations 10