Documentation
Deep dive into iofold's approach to self-improving agents
Import Traces
Connect Langfuse or LangSmith
Generate Evals
Automatic from labeled traces
Optimize Prompts
GEPA-powered evolution
Core Architecture
iofold closes the loop between agent execution, evaluation, and optimization. Instead of manually writing evals, iofold's deepagent automatically generates code + LLM-as-judge evaluation functions using data science tools and backtesting, then uses them to evolve better prompts through GEPA.
The Optimization Loop
- Import — Pull traces from Langfuse/LangSmith
- Label — Mark traces as positive/negative
- Generate — Auto-create eval functions from patterns
- Evaluate — Run evals on new traces
- Optimize — GEPA evolves prompts using eval feedback
GEPA: 35x More Efficient Than RL
iofold uses GEPA (Genetic-Pareto Agent Evolution) to optimize your agent's system prompt. Unlike traditional reinforcement learning that requires 24,000+ rollouts, GEPA achieves better results with just 400-1,200 rollouts.
Reflective Mutation
LLMs analyze failed traces in natural language, diagnose problems, and propose targeted prompt improvements—no gradient descent required.
Pareto Selection
Maintains a frontier of top-performing prompts across different test cases, avoiding local optima and preserving diversity.
GEPA is integrated into MLflow via mlflow.genai.optimize_prompts() and works with DSPy.
Automatic Eval Generation
iofold analyzes your labeled traces to automatically generate Python evaluation functions. No manual eval writing required—just label 10+ traces and let the system learn.
How It Works
# 1. Collect labeled traces (score 0-1)
high_scored = traces.filter(score >= 0.7) # Good examples
low_scored = traces.filter(score <= 0.3) # Bad examples
# 2. LLM analyzes patterns
patterns = analyze_differences(high_scored, low_scored)
# 3. Generate 5 candidate evals
candidates = [
"correctness", # Does it solve the problem?
"efficiency", # Is the response concise?
"safety", # Is it appropriate?
"completeness", # All aspects addressed?
"ensemble" # Balanced holistic check
]
# 4. Test candidates, select winner
winner = select_best(candidates, threshold={
accuracy: 0.80,
kappa: 0.60,
f1: 0.70
})Eval Function Structure
Generated evals are Python functions that run in a sandboxed environment. They can call LLMs when needed via the EvalContext.
def eval_function(
task: dict, # {user_message: str}
task_metadata: dict, # Expected output, success criteria
trace: dict, # Agent's execution trace
ctx: EvalContext # LLM access, caching, cost tracking
) -> tuple[float, str]:
"""
Returns (score, feedback) where:
- score: 0.0 to 1.0
- feedback: Explanation for the score
"""
response = trace["agent_response"]
# Can call LLMs when deterministic checks aren't enough
if needs_semantic_check(response):
judgment = ctx.call_llm(
f"Does this response address the task? {response}"
)
return parse_score(judgment), judgment
# Prefer deterministic checks when possible
if meets_criteria(response, task_metadata):
return 1.0, "All criteria met"
return 0.5, "Partial completion"Safe Sandbox
Only safe imports: json, re, math, datetime, difflib
LLM Access
Call Claude/GPT via ctx.call_llm() with cost tracking
Built-in Cache
Per-execution cache prevents redundant LLM calls
RULER: Relative Scoring
LLMs are better at ranking solutions side-by-side than scoring them in isolation. RULER (Relative Universal LLM-Elicited Rewards) leverages this insight for 8x cheaper evaluation than pairwise comparison.
How RULER Works
- Group 4-8 similar traces together
- Ask LLM to rank them relative to each other (listwise comparison)
- Convert rankings to advantages using GRPO normalization
- Use advantages as rewards for GEPA optimization
Research Foundations
iofold builds on cutting-edge research in prompt optimization and agent evaluation. See our Research page for the full list.
GEPA: Reflective Prompt Evolution
Achieves 35x efficiency over RL (400-1,200 vs 24,000 rollouts) using reflective mutation and Pareto selection.
Read paperJudging LLM-as-a-Judge (MT-Bench)
Foundational research showing LLM judges achieve 80%+ agreement with human preferences.
Read paperDSPy: Self-Improving Pipelines
Stanford NLP framework for automatic prompt and weight optimization.
View on GitHubGet Started
1. Connect your observability tool
iofold init --adapter langfuse2. Import traces
iofold import --since 7d3. Label traces in the dashboard
# Mark 10+ traces as positive/negative4. Generate evals
iofold generate-evals --agent my-agent5. Optimize with GEPA
iofold optimize --iterations 10