Research
Academic papers, cookbooks, and resources that inform iofold's approach to self-improving agents
Featured: GEPA - 35x More Efficient Than RL
iofold is built on GEPA (Genetic-Pareto Agent Evolution), which achieves state-of-the-art results with 400-1,200 rollouts instead of 24,000 required by traditional RL. The key: reflective mutation where LLMs analyze failures and propose improvements, combined with Pareto selection across training instances.
Read the paperGenetic Evolution for Agents
Evolutionary algorithms that outperform RL for agent optimization
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Core algorithm powering iofold's self-improvement. Achieves 35x efficiency over traditional RL (400-1,200 rollouts vs 24,000) using reflective mutation and Pareto selection across training instances.
Self-Evolving Agents: A Cookbook for Autonomous Agent Retraining
OpenAI cookbook covering GEPA implementation alongside Platform Optimizer and static metaprompt optimization strategies.
SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities
Academic paper on self-evolving agent architecture with reflective capabilities and memory augmentation.
Reward Modeling
Techniques for scoring agent trajectories without extensive human labeling
RULER: Relative Universal LLM-Elicited Rewards
OpenPipe's breakthrough in reward modeling. Comparative evaluation is 8x cheaper than pairwise and more reliable than absolute scoring. Core insight: LLMs rank solutions side-by-side better than scoring in isolation.
DeepSeekMath: Introducing GRPO (Group Relative Policy Optimization)
Introduces GRPO, a critic-free RL algorithm that cuts compute requirements in half vs PPO. Normalizes rewards within groups: A_i = (r_i - mean) / std. Powers DeepSeek-R1's reasoning capabilities.
Bradley-Terry Model for Pairwise Comparisons
Statistical model for aggregating pairwise comparisons into global rankings. Used in iofold for combining multiple RULER comparisons.
LLM-as-Judge
Using language models to evaluate other language models
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng et al.
Foundational paper demonstrating GPT-4 judges achieve over 80% agreement with human preferences. Introduces MT-Bench and examines biases like position and verbosity bias.
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Comprehensive benchmark to objectively evaluate LLM-based judges, showing even strong models perform only slightly better than random on challenging response pairs.
LLMs-as-Judges: A Comprehensive Survey
Comprehensive survey covering various LLM-based evaluation methodologies and frameworks.
FastChat LLM Judge
Official MT-Bench implementation with 80 multi-turn questions and 30K conversations with human preferences.
LLM-as-a-judge: Complete Guide
Practical guide covering pointwise scoring, pairwise comparison, and validation strategies.
Prompt Optimization
Automatic optimization of prompts using feedback and gradients
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab et al., Stanford NLP
Stanford NLP framework that abstracts LM pipelines as text transformation graphs, enabling automatic prompt and weight optimization without manual trial-and-error.
TextGrad: Automatic Differentiation via Text
Zou Group, Stanford
Published in Nature. Backpropagates textual feedback from LLMs to optimize prompts, achieving near-GPT-4 performance on GPT-3.5 with few iterations.
Automatic Prompt Optimization with Gradient Descent and Beam Search
APO: A nonparametric algorithm using natural language gradients and beam search to automatically improve prompts.
AutoPDL: Automatic Prompt Optimization for LLM Agents
Frames prompt optimization as structured AutoML over combinatorial spaces of prompting patterns using successive halving.
DSPy Official Documentation
Official DSPy documentation with tutorials, optimizer overview, and optimization guides.
Awesome LLM Prompt Optimization
Curated list of advanced prompt optimization and tuning methods since 2022.
Automatic Eval Generation
Automatically generating evaluations and benchmarks for LLMs
A Closer Look into Automatic Evaluation Using Large Language Models
Analyzes G-Eval and LLM-based evaluation methods, demonstrating that asking LLMs to explain their ratings improves alignment with humans.
AutoCodeBench: LLMs are Automatic Code Benchmark Generators
Automated workflow using LLM-Sandbox Interaction to generate 3,920 code problems across 20 programming languages without manual annotations.
DeepEval: The LLM Evaluation Framework
Open-source framework supporting end-to-end LLM evaluation with ready-to-use metrics and synthetic dataset generation.
EleutherAI LM Eval Harness
Framework testing generative language models across 60+ standard academic benchmarks.
Code as Evals
Using executable code and programmatic methods for evaluation
CodeJudge: Evaluating Code Generation with LLMs
Framework leveraging LLMs to evaluate semantic correctness of generated code across four programming languages.
A Survey on Evaluating LLMs in Code Generation Tasks
Comprehensive review of methods and metrics for evaluating LLM code generation including correctness, efficiency, and readability.
LiveCodeBench: Holistic and Contamination Free Evaluation
Continuously updated benchmark for code-related capabilities including self-repair and test output prediction.
EvalPlus: Rigorous Evaluation of LLM-synthesized Code
NeurIPS 2023 framework evaluating both correctness and efficiency of LLM-generated code.
ICE-Score: Instructing LLMs to Evaluate Code
EACL 2024 project for LLM-based code evaluation without relying solely on test cases.
Pydantic AI Evals
Code-first evaluation framework where all components are defined in Python.
Rollout Generation & Simulation
Generating agent trajectories efficiently without expensive real-world execution
VCR: Record/Replay for HTTP Interactions
The cassette pattern for recording and replaying tool executions. iofold uses this approach to replay agent trajectories deterministically without re-executing tools.
WebArena: A Realistic Web Environment for Building Autonomous Agents
Benchmark for web automation agents with realistic browser environments. Informs iofold's simulated browser environment design.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks
Large-scale benchmark across operating systems. Demonstrates the importance of realistic environment simulation for agent evaluation.
AgentBench: Evaluating LLMs as Agents
Multi-dimensional benchmark for LLM agents across operating systems, games, web browsing, and databases.
User Behavior Simulation
Modeling realistic user interactions for multi-turn agent evaluation
Generative Agents: Interactive Simulacra of Human Behavior
Stanford/Google paper on simulating believable human behavior. Foundational for iofold's user behavior modeling in multi-turn conversations.
UGRO: User-Guided Response Optimization with LLM User Simulators
Using LLMs as annotation-free user simulators to assess dialogue responses and optimize task-oriented dialogue systems.
GAIA: A Benchmark for General AI Assistants
Benchmark requiring real-world skills like web browsing and multi-step reasoning. Informs iofold's approach to evaluating general-purpose agents.
Contribute Resources
Know of a paper, tool, or resource that should be on this page? We'd love to hear from you.