Research
Academic papers, cookbooks, and resources that inform iofold's approach to self-improving agents
Featured: OpenAI Self-Evolving Agents Cookbook
The definitive guide to building autonomous agent retraining systems. Covers three optimization strategies: OpenAI Platform Optimizer, static metaprompt optimization, and GEPA (Genetic Pareto) optimization.
Read the cookbookSelf-Evolving Agents
Cookbooks and papers on building agents that improve themselves through feedback loops
Self-Evolving Agents: A Cookbook for Autonomous Agent Retraining
Comprehensive guide on building repeatable retraining loops that capture issues, learn from feedback, and promote improvements. Covers OpenAI Platform Optimizer, static metaprompt optimization, and GEPA (Genetic Pareto) optimization.
SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities
Academic paper on self-evolving agent architecture with reflective capabilities and memory augmentation.
Introducing AgentKit
OpenAI's toolkit for building, deploying, and optimizing agents with automated prompt optimization and trace grading.
Build a Coding Agent with GPT-5.1
Guide for building agents using GPT-5.1 models with agentic workflow best practices.
LLM-as-Judge
Using language models to evaluate other language models
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng et al.
Foundational paper demonstrating GPT-4 judges achieve over 80% agreement with human preferences. Introduces MT-Bench and examines biases like position and verbosity bias.
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Comprehensive benchmark to objectively evaluate LLM-based judges, showing even strong models perform only slightly better than random on challenging response pairs.
LLMs-as-Judges: A Comprehensive Survey
Comprehensive survey covering various LLM-based evaluation methodologies and frameworks.
FastChat LLM Judge
Official MT-Bench implementation with 80 multi-turn questions and 30K conversations with human preferences.
LLM-as-a-judge: Complete Guide
Practical guide covering pointwise scoring, pairwise comparison, and validation strategies.
Prompt Optimization
Automatic optimization of prompts using feedback and gradients
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab et al., Stanford NLP
Stanford NLP framework that abstracts LM pipelines as text transformation graphs, enabling automatic prompt and weight optimization without manual trial-and-error.
TextGrad: Automatic Differentiation via Text
Zou Group, Stanford
Published in Nature. Backpropagates textual feedback from LLMs to optimize prompts, achieving near-GPT-4 performance on GPT-3.5 with few iterations.
Automatic Prompt Optimization with Gradient Descent and Beam Search
APO: A nonparametric algorithm using natural language gradients and beam search to automatically improve prompts.
AutoPDL: Automatic Prompt Optimization for LLM Agents
Frames prompt optimization as structured AutoML over combinatorial spaces of prompting patterns using successive halving.
DSPy Official Documentation
Official DSPy documentation with tutorials, optimizer overview, and optimization guides.
Awesome LLM Prompt Optimization
Curated list of advanced prompt optimization and tuning methods since 2022.
Automatic Eval Generation
Automatically generating evaluations and benchmarks for LLMs
A Closer Look into Automatic Evaluation Using Large Language Models
Analyzes G-Eval and LLM-based evaluation methods, demonstrating that asking LLMs to explain their ratings improves alignment with humans.
AutoCodeBench: LLMs are Automatic Code Benchmark Generators
Automated workflow using LLM-Sandbox Interaction to generate 3,920 code problems across 20 programming languages without manual annotations.
DeepEval: The LLM Evaluation Framework
Open-source framework supporting end-to-end LLM evaluation with ready-to-use metrics and synthetic dataset generation.
EleutherAI LM Eval Harness
Framework testing generative language models across 60+ standard academic benchmarks.
Code as Evals
Using executable code and programmatic methods for evaluation
CodeJudge: Evaluating Code Generation with LLMs
Framework leveraging LLMs to evaluate semantic correctness of generated code across four programming languages.
A Survey on Evaluating LLMs in Code Generation Tasks
Comprehensive review of methods and metrics for evaluating LLM code generation including correctness, efficiency, and readability.
LiveCodeBench: Holistic and Contamination Free Evaluation
Continuously updated benchmark for code-related capabilities including self-repair and test output prediction.
EvalPlus: Rigorous Evaluation of LLM-synthesized Code
NeurIPS 2023 framework evaluating both correctness and efficiency of LLM-generated code.
ICE-Score: Instructing LLMs to Evaluate Code
EACL 2024 project for LLM-based code evaluation without relying solely on test cases.
Pydantic AI Evals
Code-first evaluation framework where all components are defined in Python.
Contribute Resources
Know of a paper, tool, or resource that should be on this page? We'd love to hear from you.