Research

Academic paper on self-evolving agent architecture with reflective capabilities and memory augmentation.

Introducing AgentKit

blog

OpenAI's toolkit for building, deploying, and optimizing agents with automated prompt optimization and trace grading.

Build a Coding Agent with GPT-5.1

cookbook

Guide for building agents using GPT-5.1 models with agentic workflow best practices.

LLM-as-Judge

Using language models to evaluate other language models

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Featured

Lianmin Zheng et al.

Foundational paper demonstrating GPT-4 judges achieve over 80% agreement with human preferences. Introduces MT-Bench and examines biases like position and verbosity bias.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Comprehensive benchmark to objectively evaluate LLM-based judges, showing even strong models perform only slightly better than random on challenging response pairs.

LLMs-as-Judges: A Comprehensive Survey

Comprehensive survey covering various LLM-based evaluation methodologies and frameworks.

FastChat LLM Judge

Official MT-Bench implementation with 80 multi-turn questions and 30K conversations with human preferences.

LLM-as-a-judge: Complete Guide

blog

Practical guide covering pointwise scoring, pairwise comparison, and validation strategies.

Prompt Optimization

Automatic optimization of prompts using feedback and gradients

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Featured

Omar Khattab et al., Stanford NLP

Stanford NLP framework that abstracts LM pipelines as text transformation graphs, enabling automatic prompt and weight optimization without manual trial-and-error.

TextGrad: Automatic Differentiation via Text

Zou Group, Stanford

Published in Nature. Backpropagates textual feedback from LLMs to optimize prompts, achieving near-GPT-4 performance on GPT-3.5 with few iterations.

Automatic Prompt Optimization with Gradient Descent and Beam Search

APO: A nonparametric algorithm using natural language gradients and beam search to automatically improve prompts.

AutoPDL: Automatic Prompt Optimization for LLM Agents

2025

Frames prompt optimization as structured AutoML over combinatorial spaces of prompting patterns using successive halving.

DSPy Official Documentation

tool

Official DSPy documentation with tutorials, optimizer overview, and optimization guides.

Awesome LLM Prompt Optimization

Curated list of advanced prompt optimization and tuning methods since 2022.

Automatic Eval Generation

Automatically generating evaluations and benchmarks for LLMs

A Closer Look into Automatic Evaluation Using Large Language Models

Analyzes G-Eval and LLM-based evaluation methods, demonstrating that asking LLMs to explain their ratings improves alignment with humans.

AutoCodeBench: LLMs are Automatic Code Benchmark Generators

2025

Automated workflow using LLM-Sandbox Interaction to generate 3,920 code problems across 20 programming languages without manual annotations.

DeepEval: The LLM Evaluation Framework

Open-source framework supporting end-to-end LLM evaluation with ready-to-use metrics and synthetic dataset generation.

EleutherAI LM Eval Harness

Framework testing generative language models across 60+ standard academic benchmarks.

Code as Evals

Using executable code and programmatic methods for evaluation

CodeJudge: Evaluating Code Generation with LLMs

Framework leveraging LLMs to evaluate semantic correctness of generated code across four programming languages.

A Survey on Evaluating LLMs in Code Generation Tasks

Comprehensive review of methods and metrics for evaluating LLM code generation including correctness, efficiency, and readability.

LiveCodeBench: Holistic and Contamination Free Evaluation

tool

Continuously updated benchmark for code-related capabilities including self-repair and test output prediction.

EvalPlus: Rigorous Evaluation of LLM-synthesized Code

NeurIPS 2023 framework evaluating both correctness and efficiency of LLM-generated code.

ICE-Score: Instructing LLMs to Evaluate Code