Self-improving agents with automatic evals & prompt optimization

Open Source

User Feedback

Too generic

Wrong terms

Good structure

Eval Generation

1assert domain_expertise()

2check_terminology()

3validate_accuracy()

Optimization

Evolved Agent

v1.0

"You're a helpful assistant."

The Problem

Agents don't improve themselves. User feedback gets lost.

You ship an agent, users complain, but their feedback never makes it back into your prompts.

×Precious user feedback sits in logs, never improving prompts
×Manual eval scripts rot as agent behavior evolves
×No systematic way to optimize prompts from production data

The Solution

Self-improving agents through continuous optimization

iofold folds back precious user feedback, generates TypeScript evals automatically, and uses meta-prompting to optimize your agents on-the-fly.

✓Continuous improvement — fold back feedback into automatic prompt optimization
✓Code + LLM evals — written by a deepagent with data science tools and backtesting
✓Meta-prompting — optimize instructions on-the-fly from real usage
✓Gate deployments — visualize results and iterate automatically

How It Works

Three simple steps to self-improving agents

Collect

Fold back precious user feedback through Langfuse integration or SDK hooks. Capture real conversations and outcomes.

Generate

A deepagent writes code + LLM-as-judge evals using data science tools and backtesting. Robust checks for hallucinations, tool accuracy, and more.

Optimize

Leverage continuous meta-prompting to optimize your agents on-the-fly. Visualize results, gate deployments, iterate automatically.

Integrations

Built to plug into your stack

LangGraph

Native support for LangChain's agent framework

OpenAI AgentKit

Seamless integration with OpenAI's agent toolkit

Langfuse

Direct integration with observability platform

Phoenix Arize

Full observability and evaluation platform integration

Braintrust

Enterprise evaluation and monitoring platform

LangSmith

Integrates with LangChain's monitoring platform

Transparent evaluation for transparent AI

Open source, community-driven, and built for developers

MIT Licensed

Use it freely in your projects, no strings attached

Pluggable Adapters

For different observability tools and frameworks

Easy to Extend

Add custom eval functions to fit your needs

Active Community

For metrics, datasets, and eval templates

Contribute on GitHub

See it in action

Minimal dark theme, data-focused visuals

Feedback Tagging UI

Simple thumbs up/down interface for real user feedback

Codegen Pane

LLM-generated eval functions from feedback examples

Eval Run Dashboard

Visualize metrics, track performance, and monitor trends

Screenshots and interactive demo coming soon

Get Started

Docker Development (Recommended) — All services run in Docker with auto-restart

Start all services./scripts/dev-docker.sh start

View logs./scripts/dev-docker.sh logs

Stop services./scripts/dev-docker.sh stop

Fix cache issues./scripts/dev-docker.sh clean && ./scripts/dev-docker.sh start

Services

Backendhttp://localhost:8787

Frontendhttp://localhost:3000

Python Sandboxhttp://localhost:9999

Open Source