AgentHub: The simulation and evaluation engine for AI agents

AgentHub

The simulation and evaluation engine for AI agents

Summer 2025

Active

Reinforcement Learning

B2B

Infrastructure

https://www.agenthublabs.com

The simulation and evaluation engine for AI agents

AgentHub helps you catch and fix errors in your AI agents - before your users ever see them. Simulate real-world scenarios in curated RL environments for your agents - trace, evaluate, and improve agent behavior at scale. Generate large annotated datasets and experiment with different prompts, models, and configurations - all in one place - with clear visualizations of what’s working (and what’s not). Built by a former tech lead from Apple’s Foundation Model Eval team and a product engineer from MIT with experience working on agents at startups, AgentHub makes it easy to integrate your use case and start improving your agents quickly. Our streamlined onboarding process lets you get your agent eval environment up and running fast—no heavy lifting on your end. Close the feedback loop and build better agents, faster - with confidence.

Active Founders

Youssef Kallel

Founder

Building the simulation and evaluation engine for AI agents @AgentHub. CS @ CMU, previously a tech lead on the Foundation Model Evaluation team at Apple.

Youssef Kallel

Founder

Building the simulation and evaluation engine for AI agents @AgentHub. CS @ CMU, previously a tech lead on the Foundation Model Evaluation team at Apple.

Sandra Tang

Founder

Building the simulation and evaluation engine for AI agents @AgentHub. MIT CS + Design, previously at Google, Meta, Pomelo, Nooks, MIT Space Systems Laboratory, MIT Urban Risk Lab. Solo-developed 70+ games with 700k+ plays.

Sandra Tang

Founder

Company Launches

💻 AgentHub 📈 - The Staging Environment for your AI Agents

See original launch post

TL;DR

AgentHub is an implementation-agnostic platform for evaluating AI agents in realistic simulation environment sandboxes. Use cases that we support include computer-use, browser, conversational, and tool-use agents. We help teams test, trace, track, and improve agent behavior at scale before deployment, closing the gap between vibe checks and real-world performance.

👉 Check us out AgentHubLabs.com

💬 If you’re building agents, training models, or can connect us to someone who is, we’d love to chat!

https://www.youtube.com/watch?v=1gPVYcphmmY

🧩 The Problem

LLM-based agents often pass simple evals at a small scale and in specific cases but fail in deployment. They hallucinate tool use, misuse APIs, or follow an inefficient (or incorrect) reasoning trajectory that isn’t captured in standard benchmarks. There’s no standard way to evaluate agents beyond prompt-based tests - especially across tool use, UI interactions, and multi-step reasoning.

uploaded image

✅ The Solution: AgentHub

AgentHub lets you simulate real-world environments and evaluate agents end-to-end using structured tasks, traces, and grading:

🧪 Multi-domain, customizable environments for: e-commerce, CRM, filesystem, browser agents, dashboards, and more
🕵️ Full tracing in standardized OpenTelemetry format: capture every LLM/tool call, decision point, and UI step
📊 Evaluation that’s easy to plug and play: LLM, human-in-the-loop, and rule-based grading + custom metrics at the task, step, or tool level
🛠️ Debug and iterate fast: See exactly where agents fail, and improve reasoning or control loops
🧠 Automated Insights: Get automatic insights into failure modes and auto-generate suggested fixes

Whether you’re testing autonomous workflows, tool-using copilots, or browser agents, we give you a sandboxed, realistic playground to evaluate in before you ship.

uploaded image

🕵️ Who We Are

Youssef was previously a tech lead on the Foundation Model Evaluation team at Apple and studied CS at CMU.

Sandra has been an engineer at multiple startups, Google, and Meta, and studied CS and design at MIT.

We’ve been friends since a few years ago through a summer spent in Seattle and are obsessed with building the missing critical infrastructure to help teams evaluate, debug, and trust their agents.

🙌 How to Help / Get Involved

Learn more and book a demo: AgentHubLabs.com
We’re looking for teams and labs working on agents in:
- tool-use
- browser automation
- conversational chat
- and computer-use
Have an eval challenge? We’ll help you simulate it.
Use agents in prod? It’d be bananas to not test them on AgentHub first 🍌

uploaded image