Janus: Evaluate AI Agents with Simulation Environments

Janus

Evaluate AI Agents with Simulation Environments

Spring 2025

Active

AIOps

Developer Tools

Reinforcement Learning

Monitoring

San Francisco

https://www.withjanus.com/

Evaluate AI Agents with Simulation Environments

Janus automates AI evaluations by using high-fidelity simulation environments, catching failures in reasoning, compliance, tool usage, and performance. The resulting datasets benchmark products and feed post-training loops to continuously improve performance over time.

Active Founders

Shivum Pandove

Founder

Prev. ML & CS at Carnegie Mellon. Scaled 3+ startups as a SWE/PM from 0 -> 1, conducted DL research at comp. bio labs, and always building.

Shivum Pandove

Founder

Prev. ML & CS at Carnegie Mellon. Scaled 3+ startups as a SWE/PM from 0 -> 1, conducted DL research at comp. bio labs, and always building.

Company Launches

Janus – Simulation Testing for AI Agents

See original launch post

Hey Everyone! We’re Shivum and Jet, the co-founders of Janus! 👋🏼

TLDR; Janus battle-tests your AI agents to surface hallucinations, rule violations, and tool-call/performance failures. We run thousands of AI simulations against your chat/voice agents and offer custom evals for further model improvement.

Launch Video

💸 Why this matters

A single broken AI conversation can mean:

A PR disaster (Air Canada chatbot inventing refund policies)
Users churning after one bad reply
Lawsuits or regulatory fines for poor compliance

Yet most teams still test agents manually by pasting prompts into playgrounds.

🤕 The Problem

Manual QA covers maybe 100 scenarios, while real users trigger millions. Generic testing platforms don’t understand your customers and can’t simulate nuanced back‑and‑forths at scale. This leaves companies with no actionable insights and blind spots that only appear after you ship.

💡 Our Solution

Janus automatically:

Generates thousands of hyper‑realistic user personas—from angry customers to domain experts—to cover every possible edge case
Runs full multi‑turn conversations (text or voice) against your agent, APIs, and function calls
Allows you to input natural language rules on what to test your agent against and how you’d like it to perform
Detects hallucinations, bias, tool‑call failures, and risky responses using SOTA LLM‑as‑a‑Judge + black-box UQ techniques
Pinpoints root causes and produces actionable recommendations you can plug straight into CI/CD.

All in < 10 min.

uploaded image

📜 Backstory

Shivum and Jet left incoming roles at Anduril and IBM, dropped out of Carnegie Mellon ML, and moved to SF to build Janus full-time. We felt this pain first‑hand while building consumer-facing agents ourselves: every new model or prompt tweak broke something in prod. We built Janus to give ourselves the “crash‑test dummy” we wished existed from day-1.

🚀 Our Ask

Building or piloting an AI agent? Skip manual QA and get started in 15 minutes to see how Janus makes agent eval effortless: cal.com/team/janus/quick-chat.

— Shivum & Jet (Founders of Janus)

Check us out at withjanus.com.

Email us at team@withjanus.com