AI agents that test AI products

MAIHEM creates AI agents that test AI products. We enable companies to automate their AI quality assurance, enhancing AI performance and reliability before and after deployment.

Team Size:2
Location:San Francisco
Group Partner:Jared Friedman

Active Founders

Max Ahrens

Max is the Co-Founder & CEO of MAIHEM. He did his PhD and Postdoc in Natural Language Processing at the University of Oxford. During his Postdoc, he was also the Project Leader of a +$500,000 research grant on harmful narrative detection with large language models, which he was awarded by the Alan Turing Institute and the British Ministry of Defence. Previously, he worked as a consultant with McKinsey, advising globally operating companies on digitization strategies.

Max Ahrens
Max Ahrens

Eduardo Candela

Eduardo is the Co-Founder and CTO of MAIHEM. Eduardo is deeply passionate about building state-of-the-art AI products – he previously worked as a Technical Program Manager at Tesla and a Data Scientist at the Bosch Center for AI. He holds a PhD research in AI Safety for Autonomous Vehicles at Imperial College London, an MSc in Operations Research from MIT, and BSc in Robotics from ITAM.

Eduardo Candela
Eduardo Candela

Company Launches

TL;DR – Automate quality assurance for your LLM application

MAIHEM creates AI agents that continuously test your conversational AI applications, such as chatbots. We enable you to automate your AI quality assurance – enhancing AI performance, reliability, and safety from development all the way to deployment.

Ask – Let us automate your LLM testing so you can focus on building

Want to find out how your LLM application performs before releasing it to real users? Want to avoid hours of manual and incomprehensive LLM testing?

Please book a call with us or email us at contact@maihem.ai.

Problem – Traditional quality assurance doesn’t work for LLMs

LLMs are probabilistic black boxes, as their responses are highly variable and hard to predict. Traditional software produces a few predefined results, whereas LLMs can generate thousands of different responses. This means there are also thousands of ways LLMs can fail.

Two recent and prominent examples of what can go wrong (and viral!!!) with LLM applications:

You don’t want to add your company to this list.

Solution – Our AI agents continuously test your LLM applications


  1. Simulate thousands of users to test your LLM applications before you go live.
  2. Evaluate your LLM applications with custom performance and risk metrics.
  3. Improve and fine-tune your LLM applications with hyper-realistic simulated data.

Team – Two PhDs joining forces: AI Safety 🤝 LLMs

We are @Max Ahrens (PhD in Natural Language Processing, Oxford) and @Eduardo Candela (PhD in AI Safety, Imperial College London). We met in London during our PhD studies and joined forces when we realized that we had a shared vision to make AI more reliable, safer, and perform better. We are transferring our proprietary research from safety for self-driving cars to LLM applications.