We're building the world's most capable AI evaluation models & the tools to unlock the full potential of language models for developers.
Co-founder & CEO of Atla (S23). Startup veteran @ Syrup, Trim, and Merantix. Masters in CS @ University of Pennsylvania. Half an MBA @ Harvard Business School.
Co-founder & CTO of atla (S23). AI safety researcher @ MATS. MSc. Robotics @ ETH, Stanford, Imperial.
At Atla, we train models to act as reliable evaluators of generative AI applications. Our evaluation models are faster, more stable, and align more closely with human annotators than general-purpose LLMs.
Benchmarks ≠ user preferences. Developers currently spend huge amounts of time iterating across prompts, models & training data. The only ways to evaluate performance are manual & subjective or slow & unstable. You either trust your gut, hire expensive domain experts, or have to live with the self-bias and instability of GPT-4-based evals.
With Atla, teams can rapidly achieve high performance, discover failure modes, and know the accuracy of their GenAI applications. Our evaluation models are faster, more stable, and align more closely with human annotators than general-purpose LLMs. Atla helps engineers get a clear optimization target for their LLM applications. 🎯
Our 7B eval model compares favorably with GPT-4 on our experiments with real customer data and beats GPT-3.5 as an evaluator on the most important benchmarks:
We have trained on five key metrics to cover the most critical use cases of LLMs:
hallucination: Measures the presence of incorrect or unrelated information.
groundedness: Measures if the answer is factually based on the provided context.
context_relevance: Assesses the retrieved context’s relevance to the query.
recall: Measures how completely the response captures the key facts.
precision: Assesses the relevance of all the information provided.
Using OpenAI's SDK, the following
becomes
Here’s a sample workbook with human-annotated data from when we built a legal co-pilot. Feel free to try it on your own data!
Have an LLM in production working on generative tasks? Try out our model to evaluate how well your AI application is really working: