
We’re at the very beginning of the agent era. The demand for AI is shifting from models that simply answer questions to agents that can operate autonomously over days and weeks. Models have become incredibly proficient at short tasks, but fail when asked to perform long-horizon work that requires proficiency with a diverse set of tools.
(source: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/)
Static datasets are no longer sufficient. To improve the performance of autonomous agents, they must be trained inside environments that reflect the real world.
Today, most RL environments are toy problems: they are narrow in scope, test isolated workflows, deal with static data, and produce rollouts that take on the order of minutes to complete. Innovation in environment technology is needed in order to train agents that can perform reliably over extended durations without human guidance.
Polymath builds simulated worlds where AI agents learn to operate autonomously over long horizons. Our environments combine running applications, real tools, and multi-step tasks that reflect the complexity of real work. Instead of solving small, isolated problems, agents learn by completing end-to-end work inside living systems that evolve over time. We train agents on tasks that could take humans days of effort to complete. To perform these tasks effectively, agents need to exercise good judgement under ambiguous circumstances, and respond to changes in the environment.
We recently launched Horizon-SWE, a benchmark that drops frontier models into a simulated software company.
It consists of a running application, real tools, and long-horizon tasks covering the entire software development lifecycle (planning, coding, testing, deployment, monitoring). The benchmark measures the ability of AI agents to perform end-to-end SWE tasks, as opposed to code generation alone. Leading models score around 25% on the benchmark.
Read more about our methodology here: https://www.polymathlabs.ai/blog/horizon-swe
Polymath is a team of researchers and engineers from UC Berkeley, Hume AI, Plaid, and Amazon. We have years of experience post-training frontier models in industry, and building large-scale data systems.
Dylan previously led post-training research and data at Hume AI. He launched the first generative AI features at AWS model inference, and received the Amazon Inventor Award. Dylan studied Computer Science at UC Berkeley, and conducted research in reinforcement learning at Berkeley Artificial Intelligence Research (BAIR), and ML systems at the RISE Lab. He is a recipient of the National Science Foundation (NSF) Fellowship.
Naren led the development of Plaid’s credit monitoring system from the ground up, building infrastructure that now supports millions of business-critical transactions daily. At Amazon, he was a core contributor to large-scale ML systems powering critical logistics workflows. Naren studied Electrical Engineering & Computer Science at UC Berkeley, and conducted research in ML systems at Berkeley Artificial Intelligence Research (BAIR).
If you work at a frontier lab and are interested in acquiring environments, or know someone who is, we’d love to chat! (founders@polymathlabs.ai)