{"id":98237,"title":"Polymath: Applied Intuition for AI agents","tagline":"Simulated worlds to train \u0026 evaluate long-horizon AI agents","body":"## Problem\n\nWe’re at the very beginning of the agent era. The demand for AI is shifting from models that simply answer questions to agents that can operate autonomously over days and weeks. Models have become incredibly proficient at short tasks, but fail when asked to perform **long-horizon work** that requires proficiency with a **diverse set of tools**.\n\n![uploaded image](/media/?type=post\u0026id=98237\u0026key=user_uploads/2471616/0489f25a-b235-4d6f-ade3-86a0a5b4897c)\n\n(source: \u003chttps://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/\u003e) \n\nStatic datasets are no longer sufficient. To improve the performance of autonomous agents, they must be trained inside environments that reflect the real world.\n\nToday, most RL environments are toy problems: they are narrow in scope, test isolated workflows, deal with static data, and produce rollouts that take on the order of minutes to complete. Innovation in environment technology is needed in order to train agents that can perform reliably over extended durations without human guidance.\n\n## Solution\n\nPolymath builds **simulated worlds where AI agents learn to operate autonomously over long horizons**. Our environments combine running applications, real tools, and multi-step tasks that reflect the complexity of real work. Instead of solving small, isolated problems, agents learn by completing end-to-end work inside living systems that evolve over time. We train agents on tasks that could take humans days of effort to complete. To perform these tasks effectively, agents need to exercise good judgement under ambiguous circumstances, and respond to changes in the environment.\n\n## Horizon-SWE\n\nWe recently launched Horizon-SWE, a benchmark that drops frontier models into a simulated software company.\n\nIt consists of a **running application, real tools, and long-horizon tasks** covering the entire software development lifecycle (planning, coding, testing, deployment, monitoring). The benchmark measures the ability of AI agents to perform **end-to-end SWE tasks**, as opposed to code generation alone. Leading models score around 25% on the benchmark.\n\nRead more about our methodology here: \u003chttps://www.polymathlabs.ai/blog/horizon-swe\u003e\n\n![uploaded image](/media/?type=post\u0026id=98237\u0026key=user_uploads/2471616/a86c29a4-fcff-4ae9-a7fe-53bdf6234300)\n\n## Team\n\nPolymath is a team of researchers and engineers from UC Berkeley, Hume AI, Plaid, and Amazon. We have years of experience post-training frontier models in industry, and building large-scale data systems.\n\nDylan previously led post-training research and data at Hume AI. He launched the first generative AI features at AWS model inference, and received the Amazon Inventor Award. Dylan studied Computer Science at UC Berkeley, and conducted research in reinforcement learning at Berkeley Artificial Intelligence Research (BAIR), and ML systems at the RISE Lab. He is a recipient of the National Science Foundation (NSF) Fellowship.\n\nNaren led the development of Plaid’s credit monitoring system from the ground up, building infrastructure that now supports millions of business-critical transactions daily. At Amazon, he was a core contributor to large-scale ML systems powering critical logistics workflows. Naren studied Electrical Engineering \u0026 Computer Science at UC Berkeley, and conducted research in ML systems at Berkeley Artificial Intelligence Research (BAIR).\n\nIf you work at a frontier lab and are interested in acquiring environments, or know someone who is, we’d love to chat! ([founders@polymathlabs.ai](mailto:founders@polymathlabs.ai))","slug":"PYT-polymath-applied-intuition-for-ai-agents","created_at":"2026-02-26T21:15:33.940Z","updated_at":"2026-04-22T20:50:07.839Z","total_vote_count":42,"url":"https://www.ycombinator.com/launches/PYT-polymath-applied-intuition-for-ai-agents","share_image_url":"//bookface-static.ycombinator.com/assets/ycdc/yc-og-image-c440a0ad1dacfb86eeeb343717479cc54d256614449b4ef719977a0a451f8bc8.png","company":{"id":31217,"name":"Polymath","slug":"polymath","url":"https://polymathlabs.ai/","logo":"https://bookface-images.s3.amazonaws.com/small_logos/897b59e5d2b5dee40c471f153a2aee81587f2ad8.png","batch":"Winter 2026","industry":"B2B","tags":[],"search_path":"https://bookface.ycombinator.com/company/31217"}}