HomeCompaniesPerfectBit
PerfectBit

Training data for frontier AI labs

We create a new kind of data for training AI models. Most LLMs are pre-trained on noisy web-scraped text, but they hallucinate and still fail on tasks that humans find trivial. PerfectBit creates high-quality training data that's correct by construction. We verify against physics simulators, scientific databases, formal proof systems. LLMs, robotics, AI for Science, and more.
Active Founders
Peter Vajda
Peter Vajda
Founder
I worked as Director of Media Generation at Meta before 2026 for 11 years. I was managing the Media GenAI foundation model research and development, including efficient media generation, text to image generation (Emu), image editing, Movie gen, text to video, video editing and character consistent image and video generation. Previously, led efficient deep learning for computer vision teams supporting on-device models for AR/VR. I was Assistant Professor at Stanford University
Seiji Yamamoto
Seiji Yamamoto
Founder
Led teams in the Core Llama group at Meta Superintelligence Labs. Senior Staff Research Scientist across 9 years at Meta spanning LLM pre-training and post-training, inference optimization, full-duplex speech models, and computer vision vision models. Before tech: PhD in Physics, published in Proceedings of the National Academy of Sciences and Physical Review Letters, co-authored with Fields Medalist. Educated at Stanford, Rice, Columbia, post-doc at a National Lab.
Company Launches
PerfectBit: AI Training data, correct by construction
See original launch post

TL;DR

PerfectBit creates high-quality training data that's correct by construction. We verify against physics simulators, scientific databases, formal proof systems. It's what models need to cure hallucination, close common-sense gaps, and reach superintelligence. Reinforcement Learning from Verified Rewards (RLVR) was an early step in this direction. We go further. LLMs, robotics, AI for Science, and more.

https://youtu.be/RrFKfbO6XlY

The problem: the failures standing between today's models and superintelligence

Frontier labs are bottlenecked on four data problems:

  • Hallucination. Models trained on noisy web scrapes produce confident, fluent statements that are wrong.
  • Common-sense failure. Models that solve graduate-level math still trip on trivial problems for a five-year-old.
  • Information scarcity. Scientific facts, formal derivations, peer-reviewed findings — they're a thin minority of tokens in any open-web scrape, drowned out by orders of magnitude of recipes, marketing copy, forum chatter, and AI-generated slop. Scaling the scrape doesn't change the ratio.
  • Human annotation doesn't scale. It's slow, expensive, and capped at human ability — superintelligence won't come from labeling at human throughput.

What we do

  • PerfectBit creates high-quality training data that's correct by construction. Every sample is generated against an oracle — physics simulator, scientific database, formal proof system — and verified before it ships, in text, code, image, audio, or video. For LLMs, we project observations about the world into natural language; for verticals like robotics and AI-for-science, we ship purpose-built datasets.

Why us

We've trained the models we're now building data for.

  • Peter Vajda — Former Director of Media Generation at Meta. Eleven years at Meta. Most recently led Media GenAI foundation-model R&D: Emu (text-to-image), image editing, Movie Gen, text-to-video, video editing, character-consistent generation. Earlier led efficient deep learning for computer vision powering on-device Augmented Reality and Virtual Reality (AR/VR) models. Assistant Visiting Professor at Stanford before Meta. PhD in Computer Science.
  • Seiji Yamamoto — Led teams in the Core Llama group at Meta Superintelligence Labs. Nine years at Meta as a Senior Staff Research Scientist across LLM pre- and post-training, inference optimization, full-duplex speech, and computer vision. PhD in Physics, publications in Proceedings of the National Academy of Sciences and Physical Review Letters (co-authored with a Fields Medalist). Stanford, Rice, Columbia; postdoc at a National Lab.

Between us: foundation-model training across text, image, video, and speech, shipped to billions of users — plus scientific chops to design the verifier stacks.

We’d love to talk more

We're talking to a small number of frontier AI labs about pilot engagements.

  • Heads of data / data partnerships at frontier labs
  • Research leads running pre-training, mid-training, post-training, or reasoning programs and shopping for a corpus that doesn't exist yet
  • Model-training teams preparing a next-gen training run who want a correct-by-construction supplement in the mix
  • Researchers in formal methods, scientific computing, or domain-specific simulators who want to contribute a verifier stack — we'd love to talk

We're also hiring research scientists and engineers in San Francisco (in-person). If you've shipped foundation-model training, built formal/simulation tooling or spent time developing AI training data at production scale, please reach out.

Contact

PerfectBit
Founded:2026
Batch:Spring 2026
Team Size:2
Status:
Active
Location:San Francisco
Primary Partner:Garry Tan