Datacurve: Frontier coding data for training and evaluating LLMs

Frontier coding data for training and evaluating LLMs

We generate expert quality coding data at scale for fine-tuning LLMs

Active Founders

Serena Ge

Founder

Started building software in high school - built a climbing training app with Team Canada athletes. Studied at Waterloo CS for a year then dropped out. Worked with the Cohere CTO on LLM reasoning and coding capabilities through synthetic data. Went to YC W24, pivoted 3 times until Datacurve. Now scaling high quality coding data production pipelines at Datacurve to enable next generation coding models

Serena Ge

Founder

Started building software in high school - built a climbing training app with Team Canada athletes. Studied at Waterloo CS for a year then dropped out. Worked with the Cohere CTO on LLM reasoning and coding capabilities through synthetic data. Went to YC W24, pivoted 3 times until Datacurve. Now scaling high quality coding data production pipelines at Datacurve to enable next generation coding models

Charley Lee

Founder

Hacking on things since middle school. Went to Waterloo CS, interned at Google, then dove into AI research on multi-modal RL and training browser-use agents. Went through YC W24, pivoted a few times, and landed on Datacurve – now providing the data infrastructure for frontier LLMs.

Charley Lee

Founder

Hacking on things since middle school. Went to Waterloo CS, interned at Google, then dove into AI research on multi-modal RL and training browser-use agents. Went through YC W24, pivoted a few times, and landed on Datacurve – now providing the data infrastructure for frontier LLMs.

The Problem – Why getting high-quality code data is so hard

From our experience training models, we believe the biggest bottleneck of progressing vertical LLM capabilities is the lack of curated, high-quality training data.

Acquiring this high-quality data is difficult because:

Consistent, high-quality code data cannot be synthetically generated or scraped. Tasks are often too challenging or specific for even the most capable models, and even a few incorrect samples can noticeably worsen the final training results.
Hiring human annotators is tricky. Manual data labeling en masse tends towards low-skill gig work; it’s difficult to hire and retain highly competent engineers as annotators.

The Solution

We solve the data problem with our gamified annotation platform that attracts the best engineers to come and solve fun coding problems. We have already brought on top competitive programmers, as well as highly competent engineers who have worked at companies like Amazon and AMD.

In general, we get great engineers who 1) already have good careers, and 2) already enjoy doing programming challenges outside of work. Our gamified platform pays them for solving problems, which they already do for fun.

uploaded image

Data for AI dev-tool startups to train use-case specific models:

UI design to React components generation
Framework-specific optimized code generation
Repository-wide automatic PRs from GitHub issues
Intelligent coding copilot integrated IDEs. Data for code completion and debugging

For foundation model labs, the kinds of data our platform creates are:

Refactoring code for readability
Improving code for performance
Code generation for difficult problems or new features
Debugging runtime errors
Code walkthrough and explanation

Our Team’s story 🧗‍♀️ → 👨‍💻 → 👴 →📊

In high school, I built a climbing training app with World Cup climbers, used by 3700 athletes from over 17 countries. Using the app I made myself, I qualified for Nationals twice!

This building experience was so amazing that it made me decide to study CS at UWaterloo, where I kept bumping into Charley Lee in AI reading groups and advanced CS classes. So we started building together:

While Charley interned at Google, we made a planning tool called UncleGPT that reached 930 users who created 1.3k projects and sent 5.5k messages on our platform.

I interned at Cohere, where I trained models to improve reasoning capabilities through solving algorithms and gameplay.

We’re 19 now and building Datacurve.

Our Ask

If you need custom data for your application (e.g., code editing, design to code, etc.), we should talk!
If you can introduce us to more foundation model labs, that would be amazing. Feel free to email me at serena@datacurve.ai!

Thanks for supporting us!