Humanloop CEO Raza Habib shares 5 common mistakes teams make when building with LLMs

by Greg Kumparak6/4/2024

When Humanloop started out in 2020, they were working on a better way to train the state-of-the-art language models of that time. These models needed lots of manually annotated data to work best; Humanloop’s first product made it easier for anyone to do this annotation work, while drastically reducing the overall amount of manual work required.

But they sensed a shifting tide.

“About two years ago we were watching what was happening with large language models,” says Humanloop co-founder Raza Habib, “and we realized that the biggest risk to us as a business was that these large language models would get really good — the paradigm for how people build AI would change substantially, and you wouldn’t need this annotation any more.”

In what would prove to be a prescient move, they started exploring a pivot just months before ChatGPT would debut. Instead of helping people annotate their training data, Humanloop would give teams the tools to evaluate how well their LLM-based AI applications were working, and help team members — technical or not — collaborate on building them.

“Pivoting your company… it’s a scary thing to do,” says Raza. “So we gave ourselves two weeks; we’d make some mockups and go out to the people we know are building with these [large language models] and see if anyone would pay for this. If we could get ten paying customers in two weeks, that’d be a strong enough signal that it’s worth pivoting the company. “

“In the end, it took us two days,” he says. Today Humanloop counts companies like Gusto, Vanta, and Duolingo as customers — effectively serving as their collaborative LLM playground to find the best prompts, evaluate different models, and track changes over time.

This week Humanloop is launching a podcast series called High Agency (on Apple Podcasts, Spotify, and YouTube) where Raza will talk to others building at the forefront of AI to compare notes on what works and what doesn’t in this still-early field. The first episodes will feature interviews with the CTOs at Ironclad, Zapier, Sourcegraph and Hex; all companies that have built great things with LLMs in production — but as Raza puts it, "nobody is an expert yet, and everyone is learning by doing."

With that in mind, I asked Raza to break down some of the most common mistakes he sees teams making when building on top of LLMs. Here’s what he told me:

Not having consistent, systematic evaluation in place:

Figure out what “good” looks like for your AI product’s output, then figure out how to measure against that as you build.

“If teams don’t have a good way of measuring what ‘good’ looks like,” says Raza, “they’ll spin their wheels for a long time changing things and not really knowing if they’re making any progress.”

“Everyone wants things that are fast; everyone wants things that are cheap, and accurate. But you’re going to have [criteria] that are really use-case specific to you.”

If you’re building an AI chat bot for helping someone practice a new language, maybe that means checking the output to ensure it only uses words appropriate for the user’s skill level. If you’re building an AI coach, perhaps that means double checking that each of your user’s stated goals gets mentioned and addressed.

But there’s more to it than just running the prompt a few times and making sure it all looks reasonable; the systems have to be in place to check the output regularly, as prompts change and the underlying models evolve.

“For [traditional software development], you write a piece of code, and every time you run it, it does the same thing. Same inputs, same outputs. But with an LLM? Same input, multiple outputs — every time you run it, you’ll get something slightly different.”

“One of the biggest mistakes people make is just eyeballing [one-off] examples,” he notes. “It doesn’t give them a rigorous enough sense of whether or not they’re making things better.”

Not paying attention to (sometimes silent) user feedback:

“What ‘good’ looks like is very subjective!” notes Raza.  “What is a good summary for this call? What is a good sales email? There isn’t just one single correct answer.”

“What your customer says is good is the ultimate answer,” says Raza. But they don’t always say those things out loud.

“You want to be capturing different sources of end-user feedback,” he notes. “That can be explicit things, like votes — those little thumbs up/thumbs down buttons. But it’s also implicit things that users do within your application that correlate well with whether or not it’s working. If you helped them generate that sales email, did they actually send it?”

“Plan ahead, when you’re designing an application, to capture the user signals that tell you if it’s working; you want to design that in from the beginning.”

Not closely tracking prompt history:

“Another error is not treating prompt management with the same rigor you treat code management,” says Raza.

The prompts you’re using will change over time; it’s key to track those changes and know why they were made.

“People start off doing this and they use [shared docs], they’re copying & pasting things in Slack, and they’re losing the history of their experimentation. New people join the team and it’s hard to know what was tried before. Something will be in production for months and you’ll make a change; is this better or worse than what we had before? They don’t know!”

Not fine-tuning the model:

For most purposes and when proving your idea works, you can probably get pretty far with the popular base models. But eventually, Raza suggests, you’ll want to fine-tune them for your needs. Good fine-tuning will give you better results, lower latency, and reduce costs in the long run.

“We recommend to everyone that prompt engineering is where they should start, because it’s the easiest, fastest, and most powerful thing,” says Raza. “but you can get order-of-magnitude cost savings if you fine-tune your models later.”

“The best way to think about fine-tuning is as an optimization,” he notes. “You want to avoid optimizing prematurely, but once you’ve validated that there’s demand for your product then it should become a focus.”

Not having domain experts write the prompts:

If you’re building LLM products for a specific vertical or industry, bring in people who really know the topic to help write the prompts and evaluate the output — don’t rely on engineers to do it alone. Large language models are, clearly, all about language. Language is nuanced, and the lexicons of different industries are deep.

“This is work that’s best done by domain experts,” says Raza. “It’s one of those things that’s obvious in retrospect, but wasn’t obvious at the start.”


If you’re building with LLMs, be sure to check out Humanloop here, and find Raza’s new podcast for AI builders, High Authority, on YouTube here.

Author

  • Greg Kumparak

    Greg oversees editorial content at Y Combinator. He was previously an editor at TechCrunch for nearly 15 years.