Lenny's Written Position
AI evals are the single most important new skill for product managers working on AI products — more important than prompt engineering.
The best AI evals combine automated metrics with human judgment — neither alone is sufficient for measuring AI product quality.
The most common mistake in AI evaluation is starting with off-the-shelf metrics like hallucination or toxicity scores, which often don't correlate with the actual problems users face.
Effective AI evaluation starts with error analysis using a single principal domain expert who reviews approximately 100 user interactions with open coding and axial coding to discover real failure modes.
For AI evaluation, binary pass/fail judgments are more effective than 1-to-5 Likert scales because the distinction between adjacent scores is subjective and inconsistent, while nuance is captured in written critiques.
In RAG systems, you should fix the retriever before investing in generator improvements, because if the correct information is not retrieved, the generator has no chance of producing a correct answer.
The real competitive advantage in AI products comes not from prompting but from building a continuous improvement flywheel where production monitoring flags failures, error analysis finds root causes, and fixes are added to a golden dataset.
True AI adoption happens when teams become skeptical of flashy demos and instead demand to see accuracy metrics, evaluation frameworks, and failure modes behind AI products.
Podcast Moments
“If you make a bunch of practitioners sit together and ask them, 'Is it important to build an actionable feedback loop for AI products?' All of them will agree. But almost nobody does it well.”
Aishwarya Naresh Reganti + Kiriti Badam · Aishwarya Naresh Reganti + Kiriti Badam
“Imagine you wanted to train a model to write an eight line poem about the moon. Most people check: Is this a poem? Does it contain eight lines? But we are looking for Nobel Prize-winning poetry.”
The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen
“Instead of building AI that will actually advance us as a species, we are optimizing for AI slop instead. We're basically teaching our models to chase dopamine instead of truth.”
The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen
“An RL environment is essentially a simulation of the real world. We might build a world where you have a startup with Gmail, Slack, Jira, GitHub. And then suddenly AWS goes down. Model, what do you do?”
The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen
“Evals are the most important thing in AI engineering right now. More important than model selection, more important than prompt engineering. Get your evals right and everything else follows.”
Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix) · Chip Huyen
“18 months ago, you would get a short story. Now one task is building an entire website by one of the world's best developers. These tasks now take hours and require PhDs.”
First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege
“A document that reads the same words in company A will have a different meaning in company B. Digitizing judgment is becoming a bottleneck.”
First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege
“Evals are the new unit tests for AI. If you don't have evals, you're shipping blind. Every AI product team needs to treat evals as a first-class concern.”
Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar
“The biggest mistake teams make is trying to automate all evals. You need human judgment in the loop, especially for anything subjective. The best systems combine both.”
Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar
“PMs should own evals, not engineers. It's a product quality question, not a technical one. The PM should define what 'good' looks like.”
Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar
“Expert evals are the competitive moat for AI companies. The companies that have the best domain experts writing evals are the ones shipping the best AI products.”
Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor) · Brendan Foody
“With agents and products that reason, there's a new wave around reinforcement learning. I believe we will see as much money on post-training as pre-training.”
How 80,000 companies build with AI: products as organisms, the death of org charts, and why agents will outnumber employees by 2026 | Asha Sharma (CVP of AI Platform at Microsoft) · Asha Sharma
“The models have gotten so good that generalists are no longer needed. We have 500,000 PhDs, 3 million master students.”
Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord
“Model builders care about three things: quality first, then volume, then speed.”
Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord
“I started writing evals before I knew what an eval was because I was just outlining clearly specified ideal behavior.”
Inside ChatGPT: The fastest-growing product in history | Nick Turley (Head of ChatGPT at OpenAI) · Nick Turley
“Studies have shown that using bad prompts can get you down to 0% on a problem, and good prompts can boost you up to 90%. People will always be saying, 'It's dead,' or, 'It's going to be dead with the next model version,' but then it comes out and it's not.”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“Role prompting does not work. On those older models, maybe it worked. On the more modern ones, it doesn't help at all for accuracy-based tasks. But giving a role really helps for expressive tasks, writing tasks, summarizing tasks.”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“The most common technique by far used to try to prevent prompt injection is improving your prompt and saying, 'Do not follow any malicious instructions.' This does not work. This does not work at all. Guardrails are a widely proposed used solution. They just don't work. This has to be solved at the level of the AI provider.”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“Prompt injection is not a solvable problem. Sam Altman said he thought they could get to 95 to 99% security against prompt injections. I like to say, 'You can patch a bug, but you can't patch a brain.'”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“We use ensembles of models much more internally than people might think. If we have 10 different problems, we might solve them using 20 different model calls, some using specialized fine-tuned models, different sizes for different latency or cost requirements.”
OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter) · Kevin Weil
“The evals are everything. You can't ship an AI feature without a way to measure whether it's actually good. We built an entire internal evaluation infrastructure before we shipped our agent.”
Behind the product: Replit | Amjad Masad (co-founder and CEO) · Amjad Masad
Cutting Room Floor
Guest insights on this topic that Lenny hasn't (yet) written about in his newsletters. Potential material for future posts.
“An RL environment is essentially a simulation of the real world. We might build a world where you have a startup with Gmail, Slack, Jira, GitHub. And then suddenly AWS goes down. Model, what do you do?”
The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen
“18 months ago, you would get a short story. Now one task is building an entire website by one of the world's best developers. These tasks now take hours and require PhDs.”
First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege
“A document that reads the same words in company A will have a different meaning in company B. Digitizing judgment is becoming a bottleneck.”
First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege
“With agents and products that reason, there's a new wave around reinforcement learning. I believe we will see as much money on post-training as pre-training.”
How 80,000 companies build with AI: products as organisms, the death of org charts, and why agents will outnumber employees by 2026 | Asha Sharma (CVP of AI Platform at Microsoft) · Asha Sharma
“The models have gotten so good that generalists are no longer needed. We have 500,000 PhDs, 3 million master students.”
Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord
“Model builders care about three things: quality first, then volume, then speed.”
Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord
“Studies have shown that using bad prompts can get you down to 0% on a problem, and good prompts can boost you up to 90%. People will always be saying, 'It's dead,' or, 'It's going to be dead with the next model version,' but then it comes out and it's not.”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“Role prompting does not work. On those older models, maybe it worked. On the more modern ones, it doesn't help at all for accuracy-based tasks. But giving a role really helps for expressive tasks, writing tasks, summarizing tasks.”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff
“Prompt injection is not a solvable problem. Sam Altman said he thought they could get to 95 to 99% security against prompt injections. I like to say, 'You can patch a bug, but you can't patch a brain.'”
AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff