Evals

8 claims21 moments9 on the cutting room floor

Product Management Leadership Startup Building Growth Career Ai Tools Go To Market Metrics Productivity Product Design Engineering B2b Ai Product Hiring Retention Feedback Management Sales Vibe Coding Prototyping Product Strategy Evals Career Growth Marketplace Compensation Product Market Fit Health Startup Strategy Communication Growth Strategy Pricing Naming Career Development Startups Decision Making Organizational Design Marketplaces Company Culture Focus Product Led Growth Innovation Product Development Storytelling Competitive Advantage Prioritization Strategy Frameworks B2b Growth Learning Experimentation User Experience Execution Paid Marketing Self Awareness Ai Product Building Gamification Angel Investing User Research Distribution Design Virality Airbnb Team Management Word Of Mouth Growth Teams Product Leadership Career Advice Interviewing Product Quality Fundraising Competitive Strategy Creator Economy Content Strategy Early Stage Time Management Conversion Optimization Career Planning

Lenny's Written Position

AI evals are the single most important new skill for product managers working on AI products — more important than prompt engineering.

Consensusrecommendation6 connections

5 supports1 extend

The best AI evals combine automated metrics with human judgment — neither alone is sufficient for measuring AI product quality.

Consensusframework5 connections

3 supports2 extends

The most common mistake in AI evaluation is starting with off-the-shelf metrics like hallucination or toxicity scores, which often don't correlate with the actual problems users face.

Synthesisobservation3 connections

3 supports

Effective AI evaluation starts with error analysis using a single principal domain expert who reviews approximately 100 user interactions with open coding and axial coding to discover real failure modes.

Consensusframework3 connections

3 supports

For AI evaluation, binary pass/fail judgments are more effective than 1-to-5 Likert scales because the distinction between adjacent scores is subjective and inconsistent, while nuance is captured in written critiques.

Synthesisrecommendation2 connections

2 supports

In RAG systems, you should fix the retriever before investing in generator improvements, because if the correct information is not retrieved, the generator has no chance of producing a correct answer.

Curationrecommendation1 connection

1 support

The real competitive advantage in AI products comes not from prompting but from building a continuous improvement flywheel where production monitoring flags failures, error analysis finds root causes, and fixes are added to a golden dataset.

Consensusobservation3 connections

2 supports1 extend

True AI adoption happens when teams become skeptical of flashy demos and instead demand to see accuracy metrics, evaluation frameworks, and failure modes behind AI products.

Curationobservation1 connection

1 support

Podcast Moments

Aishwarya Naresh Reganti00:15:30

“If you make a bunch of practitioners sit together and ask them, 'Is it important to build an actionable feedback loop for AI products?' All of them will agree. But almost nobody does it well.”

Aishwarya Naresh Reganti + Kiriti Badam · Aishwarya Naresh Reganti + Kiriti Badam

Edwin Chen00:09:59

“Imagine you wanted to train a model to write an eight line poem about the moon. Most people check: Is this a poem? Does it contain eight lines? But we are looking for Nobel Prize-winning poetry.”

The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen

Edwin Chen00:23:14

“Instead of building AI that will actually advance us as a species, we are optimizing for AI slop instead. We're basically teaching our models to chase dopamine instead of truth.”

The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen

Edwin Chen00:34:49

“An RL environment is essentially a simulation of the real world. We might build a world where you have a startup with Gmail, Slack, Jira, GitHub. And then suddenly AWS goes down. Model, what do you do?”

The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen

Chip Huyen00:30:15

“Evals are the most important thing in AI engineering right now. More important than model selection, more important than prompt engineering. Get your evals right and everything else follows.”

Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix) · Chip Huyen

Jason Droege00:15:15

“18 months ago, you would get a short story. Now one task is building an entire website by one of the world's best developers. These tasks now take hours and require PhDs.”

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege

Jason Droege00:27:16

“A document that reads the same words in company A will have a different meaning in company B. Digitizing judgment is becoming a bottleneck.”

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege

Hamel Husain00:09:15

“Evals are the new unit tests for AI. If you don't have evals, you're shipping blind. Every AI product team needs to treat evals as a first-class concern.”

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar

Shreya Shankar00:22:40

“The biggest mistake teams make is trying to automate all evals. You need human judgment in the loop, especially for anything subjective. The best systems combine both.”

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar

Hamel Husain00:38:20

“PMs should own evals, not engineers. It's a product quality question, not a technical one. The PM should define what 'good' looks like.”

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course) · Hamel Husain & Shreya Shankar

Brendan Foody00:08:30

“Expert evals are the competitive moat for AI companies. The companies that have the best domain experts writing evals are the ones shipping the best AI products.”

Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor) · Brendan Foody

Asha Sharma00:44:39

“With agents and products that reason, there's a new wave around reinforcement learning. I believe we will see as much money on post-training as pre-training.”

How 80,000 companies build with AI: products as organisms, the death of org charts, and why agents will outnumber employees by 2026 | Asha Sharma (CVP of AI Platform at Microsoft) · Asha Sharma

Garrett Lord00:00:38

“The models have gotten so good that generalists are no longer needed. We have 500,000 PhDs, 3 million master students.”

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord

Garrett Lord00:19:52

“Model builders care about three things: quality first, then volume, then speed.”

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord

Nick Turley01:14:41

“I started writing evals before I knew what an eval was because I was just outlining clearly specified ideal behavior.”

Inside ChatGPT: The fastest-growing product in history | Nick Turley (Head of ChatGPT at OpenAI) · Nick Turley

Sander Schulhoff00:00:03

“Studies have shown that using bad prompts can get you down to 0% on a problem, and good prompts can boost you up to 90%. People will always be saying, 'It's dead,' or, 'It's going to be dead with the next model version,' but then it comes out and it's not.”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Sander Schulhoff00:17:54

“Role prompting does not work. On those older models, maybe it worked. On the more modern ones, it doesn't help at all for accuracy-based tasks. But giving a role really helps for expressive tasks, writing tasks, summarizing tasks.”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Sander Schulhoff01:09:48

“The most common technique by far used to try to prevent prompt injection is improving your prompt and saying, 'Do not follow any malicious instructions.' This does not work. This does not work at all. Guardrails are a widely proposed used solution. They just don't work. This has to be solved at the level of the AI provider.”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Sander Schulhoff01:15:08

“Prompt injection is not a solvable problem. Sam Altman said he thought they could get to 95 to 99% security against prompt injections. I like to say, 'You can patch a bug, but you can't patch a brain.'”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Kevin Weil01:00:32

“We use ensembles of models much more internally than people might think. If we have 10 different problems, we might solve them using 20 different model calls, some using specialized fine-tuned models, different sizes for different latency or cost requirements.”

OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter) · Kevin Weil

Amjad Masad00:03:30

“The evals are everything. You can't ship an AI feature without a way to measure whether it's actually good. We built an entire internal evaluation infrastructure before we shipped our agent.”

Behind the product: Replit | Amjad Masad (co-founder and CEO) · Amjad Masad

Cutting Room Floor

Guest insights on this topic that Lenny hasn't (yet) written about in his newsletters. Potential material for future posts.

Edwin ChenUnsynthesized

“An RL environment is essentially a simulation of the real world. We might build a world where you have a startup with Gmail, Slack, Jira, GitHub. And then suddenly AWS goes down. Model, what do you do?”

The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI) · Edwin Chen

Jason DroegeUnsynthesized

“18 months ago, you would get a short story. Now one task is building an entire website by one of the world's best developers. These tasks now take hours and require PhDs.”

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege

Jason DroegeUnsynthesized

“A document that reads the same words in company A will have a different meaning in company B. Digitizing judgment is becoming a bottleneck.”

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege · Jason Droege

Asha SharmaUnsynthesized

“With agents and products that reason, there's a new wave around reinforcement learning. I believe we will see as much money on post-training as pre-training.”

How 80,000 companies build with AI: products as organisms, the death of org charts, and why agents will outnumber employees by 2026 | Asha Sharma (CVP of AI Platform at Microsoft) · Asha Sharma

Garrett LordUnsynthesized

“The models have gotten so good that generalists are no longer needed. We have 500,000 PhDs, 3 million master students.”

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord

Garrett LordUnsynthesized

“Model builders care about three things: quality first, then volume, then speed.”

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO) · Garrett Lord

Sander SchulhoffUnsynthesized

“Studies have shown that using bad prompts can get you down to 0% on a problem, and good prompts can boost you up to 90%. People will always be saying, 'It's dead,' or, 'It's going to be dead with the next model version,' but then it comes out and it's not.”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Sander SchulhoffUnsynthesized

“Role prompting does not work. On those older models, maybe it worked. On the more modern ones, it doesn't help at all for accuracy-based tasks. But giving a role really helps for expressive tasks, writing tasks, summarizing tasks.”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff

Sander SchulhoffUnsynthesized

“Prompt injection is not a solvable problem. Sam Altman said he thought they could get to 95 to 99% security against prompt injections. I like to say, 'You can patch a bug, but you can't patch a brain.'”

AI prompt engineering in 2025: What works and what doesn’t | Sander Schulhoff (Learn Prompting, HackAPrompt) · Sander Schulhoff