Ashley Stirrup

text published 2026-05-05 · Open on LinkedIn ↗

Khan Academy spent 3 years defining one metric. Because most AI teams have no idea what "good" output actually is. Dr. Kelli Hill, Head of Data at Khan Academy, shared how her team built cognitive engagement as the measure behind every Khanmigo experiment. Not time on site. Not thumbs up ratings. A rubric grounded in decades of classroom research, adapted for AI tutoring, validated until human experts agreed on it 85% of the time, then scaled using an LLM as a judge. When they finally started running production A/B tests on prompt changes, model swaps, and system instructions, they knew exactly what they were optimizing for. Small changes that would have looked like noise became meaningful signals. The teams getting the most out of AI experimentation aren't just the ones moving fastest. They're the ones who did the hard work of defining success first. #experimentation #abtesting #productmanagement

Likes

Comments

Impressions

255

from LinkedIn export

Engagement over time

Only one snapshot so far — the engagement-over-time curve appears once the daily scrape has captured this post at least twice.