Ashley Stirrup
text published 2026-04-09 · Open on LinkedIn ↗
Khan Academy gave their AI tutor a calculator to improve math accuracy. It worked, but it made responses painfully slow for students. So they ran five sequential A/B tests: removing the calculator (math errors doubled), switching to GPT-5 (accuracy still suffered), tightening the agent's prompts (latency dropped 3 seconds), upgrading the agent's model (another 300ms off), time-boxing execution (more gains, accuracy stable). Without experiments, they might have shipped the first iteration and unknowingly made tutoring worse. That's the whole case for A/B testing AI features in one example. Dr. Kelli Hill, Senior Director of Data Insights at Khan Academy, shared this and more at Experimentation Island. She's joining us for a live webinar on April 16 to go deeper. Register for the April 16 webinar with Kelli: https://lnkd.in/gTY5-RBh Blog recap: https://lnkd.in/gBfW4UrC
Engagement over time
Only one snapshot so far — the engagement-over-time curve appears once the daily scrape has captured this post at least twice.