TutorSim

A transcript-grounded harness for AI tutor benchmarking using synthetic students

Ryan Knight1, Albert Zhang1, Alexis Ross, Lucy Li2, Kyle Lo3, Julian Bernado4, Ana Ribeiro4, Rebecca Bowie1, Susanna Loeb5
1InSource Services 2University of Wisconsin–Madison 3Allen Institute for AI 4Stanford SCALE 5Stanford University
Project site under development

Abstract

TutorSim is a harness for benchmarking AI tutors that transfers expert teachers' analysis of real human-student tutoring sessions to replays between synthetic students and AI tutors.

Expert teachers reviewed real student–tutor transcripts and flagged key learning moments — points in a session that are pivotal for one of two pedagogical constructs: scaffolding vs. rigor (staying inside the student's zone of proximal development) and rapport building.

At each key moment, teachers wrote situation–action–result (SAR) caption annotations: why the moment matters, what the human tutor actually did, and how the student responded.

To benchmark a language model, we replay a transcript up to a key learning moment and then hand the conversation to a synthetic student paired with the LM tutor. A synthetic SAR annotator captions the AI's action and the simulated student's reaction; a classifier judges the result as effective, partially effective, or ineffective. The output is a profile of, when faced with different scenarios (the sitution), what pedagogical choices the LM made (the action), and the impact of that choice on the synthetic student's behavior (the result).

Funding

This work is supported by the Gates Foundation and the Chan Zuckerberg Initiative.