TutorSim

TutorSim is a harness for benchmarking AI tutors that transfers expert teachers' analysis of real human-student tutoring sessions to replays between synthetic students and AI tutors.

Expert teachers reviewed real student–tutor transcripts and flagged key learning moments — points in a session that are pivotal for one of two pedagogical constructs: scaffolding vs. rigor (staying inside the student's zone of proximal development) and rapport building.

At each key moment, teachers wrote situation–action–result (SAR) caption annotations: why the moment matters, what the human tutor actually did, and how the student responded.

To benchmark a language model, we replay a transcript up to a key learning moment and then hand the conversation to a synthetic student paired with the LM tutor. A synthetic SAR annotator captions the AI's action and the simulated student's reaction; a classifier judges the result as effective, partially effective, or ineffective. The output is a profile of, when faced with different scenarios (the sitution), what pedagogical choices the LM made (the action), and the impact of that choice on the synthetic student's behavior (the result).

TutorSim

A transcript-grounded harness for AI tutor benchmarking using synthetic students

Abstract

Funding