LLMs Have Reached or Exceeded Median Instructor Quality on Core Deliverables
#1Benchmark evaluations and blind studies consistently show that GPT-4o and Claude 3.5 Sonnet produce lecture outlines, essay feedback, and reading list recommendations that evaluators — including history faculty — rate as good as or better than median instructor outputs. A 2023 study in PLOS ONE found that ChatGPT-generated feedback on student essays was rated higher in specificity and actionability than instructor feedback in blind review. Frontier models now score in the 90th percentile on AP History exams and can engage in detailed historiographical analysis across virtually all mainstream historical topics.