A Physics Professor Tested Whether AI Can Reliably Grade Exams

Professor Lyman Page compared his grades to ones generated by Google’s Gemini

Lyman Page, professor of PHY 104: General Physics II

Denise Applewhite / Princeton University

By Lia Opperman ’25

Published July 2, 2026

3 min read

As universities nationwide grapple with how artificial intelligence could reshape education, one Princeton physics professor is testing whether AI can reliably grade exams.

The experiment, conducted in the spring 2026 semester in PHY 104: General Physics II, taught by professor Lyman Page, compared Page’s grades of exams with grades generated by Google’s Gemini Pro AI model. The AI-generated scores did not affect students’ grades, which were determined manually, and students were allowed to opt out of the experiment.

“Grading exams is an enormous time sink, and there’s always some subjectivity in it,” Page said. He explained that with more than 230 exams to grade by hand, AI could make grading more efficient and consistent. The experiment was still in progress when this issue went to press.

Page added that physics exams are often hard to read, as people frequently start over, scratch out problems, or put arrows over their work, so it will take some training for the AI model to understand that. Ultimately, it may not work.

“The goal of creating any problem is to figure out what someone knows, not necessarily if they have the exact right answer,” he said.

But if the technology performs well, he said the class will probably start using it on small assignments before considering it for other projects.

Google’s terms of service say the enterprise version of Gemini Pro does not share uploaded content with outside organizations without permission. Before launching the experiment, the physics department worked with University administrators and the Office of Information Technology. Page said the system uses Princeton’s firewall protections, and no one will be able to scrape the exam from public AI platforms.

Beyond efficiency, Page explained that AI could help instructors provide more detailed feedback to students. He explained that especially in large courses, there are not enough hours in the day to annotate on exams where students missed a vector sign or put a direction in the wrong way.

Arav Gupta ’29 said he initially thought that it was a joke when he received Page’s Canvas announcement that Gemini Pro would be used to grade his exam. Once he learned that the exams would be graded manually first, he felt more comfortable participating.

After receiving his official grade, he ran his exam into Gemini and saw that he received a score roughly 10% lower than the score he received from the hand grader.

Gupta didn’t have an answer key, so it was not a direct parallel to Page’s experiment, but Gemini did compare its answers with his responses. “It wasn’t giving me a lot of partial credit the same way the hand grader would,” he said.

Andrew Addo ’29 said he was surprised by the department’s decision but grasps the reasoning behind it. “I think I understand, going forward, the trend is that generative AI is going to be used in all aspects of life for efficiency,” he said. Still, he said that he wished the department was more transparent about how the experiment would work.

Princeton is adapting its broader policies surrounding AI. This spring, the University faculty voted to use proctors for in-person exams beginning in the fall semester, citing concerns about AI and electronic devices.

When informed of the decision, Page questioned whether human proctors were the most effective response. “If the purpose is to detect cheating, why not use a camera?” he asked.

For Page, his experiment reflects what he sees as a larger shift in higher education.

AI is “part of all of our lives … . [We have to] figure out how to take advantage of it,” he said. “It’s a powerful tool. We can use it to our benefit.”

Still, he argued that while AI may become increasingly embedded in STEM education, it will not replace the human element of teaching.

“If you could watch a computer and learn everything you need … people would,” he said. “But they don’t, because it doesn’t work. … Humans are important for communicating to humans.”

Published in the July 2026 Issue

4 Responses

Anita Kestin ’76

1 Week Ago

Using AI for Grading

Regarding Professor Page’s experiment using AI for grading:

Bravo! Great idea, but it does not go far enough. Let’s have students stay at home with their families, watch canned lectures on their computers (if they feel like it), and create papers and take exams using AI. The endowment will grow and we will eliminate messy youthful romances, the need to provide meal services, the logistics of trash pickup, and the perennial headaches of town-gown relations.

(Has Princeton University lost its ever-loving mind?)

Howard Wainer *68

2 Weeks Ago

Scoring by Human Graders and AI

What about scoring reliability?

If a student’s exam was scored by the same grader a second time, how similar were the two grades? By a second, but different, human rater? Contrast this with scoring the exam multiple times by AI. The instability of human scoring was one of the driving forces behind the shift to multiple choice exams in which inter-rater reliability was essentially perfect.

Neal Carlson ’62

2 Weeks Ago

On the Issue of Grading Consistently

An excellent question — how reliable (consistent) are human graders? Would a given paper receive the same grade if it were graded by a given human near the beginning of the set versus near the end of the set? Or would intervening papers affect the grade due to cumulative perception of comparative quality? Or due to cumulative weariness of the human grader? Presumably a specific AI model would grade consistently across those cases because it operates with the same knowledge base. On the other hand, different AI models with different knowledge bases might generate different grades for the same paper — much like different human graders. All in all, this is a very interesting and potentially useful subject!

Mitchell Berger ’79

1 Week Ago

Latest News

Disciplines

Princetonians

Opinion

History

Tiger Travels

Books

Pawcasts

Games

Classifieds

July 2026

A Physics Professor Tested Whether AI Can Reliably Grade Exams