Human or Machine? Assessing AI's Ability to Generate Game-Theory Questions

Benjamin Golub, Annie Liang, and Marciano Siniscalchi

2026

Accepted for presentation at EC'26

Abstract

AI models now excel at solving difficult applied mathematics problems; we ask how well they can compose such problems, focusing on undergraduate game theory. Adapting the Turing (1950) test to problem generation, we collect problems from professors and GPT-5, standardizing presentation so evaluation focuses on content rather than style. Sixty-seven experts classify problems as human- or LLM-generated. We find that AI output is indistinguishable to any single evaluator yet different in aggregate. Individually, evaluators perform at chance (mean accuracy 50.9%). However, pooling 2,680 classifications rejects the null that the two distributions are identical (p = 0.014). The signal resides in solutions, not problem statements: restricting to evaluators who observe solutions and report medium or high confidence, pooled accuracy rises to 53.4% (p = 0.0006), while without solutions we cannot reject the null. We train a classifier to distinguish the problem sources; the strongest objective feature separating human problems is the solution word count to problem word count ratio: human-authored problems tend to require more reasoning per unit of setup. We discuss implications of our findings for organizations that delegate knowledge work to AI.

Download PDF

BibTeX

@unpublished{golub2026human,
  author = {Benjamin Golub and Annie Liang and Marciano Siniscalchi},
  title = {Human or Machine? {A}ssessing {AI}'s Ability to Generate Game-Theory Questions},
  year = {2026},
  note = {Working paper}
}