Friend or Foe: Delegating to an AI whose Alignment is Unknown
2026
Abstract
We study delegation to an AI that could be aligned -- maximizing the designer's payoff -- or misaligned -- minimizing it. The designer asks the AI to report how chosen covariates predict the outcome. Because the designer does not know the AI's alignment or the true relationship between covariates and outcomes, they evaluate performance in both best- and worst-case scenarios. We characterize the efficient frontier of achievable best- and worst-case payoffs. Without any constraints on how covariates relate to outcomes, this frontier is a single line segment: any gain in best-case performance requires an equal sacrifice in the worst case, regardless of the designer's strategy. When the designer can bound covariate informativeness and select covariates accordingly, the frontier improves, and optimal design exhibits a simple and interpretable cutoff structure.
BibTeX
@unpublished{fudenberg2026friend,
author = {Drew Fudenberg and Annie Liang},
title = {Friend or Foe: Delegating to an {AI} whose Alignment is Unknown},
year = {2026},
note = {Working paper}
}