Spelda, Petr and Stritecky, Vit (2025) Unpredictability of AI Alignment Is Not Always Bad for AI Safety. [Preprint]
![]() |
Text
spelda-stritecky-ai-alignment-unpredictability.pdf - Submitted Version Available under License Creative Commons Attribution. Download (1MB) |
Abstract
Robustness of AI alignment is one of the safety issues of large language models. Can we predict how many mistakes will a model make when responding to a restricted request? We show that when access to the model is limited to in-context learning, the number of mistakes can be proved inapproximable, which can lead to unpredictability of alignment of the model. Against intuition, this is not entirely bad news for AI safety. Attackers might not be able to easily misuse in-context learning to break alignment of the model in a predictable manner because the mistake bounds of safe responses, which were used for alignment, can be proved inapproximable. This inapproximability can hide the safe responses from attackers and make alignment of the model unpredictable. If it were possible to keep the safe responses from attackers, responsible users would benefit from testing and repairing of the model’s alignment despite its possible unpredictability. We also discuss challenges involved in ensuring democratic AI alignment with limited access to safe responses, which helps us to make alignment of the model unpredictable for attackers.
Export/Citation: | EndNote | BibTeX | Dublin Core | ASCII/Text Citation (Chicago) | HTML Citation | OpenURL |
Social Networking: |
Item Type: | Preprint | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Creators: |
|
|||||||||
Subjects: | Specific Sciences > Artificial Intelligence General Issues > Formal Learning Theory |
|||||||||
Depositing User: | Dr. Petr Spelda | |||||||||
Date Deposited: | 04 Jul 2025 12:46 | |||||||||
Last Modified: | 04 Jul 2025 12:46 | |||||||||
Item ID: | 25882 | |||||||||
Subjects: | Specific Sciences > Artificial Intelligence General Issues > Formal Learning Theory |
|||||||||
Date: | 15 April 2025 | |||||||||
URI: | https://philsci-archive.pitt.edu/id/eprint/25882 |
Monthly Views for the past 3 years
Monthly Downloads for the past 3 years
Plum Analytics
Actions (login required)
![]() |
View Item |