PhilSci Archive

Unpredictability of AI Alignment Is Not Always Bad for AI Safety

Spelda, Petr and Stritecky, Vit (2025) Unpredictability of AI Alignment Is Not Always Bad for AI Safety. [Preprint]

[img] Text
spelda-stritecky-ai-alignment-unpredictability.pdf - Submitted Version
Available under License Creative Commons Attribution.

Download (1MB)

Abstract

Robustness of AI alignment is one of the safety issues of large language models. Can we predict how many mistakes will a model make when responding to a restricted request? We show that when access to the model is limited to in-context learning, the number of mistakes can be proved inapproximable, which can lead to unpredictability of alignment of the model. Against intuition, this is not entirely bad news for AI safety. Attackers might not be able to easily misuse in-context learning to break alignment of the model in a predictable manner because the mistake bounds of safe responses, which were used for alignment, can be proved inapproximable. This inapproximability can hide the safe responses from attackers and make alignment of the model unpredictable. If it were possible to keep the safe responses from attackers, responsible users would benefit from testing and repairing of the model’s alignment despite its possible unpredictability. We also discuss challenges involved in ensuring democratic AI alignment with limited access to safe responses, which helps us to make alignment of the model unpredictable for attackers.


Export/Citation: EndNote | BibTeX | Dublin Core | ASCII/Text Citation (Chicago) | HTML Citation | OpenURL
Social Networking:
Share |

Item Type: Preprint
Creators:
CreatorsEmailORCID
Spelda, Petrpetr.spelda@fsv.cuni.cz0000-0003-4199-645X
Stritecky, Vit0000-0003-1778-3657
Subjects: Specific Sciences > Artificial Intelligence
General Issues > Formal Learning Theory
Depositing User: Dr. Petr Spelda
Date Deposited: 04 Jul 2025 12:46
Last Modified: 04 Jul 2025 12:46
Item ID: 25882
Subjects: Specific Sciences > Artificial Intelligence
General Issues > Formal Learning Theory
Date: 15 April 2025
URI: https://philsci-archive.pitt.edu/id/eprint/25882

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item