Unpredictability of AI Alignment Is Not Always Bad for AI Safety

Spelda, Petr and Stritecky, Vit (2025) Unpredictability of AI Alignment Is Not Always Bad for AI Safety. [Preprint]

Text
spelda-stritecky-ai-alignment-unpredictability.pdf - Submitted Version
Available under License Creative Commons Attribution.
Download (1MB)

Abstract

Robustness of AI alignment is one of the safety issues of large language models. Can we predict how many mistakes will a model make when responding to a restricted request? We show that when access to the model is limited to in-context learning, the number of mistakes can be proved inapproximable, which can lead to unpredictability of alignment of the model. Against intuition, this is not entirely bad news for AI safety. Attackers might not be able to easily misuse in-context learning to break alignment of the model in a predictable manner because the mistake bounds of safe responses, which were used for alignment, can be proved inapproximable. This inapproximability can hide the safe responses from attackers and make alignment of the model unpredictable. If it were possible to keep the safe responses from attackers, responsible users would benefit from testing and repairing of the model’s alignment despite its possible unpredictability. We also discuss challenges involved in ensuring democratic AI alignment with limited access to safe responses, which helps us to make alignment of the model unpredictable for attackers.

Export/Citation:

Social Networking:

Share |

Item Type:

Preprint

Creators:

Creators	Email	ORCID
Spelda, Petr	petr.spelda@fsv.cuni.cz	0000-0003-4199-645X
Stritecky, Vit		0000-0003-1778-3657

Subjects:

Specific Sciences > Artificial Intelligence
General Issues > Formal Learning Theory

Depositing User:

Dr. Petr Spelda

Date Deposited:

04 Jul 2025 12:46

Last Modified:

04 Jul 2025 12:46

Item ID:

25882

Subjects:

Specific Sciences > Artificial Intelligence
General Issues > Formal Learning Theory

Date:

15 April 2025

URI:

https://philsci-archive.pitt.edu/id/eprint/25882

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

View Item

Search & Browse

Information

Unpredictability of AI Alignment Is Not Always Bad for AI Safety

Abstract

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

ULS D-Scribe

E-Prints

Share

Feeds

Get Alerts for All New Posts