AI Safety Collides with the Overattribution Bias

Rushing, Bruce (2024) AI Safety Collides with the Overattribution Bias. [Preprint]

Preview

Text
rushing_aisafetycollides_manuscript.pdf
Download (379kB) | Preview

Abstract

The field of Artificial Intelligence (AI) safety evaluations aims to test AI behavior for problematic capabilities like deception. However, some scientists have cautioned against the use of behavior to infer general cognitive abilities because of the human tendency to overattribute cognition to everything. They recommend the adoption of a heuristic to avoid these errors that states behavior provides no evidence for cognitive capabilities unless there is some theoretical feature present to justify that inference. We make that heuristic precise in terms of our credences's conditional independencies between behavior, cognitive capabilities, and the presence or absence of theoretical features. When made precise, the heuristic entails absurdly that failure at a behavioral task supports the presence of a theoretical feature. This is due to the heuristic suggesting inductive dependencies that conflict with our best causal models about cognition. Weakening this heuristic to allow only weak evidence between behavior and cognitive abilities leads to similar problems. Consequently, we suggest abandoning the heuristic and updating those causal models in light of the behavior observed when testing AIs for troublesome cognitive abilities.

Export/Citation:

Social Networking:

Share |

Item Type:

Preprint

Creators:

Creators	Email	ORCID
Rushing, Bruce		0000-0002-0864-9272

Keywords:

AI safety, philosophy of machine learning, inductive bias, philosophy of AI, large language models

Subjects:

Specific Sciences > Cognitive Science
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
General Issues > Evidence

Depositing User:

Dr Bruce Rushing

Date Deposited:

26 Jun 2024 17:19

Last Modified:

26 Jun 2024 17:19

Item ID:

23623

Subjects:

Specific Sciences > Cognitive Science
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
General Issues > Evidence

Date:

1 March 2024

URI:

https://philsci-archive.pitt.edu/id/eprint/23623

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

View Item

Search & Browse

Information

AI Safety Collides with the Overattribution Bias

Abstract

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

ULS D-Scribe

E-Prints

Share

Feeds

Get Alerts for All New Posts