PhilSci Archive

AI Safety Collides with the Overattribution Bias

Rushing, Bruce (2024) AI Safety Collides with the Overattribution Bias. [Preprint]

[img]
Preview
Text
rushing_aisafetycollides_manuscript.pdf

Download (379kB) | Preview

Abstract

The field of Artificial Intelligence (AI) safety evaluations aims to test AI behavior for problematic capabilities like deception. However, some scientists have cautioned against the use of behavior to infer general cognitive abilities because of the human tendency to overattribute cognition to everything. They recommend the adoption of a heuristic to avoid these errors that states behavior provides no evidence for cognitive capabilities unless there is some theoretical feature present to justify that inference. We make that heuristic precise in terms of our credences's conditional independencies between behavior, cognitive capabilities, and the presence or absence of theoretical features. When made precise, the heuristic entails absurdly that failure at a behavioral task supports the presence of a theoretical feature. This is due to the heuristic suggesting inductive dependencies that conflict with our best causal models about cognition. Weakening this heuristic to allow only weak evidence between behavior and cognitive abilities leads to similar problems. Consequently, we suggest abandoning the heuristic and updating those causal models in light of the behavior observed when testing AIs for troublesome cognitive abilities.


Export/Citation: EndNote | BibTeX | Dublin Core | ASCII/Text Citation (Chicago) | HTML Citation | OpenURL
Social Networking:
Share |

Item Type: Preprint
Creators:
CreatorsEmailORCID
Rushing, Bruce0000-0002-0864-9272
Keywords: AI safety, philosophy of machine learning, inductive bias, philosophy of AI, large language models
Subjects: Specific Sciences > Cognitive Science
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
General Issues > Evidence
Depositing User: Dr Bruce Rushing
Date Deposited: 26 Jun 2024 17:19
Last Modified: 26 Jun 2024 17:19
Item ID: 23623
Subjects: Specific Sciences > Cognitive Science
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
General Issues > Evidence
Date: 1 March 2024
URI: https://philsci-archive.pitt.edu/id/eprint/23623

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item