PhilSci Archive

Generalization Bias in Large Language Model Summarization of Scientific Research

Peters, Uwe and Chin‐Yee, Benjamin (2025) Generalization Bias in Large Language Model Summarization of Scientific Research. [Preprint]

[img] Text
AD81F1DA-0C0B-11F0-8E2E-B836B7DD53B3.pdf - Accepted Version

Download (1MB)

Abstract

Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more
likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.


Export/Citation: EndNote | BibTeX | Dublin Core | ASCII/Text Citation (Chicago) | HTML Citation | OpenURL
Social Networking:
Share |

Item Type: Preprint
Creators:
CreatorsEmailORCID
Peters, Uweu.peters@uu.nl
Chin‐Yee, Benjamin
Keywords: AI; large language models; ChatGPT; generalization; bias; overgeneralization; science communication
Subjects: Specific Sciences > Artificial Intelligence > AI and Ethics
Specific Sciences > Medicine > Clinical Trials
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
Specific Sciences > Medicine > Health and Disease
Specific Sciences > Psychology > Judgment and Decision Making
Specific Sciences > Artificial Intelligence > Machine Learning
Specific Sciences > Medicine
Depositing User: Dr. Uwe Peters
Date Deposited: 24 Apr 2025 12:57
Last Modified: 24 Apr 2025 12:57
Item ID: 25144
Subjects: Specific Sciences > Artificial Intelligence > AI and Ethics
Specific Sciences > Medicine > Clinical Trials
Specific Sciences > Psychology > Comparative Psychology and Ethology
Specific Sciences > Artificial Intelligence
Specific Sciences > Medicine > Health and Disease
Specific Sciences > Psychology > Judgment and Decision Making
Specific Sciences > Artificial Intelligence > Machine Learning
Specific Sciences > Medicine
Date: 23 April 2025
URI: https://philsci-archive.pitt.edu/id/eprint/25144

Monthly Views for the past 3 years

Monthly Downloads for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item