News Release

AI language models struggle with basic hospital data tasks, study finds

Nine leading AI models were tested on simple administrative queries drawn from real-world emergency department records—and most failed unless paired with code-generation tools

Peer-Reviewed Publication

PLOS

A new study finds that large language models (LLMs), used with straightforward prompting, perform poorly on routine number-crunching tasks that hospital administrators depend on every day to track patients and allocate resources. The findings were published this week in the open-access journal PLOS Digital Health by Eyal Klang of the Icahn School of Medicine at Mount Sinai, New York, USA, and colleagues.

Hospitals rely on structured electronic health record (EHR) data to monitor patient counts and resources and to generate administrative reports. These tasks are currently handled by data analysts using programming languages, creating delays when staff need fast answers. AI tools known as large language models, such as GPT-4o and Llama, have been proposed to simplify that process.

In the new study, researchers evaluated nine leading LLMs on two basic administrative tasks—counting patients meeting a condition and filtering records based on multiple criteria—using data drawn from 50,000 real emergency department visits at the Mount Sinai Health System.

The researchers found that straightforward prompting—asking the model a plain question like “how many patients in this table were admitted?”—produced uniformly poor results across all models. Chain-of-thought reasoning, in which the model is prompted to show step-by-step work before giving an answer, offered only modest improvements that degraded sharply as table size increased. Even GPT-4o, the top-performing model, saw accuracy drop from roughly 95% on the smallest datasets to below 60% on larger ones under chain-of-thought conditions.

A tool-based approach—where models were asked to generate code that was then executed—substantially improved accuracy for the most capable models, with GPT-4o and Qwen-2.5-72B achieving near-perfect performance. However, distilled DeepSeek models, optimized for speed and efficiency, struggled even with this approach. One model, Llama-3.1-8B, failed to produce usable output in the majority of trials and was excluded from further analysis.

“Our findings indicate that without using a tool-based strategy, current LLMs are unsuitable for standalone use even on minimally complex administrative tasks in clinical settings,” says Benjamin Glicksberg. “Structured data tasks in clinical workflows will require agentic approaches that combine LLMs with code execution to ensure accuracy and consistency.”

 #####

In your coverage please use this URL to provide access to the freely available article in PLOS Digital Health: https://plos.io/49hqXIT

Citation: Klang E, Sorin V, Korfiatis P, Sawant AS, Freeman R, Charney AW, et al. (2026) Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records. PLOS Digit Health 5(5): e0001326. https://doi.org/10.1371/journal.pdig.0001326

Author Countries: United States

Funding: The author(s) received no specific funding for this work.

 


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.