image: Researchers develop a retrospective evaluation process of LLM applications in medical scenarios.
Credit: Credit: Dr. Zhenchang Wang from Capital Medical University, China; Dr. Jiahong Dong from Beijing Tsinghua Changgung Hospital, China; Dr. Junbo Ge from Fudan University, China; Dr. Junmin Wei from Key Laboratory of Knowledge Mining and Service for Medical Journals, China Image source link: https://www.sciencedirect.com/science/article/pii/S2667102625001044
A new expert consensus made available online on 10 October 2025 and published in Volume 5, Issue 4 of the journal Intelligent Medicine on 1 November 2025, sets out a structured framework to assess large language models (LLMs) before they are introduced into clinical workflows. The guidance responds to the rapid uptake of artificial intelligence (AI) tools for diagnostic support, medical documentation, and patient communication, and the corresponding need for consistent evaluation of safety, effectiveness, and fairness.
The consensus formalizes retrospective evaluation—testing fully trained models on real or simulated clinical data in specific care contexts, without further modifying the models—to verify performance, ethical compliance, and operational readiness prior to deployment.
Developed in line with World Health Organization guideline methods and registered on the Practice Guideline Registration for Transparency (PREPARE) platform (ID: PREPARE-2025CN503), the consensus draws on literature review, Delphi procedures, and multidisciplinary expert deliberation. In the final round, 35 experts achieved agreement on six recommendations.
What does the framework include?
- Evaluation workflows prioritizing scientific rigor, objectivity, comprehensiveness, and ethics (e.g., double-blind procedures, conflict-of-interest transparency).
- Integrated metrics combining quantitative measures (accuracy, recall, F1-score; BLEU/ROUGE for generation) with structured qualitative ratings (e.g., mean opinion scores for accuracy, completeness, safety, practicality, professionalism).
- Multidisciplinary teams spanning clinicians, data and computer engineers, ethicists, legal experts, and statisticians, with standardized training and role definitions.
- Dataset design principles centered on clinical authenticity, broad representativeness across diseases, populations, and institutions, and fairness for vulnerable groups, with modular versioning and privacy/compliance safeguards.
- Feedback and versioning mechanisms to update standards as technology, regulations, or application scope evolve, including transparent dispute-resolution processes.
- Standardized reporting templates to improve transparency, reproducibility, and comparability across evaluations.
The consensus also defines six key LLM capability domains for assessment: medical knowledge question and answer; complex medical language understanding; diagnosis and treatment recommendation; medical documentation generation; multi-turn dialogue; and multimodal dialogue.
Emphasizing essential safeguards for patient data protection, bias mitigation, and the need for AI outputs to remain clinically explainable, the authors of the consensus are positioned to support the advancement of safer, more reliable, and ethically governed LLM applications within healthcare systems globally.
***
Reference
DOI: 10.1016/j.imed.2025.09.001
About the journal
Intelligent Medicine is a peer-reviewed, open-access journal focusing on the integration of artificial intelligence, data science, and digital technology in clinical medicine and public health. It is published by the Chinese Medical Association in partnership with Elsevier. To learn more about Intelligent Medicine, please visit https://www.sciencedirect.com/journal/intelligent-medicine
Funding information
The authors received no financial support for this research.
Journal
Intelligent Medicine
Method of Research
Literature review
Subject of Research
Not applicable
Article Title
2025 Expert consensus on retrospective evaluation of large language model applications in clinical scenarios
Article Publication Date
1-Nov-2025
COI Statement
All authors declare no conflicts of interest.