Mass General Brigham researchers created BRIDGE, which identified significant gaps between AI’s performance on medical licensing exams and patient care tasks.
Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering.
“Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care,” said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. “BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance.”
Medical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM’s gaps in understanding of nuanced clinical language used in health care settings.
Yang and colleagues, including co-senior author Joshua Lin, MD, MPH, ScD, and co-first authors Jiageng Wu and Bowen Gu, used BRIDGE to systematically evaluate the performance of 95 LLMs from 59 clinical sources on real-world clinical tasks spanning the patient care continuum. This involved 14 clinical specialties and included triage, information extraction, diagnosis, prognosis, and billing coding. They also created a public continuously updated leaderboard (which now includes 107 LLMs), enabling clinicians and AI developers to compare LLM performance across clinical tasks.
BRIDGE also revealed that AI performance varies across medical specialties. Because the benchmark includes clinical data in nine languages, it enables researchers to identify LLM performance gaps and support the development of more accurate and equitable AI tools for non-English-speaking patients.
Authorship: In addition to Yang and Lin, Mass General Brigham authors include Jiageng Wu, Bowen Gu, Richard Wyss, Rishi J Desai, and Sebastian Schneeweiss. Additional authors include Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Jonathan H. Chen, and Santiago Romero-Brufau.
Disclosures: Lin has received research grants from Takeda, AbbVie, and UCB for projects unrelated to this study. Alsentzer reports consultant fees from Fourier Health. Schneeweiss is participating in investigator-initiated grants to the Brigham and Women’s Hospital from Boehringer Ingelheim, Takeda, and UCB unrelated to the topic of this study. He is an advisor to Aetion Inc., a software manufacturer. Schneeweiss is an advisor to Temedica GmbH, a patient-oriented data generation company and his interests were declared, reviewed, and approved by the Brigham and Women’s Hospital in accordance with their institutional compliance policies. Chen reports cofounding Reaction Explorer, that develops and licenses organic chemistry education software, and receive medical expert witness fees from Sutton Pierce, Younker Hyde MacFarlane, Sykes McAllister, Elite Expert, consulting fees from ISHI Health, and honoraria or travel expenses for invited presentations by insitro, General Reinsurance Corporation, Cozeva, and other industry conferences, academic institutions, and health systems.
Funding: This study was partially funded by PCORI ME-2022C1-25646, Goldberg Scholarship and Brigham Research Institute, National Institute on Aging (RF1AG090405), and National Library of Medicine R01LM014667. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
Paper cited: Wu, J. et al. “BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text” Nature Biomedical Engineering DOI: 10.1038/s41551-026-01719-2
Mass General Brigham is an integrated academic health care system, uniting great minds to solve the hardest problems in medicine for our communities and the world. Mass General Brigham connects a full continuum of care across a system of academic medical centers, community and specialty hospitals, a health insurance plan, physician networks, community health centers, home care, and long-term care services. Mass General Brigham is a nonprofit organization committed to patient care, research, teaching, and service to the community. In addition, Mass General Brigham is one of the nation’s leading biomedical research organizations with several Harvard Medical School teaching hospitals. For more information, please visit massgeneralbrigham.org.