Mass General Brigham-developed system achieved high accuracy in validation studies, prompting release on open source to enable further investigations.
A team of Mass General Brigham researchers has developed one of the first fully autonomous artificial intelligence (AI) systems capable of screening for cognitive impairment using routine clinical documentation. The system, which requires no human intervention or prompting after deployment, achieved 98% specificity in real-world validation testing. Results are published in npj Digital Medicine.
Alongside the publication, the team is releasing Pythia, an open-source tool that enables any healthcare system or research institution to deploy autonomous prompt optimization for their own AI screening applications.
"We didn't build a single AI model — we built a digital clinical team," said corresponding author Hossein Estiri, PhD, director of the Clinical Augmented Intelligence (CLAI) research group and associate professor of medicine at Massachusetts General Hospital, a founding member of the Mass General Brigham healthcare system. "This AI system includes five specialized agents that critique each other and refine their reasoning, just like clinicians would in a case conference.”
Cognitive impairment remains significantly underdiagnosed in routine clinical care, and traditional screening tools and cognitive tests are highly resource-intensive to administer and difficult for patients to access. Yet early detection has become increasingly critical, especially with the recent approval of Alzheimer’s disease therapies that are most effective when administered early in the disease.
“By the time many patients receive a formal diagnosis, the optimal treatment window may have closed,” said co-lead study author Lidia Moura, MD, PhD, MPH, director of Population Health and the Center for Healthcare Intelligence in the Department of Neurology at Mass General Brigham.
To better capture at-risk patients, the Mass General Brigham team developed an AI system that runs on an open-weight large language model that can be deployed locally within hospital information technology infrastructure. It employs five agents that each serve different functions and work collaboratively to make clinical determinations and refine them to address errors and improve sensitivity and specificity.
These agents operate autonomously in an iterative loop, refining their detection capabilities through structured collaboration until performance targets are met or the system determines it has converged. No patient data are transmitted to external servers or cloud-based AI services.
The study analyzed more than 3,300 clinical notes from 200 anonymized patients at Mass General Brigham. By analyzing clinical notes produced during regular healthcare visits, this innovative system can turn everyday documentation into a chance to screen for cognitive issues, helping identify patients who might need a formal assessment.
“Clinical notes contain whispers of cognitive decline that busy clinicians can’t systematically surface,” said Moura. “This system listens at scale.”
When the AI system and human reviewers disagreed, an independent expert re-evaluated each case. Among the disagreement cases, the expert validated the AI's reasoning in 58% of the time -- meaning the system was often making sound clinical judgments that initial human review had missed.
"We expected to find AI errors. Instead, we often found the AI was making defensible judgments based on the evidence in the notes," said Estiri.
Analysis of cases in which the AI was incorrect revealed systematic patterns: documentation limitations where cognitive concerns appeared only in problem lists without supporting narrative, and domain knowledge gaps where the system failed to recognize certain clinical indicators. The system excelled with comprehensive clinical narratives but struggled with isolated data lacking context.
Although the system achieved 91% sensitivity under balanced testing, its sensitivity decreased to 62% under real-world conditions (with a prevalence of 33% positive cases), while specificity remained high at 98%. The researchers reported these calibration challenges to provide transparency and guide future efforts to improve clinical reliability.
"We're publishing exactly the areas in which AI struggles," said Estiri. "The field needs to stop hiding these calibration challenges if we want clinical AI to be trusted."
Authorship: In addition to Estiri and Moura, Mass General Brigham and Harvard Medical School co-authors include Jiazi Tian, Pedram Fard, Cameron Cagan, Neguine Rezaii, Rebeka Bustamante Rocha, Liqin Wang, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, and Shawn N. Murphy.
Disclosures: The authors declare no competing interests.
Funding: This research was funded by the National Institutes of Health (NIH): the National Institute on Aging (grants RF1AG074372, R01AG074372, R01AG082693), and the National Institute of Allergy and Infectious Diseases (R01AI165535).
Paper cited: Tian, et al. “An autonomous agentic workflow for clinical detection of cognitive concerns using large language models.” npj Digital Medicine, DOI: 10.1038/s41746-025-02324-4
Mass General Brigham is an integrated academic health care system, uniting great minds to solve the hardest problems in medicine for our communities and the world. Mass General Brigham connects a full continuum of care across a system of academic medical centers, community and specialty hospitals, a health insurance plan, physician networks, community health centers, home care, and long-term care services. Mass General Brigham is a nonprofit organization committed to patient care, research, teaching, and service to the community. In addition, Mass General Brigham is one of the nation’s leading biomedical research organizations with several Harvard Medical School teaching hospitals. For more information, please visit massgeneralbrigham.org.