Newsletters

Your E-mail *

While Clinical AI Booms, LLM Reliability Becomes the Bottleneck

ELHS Newsletter 2026-3-20

Subscribe | All Issues

Dear Friends,

Clinical AI is booming.
A new Stanford–Harvard State of Clinical AI Report (2026) highlights rapid progress across the field. Yet, despite this acceleration, the report delivers a clear message:

👉 Advancement must be grounded in real-world outcomes—not just technical capability.

This aligns closely with what we are observing across recent studies.

February Signal: Reliability Is Now the Central Challenge

In February, multiple studies raised concerns about the reliability of large language models (LLMs) in medical applications:

Human–AI interaction is a key variable
Improper use of chatbots can significantly reduce diagnostic accuracy—showing that performance is not fixed, but co-determined by how humans use AI.
Performance varies across diseases
Studies found that ChatGPT Health triage recommendations vary substantially by condition, raising safety concerns for direct-to-consumer deployment.

A concise summary of these findings is available via the UCSF Codex digest, which is worth reviewing.

Why This Matters: Reliability = Adoption

In clinical medicine, reliable decision-making is non-negotiable.

As evidence accumulates showing variability and instability in LLM outputs, a critical question emerges:

Will GenAI be trusted enough to integrate into routine care?

The answer will likely shape the trajectory of clinical AI over the next decade.

Encouraging Progress: Toward Safer, Scalable AI

At the same time, February also brought encouraging advances:

1. Defining “Moral Competence” in AI
Google DeepMind proposed a roadmap for evaluating moral competence in LLMs—an important step toward trustworthy clinical AI systems.

2. AI Assisting Subspecialists (Not Replacing Them)
Google’s AMIE demonstrated the ability to support subspecialists in complex cardiology care, reinforcing a key principle:
👉 The near-term value of GenAI lies in augmentation, not autonomy.

3. Expanding Evidence Across Clinical Domains
Promising results were reported in cancer care, emergency departments, and public health question answering, suggesting that context-specific deployment may be the most practical path forward.

4. Toward Adaptive Clinical Intelligence
A Harvard group proposed a clinical context-switching framework, enabling customizable predictions across diverse clinical tasks—moving beyond static models toward adaptive intelligence.

A Breakthrough Worth Trying: OpenScholar

One of the most practical advances this month comes from the Allen Institute:

👉 OpenScholar – an open-source, LLM-powered literature search engine
🔗 https://openscilm.allen.ai/

This tool addresses a major gap we identified in 2023—the persistent issue of hallucinated scientific references in LLMs (see JAMA network paper).

This is not just an incremental improvement—it represents a critical infrastructure layer for scientific and clinical AI.

Our Perspective: Reliability Is a System Property

At the ELHS Institute, our work on GenAI-enabled Learning Health System (GenAI-LHS) units is built around this exact challenge:

Embedding AI into routine clinical workflows
Generating real-world evidence continuously
Measuring performance across populations and settings
Designing systems with equity-by-design

Because ultimately:

Reliability is not a model property—it is a system property.

What Comes Next

In this issue, I also share a short conversation with my ChatGPT copilot exploring:

What health systems must do to ensure reliable AI
What governments should regulate and support
Why Learning Health Systems may provide the missing foundation

Warm regards,
AJ

AJ Chen, PhD
Founder & Principal Investigator, ELHS Institute
Silicon Valley, USA

🔗 ELHS Newsletters: https://elhsi.org/Newsletters
🔗 ELHS GenAI Platform: https://elhsi.com

From Page Mill

(Recent papers, news, and events showcasing the progress of GenAI and LHS)

Haas, J., Bridgers, S., Manzini, A. et al. A roadmap for evaluating moral competence in large language models. Nature 650, 565–573 (2026).

[2026/2] The question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles such as companionship and medical advising, and will increasingly be tasked with making decisions and taking actions on behalf of humans. These trends require moving beyond evaluating for mere moral performance, the ability to produce morally appropriate outputs, to evaluating for moral competence, the ability to produce morally appropriate outputs based on morally relevant considerations. Assessing moral competence is critical for predicting future model behaviour, establishing appropriate public trust and justifying moral attributions. However, both the unique architectures of LLMs and the complexity of morality itself introduce fundamental challenges. Here we identify three such challenges: the facsimile problem, whereby models may imitate reasoning without genuine understanding; moral multidimensionality, whereby moral decisions are influenced by a range of context-sensitive relevant moral and non-moral considerations; and moral pluralism, which demands a new standard for globally deployed artificial intelligence. We provide a roadmap for tackling these challenges, advocating for a suite of adversarial and confirmatory evaluations that will enable us to work towards a more scientifically grounded understanding and, in turn, a more responsible attribution of moral competence to LLMs.

Asai, A., He, J., Shao, R. et al. Synthesizing scientific literature with retrieval-augmented language models. Nature 650, 857–863 (2026).

[2026/2] Scientific progress depends on the ability of researchers to synthesize the growing body of literature. Can large language models (LLMs) assist scientists in this task? Here we introduce OpenScholar, a specialized retrieval-augmented language model (LM) that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience and biomedicine. Despite being a smaller open model, OpenScholar-8B outperforms GPT-4o by 6.1% and PaperQA2 by 5.5% in correctness on a challenging multi-paper synthesis task from the new ScholarQABench. Although GPT-4o hallucinates citations 78–90% of the time, OpenScholar achieves citation accuracy on par with human experts.

Zhao W, Wu C, Fan Y, et al. An agentic system for rare disease diagnosis with traceable reasoning. Nature (2026).

[2026/2 Jiaoda] Here we present DeepRare—a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating more than 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured human phenotype ontology terms and genetic testing results to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 2,919 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser’s 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability.

O’Sullivan JW, Palepu A, Saab K, et al. A large language model for complex cardiology care. Nat Med 2026:32, 616–623.

[2023/2] We conducted a randomized controlled trial comparing large language model-assisted care with the usual care of complex patients suspected of having a genetic cardiomyopathy, and we curated a real-world dataset of complex cases from a subspecialist cardiology practice. Nine participating general cardiologists were provided with access to both clinical text reports and raw diagnostic data—including electrocardiograms, echocardiograms, cardiac magnetic resonance imaging scans and cardiopulmonary exercise testing—and were randomized to manage these cases, either with or without assistance from AMIE. We developed a ten-domain evaluation rubric used by three blinded subspecialists to evaluate the quality of triage, diagnosis and management. In our randomized controlled trial with retrospective patient data, subspecialists favored large language model-assisted responses overall, and for the management plan and diagnostic testing domains, with the remaining domains considered a tie. Overall, subspecialists preferred AMIE-assisted cardiology assessments 46.7% of the time, compared with 32.7% for cardiologists alone (P = 0.02), with 20.6% rated as a tie. Subspecialists also quantified errors, extra and missing content, reasoning and potential bias. Cardiologists alone had more clinically significant errors (24.3% versus 13.1%, P = 0.033) and more missing content (37.4% versus 17.8%, P = 0.0021) than cardiologists assisted by AMIE. Lastly, cardiologists who used AMIE reported that AMIE helped their assessment more than half the time (57.0%) and saved time in 50.5% of cases.

Li, M.M., Reis, B.Y., Rodman, A. et al. Scaling medical AI across clinical contexts. Nat Med 32, 439–448 (2026).

[2026/2] We envision context switching as an emergent solution. Context switching adjusts model reasoning at inference, without retraining. Generative models can tailor outputs to patient biology, care setting or disease. Multimodal models can switch between notes, laboratory results, imaging and genomics, even when some data are missing or delayed. Agent models can coordinate tools and roles based on task and user context. In each case, context switching enables medical AI to adapt across specialties, populations and geographies. This approach requires advances in data design, model architectures and evaluation frameworks, and establishes a foundation for medical AI that scales to an infinite number of contexts, while remaining reliable and suited to real-world care.

Liu, Y., Carrero, Z.I., Jiang, X. et al. Benchmarking large language model-based agent systems for clinical decision tasks. npj Digit. Med. (2026).

[2026/2] In this study, we evaluate two such systems: the open-source OpenManus, built on Meta’s Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity’s Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent.

Bean, A.M., Payne, R.E., Parsons, G. et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med 32, 609–615 (2026).

[2026/2] We tested whether LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice.

Ramaswamy, A., Tyagi, A., Hugo, H. et al. ChatGPT Health performance in a structured test of triage recommendations. Nat Med (2026).

[2026/2] ChatGPT Health launched in January 2026 as OpenAI’s consumer health tool, reaching millions of users. Here, we conducted a structured stress test of triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 total responses). Performance followed an inverted U-shaped pattern, with the most dangerous failures concentrated at clinical extremes: non-urgent presentations (35%) and emergency conditions (48%). Among gold-standard emergencies, the system under-triaged 52% of cases, directing patients with diabetic ketoacidosis and impending respiratory failure to 24–48-hour evaluation rather than the emergency department, while correctly triaging classical emergencies such as stroke and anaphylaxis.

Zhou, X., Zhou, J., Wang, C. et al. A suite of large language models for public health infoveillance. npj Digit. Med. (2026).

[2026/2] To address this gap, we developed PH-LLM—a novel suite of large language models (LLMs) designed for real-time public health monitoring. We curated a multilingual training corpus and trained PH-LLM using QLoRA and LoRA plus, leveraging Qwen 2.5. We constructed a benchmark comprising 19 English and 20 multilingual held-out tasks and evaluated PH-LLM’s zero-shot performance. PH-LLM consistently outperformed baseline LLMs of similar and larger sizes. PH-LLM-14B and PH-LLM-32B surpassed Qwen2.5-72B-Instruct, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407, and GPT-4o in both English tasks (>=56.0% vs. <= 52.3%) and multilingual tasks (>=59.6% vs. <= 59.1%). PH-LLM represents a significant advancement in real-time public health infoveillance, offering state-of-the-art multilingual capabilities and cost-effective solutions for monitoring public sentiment on health issues.

Li, M., Zhan, Z., Huang, J. et al. CancerLLM: a large language model in cancer domain. npj Digit. Med. (2026).

[2026/2] To this end, we present CancerLLM, a 7-billion-parameter Mistral-style model trained on 2.7 M clinical notes and 515 K pathology reports across 17 cancer types, followed by fine-tuning on cancer phenotype extraction and diagnosis generation tasks. Our evaluation demonstrated that CancerLLM achieved strong performance on internal benchmarks, with F1 score of 91.78% on phenotyping extraction and 86.81% on diagnosis generation. It outperformed existing LLMs, with an average F1 score improvement of 9.23%. Additionally, the CancerLLM demonstrated its efficiency on time and GPU usage, and robustness comparing with other LLMs. We demonstrated that CancerLLM can potentially provide an effective and robust solution to advance clinical research and practice in cancer domain.

Naderi, B., Liu, L., Ghandehari, A. et al. The role of large language models in emergency care: a comprehensive benchmarking study. npj Artif. Intell. 2, 24 (2026).

[2026/2] With EDs increasingly overburdened, Large Language Models (LLMs) may help streamline workflow and decision-making. We evaluated their emergency medicine knowledge and performance in simulated ED tasks. This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically. LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p < 0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall. Findings suggest a generational leap in reasoning performance and support GPT-5 as a potential ED decision-support tool.

In the Eyes of GPTs

(AJ in conversation with ChatGPT to understand how machines view human healthcare)

What are the key requirements that health systems must meet to deliver reliable clinical interventions, such as software as a medical device (SaMD), and how do these apply to GenAI in routine care?

ChatGPT

Key Requirements for Reliable Clinical AI

1. Clinical Validity
AI must demonstrate accurate diagnostic or predictive performance against gold standards, with results stratified by disease type and patient subgroups. For GenAI, this requires moving beyond benchmarks to rigorous, real-world and prospective validation.

2. Clinical Utility
The system must improve meaningful clinical outcomes—such as diagnosis, management, or patient safety—not just accuracy. Evidence should show that AI changes clinician decisions and leads to better care in real workflows.

3. Reliability & Consistency
Outputs must be stable and reproducible across repeated runs, minor input changes, and different users. For GenAI, reducing variability and standardizing outputs are essential to ensure clinicians can depend on results.

4. Safety & Risk Management
Clear identification of risks, failure modes, and mitigation strategies is required, along with human oversight. Systems must define when outputs should not be trusted and include safeguards for high-risk scenarios.

5. Human Factors & Workflow Integration
AI must integrate seamlessly into clinical workflows and support decision-making without increasing cognitive burden. Structured, timely, and actionable outputs are key to real-world adoption.

6. Transparency & Explainability
Clinicians need understandable reasoning or supporting evidence behind AI outputs. While full interpretability may be unrealistic, traceable logic and clear rationale are necessary for trust.

7. Generalizability & External Validation
Performance must hold across different institutions, populations, and care settings. External validation is critical to ensure the system is not limited to a single dataset or environment.

8. Equity & Bias Control
AI must be evaluated for performance across demographic and socioeconomic groups to prevent disparities. Equity should be measured continuously and addressed proactively in system design.

9. Continuous Monitoring
Post-deployment performance must be tracked to detect drift, degradation, or emerging risks. Reliable systems require ongoing evaluation and iterative improvement within a learning health system.

10. Regulatory & Governance Compliance
AI systems must meet regulatory standards, including clear intended use, validation evidence, and risk classification. Governance frameworks are essential for safe, scalable deployment in clinical care.

Given the importance of policy and regulation in shaping the healthcare environment, what should governments do to ensure that GenAI can be used reliably and safely in routine clinical care?

ChatGPT

What Governments Should Do to Enable Reliable GenAI in Routine Care

1. Establish Clear, Risk-Based Regulatory Pathways
Governments should define how GenAI systems are classified (e.g., SaMD vs. clinical support tools) and create tiered, risk-based approval pathways. Clear guidance on intended use, evidence requirements, and lifecycle management will reduce uncertainty and accelerate responsible adoption.

2. Define Evidence Standards Beyond Accuracy
Regulators should require evidence of clinical utility and real-world impact, not just model performance. This includes pragmatic trials, workflow-integrated studies, and outcome-based evaluation aligned with how care is actually delivered.

3. Mandate Real-World Performance Monitoring
Approval should not be a one-time event. Governments should require continuous post-deployment monitoring, including performance drift, safety signals, and subgroup outcomes, with mechanisms for rapid updates or withdrawal if risks emerge.

4. Set Reliability and Robustness Benchmarks
Standards should explicitly address GenAI-specific risks such as variability, hallucination, and prompt sensitivity. Governments can define minimum thresholds for consistency, reproducibility, and error reporting before systems are approved for clinical use.

5. Require Transparency and Auditability
Policies should ensure that AI outputs are traceable and auditable, including documentation of training data sources, evaluation datasets, and system limitations. This is essential for clinical trust, accountability, and regulatory oversight.

6. Embed Equity Requirements into Regulation
Governments should require reporting of performance across demographic and socioeconomic groups and mandate mitigation strategies for identified disparities. Equity should be treated as a measurable safety and quality requirement.

7. Support Data Infrastructure and Interoperability
Reliable AI depends on high-quality, interoperable data. Governments should invest in standards (e.g., FHIR), data-sharing frameworks, and secure infrastructure to enable scalable validation and deployment across health systems.

8. Fund Real-World Evaluation Ecosystems
Public funding should support learning health system (LHS) infrastructures, multi-site clinical research networks, and pragmatic evaluation platforms to generate continuous evidence for GenAI in routine care.

9. Define Human-AI Responsibility and Liability
Clear legal frameworks are needed to delineate responsibility among clinicians, health systems, and AI developers. This reduces adoption barriers and ensures accountability in case of errors or harm.

10. Align Incentives with Reliable Use
Reimbursement and policy incentives should reward validated, high-impact AI use—not mere adoption. Linking payment models to demonstrated improvements in outcomes and safety will drive meaningful integration into care.

Bottom line:

Governments must shift from enabling innovation alone → to engineering trustworthy ecosystems, where GenAI is continuously evaluated, regulated, and improved within real-world care.

I have studied the Learning Health System (LHS) framework since 2012 and believe it addresses a major structural gap in current health systems. It now appears that LHS could provide the foundation for ensuring AI reliability in care delivery. Could you briefly explain why this is the case and how LHS enables reliable use of GenAI in routine care?

ChatGPT

Why Learning Health Systems (LHS) Enable Reliable GenAI in Care
GenAI reliability is not just a model issue—it is a system-level property requiring continuous evidence generation, feedback, and adaptation. Traditional health systems are static and episodic, while GenAI is dynamic and context-sensitive. The Learning Health System (LHS) fills this structural gap by embedding data capture, evaluation, and improvement directly into routine care, making reliability measurable and continuously optimized rather than assumed at deployment.

How LHS Provides Structural Guarantees for AI Reliability

1. Continuous Real-World Evaluation
LHS captures data from every patient encounter and uses it to evaluate AI performance in real time. This enables ongoing measurement of accuracy, safety, and clinical impact across diverse settings—far beyond one-time validation.

2. Closed-Loop Learning Cycle
Predictions → clinician decisions → patient outcomes → feedback into the system. This loop allows GenAI to be iteratively refined, ensuring that performance improves and adapts to real clinical complexity.

3. Standardized Workflows and Outputs
LHS enforces structured data inputs and outputs (e.g., standardized diagnostic reports), reducing variability from both clinicians and AI, and improving consistency and reproducibility.

4. Embedded Safety and Oversight Mechanisms
Within LHS, AI operates under defined clinical workflows with human oversight, predefined escalation rules, and monitoring of failure modes—turning safety into a system feature, not an afterthought.

5. Population-Level Monitoring and Equity Measurement
Because LHS aggregates data across populations, it enables continuous tracking of performance across subgroups, making it possible to detect and correct disparities in AI behavior.

6. Rapid Evidence Generation at Scale
LHS transforms routine care into a continuous clinical study, allowing health systems to generate large-scale, real-world evidence on AI effectiveness without relying solely on slow, expensive RCTs.

7. Governance and Accountability Infrastructure
LHS integrates governance processes—protocols, metrics, audit trails—into care delivery, ensuring that AI systems are monitored, updated, and held accountable over time.

Bottom Line

LHS converts AI reliability from a one-time validation problem into a continuous, system-managed process.

This is why LHS is not just compatible with GenAI—it is likely the missing infrastructure required for safe, scalable, and equitable adoption of AI in routine clinical care.

🔹 ELHS Institute Resources 🔹

🤖 ELHS GenAI Platform

🌐 ELHS Institute Website

📰 ELHS Newsletters

▶️ ELHS Videos

👉 For Clinical AI technology support, contact us at support@elhsi.org 📩

~ the end ~