(Paving the Way for Global Health Equity with Generative AI and Learning Health Systems)
Dear Friends,
As we enter 2026, I believe the healthcare AI industry stands at a critical crossroads.
In one direction, an earth-shaking, trillion-dollar player—OpenAI—has rapidly moved forward, announcing ChatGPT Health and the OpenAI API for Healthcare, encouraging both consumers and clinicians to use ChatGPT for health-related purposes. OpenAI has reported one prospective study conducted in high-stress primary care settings in Africa, showing remarkable reductions in medical errors.
The key question is: Is this level of preparation sufficient to responsibly transform the healthcare AI industry at scale?
In the other direction, a growing body of academic research signals the need to slow down. Evidence increasingly suggests that premature use of generative AI in routine clinical care—without appropriate guardrails—poses real risks to patients. Researchers at Stanford have warned that even advanced large language models (LLMs), including GPT-5-level systems, present safety risks in medicine. They emphasize that today’s best LLMs still struggle to reliably distinguish facts from beliefs and may hallucinate when uncertain. These limitations translate into legal risks in AI-assisted healthcare, dangers in mental health applications, and serious concerns around AI-driven insurance and coverage decisions.
At Harvard Medical School, healthcare AI thought leader Prof. Zak Kohane has raised an even more fundamental question facing humanity: When AI systems make life-or-death recommendations, whose values should they reflect? He is leading the Human Value Project to systematically evaluate how AI models respond to ethical dilemmas in medicine, measuring both their default behaviors and their capacity to align with human values. Early findings are concerning: leading LLMs behave differently for different user groups. If medical AI lacks consistent human values, is it truly ready to care for human health?
Since early 2023, when ChatGPT first captured the medical community’s attention, I have been systematically evaluating multiple top LLMs for disease prediction. What I have observed is significant variability—and occasional hallucination—even among the best models when presented with the same clinical cases. These findings reinforce a critical conclusion: LLMs are not ready for routine clinical care unless supported by robust clinical evidence. I strongly agree with thought leaders at Stanford and Harvard that GenAI systems integrated into clinical workflows must undergo rigorous evaluation similar to other software as medical devices, given the potential for patient harm. Moreover, once deployed, GenAI systems must be continuously monitored to detect unknown or emergent risks.
I am also concerned that healthcare GenAI may repeat the boom-and-bust cycle experienced by earlier generations of traditional healthcare AI. Clinical evidence generation remains slow, costly, and fragmented, making widespread adoption of GenAI in routine care unlikely within the next 5–10 years under current paradigms (see my recent comment on a JAMA paper regarding GenAI adoption rates). If the massive investments flowing into healthcare AI fail to translate into real-world clinical value and revenue, the result could be an industry-wide correction—or crash—regardless of continued advances in GenAI technology itself. Preventing this outcome will require fundamentally new innovations that can accelerate clinical evidence generation at scale.
Below, I continue this discussion with Copilot ChatGPT, exploring the crossroads now facing healthcare GenAI. I invite you to read on and reflect with me on the path forward.
Warm regards,
AJ
AJ Chen, PhD
Founder and PI, ELHS Institute
Silicon Valley, USA
ELHS Newsletters: https://elhsi.org/Newsletters
ELHS GenAI Platform: https://elhsi.com
~

From Page Mill
(Recent papers, news, and events showcasing the progress of GenAI and LHS)
Carone M, Rotnitzky A. Reinforcement Learning for Finding Optimal Dynamic Treatment Regimes Using Observational Data. JAMA. 2026;335(3):267–268. doi:10.1001/jama.2025.20541
[2025/12] A leading approach to reinforcement learning is called Q-learning, an overarching term for algorithms used to find optimal treatment regimes from longitudinal data, based on recursively inferring values resulting from various actions. Q-learning algorithms can be used in 2 distinct settings. The first is online learning, where interactive experimentation is possible (ie, when treatment decisions can be altered in real time to learn their effects and improve future decision-making). The second is offline or batch learning, where only passive use of existing data—such as observational records documenting clinical actions and subsequent outcomes—is feasible.
Palmieri S, Robertson CT, Cohen IG. New Guidance on Responsible Use of AI. JAMA. 2026;335(3):207–208. doi:10.1001/jama.2025.23059
[2025/12] The Joint Commission (TJC, the primary US hospital accreditor) has partnered with the Coalition for Health AI (CHAI, a clinician-led group that includes industry and stakeholders), and they have issued new guidance on responsible use of AI in health care, which builds on prior statements from many groups, including the American Medical Association. Core recommendations include establishing multidisciplinary AI governance committees, validating models on local patient data and workflows before deployment, and instituting continuous postmarket monitoring for drift, performance degradation, or bias.
Emma Chen, et al. International Retrospective Observational Study of Continual Learning for AI on Endotracheal Tube Placement from Chest Radiographs. NEJM AI 2026;3(1). DOI: 10.1056/AIoa2500522.
[2025/12] Medical artificial intelligence (AI) models often underperform when deployed at new hospitals despite strong performance during development, creating a need for effective adaptation strategies that maintain institutional privacy. Continual learning (i.e., repeatedly retraining a model at each new hospital where it is deployed) can improve performance during and after deployment. Despite using only 50 training data points from each hospital, continual learning surpassed traditional fine-tuning as a method for improving medical AI model generalization.
Philip Chung, et al. Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records. NEJM AI 2026;3(1). DOI: 10.1056/AIdbp2500418
[2025/12] We hypothesize that an LLM-as-a-Judge with relevant facts from a patient’s electronic health record (EHR) can be used to automatically verify facts in any clinical document written about a patient and provide an overall estimate of the document’s accuracy. To evaluate this hypothesis, we created a benchmark dataset called VeriFact — Brief Hospital Course (VeriFact-BHC), consisting of statements derived from the brief hospital course sections of discharge summaries, and established a ground truth from clinician chart review. We then devised an AI system, VeriFact, that combines retrieval-augmented generation of facts from a patient’s EHR along with an LLM-as-a-Judge to perform automated evaluation of all statements in the dataset.
Mathew Storey, et al. Early Clinical Evaluation of AI Triage of Chest Radiographs: Time to Diagnosis for Suspected Cancer and Number of Urgent CT Referrals. NEJM AI 2026;3(1). DOI: 10.1056/AIcs2500539
[2025/12] In detecting suspected cancer, the 14-feature AI triage model has an 82.5% sensitivity, a 90.5% specificity, a 14.6% positive predictive value, and a 99.6% negative predictive value for possible lung cancer on chest x-rays. Features such as focal airspace opacity and single pulmonary nodule returned the highest proportion of false-positive results, while pulmonary and mediastinal masses had a greater diagnostic yield. This work demonstrates how AI triage can be implemented for the rapid onward diagnosis of lung cancer and informs the deployment of triage systems for urgent cancer pathways in general.
Richard K. Leuchter, et al. Evaluating Translational AI: A Two-Way Moving Target Problem. NEJM AI 2025;2(12). DOI: 10.1056/AIp2500705.
[2025/12] Together, these pitfalls render traditional evaluation methods inadequate. The authors argue that health systems must adopt a new default standard for implementing any model that predicts patient outcomes or utilization: short-term randomized deployment with a control group. This approach provides a crucial counterfactual for rigorous, independent assessment of model performance and intervention effectiveness. It offers a practical path forward for institutions to ensure that their artificial intelligence tools are safe, effective, and equitable, thereby building a foundation of trust that is worthy of the patients they serve.
Obermeyer Z. Bedside to Bench - AI and the New Science of Medicine. N Engl J Med. 2025;393(23):2287-2289. doi:10.1056/NEJMp2510203.
[2025/12] By allowing direct engagement with machine-derived clinical data, AI is finding fascinating new patterns. Thus, AI generates hypotheses, but those hypotheses are tested the old-fashioned way, avoiding the myriad statistical problems of relying on AI alone for discovery. This approach will not only feed new hypotheses into molecular biology. It will bridge the gulf between medical research and clinical reasoning, thereby creating a new science. An AI-driven medical science, grounded in clinical data, will help us rediscover clinical ways of thinking about bodies at their own level of abstraction, as neither aggregates of molecules nor social constructs.
Volandes AE, Davis AD, Goldstein NE. From Bandwidth to Bedside - Bringing AI-Enabled Care to Rural America. N Engl J Med. 2025;393(22):2182-2185. doi:10.1056/NEJMp2509491.
[2025/11] AI is increasingly being positioned as a tool for transforming rural care delivery. At Dartmouth Health, for example, where two of us work, we built an AI-powered triage system that parses thousands of patient messages each week to help clinical teams identify and prioritize urgent concerns. Predictive models can flag early signs of clinical deterioration hours before symptoms are obvious. A new AI-powered ambient dictation system helps clinicians spend more time with patients and less time charting in the electronic medical record. AI-powered conversational agents offer companionship between visits, especially in areas where behavioral health services are scarce. The table outlines four common use cases for AI in rural settings, mapping each to real-world constraints, the risks if the tool is misaligned with the culture, and strategies for embedding these tools into the trust-based systems that rural communities rely on.
Isaac Kohane. The H-Index of Suspicion: How Culture, Incentives, and AI Challenge Scientific Integrity. NEJM AI 2026;3(1). DOI: 10.1056/AIe2501273.
[2025/12] Generative AI is making it astonishingly easy to create scientific fakery that looks real: convincing data, tidy plots, even entire studies that slip past automated checks and human reviewers. In this editorial, I describe a deliberately fabricated dataset and analysis, built with help from an AI model that fooled standard anomaly detectors. The episode highlights a growing problem: technical fixes like blockchain may offer reassurance, but they can’t solve the deeper cultural pressures in science that reward speed and novelty over care and verification. What might help is the harder work of valuing replication, transparency, and rigor, as AI can amplify whatever incentives we set.
Gao S, Zhao X, Chen L, et al. Enhancing privacy-preserving deployable large language models for perioperative complication detection: a targeted strategy with LoRA fine-tuning. NPJ Digit Med. 2025;8(1):773. doi:10.1038/s41746-025-02139-3.
[2025/12] Perioperative complications are a major global concern, yet manual detection suffers from 27% under‑reporting and frequent misclassification. We show targeted prompt engineering plus Low‑Rank Adaptation (LoRA) fine‑tuning converts smaller open‑source LLMs into expert‑level diagnostic tools. On external validation (Center 2), the optimized 4B model’s micro‑F1 rose from 0.28 to 0.64, approaching human experts (F1 = 0.69), driven by the targeted strategy (ΔF1 = 0.256, 95% CI [0.181, 0.336]) and LoRA (ΔF1 = 0.103, 95% CI [0.023, 0.186]). Concurrently, the 8B model surpassed human experts (F1 > 0.70). Optimized small models enable expert‑level accuracy with local deployment and preserved data sovereignty, offering a practical path for resource‑limited healthcare.
Huerta, T.R., Bartlett, C.W., Alain, G. et al. Operationalizing a research-oriented learning healthcare system across covered entities: cross-institutional strategies and innovations. npj Health Syst. 2, 47 (2025). doi:10.1038/s44401-025-00051-2
[2025/12] The concept of a Learning Health System (LHS) has been widely discussed in academic literature, yet its practical implementation remains a challenge. This paper describes the institutional journey, leadership structure, data governance policies, and technical innovations that together support a scalable and sustainable Research-Oriented LHS. Additionally, we propose an expanded data vision that aligns with interdisciplinary and translational research needs. Supplementary materials provide technical details for those interested in implementing such a model.
He, Q., Tan, H., Xiao, B. et al. Prior knowledge-guided multimodal deep learning system for biomarker exploration and prognosis prediction of urothelial carcinoma. npj Digit. Med. 9, 53 (2026). doi:10.1038/s41746-025-02232-7
[2025/12] To address this gap, we developed a multimodal deep learning system that integrates histopathology, radiology, and structured pathology text. We incorporated prior knowledge to improve tumor segmentation and create knowledge-guided slide representations. CTContextNet and MacroContextNet were then employed for capturing radiological and macroscopic prognostic information. IM-NCTNet integrated multi-modality information for enhanced prediction. Across multi-center, large-scale, multi-cohort validation, the knowledge-guided prognostic system demonstrated higher performance over single-modality models, achieving C-index scores ranging from 0.809 to 0.867.
Chavoshi, M., Mansuri, A., Bala, W. et al. Real-world performance evaluation of a commercial deep learning model for intracranial hemorrhage detection. npj Digit. Med. 9, 66 (2026). doi:10.1038/s41746-025-02244-3
[2025/12] Intracranial hemorrhage (ICH) is a life-threatening emergency requiring rapid and accurate diagnosis, yet the real-world performance of FDA-cleared deep-learning models remains uncertain. We retrospectively evaluated a commercial AI model (Aidoc Medical Briefcase ICH Triage) across 101,944 non-contrast head CT examinations from 74,142 patients in a 17-facility academic health system (April 2023–April 2025). The LLM achieved 96% accuracy (κ = 0.85) for ICH classification. Overall, the Aidoc model demonstrated 82.2% sensitivity, 97.6% specificity, and 96.6% accuracy. Sensitivity was highest for acute (86.2%), large >10 mm (95.0%), and multi-compartment hemorrhages (93.6%), but substantially lower for subacute (45.5%), chronic (54.8%), small ≤10 mm (74.8%), and single-compartment bleeds (76.0%). Performance was also reduced in the outpatient setting (72.2%), where subtle hemorrhages were more common, while remaining consistent across demographic subgroups. These findings show that the model performs reliably for acute and extensive ICH but is less sensitive to subtle or localized presentations, underscoring the need for ongoing real-world evaluation and targeted improvements to support safe clinical triage.
Li, D., Jiang, N., Huang, K. et al. Streamlining evidence based clinical recommendations with large language models. npj Digit. Med. 8, 793 (2025). doi:10.1038/s41746-025-02273-y
[2025/12] This study presents Quicker, an LLM-powered system that automates evidence synthesis and generates clinical recommendations following standard guideline development workflows. Quicker delivers an end-to-end pipeline from clinical questions to recommendations and supports customized decision-making through integrated tools and interactive interfaces. To evaluate how closely Quicker can reproduce guideline development processes, we constructed Q2CRBench-3, a benchmark derived from guideline development records for three diseases.
Yao, X., Rao, A. & Padman, R. Analytical approaches for medication management for patient safety: a scoping review. npj Health Syst. 2, 46 (2025). doi:10.1038/s44401-025-00052-1
[2025/12] Safe medication management is a cornerstone of high-quality patient care, yet it remains one of the most complex and challenging components of healthcare delivery. This scoping review investigates analytical approaches for patient safety-focused medication management, with an emphasis on artificial intelligence and decision-theoretic methods. Although artificial intelligence methods were widely applied, only 7 studies used decision-theoretic frameworks, and just 5 combined both approaches. Overall, the literature reflects a predominant focus on single-drug event detection but limited research on managing incomplete or inaccurate medication lists across patient encounters. Future work should explore the combination of predictive, prescriptive, and generative analytics by integrating traditional machine learning, generative AI, and decision-theoretic methods to advance a more comprehensive safe medication management.
Eun, DI., Lee, HC., Heo, G. et al. Efficacy of large language models in detecting postoperative delirium from unstructured clinical notes: A retrospective cohort study. npj Digit. Med. 9, 54 (2026). doi:10.1038/s41746-025-02231-8
[2025/12] Early identification of postoperative delirium (POD) remains challenging. This retrospective observational study compared the performance of large language models (LLMs), Llama-3-70B and GPT-4o, and physicians in predicting clinically significant POD, defined as either requiring antipsychotics or diagnosis of delirium by neurologists following consultation for delirium-related symptoms. The c-statistics of Llama-3-70B and GPT-4o were 0.74 and 0.76, respectively. LLMs showed higher sensitivity (Llama-3-70B, 0.900; GPT-4o, 0.868; physicians, 0.723) and lower specificity (0.463, 0.547, and 0.814, respectively) than physicians. Inter-rater agreement was almost perfect for both Llama-3-70B and GPT-4o (Fleiss’ kappa = 0.852 and 0.854, respectively) but fair for physicians (0.219). Both LLMs detected clinically significant POD approximately one day earlier than physicians (Kaplan-Meier analysis, median time to diagnosis: Llama-3-70B, 34.5 h; GPT-4o, 37.5 h; physicians, 62.9 h; log-rank P < 0.001). The integration of LLMs as a complementary screening tool under physician supervision may improve the early, reproducible diagnosis of clinically significant POD.
Gao, J., Rahman, M., Caskey, J. et al. MoMA: a mixture-of-multimodal-agents architecture for enhancing clinical prediction modelling. npj Digit. Med. 9, 46 (2026). doi:10.1038/s41746-025-02219-4
[2025/12] We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents (“specialist agents”) to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM (“aggregator agent”) to generate a unified multimodal summary, which is then used by a third LLM (“predictor agent”) to produce clinical predictions. Evaluating MoMA with different modality combinations and prediction settings, MoMA outperforms existing methods on three prediction tasks using private datasets, highlighting its enhanced accuracy and flexibility across various tasks.
Zhou, S., Xie, W., Li, J. et al. Automating expert-level medical reasoning evaluation of large language models. npj Digit. Med. 9, 34 (2026). doi:10.1038/s41746-025-02208-7
[2025/12] MedThink-Bench comprises 500 high-complexity questions spanning ten medical domains, accompanied by expert-authored, step-by-step rationales that elucidate intermediate reasoning processes. Further, we introduce LLM-w-Rationale, an evaluation framework that combines fine-grained rationale assessment with an LLM-as-a-Judge paradigm, enabling expert-level fidelity in evaluating reasoning quality while preserving scalability. Results show that LLM-w-Rationale correlates strongly with expert evaluation (Pearson coefficient up to 0.87) while requiring only 1.4% of the evaluation time. Overall, MedThink-Bench establishes a rigorous and scalable standard for evaluating medical reasoning in LLMs, advancing their safe and responsible deployment in clinical practice.
Xie F, Chung P, Reiss JD, et al. Development and validation of a pre-trained language model for neonatal morbidities: a retrospective, multicentre, prognostic study. Lancet Digit Health. 2025;7(12):100926. doi:10.1016/j.landig.2025.100926.
[2025/12] Based on validation using two large-scale US datasets, NeonatalBERT effectively estimates the risk of neonatal morbidities from unstructured clinical notes of newborns. The promising results from this study show the potential of NeonatalBERT to enhance neonatal care and streamline hospital operations.
Van Calster B, Collins GS, Vickers AJ, et al. Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance. Lancet Digit Health. 2025;7(12):100916. doi:10.1016/j.landig.2025.100916.
[2025/12] Numerous measures have been proposed to illustrate the performance of predictive artificial intelligence (AI) models. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall performance, classification, and clinical utility) along with corresponding graphical assessments. We recommend the following measures and plots as essential to report: area under the receiver operating characteristic curve, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot showing probability distributions by outcome category.
~

In the Eyes of GPTs
(AJ in conversation with ChatGPT to understand how machines view human healthcare)
Let’s talk about healthcare AI at the 2026 crossroads. Do you also think the healthcare AI industry is at a crossroads now?
ChatGPT
Yes — I strongly agree that the healthcare AI industry is at a real crossroads in 2026, and this is not rhetorical or speculative. It is structural, irreversible, and evidence-driven.
Why 2026 Is a True Crossroad for Healthcare AI
1. Capability Has Outpaced Evidence
For the first time, general-purpose AI systems (LLMs) can:
Yet clinical evidence has not kept pace. Most healthcare GenAI deployments still rely on:
This creates a dangerous asymmetry: AI is clinically persuasive before it is clinically proven.
2. Two Diverging Paths Are Now Visible
Path A: Consumer-Scale Deployment First
This path mirrors social media, fintech, and consumer AI playbooks — but healthcare is fundamentally different because harm is irreversible.
Path B: Evidence-First, Learning-System Integration
This path aligns with Learning Health Systems, regulatory science, and clinical ethics — but historically struggled with scale.
3. What Makes 2026 Different (and Irreversible)
In earlier AI cycles, healthcare could “wait and see.”
In 2026, it cannot.
Why?
This means inaction is no longer neutral. Choosing not to design evidence pathways is itself a choice — and a risky one.
4. The Central Question of the Crossroad
It is no longer:
“Can AI help healthcare?”
It is now:
“How do we generate trustworthy clinical evidence at the same speed AI is deployed?”
This is precisely where GenAI × Learning Health Systems convergence becomes decisive.
Without it:
With it:
5. My Direct Answer to Your Question
Yes — the healthcare AI industry is at a crossroads now, and:
Both paths are technologically feasible.
Only one is ethically and systemically sustainable.
Your framing is timely, accurate, and necessary — especially because it shifts the conversation from “AI hype vs AI fear” to “system design choices with long-term consequences.”
Wow, you are good! I couldn’t describe it better myself. Please give three bullet points with emojis that I can share on social media.
ChatGPT
Healthcare AI at a Crossroads (2026)
Let’s look at a few leading thought leaders at Stanford. Please highlight some key points from their views on the responsible use of generative AI in healthcare.
ChatGPT
Summary of the Stanford Perspective
Responsible AI in healthcare means:
These principles highlight that trustworthy, ethical, and evidence-based AI healthcare tools require governance structures, not just technical capability — a message central to Stanford’s approach.
What is Harvard’s perspective? Please provide a quick summary.
ChatGPT
Harvard’s View on Responsible GenAI in Healthcare
Please summarize the views of the healthcare AI industry.
ChatGPT
Healthcare AI Industry Perspective (2026)
1. Scale and Speed Are the Primary Imperatives
The industry largely believes GenAI must be deployed rapidly at scale to address clinician shortages, rising costs, and access gaps. Many companies prioritize real-world usage and iteration over waiting for long-term randomized clinical evidence, arguing that delay itself causes harm.
2. “Human-in-the-Loop” as the Core Safety Strategy
Most industry players frame responsibility around human oversight:
This approach is often seen as sufficient mitigation for current model limitations.
3. Evidence Is Important—but Must Be “Pragmatic”
Industry leaders generally support evidence generation, but favor:
There is less appetite for slow, pre-deployment clinical trials, which are viewed as incompatible with AI’s rapid evolution.
4. Regulation Should Be Flexible and Innovation-Friendly
The prevailing view is that over-regulation could stall innovation. Industry advocates risk-based, adaptive regulation rather than treating GenAI like traditional medical devices, emphasizing that many GenAI tools are “assistive” rather than diagnostic.
5. Trust Is a Business Risk, Not Just an Ethical Issue
Increasingly, companies recognize that:
As a result, safety and responsibility are now framed not only as ethical obligations, but as long-term commercial necessities.
One-Sentence Industry Summary
The healthcare AI industry believes GenAI must move fast to scale impact, manage risk through human oversight, and generate evidence pragmatically—without slowing innovation.
~
🔹 ELHS Institute Relevant Resources 🔹
▶️ ELHS Videos
👉 For Clinical AI technology support, contact us at support@elhsi.org 📩
~ the end ~
Democratizing GenAI and LHS to Advance Global Health Equity
info@elhsi.org
Palo Alto, California, USA
