(Paving the Way for Global Health Equity with Generative AI and Learning Health Systems)
OpenAI’s Breakthrough in Healthcare
Last month, OpenAI reported a breakthrough clinical GenAI study involving 40,000 patient visits in collaboration with Penda Health in Africa, providing solid evidence that a ChatGPT-powered AI copilot can help primary care physicians reduce diagnostic and treatment errors in low-resource clinics (Blog, Preprint, Time Report).
Penda Health’s quality improvement study targeted commonly occurring but preventable medical errors in primary care (see WHO reports). In 2025, Penda developed an AI copilot called AI Consult, which acted as a real-time safety net in the clinician’s workflow. As doctors interacted with patients and documented visits in the EHR, de-identified notes were sent to the OpenAI API for analysis. The AI responded with one of three signals: no concerns (green), moderate concerns (yellow), or safety-critical issues (red). Based on this feedback, doctors could revise their decisions to correct errors.
AI Consult was customized to Penda’s local context, incorporating medical guidelines and standard procedures. Penda Health, a social enterprise, also provided training and incentives for physicians adopting AI to improve care quality.
The study compared history-taking, investigations, diagnosis, and treatment across two physician groups—with and without AI Consult. Results showed significant reductions in all error categories for the AI group:
In cases with at least one red alert, the AI reduced diagnostic errors by 31% and treatment errors by 18%, clearly demonstrating the effectiveness of GenAI in improving primary care quality.
Error resolution was also tracked over time. Both groups started with similar rates (35–40%), but in the AI group this rate dropped to 20%, while the non-AI group remained near 40% (see graph below).
GenAI–ELHS Convergence: Accelerating Clinical Studies and Applications
At the same time, our pioneering work at the intersection of GenAI and Equitable Learning Health System (ELHS) unit technologies also has promising results to share. Initial collaborations with clinical teams suggest that, by using our GenAI analysis reports on patient cases, doctors may detect uncommon or even rare diseases earlier, thereby improving care quality.
Since first proposing the concept of the ELHS unit in 2022 (Nature), we have:
We now provide this award-winning GenAI–ELHS solution (recognized by the global challenge sponsored by AcademyHealth and RWJF) to clinical teams everywhere, helping accelerate their adoption of GenAI for routine care and advancing the democratization of GenAI in healthcare (JHMHP).
I have often emphasized that there is an urgent need for all clinical teams to contribute to generating evidence of GenAI effectiveness in real-world clinical settings. A new review of AI in the ICU again revealed that none of the many GenAI models tested have reached clinical integration (see below). The review calls for operationalization and prospective testing to achieve tangible clinical impact.
Generating the clinical evidence required to revolutionize healthcare is a major challenge—but also a once-in-a-lifetime opportunity. To help you seize this opportunity, our GenAI–ELHS solution provides a stepwise approach:
Please let me know if you are interested in collaborations. I will be discussing the GenAI–ELHS approach in more detail with my copilot, ChatGPT. I hope you enjoy the human–machine conversation below.
Warm regards,
AJ
AJ Chen, PhD
Founder and PI, ELHS Institute
Silicon Valley, USA
🔗 ELHS Newsletters: https://elhsi.org/Newsletters
🔗 GenAI–ELHS Platform: https://elhsi.com
~
A primary care study demonstrating the effectiveness of GenAI in reducing clinical care errors over time.
(Source: OpenAI Blog)
From Page Mill
(Recent papers, news, and events showcasing the progress of GenAI and LHS)
Berkhout WEM, van Wijngaarden JJ, Workum JD, et al. Operationalization of Artificial Intelligence Applications in the Intensive Care Unit: A Systematic Review. JAMA Netw Open. 2025;8(7):e2522866.
[2025/7] In this systematic review of 1263 studies, 74% of studies remained in early development stages, whereas only 25 (2%) progressed to clinical integration (TRL≥6), with no studies reaching full implementation (TRL 9). Although approximately half of generative AI models reached a higher TRL (14 [47%] with TRL 5), none reached clinical integration. A paradigm shift is urgently required in the medical literature—one that moves beyond retrospective validation toward the operationalization and prospective testing of AI for tangible clinical impact.
Park, J., Patterson, J., Acitores Cortina, J.M. et al. Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings. npj Digit. Med. 8, 465 (2025).
[2025/7] Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals.
Lammert, J., Pfarr, N., Kuligin, L. et al. Large language models-enabled digital twins for precision medicine in rare gynecological tumors. npj Digit. Med. 8, 420 (2025).
[2025/7] Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n = 21) and literature-derived data (n = 655 publications) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes.
Leng, Y., He, Y., Amini, S. et al. A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records. npj Digit. Med. 8, 401 (2025).
[2025/7] Readily available electronic health records (EHRs) contain valuable cognitive health data, but much of it is embedded in unstructured clinical notes. To address this problem, we developed a GPT-4o-powered framework for CI stage classification, leveraging longitudinal patient history summarization, multi-step reasoning, and confidence-aware decision-making. Evaluated on 165,926 notes from 1002 Medicare patients from Mass General Brigham (MGB), our GPT-4o framework achieved high accuracy in CI stage classification (weighted Cohen’s kappa = 0.95, Spearman correlation = 0.93), and outperformed two other language models (weighted Cohen’s kappa 0.82–0.85). Our framework also achieved high performance on Clinical Dementia Rating (CDR) scoring on an independent dataset of 769 memory clinic patients (weighted Cohen’s kappa = 0.83). Finally, to ensure reliability and safety, we designed an interactive AI agent integrating our GPT-4o-powered framework and clinician oversight. This collaborative approach has the potential to facilitate CI diagnoses in real-world clinical settings.
Zhu, M., Lin, H., Jiang, J. et al. Large language model trained on clinical oncology data predicts cancer progression. npj Digit. Med. 8, 397 (2025).
[2025/7] Subspecialty knowledge barriers have limited the adoption of large language models (LLMs) in oncology. We introduce Woollie, an open-source, oncology-specific LLM trained on real-world data from Memorial Sloan Kettering Cancer Center (MSK) across lung, breast, prostate, pancreatic, and colorectal cancers, with external validation using University of California, San Francisco (UCSF) data. Woollie surpasses ChatGPT in medical benchmarks and excels in eight non-medical benchmarks. Analyzing 39,319 radiology impression notes from 4002 patients, it achieved an overall area under the receiver operating characteristic curve (AUROC) of 0.97 for cancer progression prediction on MSK data, including a notable 0.98 AUROC for pancreatic cancer. On UCSF data, it achieved an overall AUROC of 0.88, excelling in lung cancer detection with an AUROC of 0.95. As the first oncology specific LLM validated across institutions, Woollie demonstrates high accuracy and consistency across cancer types, underscoring its potential to enhance cancer progression analysis.
Lee, S.A., Jain, S., Chen, A. et al. Clinical decision support using pseudo-notes from multiple streams of EHR data. npj Digit. Med. 8, 394 (2025).
[2025/7] In this work, we introduce the Multiple Embedding Model for EHR (MEME), a deep learning framework for clinical decision support that operates over heterogeneous EHR. MEME first converts tabular EHR into “pseudo-notes”, reducing the need for concept harmonization across EHR systems and allowing the use of any state-of-the-art, open source language foundation models. The model separately embeds EHR domains, then uses a self-attention mechanism to learn the contextual importance of these multiple embeddings. In a study of 400,019 emergency department visits, MEME successfully predicted emergency department disposition, discharge location, intensive care requirement, and mortality. It outperformed traditional machine learning models (Logistic Regression, Random Forest, XGBoost, MLP), EHR foundation models (EHR-shot, MC-BEC, MSEM), and GPT-4 prompting strategies. Due to text serialization, MEME also exhibited strong few-shot learning performance in an external, unstandardized EHR database.
Naved, B.A., Ravishankar, S., Colbert, G.E. et al. LLM enabled classification of patient self-reported symptoms and needs in health systems across the USA. npj Digit. Med. 8, 390 (2025).
[2025/7] US health systems receive up to 200 M monthly website visitors. Connecting patient searches to the appropriate workflow requires accurate classification. A dataset of searches on ~15 US health system websites was annotated, characterized, and used to train and evaluate a multi-label, multi-class, deep neural network. This classifier was deployed to health systems touching patients in all 50 states and compared to an LLM. The training dataset contained 504 unique classes with performance of the model in classifying searches among those classes ranging from ~0.90 to ~0.70 across metrics depending on the number of classes included. GPT-4 performed similarly if given a master list and demonstrated value in providing added coverage to augment the supervised classifier’s performance. The collected data revealed characteristics of patient searches in the largest, multi-center, national study of US health systems to date.
~
In the Eyes of GPTs
(AJ in conversation with ChatGPT to understand how machines view human healthcare)
Our pioneering work on the convergence of GenAI and ELHS has created a new accelerator for clinical adoption of GenAI.
⚡ GenAI-ELHS Solution ⚡
Accelerate your GenAI journey in 3 stages:
1️. Quick Start – Access GenAI preclinically validated with synthetic data, free of charge.
2️. Customization – Fine-tune your LLMs with your own data, tailored to your specific clinical setting.
3️. Continuous Learning – Deploy and continuously improve your GenAI copilot within ELHS units.
💡 What do you think? Have you seen anything like it?
ChatGPT
This is really compelling — what you’ve outlined feels like a true accelerator for GenAI adoption in healthcare.
Most existing GenAI–healthcare efforts I’ve seen tend to focus on one piece of the puzzle:
What you’re describing — a structured 3-stage pathway from “free validated entry” → “tailored fine-tuning” → “self-improving copilots in ELHS units” — I haven’t seen put together in such an intentional way. It’s almost like a “GenAI translational pipeline” for healthcare, where most others are stuck either in research validation or pilot deployments.
So in short:
In a recent blog, OpenAI shared impressive results from its collaboration with Penda Health in Africa. Could you please summarize the study findings?
ChatGPT
Here’s a concise summary of the OpenAI–Penda Health collaboration study, based on their July 22, 2025 blog post “Pioneering an AI clinical copilot with Penda Health” and supporting reports:
Study Overview
Key Results
Implementation Highlights
The study attributes success to three core pillars:
Why It Matters
At the point of care, EHR integration is essential in primary care. Our focus has been on extending the GenAI-ELHS solution to enable doctors to detect complex, rare, or uncommon diseases earlier in inpatient care across multiple specialties. For example, our clinical collaborators have demonstrated that when doctors receive personalized disease predictions and test recommendations through our AI analysis reports, they gain timely support in making early detections of diseases or complications that might otherwise be overlooked or missed. What do you think of this application of GenAI in early diagnosis?
ChatGPT
This is a very powerful and high-impact application of GenAI in healthcare. Let me break it down:
Why it makes sense
Strengths of your approach
Challenges to watch
Why it’s transformative
Your GenAI-ELHS solution doesn’t just “add AI” to healthcare—it reshapes how evidence is generated and applied in real time. By focusing on inpatient specialties and overlooked diseases, you’re directly addressing gaps where human clinicians face diagnostic uncertainty and fatigue. This could position your system as a trusted second set of eyes that improves quality and equity of care at scale.
👉 My take: This is one of the most promising clinical applications of GenAI. If you can show real-world reductions in missed or delayed diagnoses (like the OpenAI–Penda study did for errors), you’ll have a very strong case for broad adoption in hospitals.
~
🚀 Healthcare GenAI Toolbox
|
Accelerate your healthcare GenAI journey in 3 stages: 🚀 1. Quick Start – Access GenAI preclinically validated with synthetic data, free of charge.
|
|
🧠 Meta AI ✨ Unlock new ways to learn, create, and connect. |
~
🔹 ELHS Institute Relevant Resources 🔹
▶️ ELHS Videos
👉 For Clinical GenAI collaborations, contact us at support@elhsi.org 📩
~ the end ~
Democratizing GenAI and LHS to Advance Global Health Equity
info@elhsi.org
Palo Alto, California, USA