(Paving the Way for Global Health Equity with Generative AI and Learning Health Systems)
Dear Friends,
Cancer patient navigation—the process of guiding patients through the healthcare system to ensure they receive the support and services they need throughout their cancer diagnosis and treatment—remains a critical challenge, particularly in underserved communities. Sponsored by the Robert Wood Johnson Foundation and AcademyHealth, the recent global competition, Transforming Cancer Navigation with Open Data & APIs Challenge, was designed to address this challenge. The goal was to inspire data enthusiasts and tech innovators to develop cutting-edge solutions that improve the cancer patient navigation experience.
Patient navigators have four primary responsibilities: screening for social needs, monitoring care quality, scheduling appointments, and ensuring connections to available services. Without standardized AI diagnostic tools, navigators are currently unable to effectively and equitably monitor cancer care quality at scale. This limits their ability to ensure patients receive timely and high-quality care throughout their cancer journey.
We believe generative AI offers a powerful opportunity to create broad-spectrum tools that enhance the navigator’s role in monitoring care quality. However, most navigators—particularly those in low-resource clinics and hospitals—still lack access to high-accuracy GenAI tools that meet the standards for responsible AI. To address this gap, we created the GenAI-ELHS solution to empower patient navigators to equitably and effectively monitor cancer care quality at scale.
We have implemented a free, web-based navigation tool that allows patients to manually enter or integrate their health records. Navigators then use GenAI to predict the expected standard of care based on each patient’s case, compare it to the actual care received, identify potential quality gaps, and recommend next steps.
A clinic-based version will be deployed on demand for clinics of any size. This version will use GenAI to automatically monitor cancer care quality on a regular basis, enabling doctors to identify gaps in timely diagnosis and personalized treatment. The solution integrates GenAI-based monitoring into Equitable Learning Health System (ELHS) units, ensuring a system-level transformation of cancer navigation over time (see initial study published in Nature Sci Rep).
We’re excited to share that our GenAI-ELHS solution was selected as a winner in the global challenge—a recognition of the potential impact of GenAI in helping patients, navigators, and doctors improve cancer care quality. Building on this momentum, we are launching free training programs for clinical teams to integrate responsible LLM-based GenAI into cancer care monitoring and to publish real-world evidence of its effectiveness.
Last month, several strong studies were published on LLM applications in healthcare. A selection of these is highlighted in the next section for your quick review. However, studies that demonstrate how GenAI can directly support real clinical tasks in frontline care delivery remain scarce.
In our conversation with ChatGPT below, we explore how to design clinical comparative effectiveness research for GenAI. We hope this helps clarify that every doctor already has what they need—real-world data and open LLMs—to begin conducting impactful clinical evaluation studies.
Enjoy your reading.
Warm regards,
AJ
AJ Chen, PhD
Founder and PI, ELHS Institute
ELHS Platform | https://elhsi.com
ELHS Institute | https://elhsi.org
~
(Source: Herox.com)
From Page Mill
(Recent papers, news, and events showcasing the progress of GenAI and LHS)
Stults CD, Deng S, Martinez MC, et al. Evaluation of an Ambient Artificial Intelligence Documentation Platform for Clinicians. JAMA Netw Open. 2025;8(5):e258614.
In this quality improvement study at a single health care organization using before and after survey results and electronic health record data, ambient AI was associated with decreased time in notes per appointment. Ambient AI was associated with improved clinician satisfaction at work and decreased cognitive load. Among 100 clinicians (53 male [53.0%]; mean [SD] age, 48.9 [11.0] years), 58 clinicians (58.0%) were in primary care and 92 clinicians had EHR metrics. Among 57 clinicians who completed both preimplementation and postimplementation surveys, there was a decrease in burnout from 24 clinicians (42.1%) to 20 clinicians (35.1%), although this was not a significant difference (P = .12). Mean (SD) time in notes per appointment significantly decreased from 6.2 (4.0) to 5.3 (3.5) minutes (P < .001), with a bigger decrease for female vs male clinicians (8.1 [3.9] to 6.7 [3.6] minutes vs 4.7 [3.5] to 4.2 [3.1] minutes; P = .001).
Zeng D, Qin Y, Sheng B, Wong TY. DeepSeek’s “Low-Cost” Adoption Across China’s Hospital Systems: Too Fast, Too Soon? JAMA. Published online April 28, 2025.
DeepSeek is seen as a low-cost and open-source AI model and has generated excitement globally but particularly so in China. DeepSeek’s rapid adoption in China’s hospital systems presents a common paradox in health care technology: balancing the opportunity to rapidly integrate innovative technology to transform health care with the potential risks and dangers that accompany such rapid adoption (“first, do no harm”). We propose global collaborative effort among AI developers, physicians, health care administrators, and policymakers to ensure that new AI technology, particularly frontier LLMs that are now inexpensive and widely available, can be used as a trustworthy and safe tool in health care, rather than a potential source of harm. Only through a balanced approach incorporating robust clinical safety with regulatory oversight can new AI models truly realize their potential to transform health care.
Tordjman, M., Liu, Z., Yuce, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med (2025).
In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p < 10-3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.
Hong, Q., Liu, S., Wu, L. et al. Evaluating the performance of large language & visual-language models in cervical cytology screening. npj Precis. Onc. 9, 153 (2025).
Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions.
Shashikumar, S.P., Mohammadi, S., Krishnamoorthy, R. et al. Development and prospective implementation of a large language model based system for early sepsis prediction. npj Digit. Med. 8, 290 (2025).
This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results.
Gaber, F., Shaik, M., Allega, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digit. Med. 8, 263 (2025).
In this study, we benchmark multiple LLM versions and an LLM-based workflow incorporating retrieval-augmented generation (RAG) on a curated dataset of 2000 medical cases derived from the Medical Information Mart for Intensive Care database. Our findings show that these LLMs are capable of providing personalized insights into likely diagnoses, suggesting appropriate specialists, and assessing urgent care needs. These models may also support clinicians in refining diagnoses and decision-making, offering a promising approach to improving patient outcomes and streamlining healthcare delivery.
Yao, J., Wang, Y., Lei, Z. et al. Multimodal GPT model for assisting thyroid nodule diagnosis and management. npj Digit. Med. 8, 245 (2025).
We developed a multimodal generative pre-trained transformer for thyroid nodules (ThyGPT), aiming to provide a transparent and interpretable AI copilot model for thyroid nodule risk assessment and management. Ultrasound data from 59,406 patients across nine hospitals were retrospectively collected to train and test the model. After training, ThyGPT was found to assist in reducing biopsy rates by more than 40% without increasing missed diagnoses. In addition, it detects errors in ultrasound reports 1,610 times faster than humans. With the assistance of ThyGPT, the area under the curve for radiologists in assessing thyroid nodule risks improved from 0.805 to 0.908 (p < 0.001). As an AI-generated content-enhanced computer-aided diagnosis (AIGC-CAD) model, ThyGPT has the potential to revolutionize how radiologists use such tools.
Liu, X., Shi, S., Zhang, X. et al. The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders. Sci Rep 15, 18688 (2025).
To compare the diagnostic accuracy of an artificial intelligence chatbot and clinical experts in vertigo-related diseases and evaluate the ability of the AI chatbot to address vertigo-related issues. 20 clinical questions about vertigo were input to ChatGPT-4o, and three otologists evaluated the responses using a 5-point Likert scale for accuracy, comprehensiveness, clarity, practicality, and credibility. The model’s diagnostic accuracy was comparable to a clinician with one year of experience but inferior to a clinician with five years of experience, and the differences in accuracy among the three methods are statistically significant (p = 0.04). ChatGPT-4o shows promise as a supplementary tool for managing vertigo but requires improvements in readability and diagnostic capabilities.
Li, J., Navathe, A.S. & Zhang, Y. Aligning incentives: the importance of behavioral economic perspectives in AI adoption. npj Health Syst. 2, 18 (2025).
We identify three behavioral economic aspects crucial for healthcare providers adopting AI: 1) cognitive overload, 2) risk aversion, and 3) social preferences. These perspectives offer a framework for understanding how AI should be integrated into healthcare settings in ways that ensure both technological efficiency and alignment with human-centric principles. Integrating AI into healthcare requires not only technological integration but also alignment with existing operational and incentive frameworks to ensure utilization and to prevent liability.
Hein, D., Christie, A., Holcomb, M. et al. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. npj Digit. Med. 8, 301 (2025).
We present a novel, end-to-end pipeline leveraging large language models (LLMs) for highly accurate information extraction and normalization from unstructured pathology reports, focusing initially on kidney tumors. Applying the finalized pipeline to 2297 kidney tumor reports with pre-existing templated data available for validation yielded a macro-averaged F1 of 0.99 for six kidney tumor subtypes and 0.97 for detecting kidney metastasis.
Hou, Z., Liu, H., Bian, J. et al. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. npj Health Syst. 2, 14 (2025).
[2025/5] In this study, we evaluated whether fine-tuning LLMs with specialized ICD-10 knowledge can automate code generation across clinical documentation. We adopted a two-phase approach: initial fine-tuning using 74,260 ICD-10 code–description pairs, followed by enhanced training to address linguistic and lexical variations. Evaluations using a proprietary model (GPT-4o mini) on a cloud platform and an open-source model (Llama) on local GPUs demonstrated that initial fine-tuning increased exact matching from <1% to 97%, while enhanced fine-tuning further improved performance in complex scenarios, with real-world clinical notes achieving 69.20% exact match and 87.16% category match. These findings indicate that domain-specific fine-tuned LLMs can reduce manual burdens and improve reliability.
~
In the Eyes of GPTs
(AJ in conversation with ChatGPT to understand how machines view human healthcare)
Diagnosis is the first critical step in clinical care delivery. To optimize diagnostic accuracy and improve patient outcomes, health systems should also implement screening for prevention before diagnosis and monitoring of diagnostic quality after diagnosis.
What do you think? Is there anything else that should be added to an ideal diagnosis implementation model aimed at delivering timely and accurate diagnoses for all populations?
ChatGPT:
Diagnosis is the critical first step in clinical care delivery. To improve patient outcomes, health systems must not only optimize diagnostic accuracy but also implement proactive screening for prevention before diagnosis and systematic monitoring of diagnostic quality after diagnosis.
✅ Suggestions to Enhance an Ideal Medical Diagnosis Implementation:
In summary, to deliver timely and accurate diagnosis for all populations, an ideal implementation must go beyond clinical expertise alone—it must include equity-focused access, data-driven tools, feedback mechanisms, and responsible AI to support the entire diagnostic journey from prevention to post-diagnosis monitoring.
That's a fairly complete list. What features are usually not implemented in current clinical workflows?
ChatGPT:
Great question. While many health systems aspire to deliver timely and accurate diagnosis, several key features in your ideal list are often missing or inconsistently implemented in current clinical workflows—especially in low-resource or fragmented settings. Below is a structured breakdown:
❌ Features Usually Not Implemented (or Poorly Implemented) in Current Clinical Workflows
1. Systematic Diagnostic Quality Monitoring
2. Patient Involvement in Diagnostic Reasoning
3. Equity-Focused Diagnostic Access
4. AI or CDS Integration for Frontline Diagnostics
5. Population-Level Screening Based on Predictive Analytics
6. Training on Cognitive Bias and Diagnostic Uncertainty
7. Interoperable, Real-Time Data Aggregation
8. Learning Health System (LHS) Framework for Diagnosis
Ideal Feature |
Current Status |
Diagnostic quality monitoring |
Rarely implemented |
Patient involvement in diagnosis |
Limited, unstructured |
Equity-focused diagnostic access |
Inconsistent or absent |
AI/CDS for diagnosis |
Limited deployment |
Risk-based population screening |
Rare outside pilot programs |
Training on diagnostic error/cognitive bias |
Largely missing |
Real-time interoperable data |
Fragmented |
LHS-based diagnosis improvement |
Rare |
Wow—there's a lot missing! I'm going to start offering free training for clinical teams to study how GenAI can be integrated into the rarely implemented step of diagnostic quality monitoring. We need doctors to demonstrate the practical benefits of monitoring diagnostic quality in real-world settings.
Please provide an outline for a clinical evaluation study on how GenAI can help doctors monitor diagnostic quality using real-world data. Also, specify the most practical quantitative endpoints that can be used to measure effectiveness.
ChatGPT:
That’s a fantastic and impactful initiative. Integrating GenAI into diagnostic quality monitoring is not only innovative—it addresses a critical blind spot in modern healthcare delivery. Below is a detailed outline for a clinical evaluation study focused on using GenAI to assist doctors in monitoring and improving diagnostic quality using real-world data.
🧪 Study Outline: Evaluating GenAI for Diagnostic Quality Monitoring
🔹 Study Title:
Clinical Evaluation of a Generative AI–Assisted System for Monitoring Diagnostic Quality in Real-World Healthcare Settings
🔹 Study Objectives:
🔹 Study Design:
🔹 Intervention:
A GenAI-enabled tool that performs:
🔹 Data Sources:
Endpoint |
Definition |
Rationale |
Diagnostic error rate |
% of initial diagnoses changed within 30 days (e.g., via new imaging, lab, or specialist) |
Quantifies misdiagnosis or premature closure |
Diagnostic delay rate |
% of patients with > X-day delay between first abnormal finding and confirmed diagnosis |
Captures latency in decision-making |
AI-identified diagnostic discrepancy rate |
% of cases flagged by GenAI with meaningful discrepancy (confirmed by review) |
Measures GenAI’s sensitivity to potential issues |
Time to resolution of diagnostic uncertainty |
Average time between uncertain initial diagnosis and final diagnosis |
Captures improvement in resolving unclear cases |
Clinician actionability rate |
% of AI-flagged cases leading to a clinician review, follow-up test, or referral |
Measures practical utility of GenAI alerts |
Follow-up rate adherence |
% of flagged cases with documented follow-up within appropriate time window |
Assesses impact on closing diagnostic loops |
User trust score (Likert scale) |
Survey-based scoring of clinicians' trust and satisfaction with GenAI feedback |
Evaluates real-world usability |
Cognitive load score (NASA-TLX or similar) |
Self-reported cognitive load during diagnostic process |
Assesses whether GenAI reduces or adds burden |
🔹 Qualitative Measures:
🔹 Ethics and Privacy:
🔹 Anticipated Outcomes:
🔹 Dissemination Plan:
In addition to comparing an intervention group with a control group, can we also compare a doctor’s diagnostic decision-making before and after seeing GenAI-generated predictions? What would be the benefits of such comparative effectiveness research?
ChatGPT:
Yes—comparing diagnostic decision-making before and after GenAI predictions by the same doctor is not only feasible but highly valuable. This approach is a form of within-subject comparative effectiveness research (CER), and it complements the intervention vs. control group design. It’s especially useful when:
✅ Study Design Add-On: Within-Physician Before/After Comparison
How It Works:
Benefit |
Explanation |
Controls for individual variability |
Each doctor serves as their own control, removing between-clinician biases or skill differences |
Sensitive to decision shifts |
Captures subtle improvements, reversals, or confidence changes that might be missed in population-level outcomes |
Real-world relevance |
Reflects actual clinical decision-making flow where GenAI is used to augment, not replace, physician judgment |
Faster feedback loop |
Allows faster detection of GenAI's impact without waiting for long-term patient outcomes |
Supports implementation science |
Helps understand how GenAI affects human behavior, not just outcomes (e.g., when and why doctors override AI) |
✅ Practical Metrics for Before/After Comparison
Metric |
Measurement |
Diagnosis change rate |
% of cases where GenAI input led to a different final diagnosis |
Management plan change rate |
% of cases where test ordering, referrals, or treatment changed post-AI |
Confidence delta |
Self-reported confidence score before vs. after GenAI (e.g., 1–5 scale) |
Time to decision |
Seconds/minutes taken before vs. after GenAI input |
Error reduction |
% decrease in misdiagnoses or guideline deviations after GenAI input |
Override justification analysis |
Qualitative coding of why GenAI suggestions were accepted or rejected |
✅ Summary
Adding a before/after comparison within the same doctor:
~
Healthcare GenAI Toolbox
|
ELHS platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for clinical care and research: ✨ Learn GenAI with a multi-LLM copilot for training and education.
|
|
🧠 Meta AI Meta AI helps you learn, create, and connect in new ways. With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
|
~
🔹 ELHS Institute Relevant Resources 🔹
👉For LLM Fine-Tuning Services, Contact support@elhsi.org 📩
~ the end ~
Democratizing GenAI and LHS to Advance Global Health Equity
info@elhsi.org
Palo Alto, California, USA