(Paving the Way for Global Health Equity with Generative AI and Learning Health Systems)
Several landmark studies published last month in Nature, JAMA, and NEJM have brought GenAI-based diagnosis even closer to clinical application.
At Stanford University, Dr. Jonathan Chen—an AI-trained clinician—led a team that published a Nature study testing the diagnostic impact of GPT-4. Their results showed that physicians improved diagnostic accuracy when using GenAI-generated prediction support.
At Harvard Medical School, Dr. Arjun Manrai’s group compared open-source Llama 3.1 models to GPT-4 in generating differential diagnoses for complex clinical cases (JAMA). They found that these open-source models performed at a level approaching that of proprietary counterparts. This independently verifies a conclusion we reached last year in our own (yet unpublished) benchmarking studies. Based on our early insight into the rapid advancement of open-source LLMs, we have been actively advocating for the fine-tuning of Llama 3.1 models for clinical GenAI research—and the medical community is now taking notice.
In another Nature study, my friend Dr. James Li and colleagues at West China Hospital demonstrated that diagnosis accuracy can be further improved by employing a multi-agent conversation system built on GPT-4—highlighting the power of agent collaboration in complex reasoning tasks.
Meanwhile, in the imaging domain, Dr. Eric Topol and colleagues published a Nature Perspective article proposing a paradigm shift: the use of multimodal Generative Medical Imaging (GenMI) tools to assist physicians in drafting reliable radiology reports. They argue this could enhance diagnostic quality, reduce workload, support medical education, improve access to specialty care, and offer real-time expertise.
There’s so much exciting progress. I firmly believe that it's only a matter of time before GenAI becomes a true diagnostic copilot for clinicians. However, solid clinical evidence from real-world settings is urgently needed.
After benchmarking ChatGPT’s unprecedented disease prediction accuracy (JAMIA), we fine-tuned open-source Llama 3.1 models to achieve over 90% accuracy across a broad range of diseases. We are now making these models publicly available for clinical teams interested in conducting studies and generating much-needed clinical evidence of GenAI’s effectiveness.
I warmly invite doctors and clinical researchers to collaborate with us. Together, we can publish high-impact studies and accelerate the safe and equitable integration of GenAI into routine care.
Explore the latest research below, and enjoy my conversation with ChatGPT about GenAI clinical study design.
Warm regards,
AJ
AJ Chen, PhD
ELHS Platform | https://elhsi.com
ELHS Institute | https://elhsi.org
~
(Recent papers, news, and events showcasing the progress of GenAI and LHS)
Goh, E., Bunning, B., Khoong, E.C. et al. Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Commun Med 5, 59 (2025).
[2025/3] Here we show that physicians are willing to modify their clinical decisions based on GPT-4 assistance, leading to improved accuracy scores from 47% to 65% in the white male patient group and 63% to 80% in the Black female patient group. The accuracy improvement occurs without introducing or exacerbating demographic biases, with both groups showing similar magnitudes of improvement (18%). A post-study survey indicates that 90% of physicians expect AI tools to play a significant role in future clinical decision making.
Buckley TA, Crowe B, Abdulnour RE, Rodman A, Manrai AK. Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses. JAMA Health Forum. 2025;6(3):e250040.
[2025/3] Open-source LLM performed on par with GPT-4 in generating a differential diagnosis on complex diagnostic challenge cases. Our findings suggest an increasingly competitive landscape in LLM clinical decision support, and that institutions may be able to deploy high-performing custom models that run locally without sacrificing data privacy or flexibility. [JAMA interview]
Chen, X., Yi, H., You, M. et al. Enhancing diagnostic capability with multi-agents conversational large language models. npj Digit. Med. 8, 159 (2025).
[2025/3] We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.
Kopka, M., von Kalckreuth, N. & Feufel, M.A. Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digit. Med. 8, 178 (2025).
[2025/3] A total of 1549 studies were screened and 19 included. The self-triage accuracy of SAAs was moderate but highly variable (11.5–90.0%), while the accuracy of LLMs (57.8–76.0%) and laypeople (47.3–62.4%) was moderate with low variability. Based on the available evidence, the use of SAAs or LLMs should neither be universally recommended nor discouraged; rather, we suggest that their utility should be assessed based on the specific use case and user group under consideration.
Rao, V.M., Hla, M., Moor, M. et al. Multimodal generative AI for medical image interpretation. Nature 639, 888–896 (2025).
[2025/3] We advocate for a novel paradigm to deploy GenMI in a manner that empowers clinicians and their patients. Initial research suggests that GenMI could one day match human expert performance in generating reports across disciplines, such as radiology, pathology and dermatology.
AI to Assist in the Fetal Anomaly Ultrasound Scan: A Randomized Controlled Trial
T.G. Day and Others, NEJM AI 2025;2(4)
[2025/3] AI assistance in the routine fetal anomaly ultrasound scan results in significant time savings, and a reduction in sonographer cognitive load, without a reduction in diagnostic performance.
Randomized Trial of a Generative AI Chatbot for Mental Health Treatment
M.V. Heinz and Others, NEJM AI 2025;2(4)
[2025/3] This is the first RCT demonstrating the effectiveness of a fully Gen-AI therapy chatbot for treating clinical-level mental health symptoms. The results were promising for MDD, GAD, and CHR-FED symptoms. Therabot was well utilized and received high user ratings. Fine-tuned Gen-AI chatbots offer a feasible approach to delivering personalized mental health interventions at scale, although further research with larger clinical samples is needed to confirm their effectiveness and generalizability.
Zhukovsky P, Trivedi MH, Weissman M, Parsey R, Kennedy S, Pizzagalli DA. Generalizability of Treatment Outcome Prediction Across Antidepressant Treatment Trials in Depression. JAMA Netw Open. 2025;8(3):e251310. doi:10.1001/jamanetworkopen.2025.1310
[2025/3] Can neuroimaging and clinical features predict response to sertraline and escitalopram in patients with major depressive disorder across 2 large multisite studies? In this prognostic study of depression outcomes, among 363 patients in 2 trials, the best-performing models using pretreatment clinical features and functional connectivity of the dorsal anterior cingulate showed substantial cross-trial generalizability. The addition of neuroimaging features significantly improved prediction performance of antidepressant response compared with models including only clinical features.
Xie, Q., Chen, Q., Chen, A. et al. Medical foundation large language models for comprehensive text analysis and beyond. npj Digit. Med. 8, 141 (2025).
[2025/3] We present Me-LLaMA, a family of open-source medical LLMs integrating extensive domain-specific knowledge with robust instruction-following capabilities. Me-LLaMA is developed through continual pretraining and instruction tuning of LLaMA2 models using diverse biomedical and clinical data sources (e.g., biomedical literature and clinical notes). We evaluated Me-LLaMA on six text analysis tasks using 12 benchmarks (e.g., PubMedQA and MIMIC-CXR) and assessed its clinical utility in complex case diagnosis through automatic and human evaluations. Me-LLaMA outperforms existing open medical LLMs in zero-shot and supervised settings and surpasses ChatGPT and GPT-4 after task-specific instruction tuning for most text analysis tasks. Its performance is also comparable to ChatGPT and GPT-4 for diagnosing complex clinical cases. Our findings highlight the importance of combining domain-specific continual pretraining with instruction tuning to enhance performance in medical LLMs.
Yuan, K., Yoon, C.H., Gu, Q. et al. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun Med 5, 83 (2025).
[2025/3] A zero-shot OpenAI GPT4 model matches the performance of traditional NLP models without the need for labelled training data (F1 = 0.71 and 0.86) and a fine-tuned GPT3.5 model achieves similar performance to the fine-tuned BERT-based model (F1 = 0.95 and 0.97). Infection sources obtained from free-text indications reveal specific infection sources 31% more often than ICD-10 codes. Modern transformer-based models have the potential to be used widely throughout medicine to extract information from structured free-text records, to facilitate better research and patient care.
Grothey, B., Odenkirchen, J., Brkic, A. et al. Comprehensive testing of large language models for extraction of structured data in pathology. Commun Med 5, 96 (2025).
[2025/3] Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments.
~
(AJ in conversation with ChatGPT to understand how machines view human healthcare)
Let’s talk about GenAI clinical research. Collaborating with a clinical team, we are planning to evaluate the effect of disease prediction by a fine-tuned Llama model on doctors’ diagnostic decisions in a comparative prospective study. Along the care delivery process—including clinical training for residents—what endpoints or process steps can be studied to measure the effect of GenAI?
ChatGPT
That’s an exciting and timely study design. Evaluating the effect of LLM-based disease prediction in real clinical settings—especially along the full care delivery process—can generate impactful and much-needed evidence. Here's a breakdown of key endpoints and care process steps you can study to assess the effect of GenAI, both on physicians and patient outcomes:
🧠 1. Diagnostic Decision-Making
🧑⚕️ 2. Clinical Training for Residents
📈 4. Clinical Outcomes and Quality of Care
❤️ 5. Patient-Centered Outcomes
Let’s focus on one specific endpoint in resident training: comparing a resident’s diagnostic quality before and after using LLM-generated prediction information. We can have experienced doctors review and score the residents’ diagnostic decisions. But how should we measure diagnostic quality?
ChatGPT
Great focus — measuring diagnostic quality in resident training before and after LLM assistance is a clear and measurable endpoint. Having experienced physicians review and score resident diagnoses is a solid approach. Below are the key considerations and tools you can use to measure diagnostic quality rigorously and reproducibly.
✅ Approach: Expert Panel Scoring of Resident Diagnoses
📏 Scoring the Quality of Diagnosis
Option 1: Structured Scoring Rubric (e.g., 0–5 scale)
Experts can rate based on:
Criterion |
Description |
Score Range |
Correctness |
How accurate is the diagnosis? |
0 = completely incorrect, 5 = fully correct |
Completeness |
Are key differentials and red flags considered? |
0–5 |
Prioritization |
Is the most likely diagnosis listed first? |
0–5 |
Clinical reasoning |
Are justifications aligned with evidence? |
0–5 |
Appropriateness of plan |
Are next steps (tests/treatments) suitable? |
0–5 |
🔢 Final score = Sum or weighted average of all criteria (e.g., out of 25)
Option 2: Gold Standard Comparison (Binary or Rank-based)
🧪 Example Diagnostic Quality Scoring
Resident |
Diagnosis |
LLM Used? |
Expert Score (/25) |
R1 |
Influenza |
❌ |
15 |
R1 |
Influenza (after LLM) |
✅ |
22 |
R2 |
Meningitis |
❌ |
10 |
R2 |
Meningitis (after LLM) |
✅ |
19 |
👥 Expert Review Considerations
📈 Analysis Options
We will use the binary gold standard comparison method to score the differential diagnoses. To ensure that the comparison of pre- and post-LLM scores is statistically significant, how can we calculate the required number of residents and the number of patient cases per resident?
ChatGPT
Perfect — using a binary gold standard comparison (e.g., correct diagnosis in top 1 or top 3) sets you up for paired binary outcome analysis, which is well suited to McNemar's test (for pre/post comparisons on the same subjects).
✅ Assumptions for Sample Size Calculation
To estimate the sample size, we need:
1. Effect Size (Δ)
Expected improvement in accuracy after using LLM.
Let’s assume:
2. Discordant Pairs
Only residents who change their diagnosis between pre and post matter in McNemar's test.
Assume:
3. Statistical Parameters
📊 Sample Size Calculation (Per Resident)
Option 1: Use Power Tables or Calculator for McNemar's Test
Example Estimate:
If you expect:
Then you would need:
👩⚕️👨⚕️ Residents Needed
If each resident reviews ~40 cases, and you aim to observe the LLM effect across groups:
For a more efficient design, consider:
🔁 Summary Table
Design |
Minimum Estimate |
Residents |
10–15 |
Cases per Resident |
30–50 |
Total Pre/Post Comparisons |
300–600 |
🔧 Pro Tip
You can run a small pilot (e.g., 5 residents × 10 cases) to get:
Our comparative effectiveness research (CER) will be conducted as a prospective clinical trial within a hospital setting. The key questions are: How should resident participants be appropriately selected? And how will each resident identify eligible patients for the study?
ChatGPT
Great questions. In a prospective CER study—especially involving residents using a GenAI tool in a real hospital setting—resident and patient selection are key for minimizing bias and ensuring feasibility.
Here's a practical and methodologically sound approach to both:
👨⚕️👩⚕️ 1. Resident Selection
🎯 Goal: Ensure a representative and engaged group of residents who see a variety of clinical cases and are available for the study duration.
✅ Selection Criteria
✳️ Optional Stratification
🏥 2. Patient Selection by Each Resident
🎯 Goal: Allow residents to apply the LLM in real-time to a defined set of diagnostic cases without interfering with patient care.
✅ Recommended Methods
A. Case Eligibility Criteria
Patients seen by residents should:
Example: Patients seen during intake, admissions, or initial consults.
B. Patient Selection Methods
Choose one of these approaches depending on your workflow:
Method |
How it works |
Pros |
Cons |
Consecutive sampling |
Residents include all eligible patients seen during a shift |
Minimizes selection bias |
May be logistically harder |
Random sampling |
Select random eligible cases (e.g., every 3rd new patient) |
Reduces selection bias, manageable |
Requires planning |
Quota sampling |
Residents include a fixed number of cases per shift/day |
Easy to manage workload |
Some selection bias possible |
Triggered sampling |
Residents only include cases where they would naturally consult GenAI |
Measures real-world adoption |
Less generalizable |
🔍 Important: Residents should log both pre-LLM and post-LLM diagnostic decisions for each patient.
🧾 Data Capture Tips
📊 Final Study Design Snapshot
In designing the study, our primary goal is to generate reliable and generalizable clinical evidence. Given these objectives, a critical question arises: what is the optimal approach to including both residents and patients in the study?
ChatGPT
Excellent point—when the core goal of your GenAI diagnostic study is to generate reliable and generalizable clinical evidence, your study design must ensure both internal validity (accuracy) and external validity (generalizability). Here's how that shapes resident and patient inclusion strategies:
✅ Key Objectives for Your Design
👨⚕️👩⚕️ Best Approach for Including Residents
✅ 1. Use a Stratified Random Sampling Approach
🟢 Why? Ensures diversity of experience while preserving real-world decision-making behaviors. Prevents only “tech-savvy” or high-performers from dominating the study sample.
🏥 Best Approach for Including Patients
✅ 2. Use Consecutive Case Sampling Within Predefined Inclusion Criteria
🟢 Why? Consecutive sampling is the gold standard for avoiding selection bias in prospective clinical studies. It ensures real-world variation in case types, complexity, and demographics.
⚖️ Balance Between Realism and Rigor
Design Decision |
Internal Validity |
External Validity |
Random resident inclusion |
🟢 |
🟢 |
Stratified resident groups |
🟢 |
🟢 |
Consecutive patient sampling |
🟢 |
🟢 |
Diagnostic cases only |
🟢 |
🔴 (limits scope, but acceptable) |
Real-time LLM usage |
🟡 (introduces variation) |
🟢 |
🔧 Practical Implementation Plan
For Residents:
For Patients:
✨ Final Recommendation
✔️ Stratified-random resident sampling
✔️ Consecutive patient sampling during clinical care
✔️ Use real-time GenAI use to simulate real-world practice
✔️ Blinded expert review to assess diagnostic quality
These strategies will maximize evidence quality, clinical realism, and generalizability to support future deployment or policy decisions.
Let us examine the resident selection strategy more closely. One common assumption is that the resident group may be too large to feasibly include every individual. For this reason, many studies opt to randomly select 10–15 residents from within a stratified group (e.g., all PGY-1 residents) to balance feasibility and representativeness. However, this assumption does not always hold true.
In our setting, it may be practically feasible to include the entire cohort of PGY-1 residents within one department. If this is the case, and considering our overarching goal of maximizing generalizability, would it be preferable to include all residents rather than conducting a random sampling? Including the full cohort ensures a more complete representation of real-world diagnostic behavior and avoids the selection bias that may result from sampling only a subset of participants. What do you think?
ChatGPT
That’s a very thoughtful and important question — and you're absolutely right to question whether random selection is always necessary, especially when it’s feasible to include an entire resident group. Let’s walk through this carefully, based on your generalizability goal and practical constraints:
✅ Core Principles
🧪 Internal Validity
🌍 External Validity (Generalizability)
🔍 Scenario Comparison: Include All vs. Randomly Select
Scenario |
Pros |
Cons |
Include All PGY-1s in Department |
✅ More data |
🔸 Slightly more complexity in coordination |
Randomly Select 10–15 Residents |
✅ Easier to manage |
🔸 Risk of selection bias if non-participants differ systematically |
🧠 Key Considerations
1. If the entire group is available and willing to participate
✅ Include all residents
This maximizes generalizability within that training year and department, eliminates self-selection bias, and gives you more statistical power.
2. If only some residents are available or willing
✅ Use stratified random sampling
This controls for variation and ensures representativeness, as long as the selection process is transparent and documented.
📝 Special Note on Generalizability
Including all PGY-1 residents in one department:
🔬 Limitation: You may still need multi-site or multi-cohort studies to generalize beyond your hospital or specialty — but including the full resident group is a strong foundation.
✅ Final Recommendation (Given Feasibility)
Yes — include all PGY-1 residents in the department if logistically practical.
This supports your goal of generating reliable and generalizable clinical evidence, without sacrificing rigor.
~
|
The new Copilot Platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for research and real-world applications: ✨ Learn GenAI with a multi-LLM copilot — perfect for exploration and education.
|
|
🧠 Meta AI Meta AI helps you learn, create, and connect in new ways. With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
|
~
🔹 ELHS Institute Relevant Resources 🔹
▶️ ELHS Videos
👉For LLM Fine-Tuning Services, Contact support@elhsi.org 📩
~ the end ~
Democratizing GenAI and LHS in Healthcare to Help Achieve Global Health Equity
info@elhsi.org
Palo Alto, California, USA