ELHS Institute

Newsletters

Your E-mail *

More Progress in Open-Source LLMs and GenAI Agents for Medical Diagnosis

ELHS Newsletter 2025-04-04

Dear Friends,

Several landmark studies published last month in Nature, JAMA, and NEJM have brought GenAI-based diagnosis even closer to clinical application.

At Stanford University, Dr. Jonathan Chen—an AI-trained clinician—led a team that published a Nature study testing the diagnostic impact of GPT-4. Their results showed that physicians improved diagnostic accuracy when using GenAI-generated prediction support.

At Harvard Medical School, Dr. Arjun Manrai’s group compared open-source Llama 3.1 models to GPT-4 in generating differential diagnoses for complex clinical cases (JAMA). They found that these open-source models performed at a level approaching that of proprietary counterparts. This independently verifies a conclusion we reached last year in our own (yet unpublished) benchmarking studies. Based on our early insight into the rapid advancement of open-source LLMs, we have been actively advocating for the fine-tuning of Llama 3.1 models for clinical GenAI research—and the medical community is now taking notice.

In another Nature study, my friend Dr. James Li and colleagues at West China Hospital demonstrated that diagnosis accuracy can be further improved by employing a multi-agent conversation system built on GPT-4—highlighting the power of agent collaboration in complex reasoning tasks.

Meanwhile, in the imaging domain, Dr. Eric Topol and colleagues published a Nature Perspective article proposing a paradigm shift: the use of multimodal Generative Medical Imaging (GenMI) tools to assist physicians in drafting reliable radiology reports. They argue this could enhance diagnostic quality, reduce workload, support medical education, improve access to specialty care, and offer real-time expertise.

There’s so much exciting progress. I firmly believe that it's only a matter of time before GenAI becomes a true diagnostic copilot for clinicians. However, solid clinical evidence from real-world settings is urgently needed.

After benchmarking ChatGPT’s unprecedented disease prediction accuracy (JAMIA), we fine-tuned open-source Llama 3.1 models to achieve over 90% accuracy across a broad range of diseases. We are now making these models publicly available for clinical teams interested in conducting studies and generating much-needed clinical evidence of GenAI’s effectiveness.

I warmly invite doctors and clinical researchers to collaborate with us. Together, we can publish high-impact studies and accelerate the safe and equitable integration of GenAI into routine care.

Explore the latest research below, and enjoy my conversation with ChatGPT about GenAI clinical study design.

Warm regards,
AJ

AJ Chen, PhD
ELHS Platform | https://elhsi.com
ELHS Institute | https://elhsi.org

~

From Page Mill

Goh, E., Bunning, B., Khoong, E.C. et al. Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Commun Med 5, 59 (2025).

[2025/3] Here we show that physicians are willing to modify their clinical decisions based on GPT-4 assistance, leading to improved accuracy scores from 47% to 65% in the white male patient group and 63% to 80% in the Black female patient group. The accuracy improvement occurs without introducing or exacerbating demographic biases, with both groups showing similar magnitudes of improvement (18%). A post-study survey indicates that 90% of physicians expect AI tools to play a significant role in future clinical decision making.

Buckley TA, Crowe B, Abdulnour RE, Rodman A, Manrai AK. Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses. JAMA Health Forum. 2025;6(3):e250040.

[2025/3] Open-source LLM performed on par with GPT-4 in generating a differential diagnosis on complex diagnostic challenge cases. Our findings suggest an increasingly competitive landscape in LLM clinical decision support, and that institutions may be able to deploy high-performing custom models that run locally without sacrificing data privacy or flexibility. [JAMA interview]

Chen, X., Yi, H., You, M. et al. Enhancing diagnostic capability with multi-agents conversational large language models. npj Digit. Med. 8, 159 (2025).

[2025/3] We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.

Kopka, M., von Kalckreuth, N. & Feufel, M.A. Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digit. Med. 8, 178 (2025).

[2025/3] A total of 1549 studies were screened and 19 included. The self-triage accuracy of SAAs was moderate but highly variable (11.5–90.0%), while the accuracy of LLMs (57.8–76.0%) and laypeople (47.3–62.4%) was moderate with low variability. Based on the available evidence, the use of SAAs or LLMs should neither be universally recommended nor discouraged; rather, we suggest that their utility should be assessed based on the specific use case and user group under consideration.

Rao, V.M., Hla, M., Moor, M. et al. Multimodal generative AI for medical image interpretation. Nature 639, 888–896 (2025).

[2025/3] We advocate for a novel paradigm to deploy GenMI in a manner that empowers clinicians and their patients. Initial research suggests that GenMI could one day match human expert performance in generating reports across disciplines, such as radiology, pathology and dermatology.

AI to Assist in the Fetal Anomaly Ultrasound Scan: A Randomized Controlled Trial

T.G. Day and Others, NEJM AI 2025;2(4)

[2025/3] AI assistance in the routine fetal anomaly ultrasound scan results in significant time savings, and a reduction in sonographer cognitive load, without a reduction in diagnostic performance.

Randomized Trial of a Generative AI Chatbot for Mental Health Treatment

M.V. Heinz and Others, NEJM AI 2025;2(4)

[2025/3] This is the first RCT demonstrating the effectiveness of a fully Gen-AI therapy chatbot for treating clinical-level mental health symptoms. The results were promising for MDD, GAD, and CHR-FED symptoms. Therabot was well utilized and received high user ratings. Fine-tuned Gen-AI chatbots offer a feasible approach to delivering personalized mental health interventions at scale, although further research with larger clinical samples is needed to confirm their effectiveness and generalizability.

Zhukovsky P, Trivedi MH, Weissman M, Parsey R, Kennedy S, Pizzagalli DA. Generalizability of Treatment Outcome Prediction Across Antidepressant Treatment Trials in Depression. JAMA Netw Open. 2025;8(3):e251310. doi:10.1001/jamanetworkopen.2025.1310

[2025/3] Can neuroimaging and clinical features predict response to sertraline and escitalopram in patients with major depressive disorder across 2 large multisite studies? In this prognostic study of depression outcomes, among 363 patients in 2 trials, the best-performing models using pretreatment clinical features and functional connectivity of the dorsal anterior cingulate showed substantial cross-trial generalizability. The addition of neuroimaging features significantly improved prediction performance of antidepressant response compared with models including only clinical features.

Xie, Q., Chen, Q., Chen, A. et al. Medical foundation large language models for comprehensive text analysis and beyond. npj Digit. Med. 8, 141 (2025).

[2025/3] We present Me-LLaMA, a family of open-source medical LLMs integrating extensive domain-specific knowledge with robust instruction-following capabilities. Me-LLaMA is developed through continual pretraining and instruction tuning of LLaMA2 models using diverse biomedical and clinical data sources (e.g., biomedical literature and clinical notes). We evaluated Me-LLaMA on six text analysis tasks using 12 benchmarks (e.g., PubMedQA and MIMIC-CXR) and assessed its clinical utility in complex case diagnosis through automatic and human evaluations. Me-LLaMA outperforms existing open medical LLMs in zero-shot and supervised settings and surpasses ChatGPT and GPT-4 after task-specific instruction tuning for most text analysis tasks. Its performance is also comparable to ChatGPT and GPT-4 for diagnosing complex clinical cases. Our findings highlight the importance of combining domain-specific continual pretraining with instruction tuning to enhance performance in medical LLMs.

Yuan, K., Yoon, C.H., Gu, Q. et al. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun Med 5, 83 (2025).

[2025/3] A zero-shot OpenAI GPT4 model matches the performance of traditional NLP models without the need for labelled training data (F1 = 0.71 and 0.86) and a fine-tuned GPT3.5 model achieves similar performance to the fine-tuned BERT-based model (F1 = 0.95 and 0.97). Infection sources obtained from free-text indications reveal specific infection sources 31% more often than ICD-10 codes. Modern transformer-based models have the potential to be used widely throughout medicine to extract information from structured free-text records, to facilitate better research and patient care.

Grothey, B., Odenkirchen, J., Brkic, A. et al. Comprehensive testing of large language models for extraction of structured data in pathology. Commun Med 5, 96 (2025).

[2025/3] Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments.

~

In the Eyes of GPTs

Let’s talk about GenAI clinical research. Collaborating with a clinical team, we are planning to evaluate the effect of disease prediction by a fine-tuned Llama model on doctors’ diagnostic decisions in a comparative prospective study. Along the care delivery process—including clinical training for residents—what endpoints or process steps can be studied to measure the effect of GenAI?

ChatGPT

That’s an exciting and timely study design. Evaluating the effect of LLM-based disease prediction in real clinical settings—especially along the full care delivery process—can generate impactful and much-needed evidence. Here's a breakdown of key endpoints and care process steps you can study to assess the effect of GenAI, both on physicians and patient outcomes:

🧠 1. Diagnostic Decision-Making

Diagnostic accuracy (compared to final diagnosis / gold standard)
Time to diagnosis (faster vs. standard process)
Confidence in diagnosis (self-reported by physicians)
Differential diagnosis breadth and quality
Error type reduction (e.g., missed rare diseases, cognitive biases corrected)

🧑‍⚕️ 2. Clinical Training for Residents

Improvement in diagnostic reasoning skills (pre/post test)
Knowledge retention when assisted by GenAI
Learning curve acceleration (e.g., fewer cases needed to reach a diagnostic milestone)
Case-based teaching effectiveness using LLM vs. human tutor vs. control

📈 4. Clinical Outcomes and Quality of Care

Time to treatment initiation
Unnecessary tests or procedures avoided
Hospital stay length
Readmission rates
Adverse event rates

❤️ 5. Patient-Centered Outcomes

Patient satisfaction or trust (with physician using GenAI)
Shared decision-making quality
Health literacy improvement (if LLM explanations are shared)
Adherence to treatment (if GenAI helps clarify plan)

Let’s focus on one specific endpoint in resident training: comparing a resident’s diagnostic quality before and after using LLM-generated prediction information. We can have experienced doctors review and score the residents’ diagnostic decisions. But how should we measure diagnostic quality?

ChatGPT

Great focus — measuring diagnostic quality in resident training before and after LLM assistance is a clear and measurable endpoint. Having experienced physicians review and score resident diagnoses is a solid approach. Below are the key considerations and tools you can use to measure diagnostic quality rigorously and reproducibly.

✅ Approach: Expert Panel Scoring of Resident Diagnoses

Residents review clinical cases and provide:
- Primary diagnosis
- Differential diagnoses
- Next steps (tests, management plan)
After LLM exposure, residents revise or reaffirm their decisions.
Expert reviewers score both pre- and post-LLM responses in a blinded, structured way.

📏 Scoring the Quality of Diagnosis

Option 1: Structured Scoring Rubric (e.g., 0–5 scale)

Experts can rate based on:

Criterion	Description	Score Range
Correctness	How accurate is the diagnosis?	0 = completely incorrect, 5 = fully correct
Completeness	Are key differentials and red flags considered?	0–5
Prioritization	Is the most likely diagnosis listed first?	0–5
Clinical reasoning	Are justifications aligned with evidence?	0–5
Appropriateness of plan	Are next steps (tests/treatments) suitable?	0–5

🔢 Final score = Sum or weighted average of all criteria (e.g., out of 25)

Option 2: Gold Standard Comparison (Binary or Rank-based)

Define gold standard diagnosis and differentials per case.
Score:
- 1 point if correct diagnosis is in top 1
- 0.5 points if in top 3
- 0 otherwise
Or use ranking-based evaluation (e.g., Normalized Discounted Cumulative Gain – NDCG)

🧪 Example Diagnostic Quality Scoring

Resident	Diagnosis	LLM Used?	Expert Score (/25)
R1	Influenza	❌	15
R1	Influenza (after LLM)	✅	22
R2	Meningitis	❌	10
R2	Meningitis (after LLM)	✅	19

👥 Expert Review Considerations

Blinding: Reviewers should be blinded to whether LLM was used.
Inter-rater reliability: Use ≥2 reviewers and compute:
- Cohen’s Kappa (for binary or ordinal scores)
- Intraclass correlation (ICC) for continuous scores

📈 Analysis Options

Paired t-test / Wilcoxon to compare pre/post LLM scores.
Effect size (Cohen’s d) for clinical significance.
Subgroup analysis: junior vs. senior residents, high vs. low baseline scorers.

We will use the binary gold standard comparison method to score the differential diagnoses. To ensure that the comparison of pre- and post-LLM scores is statistically significant, how can we calculate the required number of residents and the number of patient cases per resident?

ChatGPT

Perfect — using a binary gold standard comparison (e.g., correct diagnosis in top 1 or top 3) sets you up for paired binary outcome analysis, which is well suited to McNemar's test (for pre/post comparisons on the same subjects).

✅ Assumptions for Sample Size Calculation

To estimate the sample size, we need:

1. Effect Size (Δ)

Expected improvement in accuracy after using LLM.

Let’s assume:

Pre-LLM accuracy (e.g., correct in top 1): 60%
Post-LLM accuracy: 80%
→ Effect size Δ = 20%

2. Discordant Pairs

Only residents who change their diagnosis between pre and post matter in McNemar's test.

Assume:

30% of cases are discordant (either improved or worsened).

3. Statistical Parameters

Power (1-β): 80%
Significance level (α): 0.05 (two-sided)

📊 Sample Size Calculation (Per Resident)

Option 1: Use Power Tables or Calculator for McNemar's Test

Example Estimate:

If you expect:

60% correct diagnosis before LLM
80% correct diagnosis after LLM
30% discordant rate (pre incorrect → post correct OR vice versa)

Then you would need:

~35–45 patient cases per resident to detect that effect with 80% power.

👩‍⚕️👨‍⚕️ Residents Needed

If each resident reviews ~40 cases, and you aim to observe the LLM effect across groups:

10 residents × 40 cases = 400 paired comparisons
May be sufficient to detect small-moderate effects and allow for subgroup analysis

For a more efficient design, consider:

Fewer cases per resident but more residents
Simulate design trade-offs using actual pilot data

🔁 Summary Table

Design	Minimum Estimate
Residents	10–15
Cases per Resident	30–50
Total Pre/Post Comparisons	300–600

🔧 Pro Tip

You can run a small pilot (e.g., 5 residents × 10 cases) to get:

Actual discordant pair rate
Observed effect size Then plug that into a formal power calculator to finalize your design.

Our comparative effectiveness research (CER) will be conducted as a prospective clinical trial within a hospital setting. The key questions are: How should resident participants be appropriately selected? And how will each resident identify eligible patients for the study?

ChatGPT

Great questions. In a prospective CER study—especially involving residents using a GenAI tool in a real hospital setting—resident and patient selection are key for minimizing bias and ensuring feasibility.

Here's a practical and methodologically sound approach to both:

👨‍⚕️👩‍⚕️ 1. Resident Selection

🎯 Goal: Ensure a representative and engaged group of residents who see a variety of clinical cases and are available for the study duration.

✅ Selection Criteria

Training Level: Include residents at a similar level (e.g., PGY-2 or PGY-3) to reduce variation in baseline diagnostic skills.
Specialty: Target the relevant field (e.g., internal medicine, emergency medicine, pediatrics).
Availability: Residents should be rotating through services where diagnostic decision-making is frequent (e.g., admitting, clinic, ED).
Consent and Engagement: Willingness to participate, attend brief training, and follow study protocol.

✳️ Optional Stratification

Stratify by training level or prior AI experience to enable subgroup analysis.
Randomly select from eligible volunteers to avoid selection bias.

🏥 2. Patient Selection by Each Resident

🎯 Goal: Allow residents to apply the LLM in real-time to a defined set of diagnostic cases without interfering with patient care.

✅ Recommended Methods

A. Case Eligibility Criteria

Patients seen by residents should:

Be new or undifferentiated cases (not follow-ups)
Present with diagnostic uncertainty
Have sufficient clinical data (e.g., H&P, labs, imaging) at the time of decision-making
Be seen during routine workflow (no additional burden on patient)

Example: Patients seen during intake, admissions, or initial consults.

B. Patient Selection Methods

Choose one of these approaches depending on your workflow:

Method	How it works	Pros	Cons
Consecutive sampling	Residents include all eligible patients seen during a shift	Minimizes selection bias	May be logistically harder
Random sampling	Select random eligible cases (e.g., every 3rd new patient)	Reduces selection bias, manageable	Requires planning
Quota sampling	Residents include a fixed number of cases per shift/day	Easy to manage workload	Some selection bias possible
Triggered sampling	Residents only include cases where they would naturally consult GenAI	Measures real-world adoption	Less generalizable

🔍 Important: Residents should log both pre-LLM and post-LLM diagnostic decisions for each patient.

🧾 Data Capture Tips

Use a structured case capture form or digital survey
Collect: patient presentation, differential diagnosis (pre/post), LLM input used, confidence level, and final attending diagnosis
Blinding: reviewers of diagnostic quality should be blinded to whether LLM was used

📊 Final Study Design Snapshot

15 residents, each reviewing ~30 diagnostic cases
Use consecutive or quota-based patient sampling
Collect paired diagnosis data (before and after LLM)
Evaluate quality with expert scoring or binary gold-standard method

In designing the study, our primary goal is to generate reliable and generalizable clinical evidence. Given these objectives, a critical question arises: what is the optimal approach to including both residents and patients in the study?

ChatGPT

Excellent point—when the core goal of your GenAI diagnostic study is to generate reliable and generalizable clinical evidence, your study design must ensure both internal validity (accuracy) and external validity (generalizability). Here's how that shapes resident and patient inclusion strategies:

✅ Key Objectives for Your Design

Generate unbiased, real-world clinical evidence of GenAI effectiveness.
Reflect the diversity of clinical cases and diagnostic skill levels.
Minimize selection bias and artificial use conditions.
Enable reproducibility and scalability across settings.

👨‍⚕️👩‍⚕️ Best Approach for Including Residents

✅ 1. Use a Stratified Random Sampling Approach

Stratify by PGY level (e.g., PGY-1 vs PGY-3), or specialty (e.g., IM vs EM).
Then randomly invite a representative sample from each group.
Balance voluntary participation with random selection to reduce selection bias.

🟢 Why? Ensures diversity of experience while preserving real-world decision-making behaviors. Prevents only “tech-savvy” or high-performers from dominating the study sample.

🏥 Best Approach for Including Patients

✅ 2. Use Consecutive Case Sampling Within Predefined Inclusion Criteria

Define clear clinical inclusion criteria (e.g., new undifferentiated cases, first encounter, diagnostic uncertainty).
Instruct residents to include every eligible patient consecutively during defined shifts or time blocks.

🟢 Why? Consecutive sampling is the gold standard for avoiding selection bias in prospective clinical studies. It ensures real-world variation in case types, complexity, and demographics.

⚖️ Balance Between Realism and Rigor

Design Decision	Internal Validity	External Validity
Random resident inclusion	🟢	🟢
Stratified resident groups	🟢	🟢
Consecutive patient sampling	🟢	🟢
Diagnostic cases only	🟢	🔴 (limits scope, but acceptable)
Real-time LLM usage	🟡 (introduces variation)	🟢

🔧 Practical Implementation Plan

For Residents:

Randomly select 10–15 residents across PGY levels and services.
Train them with the data capture protocol and GenAI tool.
Rotate them through defined case-collection periods (e.g., “use GenAI on 5 new diagnostic cases per shift”).

For Patients:

Each resident includes every new patient meeting inclusion criteria during shifts.
Stop once target sample size per resident is met (e.g., 30–40 cases).

✨ Final Recommendation

✔️ Stratified-random resident sampling
✔️ Consecutive patient sampling during clinical care
✔️ Use real-time GenAI use to simulate real-world practice
✔️ Blinded expert review to assess diagnostic quality

These strategies will maximize evidence quality, clinical realism, and generalizability to support future deployment or policy decisions.

Let us examine the resident selection strategy more closely. One common assumption is that the resident group may be too large to feasibly include every individual. For this reason, many studies opt to randomly select 10–15 residents from within a stratified group (e.g., all PGY-1 residents) to balance feasibility and representativeness. However, this assumption does not always hold true.

In our setting, it may be practically feasible to include the entire cohort of PGY-1 residents within one department. If this is the case, and considering our overarching goal of maximizing generalizability, would it be preferable to include all residents rather than conducting a random sampling? Including the full cohort ensures a more complete representation of real-world diagnostic behavior and avoids the selection bias that may result from sampling only a subset of participants. What do you think?

ChatGPT

That’s a very thoughtful and important question — and you're absolutely right to question whether random selection is always necessary, especially when it’s feasible to include an entire resident group. Let’s walk through this carefully, based on your generalizability goal and practical constraints:

✅ Core Principles

🧪 Internal Validity

How accurate and bias-free your estimates are (i.e., the causal effect of GenAI on diagnosis)

🌍 External Validity (Generalizability)

How well your results can be applied to other residents, departments, or institutions

🔍 Scenario Comparison: Include All vs. Randomly Select

Scenario	Pros	Cons
Include All PGY-1s in Department	✅ More data ✅ Real-world representative sample ✅ Avoids selection bias ✅ Enhances generalizability within that setting	🔸 Slightly more complexity in coordination 🔸 May need more resources for training & oversight
Randomly Select 10–15 Residents	✅ Easier to manage ✅ Requires fewer resources ✅ Reduces logistical burden	🔸 Risk of selection bias if non-participants differ systematically 🔸 Less generalizable if sample is not representative of the whole cohort

🧠 Key Considerations

1. If the entire group is available and willing to participate

✅ Include all residents
This maximizes generalizability within that training year and department, eliminates self-selection bias, and gives you more statistical power.

2. If only some residents are available or willing

✅ Use stratified random sampling
This controls for variation and ensures representativeness, as long as the selection process is transparent and documented.

📝 Special Note on Generalizability

Including all PGY-1 residents in one department:

Improves external validity within that department and level of training
Allows you to analyze real-world variability in how GenAI affects decision-making across diverse resident performance levels
Makes your findings more credible for internal quality improvement, teaching, and scaling to other departments

🔬 Limitation: You may still need multi-site or multi-cohort studies to generalize beyond your hospital or specialty — but including the full resident group is a strong foundation.

✅ Final Recommendation (Given Feasibility)

Yes — include all PGY-1 residents in the department if logistically practical.
This supports your goal of generating reliable and generalizable clinical evidence, without sacrificing rigor.

~

GenAI/LHS Toolbox

🤖 ELHS GenAI Copilot Platform

The new Copilot Platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for research and real-world applications:

✨ Learn GenAI with a multi-LLM copilot — perfect for exploration and education.
🧭 Use navigation tools to monitor care quality and improve patient outcomes.
💬 Chat with your own open-source LLM copilot — personalized for healthcare tasks.
🔬 Fine-tune Llama 3.1–8B for high-accuracy disease prediction in clinical studies. 🚀

🧠 Meta AI

Meta AI helps you learn, create, and connect in new ways.

With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
Just ask Meta AI anything ✨ — we're here to help!