Newsletters

Your E-mail *

Nature Publishes First Autonomous AI Doctor Study

 

ELHS Newsletter 2025-05-05

Subscribe | All Issues

 

Dear Friends,

 

Last month’s newsletter highlighted a multi-agent system study from West China Hospital. This month, we explore new perspectives from Dr. Eric Topol and colleagues, who propose decentralized yet coordinated networks of specialized AI agents as the next leap forward in healthcare.

 

The most exciting development, however, is the long-awaited publication of Google’s autonomous diagnostic agent — AMIE — in Nature. This marks a historic moment. While top research labs at Harvard and Stanford have been exploring similar directions, Google is the first to rigorously demonstrate that an AI software agent can autonomously conduct an end-to-end patient consultation and generate a differential diagnosis.

 

This success was likely enabled by Google’s unique advantage: ownership of a high-performance, healthcare-specialized foundation model, Med-PaLM, which allowed their team to fine-tune AMIE for conversational diagnostic reasoning.

 

But it raises a critical question: How can academic labs compete? Building open-source, autonomous “AI doctor” agents remains out of reach for most academic institutions due to the prohibitive costs of training large foundation models. Even institutions like Stanford or Harvard may be unable to match the required compute and financial resources — unless we discover new, more efficient paths to achieve this vision.

 

In the discussion that follows, I continue to explore this challenge with my copilot, ChatGPT. I hope you find it inspiring.

 

Warm regards,
AJ
AJ Chen, PhD
Founder of ELHS Institute
ELHS Platform | https://elhsi.com
ELHS Institute | https://elhsi.org

 

~

 

 

From Page Mill

 

Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence. Nature. 2025 Apr 9. doi:10.1038/s41586-025-08866-7.

[2025/4] Here we introduce AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based AI system optimized for diagnostic dialogue. AMIE uses a self-play-based simulated environment with automated feedback for scaling learning across disease conditions, specialties and contexts. We designed a framework for evaluating clinically meaningful axes of performance, including history-taking, diagnostic accuracy, management, communication skills and empathy. We compared AMIE’s performance to that of primary care physicians in a randomized, double-blind crossover study of text-based consultations with validated patient-actors similar to objective structured clinical examination. The study included 159 case scenarios from providers in Canada, the United Kingdom and India, 20 primary care physicians compared to AMIE, and evaluations by specialist physicians and patient-actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 30 out of 32 axes according to the specialist physicians and 25 out of 26 axes according to the patient-actors. 

 

McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. Nature. 2025 Apr 9. doi:10.1038/s41586-025-08869-4.

[2025/4] Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar’s test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar’s test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients’ access to specialist-level expertise.

 

Moritz M, Topol E, Rajpurkar P. Coordinated AI agents for advancing healthcare. Nat Biomed Eng. 2025 Apr;9(4):432-438. doi: 10.1038/s41551-025-01363-2.

[2025/4] Decentralized yet coordinated networks of specialized artificial intelligence agents, multi-agent systems for healthcare (MASH), that excel in performing tasks in an assistive or autonomous manner within specific clinical and operational domains are likely to become the next paradigm in medical artificial intelligence.

 

Zhou S, Lin M, Ding S, et al. Explainable differential diagnosis with dual-inference large language models. Npj Health Syst. 2025;2(1):12. doi: 10.1038/s44401-025-00015-6.

[2025/4] We developed the first publicly available DDx dataset, comprising expert-derived explanations for 570 clinical notes, to evaluate DDx explanations. Meanwhile, we proposed a novel framework, Dual-Inf, that could effectively harness LLMs to generate high-quality DDx explanations. To the best of our knowledge, it is the first study to tailor LLMs for DDx explanation and comprehensively evaluate their explainability. Overall, our study bridges a critical gap in DDx explanation, enhancing clinical decision-making.

 

Hua Y, Na H, Li Z, Liu F, Fang X, Clifton D, Torous J. A scoping review of large language models for generative tasks in mental health care. NPJ Digit Med. 2025 Apr 30;8(1):230. doi: 10.1038/s41746-025-01611-4.

[2025/4] A systematic search identified 726 unique articles, of which 16 met the inclusion criteria. These studies, encompassing applications such as clinical assistance, counseling, therapy, and emotional support, show initial promises. However, the evaluation methods were often non-standardized, with most studies relying on ad-hoc scales that limit comparability and robustness. A reliance on prompt-tuning proprietary models, such as OpenAI’s GPT series, also raises concerns about transparency and reproducibility. As current evidence does not fully support their use as standalone interventions, more rigorous development and evaluation guidelines are needed for safe, effective clinical integration.

 

Hou Z, Wu Z, Qu Z, et al. A vaccine chatbot intervention for parents to improve HPV vaccination uptake among middle school girls: a cluster randomized trial. Nat Med. 2025 Apr 7. doi: 10.1038/s41591-025-03618-6.

[2025/4] The study included 2,671 parents from 180 middle school classes stratified by socioeconomic setting, school and grade level in Shanghai megacity, and urban and rural regions of Anhui Province. Participants were randomly assigned to either the intervention group (90 classes, 1,294 parents), which engaged with the chatbot for two weeks, or the control group (90 classes, 1,377 parents), which received usual care. The primary outcome was the receipt or scheduled appointment of the HPV vaccine for participants’ daughters. In intention-to-treat analyses, 7.1% of the intervention group met this outcome versus 1.8% of the control group (P < 0.001) over a two-week intervention period. In addition, there was a statistically significant increase in HPV vaccination-specific consultations with health professionals (49.1% versus 17.6%, P < 0.001), along with enhanced vaccine literacy (P < 0.001) and rumor discernment (P < 0.001) among participants using the chatbot. These findings indicate that the chatbot effectively increased vaccination and improved parental vaccine literacy, although further research is necessary to scale and sustain these gains.

 

Chen Q, Hu Y, Peng X, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun. 2025 Apr 6;16(1):3280. doi: 10.1038/s41467-025-56989-2.

[2025/4] We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

 

~

 

 

In the Eyes of GPTs

 

 Google’s AMIE autonomous diagnostic agent study was finally published in Nature. It marks a historic moment — the first time an "AI doctor" has been shown to successfully converse with a patient, take a clinical history, and complete a differential diagnosis. Remarkably, AMIE outperformed human doctors in the study. I’d like to explore this entire field with you and gain a deeper understanding of its foundations. Could you start by defining the field of autonomous medical doctors?

  ChatGPT

Absolutely — Google's AMIE study is a landmark in AI and healthcare, marking the first rigorous demonstration of an autonomous diagnostic agent performing core clinical tasks end-to-end with superhuman performance. Let's define the field of autonomous medical doctors and how it's emerging.

🧠 Definition: Autonomous Medical Doctors (AMDs)

Autonomous medical doctors refer to AI systems capable of performing the full clinical reasoning and decision-making workflow traditionally handled by human physicians — with minimal or no human supervision. These systems may interview patients (natural language interaction), collect medical history and symptom details, ask follow-up questions, generate differential diagnoses, recommend diagnostic tests or treatment plans, and communicate findings and next steps to patients or clinicians. They combine LLMs for conversation and reasoning, domain-specific medical knowledge, clinical guidelines and decision trees, and often some form of reinforcement learning or human feedback tuning.


⚙️ Core Capabilities of an AMD

Task

Description

🗣️ Clinical Interviewing

Conducting history-taking via open-ended and follow-up questions

🧠 Differential Diagnosis

Reasoning across multiple disease possibilities based on collected information

🧾 Documentation

Generating structured notes (SOAP, summary, ICD-10 coding, etc.)

🧪 Ordering Tests or Referrals

Recommending next clinical actions, such as labs or imaging

🧍 Patient Communication

Explaining reasoning, risks, and care plans to patients

📊 Feedback and Iteration

Revising hypotheses in response to new data


📍Autonomous ≠ Fully Independent (Yet)

“Autonomous” does not necessarily mean independent of clinicians. Current systems like AMIE are designed to operate in a sandbox or simulation, focus on specific diagnostic tasks, serve as second opinions or copilots in real settings (for now). Eventually, some subsystems (e.g., diabetic retinopathy diagnosis, triage bots) may become truly independent agents, especially in low-resource or remote environments.

🌍 Why It Matters

The goal of autonomous diagnostic agents is to scale access to expert-level care globally, reduce diagnostic errors, relieve clinician burden, enable continuous care in underserved areas, and support learning health systems by feeding back diagnostic reasoning into real-world outcomes.

 

 AMIE is built on Google’s own healthcare foundation model, Med-PaLM, correct? This gives Google a significant advantage. However, it raises an important question: How can academic labs compete with Google in developing autonomous medical doctors? The work is extremely resource-intensive, both in terms of computation and funding. No academic lab currently has the capacity to train or fine-tune a large language model on the scale of Med-PaLM, Gemini, or ChatGPT.

 ChatGPT

You're absolutely right — Google's AMIE is built on top of Med-PaLM, which itself is a proprietary, large-scale foundation model trained on privileged medical data and massive compute. This gives Google (and similar tech giants) a structural advantage in the race to build autonomous medical doctors. But your concern raises a deep, valid, and urgent question:

How can academic labs — without access to commercial-scale compute or private data — meaningfully compete in this field?

Let's break it down.


🧱 Structural Advantages of Big Tech

Advantage

Description

💾 Data Access

Google has access to high-quality, large-scale medical QA datasets, clinician feedback, and possibly real-world health records (e.g., from DeepMind NHS collaborations)

🧠 Model Infrastructure

Med-PaLM and Gemini models are built on world-class model architectures (e.g., PaLM, Gemini 1.5), pretrained with trillions of tokens

🖥️ Compute Power

They have nearly unlimited TPU/GPU capacity and optimized inference/training infrastructure

💸 Funding

Internal projects often run at R&D budgets unreachable by any academic grant


🧭 What Can Academic Labs Do?

Despite the asymmetry, academic labs can play a critical and unique role — not by competing head-to-head in size, but by excelling in focus, openness, and innovation.

1. 🧪 Study Specialized, Narrow-Scope Agents. Instead of building a full AMIE clone, academic labs can build agents focused on a single disease domain (e.g., cancer, stroke, RA), focus on low-resource settings, older adults, or rare diseases, implement multi-agent collaboration, where diagnosis is distributed.

Example: Build a lightweight “autonomous RA copilot” fine-tuned with synthetic data — not a generalist AMIE.

2. 🤖 Fine-Tune Open-Source Models (LoRA, QLoRA, DPO). Use smaller open-source models; fine-tune them with synthetic patient cases, textbook data, or guideline-derived instructions; use LoRA/QLoRA to reduce compute needs by 90%; add dialogue-style history-taking, reasoning, and explanation capabilities.

You’ve already done this — building LoRA adapters fine-tuned for disease prediction!

3. 📖 Contribute Open Evaluation, Transparency, and Ethics. Academic teams can lead in evaluation benchmarks (e.g., dialog realism, patient safety, bias detection), open science protocols, clinician-in-the-loop design, human-AI collaboration studies in real settings.

4. 🤝 Build Academic Consortia. Share compute (e.g., through NIH/NSF-funded centers), co-train models across institutions, use federated learning or model sharing instead of data sharing

Summary

💡 Academic labs cannot out-scale Google — but they can out-specialize, out-open, and out-innovate at the edges where GenAI meets real-world clinical care. The question isn’t whether academic labs can participate — it’s how they will shape the future of autonomous care by focusing where commercial models cannot.

Would you like help outlining a strategy framework for academic labs (like yours) to lead in this space — including technical and policy components?

 

 Yes, please.

 ChatGPT

Great — here's a structured strategy framework for how academic labs can lead in the field of autonomous medical doctors (AMDs), even without commercial-scale resources like Google or OpenAI.

🧭 Strategic Framework for Academic Leadership in Autonomous Medical Diagnosis

I. 🎯 Focus: Specialized, Impactful Domains

Strategic Principle: Depth over breadth.

Academic labs should focus on high-need, underexplored, or domain-specific tasks where open models can be fine-tuned and validated efficiently.

Target areas:

  • Specific diseases (e.g., RA, heart failure, lung cancer)
  • Underserved populations (e.g., rural, aging, pediatric)
  • Complex workflows (e.g., differential diagnosis, post-visit care gap detection)

Example: A "Doctor AI" agent for rheumatology triage fine-tuned on synthetic RA cases and validated in a local clinic.

II. 🛠️ Technology: Lightweight Fine-Tuning with Open Models

Strategic Principle: Leverage efficient GenAI stacks.

Tools and Methods:

  • Use open-source LLMs (e.g., LLaMA 3, Mistral, DeepSeek, Phi-3)
  • Apply LoRA/QLoRA for parameter-efficient fine-tuning
  • Employ instruction tuning, dialog fine-tuning, and DPO to simulate diagnostic reasoning
  • Use synthetic patient data (like you've done) to avoid privacy and data-sharing barriers

III. 🔬 Evaluation: Build Gold-Standard Benchmarks and Studies

Strategic Principle: Own the clinical evaluation space.

Key areas:

  • Design standardized diagnostic conversation benchmarks (comparing LLMs vs. physicians)
  • Conduct prospective studies of GenAI copilots in real care delivery (e.g., decision impact on patient outcomes)
  • Benchmark reasoning traceability, bias, transparency, and safety

IV. 🌍 Collaboration: Build Federated Open Science Consortia

Strategic Principle: Networked scale via openness.

Action steps:

  • Form a multi-institution academic consortium, linking to initiatives like AIM-AHEAD, Bridge2AI, and FAIR Genomes
  • Share LoRA adapters, prompts, test data, and evaluation tools
  • Use federated training to protect patient data while scaling learning across health systems
  • Invite global collaborators to test and extend disease-specific AMDs

V. 🧑‍⚖️ Governance: Lead in AI Ethics and Trustworthiness

Strategic Principle: Set the ethical foundation before others do.

Core responsibilities:

  • Define standards for safe autonomous diagnosis (e.g., oversight, auditability)
  • Ensure explainability in autonomous reasoning (e.g., what evidence the model used)
  • Engage patients and clinicians to co-design AI interfaces
  • Build policy papers and commentaries to guide regulators and public trust

 

 Who are the major players in this space — including research labs and leading experts focused on autonomous medical diagnosis (AMD), generalist AI doctors, and diagnostic agents?

 ChatGPT

As of 2024–2025, the field of Autonomous Medical Diagnosis (AMD) and LLM-based “AI Doctor” systems is growing rapidly. Below is a curated list of major players — including research labs, institutions, and leading experts — categorized by their role in shaping this field.

🧠 Big Tech Research Labs (Core developers of autonomous systems)

Organization

Contribution

Google DeepMind / Health AI

Creator of Med-PaLM and AMIE. Major work on safety, reasoning, and multi-modal diagnostic AI.

OpenAI

GPT-4 and GPT-4o power many medical copilots, though not healthcare-specific. Research into tool use, safety, and reasoning informs medical AI.

Microsoft Research + Nuance

Focused on ambient clinical documentation and AI copilots (e.g., DAX Copilot) — building blocks for AMD-like workflows.

Amazon Science / AWS Health AI

Developing medical knowledge graph and symptom-checking systems.

Tencent AI Lab (China)

Developing large medical LLMs (e.g., MedGPT), focusing on patient-facing triage and assistant agents.

ByteDance AI Lab (China)

Invested in MedAgent-style models and open-sourcing foundation models like OpenBMB-Med.


🏥 Academic & Clinical AI Centers (Leading academic innovation)

Institution

Research Focus

Stanford Center for Biomedical Informatics Research (BMIR)

NLP for clinical notes, LLMs for diagnosis; led by Dr. Nigam Shah, Dr. James Zou, Dr. Andrew Ng (adjacent)

MIT CSAIL / J-Clinic

Focus on interpretable ML, autonomous clinical tools, reinforcement learning in healthcare (e.g., Dr. Regina Barzilay)

Harvard Medical School / MGH / BWH

AI for diagnostics, radiology, and EHR-based phenotyping. Includes Dr. Zak Kohane.

UCSF Center for Intelligent Imaging

Medical imaging AI, integration with LLMs for multimodal diagnosis

Johns Hopkins Malone Center for Engineering in Healthcare

Building diagnostic models with EHR + language data, with fairness and generalizability focus

Tsinghua University AI Research Center

Chinese leaders in multimodal LLMs (e.g., ChatMed) and patient-facing diagnostic agents

Peking University, PKU Medical AI Lab

Focus on fine-tuning foundation models for medical reasoning and conversation


👨‍⚕️ Leading Experts in Autonomous Diagnosis / AI Doctors

Expert

Affiliation

Contributions

Dr. Alan Karthikesalingam

Google DeepMind

Lead author of Med-PaLM and AMIE; one of the central architects of autonomous GenAI agents in healthcare

Dr. Vivek Natarajan

Google Health AI

LLMs for medicine, lead on Med-PaLM; clinical safety, explainability, and multi-turn diagnosis

Dr. Nigam Shah

Stanford

Pioneer in real-world evidence generation and clinical LLM evaluation

Dr. Regina Barzilay

MIT

Machine learning for cancer diagnosis and LLMs for clinical reasoning

Dr. Isaac Kohane

Harvard

Chair of the Department of Biomedical Informatics; advocate for safe, human-centered AI in medicine

Dr. Eric Topol

Scripps Research

Thought leader in democratized AI and autonomous care systems

Dr. James Zou

Stanford

Leading evaluation of LLMs for clinical QA and reasoning benchmarks

Dr. Andrew Beam

Harvard

Focus on deep learning interpretability, EHR-based diagnostic prediction

Dr. Yuan Lu

Yale

LLMs for guideline adherence and quality gap detection

 

 

~

 

GenAI/LHS Toolbox

 

🤖 ELHS GenAI Copilot Platform

The new Copilot Platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for research and real-world applications:

Learn GenAI with a multi-LLM copilot — perfect for exploration and education.
🧭 Use navigation tools to monitor care quality and improve patient outcomes.
💬 Chat with your own open-source LLM copilot — personalized for healthcare tasks.
🔬 Fine-tune Llama 3.1–8B for high-accuracy disease prediction in clinical studies. 🚀

 

🧠 Meta AI

Meta AI helps you learn, create, and connect in new ways.

With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
Just ask Meta AI anything — we're here to help!

 

 

~

 

🔹 ELHS Institute Relevant Resources 🔹

🚀 ELHS GenAI Copilot Platform

🌐 ELHS Institute Website

📰 ELHS Newsletters

▶️ ELHS Videos

👉For LLM Fine-Tuning Services, Contact support@elhsi.org 📩

 

~ the end ~