ELHS Institute

Newsletters

Your E-mail *

Integration of GenAI in Medical Education and Clinical Training

ELHS Newsletter 2025-07-10

Dear Friends,

One year after publishing the review on the democratization of healthcare GenAI, I would like to review the progress of GenAI in medical education and clinical training. The results will be used as training material for the practical AI course I will teach for the 2025 Class of Shanghai General Medicine Physician Training in July. The training will be delivered in the following four parts:

GenAI Introduction and Trends
As large language models (LLMs) rapidly improve each month, LLM-based GenAI continues to distinguish itself from traditional AI in terms of performance, generalizability, usability, and availability. The trend of democratizing GenAI in healthcare remains strong, and responsible use of AI has become a top priority. This means every doctor or health professional can apply GenAI to solve problems. However, a significant bottleneck still exists in clinical research and application of GenAI for most clinical teams.
Progress of GenAI in Transforming Medical Education
Harvard and Stanford medical schools are leading the way in integrating GenAI into medical education. GenAI is now being taught as a fundamental skill and is changing how medical students learn medicine. The simulation of predictive care is accelerating clinical training. In particular, General Medicine is on its way to becoming the first autonomous medical practice, as demonstrated by Google’s breakthroughs in conversational diagnostic AI agent called AMIE. The GenAI revolution is creating exciting research opportunities in medicine that have not been seen in centuries!
Accelerator: Synthetic Patient Data
As demonstrated in our research and that of others, synthetic patient data has accelerated the development of ML models, GenAI benchmarking, and the fine-tuning of open-source LLMs for high accuracy at low cost and without privacy concerns. This has become the key to effectively overcoming the bottlenecks faced by most clinical teams.
Three Ways to Practically Integrate GenAI into Medical Education
To put GenAI into practice immediately, General Medicine can learn from the three approaches adopted by New York University Medical School:
(1) Build AI-assisted patient journey simulations to help students learn how GenAI can accelerate early diagnosis;
(2) Provide precision clinical training for interns, residents, and physicians;
(3) Enable doctors to use GenAI predictions on patient cases to make more informed clinical decisions.
If needed, students and physicians can use our award-winning GenAI-ELHS platform solution to implement these practical GenAI applications, discover new predictive care knowledge, and publish SCI papers.

Dr. Zak Kohane, a pioneer in healthcare AI at Harvard Medical School, summed it up nicely in his NEJM perspective:

“The growing gap in access to primary care is not specific to the United States. Large numbers of primary care doctors in the United Kingdom are leaving or planning to leave practice, and similar challenges are being recognized throughout Europe. In low- and middle-income countries, gaps in access to primary care can be much more profound than in high-income countries. If AI can be deployed to safely and effectively fill or compensate for even part of these gaps, health care systems and other organizations around the world should conduct trials promptly to determine how best to achieve that end. As part of this process, investigators should be careful to compare outcomes associated with the use of AI with outcomes in the health care systems we have, rather than the ones we wish we had.”

In the eyes of machines, GenAI has moved from the proof-of-concept stage to actual integration into medical education, clinical training, and healthcare practice. I hope you enjoy my conversation with ChatGPT below.

Warm regards,
AJ

AJ Chen, PhD
Founder and PI, ELHS Institute
https://elhsi.org
ELHS Platform: https://elhsi.com

From Page Mill

(Recent papers, news, and events showcasing the progress of GenAI and LHS)

Steel, P.A.D., Wardi, G., Harrington, R.A. et al. Learning health system strategies in the AI era. npj Health Syst. 2, 21 (2025). doi:10.1038/s44401-025-00029-0

[2025/6] The learning health system (LHS) offers a framework to accelerate evidence generation and care improvement, yet widespread adoption remains limited. In this perspective, we explore strategies to operationalize the LHS in the era of artificial intelligence, including biomedical informatics and health information technology integration, workforce development, quality improvement, and data governance. We highlight promising institutional models and propose policy, educational, and financial reforms to support scalable, value-driven innovation in increasingly complex and resource-constrained health systems.

Alsentzer E, Li MM, Kobren SN, Noori A; Undiagnosed Diseases Network; Kohane IS, Zitnik M. Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases. NPJ Digit Med. 2025 Jun 20;8(1):380. doi:10.1038/s41746-025-01749-1.

[2025/6] We present SHEPHERD, a few-shot learning approach for multi-faceted rare disease diagnosis. SHEPHERD performs deep learning over a knowledge graph enriched with rare disease information and is trained on a dataset of simulated rare disease patients. We demonstrate SHEPHERD's effectiveness across diverse diagnostic tasks, performing causal gene discovery, retrieving “patients-like-me”, and characterizing novel disease presentations, using real-world cohorts from the Undiagnosed Diseases Network (N = 465), MyGene2 (N = 146), and the Deciphering Developmental Disorders study (N = 1431). SHEPHERD demonstrates the potential of knowledge-grounded deep learning to accelerate rare disease diagnosis.

Wang, H., Yang, R., Alwakeel, M. et al. An evaluation framework for ambient digital scribing tools in clinical applications. npj Digit. Med. 8, 358 (2025). doi:10.1038/s41746-025-01622-1

[2025/6] This study proposes a comprehensive ADS evaluation framework incorporating human evaluation, automated metrics, simulation testing, and large language models (LLMs) as evaluators. Our framework assesses transcription, diarization, and medical note generation across criteria such as fluency, completeness, and factuality. To demonstrate its effectiveness, we developed an ADS tool and applied our framework to evaluate the tool’s performance on 40 real clinical visit recordings. Our evaluation revealed strengths, such as fluency and clarity, but also highlighted weaknesses in factual accuracy and the ability to capture new medications.

Zhou, S., Xu, Z., Zhang, M. et al. Large language models for disease diagnosis: a scoping review. npj Artif. Intell. 1, 9 (2025). doi:10.1038/s44387-025-00011-z

[2025/6] In this article, we perform a comprehensive review of LLM-based methods for disease diagnosis. Our review examines the existing literature across various dimensions, including disease types and associated clinical specialties, clinical data, LLM techniques, and evaluation methods. Additionally, we offer recommendations for applying and evaluating LLMs for diagnostic tasks. Furthermore, we assess the limitations of current research and discuss future directions. To our knowledge, this is the first comprehensive review for LLM-based disease diagnosis.

Yan, S., Yu, Z., Primiero, C. et al. A multimodal vision foundation model for clinical dermatology. Nat Med (2025). doi:10.1038/s41591-025-03747-y

[2025/6] Here we introduce PanDerm, a multimodal dermatology foundation model pretrained through self-supervised learning on over 2 million real-world skin disease images from 11 clinical institutions across 4 imaging modalities. We evaluated PanDerm on 28 diverse benchmarks, including skin cancer screening, risk stratification, differential diagnosis of common and rare skin conditions, lesion segmentation, longitudinal monitoring, and metastasis prediction and prognosis. PanDerm achieved state-of-the-art performance across all evaluated tasks, often outperforming existing models when using only 10% of labeled data. We conducted three reader studies to assess PanDerm’s potential clinical utility. PanDerm outperformed clinicians by 10.2% in early-stage melanoma detection through longitudinal analysis, improved clinicians’ skin cancer diagnostic accuracy by 11% on dermoscopy images and enhanced nondermatologist healthcare providers’ differential diagnosis by 16.5% across 128 skin conditions on clinical photographs.

Valan, B., Prakash, A., Ratliff, W. et al. Evaluating sepsis watch generalizability through multisite external validation of a sepsis machine learning model. npj Digit. Med. 8, 350 (2025). doi:10.1038/s41746-025-01664-5

[2025/6] The objective of this reproducibility study is to validate Duke Health’s Sepsis Watch ML model, in a new community healthcare setting and assess its performance and clinical utility in early sepsis detection at Summa Health’s emergency departments. The study analyzed the model’s ability to predict sepsis using a combination of static and dynamic patient data using 205,005 encounters between 2020 and 2021 from 101,584 unique patients. 54.7% (n = 112,223) patients were female and the average age was 50 (IQR [38,71]). The AUROC ranged from 0.906 to 0.960, and the AUPRC ranged from 0.177 to 0.252 across the four sites. Ultimately, the reproducibility of the Sepsis Watch model in a community health system setting confirmed its strong and robust performance and portability across different geographical and demographic contexts with little variation.

Jiang, N., Ji, H., Guan, Z. et al. A deep learning system for detecting silent brain infarction and predicting stroke risk. Nat. Biomed. Eng (2025). doi:10.1038/s41551-025-01413-9

[2025/6] Current brain imaging to detect silent brain infarctions (SBIs) is not feasible for the general population. Here, to overcome this challenge, we developed a retinal image-based deep learning system, DeepRETStroke, to detect SBI and refine stroke risk. We use 895,640 retinal photographs to pretrain the DeepRETStroke system, which encodes a domain-specific foundation model for representing eye–brain connections. Then, we validated the downstream clinical tasks of DeepRETStroke using 213,762 retinal photographs from diverse datasets across China, Singapore, Malaysia, the USA, the UK and Denmark to detect SBI and predict stroke events. DeepRETStroke performed well in internal validation datasets, with areas under the curve of 0.901 for predicting incident stroke and 0.769 for predicting recurrent stroke. External validations demonstrated consistent performances across diverse datasets. Finally, in a prospective study comprising 218 participants with stroke, we assessed the performance of DeepRETStroke compared with clinical traits in guiding strategies for stroke recurrence prevention. Altogether, the retinal image-based deep learning system, DeepRETStroke, is superior to clinical traits in predicting stroke events, especially by incorporating the detection of SBI, without the need for brain imaging.

Shanwetter Levit, N., Saban, M. When investigator meets large language models: a qualitative analysis of cancer patient decision-making journeys. npj Digit. Med. 8, 336 (2025). doi:10.1038/s41746-025-01747-3

[2025/6] This study compares human-led and LLM-assisted approaches to analyzing cancer patient narratives, using 33 semi-structured interviews. We conducted three parallel analyses: investigator-led thematic analysis, ChatGPT-4o, and Gemini Advance Pro 1.5. The investigator-led approach identified psychosocial and emotional themes, while the LLMs highlighted structural, temporal, and logistical aspects. LLMs demonstrated efficiency in identifying recurring patterns but struggled with emotional nuance and contextual depth. Investigator-led analysis, while time-intensive, captured the complexities of identity disruption and emotional processing. Our findings suggest that LLMs can serve as complementary tools in qualitative research, enhancing analytical breadth when paired with human interpretation.

Huang J, Wittbrodt MT, Teague CN, et al. Efficiency and Quality of Generative AI–Assisted Radiograph Reporting. JAMA Netw Open. 2025;8(6):e2513921. doi:10.1001/jamanetworkopen.2025.13921

[2025/6] Question Is clinical use of artificial intelligence (AI)–generated draft radiograph reports associated with documentation efficiency, clinical accuracy, textual quality, and ability to promptly detect pneumothorax requiring intervention? Findings In this cohort study, in 11 980 model-assisted radiograph interpretations in live clinical care, model use was associated with a 15.5% documentation efficiency improvement, with no change in radiologist-evaluated clinical accuracy or textual quality of reports. Of 97 651 radiographs analyzed for pneumothorax flagging, those containing clinically actionable pneumothorax were identified rapidly with high accuracy. Meaning The findings suggest the potential for radiologist and generative AI collaboration to improve clinical care delivery.

In the Eyes of GPTs

(AJ in conversation with ChatGPT to understand how machines view human healthcare)

After GenAI showed promise to transform medical education in 2023, what are the main advances so far?

ChatGPT:

Main Progress in GenAI for Medical Education Since 2023

1. Passing Medical Licensing Exams

ChatGPT and other LLMs have demonstrated high performance on medical exams (such as USMLE, NBME, etc.), exceeding or approaching human-level scores.

2. Integration into Medical Curricula

Medical schools (UCSF, Stanford, Harvard) are beginning to pilot and discuss integrating GenAI tools for admissions, formative assessment, clinical reasoning, and simulation.

3. Benchmarking Against Students

LLMs have outperformed medical students in clinical reasoning exams and free-text vignettes.

4. AI-powered Clinical Simulation

GenAI-based chatbots now simulate patient encounters for medical training (e.g., history taking, OSCE simulation), offering scalable and accessible practice opportunities.

5. Faculty and Student Engagement

Reports and editorials recommend training all medical students and clinicians in using GenAI tools for tasks like information retrieval, research, and clinical documentation.

6. Equity, Ethics, and Responsible Integration

Attention to responsible and equitable GenAI use, with emphasis on bias, data privacy, and transparent evaluation in medical education.

Summary:
Since 2023, GenAI has moved from proof-of-concept (passing exams, simulating patients) to early-stage integration in medical curricula and clinical skills training. Leading academic centers are actively publishing guidance, research, and validation studies on best practices for responsible adoption.

Please provide additional research progress on the use of GenAI or LLMs in clinical training.

ChatGPT:

1. Patient Simulation and OSCE Training

GenAI chatbots (e.g., ChatGPT, Med-PaLM) are now used to simulate patient encounters for history taking, OSCE preparation, and clinical communication skills. Findings: Chatbots can provide realistic, scalable, and interactive patient scenarios. Trainees can practice and receive feedback in a safe environment.

2. AI-Assisted Clinical Reasoning and Decision Support

LLMs have been benchmarked against students and clinicians in clinical case vignettes and differential diagnosis tasks. Findings: LLMs are sometimes able to match or exceed the performance of medical students and even practicing clinicians in generating differential diagnoses, suggesting their utility as a clinical training aid.

3. Documentation and Charting Skills

GenAI has been used to assist trainees in creating discharge summaries, clinical notes, and even simulated EHR entries. Findings: Medical students using AI-generated summaries scored higher in note completeness and clarity; AI tools also help to standardize documentation and reduce clerical burden.

4. Communication, Empathy, and Feedback

LLMs are being tested for training communication skills, such as delivering bad news and showing empathy. Findings: AI can help trainees practice empathy and communication by simulating difficult conversations and providing instant, structured feedback.

5. Team-Based Training and Multidisciplinary Scenarios

GenAI models can simulate entire care teams and multidisciplinary discussions (e.g., tumor boards, case conferences). Findings: Enables trainees to understand multiple perspectives and practice collaborative clinical decision-making.

6. Automated Assessment and Personalized Feedback

AI-powered assessment tools can provide personalized feedback on clinical reasoning, note quality, and even spoken responses in OSCEs or simulated consults. Findings: Automated feedback is being validated to support faculty and improve training scalability.

Focusing on general medicine, what specific advances have been made in medical education, clinical training, and healthcare practice?

ChatGPT:

1. Medical Education (General Medicine)

A. Passing and Benchmarking Medical Exams

LLMs (e.g., ChatGPT, GPT-4, Med-PaLM) have passed or outperformed humans in core general medicine exams, including the USMLE, NBME, and specialty progress tests.

B. Curriculum Integration & Faculty Guidance

Top medical schools (Stanford, UCSF, Harvard) are piloting LLMs for clinical reasoning, formative assessment, and content generation.

C. Equity and Accessibility

GenAI is seen as democratizing access to advanced clinical reasoning and education resources, particularly for underserved and rural settings.

2. Clinical Training (General Medicine)

A. Simulated Patient Encounters & OSCE Training

GenAI chatbots can act as virtual standardized patients, offering students and residents unlimited, consistent practice in history-taking, counseling, and decision-making.

B. Clinical Reasoning and Case-Based Learning

LLMs help learners generate differentials, management plans, and receive instant feedback in interactive, case-based vignettes.

C. Communication and Empathy Training

GenAI is being piloted for communication skills: delivering bad news, shared decision-making, and showing empathy.

3. Healthcare Practice (General Medicine)

A. Decision Support for Primary Care and General Practice

LLMs have demonstrated higher accuracy than family medicine residents in diagnostic reasoning and management for common primary care scenarios.

B. Documentation and EHR Integration

LLMs can automate or assist in generating clinical notes, discharge summaries, and patient communications, reducing clinician documentation burden and errors.

C. Team-Based and Multidisciplinary Care Simulation

LLMs can simulate multidisciplinary care teams for collaborative clinical decision-making in general medicine.

In summary:
General medicine is at the forefront of GenAI adoption, with rapid progress in AI-assisted learning, virtual clinical training, and real-world clinical support—supported by a growing body of published research from top academic centers.

🚀 Healthcare GenAI Toolbox

🤖 ELHS GenAI Copilot Platform

ELHS platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for clinical care and research:

✨ Learn GenAI with a multi-LLM copilot for training and education.
🧭 Use navigation tools to monitor care quality and improve patient outcomes.
💬 Chat with your LLM copilot for personalized healthcare tasks.
🔬 Fine-tune Llama 3.1–8B for high-accuracy prediction of diseases. 🚀

🧠 Meta AI

Meta AI helps you learn, create, and connect in new ways.

With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
Just ask Meta AI anything ✨