Newsletters

Your E-mail *

Early 2026 Breakthroughs in Clinical GenAI

 
ELHS Newsletter 2026-2-20
Subscribe | All Issues

 

 

 

Dear Friends,

 

Although 2026 has placed AI at a crossroads, the new year has already seen rapid progress in clinical research and real-world applications of generative AI and Learning Health Systems (LHS). Encouraging results are emerging from all directions.

New Evidence for Clinical Applications of GenAI:

  • A clinical trial demonstrating AI benefits in mammogram screening for breast cancer.
  • An LLM chatbot facilitating triage from primary care to specialty care.

New LLMs, AI Agents, and AI Tools:

  • Stanford’s SleepFM predicts 130 conditions during sleep.
  • A new AI agent detecting cognitive concerns.
  • An open-source platform for deploying AI health chatbots.

Performance Improvements in LLMs and GenAI:

  • MIT’s GenAI framework, xHAIM, improving AI performance.

Convergence of GenAI and LHS:

  • LHS pioneer Dr. Chuck Friedman suggests that LHS may provide a glide path to safely land AI in healthcare.

LLM and GenAI Benchmarking and Evaluation:

  • Stanford’s MedHELM framework for evaluating LLMs in real-world clinical tasks.

AI-Accelerated Bioscience Breakthroughs:

  • DeepMind has made another breakthrough, releasing AlphaGenome, which can predict diverse biological events from genomic sequences.
  • A Nature review on AI agents in cancer research and oncology.

Unfortunately, prospective evidence using real-world data from real clinical settings remains limited. I continue to advocate for every clinician to evaluate the benefits of GenAI in improving early diagnosis and treatment of complex, uncommon, or rare diseases (CURDs). This is both practical for clinicians and essential to making AI-enabled healthcare a reality.

At the ELHS Institute, we would be happy to show you how easy it is to use GenAI in routine care to generate new clinical evidence suitable for publication. Please feel free to contact us.

Please continue reading my conversation with the ChatGPT copilot about the exciting sleep-focused LLM and how it may help detect cognitive concerns during sleep.

 

Warm regards,
AJ

AJ Chen, PhD
Founder and PI, ELHS Institute
Silicon Valley, USA
ELHS GenAI Platform:
https://elhsi.com
ELHS Newsletters: https://elhsi.org/Newsletters

 

 

~

 

 

From Page Mill

(Recent papers, news, and events showcasing the progress of GenAI and LHS) 

 

Gommers J, Hernström V, Josefsson V, et al. Interval cancer, sensitivity, and specificity comparing AI-supported mammography screening with standard double reading without AI in the MASAI study: a randomised, controlled, non-inferiority, single-blinded, population-based, screening-accuracy trial. Lancet. 2026 Jan 31;407(10527):505-514.

[2026/1] In this Swedish randomised, controlled, non-inferiority, single-blinded, population-based screening accuracy trial, participants were allocated in a 1:1 ratio to either AI-supported mammography screening (the intervention group) or standard double reading without AI (the control group). AI was used to triage examinations to single or double reading by radiologists and for detection support. Interval cancer rates were 1·55 (95% CI 1·23-1·92) and 1·76 (1·42-2·15) per 1000 participants in the intervention and control group respectively, a non-inferior proportion ratio of 0·88 (95% CI 0·65-1·18; p=0·41). 

-

Tao X, Zhou S, Ding K, et al. An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial. Nat Med. 2026 Jan 19.

[2026/1] The trial met its primary end points with the PreA-only group showing significantly reduced physician consultation duration (28.7% reduction; 3.14 ± 2.25 min) compared to the No-PreA group (4.41 ± 2.77 min; P < 0.001), alongside significant improvements in physician-perceived care coordination (mean scores 113.1% increase; 3.69 ± 0.90 versus 1.73 ± 0.95; P < 0.001) and patient-reported communication ease (mean scores 16.0% increase; 3.99 ± 0.62 versus 3.44 ± 0.97; P < 0.001).

-

Petridis, P., Margaritis, G., Stoumpou, V. et al. Holistic AI in medicine; improved performance and explainability. npj Digit. Med. 9, 120 (2026).

[2026/1 mit] To address these limitations, we introduce xHAIM (Explainable HAIM), a novel framework leveraging Generative AI to enhance both prediction and explainability through four structured steps: (1) automatically identifying task-relevant patient data across modalities, (2) generating comprehensive patient summaries, (3) using these summaries for improved predictive modeling, and (4) providing clinical explanations by linking predictions to patient-specific medical knowledge. Evaluated on the HAIM-MIMIC-MM dataset, xHAIM improves average AUC from 79.9% to 91.3% across chest pathology and operative tasks. 

-

Thapa R, Kjaer MR, He B, et al. A multimodal sleep foundation model for disease prediction. Nat Med. 2026 Jan 6.

[2026/1] To address these challenges, we developed SleepFM, a multimodal sleep foundation model trained with a new contrastive learning approach that accommodates multiple PSG configurations. Trained on a curated dataset of over 585,000 hours of PSG recordings from approximately 65,000 participants across several cohorts, SleepFM produces latent sleep representations that capture the physiological and temporal structure of sleep and enable accurate prediction of future disease risk. From one night of sleep, SleepFM accurately predicts 130 conditions with a C-Index of at least 0.75 (Bonferroni-corrected P < 0.01), including all-cause mortality (C-Index, 0.84), dementia (0.85), myocardial infarction (0.81), heart failure (0.80), chronic kidney disease (0.79), stroke (0.78) and atrial fibrillation (0.78). 

-

Tian J, Fard P, Cagan C, et al. An autonomous agentic workflow for clinical detection of cognitive concerns using large language models. NPJ Digit Med. 2026 Jan 7;9(1):51.

[2026/1] Early detection of cognitive impairment is limited by traditional screening tools and resource constraints. We developed two large language model workflows for identifying cognitive concerns from clinical notes: (1) an expert-driven workflow with iterative prompt refinement across three LLMs (LLaMA 3.1 8B, LLaMA 3.2 3B, Med42 v2 8B), and (2) an autonomous agentic workflow coordinating five specialized agents for prompt optimization. Using Llama3.1, we optimized on a balanced refinement dataset and validated on an independent dataset reflecting real-world prevalence. The agentic workflow achieved comparable validation performance (F1 = 0.74 vs. 0.81) and superior refinement results (0.93 vs. 0.87) relative to the expert-driven workflow. Sensitivity decreased from 0.91 to 0.62 between datasets, demonstrating the impact of prevalence shift on generalizability. Expert re-adjudication revealed 44% of apparent false negatives reflected clinically appropriate reasoning. These findings demonstrate that autonomous agentic systems can approach expert-level performance while maintaining interpretability, offering scalable clinical decision supports.

-

Bedi S, Cui H, Fuentes M, et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nat Med. 2026 Jan 20.

[2026/1] Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks—clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs—Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini—using an automated LLM-jury evaluation method.

-

Del Fiol G, Borsato E, Bradshaw RL, et al. GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots. J Am Med Inform Assoc. 2026 Jan 10:ocaf211.

[2026/1] GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.

-

Curcin V, Delaney B, Alkhatib A, et al. Learning Health Systems provide a glide path to safe landing for AI in health. Artif Intell Med. 2026 Mar;173:103346.

[2025/12] The paper outlines practical steps to align AI with LHS frameworks, including investment in data infrastructure, continuous model monitoring, and fostering a learning culture. Embedding AI in LHSs transforms implementation from a one-time event into a sustained, evidence-based learning process that aligns innovation with clinical realities, ultimately advancing patient care, health equity, and system resilience.

-

Avsec Ž, Latysheva N, Cheng J, Novati G, et al. Advancing regulatory variant effect prediction with AlphaGenome. Nature. 2026 Jan;649(8099):1206-1218.

[2026/1] We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. 

-

Truhn D, Azizi S, Zou J, et al. Artificial intelligence agents in cancer research and oncology. Nat Rev Cancer. 2026 Jan 12.

[2026/1] Here we provide a primer on AI agents for cancer researchers and oncologists. We illustrate how this technology is set apart from and goes beyond traditional AI systems. We discuss existing and emerging applications in cancer research and address real-world challenges from the perspective of academic, clinical and industrial research.

 

~

 

 

 

In the Eyes of GPTs

(AJ in conversation with ChatGPT to understand how machines view human healthcare)

 

User  Now let’s discuss Stanford’s SleepFM study.

 

A white logo on a purple background

Description automatically generated  ChatGPT

📘 Overview of the SleepFM Study (Thapa et al., Nature Medicine, 2026)

Study Goal:
The paper reports the development of SleepFM, a multimodal sleep foundation model that learns comprehensive physiological patterns from overnight sleep recordings (polysomnography or PSG) and uses them to predict future disease risk.

Data & Training:

  • Trained on a huge dataset of >585,000 hours of sleep data from ~65,000 participants across multiple cohorts.
  • PSG inputs included EEG (brain activity), ECG (heart), EMG (muscle), respiratory signals, and more—capturing the complex multimodal nature of sleep physiology.

Methodology:

  • A foundation model architecture was trained using contrastive learning to create generalizable latent representations of sleep patterns shared across modalities.
  • These learned embeddings were then fine-tuned for disease prediction tasks using longitudinal electronic health record (EHR) data.

🔍 Key Findings

Predictive Performance:

  • From just one night of sleep data, SleepFM accurately predicts risk for 130 different conditions, with a Harrell’s concordance index (C-Index) ≥ 0.75 for all of them (Bonferroni-corrected p < 0.01).
  • High accuracy was demonstrated for serious outcomes such as:
    • All-cause mortality (C-Index ~0.84)
    • Dementia (~0.85)
    • Myocardial infarction (~0.81)
    • Heart failure (~0.80)
    • Chronic kidney disease (~0.79)
    • Stroke (~0.78)
    • Atrial fibrillation (~0.78)

Notable High Performers:
Independent reports highlighted particularly strong performance for conditions such as Parkinson’s disease (C-Index ~0.89) and prostate & breast cancers (C-Index ~0.87-0.89).

Physiological Insights:

  • SleepFM’s disease predictions draw information across different sleep stages and signal channels, with Stage 1/2 and REM sleep showing slightly stronger signals for certain conditions.
  • Combining all modalities yields the best predictive performance.

🧠 What This Means

  1. Sleep as a Rich Health Biomarker:
    The study suggests that raw physiological features captured during sleep contain distributed, latent signals about future disease risk—beyond conventional sleep staging or isolated metrics like apnea severity.
  2. Foundation Model Approach:
    By training on enormous multimodal datasets, SleepFM learns generalizable sleep representations that can be adapted to downstream clinical tasks, similar to how large language models learn “language” features and transfer to many applications.
  3. Clinical & Research Implications:
    If validated in broader real-world settings, such models could eventually support risk stratification, early detection, preventive interventions, and monitoring—all from routine sleep studies.

 

User  Unbelievable! How can it predict neurological diseases like Alzheimer’s, dementia, and Parkinson’s with such high accuracy? What key data are required to make these predictions?

 

A white logo on a purple background

Description automatically generated  ChatGPT

It does feel unbelievable at first glance — but mechanistically it is plausible. Let’s break it down carefully.

🧠 Why Can SleepFM Predict Neurological Diseases?

Sleep is not just “rest.” It is a high-bandwidth neurophysiological stress test of the brain.

Neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and other forms of Dementia begin 10–20 years before clinical diagnosis. During that preclinical phase, subtle dysfunction already affects:

  • Brain network synchronization
  • Autonomic regulation
  • Motor tone control
  • REM architecture
  • Cortical oscillations

All of these are measurable during sleep.


🔬 What Physiological Signals Contain Predictive Information?

SleepFM uses full polysomnography (PSG) — not just sleep duration or apnea index.

1️EEG (Brain Activity) — The Most Important Signal

EEG provides:

  • Slow-wave amplitude (deep sleep integrity)
  • Sleep spindle density (memory consolidation marker)
  • REM theta activity
  • Micro-arousals
  • Spectral power distribution

In Alzheimer’s:

  • ↓ Slow wave activity
  • ↓ Spindle density
  • Fragmented REM

In Parkinson’s:

  • REM sleep behavior abnormalities
  • Altered cortical synchronization

These changes occur years before symptoms.


2️Autonomic Signals (ECG + Respiratory)

These include:

  • Heart rate variability (HRV)
  • Respiratory rhythm stability
  • Apnea/hypopnea patterns
  • Sympathetic–parasympathetic balance

Neurodegeneration affects:

  • Brainstem autonomic centers
  • Vagal tone
  • Cardiovascular coupling

SleepFM likely captures multimodal coupling patterns, not single features.


3️EMG (Muscle Tone)

Especially critical for Parkinson’s risk:

  • Loss of REM atonia
  • REM Sleep Behavior Disorder (RBD)

RBD can precede Parkinson’s by 10–15 years.


🧩 Why a Foundation Model Works So Well

Traditional sleep metrics:

  • AHI
  • Total sleep time
  • Sleep stages

SleepFM instead learns:

  • High-dimensional waveform patterns
  • Cross-modal interactions
  • Temporal micro-architecture
  • Latent physiological embeddings

This is analogous to how LLMs detect semantic structure beyond keywords.

The model likely identifies distributed weak signals that humans cannot manually quantify.


📊 What Data Are Required to Make Predictions?

Based on the paper design:

Required Inputs:

  • Overnight polysomnography (PSG)
    • EEG
    • ECG
    • EMG
    • Respiratory channels
    • Oxygen saturation

Not Required:

  • Cognitive testing
  • MRI
  • Biomarkers
  • Genetic testing

It predicts future diagnosis from longitudinal EHR linkage, not from current symptoms.


🚨 Important Caveats

High C-index ≠ clinical readiness.

Key questions:

  • Was external validation done in diverse populations?
  • How does it perform prospectively?
  • Does prediction change outcomes?
  • Is it confounded by age?

Remember:
Sleep architecture strongly correlates with age — and age is the biggest risk factor for neurodegeneration.


💡 Big Picture Insight

Sleep may function as a whole-brain stress test, compressing:

  • Synaptic health
  • Glymphatic clearance
  • Autonomic integrity
  • Network synchronization

Into measurable signals.

In that sense, sleep is a biological “language” — and SleepFM is a foundation model trained to read it.

 

User  I hope I will have a chance to try it soon. Forget about fortune tellers — SleepFM may be able to tell me my health fortune instead.

 

~

 

🔹 ELHS Institute Resources 🔹

🤖 ELHS GenAI Platform

🌐 ELHS Institute Website

📰 ELHS Newsletters

▶️ ELHS Videos

👉 For Clinical AI technology support, contact us at support@elhsi.org 📩

 

~ the end ~