ELHS Institute

Newsletters

Your E-mail *

GenAI-Based Monitoring of Cancer Care Quality Wins Global Challenge

ELHS Newsletter 2025-06-05

^{(Paving the Way for Global Health Equity with Generative AI and Learning Health Systems)}

Dear Friends,

Cancer patient navigation—the process of guiding patients through the healthcare system to ensure they receive the support and services they need throughout their cancer diagnosis and treatment—remains a critical challenge, particularly in underserved communities. Sponsored by the Robert Wood Johnson Foundation and AcademyHealth, the recent global competition, Transforming Cancer Navigation with Open Data & APIs Challenge, was designed to address this challenge. The goal was to inspire data enthusiasts and tech innovators to develop cutting-edge solutions that improve the cancer patient navigation experience.

Patient navigators have four primary responsibilities: screening for social needs, monitoring care quality, scheduling appointments, and ensuring connections to available services. Without standardized AI diagnostic tools, navigators are currently unable to effectively and equitably monitor cancer care quality at scale. This limits their ability to ensure patients receive timely and high-quality care throughout their cancer journey.

We believe generative AI offers a powerful opportunity to create broad-spectrum tools that enhance the navigator’s role in monitoring care quality. However, most navigators—particularly those in low-resource clinics and hospitals—still lack access to high-accuracy GenAI tools that meet the standards for responsible AI. To address this gap, we created the GenAI-ELHS solution to empower patient navigators to equitably and effectively monitor cancer care quality at scale.

We have implemented a free, web-based navigation tool that allows patients to manually enter or integrate their health records. Navigators then use GenAI to predict the expected standard of care based on each patient’s case, compare it to the actual care received, identify potential quality gaps, and recommend next steps.

A clinic-based version will be deployed on demand for clinics of any size. This version will use GenAI to automatically monitor cancer care quality on a regular basis, enabling doctors to identify gaps in timely diagnosis and personalized treatment. The solution integrates GenAI-based monitoring into Equitable Learning Health System (ELHS) units, ensuring a system-level transformation of cancer navigation over time (see initial study published in Nature Sci Rep).

We’re excited to share that our GenAI-ELHS solution was selected as a winner in the global challenge—a recognition of the potential impact of GenAI in helping patients, navigators, and doctors improve cancer care quality. Building on this momentum, we are launching free training programs for clinical teams to integrate responsible LLM-based GenAI into cancer care monitoring and to publish real-world evidence of its effectiveness.

Last month, several strong studies were published on LLM applications in healthcare. A selection of these is highlighted in the next section for your quick review. However, studies that demonstrate how GenAI can directly support real clinical tasks in frontline care delivery remain scarce.

In our conversation with ChatGPT below, we explore how to design clinical comparative effectiveness research for GenAI. We hope this helps clarify that every doctor already has what they need—real-world data and open LLMs—to begin conducting impactful clinical evaluation studies.

Enjoy your reading.

Warm regards,
AJ

AJ Chen, PhD
Founder and PI, ELHS Institute
ELHS Platform | https://elhsi.com
ELHS Institute | https://elhsi.org

(Source: Herox.com)

From Page Mill

(Recent papers, news, and events showcasing the progress of GenAI and LHS)

Stults CD, Deng S, Martinez MC, et al. Evaluation of an Ambient Artificial Intelligence Documentation Platform for Clinicians. JAMA Netw Open. 2025;8(5):e258614.

In this quality improvement study at a single health care organization using before and after survey results and electronic health record data, ambient AI was associated with decreased time in notes per appointment. Ambient AI was associated with improved clinician satisfaction at work and decreased cognitive load. Among 100 clinicians (53 male [53.0%]; mean [SD] age, 48.9 [11.0] years), 58 clinicians (58.0%) were in primary care and 92 clinicians had EHR metrics. Among 57 clinicians who completed both preimplementation and postimplementation surveys, there was a decrease in burnout from 24 clinicians (42.1%) to 20 clinicians (35.1%), although this was not a significant difference (P = .12). Mean (SD) time in notes per appointment significantly decreased from 6.2 (4.0) to 5.3 (3.5) minutes (P < .001), with a bigger decrease for female vs male clinicians (8.1 [3.9] to 6.7 [3.6] minutes vs 4.7 [3.5] to 4.2 [3.1] minutes; P = .001).

Zeng D, Qin Y, Sheng B, Wong TY. DeepSeek’s “Low-Cost” Adoption Across China’s Hospital Systems: Too Fast, Too Soon? JAMA. Published online April 28, 2025.

DeepSeek is seen as a low-cost and open-source AI model and has generated excitement globally but particularly so in China. DeepSeek’s rapid adoption in China’s hospital systems presents a common paradox in health care technology: balancing the opportunity to rapidly integrate innovative technology to transform health care with the potential risks and dangers that accompany such rapid adoption (“first, do no harm”). We propose global collaborative effort among AI developers, physicians, health care administrators, and policymakers to ensure that new AI technology, particularly frontier LLMs that are now inexpensive and widely available, can be used as a trustworthy and safe tool in health care, rather than a potential source of harm. Only through a balanced approach incorporating robust clinical safety with regulatory oversight can new AI models truly realize their potential to transform health care.

Tordjman, M., Liu, Z., Yuce, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med (2025).

In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p < 10^-3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p < 10⁻³). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p < 10⁻³). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.

Hong, Q., Liu, S., Wu, L. et al. Evaluating the performance of large language & visual-language models in cervical cytology screening. npj Precis. Onc. 9, 153 (2025).

Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions.

Shashikumar, S.P., Mohammadi, S., Krishnamoorthy, R. et al. Development and prospective implementation of a large language model based system for early sepsis prediction. npj Digit. Med. 8, 290 (2025).

This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results.

Gaber, F., Shaik, M., Allega, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digit. Med. 8, 263 (2025).

In this study, we benchmark multiple LLM versions and an LLM-based workflow incorporating retrieval-augmented generation (RAG) on a curated dataset of 2000 medical cases derived from the Medical Information Mart for Intensive Care database. Our findings show that these LLMs are capable of providing personalized insights into likely diagnoses, suggesting appropriate specialists, and assessing urgent care needs. These models may also support clinicians in refining diagnoses and decision-making, offering a promising approach to improving patient outcomes and streamlining healthcare delivery.

Yao, J., Wang, Y., Lei, Z. et al. Multimodal GPT model for assisting thyroid nodule diagnosis and management. npj Digit. Med. 8, 245 (2025).

We developed a multimodal generative pre-trained transformer for thyroid nodules (ThyGPT), aiming to provide a transparent and interpretable AI copilot model for thyroid nodule risk assessment and management. Ultrasound data from 59,406 patients across nine hospitals were retrospectively collected to train and test the model. After training, ThyGPT was found to assist in reducing biopsy rates by more than 40% without increasing missed diagnoses. In addition, it detects errors in ultrasound reports 1,610 times faster than humans. With the assistance of ThyGPT, the area under the curve for radiologists in assessing thyroid nodule risks improved from 0.805 to 0.908 (p < 0.001). As an AI-generated content-enhanced computer-aided diagnosis (AIGC-CAD) model, ThyGPT has the potential to revolutionize how radiologists use such tools.

Liu, X., Shi, S., Zhang, X. et al. The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders. Sci Rep 15, 18688 (2025).

To compare the diagnostic accuracy of an artificial intelligence chatbot and clinical experts in vertigo-related diseases and evaluate the ability of the AI chatbot to address vertigo-related issues. 20 clinical questions about vertigo were input to ChatGPT-4o, and three otologists evaluated the responses using a 5-point Likert scale for accuracy, comprehensiveness, clarity, practicality, and credibility. The model’s diagnostic accuracy was comparable to a clinician with one year of experience but inferior to a clinician with five years of experience, and the differences in accuracy among the three methods are statistically significant (p = 0.04). ChatGPT-4o shows promise as a supplementary tool for managing vertigo but requires improvements in readability and diagnostic capabilities.

Li, J., Navathe, A.S. & Zhang, Y. Aligning incentives: the importance of behavioral economic perspectives in AI adoption. npj Health Syst. 2, 18 (2025).

We identify three behavioral economic aspects crucial for healthcare providers adopting AI: 1) cognitive overload, 2) risk aversion, and 3) social preferences. These perspectives offer a framework for understanding how AI should be integrated into healthcare settings in ways that ensure both technological efficiency and alignment with human-centric principles. Integrating AI into healthcare requires not only technological integration but also alignment with existing operational and incentive frameworks to ensure utilization and to prevent liability.

Hein, D., Christie, A., Holcomb, M. et al. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. npj Digit. Med. 8, 301 (2025).

We present a novel, end-to-end pipeline leveraging large language models (LLMs) for highly accurate information extraction and normalization from unstructured pathology reports, focusing initially on kidney tumors. Applying the finalized pipeline to 2297 kidney tumor reports with pre-existing templated data available for validation yielded a macro-averaged F1 of 0.99 for six kidney tumor subtypes and 0.97 for detecting kidney metastasis.

Hou, Z., Liu, H., Bian, J. et al. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. npj Health Syst. 2, 14 (2025).

[2025/5] In this study, we evaluated whether fine-tuning LLMs with specialized ICD-10 knowledge can automate code generation across clinical documentation. We adopted a two-phase approach: initial fine-tuning using 74,260 ICD-10 code–description pairs, followed by enhanced training to address linguistic and lexical variations. Evaluations using a proprietary model (GPT-4o mini) on a cloud platform and an open-source model (Llama) on local GPUs demonstrated that initial fine-tuning increased exact matching from <1% to 97%, while enhanced fine-tuning further improved performance in complex scenarios, with real-world clinical notes achieving 69.20% exact match and 87.16% category match. These findings indicate that domain-specific fine-tuned LLMs can reduce manual burdens and improve reliability.

In the Eyes of GPTs

(AJ in conversation with ChatGPT to understand how machines view human healthcare)

Diagnosis is the first critical step in clinical care delivery. To optimize diagnostic accuracy and improve patient outcomes, health systems should also implement screening for prevention before diagnosis and monitoring of diagnostic quality after diagnosis.
What do you think? Is there anything else that should be added to an ideal diagnosis implementation model aimed at delivering timely and accurate diagnoses for all populations?

ChatGPT:

Diagnosis is the critical first step in clinical care delivery. To improve patient outcomes, health systems must not only optimize diagnostic accuracy but also implement proactive screening for prevention before diagnosis and systematic monitoring of diagnostic quality after diagnosis.

✅ Suggestions to Enhance an Ideal Medical Diagnosis Implementation:

Equitable Access to Diagnostic Services
Ensure diagnostic tools and services are available across urban, rural, and underserved populations, reducing disparities in early detection.
Integration of Clinical Decision Support (CDS)
Embed AI-driven and evidence-based CDS tools into EHRs to support clinicians in making timely and consistent diagnostic decisions.
Training and Continuous Education
Provide clinicians with ongoing education on diagnostic reasoning, bias reduction, and interpretation of advanced diagnostics (e.g., genomics, imaging AI).
Feedback Loops for Diagnostic Performance
Establish mechanisms to track diagnostic errors, delays, and missed diagnoses—and use this data to drive continuous quality improvement.
Patient Involvement and Communication
Engage patients in the diagnostic process by ensuring they understand the reasoning, implications, and next steps—improving trust and adherence.
Data Interoperability and Real-Time Access
Enable real-time access to integrated data (labs, imaging, history) across systems, so clinicians have the full picture for timely decision-making.
Use of Population Health Data
Leverage data analytics to identify at-risk populations and design targeted screening or diagnostic initiatives.
Responsible Use of Generative AI
Incorporate responsible GenAI tools to assist in symptom triage, differential diagnosis, and clinical documentation—especially for frontline providers in resource-limited settings.

In summary, to deliver timely and accurate diagnosis for all populations, an ideal implementation must go beyond clinical expertise alone—it must include equity-focused access, data-driven tools, feedback mechanisms, and responsible AI to support the entire diagnostic journey from prevention to post-diagnosis monitoring.

That's a fairly complete list. What features are usually not implemented in current clinical workflows?

ChatGPT:

Great question. While many health systems aspire to deliver timely and accurate diagnosis, several key features in your ideal list are often missing or inconsistently implemented in current clinical workflows—especially in low-resource or fragmented settings. Below is a structured breakdown:

❌ Features Usually Not Implemented (or Poorly Implemented) in Current Clinical Workflows

1. Systematic Diagnostic Quality Monitoring

Missing: Feedback loops that track diagnostic accuracy, delays, or errors (e.g., misdiagnosis rates, near-misses)
Why: Most EHRs lack built-in tools to track diagnostic performance, and diagnostic errors are underreported

2. Patient Involvement in Diagnostic Reasoning

Missing: Structured communication tools to explain diagnostic uncertainty or differential diagnosis
Why: Time pressure, lack of decision aids, and cultural habits limit shared decision-making in diagnosis

3. Equity-Focused Diagnostic Access

Missing: Proactive outreach and diagnostics in underserved or rural populations
Why: Health systems are reactive and resource-constrained, lacking incentives for early outreach

4. AI or CDS Integration for Frontline Diagnostics

Missing: GenAI tools or even basic Clinical Decision Support (CDS) systems in primary care or safety-net clinics
Why: Integration costs, interoperability challenges, regulatory concerns, and lack of validation in real-world settings

5. Population-Level Screening Based on Predictive Analytics

Missing: Data-driven algorithms that trigger targeted screening based on risk scores
Why: Lack of infrastructure, fragmented data sources, and absence of unified population health platforms

6. Training on Cognitive Bias and Diagnostic Uncertainty

Missing: Curriculum and feedback loops that train clinicians to recognize diagnostic bias (e.g., anchoring, premature closure)
Why: Diagnostic reasoning is under-emphasized in clinical training; few hospitals provide structured feedback

7. Interoperable, Real-Time Data Aggregation

Missing: Seamless access to patient history, labs, and imaging across institutions
Why: EHR silos, lack of true FHIR-based interoperability, and slow adoption of national health information exchanges

8. Learning Health System (LHS) Framework for Diagnosis

Missing: Institutional structures to learn from every diagnosis to improve future ones
Why: Most systems lack the infrastructure, incentives, or governance to implement LHS loops for diagnostic improvement

✅ Summary

Ideal Feature	Current Status
Diagnostic quality monitoring	Rarely implemented
Patient involvement in diagnosis	Limited, unstructured
Equity-focused diagnostic access	Inconsistent or absent
AI/CDS for diagnosis	Limited deployment
Risk-based population screening	Rare outside pilot programs
Training on diagnostic error/cognitive bias	Largely missing
Real-time interoperable data	Fragmented
LHS-based diagnosis improvement	Rare

Wow—there's a lot missing! I'm going to start offering free training for clinical teams to study how GenAI can be integrated into the rarely implemented step of diagnostic quality monitoring. We need doctors to demonstrate the practical benefits of monitoring diagnostic quality in real-world settings.

Please provide an outline for a clinical evaluation study on how GenAI can help doctors monitor diagnostic quality using real-world data. Also, specify the most practical quantitative endpoints that can be used to measure effectiveness.

ChatGPT:

That’s a fantastic and impactful initiative. Integrating GenAI into diagnostic quality monitoring is not only innovative—it addresses a critical blind spot in modern healthcare delivery. Below is a detailed outline for a clinical evaluation study focused on using GenAI to assist doctors in monitoring and improving diagnostic quality using real-world data.

🧪 Study Outline: Evaluating GenAI for Diagnostic Quality Monitoring

🔹 Study Title:

Clinical Evaluation of a Generative AI–Assisted System for Monitoring Diagnostic Quality in Real-World Healthcare Settings

🔹 Study Objectives:

Primary Objective:
- To evaluate the effectiveness of GenAI in assisting clinicians with real-time and retrospective monitoring of diagnostic quality.
Secondary Objectives:
- To assess GenAI’s role in identifying missed diagnoses, delayed diagnoses, or diagnostic inconsistencies.
- To evaluate the impact on diagnostic workflow efficiency and clinician trust in AI-based feedback.

🔹 Study Design:

Type: Prospective, mixed-methods (quantitative + qualitative)
Duration: 6–12 months
Sites: Primary care clinics, community hospitals, or VA clinics
Participants:
- Intervention group: Clinical teams using GenAI for diagnostic quality monitoring
- Control group: Teams with standard workflow (no GenAI support)

🔹 Intervention:

A GenAI-enabled tool that performs:

Retrospective chart review to identify diagnostic inconsistencies
Real-time suggestions or alerts when diagnostic quality may be suboptimal (e.g., conflicting lab/imaging vs. diagnosis code)
Structured diagnosis audit reports per patient panel or by disease

🔹 Data Sources:

EHRs (structured and unstructured data)
GenAI-generated diagnostic suggestions or flagging reports
Physician-confirmed diagnoses
Follow-up and referral outcomes

🔹 Most Practical Quantitative Endpoints:

Endpoint	Definition	Rationale
Diagnostic error rate	% of initial diagnoses changed within 30 days (e.g., via new imaging, lab, or specialist)	Quantifies misdiagnosis or premature closure
Diagnostic delay rate	% of patients with > X-day delay between first abnormal finding and confirmed diagnosis	Captures latency in decision-making
AI-identified diagnostic discrepancy rate	% of cases flagged by GenAI with meaningful discrepancy (confirmed by review)	Measures GenAI’s sensitivity to potential issues
Time to resolution of diagnostic uncertainty	Average time between uncertain initial diagnosis and final diagnosis	Captures improvement in resolving unclear cases
Clinician actionability rate	% of AI-flagged cases leading to a clinician review, follow-up test, or referral	Measures practical utility of GenAI alerts
Follow-up rate adherence	% of flagged cases with documented follow-up within appropriate time window	Assesses impact on closing diagnostic loops
User trust score (Likert scale)	Survey-based scoring of clinicians' trust and satisfaction with GenAI feedback	Evaluates real-world usability
Cognitive load score (NASA-TLX or similar)	Self-reported cognitive load during diagnostic process	Assesses whether GenAI reduces or adds burden

🔹 Qualitative Measures:

Focus groups with clinicians on:
- Perceived accuracy and value of GenAI
- Barriers to acting on diagnostic quality feedback
- Usefulness in team-based decision-making

🔹 Ethics and Privacy:

Patient data de-identified before GenAI analysis
Clinician consent for workflow impact assessment
No direct patient intervention (minimal risk)

🔹 Anticipated Outcomes:

Quantifiable improvement in diagnostic quality metrics
Increased clinician awareness of diagnostic pitfalls
Demonstration that GenAI tools can augment learning health systems through continuous feedback

🔹 Dissemination Plan:

Preprint or peer-reviewed publication (e.g., JAMA Netw Open, BMJ Open)
Submission of findings to AIM-AHEAD or AHRQ
Open-source sharing of de-identified audit algorithms (optional)

In addition to comparing an intervention group with a control group, can we also compare a doctor’s diagnostic decision-making before and after seeing GenAI-generated predictions? What would be the benefits of such comparative effectiveness research?

ChatGPT:

Yes—comparing diagnostic decision-making before and after GenAI predictions by the same doctor is not only feasible but highly valuable. This approach is a form of within-subject comparative effectiveness research (CER), and it complements the intervention vs. control group design. It’s especially useful when:

Sample sizes are small (e.g., early-stage pilots or low-resource settings)
You want to isolate the effect of GenAI itself from broader team or system influences
Clinician variability needs to be controlled to detect subtle changes in behavior

✅ Study Design Add-On: Within-Physician Before/After Comparison

How It Works:

Doctors first make diagnostic decisions without GenAI input (baseline stage)
Then, they review GenAI predictions or suggestions (intervention stage)
You collect both sets of decisions and compare them in terms of:
- Diagnosis accuracy
- Timeliness
- Changes in follow-up or referral plans
- Confidence level or uncertainty
- Cognitive load (optional)

✅ Key Benefits of This Comparative Effectiveness Design

Benefit	Explanation
Controls for individual variability	Each doctor serves as their own control, removing between-clinician biases or skill differences
Sensitive to decision shifts	Captures subtle improvements, reversals, or confidence changes that might be missed in population-level outcomes
Real-world relevance	Reflects actual clinical decision-making flow where GenAI is used to augment, not replace, physician judgment
Faster feedback loop	Allows faster detection of GenAI's impact without waiting for long-term patient outcomes
Supports implementation science	Helps understand how GenAI affects human behavior, not just outcomes (e.g., when and why doctors override AI)

✅ Practical Metrics for Before/After Comparison

Metric	Measurement
Diagnosis change rate	% of cases where GenAI input led to a different final diagnosis
Management plan change rate	% of cases where test ordering, referrals, or treatment changed post-AI
Confidence delta	Self-reported confidence score before vs. after GenAI (e.g., 1–5 scale)
Time to decision	Seconds/minutes taken before vs. after GenAI input
Error reduction	% decrease in misdiagnoses or guideline deviations after GenAI input
Override justification analysis	Qualitative coding of why GenAI suggestions were accepted or rejected

✅ Summary

Adding a before/after comparison within the same doctor:

Enhances causal inference on how GenAI affects clinical judgment
Quantifies change in human decision-making, not just patient outcomes
Strengthens your study's value for implementation science, clinical training, and GenAI safety evaluation

Healthcare GenAI Toolbox

🤖 ELHS GenAI Copilot Platform

ELHS platform makes it easy for patients, navigators, and doctors to harness the power of GenAI for clinical care and research:

✨ Learn GenAI with a multi-LLM copilot for training and education.
🧭 Use navigation tools to monitor care quality and improve patient outcomes.
💬 Chat with your LLM copilot for personalized healthcare tasks.
🔬 Fine-tune Llama 3.1–8B for high-accuracy prediction of diseases. 🚀

🧠 Meta AI

Meta AI helps you learn, create, and connect in new ways.

With our latest features, you can now chat with recognizable voices 🗣️ and share images 🖼️ to explore and learn more about your surroundings.
Just ask Meta AI anything ✨