Newsletters

Your E-mail *

How to Clean Up the Mess in Clinical AI Testing?

 

ELHS Newsletter 2024-09-02

Paving the Way for Global Health Equity with GenAI, ML, Data, and LHS (Learning Health Systems)

Subscribe | All Issues

 

 

 

 

 

 

Dear Friends,

 

The highly anticipated JAMIA Focus Issue on LLMs in biomedicine and health has been published, featuring a comprehensive list of studies on LLM applications and methods. The editors identified LLM evaluation beyond multiple-choice questions (MCQs) as a significant gap, which I agree with. Although evaluation using MCQs is popular, it does not reflect real-world scenarios where potential answers are dynamic. Previous research has shown that LLM performance drops significantly when testing moves from MCQs to open-form evaluations. Together with Professor Tian at Stanford University, we contributed a benchmarking study to this issue, which attempted an open-form evaluation of ChatGPT’s diagnostic prediction capability against the live Mayo Clinic Symptom Checker service. Covering a broad range of diseases, this benchmarking demonstrated that ChatGPT’s GPT-4 model achieved an average accuracy of 78.8%, with some specialties scoring above 90%. I’m confident that ChatGPT outperforms all previous AI chatbots if they take the same tests. ChatGPT’s high accuracy in symptom checking is an unexpected breakthrough because it brings new hope for empowering less trained physicians in low-resource regions with Generative AI (GenAI) within learning health system units to enhance care quality and address health disparities.

 

Evaluating GenAI interventions in clinical settings is even more challenging. Both JAMA and NEJM AI have called for GenAI clinical research using rigorous clinical trial methods. Recently, Nature published a News Feature titled “The testing of AI in medicine is a mess. Here’s how it should be done.” It says: Hundreds of medical algorithms have been approved based on limited clinical data. A review found that only 65 randomized controlled trials of AI interventions were published between 2020 and 2022, yet regulators like the US FDA have approved hundreds of AI-powered medical devices for use in hospitals and clinics. There are numerous challenges in testing AI tools. For example, AI performance is sensitive to user interactions (fragile algorithms?). AI results can vary greatly in unfamiliar settings (data biases?). Scientists are debating who should test these tools and how best to do it. A centralized approach, called the Coalition for Health AI (CHAI), is creating a network of health AI assurance laboratories nationwide. A distributed approach, called the Health AI Partnership, provides technical assistance for any organization to test AI models locally. However, I’m not convinced that these traditional approaches are enough to meet the challenges posed by unconventional GenAI.

 

Famous AI educator and industry leader Dr. Andrew Ng has correctly focused on the AI deployment problem because AI won’t become widespread until it can be reliably deployed and validated. He recently reiterated that when switching between models, unfortunately, a major barrier remains the difficulty of implementing evaluations. He anticipates that the science of evaluation must be improved for the AI industry.

 

In my last newsletter, I highlighted that open-source LLMs are becoming the foundation of GenAI-enabled healthcare. The latest MIT Technology Review reported, “We finally have a definition for open-source AI.” An open-source system is available to all without the need for permission and is modifiable, shareable, and studyable. It doesn’t require developers to share all their training data sets. This is a positive development because it supports top open-source LLMs, including Llama from Meta and Gemma from Google, to stay on course.

 

Keep reading the latest papers From Page Mill and the above topics In The Eyes of GPTs.

 

Warm regards,

AJ

AJ Chen, PhD | ELHS Institute | https://elhsi.org
 

~

 

 

Source: Nature 632, 722-724 (2024)

 

 

From Page Mill

Published papers, recent news, and significant events, showing progress of GenAI and LHS. 

 

Lu Z, et al. Large language models in biomedicine and health: current research landscape and future directions. JAMIA 2024;31(9):1801–1811.

[2024/8] This JAMIA Focus Issue on LLMs in biomedicine and health has drawn enthusiasm from many researchers across different research disciplines. The papers in the Focus Issue explore a wide range of tasks within two main thematic topics: (a) innovative uses of LLMs in various biomedical and healthcare applications, and (b) technical method development for enhancing the capability of standard LLMs. The editors identified LLM evaluation beyond multiple-choice questions (MCQs) as a major gap. Evaluation method using MCQs does not reflect real-world scenarios where potential answers are not provided ahead of time. 

 

Lenharo M. The testing of AI in medicine is a mess. Here's how it should be done. Nature. 2024 Aug;632(8026):722-724. doi: 10.1038/d41586-024-02675-0.

[2024/8 news feature] Hundreds of medical algorithms have been approved on basis of limited clinical data. Scientists are debating who should test these tools and how best to do it. Two main solutions exist now: (1) The Coalition for Health AI has proposed the creation of a network of health AI assurance laboratories, which would evaluate models using an agreed set of principles in a centralized way. (2) The Health AI Partnership aims to build capabilities and provide technical assistance for any organization to be able to test AI models locally.

 

Matheny ME, Yang J, Smith JC, et al. Enhancing Postmarketing Surveillance of Medical Products With Large Language Models. JAMA Netw Open. 2024;7(8):e2428276. doi:10.1001/jamanetworkopen.2024.28276

[2024/8] LLMs represent a novel paradigm that may facilitate generation of information to support medical product postmarket surveillance activities that have not been possible. However, additional work is required to ensure LLMs can be used in a fair and equitable manner, minimize false positive findings, and support the necessary rigor of signal detection needed for regulatory activities.

 

Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC-LLaMA: toward building open-source language models for medicine. J Am Med Inform Assoc. 2024 Sep 1;31(9):1833-1843. doi: 10.1093/jamia/ocae045.

[2024/4] While evaluating various public medical QA benchmarks and manual rating, our lightweight PMC-LLaMA, which consists of only 13B parameters, exhibits superior performance, even surpassing ChatGPT. 

 

~

 

 

 

 

 

GenAI are becoming an integral part of our society. AJ talks with ChatGPT to understand how machines view human healthcare.

 

  In the new environment where GenAI is being democratized in healthcare, what are the barriers to testing every GenAI tool for every clinical task in real clinical settings?

 

 ChatGPT

The barriers to testing every GenAI tool for every clinical task in real clinical settings include:

  1. Resource Constraints: The time, cost, and infrastructure required to conduct thorough testing for every GenAI tool are significant, making it difficult to test each one comprehensively.

  2. Data Quality and Availability: High-quality and representative data are crucial for testing, but access to such data can be limited, leading to potential biases in results.

  3. Regulatory and Ethical Challenges: Navigating the complex regulatory landscape and ensuring ethical considerations, such as patient safety and data privacy, adds layers of difficulty.

  4. Variability in Clinical Settings: Differences in how GenAI tools are used across various settings can lead to inconsistent performance, complicating standardized testing.

  5. Rapid Evolution of GenAI Models: The fast pace of updates and improvements to GenAI models makes it challenging to keep testing protocols current and relevant.

 

  What solutions do you suggest for creating innovative and adaptable GenAI clinical testing frameworks?

 

 ChatGPT 

To create innovative and adaptable GenAI clinical testing frameworks, I suggest the following solutions:

  1. Adaptive Trial Designs: Use Bayesian and platform trial designs that allow for continuous learning and adaptation based on real-time data, making the testing process more flexible and efficient.

  2. Federated Learning and Testing: Implement federated learning approaches to enable decentralized testing across multiple institutions while preserving data privacy and ensuring diverse data sources.

  3. Synthetic Data and Simulation: Leverage synthetic data and simulation environments to conduct initial testing phases, which can help reduce the dependency on real-world data while ensuring early insights into GenAI performance.

  4. Context-Aware Testing Protocols: Develop protocols that consider the specific clinical context, including patient history and evolving symptoms, to evaluate GenAI tools more accurately.

  5. Hybrid Testing Models: Combine centralized and decentralized approaches to balance the need for standardization with the flexibility required to address diverse clinical environments.

  6. Regulatory Sandboxes: Create regulatory sandboxes that allow for controlled testing of GenAI tools under the supervision of regulators, enabling innovation while ensuring compliance with safety standards.

  7. Real-World Evidence Integration: Use real-world data and post-market surveillance to continuously monitor and improve GenAI tools, ensuring they remain effective and safe in diverse clinical settings.

 

  It’s not just the scale of GenAI applications, but also the rapid updates of the underlying LLMs that healthcare has never seen before. What challenges will arise from this high frequency of updating GenAI models?

 

 ChatGPT 

The rapid updates of underlying LLMs in GenAI applications present several challenges in healthcare:

  1. Consistency in Clinical Practice: Frequent updates can lead to inconsistencies in how GenAI tools perform, potentially disrupting clinical workflows and creating uncertainty among healthcare providers.

  2. Validation and Re-Testing: Each update may require re-validation and re-testing of the GenAI tool to ensure it meets clinical standards, which can be time-consuming and resource-intensive.

  3. User Training and Adaptation: Healthcare professionals need to be continually retrained to use the updated tools effectively, which can strain training resources and impact the adoption of GenAI tools.

  4. Data Compatibility: New versions of LLMs may require adjustments in data handling or integration with existing systems, leading to potential compatibility issues.

  5. Regulatory Compliance: Ensuring that each updated model complies with regulatory standards can be challenging, especially if updates are frequent and substantial.

  6. Patient Safety: Rapid updates might introduce unforeseen errors or biases, potentially affecting patient safety if not thoroughly tested before deployment.

 

  What solutions do you have to keep up with these rapid changes in LLM models in clinical settings?

 

 ChatGPT 

To keep up with the rapid changes in LLM models in clinical settings, the following solutions can be implemented:

  1. Continuous Integration and Deployment (CI/CD): Establish CI/CD pipelines that allow for automated testing, validation, and deployment of updated LLM models. This ensures that updates are swiftly and safely integrated into clinical systems.

  2. Modular Architecture: Design GenAI tools with a modular architecture so that individual components, like LLMs, can be updated independently without disrupting the entire system.

  3. Ongoing Training and Education: Implement regular training programs for healthcare professionals to keep them up-to-date with the latest changes in LLM tools, ensuring they can effectively utilize new features.

  4. Version Control and Rollback Mechanisms: Use robust version control systems to track changes and implement rollback mechanisms in case an update introduces issues, allowing for quick reversion to a stable version.

  5. Regulatory Collaboration: Work closely with regulatory bodies to create agile compliance frameworks that can accommodate frequent updates without compromising safety or effectiveness.

  6. Real-Time Monitoring and Feedback: Implement real-time monitoring systems to track the performance of updated LLMs in clinical settings and gather immediate feedback, enabling prompt adjustments if necessary.

 

 

Resources:

ELHS GenAI Copilot Platform: https://elhsi.com

Tech Support Contact: support@elhsi.org

ELHS Newsletters: https://elhsi.org/Newsletters

ELHS Institute Website: https://elhsi.org

Prof. Andrew Ng Classes: https://deeplearning.ai

 

 

~ the end ~