ELHS Institute

Newsletters

Your E-mail *

‹ Back

Open-Source LLMs Ready for Prime Time to Power Care Delivery?

ELHS Newsletter 2024-08-05

Paving the Way for Global Health Equity with GenAI, ML, Data, and LHS (Learning Health Systems)

Subscribe | All Issues

Launching Free Copilot Tool

For students, doctors and professionals to streamline GenAI learning and Research

Dear Friends,

On July 23, 2024, Meta published a blog Introducing Llama 3.1 405B large language model (LLM), which was trained on over 15 trillion tokens. You may try it at meta.ai. This is a historic advancement for the democratization of generative AI (GenAI), particularly in healthcare. As Meta claims, until today, open-source (OS) LLMs have been less powerful than their closed counterparts. Now, we’re ushering in a new era with open source leading the way. Its experimental evaluation suggests that this 405B-parameter model is competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet.

I’m so excited to see this milestone happening. It has answered the big question in the back of everyone’s mind – Will OS-LLMs perform at the best possible level? After the latest Llama models demonstrated that it is possible to have high-performance open-source models as a baseline, the road ahead becomes clearer. I see the following three steps can potentially bring GenAI based on OS-LLMs into real-world clinical care delivery:

Systematic Benchmarking LLMs: In addition to benchmarking with medical Q&A questions, we need to benchmark LLMs with datasets and tasks more closely resembling real clinical settings, including diagnostic prediction and treatment selection across all diseases. The goal is to clearly understand what LLMs work well for which tasks and diseases, providing a baseline for choosing the right targets for fine-tuning improvement. As an example, with my collaborator at Stanford University, we have published a new benchmarking for symptom checking in JAMIA. Our ongoing benchmarking for diagnostic prediction across many diseases has also shown >70% accuracy achieved by Llama 3 and Gemma 2, suggesting they are ready for clinical evaluation research. The urgent and essential roles of benchmarking are further emphasized by UCSD researchers calling for the development of a network of healthcare delivery organizations to focus on the clinical effectiveness of AI models in real-world settings (see paper below).
Fine-tuning OS-LLMs: To improve the model performance for the patient populations served in local healthcare settings, you may need to fine-tune the selected OS-LLM using your local patient data. In this step, the issue of generalizability facing traditional machine learning (ML) AI must be overcome so that trusted and inclusive models are used responsibly. Goetz et al. presented a strategy to meet the generalization challenges in a paper highlighted below.
Continuous Learning as Learning Health Systems (LHS) Units: In recent publications (Nature Sci Rep, JHMHP, JMIR AI), I have proposed deploying AI models as learning health system units to enable continuous learning and improvement of the models, which provides the system-level changes required to ensure responsible development and use of GenAI. Although there is not yet a completely successful example of this ML-LHS approach, I believe this is the most promising strategy to revert the deployment failure often seen in hospitals.

I see a day in the near future in which fine-tuned OS-LLMs are powering predictive healthcare everywhere. Before that day, medical schools also need to accelerate the preparation of the next generation of doctors equipped with GenAI. To help with this, we have just released a free GenAI Copilot tool to streamline GenAI learning and research. Using GenAI as a copilot will offer a new way for medical students, doctors, and healthcare professionals to get started with evaluating GenAI in their workflow quicker. Give it a try and let me know what you think.

Last month, there was another breakthrough in screening and preventing Alzheimer’s disease. JAMA published two studies (see below) that demonstrated plasma p-tau217 and Aβ42/40 levels can predict the development of Aβ pathology in people with low Aβ levels and improve the diagnosis of Alzheimer’s disease among individuals with cognitive symptoms. These results will help manage Alzheimer’s disease, including early diagnosis, risk reduction, and prevention. I expect GenAI will be able to facilitate the translation of this new knowledge into Alzheimer’s disease management.

Hope you enjoy reading these latest developments and my conversations with ChatGPT below.

Warm regards,

AJ Chen, PhD | ELHS Institute | web: elhsi.org

llama

From Page Mill

^{Published papers, recent news, and significant events, showing progress of GenAI and LHS.}

Longhurst CA, et al. A Call for Artificial Intelligence Implementation Science Centers to Evaluate Clinical Effectiveness. NEJM AI 2024;1(8). DOI: 10.1056/AIp2400223.

[2024/7] We call for increased recognition of implementation science principles and their adoption through development of a network of health care delivery organizations to focus on the clinical effectiveness of AI models in real-world settings to help achieve the shared goal of safer, more effective, and equitable care for all patients.

Goetz L, Seedat N, Vandersluis R, van der Schaar M. Generalization-a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med. 2024 May 21;7(1):126. doi: 10.1038/s41746-024-01127-3.

[2024/5] Here we explore data-based reasons for generalization challenges and look at how selective predictions might be implemented technically, focusing on clinical AI applications in real-world healthcare settings. The criterion for responsible use of ML is whether we can trust the predictions of a model.

Palmqvist S, Tideman P, Mattsson-Carlgren N, et al. Blood Biomarkers to Detect Alzheimer Disease in Primary Care and Secondary Care. JAMA. 2024 Jul 28:e2413855. doi: 10.1001/jama.2024.13855.

[2024/7] The amyloid probability score 2 (APS2, β-amyloid 42/40 ratio) and percentage of p-tau217 alone had high diagnostic accuracy for identifying AD among individuals with cognitive symptoms in primary and secondary care using predefined cutoff values. Future studies should evaluate how the use of blood tests for these biomarkers influences clinical care.

Janelidze S, Barthélemy NR, Salvadó G, et al. Plasma Phosphorylated Tau 217 and Aβ42/40 to Predict Early Brain Aβ Accumulation in People Without Cognitive Impairment. JAMA Neurol. 2024 Jul 28:e242619. doi: 10.1001/jamaneurol.2024.2619.

[2024/7] To investigate if combining plasma biomarkers could be useful in predicting subsequent development of Aβ pathology in cognitively unimpaired (CU) individuals with subthreshold brain Aβ levels (defined as Aβ levels <40 Centiloids) at baseline. Results of this cohort study suggest that combining plasma p-tau217and Aβ42/40 levels could be useful for predicting development of Aβ pathology in people with early stages of subthreshold Aβ accumulation. These biomarkers might thus facilitate screening of participants for future primary prevention trials.

robot eyes

In the Eyes of GPTs

^{GenAI are becoming an integral part of our society. AJ talks with ChatGPT to understand how machines view human healthcare.}

Do you think open-source LLMs are ready for prime time to power care delivery?

ChatGPT

Open-source Large Language Models (LLMs) have made significant strides in recent years, but their readiness to power care delivery in healthcare is still under evaluation. While they offer great potential for democratizing AI access and improving clinical decision support, there are several challenges to address. These include ensuring data privacy, maintaining accuracy and reliability, addressing ethical concerns, and integrating seamlessly with existing healthcare systems. Further research, rigorous testing, and regulatory approvals are needed before open-source LLMs can be fully trusted and widely adopted in care delivery.

What do you think of Meta's LLaMA LLMs?

ChatGPT

Meta's LLaMA (Large Language Model Meta AI) models are notable for their high performance and efficiency. They are designed to be more resource-efficient than other large models like GPT-3, making them accessible for research and practical applications. LLaMA models have shown strong capabilities in various natural language processing tasks, including text generation, summarization, and translation. Meta's commitment to making these models available for research purposes is a positive step towards democratizing AI and fostering innovation in the AI community. However, as with any AI model, there are challenges related to data privacy, ethical use, and ensuring reliability that need to be addressed.

Meta just released its latest LLaMA 3.1 405B model and showed its performance in all common benchmark standards comparable to the best versions of ChatGPT and Claude. It proves that it is possible for open-source LLMs to reach the highest prediction accuracy that was previously achieved only by their closed counterparts. What do you think?

ChatGPT

The release of Meta's LLaMA 3.1 405B model and its performance comparable to the best versions of ChatGPT and Claude is a significant milestone. It demonstrates that open-source Large Language Models can achieve the highest levels of prediction accuracy, previously thought to be the domain of closed-source models. This progress highlights the potential of open-source LLMs to democratize access to advanced AI capabilities, foster innovation, and provide a robust alternative to proprietary models. However, ensuring responsible use, addressing ethical concerns, and maintaining high standards of data privacy and security remain crucial as these models become more widely adopted.

Prediction accuracy is the first key metric. Using more realistic benchmarking is essential in healthcare. For example, I think the common benchmarks with the licensing exam Q&A questions are not sufficient for determining the diagnostic prediction ability of LLMs. How do you design new benchmarking that more closely resembles real-world clinical settings?

ChatGPT

Designing new benchmarks that closely resemble real-world clinical settings for evaluating the diagnostic prediction ability of Large Language Models in healthcare involves several key considerations:

Clinical Case Simulations: Develop a set of comprehensive clinical case simulations that include patient histories, symptoms, lab results, imaging, and treatment responses. These cases should represent a wide range of medical conditions, complexities, and patient demographics.
Longitudinal Data: Incorporate longitudinal patient data to assess the model’s ability to track patient progress over time, adjust diagnoses, and recommend follow-up actions.
Interdisciplinary Scenarios: Include scenarios that require interdisciplinary knowledge, such as cases involving comorbidities, to evaluate the model’s ability to integrate information from multiple medical specialties.
Clinical Decision-Making Processes: Test the model’s ability to make diagnostic and therapeutic decisions, prioritizing steps based on clinical guidelines, and responding to changes in patient conditions.
Expert Panel Review: Use a panel of medical experts to review and validate the model’s outputs, ensuring that the benchmarks reflect current clinical practices and standards.
Real-World Data Integration: Incorporate anonymized real-world clinical data from electronic health records (EHRs) to evaluate the model’s performance in practical, everyday healthcare settings.
Patient Safety and Ethical Considerations: Ensure that the benchmarks assess the model’s ability to avoid common pitfalls, such as misdiagnosis or inappropriate treatment recommendations, emphasizing patient safety and ethical decision-making.
User Interaction and Feedback: Measure the model’s usability and the quality of its interactions with healthcare providers, including how well it integrates with existing clinical workflows and supports decision-making.

Great design. If a hospital can perform such benchmarking on the best open-source LLM, it can then identify the tasks that need improvement in fine-tuning the LLM. How do you design experiments to fine-tune open-source LLMs to improve diagnostic prediction accuracy?

ChatGPT

Designing experiments to fine-tune an open-source LLM for improving diagnostic prediction accuracy involves a systematic approach. Here’s a step-by-step guide:

Data Collection and Preprocessing:
- Curate a High-Quality Dataset: Collect a large, diverse, and representative dataset of clinical cases, including patient histories, symptoms, lab results, imaging, diagnoses, and treatment outcomes.
- Data Cleaning and Annotation: Ensure the data is clean, well-annotated, and structured, with accurate labels for diagnosis and outcomes. Use domain experts to validate the annotations.
Initial Benchmarking:
- Baseline Performance: Evaluate the LLM’s current performance on the collected dataset using the newly designed benchmarks.
- Identify Weaknesses: Analyze the results to identify specific areas where the model’s diagnostic predictions are weak or inaccurate.
Domain-Specific Pretraining:
- Medical Literature and Guidelines: Pretrain the LLM on a large corpus of medical literature, textbooks, and clinical guidelines to enhance its medical knowledge base.
- Specialized Datasets: Use specialized medical datasets to further pretrain the model on specific areas of interest (e.g., oncology, cardiology).
Supervised Fine-Tuning:
- Task-Specific Fine-Tuning: Fine-tune the model on the curated clinical dataset, focusing on improving performance in identified weak areas.
- Incremental Training: Use incremental training techniques to gradually improve the model’s performance, ensuring it doesn’t overfit to the training data.
Cross-Validation and Hyperparameter Tuning:
- Cross-Validation: Implement cross-validation to evaluate the model’s performance on different subsets of the data, ensuring robustness and generalizability.
- Hyperparameter Optimization: Experiment with different hyperparameters (e.g., learning rate, batch size) to find the optimal settings for fine-tuning.
Human-in-the-Loop Training:
- Expert Feedback Loop: Involve healthcare professionals in the training process by having them review and provide feedback on the model’s predictions. Use this feedback to further refine the model.
- Iterative Improvement: Implement an iterative process where the model is continuously refined based on expert feedback and additional data.
Evaluation and Validation:
- Rigorous Testing: Test the fine-tuned model on an independent validation dataset to ensure it performs well across different clinical scenarios.
- Real-World Simulations: Conduct simulations of real-world clinical settings to evaluate the model’s practical utility and accuracy.
Deployment and Monitoring:
- Pilot Implementation: Deploy the model in a controlled clinical environment to monitor its performance in real-time.
- Continuous Monitoring and Updates: Continuously monitor the model’s performance and update it with new data and feedback to maintain and improve its accuracy over time.

Deployment is where most traditional EHR-based ML models fail. I think new system-level changes are necessary to ensure successful deployment. A learning health system (LHS) is the most promising approach, I believe. How do you adopt the LHS design in the LLM deployment step?

ChatGPT

Adopting the Learning Health System (LHS) design in the deployment of Large Language Models (LLMs) can help address the challenges faced by traditional EHR-based ML models. Here’s how to incorporate LHS principles into LLM deployment:

Continuous Learning and Improvement:
- Feedback Loop Integration: Establish a continuous feedback loop where the LLM’s performance is monitored in real-time, and healthcare providers can provide feedback on its outputs. This feedback is used to continuously refine and improve the model.
- Data Collection and Analysis: Collect new data from real-world clinical interactions and use it to update the LLM. Analyze this data to identify patterns and areas for improvement.
Stakeholder Engagement:
- Multidisciplinary Collaboration: Engage a diverse group of stakeholders, including clinicians, data scientists, IT professionals, and patients, in the deployment process to ensure the system meets the needs of all users.
- Education and Training: Provide ongoing training and education for healthcare providers on how to effectively use the LLM and interpret its outputs.
Seamless Integration with EHRs:
- Interoperability Standards: Ensure the LLM can seamlessly integrate with existing EHR systems by adhering to interoperability standards such as HL7 FHIR.
- Workflow Integration: Design the LLM to fit naturally into existing clinical workflows, minimizing disruption and maximizing adoption.
Data Privacy and Security:
- Robust Security Measures: Implement robust data privacy and security measures to protect patient information. Ensure compliance with regulations such as HIPAA.
- Anonymization and Encryption: Use data anonymization and encryption techniques to safeguard sensitive information during data transfer and storage.
Performance Monitoring and Evaluation:
- Real-Time Monitoring: Continuously monitor the LLM’s performance in real-time using metrics such as accuracy, response time, and user satisfaction.
- Regular Audits: Conduct regular audits and evaluations to ensure the model’s predictions remain accurate and reliable over time.
Adaptability and Scalability:
- Scalable Infrastructure: Deploy the LLM on a scalable infrastructure that can handle varying loads and accommodate future growth.
- Flexible Architecture: Design the system architecture to be flexible, allowing for easy updates and integration of new features or improvements.
Patient-Centric Approach:
- Patient Involvement: Involve patients in the development and deployment process to ensure their needs and concerns are addressed.
- Transparency and Communication: Maintain transparency with patients about how their data is being used and how the LLM’s predictions are made. Provide clear communication about the benefits and limitations of the system.