ELHS Institute

Newsletters

Your E-mail *

‹ Back

LLM GenAI Capable of Self-Taught Evaluation

ELHS Newsletter 2024-10-02

Subscribe | All Issues

(https://elhsi.com/)

Dear Friends,

Last month was very productive in academia, with many new GenAI studies published, including

About 20% of primary care doctors in a UK survey reported using GenAI tools.
An LLM accurately assessed the clinical acuity of adult patients during ED visits.
A foundation mode for pathology outperformed the existing best cancer diagnosis models.
A GenAI chatbot successfully delivered cancer genetic services information to patients.

AI Auto-correction Project Funding

ARPA-H has announced a new funding opportunity through the Performance and Reliability Evaluation for Continuous Modifications and Useability of Artificial Intelligence (PRECISE-AI) program. PRECISE-AI has very ambitious goals: to develop capabilities that can detect when an AI-enabled tool used in real-world clinical care settings is potentially out of alignment with its underlying training data and, importantly, auto-correct these AI-enabled tools to maintain their peak performance.

Self-Taught LLM Evaluators

To reduce the cost of model-based evaluation for LLM development, Meta has developed an approach aimed at improving evaluators without human annotations, using synthetic training data only. Without any labeled preference data, Meta’s Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 on RewardBench. It outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

LLM Reflection Fine-tuning

Hugging Face has showcased a new open-source LLM, Reflection Llama-3.1 70B, trained with a new technique called Reflection-Tuning, which teaches an LLM to detect mistakes in its reasoning and correct its course. In Reflection Tuning, the LLM is trained on synthetic, structured data to learn reasoning and self-correction. Although its initial benchmarking showed improvement over OpenAI GPT-4o, it remains to be seen whether it can outperform GPT-4o in all benchmarks.

Explaining LLM

It has been challenging to explain how an LLM works in response to an input. Previously, Anthropic team reported how to interpret a single layer of their neural networks or a small network. Now, Google released a system called Gemma Scope that can illuminate how each layer in Gemma 2 large language models responds to a given input token. Gemma Scope has a suite of sparse autoencoders (SAEs) trained on all payers of Gemma 2 2B and 9B models. SAE is an unsupervised method for learning a sparse decomposition of a neural network’s latent representations into seemingly interpretable features. Gemma Scope’s open SAE weights and tutorials will help make safety and interpretability research easier. You can play with an interactive demo.

Keep reading the latest papers below and enjoy my conversations with ChatGPT about self-taught AI.

Warm regards,

AJ Chen, PhD | ELHS Institute | https://elhsi.org

Source: ARPA-H

From Page Mill

Blease CR, Locher C, Gaab J, Hägglund M, Mandl KD. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform. 2024 Sep 17;31(1):e101102. doi: 10.1136/bmjhci-2024-101102.

[2024/9] A UK survey showed about 20% primary care doctors are using generative AI tools in clinical practice. Top applications are writing documentation after patient appointments and assisting with differential diagnosis.

Williams CYK, Zack T, Miao BY, et al. Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department. JAMA Netw Open. 2024;7(5):e248895. doi:10.1001/jamanetworkopen.2024.8895

[2024/5] This cross-sectional study of 251 401 adult ED visits investigated the potential for an LLM to classify acuity levels of patients in the ED based on the Emergency Severity Index across 10 000 patient pairs. The LLM demonstrated accuracy of 89% and was comparable with human physician classification in a 500-pair subsample.

Wang, X., Zhao, J., Marostica, E. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature (2024). doi: 10.1038/s41586-024-07894-z

[2024/9] We devised the Clinical Histopathology Imaging Evaluation Foundation (CHIEF) model, a general-purpose weakly supervised machine learning framework to extract pathology imaging features for systematic cancer evaluation. We successfully validated CHIEF using 19,491 whole-slide images from 32 independent slide sets collected from 24 hospitals and cohorts internationally. Overall, CHIEF outperformed the state-of-the-art deep learning methods by up to 36.1%, showing its ability to address domain shifts observed in samples from diverse populations and processed by different slide preparation methods. CHIEF provides a generalizable foundation for efficient digital pathology evaluation for patients with cancer.

Zimolzak AJ, Wei L, Mir U, et al. Machine learning to enhance electronic detection of diagnostic errors. JAMA Netw Open. 2024;7(9):e2431982. doi:10.1001/jamanetworkopen.2024.31982

[2024/9] Based on expert input and existing frameworks, we designed rules-based e-triggers to find possible MODs in emergency departments (ED). Using Veterans Affairs national EHR data covering more than 20 million unique individuals, we identified 2 high-risk cohorts: (1) patients with stroke risk factors discharged from ED after presenting with dizziness or vertigo who were subsequently hospitalized for stroke or TIA within 30 days; and (2) patients discharged from ED with abdominal pain and abnormal temperature who were subsequently hospitalized within 10 days. The best-performing machine learning algorithm achieved a positive predictive value for MODs in the stroke cohort of 92% and 93% among the patients with abdominal pain and fever.

Kaphingst KA, Kohlmann WK, Lorenz Chambers R, et al. Uptake of Cancer Genetic Services for Chatbot vs Standard-of-Care Delivery Models: The BRIDGE Randomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432143. doi:10.1001/jamanetworkopen.2024.32143

[2024/9] Despite this trial’s limitations, its findings advance the state of the science in delivery of cancer genetic services to unaffected patients meeting criteria for genetic evaluation. By comparing chatbot service delivery vs SOC using a multisite RCT design, the findings of the BRIDGE equivalence trial support the use of chatbot approaches to offer cancer genetic services. The findings show that use of chatbots to deliver pretest genetic services has strong potential to increase access to these services for unaffected patients. These trial findings therefore support the implementation of systemwide population health management strategies to deliver cancer genetic services.

Liu T, Hetherington TC, Stephens C, et al. AI-Powered Clinical Documentation and Clinicians’ Electronic Health Record Experience: A Nonrandomized Clinical Trial. JAMA Netw Open. 2024;7(9):e2432460. doi:10.1001/jamanetworkopen.2024.32460

[2024/9] Approximately half of clinicians using the AI-powered clinical documentation tool based on interest reported a positive outcome, potentially reducing burnout. However, a significant subset did not find time-saving benefits or improved EHR experience. Study limitations include potential selection bias and recall bias in both groups. Further research is needed to identify opportunities for improvement and understand the impact on different clinician subsets and health systems.

Pfohl SR, Cole-Lewis H, Sayres R, et al. A toolbox for surfacing health equity harms and biases in large language models. Nat Med. 2024 Sep 23. doi: 10.1038/s41591-024-03258-2.

[2024/9] Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise.

In the Eyes of GPTs

^{(AJ in conversation with ChatGPT to understand how machines view human healthcare)}

Meta recently reported a self-taught evaluator for LLMs. What does that mean?

Meta recently reported a self-taught evaluator for LLMs, which means that they have developed a system where the evaluation of a large language model (LLM) can be performed without human-annotated data. Instead of relying on human feedback or labeled data, the self-taught evaluator uses synthetic data to judge and improve the model's performance. This approach enables the LLM to assess itself and refine its capabilities autonomously, which reduces the need for expensive and time-consuming human annotations.

Since evaluation can be self-taught, is it possible for an LLM to autonomously improve its performance during the development process?

Yes, if evaluation can be self-taught, it is possible for an LLM to autonomously improve its performance during the development process. By using techniques like synthetic data generation, self-evaluation, and reinforcement learning, LLMs can identify their own mistakes and weaknesses, then adjust and retrain themselves accordingly. This creates a feedback loop where the model continuously refines its abilities without relying on human intervention. Over time, this autonomous process can lead to significant performance improvements.

Wow, amazing! How does it work? Please provide a detailed procedure.

The process of an LLM autonomously improving its performance works through a series of steps that create a feedback loop of learning and self-improvement. Here’s a detailed procedure:

1. Initial Training

The LLM is first trained on a large dataset using traditional supervised or self-supervised learning methods. During this phase, the model learns to generate responses based on the data it has been exposed to.

2. Synthetic Data Generation

Once the initial training is complete, the LLM generates its own synthetic data. This data could be based on tasks or problems that it hasn’t been trained on, or variations of the data it has already seen. This step enables the model to explore new scenarios that may not have been present in the original dataset.

3. Self-Evaluation

The model then evaluates its own performance using a self-taught evaluator. In this phase, the LLM uses metrics or learned scoring systems to assess the quality of its responses. For example, it can compare its outputs against expected patterns, logical coherence, or previously generated examples.

4. Reinforcement Learning

Based on the evaluation, the LLM engages in a form of reinforcement learning where it rewards or penalizes its outputs. This step helps the model learn what constitutes a better response. In some cases, the model might identify flaws in its reasoning and adjust accordingly.

5. Self-Correction and Fine-Tuning

The LLM fine-tunes itself based on the feedback from the evaluation phase. This involves adjusting its internal weights and parameters to improve future responses. This process may occur repeatedly, allowing the model to gradually reduce errors and refine its abilities.

6. Continuous Self-Learning

The LLM continues this cycle of synthetic data generation, self-evaluation, reinforcement learning, and fine-tuning. Over time, it can autonomously discover weaknesses in its performance and improve without human intervention.

7. Validation

Finally, the model can be validated on real-world data or test cases to ensure that its autonomous improvements translate into practical, real-world performance enhancements.

Have there been any published studies on autonomous LLM development?

Yes, there have been several published studies on autonomous LLM development. These studies explore techniques like self-learning, self-evaluation, reinforcement learning, and self-supervised learning, which allow LLMs to improve their performance with minimal human intervention. For example, Meta has reported on their self-taught evaluator, which enables LLMs to evaluate themselves using synthetic data. These advancements are part of a growing body of research aimed at creating more autonomous AI systems.

Some key research areas include:

Self-taught evaluators for autonomous LLM performance evaluation.
Reinforcement learning without human feedback, enabling LLMs to learn from their own outputs.
Self-supervised learning, where LLMs use unlabeled data to continuously improve.

Researchers from organizations like OpenAI, Meta, and Google have contributed to this field, and many papers are available on platforms like arXiv.

GenAI/LHS Toolbox

	ELHS GenAI Copilot Platform The new copilot platform makes it easy for medical students and healthcare professionals to start healthcare GenAI journey: Use free multi-LLM copilot, including GPT-4o, Gemini 5pro, Llama 3.1. Try free personal open-source LLM and chatbot. Deploy and fine-tune their own open-source LLMs and copilots.
	Learning to Reason with LLMs OpenAI has introduced OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers. OpenAI o1 exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
	Meta Llama 3.2 Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions.
	Qwen2.5 Chatbot demo Qwen models are pre-trained on multilingual data covering various industries and domains, with Qwen-72B being trained on an astounding 3 trillion tokens of data. Capabilities including multimodal understanding and generation, and API.