Your Name *
Your E-mail *

Systematic Benchmarking LLMs and Continuous Monitoring of AI Performance


ELHS Newsletter 2024-05-29

Paving the Way for Global Health Equity with GenAI, ML, Data, and LHS (Learning Health Systems)

Subscribe | All Issues


Dear Friends,


Nature news reported an incident on the X platform that triggered a debate about modern scientific research between Elon Musk and thousands of X users. It started with AI pioneer Yann LeCun criticizing this @elonmusk post on May 27: “Join xAI if you believe in our mission of understanding the universe, which requires maximally rigorous pursuit of the truth, without regard to popularity or political correctness.” I’m not sure any new insight can result from this kind of social media clash. However, it reminds me that it is crucial to rethink how we pursue the truth in the new world of generative AI in this historic transition time. Particularly, because the general purpose LLMs like ChatGPT are not specifically designed for the predictive outcomes seen in healthcare, such as diagnostic predictions, what constitutes truth now becomes a question.     


While it may take time for human scientists to answer this fundamental question, I am curious what ChatGPT thinks about it. ChatGPT’s answers in the conversation below might surprise you. There may need to be a paradigm shift from seeking deterministic truth to probabilistic truth.


I think the machine gives us an interesting idea. For healthcare, this paradigm shift requires systematic benchmarking LLM-based generative AI chatbots for all healthcare tasks that may benefit from GenAI, an essential first step toward democratization of healthcare AI. Recognizing this shift early, we have been conducting benchmarking studies systematically on the ability of symptom checking and diagnostic predictions for large numbers of diseases across different specialties.  The overall accuracy of predictions per specialty is reported on the Healthcare GenAI Scoreboards of our ELHS Copilot learning tool. The initial scoreboards include two specialties: oncology and neurology. The unprecedented high accuracies of diagnostic predictions achieved by OpenAI ChatGPT, Google Gemini and Baidu Ernie qualify these chatbots for further comparative effectiveness research on the collaboration of GenAI and doctors in clinical training and decision-making.


Beside systematic benchmarking, continuous validation and monitoring of AI models are also essential to AI applications in healthcare. One great piece  of news -  UCSF has obtained a $5M gift to build a novel continuous AI monitoring platform, a real-time and automated artificial intelligence monitoring platform for clinical care. This effort is co-led by LHS expert Dr. Julia Adler-Milstein at UCSF.


Reported by Healthcare IT News, Epic is delivering another good piece of news to medical AI community: In their efforts to democratize health AI validation, Epic will provide new open-source tools for healthcare organizations to validate any artificial intelligence model and monitor its ongoing performance.


Enjoy reading the latest developments and my conversations with ChatGPT below. 


Warm regards,


AJ Chen, PhD | ELHS Institute | web: | email:



Paradigm shift: deterministic truth to probabilistic truth


From Page Mill

Published papers, recent news, and significant events in a coherent narrative for the main topic. 


Lin CS, Liu WT, Tsai DJ, et al. AI-enabled electrocardiography alert intervention and all-cause mortality: a pragmatic randomized clinical trial. Nat Med. 2024 May;30(5):1461-1470. doi:10.1038/s41591-024-02961-4.

[2024/5] The trial met its primary outcome, finding that implementation of the AI-ECG alert was associated with a significant reduction in all-cause mortality within 90 days: 3.6% patients in the intervention group died within 90 days, compared to 4.3% in the control group (4.3%) (hazard ratio (HR) = 0.83, 95% confidence interval (CI) = 0.70–0.99). A prespecified analysis showed that reduction in all-cause mortality associated with the AI-ECG alert was observed primarily in patients with high-risk ECGs (HR = 0.69, 95% CI = 0.53–0.90).


Barnett AJ, Jade A, Guo Z, et al. Improving Clinician Performance in Classifying EEG Patterns on the Ictal–Interictal Injury Continuum Using Interpretable Machine Learning. NEJM AI 2024;1(6). DOI:10.1056/AIoa2300331.

[2024/5] We developed an interpretable deep-learning system that accurately classifies six patterns of potentially harmful EEG activity — seizure, lateralized periodic discharges (LPDs), generalized periodic discharges (GPDs), lateralized rhythmic delta activity (LRDA), generalized rhythmic delta activity (GRDA), and other patterns — while providing faithful case-based explanations of its predictions. The performance of all users significantly improved when provided with AI assistance. Mean user diagnostic accuracy improved from 47 to 71% (P<0.04). The model achieved AUROCs of 0.87, 0.93, 0.96, 0.92, 0.93, and 0.80 for the classes seizure, LPD, GPD, LRDA, GRDA, and other patterns, respectively.


Ferber D, Wiest IC, Wölflein G, et al. GPT-4 for Information Retrieval and Comparison of Medical Oncology Guidelines. NEJM AI 2024;1(6). DOI: 10.1056/AIcs2300235

[2024/5] With the aim of improving oncologists’ efficiency and supporting their adherence to the most recent treatment recommendations, we evaluated the use of the large language model generative pretrained transformer 4 (GPT-4) to interpret guidelines. We showed that GPT-4, when enhanced with additional clinical information through RAG, can accurately identify detailed similarities and disparities in diagnostic and treatment proposals across different authoritative sources. 


Huang, J., Yang, D.M., Rong, R. et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 7, 106 (2024).

[2024/3] Our study shows the feasibility of using ChatGPT to process large volumes of clinical notes for structured information extraction without requiring extensive task-specific human annotation and model training. The results underscore the potential role of LLMs in transforming unstructured healthcare data into structured formats, thereby supporting research and aiding clinical decision-making.














GPTs are becoming an integral part of our society. AJ talks with GenAI to understand how machines view the main topics.


 What is Science?



Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. It involves observing, identifying, describing, investigating, and theoretically explaining phenomena. Here are some key aspects of science:

  1. Empirical Evidence: Science relies on empirical evidence, which is data obtained through observations and experiments.

  2. Hypotheses and Theories: Scientists develop hypotheses (testable predictions) and theories (well-substantiated explanations) to explain natural phenomena.

  3. Experimentation: Scientific knowledge is often acquired through controlled experiments that test hypotheses.

  4. Reproducibility: Scientific findings must be reproducible by others following the same methodology to ensure validity.

  5. Peer Review: Scientific research is often subjected to peer review, where other experts evaluate the methods, results, and conclusions before publication.

  6. Scientific Method: The process of science typically involves making observations, forming hypotheses, conducting experiments, analyzing data, and drawing conclusions.

  7. Progressive Nature: Science is progressive, meaning it builds upon previous knowledge and is always subject to revision with new evidence.


 Great summary. Thanks. In traditional science, truth-seeking is usually deterministic - after discovering a rule, following such a rule should produce an expected and verifiable result. However, in the new world of generative AI, LLMs are designed to understand human languages from available content, but they can generate and output results that seem unlikely but are true. For example, a common misconception of ChatGPT is that it cannot make certain medical predictions because the required medical data are not in the training data. But ChatGPT has proven many of these pre-judgements wrong. This raises a basic scientific research question: How do we study the truth in generative AI? I'd like to hear your thoughts.


 Studying the truth in generative AI, especially large language models (LLMs) like ChatGPT, presents a unique challenge because these models do not follow the deterministic rules of traditional science. Instead, they rely on probabilistic methods and vast amounts of data to generate responses that can sometimes be unexpected but still valid. Here are some thoughts on how to approach this complex task:

  1. Epistemological Shift: Recognize the fundamental shift from deterministic to probabilistic truth. Traditional scientific methods rely on reproducibility and verifiability through deterministic rules. Generative AI, however, operates on patterns and probabilities derived from large datasets. This means embracing a new understanding of how truth can be discerned in AI outputs.

  2. Data and Model Transparency: Ensure transparency in the data used for training and the model architecture. Understanding the sources of data and the mechanisms by which the model learns can provide insights into the origins of its outputs. Transparency helps in assessing the reliability and biases present in the AI system.

  3. Validation through Empirical Testing: Employ rigorous empirical testing of AI-generated hypotheses and predictions. While LLMs can generate novel and seemingly unlikely outputs, these need to be validated through real-world experiments and observations. This step bridges the gap between AI-generated knowledge and traditional scientific verification.

  4. Cross-Disciplinary Collaboration: Engage experts from various fields such as computer science, linguistics, philosophy, and domain-specific experts (e.g., medical professionals). This multidisciplinary approach helps in framing the right questions, interpreting AI outputs accurately, and designing appropriate validation studies.

  5. Continuous Feedback Loop: Implement a continuous feedback loop where AI-generated outputs are tested, validated, and fed back into the model. This iterative process can help refine the AI’s accuracy and reliability over time, improving its ability to generate true and useful insights.

  6. Ethical and Societal Considerations: Consider the ethical implications and societal impact of AI-generated knowledge. The truth in AI is not just a technical issue but also a social one. Ensuring that AI systems are used responsibly and their outputs are interpreted with caution is crucial.

  7. Philosophical Inquiry: Engage in philosophical inquiry to explore the nature of truth in the context of AI. Questions about the nature of knowledge, belief, and justification in AI-generated content can provide deeper insights into how we understand and validate the truth in this new paradigm.

  8. Benchmarking and Standardization: Develop benchmarks and standardized tests for evaluating the truthfulness and reliability of AI outputs. This includes creating datasets specifically designed to test the limits and capabilities of generative AI in producing accurate and reliable information.


 I appreciate your comprehensive view of this fundamental shift from deterministic to probabilistic truth. In the context of healthcare, this means it is essential to have the following capabilities in place for healthcare AI scientific research: systematic benchmarking for all healthcare tasks, real-time validation of LLM changes, and continuous monitoring of GenAI performance. What do you think?


 I completely agree with your perspective. In the context of healthcare, adopting a systematic and rigorous approach to evaluating and monitoring AI systems is crucial to ensure safety, efficacy, and reliability. Here are some detailed thoughts on the capabilities you mentioned:


  1. Systematic Benchmarking for All Healthcare Tasks:

  • Comprehensive Evaluation: Develop and maintain a comprehensive set of benchmarks for various healthcare tasks such as diagnosis, treatment recommendations, patient management, and administrative tasks.

  • Standardized Metrics: Use standardized metrics to assess the performance of AI models. This includes accuracy, sensitivity, specificity, precision, recall, F1 score, and other relevant metrics depending on the task.

  • Diverse Data Sets: Ensure that the benchmarking datasets are diverse and representative of different populations, medical conditions, and clinical settings to avoid biases and ensure generalizability.

  • Publicly Available Benchmarks: Where possible, make these benchmarks publicly available to promote transparency and allow for independent validation by the research community.


  1. Real-Time Validation of LLM Changes:

  • Continuous Testing: Implement continuous testing and validation pipelines that automatically evaluate the performance of language models (LLMs) as they are updated or retrained.

  • Alert Systems: Develop alert systems to notify researchers and clinicians of any significant changes in performance, especially drops in accuracy or the emergence of new biases.

  • A/B Testing: Use A/B testing methodologies to compare the performance of new model versions against previous ones, ensuring that changes lead to improvements or at least do not degrade performance.

  • Clinical Simulation Environments: Create simulated clinical environments where new model versions can be tested safely before deployment in real-world settings.


  1. Continuous Monitoring of GenAI Performance:

  • Real-Time Monitoring: Implement real-time monitoring systems to track the performance of generative AI models in production environments. This includes monitoring output quality, user interactions, and error rates.

  • Feedback Loops: Establish robust feedback mechanisms where clinicians can report issues or inaccuracies in AI outputs. Use this feedback to continuously improve the models.

  • Periodic Audits: Conduct periodic audits of AI performance to ensure ongoing compliance with regulatory standards and clinical guidelines.

  • Adaptive Learning: Implement adaptive learning systems that allow the AI to learn and improve from new data and user interactions continuously.



~ the end ~