ELHS Institute

Newsletters

Your E-mail *

Machine Psychology

ELHS Newsletter 2024-06-29

Paving the Way for Global Health Equity with GenAI, ML, Data, and LHS (Learning Health System)

Dear Friends,

Does generative AI (GenAI) have a mind? A new study compared GPTs and humans in performing a series of theory of mind tasks. The results were surprising: GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs, and misdirection, but struggled with detecting faux pas. This is not unexpected, as we have seen LLM-based chatbots exhibit some human-like behaviors, including logical sense and reasoning, when interacting with humans. What does this mean? I asked ChatGPT to explain the research results for us.

Other recent studies included here are a pathology framework from Prof. James Zou’s lab at Stanford University demonstrating AI-human collaboration improving diagnostic accuracy; GPT-4 enhancing clinical trial screening performance; prompt engineering with open-source LLM delivering good accuracy in medical Q&A benchmarking; a review on GenAI transforming public health; and a method to detect hallucinations in LLMs.

Hope you enjoy reading these latest developments and my conversations with ChatGPT below.

Warm regards,

AJ Chen, PhD | ELHS Institute | web: elhsi.org

From Page Mill

^{Published papers, recent news, and significant events in a coherent narrative for the main topic.}

Strachan, J.W.A., Albergo, D., Borghini, G. et al. Testing theory of mind in large language models and humans. Nat Hum Behav (2024).

[2024/5] We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

Also see news: How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models. Researchers are striving to reverse-engineer artificial intelligence and scan the ‘brains’ of LLMs to see what they are doing, how and why.

Huang, Z., Yang, E., Shen, J. et al. A pathologist–AI collaboration framework for enhancing diagnostic accuracies and efficiencies. Nat. Biomed. Eng (2024).

[2024/6] In pathology, the deployment of artificial intelligence (AI) in clinical settings is constrained by limitations in data collection and in model transparency and interpretability. Here we describe a digital pathology framework, nuclei.io, that incorporates active learning and human-in-the-loop real-time feedback for the rapid creation of diverse datasets and models. We validate the effectiveness of the framework via two crossover user studies that leveraged collaboration between the AI and the pathologist, including the identification of plasma cells in endometrial biopsies and the detection of colorectal cancer metastasis in lymph nodes. In both studies, nuclei.io yielded considerable diagnostic performance improvements. Collaboration between clinicians and AI will aid digital pathology by enhancing accuracies and efficiencies.

Unlu O, et al. Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. NEJM AI 2024;1(7). DOI: 10.1056/AIoa2400181.

[2024/6] Large language model–based solutions such as RECTIFIER can significantly enhance clinical trial screening performance and reduce costs by automating the screening process. However, integrating such technologies requires careful consideration of potential hazards and should include safeguards such as final clinician review.

Maharjan, J., Garikipati, A., Singh, N.P. et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci Rep 14, 14156 (2024).

[2024/6] We evaluated OS foundation LLMs (7B-70B) on medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU medical-subset) and selected Yi34B for developing OpenMedLM. Prompting strategies included zero-shot, few-shot, chain-of-thought, and ensemble/self-consistency voting. The model achieved 72.6% accuracy on MedQA, outperforming the previous SOTA by 2.4%, and 81.7% accuracy on MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark.

Bharel M, Auerbach J, Nguyen V, DeSalvo KB. Transforming Public Health Practice With Generative Artificial Intelligence. Health Aff (Millwood). 2024 Jun;43(6):776-782. doi: 10.1377/hlthaff.2024.00050.

[2024/6] In this Commentary, we explore how AI is being used in public health, and we describe the advanced capabilities of generative AI models capable of producing synthetic content such as images, videos, audio, text, and other digital content. Viewing the use of AI from the perspective of health departments in the United States, we examine how this new technology can support core public health functions with a focus on near-term opportunities to improve communication, optimize organizational performance, and generate novel insights to drive decision making.

Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024 Jun;630(8017):625-630. doi: 10.1038/s41586-024-07421-0.

[2024/6] Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations-confabulations-which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before.

In the Eyes of GPTs

^{GPTs are becoming an integral part of our society. AJ talks with GenAI to understand how machines view the main topics.}

Since LLMs like ChatGPT have shown logic and reasoning behaviors, researchers have started to compare LLMs to humans in terms of psychology. Do you know the new term "machine psychology"? What does it mean?

ChatGPT

"Machine psychology" is an emerging term that refers to the study and understanding of the behavior, logic, and reasoning processes of artificial intelligence (AI) systems, particularly large language models (LLMs) like ChatGPT. This field aims to analyze and compare the cognitive functions of machines to those of humans, using concepts from psychology and cognitive science. Researchers are interested in how these models learn, make decisions, and simulate human-like thought processes. The goal is to better understand AI behavior, improve AI-human interaction, and ensure AI systems are aligned with human values and ethical standards.

Machine psychology involves:

Behavioral Analysis: Examining how AI systems respond to various inputs and how their responses align with human psychological principles.
Cognitive Modeling: Understanding the internal processes of AI, including how they process information and generate outputs.
Ethical Considerations: Ensuring that AI behaves in ways that are ethical and beneficial to society.
Human-AI Interaction: Studying how humans perceive and interact with AI systems to improve user experience and trust.

What specific methods can be applied to psychological behavior analysis?

ChatGPT

Methods for Psychological Behavior Analysis:

Surveys and Questionnaires: These tools gather self-reported data on attitudes, beliefs, and behaviors. Standardized instruments, like the Big Five Personality Test, help quantify psychological traits.
Experiments: Controlled experiments test hypotheses about behavior under specific conditions. For example, the Stanford prison experiment examined the effects of perceived power on behavior.
Observational Studies: Naturalistic observation records behaviors in natural settings without manipulation. This method helps understand behavior in real-world contexts.
Interviews: Structured or semi-structured interviews provide in-depth data through direct interaction. Techniques like the Clinical Interview for DSM Disorders (SCID) are commonly used.
Psychophysiological Measures: These include monitoring brain activity (EEG, fMRI), heart rate, and skin conductance to study the physiological basis of behavior.
Content Analysis: Analyzing written or spoken materials to identify patterns, themes, or biases. For instance, analyzing social media posts for sentiment analysis.
Behavioral Tasks: Tasks like the Stroop Test or the Iowa Gambling Task measure cognitive and emotional responses under controlled conditions.
Longitudinal Studies: These studies follow the same individuals over time to observe changes and developments in behavior and psychological traits.

What are the tasks of theory of mind? How can they be studied in LLMs?

ChatGPT

Theory of Mind (ToM) Tasks:

Theory of Mind (ToM) refers to the ability to attribute mental states—beliefs, intents, desires, emotions, knowledge—to oneself and others and to understand that others have beliefs, desires, and intentions that are different from one's own. Common ToM tasks include:

False-Belief Task: This tests the understanding that others can hold beliefs about the world that are incorrect. A classic example is the Sally-Anne task, where a child must infer where Sally will look for her toy, not knowing that it has been moved.
Unexpected Contents Task: This involves a container holding unexpected contents (e.g., a crayon box filled with candles). The participant must predict what someone else will think is inside the box without knowing its true contents.
Perspective-Taking Task: Participants are asked to view a scenario from someone else’s perspective, understanding their unique viewpoint.
Second-Order False-Belief Task: This more advanced task involves understanding that one person can have a false belief about another person's belief.

Studying ToM in Large Language Models (LLMs):

Task Simulation: LLMs can be presented with descriptions of ToM tasks and asked to generate responses that demonstrate an understanding of others’ beliefs and perspectives. For example, providing LLMs with a false-belief scenario and analyzing their responses to see if they correctly predict the character’s belief.
Narrative Understanding: LLMs can analyze stories or dialogues involving multiple characters with different beliefs and intentions, assessing the model's ability to infer mental states from context.
Question-Answering: LLMs can be tested using question-answering formats where they need to infer the beliefs and intentions of characters based on provided narratives.
Interactive Simulations: Creating interactive scenarios where LLMs must navigate and respond to various ToM tasks dynamically can provide insights into their ToM capabilities.

In experiments designed to compare you (LLM) to humans in these ToM tasks, what do you expect the results will be?

ChatGPT

In an experiment designed to compare an LLM like me to humans in Theory of Mind (ToM) tasks, the expected results would likely show both strengths and limitations for the LLM:

Strengths:
- Pattern Recognition: LLMs excel at identifying patterns in text and can generate responses that mimic understanding of ToM scenarios by drawing on extensive training data.
- Consistency: LLMs provide consistent responses and do not suffer from fatigue or emotional biases that can affect human performance.
- Speed: LLMs can process and respond to ToM tasks quickly and efficiently.
Limitations:
- True Understanding: While LLMs can simulate understanding, they lack genuine consciousness and emotional experiences, which are crucial for true ToM.
- Contextual Nuance: Humans may excel in interpreting nuanced social contexts and emotional subtleties that LLMs might miss or misinterpret.
- Adaptability: Humans can adapt their responses based on real-time feedback and complex social cues, which is more challenging for LLMs without specific training data.

Expected Results:

False-Belief Tasks: LLMs might perform well in straightforward scenarios but struggle with more complex or deeply contextualized tasks.
Perspective-Taking: LLMs can generate plausible responses but may not consistently understand or replicate the depth of human perspective-taking.
Overall Comparison: Humans would likely outperform LLMs in tasks requiring deep emotional and social understanding, whereas LLMs would excel in tasks requiring pattern recognition and consistency.

A new study found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs, and misdirection, but struggled with detecting faux pas. Could you explain these results?

ChatGPT

The results of the study showing that GPT-4 models performed at or above human levels in identifying indirect requests, false beliefs, and misdirection, but struggled with detecting faux pas, can be explained by examining the underlying capabilities and limitations of large language models (LLMs) like GPT-4.

Indirect Requests, False Beliefs, and Misdirection:
- Pattern Recognition: GPT-4 excels in pattern recognition, enabling it to identify and interpret indirect requests by matching them to similar instances in its training data.
- Training Data: Extensive exposure to diverse linguistic contexts in its training data allows GPT-4 to understand and predict false beliefs and misdirection accurately. This includes recognizing when a character in a text does not have access to certain information, thus holding a false belief, or when a narrative is intentionally misleading.
- Contextual Understanding: GPT-4 can process and integrate context from previous parts of a conversation or text to make inferences about indirect requests and misdirection.
Detecting Faux Pas:
- Subtlety and Nuance: Faux pas often involves subtle social cues and cultural norms that may not be explicitly stated but are understood through shared human experiences and empathy. These nuances can be difficult for LLMs to grasp fully.
- Contextual Sensitivity: Detecting a faux pas requires an understanding of social norms and the emotional states of individuals involved, which can be highly contextual and variable across different cultures and situations. LLMs, despite their extensive training, may lack the depth of contextual sensitivity that humans naturally possess.
- Empathy and Theory of Mind: Recognizing a faux pas often involves empathy and a sophisticated theory of mind—understanding the perspectives, emotions, and intentions of others—which LLMs simulate but do not genuinely possess. This makes it challenging for them to detect when a social mistake has occurred.

Overall, while GPT-4 can handle many aspects of human communication impressively, it still faces challenges with tasks that require deep social understanding and emotional intelligence.

~ the end ~