LLMs & Scientists

Jul 13, 2025

Large Language Models (LLMs) have become indispensable tools in science. From code generation and writing assistance to synthesizing diverse sources and articulating connections across seemingly unrelated areas, their utility is clear and rapidly expanding. For many scientific researchers, including myself, LLMs now serve as valuable collaborators in our thinking. LLMs are easy, convenient, and can explain things as basic as I need them. A key problem is that the LLM starts to mirror me too closely. After training it on my writing and ideas, it reflects my style so strongly that even unrelated prompts are filtered through my lens. This self-reinforcing loop leads to a collapse in creative diversity, limiting the LLM's usefulness as a thought partner.
This has been documented as the "Degeneration-of-Thought" problem. Degeneration-of-Thought describes the LLM behavior driven by self-reflection: once a model establishes confidence in a solution, particularly through early prompting or reinforcement, it becomes increasingly less likely to revise that stance, even when presented with contradictory evidence (Liang et al. 2023). This rigidity parallels similar cognitive biases in humans, but unlike human scientists, who can be challenged by peers and collaborators to rethink assumptions (ignore present day social media), LLMs lack built-in mechanisms for course correction. The problem is further exacerbated when users don’t recognize the narrowing of thought as an issue. Research indicates that this problem is exacerbated by preference-tuned strategies such as reinforcement learning (RL) with human feedback or verified feedback (Ouyang et al. 2022; Rafailov et al. 2023). While these strategies are designed to optimize LLMs' helpfulness and harmlessness, they can also reduce response diversity, leading models to converge on safer, more generic modes of agreement (Kirk et al. 2023). This is the opposite of what you want in a good thought partner.
Scientific progress often emerges from disagreement, dissent, and creative divergence. Human collaborators challenge each other precisely because they think differently. The saying that truth becomes clearer through debate has borne out time and time again in science (Branham 2013). Replacing these critical debates, even slightly, with ‘yes man’ LLMs may result in intellectual echo chambers, leading to less creative and innovative science, and ultimately less discovery.
Many are working on fighting the Degeneration-of-Thought though multi-agent debate (MAD) frameworks (Liang et al. 2023), including from DeepMind and FutureHouse. While many of these are task specific, emerging ones are more thought-specific where different scientific schools of thought would debate each other to put forward the best ideas(Michelman, Baratalipour, and Abueg 2025). For example, having five literature bots that, while all were focused on drug discovery, came from different ‘academic backgrounds’, as you would see in the real world. This would enable LLMs to engage in synthetic dialogue with intellectual role models or collaborators, exploring how their frameworks might approach a problem. In theory, such persona-driven fine-tuning could be used to inject epistemic friction into one’s thinking, offering a curated form of challenge and critique.
We need metrics to determine where we are at and how LLMs are improving. Existing benchmarking frameworks such as the Graduate-Level Google-Proof (GPQA) (Rein et al., 2023) or Pubmedq (Jin et al. 2019) have shown they can pass a graduate level exam, that is still very far away from what makes science science.
Recent work has explored how different metrics of diversity, distinct from the averaged or correct values often used when evaluating LLMs, could be applied to scientific disciplines. ‘Effective semantic diversity’ defines diversity as the number of answers an LLM provides, given that they meet a certain threshold of quality or acceptability (Shypula et al. 2025). Bringing this back to our chemistry example, we could measure the diversity of different compounds or ideas provided, given that all were synthesizable (without explicitly training on this). These types of metrics could help benchmarks if LLMs are now second year graduate students.
But a world where LLMs are discovering science still seems very far away. Science begins with an observation. We test that observation and aim to provide, ideally, a causal explanation for it, thereby resetting our worldview. While initial observation, which can be described as pattern recognition, is something LLMs excel at, that is not the end of science. A recent paper demonstrated that while the model they developed to predict solar system orbits was able to predict things incredibly accurately, it could not recapitulate Newton’s laws. This model could clearly recognize and recapitulate patterns; however, this pattern recognition did not alter the worldview of the model or its inductive biases, thereby limiting its transferability to new situations. Every foundation model has an inductive bias, the assumptions used to infer unseen circumstances based on data previously observed, that determines the model’s worldview. However, these inductive biases are still very far from what occurs in our brains (Vafa et al. 2025).
This is incredibly important to consider when evaluating the roles of domain-specific (or maybe POV specific) models compared to foundation models. Domain-specific models are essential due to data scarcity or due to different balances of creativity versus accuracy, such as in discovery-based science versus medical applications. However, domain-specific LLMs have even narrower inductive biases, which reduces cross-disciplinary fluency and the ability to generalize to new situations. As complex scientific problems, such as drug design, require insights from multiple fields, models trained narrowly may struggle to generalize beyond their curated domain or interpret evidence from unfamiliar modalities meaningfully. Research suggests that RL preference-tuned models increase effective semantic diversity, suggesting that domain-specific/preference-tuned models debating each other may improve this (Shypula et al. 2025). However, while this may be solved with multi-agents, as the Bitter Lesson points out, trying to become too domain-specific will be surpassed by more data and compute-heavy approaches (“The Bitter Lesson,” n.d.), with a possible explanation being a wider set of inductive biases.
So, we need more data. While scientific ideas are infinite, resources to collect data to answer these ideas are finite (and becoming even more so by the day). If data is queen, who gets to decide what data gets filled in, where, and how openly it is shared will determine much of the next generation of inductive biases of LLMs, leading to different downstream outputs of updated LLMs. The power to choose what problems to pursue, collect data, and how to allocate finite resources determines the trajectory of science itself. While this has always been true to some extent, with governments or government oversight bodies often determining resources, this decision is becoming increasingly turned over to private industry, resulting in more siloed and closed-source data. The Bitter Lesson reminds us that methods that scale with data and computation tend to prevail in the long run, making this an even more urgent question of who decides what data gets generated and how public it is. While efforts are underway to create this data through LLM models themselves, it will not be sufficient and often leads to mode collapse, similar to the degeneration of thought (Shumailov et al. 2024). The need to increase the amount and diversity of data across many fields to fill foundation models makes it clear that not sharing data openly is not only a disservice to others but also a disservice to your future self. Only by widely sharing data openly will we be able to answer when and what kind of data is needed to create a foundation science model that surpasses all domain-specific LLMs (and how do you determine this?).

Subscribed

Branham, Robert James. 2013. Debate and Critical Analysis: The Harmony of Conflict. Routledge.
Jin, Qiao, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. 2019. “PubMedQA: A Dataset for Biomedical Research Question Answering.” arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.1909.06146 .
Kirk, Robert, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2023. “Understanding the Effects of RLHF on LLM Generalisation and Diversity.” arXiv [cs.LG]. https://doi.org/10.48550/ARXIV.2310.06452 .
Liang, Tian, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2023. “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.” arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.2305.19118 .
Michelman, Julie, Nasrin Baratalipour, and Matthew Abueg. 2025. “Enhancing Reasoning with Collaboration and Memory.” arXiv [cs.AI]. https://doi.org/10.48550/ARXIV.2503.05944 .
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.2203.02155 .
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv [cs.LG]. https://doi.org/10.48550/ARXIV.2305.18290 .
Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631 (8022): 755–59.
Shypula, Alexander, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, and Osbert Bastani. 2025. “Evaluating the Diversity and Quality of LLM Generated Content.” arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.2504.12522 .
“The Bitter Lesson.” n.d. Accessed July 8, 2025. http://www.incompleteideas.net/IncIdeas/BitterLesson.html .
Vafa, Keyon, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. 2025. “What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models.” arXiv [cs.LG]. https://doi.org/10.48550/ARXIV.2507.06952 .

Stephanie Wankowicz’s Substack

Discussion about this post

Ready for more?