AI is great for scientists. Perhaps it's not so great for science

Large language models may make science more generic

Jan 15, 2026

Here are three things that are connected.

First, my sometime co-author James Evans dropped a banger a few days ago. James, Alison Gopnik, Cosma Shalizi and I were chatting via email about a lovely piece that Walter Frick wrote for Bloomberg a few days ago, which starts from our shared ideas about AI as a social and cultural technology. James mentioned in passing, as you do, that he and other co-authors had a forthcoming research article in Nature about how AI was changing science. The takeaway argument is that using AI (which they define as involving a variety of machine learning techniques) is great for scientists’ careers, but not so great for the broader scientific enterprise. That piece is now out.

Second, by coincidence, there’s been a lot of conversation among political scientists about “academic slop” this week. Andy Hall, a political scientist at Stanford Business School, suggested that Claude Code would enable a single academic to write “thousands of empirical papers (especially survey experiments or LLM experiments) per year.” The very next day, he put his money where his mouth was, publishing an entire Claude Code replication of an earlier paper that he’d written, plus the prompts and other stuff, to Github. The result is a lot of nervous chatter about what the industrialization of social science might mean for academic publication and careers.

The third may seem at first to be the one of these things that is not like the others. It’s a piece I wrote myself a few weeks ago, which is mostly a repackaging of Cosma’s and Alison’s ideas, riffing on how sung Yugoslavian folk-tales from the 1930s do and don’t resemble the outputs of Large Language Models.* The upshot is that if you want to think of LLMs as a generative cultural technology, they are far from being the first such technologies that humans have come up with.

What I want to argue is that the third phenomenon is plausibly the glue that connects the second with the first. James and his co-authors suggest that older versions of AI are connected to collective pathologies in science. The Andy Hall Experiment is a specific micro-level instance of the way in which newer and different forms of generative AI present similar, and possibly much worse problems. But it is the Yugoslav folktales that join micro-level opportunities with macro-level pathologies. LLMs threaten to genre-fy the practice of science.

[Update: Kevin Munger has further arguments on the social sciences that start from a similar perspective to my own]

******

James and his co-authors are interested in the natural sciences: physics, chemistry, medicine and the like, and how they work at scale. There is already a lot of worry among natural scientists about what is happening to their fields. In the news section of Nature I counted no less than three news pieces on the topic: “AI is saving time and money in research — but at what cost?,” “More than half of researchers now use AI for peer review — often against guidance,” and “‘I rarely get outside’: scientists ditch fieldwork in the age of AI.” The new piece is research rather than news, but it too suggests that there are reasons to be worried.

The article uses an early and customizable large language model called BERT to categorize over 40 million papers, identifying the ones that appear to have used AI/machine learning techniques (which have a wide variety of legitimate applications to data).

First, AI use seems to be really good for the careers of individual scientists. Scientists who use it are able to write a lot more papers, with less help from other human researchers. Those papers are more likely to be cited by others. Their authors are on average promoted more quickly. All these relationships are associational rather than causal, but they are both visible and important at scale.

The problem is that what is good for scientists may not be good for science as a whole. Papers that use AI are more likely to succeed, but apparently less likely to stretch boundaries. Evans and his co-authors deploy another bespoke AI model to measure how AI-aided papers shape knowledge production. They find that AI-enabled research tends to shrink scientific inquiry to a smaller set of more topical questions. Furthermore, the linkages between papers suggest that there is less vibrant horizontal exchange associated with AI. The authors conclude that:

These findings suggest that AI in science has become more concentrated around popular research topics that become “lonely crowds” with reduced interaction among papers, linking to more overlapping research and a contraction in knowledge extent and diversity across science.

So is this likely to become more of a problem as scholars use AI not just to interrogate data, but actually to carry out research, review other scientists’ papers and so on? To understand this it’s useful to step back a bit, and think about what science is supposed to be doing in the world.

The entire enterprise of scientific research is intended to produce and evaluate useful discoveries. Usefulness, of course, is subjective, and disputed, but few apart from the late Ted Kaczynski would condemn the entire enterprise wholesale. Unfortunately, while the delights and benefits of disinterested discovery are genuine, they are insufficient to keep the scientific enterprise going at scale. To do that, you need some set of social institutions that imperfectly reconciles individual self-centered incentives (I, a scientist want not just to find out about the world, but to have a great job and career, and the admiration of my colleagues) with the production of general scientific knowledge.

The gap between individual goals and collective benefits explains much of the workings of science. Publication pressures, peer review, competitive funding are all highly imperfect means to incentivize individuals to participate in the scientific enterprise, and to increase the chances that good work rises to the top.

So what happens when we add LLMs to the equation? To be clear, as best as I understand James and his colleagues’ research, it is not aimed at detecting LLM use but figuring out when researchers explicitly use AI tools e.g. for data analysis.

What LLMs plausibly do is to exacerbate already existing contradictions between individual incentives and collectively beneficial outcomes of interesting and creative research. This is where the Andy Hall Experiment provides a useful example. It provides a best-case scenario for the use of LLMs to enhance science: obviously, it is about the social sciences rather than the natural sciences, but I am pretty sure that the basics of datasets, packages and prompts carry over pretty well to a wide variety of fields.

The experiment is noteworthy in that there are collective as well as individual benefits to this kind of work. As Seva Gunitsky says, replication of existing results is (a) important, (b) unprestigious, and (c) a massive pain in the arse. Having Claude Code doing it instead usefully fills in some gaps in the existing scientific enterprise. Equally, the creation of an automated system that can churn out thousands of scientific papers sounds ominous. So too, the use of LLMs for peer review and a myriad other potential uses. But why should it be ominous? These, after all, are the kinds of industrialization and automation that have served us well in a myriad of other economic sectors. Are academics the equivalents of 19th century craftsmen, deploring the factories that are capable of turning out product at scale for putting them out of jobs?

I think academics’ worries are justified, but LLMS are probably not so much creating fundamentally new problems as exacerbating old ones. In particular, I suspect that LLMs are hastening the genre-fication of scientific research.

To understand this, it is useful to highlight three words in the Hall comments, which possibly gave rise to an instinctive shudder in some, though certainly not all of the political scientists reading this essay: “especially,” “survey” and “experiments.” My comments on survey experiments below are not only political science insider-baseball, but highly tendentious insider-baseball. All I can say in their defense is that they come from a place of genuine pain.

Survey experiments are opinion surveys in which you randomly assign respondents to different treatments (e.g. different question wordings, or different initial scenarios that may prime respondents to think about some topic in particular ways) to see if there are meaningful differences in their responses. They have become semi-ubiquitous in political science. There are a lot of publications that use this approach, but my sense is that much fewer of them are good, in the sense that they contribute in a genuinely significant way to collective knowledge of politics. They do, however, plausibly contribute to their authors’ chances of tenure and promotion.

Indeed, they are reasonable responses to the institutional incentives of the field of political science. Academics who want to land and keep good jobs want to get their work published in respectable journals. Editors and reviewers for such journals are often more inclined to reject than encourage submissions, because they get so many of them. Scholars, especially younger scholars, are desperate to figure out sure-fire ways of navigating the obstacles of a review process that seems purpose-designed to frustrate them.

Articles that employ survey experiments have a better chance of getting through this process than many other approaches. Political science is often a heavily lagging indicator of trends in economics, and economists have become much more interested in causal identification in the last couple of decades - isolating causal relationships to figure out what is actually causing what. Political scientists have followed suit, making it much harder for articles without a good story about causation to get published. The problem is that establishing causation is difficult, expensive and murky in the real world. Survey experiments do not tell you very much about the real world unless they are carefully done (some are!), but they do make it very easy to tell a story about causation (they are, after all, built around a treatment which may cause one response or another). The result is that low quality survey experiment articles have become the political science version of kudzu - an infestation of most-mediocre-output-that-is-potentially-publishable that threatens to take over the entire ecosystem.

So what does this have to do with AI? The way in which I would adapt Hall’s comment (this may or may not bear any resemblance to his own ideas) is that survey experiment articles have become a genre, for much the same reason that pop music generates genres. There too, myriads of desperate young people are trying to succeed in producing outputs that will prove acceptable to a fickle and inscrutable public. There too, when someone miraculously succeeds, everyone else will rush in to copy them. There too, successful methods tend to turn into replicable packages: specific beats, lengths, themes and vocals in the one; forms of presentation, methodologies, kinds of data and means of arguing for significance in the other.

And if there is one thing we know about LLMs, it is that they are machines for detecting and reproducing genres. Here, then, is where the Yugoslav folk singers described in Albert Lord’s The Singer of Tales get their due. LLMs are very much like the generative cultural systems that created these folk tales with their minor variations, processing textual material so that it hits the right linguistic beats, harks to the right tropes at the right times and so on.

There is a lot of speculation that LLMs are returning us to something like oral culture. There is rather less that engages in any very intelligent way with the particulars of how oral culture works. Oral culture, like LLMs, involves lossy abstractions that also serve as generative systems. It too produces myriads of variations on common themes, adapting them to particular prompts and circumstances. It too is indifferently geared for verbatim transmission of the work on which it has been trained. When Varshney describes Lord’s formulas as “heuristic solutions to constrained optimization problems that must be solved in real-time,” he is using language that Lord might perhaps have found peculiar (though also perhaps not; Jakobson was on his dissertation committee), but that Wolfe would readily have recognized.

Many academic articles too are “variations on common themes” that are adopted to particular prompts and circumstances. Is it any wonder that sophisticated LLMs like Claude Code are capable of replicating them en masse?

As I said back then:

many aspects of human work and culture involve broadly similar combination of templates and stereotypes to those employed by the singers of tales. I suspect that this helps explain the facility of LLMs in carrying out many programming tasks, since programming too involves figuring out how to apply a common formula to a particular problem. The poiesis of the programmer is closer to the heroic poiesis of the bard than we think. … And perhaps also for much of the practice of social science? Dani Rodrik has written that a great deal of the art of the economist consists in accumulating a large mental library of mathematical models, and building an intuitive grasp of which model one ought to use when.

I didn’t at all expect that this throwaway suggestion would become relevant so quickly!

******

This, then, suggests a possible theory of what is happening. This is nowhere near a complete theory - there are plausibly lots and lots of specific micro-mechanisms jostling with each other to connect cause to effect. But I like it, perhaps only because it is my own.

Science is two things - a process of open-ended discovery and verification of those discoveries and an institutional system for employing the energies of scientists towards that process and compensating them. The two ought point in the same direction, and do, to some substantial degree. There is an awful lot of waste in the system, but it is impossible to eliminate some (the open-endedness is part of the point; apparently useless discoveries may cumulate into great things), and extremely difficult to eliminate others (many proposed cures are more damaging than the disease). There are always tensions, and there are disciplines and sub-disciplines (metascience; chunks of social epistemology) devoted to studying and perhaps partly remedying these tensions.

One way in which those tensions manifest is genre-fication. Interesting discovery is hard, unpredictable and often requires a lot of resources. Scientists across the hard, soft and social sciences would often prefer, as all humans would, to have a more predictable world in which they can land jobs. This gives rise, in turn, to tendencies towards genre-fication. When someone discovers a path through the kill-zone of peer-review, others will want to copy it, in they hope that they too will succeed in winning kudos and career success. This results in the creation of scientific genres - packets of techniques, methodological approaches and rhetorical claims that scientists adopt in the hope that they will prosper. And that opens up the way for technologies that are good at picking up on genre cues and replicating them.

People will reasonably disagree about the merits of specific scientific genres. As should be clear, I am skeptical about the merits of survey experiments in political science, but many of my colleagues may very reasonably disagree. And genre has value! Some coordination is necessary for science to work. But the overall effects of genre-fication are to winnow out some of the variety among scientists that produces unexpected discovery. That LLMs may create their very own new genre of social science articles that treat LLMs as a proxy for public opinion, generating outputs that become inputs? This only adds icing to the cake.

LLMs then, as they are currently employed by scientists, are likely to reduce diversity. Claude Code is plausibly still at the stage where it is good at doing replications, but not so great at assembling the package in ways that produce somewhat novel-seeming research. I suspect, as Hall does, that it is not far away from it. This will, however, rapidly accelerate the genre-fication of science.

LLMs are excellent at assembling outputs that match the requirements of particular templates - producing genre outputs. They are also very good at match-and-mixing genres. In contrast, they are remarkably poor at generating usefully novel outputs or recognizing novelty in the data they are trained on. Accordingly, the more that LLMs are employed in the ways that they are currently being employed, the more concentrated science will be on studying already-popular questions in already-popular ways, and the less well suited it will be to discovering the novel and unexpected. James and his colleagues’ findings identify an existing problem that may likely become much worse with the newer forms of generative AI that are rapidly reshaping science.

To be clear, this is not an inevitable consequence of the technology. To steal another analogy from pop music, Autotune has likely, on average, made pop music more bland, but it has also been used in weird and interesting ways to expand the range of things that you can do. The Nature article employs a basic LLM to make the scientific enterprise visible at scale in ways that would have been inconceivable fifteen years ago. But it is going to be hard to get to a place where the technology is better suited to serve the interests of science, rather than those interests of scientists that point away from discovery.

* I do take credit - or blame - for the excursion into the Proustian science fiction of Gene Wolfe.

Ben Recht

Jan 15

I don’t think this detracts from any of your excellent points here, but three things worth pointing out about Hall’s “paper” are that it wouldn’t actually get published anywhere, it used existing data in known repositories, and the instructions to Claude Code were longer than the paper itself.

That last part is critical. Because Hall could write a screenplay for a replication study, the study was instantly automatable. That this is mechanically true now is both remarkable and mundane.

1 reply

Alex Tolley

Science proceeds in steps with new experimental discoveries that can be replicated. However, while the new discovery can be published, the replication, or more importantly, teh failure to replicate, is mostly not. A good "recent" example was the "Arsenic bacteria" paper published in Science in 2015. After 10 years of controversy, it was retracted this year (2025). Most of that failure to replicate was not published. This is a problem.

Cognitive Scientist Melanie Mitchell's presentation at NeurIPS 2025 decried that while AI papers were being easily accepted and published, replication studies that did not confirm the results were very hard to publish. Yet replication is important because there are biases that seep in, and this distorts the research and its interpretation.

Her summary of the talk. https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities

We used to worry about junk research being published. Then, about the many non-peer-reviewed journals polluting science with so-so or even poor papers. Now we have GenAI increasing the noise of too many "genre" papers, saying very little that is new, first by making papers easier to write, now by generating "new" papers en masse. ArXiv doesn't want review papers, and it already has a lot of junk papers, as do similar platforms. The incentives for academic research publications are perverse, resulting in lots of junk and increasing fraud rates. Isn't this rather like the Russian propaganda model of creating lots of false information to bury the truth amongst the lies, and turning off people trying to find the truth?

27 more comments...

Programmable Mutter

Discussion about this post

Ready for more?