29 Comments
User's avatar
Ben Recht's avatar

I don’t think this detracts from any of your excellent points here, but three things worth pointing out about Hall’s “paper” are that it wouldn’t actually get published anywhere, it used existing data in known repositories, and the instructions to Claude Code were longer than the paper itself.

That last part is critical. Because Hall could write a screenplay for a replication study, the study was instantly automatable. That this is mechanically true now is both remarkable and mundane.

Gerben Wierda's avatar

Your comment puts a critical note on mine. Easy is indeed not to be assumed.

Alex Tolley's avatar

Science proceeds in steps with new experimental discoveries that can be replicated. However, while the new discovery can be published, the replication, or more importantly, teh failure to replicate, is mostly not. A good "recent" example was the "Arsenic bacteria" paper published in Science in 2015. After 10 years of controversy, it was retracted this year (2025). Most of that failure to replicate was not published. This is a problem.

Cognitive Scientist Melanie Mitchell's presentation at NeurIPS 2025 decried that while AI papers were being easily accepted and published, replication studies that did not confirm the results were very hard to publish. Yet replication is important because there are biases that seep in, and this distorts the research and its interpretation.

Her summary of the talk. https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities

We used to worry about junk research being published. Then, about the many non-peer-reviewed journals polluting science with so-so or even poor papers. Now we have GenAI increasing the noise of too many "genre" papers, saying very little that is new, first by making papers easier to write, now by generating "new" papers en masse. ArXiv doesn't want review papers, and it already has a lot of junk papers, as do similar platforms. The incentives for academic research publications are perverse, resulting in lots of junk and increasing fraud rates. Isn't this rather like the Russian propaganda model of creating lots of false information to bury the truth amongst the lies, and turning off people trying to find the truth?

Julie's avatar

The problem with AI is that it holds great potential but equally great risk and all the difference is whether the user has good and well-thought-out intentions or not.

Will these scientists restrain themselves from pure pursuit of money and gaming academia? The AI multiplies ability to do the great and the very scummy things alike.

It is like giving out digital enriched uranium and hoping that people will make reactors and not bombs. Some will achieve great things leveraging it and some will commit great fraud or cluelessly harm their field. Just like before but multiplied ten times.

Use it responsibly. Anyone could impersonate any person right at this moment and gain millions of dollars overnight using it unethically. Thankfully the people who know how and the people who are willing to are not yet overlapping greatly but it is a question of time.

Gerben Wierda's avatar

One can wonder of the GenAI tools are going to influence what we do research/publish and what we do not (because GenAI can't easily assist with it). A bit like the drunk searching for their keys under the lantern, because that is where the light is.

Conor Griffin's avatar

Hi Henry

The broader question about how AI will affect scientific creativity - the production of ideas that are novel and useful - is a very important one. We touched on it very briefly in a past essay, and would love to write a more detailed unpacking of this.(1)

The idea that all scientists will start using LLMs to produce increasingly similar questions and findings, of questionable value, is a compelling one, but that’s more of a reason to be skeptical of it I think. There are also lots of counter-arguments. If AI makes scientists more productive, it could also give them time and scaffolding to explore new, harder, ideas, especially if bad incentives to publish ever more papers could be tackled - albeit a big if! If AI makes other disciplines more accessible to non-specialists, it might help scientists to operate across those disciplines better (a common source of creative ideas). Rather than replacing expensive experiments that provide critical, but often expensive, ground-truth evidence, scientists could use AI to help design better and more novel experiments, with higher RoI. Scientists can also shape LLM outputs to some degree, by uploading their personal notes or thoughts. We can also train and optimise LLMs for many goals, from answering questions to generating novel problems. Or from accuracy on a task to curiosity. Very little is set in stone.

So we need more empirical evidence about how AI is actually affecting science. In the Hao/James Evans paper, they train a model to identify ‘AI-augmented papers’, based on titles and abstracts from AI-oriented journals. Any study of this scale will have to make difficult methodological choices. But by definition, the model will select “AI-augmented papers” that share a technical vocabulary (like “neural network”) and be mathematically closer in vector space, which is how they measure “narrowness”, compared to the “non-AI” papers. ‘AI-augmented papers will also capture research areas where AI is suitable to use in the first instance, which may again, by definition, be a narrower space. The methodology will also miss those who use AI-derived predictions in their research, but do not mention it in the abstract. For example, the authors report that the downstream citations to the AI papers actually cover more ground than those to non-AI papers. This could be interpreted as AI research offering ‘tools' that are useful to many scientists - a widening rather than a narrowing of science.

In the latter spirit, a recent paper (2), independently carried out by financially supported by my colleagues here at Google DeepMind, looked at how scientists are using AlphaFold protein structure predictions. It adopts a more causal framework to demonstrate that AlphaFold has made it more possible for researchers to discover protein structures for proteins located in less-studied domains - a widening rather than a narrowing of the field. Some might think that AlphaFold is a niche biology model and very different to an LLM. But they share some architectural similarities and the coming era of AI scientific assistants (based on LLMs) will be able to call on specialised AI tools, like AlphaFold and many others. Indeed one can imagine an orchestra of AI agents and tools, including an agent that could tell you how ‘novel’ your work is (albeit novelty, in and of itself, is not always good of course).

Of course, if scientists are only incentivised to publish papers and nobody has time to really read or review them, there are many ways that LLMs could make things worse. But I think there’s a lot of nuance and optionality in how this plays out.

P.S. We also wrote a short note on the other very interesting ‘LLMs affecting science’ paper from last week, from Yian Yin and colleagues in case of interest. (3)

Links

1. https://www.aipolicyperspectives.com/p/a-new-golden-age-of-discovery

2. https://www.innovationgrowthlab.org/wp-content/uploads/2025/11/ai_in_science_af2_igl_summary.pdf

3. https://substack.com/home/post/p-185431142

Henry Farrell's avatar

Hi Conor - super brief because I am in the midst of crazy right now, thanks to the geopolitical firestorm. Every time I try to get out of weaponized interdependence it pulls me back …

I interpret James and his colleagues less as making an argument that AI is necessarily going to weaken discovery than as claiming that the ways in which it is being deployed as weakening discovery. There are other ways to use it that are more plausibly innovative, including James’ other work, which looks to turn models upside down so as to discover the possibility of surprise. I think - which James may or may not agree with - that the dominant uses of LLMs right now (which as you say and as the piece says are not actually the topic of James’ research) points in a more rather than less generic way. LLMs e.g. as a tool of automated peer review looks to me to on average be a bad idea for innovation. Equally, there are possible second order effects, and this result is not a necessary outcome. If you want to think proactively, the problem is not LLMs as such, but the ways in which LLMs accentuate and perhaps radically accelerate already existing trends, which are part and parcel of the current political economy of scientific discovery. There may be different political economies of discovery and credit giving in which they would have very different effects. If there is anyone working on this, I would love to know!

Conor Griffin's avatar

Thanks Henry

I'd definitely agree with this:

"If you want to think proactively, the problem is not LLMs as such, but the ways in which LLMs accentuate and perhaps radically accelerate already existing trends, which are part and parcel of the current political economy of scientific discovery."

On peer review, I think the goal should be less to automate current forms of peer review which likely wouldn't work well and would annoy 97% of scientists (a conservative guess!), but rather to use LLMs for specific tasks, that enhance the broader goals of peer review - e.g. checking for errors in maths papers.

There would be a lot to work out and refine of course. Unfortunately, the fact that AI in peer review is often banned, and that the models themselves are general purpose and we often don't know how good or bad they are, prohibits the kind of exploration and experimentation that it would be nice to see in areas like peer reveiw (although folks are likely doing it in the shadows!)

Good luck with geopolitical firestorm

Alexander Kurz's avatar

What will be the impact on phd students? Will writing easy AI papers help them to learn the trade or rather the opposite?

Albrecht Zimmermann's avatar

You're kind of getting at this with "What LLMs plausibly do is to exacerbate already existing contradictions between individual incentives and collectively beneficial outcomes of interesting and creative research." but I think it should be stressed even more:

When you write "They find that AI-enabled research tends to shrink scientific inquiry to a smaller set of more topical questions. Furthermore, the linkages between papers suggest that there is less vibrant horizontal exchange associated with AI." you're describing a mechanism that's happening quite well without LLMs already, those models just speed it up. I spent years during my PhD trying to convince fellow researchers to acknowledge that there was related work to theirs that just used different terms for the same concepts and to cite others outside their term-specific bubble (in part because I was outside that bubble) and got exactly nowhere with this. Same for not focusing on small improvements on the topic-du-jour but instead go for bigger swings.

In both cases, I eventually took the cynical view that this was basically a publication- and citation-maximizing strategy. So LLMs might make this worse but the root problem are current job- and financing-incentives.

And a completely different point: "But why should it be ominous? These, after all, are the kinds of industrialization and automation that have served us well in a myriad of other economic sectors. Are academics the equivalents of 19th century craftsmen, deploring the factories that are capable of turning out product at scale for putting them out of jobs? "

As Merchant's "Blood in the machine", if they were the equivalent of those craftsmen, they would be absolutely right: quality got shoddier (one-size-fits-none is still very much a thing too), untrained people could produce the products and the increase in profits went almost entirely to capitalists and didn't benefit producers at all.

Synthetic Civilization's avatar

Genre-fication isn’t a cultural failure.

It’s what happens when discovery is filtered through systems optimized for throughput, citation, and review latency.

LLMs don’t replace scientists, they align perfectly with the existing selection function.

Andrew Moore's avatar

Recently-retired scientist here, so I'm watching younger colleagues grapple with this stuff rather than having it impose consequences on me personally.

I have two observations to make.

First: I'm still actively doing peer reviews. In my corner of science, it's pretty usual for the journal to ask the reviewers a question of the form "how much do you think this manuscript is going to influence the field?" This means that there is an existing lever for editorial boards to pull if they see a need to de-genre-ify their journals.

Second: there's a discernible pattern that follows on from the development of a "scientific genre" of the kind you describe. Once enough papers accumulate within a genre (papers that are mediocre, on average), some bright person [1] either does a formal meta-analysis or else delves deeply into what the collected information is telling us. Such meta-studies are disproportionately likely to either locate simplicity beyond the complexity, or to drive the formation of good new questions. I suspect, therefore, that lots of smallish "genre islands" can actually be a **good** thing for scientific progress; a few small genre sub-continents, not so much.

[1] With the advent of social networking for researchers, consortia of bright people have become the default means of making this happen. (This is one way of pushing up the average authors per paper in highly-cited papers :( )

Laurence Woodward's avatar

“…the third phenomena is plausibly the glue that connects the second with the first.”

Why do so few people, even professional academics, seem to understand that ‘phenomena’ is the plural of ‘phenomenon’? I hear this all the time and it drives me crazy.

I would get out more but there’s a sea of slop at my front door.

Henry Farrell's avatar

It's a typo. My posts have many because I don't have a professional editor. I regularly correct several after publishing, because it is difficult to see mistakes easily when you write, and easier for cognitive reasons to see them after. I would suggest that there are more socially effective ways of getting people to change things than by claiming that "professional academics" don't "understand" something that you do? kthxbai

Laurence Woodward's avatar

Fair enough, criticism accepted. I apologise for misdirecting my irritation at you. I really have heard so many people say this recently, and at least three of them were academics (professionally!) I guess I felt the need to get it off my chest.

Henry Farrell's avatar

No worries and apology accepted.

Henry Farrell's avatar

[I in turn should not have been so peevish]

Alex Tolley's avatar

I think your criticism is justified. Prof. Farrell could easily use a tool to check for typos, yet apparently, doesn't. Krugman occasionally makes typos, too, so even Nobel laureates can be guilty. ;-)

Laurence Woodward's avatar

Don’t be absurd. All Nobel laureates are incapable of error, as just a cursory glance at the list of Peace prize winners will confirm ;)

But it wasn’t justified: I didn’t think it was a typo, but an instance of a trend I’ve observed (one might even say a phenomenon) of people genuinely not knowing the singular/plural of this word.

Alex Tolley's avatar

And yet AI, e.g., Grammarly, detects this problem that Laurance Woodward flags. While my writer wife won't use Grammarly because she sees where it is so often wrong, I use it because I make lots of typos, and it flags them for correction. As typos are increasingly common in the MSM with the apparent death of copy editors, (even the BBC *gasp* has the occasional typo or grammar error), I do wonder why simple spell check and grammar tools are not apparently used. It isn't as if journalists and reporters are using manual typewriters. SMH.

Jeremy Fox's avatar

I confess I'm skeptical of the conclusions of that new Nature paper, given the extremely expansive definition of "AI" it uses (presumably in order to have sufficiently long time series to analyze). Any definition of "AI" that includes many decades-old statistical techniques like logistic regression and principal components analysis just seems too expansive to be useful. Whatever trends that paper is picking up just seems like long-term scientific trends that haven't been much altered by the onset of LLMs or any other tool that might fall within a not-overly-expansive definition of "AI." I mean, if PCA and logistic regression count as "AI", then why doesn't (say) linear regression count as "AI" too? Why not single sample t-tests? Is all of statistics "AI" (and if so, does that mean all of statistics has been good for individual scientists but bad for science as a whole)?

Alex Tolley's avatar

Actually, "p-hacking" with a choice of readily available statistical tests and data mining is a well-recognized problem. Some years ago, JAMA required any study to register upfront the medical study hypothesis/hypotheses to be tested, in order to counteract data mining. IIRC, some years ago, there was an analysis of statistical tests used, which determined that about 1/3 of published papers applied incorrect statistical tests on the data. I'm guessing that university labs don't have access to an in-house statistician to check on the statistics used.

Derek Neal's avatar

Very interesting. Let's also remember that one of the first public examples of GPT-2, back in 2019, was the generation of an article in the genre of science journalism about researchers discovering unicorns in South America. What impressed people so much was that the LLM could imitate this sort of writing so convincingly, even when the subject matter itself was clearly made up. To avoid the situation described in the article (of scientific research becoming stagnant due to genrefication), it would seem to me that the humans in the equation would have to do some self reflection and consciously work against genre. Of course, then their output might not be legible to their academic peers and so on, so the incentive structure would have to change in academia.

Alex Tolley's avatar

GenAI could create any number of "Sokal hoax" papers with no effort needed. Sometimes it appears that "AI summaries" do this unintentionally, too.

Derek Neal's avatar

For sure, I've read a few! (Student essays with completely fabricated sources, topics, etc.)

Carol Chapman's avatar

OMG, so many words. I think you brought up quality vs quantity. that AI is good for generating many articles in scientific publications. In system where authors are promoted

Carol Chapman's avatar

Promoted based on quantity that works, but AI generated and reviewed

Carol Chapman's avatar

Misses the small variations that point to original ideas rather than associations on a theme. Survey data most vulnerable to thematic quantity and loss of originality.

Winston Smith London Oceania's avatar

The purveyors/profiteers of AI are pushing it like it's ready for prime time, and quite frankly, I don't believe that it really is.