So I thought that the last post of the year would be a sedate round-up of what I’d written over the last twelve months and what I hoped to do next. Perhaps I’ll still write that. Now, instead, another post about why public debates over AI are so irritating.
My current cause for vexation is the controversy over AllDayTA.com, a new Large Language Model (LLM) based service for professors. Many academics are unhappy that this technology exists, and have expressed their feelings about the service’s creators (who are themselves academics) in plain and forceful language. The service’s creators have expressed their own unhappiness in turn.
Like so many other disputes, this one was swiftly pulled into the all-devouring maelstrom of what I call “AI Fight Club,” a ritualized combat, waged bout after bout between two highly stylized positions. Position One is that scaling (feeding LLMs more data and more compute) will lead AI to transform everything, replacing knowledge workers with algorithmic processes that effectively automate sentient activity, opening the gate into Artificial General Intelligence and the post-Singularity paradise. Position Two is the counter-claim that scaling doesn’t work, and LLMs are useless but still changing everything for the worse, as management replaces human workers with automated bullshit machines. It’s a no-holds-barred wrestling competition between two starkly opposed perspectives on the world, that keeps on going, and going, and going.
The problem, as the AllDayTA spat illustrates, is that neither of these positions provides a particularly good guide to how the technologies are actually developing. Still, the black-hole binary exerts an irresistible gravitational force, bending and distorting the ways that we think and talk so that they better conform to the terms of dispute. Most protagonists probably don’t buy into all, or even most of the claims of their side, but that doesn’t matter as much as you might think. The first rule of AI Fight Club is not that that you don’t talk about it, but that anything that you do say will be sucked into AI Fight Club, so that it is interpreted as supporting one or the other position.
Now to the specifics. You can get a sense of what AllDayTA does from what the website says, ** as well as this video from Josh Gans, one of the two founders. The video also provides a clue as to why the debate started off on highly unfortunate terms. Gans describes the product as “a TA, driven by artificial intelligence, trained by professors, [that] gives the right answers 24/7 to your students.” The sell is targeted at teaching academics (although I can’t help wondering if it has been slightly adapted from some VC-pitch or another): telling them that their students are already using AI, and that this will give them an opportunity to use it properly.
Whatever its origin, this sales pitch draws on the rhetoric of Position One in AI Fight Club. Who needs human TAs, when you have a sleepless AI system that will do the hard work of TAing for cheap? Unsurprisingly, that has provoked an outraged response that come straight outta Position Two: this is replacing real human employees with an automated bullshit engine! And so the stylized fight begins.
I (very slightly) know people on both sides of this fight - what I write is neither intended to criticize anyone personally, or, for that matter, to get them together to resolve their differences in a group singalong. I think that their disagreements reflect actually important arguments that we need to have. Still the reflection is badly distorted by the rhetorical gravitational field. When the technology doesn’t do what either its creators or detractors suggest it does, it is really hard to think straight about what its plausible consequences are. And that is what seems to be happening here.
As best as I can see from the actual detailed description of what the LLM does, it is not, actually, a human TA replacement, or plausibly set to become one. Instead, it is better understood as a course syllabus that speaks, combined with a self-summarizing reading packet. In other words, it looks to automate the parts of being a professor - or TA - that actual professors and TAs hate doing.
“Read the syllabus” is a tired and painful joke among professors for a very excellent reason. Many students do not, in fact, read syllabi, prompting umpteen questions throughout the semester about policies and assignments that are explained at length, and in writing. This is not improved by university policies, which increasingly mandate the inclusion of various forms of organizational boilerplate in syllabi, so as to propitiate this review board or that micro-organelle of the internal hierarchy.*** So too for the more obvious questions about what the course readings say and mean.
So this technology, if it works reasonably well, has actual uses in the classroom. But these uses do not replace human TAs. If they work, they will supplement them. That … has not emerged very clearly in the debate … in part because of the predictable rhetorical forms used to sell the technology and in part because of the predictable responses to that rhetoric. In the world of AI Fight Club, people both sell and condemn AI as a means for replacing humans, even when it isn’t, and even when their creators apparently don’t mean them to.
The result is that the wrong questions get asked. Here are some questions that I think we should actually debate instead (the list is not exhaustive).
First - how likely is it that the technology will indeed work reasonably well? In my personal experience, task-specific LLMs, which look to summarize a specific body of information (usually on the basis of a general LLM with some particular tweaking), can do a reasonably solid job at summarizing information. But they still make significant mistakes - especially when they interpolate together different sources and look to draw out their relations and synthesize them. They also (I suspect from general principle) are going to be better at elucidating general patterns and obvious points rather than rare ones that are under-represented in their training data. This means that they are not going to be particularly original, or good at drawing out the more subtle relationships or implications.
Second - will there be robust mechanisms that will allow course-correction when the technology fails? The YouTube description suggests that there are some. The system apparently provides links to the source or sources that it bases its summarization on, so that people can see what it is summarizing or synthesizing (here it resembles e.g. NotebookLM). It also provides the professor with access to the questions that students have asked and the answers that have been provided. That sounds promising - but getting students to actually check the sources, and not just rely on the summary, will require hard work from the professor and TAs.
Third - what will the existing university hierarchy make of it? As mentioned already, standard syllabi have already become vectors through which university wide policies are communicated. One failure mode is that models like this will enable the further proliferation of organizational ritual. Another is that higher level university officials are sometimes … not as familiar … with the strengths and weaknesses of new technologies as they ought be and may have unrealistic expectations of what this technology can do. Furthermore, just like existing Content Management Systems such as Canvas and Blackboard, this system is going to be deployed for a very wide variety of use cases, some better fitted than others. There is an entire political economy behind how technologies like this are adopted and used, and both professors and TAs might reasonably be nervous about how this will change the university, perhaps in ways that are adverse to their interests.
Fourth - speaking of adversity, how will systems like this fare in an adversarial environment? It does seem to have been stress tested in the classroom, but I don’t know whether it has been exposed to the true ingenuity of students with strong incentives to game the syllabus. Students and professors sometimes have different interests over how the rules of a course ought be interpreted, and students sometimes have quite strong reasons to prefer the one interpretation over the other. Ingenious queries might produce unexpected results. Based on other LLMs that have been mandated to interpret rules, I anticipate some lively disputes arising.
My personal guess - which reflects all the reasons why I am a carping professor rather than an edutech entrepreneur - is that some of these questions will turn out to have vexing and complicated answers. Indeed, it may well turn out that systems like AllDayTA do not save labour so much as they transform it. They automate some tasks that you want automated, but they require new kinds of skilled interpretation and management to use well.
Which is, in itself, actually a perverse and contrary reason for teaching courses using them. I am teaching an undergraduate course on Democracy and AI next semester (with an excellent human TA), where I am quite likely to use AllDayTA, because I can’t imagine a better practical way to familiarize students with the strengths and weaknesses of LLMs in a context that matters to them. Want to know what LLMs can do and what they can’t? Put them to the test! Perhaps I’ll offer bonus marks for students who torture the syllabus and readings in original ways that provide startling and unexpected results. They’ll learn if they succeed, and they will learn by trying.
And this, perhaps, generalizes. LLMs are, as Ted Chiang pithily put it, blurry JPEGs of human knowledge - that is, they are tools for generating lossy summarizations of large bodies of human knowledge. That (contra Ted’s own interpretation) can be quite useful, in the same way, for example, as statistics, another more familiar form of lossy summarization is useful. When you quantify data, enormous amounts of information are thrown out. Behind every great dataset, there lies a great crime. But nonetheless, lossy and manipulable representations of complex bodies of information are very handy things to have!
So the real benefits and problems are not, contrary to the rhetoric, that they actually substitute for human beings. AllDayTA will help automate some drudge labor - at the cost of some murky politics. Deploying it will require some training not simply of professors and TAs but of the students too. But that training will be more broadly useful.
I fully expect that these tools of summarization will be deployed at scale by organizations in the world. People who understand their strengths and weaknesses will be better able to evaluate what you can and can’t do with them. And university classrooms, with specialized systems are a very good place to develop such an understanding, and even to think more broadly about the organizational politics that they entail.
* Interestingly, the third position that used to generate debate - “Existential Risk” - is beginning to fade in importance. I suspect that the reason is that AI companies have incentives to let go of their risk teams or at least to soft peddle their concerns if they wish to get funding.
** I should be clear that I know nothing about this controversy beyond what has emerged in public over the last couple of days. On the upside, this means that I’ve no hidden agenda. On the downside, I may misinterpret what I read and see.
*** For the past couple of years, I’ve had a ‘if you read this sentence, contact the professor for a reward’ easter egg, concealed in one particularly dense thicket of ritual verbiage. Not one has yet contacted me. Now that I have revealed this, I will have to desist for a couple of years, for fear that some students read this newsletter.
"LLMs are, as Ted Chiang pithily put it, blurry JPEGs of human knowledge - that is, they are tools for generating lossy summarizations of large bodies of human knowledge."
A better metaphor might be that LLMs are microscopic images at, and below, the diffraction limit. Strictly speaking, one is skating on very thin ice if they take any object in a microscopic image that is below the diffraction limit as real (as opposed to an artifact). However, as video microscopy gained purchase in the 80s and 90s, it became clear that one could use motion to discriminate real objects from artifacts. Much as analyticity allows you to assume a type of continuity for a function, observed motion that obeys (or appears to be consistent with) physical law allows you to assume some smudge in the field of view is a vesicle on a kinesin motor running along a microtubule rather than just a typical smudge-artifact on a static field of view. I'm grossly simplifying, but it is possible to see objects that are below the diffraction limit using video microscopy that would be dismissed as mere artifacts on a static image. It's blurry like an overly compressed jpeg but there is also data that can be extracted by using clever technique even though it appears not to be there at all.
The trick with deep conv nets is that the multiple layers allow sequential correlations, rather than static correlations, to give one an opportunity to infer a causal process (like physical motion in the microscopic field). Unfortunately, the "one" who is given that opportunity is a computer program, so it doesn't really have the experience of cause-and-effect, built up through years of manipulating the environment. Thus the program will often infer causation through sequential correlations that are "unphysical" and thus spit out an extrapolation that is ridiculous to a human (so-called hallucinations). If there was some way for an LLM to back out the series of correlations that led to a particular result, when asked to explain itself, we might learn something about the kind of "unphysical" correlations that cause these errors.
Not to be That Guy, but on the matter of the syllabus, I've found good luck with starting the term with instruction on annotation and then applying our skills to the syllabus. Annotation and close reading are skills for any discipline. I just so happen to teach writing, so I layer in genre discussion. Part of the "read the syllabus" problem is that the syllabus is an incredibly complex rhetorical situation because of the multiple audiences (students, bureaucrats, other institutions, courts). I require students to annotate with both a comment/reaction as a reader (another layered lesson since we focus on audience in our work together) and a question a reasonable person might have (my effort to not only clarify matters on the syllabus but also to normalize questioning/not knowing). We engage in this activity via a shared doc in any course I teach.
Results of this effort include a steep reduction in one-off questions and a 5/5 on the end-of-term course survey metric, "The syllabus clearly outlines the policies and procedures of the course."