"LLMs are, as Ted Chiang pithily put it, blurry JPEGs of human knowledge - that is, they are tools for generating lossy summarizations of large bodies of human knowledge."
A better metaphor might be that LLMs are microscopic images at, and below, the diffraction limit. Strictly speaking, one is skating on very thin ice if they take any object in a microscopic image that is below the diffraction limit as real (as opposed to an artifact). However, as video microscopy gained purchase in the 80s and 90s, it became clear that one could use motion to discriminate real objects from artifacts. Much as analyticity allows you to assume a type of continuity for a function, observed motion that obeys (or appears to be consistent with) physical law allows you to assume some smudge in the field of view is a vesicle on a kinesin motor running along a microtubule rather than just a typical smudge-artifact on a static field of view. I'm grossly simplifying, but it is possible to see objects that are below the diffraction limit using video microscopy that would be dismissed as mere artifacts on a static image. It's blurry like an overly compressed jpeg but there is also data that can be extracted by using clever technique even though it appears not to be there at all.
The trick with deep conv nets is that the multiple layers allow sequential correlations, rather than static correlations, to give one an opportunity to infer a causal process (like physical motion in the microscopic field). Unfortunately, the "one" who is given that opportunity is a computer program, so it doesn't really have the experience of cause-and-effect, built up through years of manipulating the environment. Thus the program will often infer causation through sequential correlations that are "unphysical" and thus spit out an extrapolation that is ridiculous to a human (so-called hallucinations). If there was some way for an LLM to back out the series of correlations that led to a particular result, when asked to explain itself, we might learn something about the kind of "unphysical" correlations that cause these errors.
Not to be That Guy, but on the matter of the syllabus, I've found good luck with starting the term with instruction on annotation and then applying our skills to the syllabus. Annotation and close reading are skills for any discipline. I just so happen to teach writing, so I layer in genre discussion. Part of the "read the syllabus" problem is that the syllabus is an incredibly complex rhetorical situation because of the multiple audiences (students, bureaucrats, other institutions, courts). I require students to annotate with both a comment/reaction as a reader (another layered lesson since we focus on audience in our work together) and a question a reasonable person might have (my effort to not only clarify matters on the syllabus but also to normalize questioning/not knowing). We engage in this activity via a shared doc in any course I teach.
Results of this effort include a steep reduction in one-off questions and a 5/5 on the end-of-term course survey metric, "The syllabus clearly outlines the policies and procedures of the course."
I agree with the arguments laid out, but there was one I was expecting to see, which it the role of TAs (to professors and as a TA themselves), which is to learn how to be a professor. The next generation of professors come out of learning the craft by being TAs. A lot of the painful work of being a professor gets handed over to TAs, which when the TA becomes a professor that handing off of tasks will be passed to their TAs.
You did bring up your use of a human TA and LLMTA together, which as augmentation to TAs may help take some of the laborious and painful tasks that get handed down to them.
Somehow, I missed this post when dealing with the All Day TA controversy last year. I agree with you that positioning is tricky. Actually, we were very mindful not to position it as a TA substitute because, as you say, it isn't one. But at the same time, explaining it as a TA makes it easy to communicate what it does -- this is especially true since 90% of university courses have no TA at all, so we can't possibly be replacing those.
As you also point out, there are lots of directions this can all take that make it look less like a traditional TA and more like an evolution to something else.
If you do use it in your classes, please let us know how it goes.
I hope the founders did a fair amount of stress-testing prior to making this go live. The author mentions NoteBookLM of which this is simply a more refined version of with the ability to view the chats of the students in the class (helpful). But any teacher can do something similar by setting up their own NoteBook with all their course material and sharing the link with students. The problem arises when, as I believe the author also notes, the more sources and files provided, the more likely something goes awry. If syllabi and other dense material with detailed information becomes convoluted, even in small ways (a due date is misidentified or misreported in a response), then it defeats the whole system. The mere fact that AI "hallucinations" exist (though some rightly point out that every AI output from an LLM is "hallucinated") make these kind of large experiments fraught. Glad someone else is the canary in the coal mine. I predict not necessarily disaster, but a steady stream of reported mistakes that will turn off many and serve as ammo for the AI Fight Club. As good and interesting as some of the models are getting, they are still just not ready for prime time, something many businesses are learning the hard way. And the more I read, the more I fear the "hallucination" issue is a feature not a bug and just might be baked into the system. That will be a problem.
Summarising is the use case for LLMs most people believe is reasonably doable (I did). But under the positive examples lurks a bit of nuance, for instance when the system *seems* to summarise but doesn't (e.g. either the data used to train the base model dominates, ignoring the actual thing to be summarised, or it doesn't summarise as much but 'shortens' (and there is a difference). Illustrated at: https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actually-does-nothing-of-the-kind/
I've settled so far on the following non-technical definition: GenAI "approximates the result of understanding without having any understanding of its own".
(In that sense, GPT o1/o3 are interesting as it approximates not so much understanding itself directly, but the 'use of the steps seen when dissecting logical understanding' (a.k.a. CoT reasoning). It becomes an — expensive — approximation of an approximation (for a special set of cases). o3 scoring well on ARC-AGI-1-PUB is impressive, but the fact that it fails on some easy cases is an example of that approximation-machine fundamental, a hard case might be approximated correctly while easier cases might fail — even when thrown insane amounts of compute energy at the problem. This doesn't happen with humans.)
Looking at it that way, it is very difficult to use it in places where reliability is important (you need all sorts of precautions and processes around it and it quickly becomes hard to use profitably), but it can do a productive job when that is less the case (e.g. when the details of an image — such as the one you use for your article here — doesn't really matter).
Attached to this 'approximation machine' is an ever growing zoo of special purpose techniques to rein in the negative effects or improve certain productive uses. The demos and examples often shine, but the reality is often less shiny. And it becomes less and less possible to get a feel for the — likely byzantine — proprietary approaches taken by the OpenAIs of this world.
Because of the huge sums spent by large players on the training of LLMs, a lot rests on the outcome of the technology. The "inflated expectations" stage is being promoted by these players and enablers. We have probably reached the peak of this phase as further scaling is failing to deliver much improvement. The "Trough of Disillusionment" awaits what may be a bust, even possibly a new "AI Winter". I do see signs that we will reach a stage where the technology has usefulness, although widespread use requires it to be far less expensive to run and pay for. On its own, LLMs are not reliable enough for serious work, but they are adequate as human-like interfaces, and like humans, their responses may not be accurate and so should not be accepted as correct. I think specialized LLMs using supporting technology will prove useful if they can demonstrate accuracy and are cheap enough to deploy. In the academic setting, I suspect the publishers will push their offerings around textbooks.
"LLMs are, as Ted Chiang pithily put it, blurry JPEGs of human knowledge - that is, they are tools for generating lossy summarizations of large bodies of human knowledge."
A better metaphor might be that LLMs are microscopic images at, and below, the diffraction limit. Strictly speaking, one is skating on very thin ice if they take any object in a microscopic image that is below the diffraction limit as real (as opposed to an artifact). However, as video microscopy gained purchase in the 80s and 90s, it became clear that one could use motion to discriminate real objects from artifacts. Much as analyticity allows you to assume a type of continuity for a function, observed motion that obeys (or appears to be consistent with) physical law allows you to assume some smudge in the field of view is a vesicle on a kinesin motor running along a microtubule rather than just a typical smudge-artifact on a static field of view. I'm grossly simplifying, but it is possible to see objects that are below the diffraction limit using video microscopy that would be dismissed as mere artifacts on a static image. It's blurry like an overly compressed jpeg but there is also data that can be extracted by using clever technique even though it appears not to be there at all.
The trick with deep conv nets is that the multiple layers allow sequential correlations, rather than static correlations, to give one an opportunity to infer a causal process (like physical motion in the microscopic field). Unfortunately, the "one" who is given that opportunity is a computer program, so it doesn't really have the experience of cause-and-effect, built up through years of manipulating the environment. Thus the program will often infer causation through sequential correlations that are "unphysical" and thus spit out an extrapolation that is ridiculous to a human (so-called hallucinations). If there was some way for an LLM to back out the series of correlations that led to a particular result, when asked to explain itself, we might learn something about the kind of "unphysical" correlations that cause these errors.
Not to be That Guy, but on the matter of the syllabus, I've found good luck with starting the term with instruction on annotation and then applying our skills to the syllabus. Annotation and close reading are skills for any discipline. I just so happen to teach writing, so I layer in genre discussion. Part of the "read the syllabus" problem is that the syllabus is an incredibly complex rhetorical situation because of the multiple audiences (students, bureaucrats, other institutions, courts). I require students to annotate with both a comment/reaction as a reader (another layered lesson since we focus on audience in our work together) and a question a reasonable person might have (my effort to not only clarify matters on the syllabus but also to normalize questioning/not knowing). We engage in this activity via a shared doc in any course I teach.
Results of this effort include a steep reduction in one-off questions and a 5/5 on the end-of-term course survey metric, "The syllabus clearly outlines the policies and procedures of the course."
Do I pick up my Easter egg during office hours?
TAs are almost always graduate students and their work as TAs is an important element of their graduate education.
As surgical pedagogues say, watch one, do one, teach one. You don’t really know something until you can teach it.
I agree with the arguments laid out, but there was one I was expecting to see, which it the role of TAs (to professors and as a TA themselves), which is to learn how to be a professor. The next generation of professors come out of learning the craft by being TAs. A lot of the painful work of being a professor gets handed over to TAs, which when the TA becomes a professor that handing off of tasks will be passed to their TAs.
You did bring up your use of a human TA and LLMTA together, which as augmentation to TAs may help take some of the laborious and painful tasks that get handed down to them.
Dear Henry.
Somehow, I missed this post when dealing with the All Day TA controversy last year. I agree with you that positioning is tricky. Actually, we were very mindful not to position it as a TA substitute because, as you say, it isn't one. But at the same time, explaining it as a TA makes it easy to communicate what it does -- this is especially true since 90% of university courses have no TA at all, so we can't possibly be replacing those.
As you also point out, there are lots of directions this can all take that make it look less like a traditional TA and more like an evolution to something else.
If you do use it in your classes, please let us know how it goes.
Best
Joshua
I hope the founders did a fair amount of stress-testing prior to making this go live. The author mentions NoteBookLM of which this is simply a more refined version of with the ability to view the chats of the students in the class (helpful). But any teacher can do something similar by setting up their own NoteBook with all their course material and sharing the link with students. The problem arises when, as I believe the author also notes, the more sources and files provided, the more likely something goes awry. If syllabi and other dense material with detailed information becomes convoluted, even in small ways (a due date is misidentified or misreported in a response), then it defeats the whole system. The mere fact that AI "hallucinations" exist (though some rightly point out that every AI output from an LLM is "hallucinated") make these kind of large experiments fraught. Glad someone else is the canary in the coal mine. I predict not necessarily disaster, but a steady stream of reported mistakes that will turn off many and serve as ammo for the AI Fight Club. As good and interesting as some of the models are getting, they are still just not ready for prime time, something many businesses are learning the hard way. And the more I read, the more I fear the "hallucination" issue is a feature not a bug and just might be baked into the system. That will be a problem.
Good piece. Often there is nuance.
Summarising is the use case for LLMs most people believe is reasonably doable (I did). But under the positive examples lurks a bit of nuance, for instance when the system *seems* to summarise but doesn't (e.g. either the data used to train the base model dominates, ignoring the actual thing to be summarised, or it doesn't summarise as much but 'shortens' (and there is a difference). Illustrated at: https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actually-does-nothing-of-the-kind/
I've settled so far on the following non-technical definition: GenAI "approximates the result of understanding without having any understanding of its own".
(In that sense, GPT o1/o3 are interesting as it approximates not so much understanding itself directly, but the 'use of the steps seen when dissecting logical understanding' (a.k.a. CoT reasoning). It becomes an — expensive — approximation of an approximation (for a special set of cases). o3 scoring well on ARC-AGI-1-PUB is impressive, but the fact that it fails on some easy cases is an example of that approximation-machine fundamental, a hard case might be approximated correctly while easier cases might fail — even when thrown insane amounts of compute energy at the problem. This doesn't happen with humans.)
Looking at it that way, it is very difficult to use it in places where reliability is important (you need all sorts of precautions and processes around it and it quickly becomes hard to use profitably), but it can do a productive job when that is less the case (e.g. when the details of an image — such as the one you use for your article here — doesn't really matter).
Attached to this 'approximation machine' is an ever growing zoo of special purpose techniques to rein in the negative effects or improve certain productive uses. The demos and examples often shine, but the reality is often less shiny. And it becomes less and less possible to get a feel for the — likely byzantine — proprietary approaches taken by the OpenAIs of this world.
Much of the "AI debate" are the competing POVs of adoption overlayed on the "Gartner Hype Cycle" https://en.wikipedia.org/wiki/Gartner_hype_cycle
Because of the huge sums spent by large players on the training of LLMs, a lot rests on the outcome of the technology. The "inflated expectations" stage is being promoted by these players and enablers. We have probably reached the peak of this phase as further scaling is failing to deliver much improvement. The "Trough of Disillusionment" awaits what may be a bust, even possibly a new "AI Winter". I do see signs that we will reach a stage where the technology has usefulness, although widespread use requires it to be far less expensive to run and pay for. On its own, LLMs are not reliable enough for serious work, but they are adequate as human-like interfaces, and like humans, their responses may not be accurate and so should not be accepted as correct. I think specialized LLMs using supporting technology will prove useful if they can demonstrate accuracy and are cheap enough to deploy. In the academic setting, I suspect the publishers will push their offerings around textbooks.