Google AI fails the taste test
Large language models have complicated consequences for brands
Coca-Cola’s 1985 decision to change Coke’s flavor was a notorious business disaster. The company had carefully taste-tested the new flavor, and found out, in its experiments, that consumers seemed to prefer it. But when the product was launched, people hated it: they wanted to stick with what they knew. After a few years of elaborate corporate gymnastics, (for a while, confusingly branded versions of the old and new flavors competed with each other), the company admitted defeat, and reverted the original.
And it’s arguably not the most ridiculous brand disaster of recent history. In 1996, after spending hundreds of millions of dollars, Procter and Gamble launched Olestra, a synthetic fat that your body couldn’t absorb or get fatter from. At the beginning it seemed as though Olestra was going to be a huge money maker. Procter and Gamble had lobbied the regulators into submission, and its opponents were badly underfunded. And who could argue with fat-free Pringles that tasted as though they were full fat?
Some test participants did complain about Olestra’s side effects such as “occasional soiling” and “intestinal rumblings.” Sometimes, the undigestible fat just slid on through, bringing just a little detritus along with it. But it wasn’t until some anonymous genius came up with an attention-grabbing descriptor for these unfortunate side effects, “anal leakage,” that the marketing nightmare began. If your food product becomes associated in the public mind with anal leakage, it is doomed. It doesn’t help to insist that that the leaking is relatively uncommon, and that apart from this one little side effect, which might not even happen to you, you can binge on junk food to your heart’s content without gaining weight.* People, being people, are just going to obsess on the leakage and avoid the product like the plague.
And now … Google? Google’s launch of AI search has gotten lots of people yelling that they prefer the old product, but it hasn’t quite degenerated into Olestra’s drip-drip of noxious publicity. Yet. Critics have gleefully pointed to AI search results saying that maybe you should eat a rock a day, that you should mix in glue to stop the cheese of your pizza sliding off, and that thirteen U.S. presidents graduated from University of Wisconsin at Madison, earning 59 degrees. Stories like this foster mean-spirited jokes and urban legends that may stick around to hurt your brand name for years.
So what went wrong? One plausible theory is that AI search is a misnomer. Google seemingly expects large language models to devour search, but they’re quite poorly suited to power or present the results of search engines. The new approach fosters two problems. First - it fosters errors. Second, it parks the responsibility for these errors squarely with Google.
Online search has been in trouble for a while. There are a variety of reasons why. Google’s original approach to search, which generated fantastic levels of revenue when tied to ads, was to mine the structure of weblinks for information. If you want to figure out what the most valuable or most reliable sources of information on the Internet are, then the best way may be to discover what other people think those sources are, relying on the number and quality of links to a particular online resource as a rough proxy for the quality of that online resource. The more incoming links from reliable places that a website gets, the more likely it is to be a high quality source. It’s more complicated than that - but that’s the basic insight.
Applying this insight, of course, generated its own problems, especially as people figured out that Google search traffic could help them make money on the Web. The result was cat-and-mouse-games between Google and content farms trying to figure out ways to game Google’s algorithm to make themselves seem more important than they were. A second problem was more fundamental. As Google’s founders recognized, the relationship between Google’s service (linking to high quality information) and its business model (getting money from advertisers) is potentially toxic - it is really really hard to avoid the temptation to send people to the links that are most profitable, which allows you to make lots of money in the short term, but radically degrades your service’s usefulness in the long run. The final big problem was that Google found itself in direct business competition with news media - which generated some of the high quality information that Google relies on. Both are funded by ads, and the more that Google and a couple of other platforms swallowed up the available ad money, the less there was to support the information system that it claimed to be organizing and making more useful.
All this generated some nasty politics, especially as Google moved on from flirting with problem #2 into a full-on toxic relationship . But now, Google’s move to AI search is creating a whole variety of new difficulties. It isn’t just that it fosters errors, but that the errors are tied to Google’s name in a way that they weren’t previously.
Here, it’s useful to think of the old algorithm - call it Google Original Flavor - and the AI approach (New Google) as two different ways of representing an underlying body of knowledge. Google Original Flavor (based on the PageRank algorithm, its variants and descendants), responds to a search query by presenting an ordered list of links you can click through. These are the algorithm’s best guesses as to the places on the Web that are most relevant to your query. Increasingly, this list based representation is corrupted by mini-reconstructions of the websites that you might want to visit, and by ads that are increasingly difficult to distinguish from ‘real’ links.
New Google is an “AI Overview” that lives, for the moment, on top of that list. It is a summarization, generated by a large language model, that purports to provide a different kind of response to search queries. Rather than a set of weblinks, it provides an actual answer.
This isn’t entirely new. Google has generated answers by other means for a while - search on “Henry Farrell” and you’ll find that Google produces a brief bio, based on my Wikipedia page, beside the search results. But it has never suggested that any of these ways of generating answers were more than a sideshow, still less that they were likely to result in a fundamental transition for its business model and the world.
Both the old and the new means of representation are lossy. Compressing all the knowledge available on the Internet into a one dimensional, rank-ordered set of ten links (very few people indeed click past the first page) necessarily discards a lot of information. So too, does the LLM underlying the AI Overview system. As a result, both make mistakes. It could be that the #1 link that Google Original Flavor suggests to you is not the best link to click on. It might be wildly misleading. But people are likely to treat mistakes very differently under the old and new approaches.
Google Original Flavor lucked into a very sweet epistemological deal. People seemed, by and large to treat it as highly authoritative. But Google did not usually have to take the heat when it got things wrong.
On the one hand, people do believe in Google results. Francesca Tripodi has an interesting book, which talks about the collision between the biblical interpretation practices of fundamentalist Christians, and search engines. Crudely summarized, she suggests that some Christians subscribe to Google inerrancy - if something comes up in the first few links, it must be right! But as she discusses in passing, other people too think similarly: “Psychologists and internet researchers have found that the order of results influences how credible or trustworthy people think the information is.”
On the other, when Google Original Flavor clearly gets things wrong, it isn’t on the hook. It is not itself asseverating** to the correctness of the sources that it links - instead it is merely indexing the knowledge provided by other people, for whom Google itself takes no responsibility.
That sweet deal is not available to New Google. The new approach isn’t just linking to other people’s stuff. It’s presenting what appears, to ordinary users, to be Google’s own summarization of the best state of knowledge on a particular topic. People may still often continue to treat this as inerrant. But when there are obvious, ridiculous mistakes, they are likely to treat Google as directly responsible for these mistakes. Google has, effectively, put its own name to them.
This is brought out even more clearly if you look at the actual examples of New Google getting it wrong that everyone is pointing to. Lots of people are saying that these mistakes are “hallucinations,” where the large language model lacks data, and interpolates in the most statistically plausible response. But all of the examples I’ve looked at seem to be more accurately described as Gopnik errors, where the model fixes on a response that is present both in the training data and model, and is statistically associated with the prompt, but that lacks real world validity. Or put in plainer language, the problem is that large language models can’t take a joke.
Take, for example, the one rock a day example. I checked this out myself (it still worked a couple of days ago).
It was clear that the model hadn’t independently generated this advice on its own. Instead, the AI Overview linked to a site that reproduced this claim, which it had gotten for complicated reasons from The Onion. A human reader would (probably) have gotten the joke. Large Language Models can’t, for the reasons that Gopnik and her colleagues lay out. A lot of the weirder results seem to draw on America’s Finest News Source (tm)! Very likely, Google has taken measures to counteract this, by deprecating The Onion, or not including it in the training set. Equally likely, these measures aren’t sufficient (funny stories spread far from their original sources and often fail to provide proper attribution). And recognizing deadpan humor is a particularly difficult challenge for large language models, since the funniness involves presenting obvious absurdities in ways that make them sound true.
And the richest irony of all, is that Google search might have pointed towards nearly the same thing! Post-controversy, it’s impossible to say, but I would lay good money that up to a few days ago, if you asked Google Original Flavor for health advice about daily consumption of rocks, the site and the Onion story would have been high in the list of recommended links. Nobody would have paid attention to them. And no-one would have blamed Google in any very energetic way for it if they did. Thus, there isn’t any competitive ‘can you believe what came up #1 as my Google search result for x’ dynamic to tarnish Google’s brand name.
That has changed! Google’s problem, as it is now discovering, is that it is on the hook for large language model outputs in a way that it is not for ranked lists of recommended links. Its response has been to argue that this is a problem with “uncommon queries,” and to suggest sotto voce that its critics are being really mean and unfair. There is likely some truth to both complaints. But they are mostly irrelevant to Google’s actual situation. A bad story can not only stick to you, as the anal leakage story stuck to Olestra. It can become generative. It provides a media script that can readily be applied to other examples, where Google’s AI Overview says ridiculous seeming things,.
Such scripts aren’t just there to be taken up by journalists. There is a time-honored tradition of Googlebombing obscure and not-so-obscure search terms so that Google highlights unfortunate results. If there aren’t myriads of trolls out there, competing to figure out the practical mechanics of turbocharged LLM-bombing, I’ll be extremely surprised. Now, they won’t just be able to embarrass Rick Santorum, but Google too!
The same is true, and at scale for people who farm search results for money. Of course, this is not a new problem. I’m guessing that it is (a) a lot more technically difficult to tweak LLMs to respond dynamically to manipulation at scale than traditional search algorithms, and (b) that Google has nothing like the expertise that it has accumulated over decades to deal with more traditional forms of SEO. But I could be completely wrong!
Either way, Google’s hard won reputation for reliability in its core competence has already been badly battered over the last few years. Now, Google may have opened up a whole new attack surface, by assuming that large language models are a powerful and important technology for turbocharging search. They are indeed powerful. They are indeed important. But they aren’t necessarily at all well suited for search.
The reasons for this stem from some broader points that I’ll write more about soon. People like to say that large language models are a “general purpose technology.” If this just means that large language models can be applied to a broad variety of uses, they are right. If, instead, people are suggesting, or fooling themselves into thinking, that they are magically omnicompetent, they are completely wrong. Large language models are engines of summarization. They are potentially valuable where you (a) want to generate summarizations, (b) the data you are summarizing is fairly high quality, (c) you are prepared to tolerate a certain amount of slop, and (d) you aren’t too worried if the outputs tend to drift towards central trends present in the training corpus.
It looks plausible that the “certain amount of slop” could end up being a real problem for search even when the slop isn’t that much worse in absolute terms under New Google than Google Original Flavor. You can tolerate slop as long as people don’t directly attribute it to you, ruining your brand. When they start not just to blame you, but to start to search for more slop to tarnish your name, and use possibly unfixable weaknesses in your technological attack surface to generate slop, you’re in trouble.
The outliers and soon-forthcoming New-Googlebombs that we’re focusing on right now are probably less consequential than the models’ convergence on central cultural tendencies (more on this soon, too). But in the short term, they are generating a real reputational problem for Google, which might easily worsen into an Olestra style nightmare.
* Which wouldn’t have been true, anyways.
** If I have the opportunity to use a lovely word like asseverate, I’m going to take it. We all have our faults.
I would like to make a motion to refer to the risible answers from New Google as "anal leakage", e.g. Did you see the anal leakage google gave me when I asked about eating rocks?!
Going along with your post, I thought this was a good, short, explanation for why those results would be part of a successful web search, but not for the summary: https://mikecaulfield.substack.com/p/the-elmers-glue-pizza-error-is-more