Fully automated data driven authoritarianism ain't what it's cracked up to be

Nor, for that matter, is fully automated data driven capitalism

Jul 24, 2023

Last September, Abe Newman, Jeremy Wallace and I had a piece in Foreign Affairs’ 100th anniversary issue. I can’t speak for my co-authors’ motivations, but my own reason for writing was vexation that someone on the Internet was wrong. In this case, it was Yuval Harari. His piece has been warping debate since 2018, and I have been grumpy about it for nearly as long. But also in fairness to Harari, he was usefully wrong - I’ve assigned this piece regularly to students, because it wraps up a bunch of common misperceptions in a neatly bundled package, ready to be untied and dissected.

Specifically, Harari argued that AI (by which he meant machine learning) and authoritarian rule are two flavors that will go wonderfully together. Authoritarian governments will use surveillance to scoop up vast amounts of data on what their subjects are saying and doing, and use machine learning feedback systems to figure out what they want and manipulate it, so that “the main handicap of authoritarian regimes in the 20th century—the desire to concentrate all information and power in one place—may become their decisive advantage in the 21st century.” Harari informed us in grave terms that liberal democracy and free-market economics were doomed to be outcompeted unless we took Urgent But Curiously Non-Specific Steps Right Now.

In our Foreign Affairs piece, Abe, Jeremy and I argued that this was horseshit. What we know about authoritarian systems (Jeremy has a great new book on this topic) is that it’s really hard for them to generate and use good data. When data is politically important (e.g. it shapes the career prospects of self-interested officials), there is invariably going to be a lot of stats-juking. We suggested that machine learning is very unlikely to solve this fundamentally political problem - garbage in most definitely leads to garbage out. And we claimed that machine learning will introduce its own particular problems. Many of the worries that have been expressed in US and European debates about machine learning - that it can create self-perpetuating forms of bias - are likely to be much more pervasive and much worse in authoritarian regimes, where there are far fewer mechanisms of open criticism to correct nasty feedback loops that keep on getting nastier.

That was a Foreign Affairs think-piece, which means that you wouldn’t want to have staked your life on its thesis: its boldly stated arguments are better described as plausibility claims. But now, there is actually some Real Social Science by a UCSD Ph.D. student, Eddie Yang, that makes a closely similar argument to ours. Eddie came up with his argument independently of ours, and furthermore did the hard and serious work of figuring out whether there is good evidence to back it up. Whereas we (and Harari) have sweeping arguments, Eddie has specific hypotheses that he tests against quantitative data. I think this piece is going to get a lot of attention and I think it deserves it.

What Eddie argues is the following. Authoritarian regimes face a well known set of trade-offs involving information. Ideally, they would like to be able to do two things at once.

First, they would like to make sure that their domestic enemies (which are often, in practice, frustrated minorities within the ruling class) can’t mobilize against them. This is why they so often impose extensive censorship regimes - they don’t want people mobilizing around their collective unhappiness with their rulers.

Second, they would like to know what their citizens actually think, believe, and are frustrated about. If citizens are really, quietly, angry, there is a much greater likelihood that crises will unexpectedly lead to the regime’s demise, as they did in Eastern Germany before the Berlin Wall fell, or in Tunisia before the Arab Spring. When there is diffuse and strong unhappiness with the regime, a small incident or a minor protest might suddenly cascade into general protests, and even regime collapse. This doesn’t necessarily lead to a transition to democracy (see: Arab Spring) but that is scant comfort to the rulers who have been deposed.

The problem is that these two desires are hard to reconcile. If you, as an authoritarian ruler, suppress dissent, you will flatten out public opinion, making people less likely to say what they truly think and believe. But if you do that, you also have a very hard time figuring out what people actually think and believe, and you may find yourself in for nasty surprises if things begin to go wrong.

This tradeoff explains why authoritarian regimes do things that might seem initially surprising, such as e.g. quietly running opinion polls (to be read only by officials), or creating petition systems. They want to know what their publics think, but they don’t want their publics to know what they think. It’s quite hard to do the one without the other.

None of this is news - but what Eddie does is to ask whether machine learning changes the dilemma. He asks whether new technologies make it possible for authoritarian government to see what their public wants, while still squashing dissent? The answer is nope, but it’s a very interesting nope, and one that undermines the Harari thesis.

So what does Eddie answer his question? He starts with a dataset of 10 million Chinese social media posts from Weibo (think Twitter with Chinese characteristics). And then he applies an actual real political sensitivity model from a Chinese social media company (which unsurprisingly goes unnamed) to score how spicy the content is. I’ve no idea how he got this model, and I imagine he has excellent reason not to tell, but it allows him to approximate the actual machine learning techniques used by actual Chinese censors. His core finding is that machine learning bias is an enormous problem for censorship regimes - “AI that is trained to automate repression and censorship can be crippled by bad data caused by citizens’ strategic behavior of preference falsification and self-censorship.” In other words, the biases in bad data give rise to systematic blindness in the algorithms, which in turn may very plausibly be destabilizing.

The fundamental problem is that the data that the machine learning algorithm has access to are a lousy representation of what Chinese citizens actually believe. Instead, the data reflect what Chinese citizens are prepared to say in public (after decades of censorship), which is … not quite the same thing. That makes it really hard to train a machine learning system to predict what Chinese citizens might say in a time of crisis, and rapidly squelch any dangerous sounding talk, which is what the CCP presumably, wants to be able to do (the first priority of China’s leaders is and always has been domestic political stability). Eddie’s simulations suggest that the traditional solution - throw more data at the model! - doesn’t work so well. Exposing the model to more crappy data - even lots and lots of crappy data - leads only to negligible improvements.

And this is where the interesting twist comes in. What does lead to significant improvements is drawing on a different data source: specifically, drawing on data from an uncensored Western social media service. When you feed the model data from Chinese language users talking about similar stuff on Twitter, the accuracy of the model improves significantly. Now that the model is able to ‘see’ the kind of potentially politically dangerous language that self-censoring citizens don’t use on Chinese social media services, it is able to do a better job at identifying it, and potentially censoring it when it is used in places where it can use censorship. Free and open Western social media (i.e. Twitter before it became the shitshow it is now) provide an unexpected service to authoritarian regimes, by providing them with a relatively unbiased source of data that they can train their algorithms of oppression on. There are limits to this - the uncensored conversation outside China is obviously different from the conversation that Chinese citizens would have if they weren’t censored. And the uncensored conversations are obviously still imperfect proxies for what people really think (a concept that gets blurrier the harder you poke at it). But still, it provides much better information than the censored alternative.

This tells us two things. One is that machine learning offers new tools to authoritarians, but it also generates new problems. None of these new technologies make politics go away. There are a lot of people who treat machine learning as some kind of automated sorcery - they may disagree over whether it will miraculously cure the world’s problems, or cast mass publics under an evil and irrevocable enchantment. They don’t usually have much idea what they’re talking about. Machine learning is applied statistics, not magic.

The other is a little less obvious. When the world begins to fill up with garbage and noise (borrowing a term from Philip K. Dick, when it is overwhelmed by ‘gubbish’), probably approximately correct knowledge becomes the scarce and valuable resource. And this probably approximately correct knowledge is largely the product of human beings, working together under specific kinds of institutional configuration (think: science, or, more messily, some kinds of democracy). The social applications of machine learning in non-authoritarian societies are just as parasitic on these forms of human knowledge production as authoritarian governments. Large language models’ sometime ability to approximate the right answer relies on their having ingested large corpuses of textual data on how humans have answered such questions, and drawing inferences from the patterns in the answers.

So how do you keep these forms of knowledge production alive in a world where every big tech company wants to parasitize them and/or subvert them to their own purposes? Eddie’s article suggests that authoritarian governments have a hidden dependency on more liberal regimes’ ability to produce more accurate information. The same is true of the kinds of knowledge capitalism that are dominant in free societies.

Decades ago, Albert Hirschman described cultural depletion theories under which market societies undermined the cultural conditions that they needed to reproduce themselves, by devouring trust relations, honesty, good will and the like. It’s at the least prima facie plausible that knowledge capitalism - in its current form - does much the same thing. Software eats the world, consuming the structures that produce more or less reliable human knowledge - and excreting gubbish instead. You don’t have to do the full Harari-we’re-all-doomed-thumbsucker-article-for-the-Atlantic to worry that the long term implications of this may not be so great. The future is not an inevitable dystopia, whether authoritarian or capitalist. Equally, there is no reason to assume that new technologies and forms of production will automatically regenerate the kinds of knowledge structures that we need to figure things out collectively in any even loosely reliable way.

Discussion about this post

Timothy Burke

Jul 24, 2023

Nice piece, Henry--and also I liked the Foreign Affairs piece.

Yang's research is going at one of the enormous weaknesses of *all* social science research that seeks to mine social media for an indication of what people really think and even more importantly what they are prone to *do* about what they think. In a comparative sense, this would be like European ancien regimes mining the print culture of the 18th Century and concluding that the texts which most directly attest to political sentiments and call for political action are leading indicators of the possibility of unrest or dissent, whereas if you buy even part of Darnton's argument in The Forbidden Bestsellers of Pre-Revolutionary France, the more important texts that worried the regime were pornography, dreamy utopian narratives, and vicious attacks on minor public figures--but sometimes *all* texts did. But the real argument is that it wasn't until the French ancien regime identified print culture *as* a problem (and *as* containing information that the regime needed to know) that those texts started mobilizing rebellious sentiments among readers. I think you could add layers to Darnton's take--18th C. print culture was a complicated product of individual authors, the chaotic improvisations of publishers, and the constantly protean social worlds where print was kept, read and talked about, and that beyond those worlds were the complicated social ties between the readers and non-readers, and how what was read was communicated and re-circulated. Contemporary social media is a product of platforms and their cyborg-making interfaces, of the temporary relations that exist inside platforms, and of the ways what is said inside platforms gets communicated onward to people who aren't there. (Am I exactly the sum of my likes and dislikes on Facebook, or that man a creature who exists only inside Facebook's media ecology? Can you learn anything of what I might do out here in my daily material existence from it? Something, but not even close to everything.) Garbage in, garbage out in terms of *whether* what citizens say and do in texts is what they're going to say and do in material reality is a problem older than AI, and it's where a humanistic understanding of representation becomes really important for understanding the problem. (Because it's not just a problem for governments, it's a problem for social scientists who are looking for data that will tell them what people "really" think and how what they "really" think will govern what they concretely *do*.)

But I'd also work it from the other end of the problem. (Forgive me if Jeremy's new book does this at length, as I haven't had the chance to read it yet.) Do authoritarian regimes really want to know what their people want? In ethnographic and historical terms, I'd say there's a fair amount of evidence that they don't, at least not in the upper reaches of their hierarchies. In fact, I'd say that a good deal of the time, the people who are ostensibly charged with making decisions that are supposed to be guided by accurate information don't want information that will complicate, deflect, or outright invalidate what they ideologically or predispositionally want to do, that authoritarian states are sometimes more inclined to act according to some theory of power (or some whim of the authoritarian) and then to try and force social reality to *cohere to the action* or to suppress (with various degrees of violence or coercion) any attempt to dissent from that re-alignment. Harari's piece presupposes a kind of evolutionary race between authoritarian and non-authoritarian regimes in which accurate data mining of what the people really think and are really inclined to do will make authoritarian regimes more successful and powerful because they will more accurately anticipate and neutralize popular feelings. That seems to just misunderstand actually-existing authoritarian regimes, or it sees contemporary China as defining bleeding-edge authoritarianism, which I think is at least worth debating.

This might be a point that applies more generally to large-scale modern governments period. E.g., the people at the top of governmental hierarchies often have potent reasons to not know what they could know, or to discount an accurate source of information in favor of an inaccurate one simply because the accurate information will force decision-makers to move in a direction that they're politically or ideologically indisposed to take, whether that's invading Ukraine or putting out a 'mini-budget' that included dramatic tax cuts and no budgetary audit of the consequences. I hate to invoke "deep states", but it does seem fair to say that if there's an appetite for copious amounts of deeply accurate information about the population of a country, it's lower down the pyramid of power *within* states, I think.

Expand full comment