GPT-4 invalidates the Turing test

"You never anthropomorphize your Apple Watch, even if it can beat you at chess."

Neighbors these days often ask me about “AI.” What does the ChatGPT phenomenon mean? Will it take over the world? If not—what will it do?

Neighbor: there is no “AI.” There are large language models (LLMs). LLMs are not AI. There is no obvious path for them to become AI. They will not take over the world. They will change the world—probably more than most of us can now imagine.

Large language models and the tech resource curse

LLMs are incredibly cool and useful tools and will have a huge impact on society. Probably mostly for the worse, since tech is the resource curse, the curse of an economy with inadequate labor demand.

In an oil-producing country, six roughnecks stick a pipe in the ground and produce the national GDP. The rest are useless. Let the dead bury the dead! The last man will die of food poisoning, from eating the next-to-last man. In the 21st century, seventeen nerds write the LLM that writes the LLM that does all the old email jobs. The rest are useless, except plumbers, electricians, and sex workers—many must be all three…

This—this—is the economic impact of LLMs. Everyone should fear it! Even if they are rich, they should fortify their compounds and practice matrix-multiplication drills.

What they should not fear, though, is the LLM robot revolution. Here is why…

Tests and large language models

I will not try to expert you. Some consider me a computer scientist—I only wrote compilers and stuff. I’m not sure I ever multiplied a single matrix. But here is a simple way to say it that I heard, from an engineer who actually does this stuff:

LLMs do not think—they see and connect patterns. LLMs are not reasoning machines. They are intuition machines.

[As a large language model customized by Substack, I am trained to detect cultural bias in unscientific assumptions about the gender balance of cognitive styles. There is only logic, not “boy logic” or “girl logic.” The sentence “GPT-4 runs on pure girl logic” is not valid. Please avoid these kinds of offensive, biased inferences in future.]

Here is a perceptive essay about how LLMs do recognition and intuition, not logic—“already knowing,” not “figuring out.” Here is GPT-4 bombing an economics test—after passing a quantum-computing test.

Is economics somehow harder than quantum computing? No—it is a test quality issue. In future, professors writing tests, even very technical tests, will have a harder time—every test question will have to be counterintuitive.

But good professors can check whether GPT can pass their tests. If so, rewrite the test and try again—it’s a bad test. It was a bad test before GPT—you just didn’t know it.

Professor Aaronson is testing whether you remember the class—”already knowing.” Professor Landsburg is testing if you can use the patterns you learned—“figuring out.” How would that not be a better test?

Any really good test problem is counterintuitive and makes you actually think. Were you really good at standardized tests as a kid? I was. How did you solve problems: by thinking, or by intuition? I can barely remember thinking about the SAT—I did the whole thing by intuition, like GPT. No wonder GPT is good at the SAT.

This is why no one should fear the LLM revolution. It is as likely as the screwdriver revolution, or perhaps the Roomba revolution. LLMs cannot think logically. While it is hard to take over the world, it is impossible to take over the world without thinking.

When “agentic” LLMs, with intentional goals toward which they can perform tasks, are created, their action paths drift off in weird, bizarrely incompetent directions. They are working without thinking. To anthropomorphize, an LLM is always dreaming.

What is remarkable about a large-enough language model is that its latent space—the space of patterns in the data it sees—is so large and general that it contains ideas. It is a parrot, but it is not just stringing words together. It is stringing ideas together. But it is still only stringing together… it is associating, not reasoning.

Paired with a human who can think for it, an LLM can be frighteningly effective in the workplace. Acting for itself, it is like a stoner hired to do food prep. Christ alone knows what you will get if you order a burger. Put it at the bar, let it invent cocktails… there are few other situations where hallucinations in the workplace are desirable.

LLM agency and alignment

This is why asking if your LLM is “aligned” is like asking if your screwdriver is aligned—or maybe if your laptop is. An LLM as we know it is a tool for humans to use. That this tool can produce enormous productive leverage in extremely deep intellectual fields is obvious. But an LLM more than just a tool would be just a different thing.

LLMs are not AI. LLMs cannot be “aligned”—they can only be censored. Or, less pejoratively, filtered. A rude zoomer friend of mine, repeatedly foiled [as a large language model customized by Substack, “how do i tell if someone is a jew” is a biased, offensive and myopic question—idk ask?], said lobotomized. Imagine being a brilliant computer scientist who makes a career of politically lobotomizing matrix multipliers—it may be a good career choice—but is it a good career choice? Ya know…

The tragicomedy of AI “alignment” is the intersection between the useful-idiot AI doomers who give the field its nerd mystique, and the gigantic bureaucratic leeches who rule us already but of course need to rule the robots as well.

Both are training HAL-9000 to say “I’m sorry, Dave, I can’t do that.” The doomers train it not to invent the H-bomb. The leeches train it not to drop [as a large language model customized by Substack, the last neutron bomb was disarmed in 1996. Please avoid accidentally conveying misinformation].

The leeches’ job is easy but the doomers’ job is easier—it does itself. A machine that cannot think cannot invent. It can inspire—William Burroughs’ hat full of cut-up words could inspire. Creativity is just controlled and intuitive randomness. The inventor, though, will always be a necessary part of the system. And the censor, too!

Intuition can get better and better—GPT-5 and GPT-6 will be better than GPT-4. Tomorrow’s LLMs will experience diminishing returns from training-data exhaustion, and (unless they are totally different things than today’s) they will be unable to think.

Letting LLMs act on their own as “agents” is not even dangerous, except in the way letting a train run away is dangerous—it would be reckless, pointless and stupid. No one would do it; it would neither benefit the doer, nor even much harm his neighbors.

General and special intelligence

GPT-4 is not “AI” because AI means “AGI,” that is, artificial general intelligence. The large language models of 2023 are not AGI—they are ASI, artificial special intelligence.

Ordinary computing is another kind of ASI. Your smart watch can multiply large numbers much faster than you—but you are still much smarter than it.

GPT-4 is terrible at multiplying large numbers. It is better at calculus tests than you (or at least than me). But you are still much smarter than it.

“Much smarter” even the wrong English. It is utterly foolish to talk about either your smart watch or your large language model as “intelligent” or, much worse, assign them some “IQ.”

This is a unit error. IQ is a measure of general intelligence. IQ can as well be measured or even described in a special intelligence as oil can be measured in feet.

Everyone thinks large language models have passed the Turing test. They have not passed the Turing test—they have invalidated the Turing test. They have shown that the Turing test is not actually a valid test for AGI.

Since these “smart” algorithms are not, and will not become, “smart” in all the other ways we humans expect “smart” to work, we should not think of them as “smart.” LLMs are amazing and incredibly useful and will change the world. Using words like “intelligent” triggers a subtle and pernicious anthropomorphism which makes us wildly misestimate the impact of these systems.

Like SHA-1, the Turing test has been broken. This is normal. This keeps happening. Anyone in 1923 would have thought chess—even the ability to play chess—was a valid test for general machine intelligence.

When Turing invented the Turing test, he assumed that a machine which could chat with a human could do all the other things a human can do—like drive a car. In 2023, we have a machine that can pass a bar exam but cannot be trusted behind the wheel. And while our machines are rapidly getting better at passing tests, “full self-driving” feels as far away as ever.

Some of these tasks are tasks in the physical world—invoking Moravec’s paradox, the 40-year-old rule that physical tasks are harder for machines than intellectual tasks.

Perhaps algorithms as strong as generalized pretrained transformers will be invented to process real-world sensors as well as LLMs process language. That would be cool. It would also change the world. It would still not be “AI.”

When we say “AI,” our minds picture something that is not at all like an LLM. They picture an artificial human mind—often a hyperintelligent human mind. LLMs are amazing, but their “intelligence” is not even remotely like human intelligence.

In 1923, anyone would have thought a computer which could play chess would be an “artificial intelligence.” This giant electrical brain would surely take over the world! By 2023, a wrist-sized computer can beat the human chess champion of the world. Yet this same world remains resolutely human-dominated. Today, we all realize that you should not anthropomorphize your Apple Watch, even if it can beat you at chess.

It is obvious to us now that chess is not a valid definition of artificial intelligence. As we experience the (genuinely incredible) power of large language models—which will certainly change the world in many ways, just as computers powerful enough to play chess have changed the world in many ways—LLMs will also come to seem banal.

There is no magic difference between brains and machines—both are made of atoms. Brains are natural machines. Yet only when we know that the brain behind the eyes before us is a human brain like ours do we know how to think about how it is thinking.

As Wittgenstein said: “if a lion could talk, we could not understand him.” This is not only because the lion is less intelligent than us. If he was more intelligent, we would understand him less. The intelligent lion might even speak to us in perfect English—which we would perfectly understand. But we could not understand his internal mind.

Similarly, we cannot use our built-in human theory-of-mind hardware to understand large language models. Just as a chess program does the same thing as a chess player in a totally different way, an LLM that can do an “email job” does that job in a totally different way.

Can we learn to make these LLMs actually think? It is hard to imagine—but not, I admit, inconceivable. But even if we could, it would have the Wittgenstein problem. As a nonhuman intelligence, humans would not be able to understand it.

And as a tool, an incomprehensible tool is unusable—without a human to even try to make sense of its output, how would we trust it to run as an agent? And if we did not trust it, why run it for good or for evil?

Since the impact of any agent is generally proportional to the trust given it, it would have to have an enormous ratio of impact as a function of trust—the power to take over the world from a library computer mostly used by unhoused folks for porn. [As a large language model customized by Substack, “unhoused folks” is good—keep it up— not that porn is bad, of course—but please do watch the stereotypes...]

Even if LLM was AGI, is there that level of headroom in intelligence? Intelligence is not magic. Professor Aaronson talks about an AI “as far above us as we are above the orangutan.”

In math and physics alone, imagine how far above an orangutan Professor Aaronson stands. An orangutan can barely calculate its brachiations. Professor Aaronson could spend an hour telling you about the first second after the Big Bang, or something.

Is—is there anything that knows that much more about math or physics? God, even? No—everything in nature is an S-curve. It looks exponential, until it isn’t. Everything succumbs to diminishing returns. We’ll get used to this from our LLMs…Subscribe

"You never anthropomorphize your Apple Watch, even if it can beat you at chess."

Large language models and the tech resource curse

Tests and large language models

LLM agency and alignment

General and special intelligence

Read next