Category Archives: Machine Learning

An AI Opinions Chart

You ever read something and suddenly a whole classification scheme lights up in your head?

A thread on X from “stringking42069” showed me a combination of opinions I hadn’t seen before. stringking42069 is a pro-string theory commentator with a macho gym bro memer gimmick. He’s openly contemptuous of many physicists who describe themselves as string theorists, arguing that only a smaller number really deserve the name.

To be clear, none of that is the new combination. Long-time readers of this blog will remember a frequent commenter with a very similar attitude, if much less tendency to use the word “bro”.

The new thing, from my perspective, is how he thinks about AI. As he explains in that thread, he sees AI as great at certain kinds of physics calculations, ones where the methods and goals are mostly known and the challenge is working out the math. He doesn’t expect it to be able to contribute real creativity or judgement, the messy decision-making that physicists use to decide what is worth building in the first place.

Others with that perspective tend to argue that this will be a boon for scientists, who AI will free up to do creative work, multiplying their output. The difference is, stringking42069 thinks a lot of scientists are not doing creative work in the first place, including most of the people making extensive use of AI. So if anything he’s happy to see them go, and only pissed that they’re sucking up resources and attention on the way out, and discouraging students who could be joining the parts of the field that do real creative work.

It made me realize that there are two axes to thinking about AI in physics.

On the one hand, there’s where you think AI capabilities are. Is AI going to lead to “a nation of geniuses in a data center”, an AI-powered super-(cyber-)Ed Witten for everything and everyone? Is AI great at routine work and coding, but will never be able to do anything really creative or novel? Or is AI total hype, almost always a waste of time?

On the other hand, there’s another axis: misanthropy about science. For some of the people arguing about AI online, most scientists are good people trying their best to do worthwhile things. For others, most scientists are complacent and cliquish, wasting time and money on ideas that are going nowhere and forcing the real geniuses out of the field.

Put those together, and you get the table below:

Thinks academia is mostly fineMisanthrope
AI geniuses are comingThe practice of science will change. We’ll play at science like chess, and have fun trying to read and understand amazing AI insights.Soon all scientists will be out of a job when the public notices AI can do it all better. Then the real breakthroughs will come.
AI can do routine workAI frees scientists to focus on what we do best: creativity. We should think carefully about how to train junior scientists now, though.AI is comparable to bad scientists who only do derivative work. If they leave, we real paradigm-changers could inherit the field.
AI is complete hypeMost scientists don’t use AI. AI is worrying because it misleads students and the public, who should listen to real scientists.Scientists are shilling for AI companies, as you should expect for people who waste the public’s money on reputation games.

This classification is missing a lot, of course. One important question is not just what AI can do in principle, but what it can do cost-effectively, and whether anyone is actually willing to pay for it. A point where I agree with stringking42069 is that companies get a lot of good PR out of building AI physicists right now, and that PR benefit won’t be relevant forever. I’m also leaving out the more general questions of AI’s effect on society, for example people who think AI geniuses will lead to the end of the world as we know it.

But I suspect if you look at this table, you can already start matching the scientists you see on social media. I’ve seen examples of all of these in the wild (though the bottom-left is somewhat rare, as far as I can tell). Where do you fall?

Make No Mistakes

I’m taking a Danish exam next week, and it’s a big one, a culmination of years learning the language. My classmates are stressed. Despite how much we’ve learned, it feels like we’re always making little mistakes. We write the wrong prepositions, put verbs in the wrong form, or mess up the order of words in a sentence. And while we should have time to check our work, that doesn’t help as much as it should. If we don’t notice a mistake the first time around, what guarantee is there that we notice it on the next read, or the next? Too many checks and we can even end up second-guessing ourselves, “correcting” something that was right to begin with.

It’s given me some sympathy for AI.

Earlier this month, investor Marc Andreessen posted a custom prompt he inputs when using AI, which was immediately mocked.

The silliest instruction, according to many critics, was to “Never hallucinate or make anything up.” It’s similar to a prompt that’s become a meme used to make fun of AI-using “vibe coders”, “Make no mistakes”.

Experts point out that this is just not how AI works. Large language model-powered programs like ChatGPT are inherently random, producing text largely based on its similarity to other text. “Hallucinations” or “mistakes” are an inevitable feature of the technology, and a prompt like Andreessen wrote isn’t a set of instructions the AI will follow without error: it’s just another part of the text the AI is trying to generate.

All that said, telling an AI to “make no mistakes” should have some effect. But it likely won’t be what you want.

The best way I’ve found to understand AI is in terms of stories. Chatbots like ChatGPT take a large language model, a mathematical formula for how words are most likely to appear in a text, and warp it, twisting it to almost always produce one particular kind of text: one half of a dialogue with a fictional AI assistant. This twisted formula determines how the AI responds to your prompts, but these days it also is used behind the scenes, in a kind of structured soliloquy called a “chain of thought”. You can think of the prompts you send to the AI as a preface to those soliloquies, and imagine the AI telling stories of a sort that would typically follow that preface.

So if you tell an AI “make no mistakes” or “do not hallucinate”, you’re making it more likely to generate the kind of story that begins, “the AI was instructed to make no mistakes”.

Let me put it this way, Mr. Amor. The 9000 series is the most reliable computer ever made. No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error. – HAL 9000, “2001: A Space Odyssey”

You’d expect this to affect the chain of thought. For example, the AI might occasionally pause to say “I’m supposed to make no mistakes, so I should check this. What could have gone wrong?” and then list something that plausibly could be wrong with its idea. If this happens often enough, you’ll probably catch some real problems.

But I’m reminded of my classmates, practicing for that Danish exam. We can go over the text again and again, asking if this thing, or that, might be wrong. We can try again and again to use our mental model of the Danish language, seeing if this time it catches a new mistake. But there are things we won’t catch. And if we do it too much, we’ll second-guess ourselves out of the good answers, too.

Ultimately, “make no mistakes” isn’t a great instruction, either for humans or for chatbots. And its use by people like Marc Andreessen has me wondering if they are used to interacting with humans in the same way, as tools that keep making mistakes no matter how many times they’re instructed not to, requiring more and more long-winded instructions and yet continuing to misbehave.

Then again, that may be a mistake on my part.

What AI Physicists Are Missing and What They Aren’t

I’ve seen a couple more thoughtful takes on use of LLMs for physics lately. This blog post by Minas Karamis is particularly nice.

He points out something that I’ve said a version of: an AI that must be supervised like a student isn’t very useful, because the main point of student projects isn’t the paper at the end: it’s training the student. If students don’t struggle through all the mistakes of a project, they won’t get the expertise to one day do greater things.

Someone might object that not all suffering is educational. In the 1700’s, Leonhard Euler calculated digit after digit of transcendental numbers by hand. Nobody asks students to do that anymore, and they still seem to turn out alright. Why would using an LLM for science be worse than using a computer for numerical calculations?

In a word: different skills. Programming numerics teaches you some of the same skills as calculating the numbers by hand: skills at being specific about what you mean, aware of the consequences of the details and their implications. Prompting an AI still requires those skills, to check whether the AI’s output is correct. But it’s much worse at teaching them: unlike programming or calculating, when prompting AI, the consequences of your actions aren’t predictable.

For some, though, there is another objection. Sure, using AI reliably might require those skills now. But when it gets better, surely being careful will stop mattering. Surely the AI will end up doing science on its own, and all that training will be as useful as if we trained the students to play football.

I’m skeptical, but not as strongly as some. I think we’re still living in a time when it makes sense to hire scientists, and train people to think, and invest in your retirement.

I don’t think I have any knock-down arguments for that, though. Just some suggestive ones.

One I’ve talked about before is that a lot of the most important parts of thinking aren’t written down. An AI physicist is going to have a hard time replicating the kinds of methods and approaches that people use behind the scenes, but rarely describe or spell out. It will be easier to suss this out over time, as more data accumulates of people working with LLMs and correcting them. But ultimately there isn’t going to be a lot of documentation of this kind of thing.

Another limitation is memory. A mature scientist can draw from experiences across their entire career. For an LLM, any problem it’s solved in the past is by default lost in each new session. People build structures around this, taking notes and reminding the AI when it “wakes up”, or making documents the AI can be prompted to check. But nothing in this vein so far seems to get nearly as wide-scope or powerful as human memory. A scientist career is still the best way we have to build durable, functional expertise.

Finally, there is a question of costs, and efficiency. Here I’m not an expert, and I get the impression the actual experts disagree. I don’t know whether we should expect scaling to hit a wall, but I wouldn’t be that surprised if it did.

There are other common reasons for skepticism that seem more dubious to me. I don’t think AI is inherently worse at creativity just because they’re trained on existing work, though some of the skills we associate with creativity aren’t very well-documented, and thus are hard to train for. I don’t think AI’s randomness or unreliability is a deal-breaker, because human intuition is also random and unreliable: we solve that with tools, and that’s something AI can in principle do as well. I don’t think humans are “more agentic” or something, except in the sense that most AIs are made by companies who need to make them behave in a customer-friendly way. But an agent is just a game-theoretic construct, a way to figure out can win or lose in situations with defined stakes, and anything you can train or engineer to try to win can be modeled by that construct.

Coming from a place of uncertainty, my main appeal to you is to not get hung up on the bad reasons, either yourself, or from the people you’re arguing with. Focus on the best arguments, and see where they take you.

About the OpenAI Amplitudes Paper, but Not as Much as You’d Like

I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.

This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.

Context:

Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”

In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.

Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.

That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.

I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.

Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.

How impressed should you be?

I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.

If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.

What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.

(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)

Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.

This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.

For the moment, then, I have to guess what input the AI had, and what it actually accomplished.

Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.

Given that, how impressed should we be?

First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.

Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.

Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?

After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.

If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.

From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.

While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.

Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.

A stab at implications:

If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?

If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.

More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.

And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.

Remaining questions:

At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.

  1. What was the prompt given to the internal model, and how much did it do based on that prompt?
  2. Was it really done in one shot, no retries or feedback?
  3. How much did running the internal model cost?
  4. Is this result likely to be useful? Are there things people want to calculate that this could make easier? Recursion relations it could seed? Is it useful for SCET somehow?
  5. How easy would it have been for the authors to do what the LLM did? What about other experts in the community?

Hypothesis: If AI Is Bad at Originality, It’s a Documentation Problem

Recently, a few people have asked me about this paper.

A couple weeks back, OpenAI announced a collaboration with a group of amplitudes researchers, physicists who study the types of calculations people do to make predictions at particle colliders. The amplitudes folks had identified an interesting loophole, finding a calculation that many would have expected to be zero actually gave a nonzero answer. They did the calculation for different examples involving more and more particles, and got some fairly messy answers. They suspected, as amplitudes researchers always expect, that there was a simpler formula, one that worked for any number of particles. But they couldn’t find it.

Then a former amplitudes researcher at OpenAI suggested that they use AI to find it.

“Use AI” can mean a lot of different things, and most of them don’t look much like the way the average person talks to ChatGPT. This was closer than most. They were using “reasoning models”, loops that try to predict the next few phrases in a “chain of thought” again and again and again. Using that kind of tool, they were able to find that simpler formula, and mathematically prove that it was correct.

A few of you are hoping for an in-depth post about what they did, and its implications. This isn’t that. I’m still figuring out if I’ll be writing that for an actual news site, for money, rather than free, for you folks.

Instead, I want to talk about a specific idea I’ve seen crop up around the paper.

See, for some, the existence of a result like this isn’t all that surprising.

Mathematicians have been experimenting with reasoning models for a bit, now. Recently, a group published a systematic study, setting the AI loose on a database of minor open problems proposed by the famously amphetamine-fueled mathematician Paul Erdös. The AI managed to tackle a few of the problems, sometimes by identifying existing solutions that had not yet been linked to the problem database, but sometimes by proofs that appeared to be new.

The Erdös problems solved by the AI were not especially important. Neither was the problem solved by the amplitudes researchers, as far as I can tell at this point.

But I get the impression the amplitudes problem was a bit more interesting than the Erdös problems. The difference, so far, has mostly been attributed to human involvement. This amplitudes paper started because human amplitudes researchers found an interesting loophole, and only after that used the AI. Unlike the mathematicians, they weren’t just searching a database.

This lines up with a general point, one people tend to make much less carefully. It’s often said that, unlike humans, AI will never be truly creative. It can solve mechanical problems, do things people have done before, but it will never be good at having truly novel ideas.

To me, that line of thinking goes a bit too far. I suspect it’s right on one level, that it will be hard for any of these reasoning models to propose anything truly novel. But if so, I think it will be for a different reason.

The thing is, creativity is not as magical as we make it out to be. Our ideas, scientific or artistic, don’t just come from the gods. They recombine existing ideas, shuffling them in ways more akin to randomness than miracle. They’re then filtered through experience, deep heuristics honed over careers. Some people are good at ideas, and some are bad at them. Having ideas takes work, and there are things people do to improve their ideas. Nothing about creativity suggests it should be impossible to mechanize.

However, a machine trained on text won’t necessarily know how to do any of that.

That’s because in science, we don’t write down our inspirations. By the time a result gets into a scientific paper or textbook, it’s polished and refined into a pure argument, cutting out most of the twists and turns that were an essential part of the creative process. Mathematics is even worse, most math papers don’t even mention the motivation behind the work, let alone the path taken to the paper.

This lack of documentation makes it hard for students, making success much more a function of having the right mentors to model good practices, rather than being able to pick them up from literature everyone can access. I suspect it makes it even harder for language models. And if today’s language model-based reasoning tools are bad at that crucial, human-seeming step, of coming up with the right idea at the right time? I think that has more to do with this lack of documentation, than with the fact that they’re “statistical parrots”.

The Timeline for Replacing Theorists Is Not Technological

Quanta Magazine recently published a reflection by Natalie Wolchover on the state of fundamental particle physics. The discussion covers a lot of ground, but one particular paragraph has gotten the lion’s share of the attention. Wolchover talked to Jared Kaplan, the ex-theoretical physicist turned co-founder of Anthropic, one of the foremost AI companies today.

Kaplan was one of Nima Arkani-Hamed’s PhD students, which adds an extra little punch.

There’s a lot to contest here. Is AI technology anywhere close to generating papers as good as the top physicists, or is that relegated to the sci-fi future? Does Kaplan really believe this, or is he just hyping up his company?

I don’t have any special insight into those questions, about the technology and Kaplan’s motivations. But I think that, even if we trusted him on the claim that AI could be generating Witten- or Nima-level papers in three years, that doesn’t mean it will replace theoretical physicists. That part of the argument isn’t a claim about the technology, but about society.

So let’s take the technological claims as given, and make them a bit more specific. Since we don’t have any objective way of judging the quality of scientific papers, let’s stick to the subjective. Today, there are a lot of people who get excited when Witten posts a new paper. They enjoy reading them, they find the insights inspiring, they love the clarity of the writing and their tendency to clear up murky ideas. They also find them reliable: the papers very rarely have mistakes, and don’t leave important questions unanswered.

Let’s use that as our baseline, then. Suppose that Anthropic had an AI workflow that could reliably write papers that were just as appealing to physicists as Witten’s papers are, for the same reasons. What happens to physicists?

Witten himself is retired, which for an academic means you do pretty much the same thing you were doing before, but now paid out of things like retirement savings and pension funds, not an institute budget. Nobody is going to fire Witten, there’s no salary to fire him from. And unless he finds these developments intensely depressing and demoralizing (possible, but very much depends on how this is presented), he’s not going to stop writing papers. Witten isn’t getting replaced.

More generally, though, I don’t think this directly results in anyone getting fired, or in universities trimming positions. The people making funding decisions aren’t just sitting on a pot of money, trying to maximize research output. They’ve got money to be spent on hires, and different pools of money to be spent on equipment, and the hires get distributed based on what current researchers at the institutes think is promising. Universities want to hire people who can get grants, to help fund the university, and absent rules about AI personhood, the AIs won’t be applying for grants.

Funding cuts might be argued for based on AI, but that will happen long before AI is performing at the Witten level. We already see this happening in other industries or government agencies, where groups that already want to cut funding are getting think tanks and consultants to write estimates that justify cutting positions, without actually caring whether those estimates are performed carefully enough to justify their conclusions. That can happen now, and doesn’t depend on technological progress.

AI could also replace theoretical physicists in another sense: the physicists themselves might use AI to do most of their work. That’s more plausible, but here adoption still heavily depends on social factors. Will people feel like they are being assessed on whether they can produce these Witten-level papers, and that only those who make them get hired, or funded? Maybe. But it will propagate unevenly, from subfield to subfield. Some areas will make their own rules forbidding AI content, there will be battles and scandals and embarrassments aplenty. It won’t be a single switch, the technology alone setting the timeline.

Finally, AI could replace theoretical physicists in another way, by people outside of academia filling the field so much that theoretical physicists have nothing more that they want to do. Some non-physicists are very passionate about physics, and some of those people have a lot of money. I’ve done writing work for one such person, whose foundation is now attempting to build an AI Physicist. If these AI Physicists get to Witten-level quality, they might start writing compelling paper after compelling paper. Those papers, though, will due to their origins be specialized. Much as philanthropists mostly fund the subfields they’ve heard of, philanthropist-funded AI will mostly target topics the people running the AI have heard are important. Much like physicists themselves adopting the technology, there will be uneven progress from subfield to subfield, inch by socially-determined inch.

In a hard-to-quantify area like progress in science, that’s all you can hope for. I suspect Kaplan got a bit of a distorted picture of how progress and merit work in theoretical physics. He studied with Nima Arkani-Hamed, who is undeniably exceptionally brilliant but also undeniably exceptionally charismatic. It must feel to a student of Nima’s that academia simply hires the best people, that it does whatever it takes to accomplish the obviously best research. But the best research is not obvious.

I think some of these people imagine a more direct replacement process, not arranged by topic and tastes, but by goals. They picture AI sweeping in and doing what theoretical physics was always “meant to do”: solve quantum gravity, and proceed to shower us with teleporters and antigravity machines. I don’t think there’s any reason to expect that to happen. If you just asked a machine to come up with the most useful model of the universe for a near-term goal, then in all likelihood it wouldn’t consider theoretical high-energy physics at all. If you see your AI as a tool to navigate between utopia and dystopia, theoretical physics might matter at some point: when your AI has devoured the inner solar system, is about to spread beyond the few light-minutes when it can signal itself in real-time, and has to commit to a strategy. But as long as the inner solar system remains un-devoured, I don’t think you’ll see an obviously successful theory of fundamental physics.

AGI Is an Economic Term, Not a Computer Science Term

Since it resonated with the audience, I’ll recap my main argument against AGI here. ‘General intelligence’ is like phlogiston, or the aether. It’s an outmoded scientific concept that does not refer to anything real. Any explanatory work it did can be done better by a richer scientific frame. 1/3

Shannon Vallor (@shannonvallor.bsky.social) 2025-10-02T22:09:06.610Z

I ran into this Bluesky post, and while a lot of the argument resonated with me, I think the author is missing something important.

Shannon Vallor is a philosopher of technology at the University of Edinburgh. She spoke recently at a meeting honoring the 75th anniversary of the Turing Test. The core of her argument, recapped in the Bluesky post, is that artificial general intelligence, or AGI, represents an outdated scientific concept, like phlogiston. While some researchers in the past thought of humans as having a kind of “general” intelligence that a machine would need to replicate, scientists today break down intelligence into a range of capabilities that can be present in different ways. From that perspective, searching for artificial general intelligence doesn’t make much sense: instead, researchers should focus on the particular capabilities they’re interested in.

I have a lot of sympathy for Vallor’s argument, though perhaps from a different direction than what she had in mind. I don’t know enough about intelligence in a biological context to comment there. But from a computer science perspective, intelligence obviously is composed of distinct capabilities. Something that computes, like a human or a machine, can have different amounts of memory, different processing speeds, different input and output rates. In terms of ability to execute algorithms, it can be a Turing machine, or something less than a Turing machine. In terms of the actual algorithms it runs, they can have different scaling for large inputs, and different overhead for small inputs. In terms of learning, one can have better data, or priors that are closer to the ground truth.

These days, all of these Turing machine algorithm capabilities are in some sense obviously not what the people interested in AGI are after. We already have them in currently-existing computers, after all. Instead, people who pursue AGI, and AI researchers more generally, are interested in heuristics. Humans do certain things without reliable algorithms, instead we do them faster, but unreliably. And while some human heuristics seem pretty general, it’s widely understood that in the heuristics world there is no free lunch. No heuristic is good for everything, and no heuristic is bad for everything.

So is “general intelligence” a mirage, like phlogiston?

If you think about it as a scientific goal, sure. But as a product, not so much.

Consider a word processor.

Obviously, from a scientific perspective, there are lots of capabilities that involve processing words. Some were things machines could do well before the advent of modern computers: consider typewriters, for instance. Others still are out of reach, after all, we do still pay people to write. (I myself am such person!)

But at the same time, if I say that a computer program is a word processor, you have a pretty good idea of what that means. There was a time when processing words involved an enormous amount of labor, work done by a large number of specialized people (mostly women). Look at a workplace documentary from the 1960’s, and compare it to a workplace today, and you’ll see that word processor technology has radically changed what tasks people do.

AGI may not make sense as a scientific goal, but it’s perfectly coherent in these terms.

Right now, a lot of tasks are done by what one could broadly call human intelligence. Some of these tasks have already fallen to technology, others will fall one by one. But it’s not unreasonable to think of a package deal, a technology that covers enough of such tasks that human intelligence stops being economically viable. That’s not because there will be some scientific general intelligence that the technology would then have, but because a decent number of intellectual tasks do seem to come bundled together. And you don’t need to cover 100% of human capabilities to radically change workplaces, any more than you needed to cover 100% of the work of a 1960’s secretary with a word processor for modern secretarial work to have a dramatically different scope and role.

It’s worth keeping in mind what is and isn’t scientifically coherent, to be aware that you can’t just extrapolate the idea of general intelligence to any future machine. (For one, it constrains what “superintelligence” could look like.) But that doesn’t mean we should be complacent, and assume that AGI is impossible in principle. AGI, like a word processor, would be a machine that covers a set of tasks well enough that people use it instead of hiring people to do the work by hand. It’s just a broader set of tasks.

Some Dumb AI Ideas

Sometimes, when I write a post about AI, I’ve been sitting on an idea for a long time. I’ve talked to experts, I’ve tried to understand the math, I’ve honed my points and cleared away clutter.

This is not one of those times. The ideas in this post almost certainly have something deeply wrong with them. But hopefully they’re interesting food for thought.

My first dumb idea: instruction tuning was a mistake.

I’m drawing the seeds of this one from a tumblr post by nostalgebraist, someone known for making a popular bot trained on his tumblr posts in the early days before GPT became ChatGPT.

AIs like ChatGPT are based on Large Language Models, insanely complicated mathematical formulas that predict, given part of a text, what the rest of that text is likely to look like. In the early days, this was largely how they were used. Loosely described nostalgebraist’s bot, called nostalgebraist-autoresponder, began with a list of tumblr posts and asks and determines what additional posts would best fit in.

If you think about it, though, ChatGPT doesn’t really work like that. ChatGPT has conversations: you send it messages, it sends you responses. The text it creates is a dialogue, with you supplying half the input. But most texts aren’t dialogues, and ChatGPT draws on a lot of non-dialogue texts to make its dialogue-like responses.

The reason it does this is something called instruction tuning. ChatGPT has been intentionally biased, not to give the most likely completion to a task in general, but to give completions that fit this dialogue genre. What I didn’t know until I read nostalgebraist’s post was that this genre was defined artificially: AI researchers made up fake dialogues with AI, cheesy sci-fi conversations imagining how an AI might respond to instructions from a user, and then biased the Large Language Model so that rather than giving the most likely text in general, it gives a text that is more likely to look like these cheesy sci-fi conversations. It’s why ChatGPT sounds kind of like a fictional robot: not because sci-fi writers accurately predicted what AI would sound like, but because AI was created based on sci-fi texts.

For nostalgebraist, this leads into an interesting reflection of how a sci-fi AI should behave, how being warped around a made-up genre without history or depth creates characters which act according to simple narratives and express surprising anxiety.

For myself, though, I can’t help but wonder if the goal of dialogue itself is the problem. Dialogue is clearly important commercially: people use ChatGPT because they can chat with it. But Large Language Models aren’t inherently chatbots: they produce plausible texts, of any sort you could imagine. People seem to want a machine that can, for example, answer scientific questions as part of a conversation. But most competent answers to scientific questions aren’t conversations, they’re papers. If people stuck with the “raw” model, producing excerpts of nonexistent papers rather than imitating a dialogue with a non-existent expert, wouldn’t you expect the answers to be more accurate, with the model no longer biased by an irrelevant goal? Is the need to make a sell-able chatbot making these AIs worse at everything else people are trying to use them for?

I’m imagining a world where, instead of a chatbot, OpenAI built an “alternate universe simulator”. You give it some context, some texts or parts of texts from a universe you made up, and it completes them in a plausible way. By imagining different universes, you can use it to answer different questions. Such a gimmick would get fewer customers, and fewer investors, it would probably do worse. But I have to wonder if the actual technology might have been more useful.

My second idea is dumber, to the point where I mostly know why it doesn’t work. But thinking about it might help clarify how things work for people unused to AI.

I saw someone point out that, unlike something like Wikipedia, AI doesn’t give you context. You shouldn’t trust Wikipedia, or a source you find on Google, blindly. If you want to, you can look through the edit history on Wikipedia, or figure out who wrote a page you found on Google and how. If ChatGPT tells you something, by default you don’t know where that knowledge came from. You can tell it to search, and then you’ll get links, but that’s because it’s using Google or the like behind the scenes anyway. You don’t know where the model is getting its ideas.

Why couldn’t we get that context, though?

Every text produced by a Large Language Model is causally dependent on its training data. Different data, different model, different text. That doesn’t mean that each text draws from one source, or just a few sources: ChatGPT isn’t copying the training data, at least not so literally.

But it does mean that, if ChatGPT says something is true, you should in principle be able to ask which data was most important in making it say that. If you leave a piece of data out of the training, and get similar answers, you can infer that the response you got doesn’t have much to do with that piece of data. But if you leave out a text in training, and now ChatGPT gives totally different responses to the same question…then there’s a pretty meaningful sense that it got the information from that source.

If this were the type of non-AI statistical model people use in physics, this would be straightforward. Researchers do this all the time: take one experiment out of the data, see how their analysis changes, and thereby figure out which experiments are most important to check. One can even sometimes calculate, given a model, where you should look.

Unfortunately, you can’t do this with ChatGPT. The model is just too big. You can’t calculate anything explicitly about it, the giant mathematical formulas behind it are so complicated that the most you can do is get probabilities out case by case, you can’t “unwind” them and see where the numbers come from. And you can’t just take out sources one by one, and train the model again: not when training takes months of expensive computer time.

So unlike with the previous idea, I understand even on a technical level why you can’t do this. But it helped me to be able to think about what I would like to do, if it were possible. Maybe it helps you too!

Microdosing Vibe Physics

Have you heard of “vibe physics”?

The phrase “vibe coding” came first. People have been using large language models like ChatGPT to write computer code (and not the way I did last year). They chat with the model, describing what they want to do and asking the model to code it up. You can guess the arguments around this, from people who are convinced AI is already better than a human programmer to people sure the code will be riddled with errors and vulnerabilities.

Now, there are people claiming not only to do vibe coding, but vibe physics: doing theoretical physics by chatting with an AI.

I think we can all agree that’s a lot less plausible. Some of the people who do vibe coding actually know how to code, but I haven’t seen anyone claiming to do vibe physics who actually understands physics. They’re tech entrepreneurs in the most prominent cases, random people on the internet otherwise. And while a lot of computer code is a minor tweak on something someone has already done, theoretical physics doesn’t work that way: if someone has already come up with your idea, you’re an educator, not a physicist.

Still, I think there is something to keep in mind about the idea of “vibe physics”, related to where physics comes from.

Here’s a question to start with: go back a bit before the current chat-bot boom. There were a ton of other computational and mathematical tools. Theorem-proving software could encode almost arbitrary mathematical statements in computer code and guarantee their accuracy. Statistical concepts like Bayes’ rule described how to reason from evidence to conclusions, not flawlessly but as well as anyone reliably can. We had computer simulations for a wealth of physical phenomena, and approximation schemes for many others.

With all those tools, why did we still have human physicists?

That is, go back before ChatGPT, before large language models. Why not just code up a program that starts with the evidence and checks which mathematical model fits it best?

In principle, I think you really could have done that. But you could never run that program. It would take too long.

Doing science 100% correctly and reliably is agonizingly slow, and prohibitively expensive. You cannot check every possible model, nor can you check those models against all the available data. You must simplify your problem, somehow, even if it makes your work less reliable, and sometimes incorrect.

And for most of history, humans have provided that simplification.

A physicist isn’t going to consider every possible model. They’re going to consider models that are similar to models they studied, or similar to models others propose. They aren’t going to consider all the evidence. They’ll look at some of the evidence, the evidence other physicists are talking about and puzzled by. They won’t simulate the consequences of their hypotheses in exhaustive detail. Instead, they’ll guess, based on their own experience, a calculation that captures what they expect to be relevant.

Human physicists provided the unreliable part of physics, the heuristics. The “vibe physics”, if you will.

AI is also unreliable, also heuristic. But humans still do this better than AI.

Part of the difference is specificity. These AIs are trained on all of human language, and then perhaps fine-tuned on a general class of problems. A human expert has spent their life fine-tuning on one specific type of problem, and their intuitions, their heuristics, their lazy associations and vibes, all will be especially well-suited to problems of that type.

Another part of the difference, though, is scale.

When you talk to ChatGPT, it follows its vibes into paragraphs of text. If you turn on reasoning features, you make it check its work in the background, but it still is generating words upon words inside, evaluating those words, then generating more.

I suspect, for a physicist, the “control loop” is much tighter. Many potential ideas get ruled out a few words in. Many aren’t even expressed in words at all, just concepts. A human physicist is ultimately driven by vibes, but they check and verify those vibes, based on their experience, at a much higher frequency than any current AI system can achieve.

(I know almost nothing about neuroscience. I’m just basing this on what it can feel like, to grope through a sentence and have it assemble itself as it goes into something correct, rather than having to go back and edit it.)

As companies get access to bigger datacenters, I suspect they’ll try to make this loop tighter, to get AI to do something closer to what (I suspect, it appears) humans do. And then maybe AI will be able to do vibe physics.

Even then, though, you should not do vibe physics with the AI.

If you look at the way people describe doing vibe physics, they’re not using the AI for the vibes. They’re providing the vibes, and the AI is supposed to check things.

And that, I can confidently say, is completely ass-backwards. The AI is a vibe machine, it is great at vibes. Substituting your vibes will just make it worse. On the other hand, the AI is awful at checking things. It can find published papers sometimes, which can help you check something. But it is not set up to do the math, at least not unless the math can be phrased as a simple Python script or an IMO problem. In order to do anything like that, it has to call another type of software to verify. And you could have just used that software.

Theoretical physics is still not something everyone can do. Proposing a crackpot theory based on a few papers you found on Google and a couple YouTube videos may make you feel less confident than proposing a crackpot theory based on praise from ChatGPT and a list of papers it claims have something to do with your idea, which makes it more tempting. But it’s still proposing a crackpot theory. If you want to get involved, there’s still no substitute for actually learning how physics works.

I’ve Felt Like a Hallucinating LLM

ChatGPT and its kin work by using Large Language Models, or LLMs.

A climate model is a pile of mathematics and code, honed on data from the climate of the past. Tell it how the climate starts out, and it will give you a prediction for what happens next.

Similarly, a language model is a pile of mathematics and code, honed on data from the texts of the past. Tell it how a text starts, and it will give you a prediction for what happens next.

We have a rough idea of what a climate model can predict. The climate has to follow the laws of physics, for example. Similarly, a text should follow the laws of grammar, the order of verbs and nouns and so forth. The creators of the earliest, smallest language models figured out how to do that reasonably well.

Texts do more than just follow grammar, though. They can describe the world. And LLMs are both surprisingly good and surprisingly bad at that. They can do a lot when used right, answering test questions most humans would struggle with. But they also “hallucinate”, confidently saying things that have nothing to do with reality.

If you want to understand why large language models make both good predictions and bad, you shouldn’t just think about abstract “texts”. Instead, think about a specific type of text: a story.

Stories follow grammar, most of the time. But they also follow their own logic. The hero sets out, saves the world, and returns home again. The evil queen falls from the tower at the climax of the final battle. There are three princesses, and only the third can break the spell.

We aren’t usually taught this logic, like we’re taught physics or grammar. We learn it from experience, from reading stories and getting used to patterns. It’s the logic, not of how a story must go, but of how a story typically goes. And that question, of what typically comes next, is exactly the question LLMs are designed to answer.

It’s also a question we sometimes answer.

I was a theatre kid, and I loved improv in particular. Some of it was improv comedy, the games and skits you might have seen on “Whose Line is it Anyway?” But some of it was more…hippy stuff.

I’d meet up with a group on Saturdays. One year we made up a creation myth, half-rehearsed and half-improvised, a collection of gods and primordial beings. The next year we moved the story forward. Civilization had risen…and fallen again. We played a group of survivors gathered around a campfire, wary groups wondering what came next.

We plotted out characters ahead of time. I was the “villain”, or the closest we had to one. An enforcer of the just-fallen empire, the oppressor embodied. While the others carried clubs, staves, and farm implements, I was the only one with a real weapon: a sword.

(Plastic in reality, but the audience knew what to do.)

In the arguments and recriminations of the story, that sword set me apart, a constant threat that turned my character from contemptible to dangerous, that gave me a seat at the table even as I antagonized and stirred the pot.

But the story had another direction. The arguments pushed and pulled, and gradually the survivors realized that they would not survive if they did not put their grievances to rest, if they did not seek peace. So, one man stepped forward, and tossed his staff into the fire.

The others followed. One by one, clubs and sticks and menacing tools were cast aside. And soon, I was the only one armed.

If I was behaving logically, if I followed my character’s interests, I would have “won” there. I had gotten what I wanted, now there was no check on my power.

But that wasn’t what the story wanted. Improv is a game of fast decisions and fluid invention. We follow our instincts, and our instincts are shaped by experience. The stories of the past guide our choices, and must often be the only guide: we don’t have time to edit, or to second-guess.

And I felt the story, and what it wanted. It was a command that transcended will, that felt like it left no room for an individual actor making an individual decision.

I cast my sword into the fire.

The instinct that brought me to do that is the same instinct that guides authors when they say that their characters write themselves, when their story goes in an unexpected direction. It’s an instinct that can be tempered and counteracted, with time and effort, because it can easily lead to nonsense. It’s why every good book needs an editor, why improv can be as repetitive as it is magical.

And it’s been the best way I’ve found to understand LLMs.

An LLM telling a story tells a typical story, based on the data used to create it. In the same way, an LLM giving advice gives typical advice, to some extent in content but more importantly in form, advice that is confident and mentions things advice often mentions. An LLM writing a biography will write a typical biography, which may not be your biography, even if your biography was one of those used to create it, because it tries to predict how a biography should go based on all the other biographies. And all of these predictions and hallucinations are very much the kind of snap judgement that disarmed me.

These days, people are trying to build on top of LLMs and make technology that does more, that can edit and check its decisions. For the most part, they’re building these checks out of LLMs. Instead of telling one story, of someone giving advice on the internet, they tell two stories: the advisor and the editor, one giving the advice and one correcting it. They have to tell these stories many times, broken up into many parts, to approximate something other than the improv actor’s first instincts, and that’s why software that does this is substantially more expensive than more basic software that doesn’t.

I can’t say how far they’ll get. Models need data to work well, decisions need reliability to be good, computers need infrastructure to compute. But if you want to understand what’s at an LLM’s beating heart, think about the first instincts you have in writing or in theatre, in stories or in play. Then think about a machine that just does that.