Category Archives: Machine Learning

About the OpenAI Amplitudes Paper, but Not as Much as You’d Like

I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.

This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.

Context:

Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”

In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.

Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.

That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.

I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.

Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.

How impressed should you be?

I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.

If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.

What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.

(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)

Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.

This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.

For the moment, then, I have to guess what input the AI had, and what it actually accomplished.

Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.

Given that, how impressed should we be?

First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.

Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.

Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?

After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.

If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.

From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.

While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.

Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.

A stab at implications:

If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?

If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.

More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.

And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.

Remaining questions:

At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.

  1. What was the prompt given to the internal model, and how much did it do based on that prompt?
  2. Was it really done in one shot, no retries or feedback?
  3. How much did running the internal model cost?
  4. Is this result likely to be useful? Are there things people want to calculate that this could make easier? Recursion relations it could seed? Is it useful for SCET somehow?
  5. How easy would it have been for the authors to do what the LLM did? What about other experts in the community?

Hypothesis: If AI Is Bad at Originality, It’s a Documentation Problem

Recently, a few people have asked me about this paper.

A couple weeks back, OpenAI announced a collaboration with a group of amplitudes researchers, physicists who study the types of calculations people do to make predictions at particle colliders. The amplitudes folks had identified an interesting loophole, finding a calculation that many would have expected to be zero actually gave a nonzero answer. They did the calculation for different examples involving more and more particles, and got some fairly messy answers. They suspected, as amplitudes researchers always expect, that there was a simpler formula, one that worked for any number of particles. But they couldn’t find it.

Then a former amplitudes researcher at OpenAI suggested that they use AI to find it.

“Use AI” can mean a lot of different things, and most of them don’t look much like the way the average person talks to ChatGPT. This was closer than most. They were using “reasoning models”, loops that try to predict the next few phrases in a “chain of thought” again and again and again. Using that kind of tool, they were able to find that simpler formula, and mathematically prove that it was correct.

A few of you are hoping for an in-depth post about what they did, and its implications. This isn’t that. I’m still figuring out if I’ll be writing that for an actual news site, for money, rather than free, for you folks.

Instead, I want to talk about a specific idea I’ve seen crop up around the paper.

See, for some, the existence of a result like this isn’t all that surprising.

Mathematicians have been experimenting with reasoning models for a bit, now. Recently, a group published a systematic study, setting the AI loose on a database of minor open problems proposed by the famously amphetamine-fueled mathematician Paul Erdös. The AI managed to tackle a few of the problems, sometimes by identifying existing solutions that had not yet been linked to the problem database, but sometimes by proofs that appeared to be new.

The Erdös problems solved by the AI were not especially important. Neither was the problem solved by the amplitudes researchers, as far as I can tell at this point.

But I get the impression the amplitudes problem was a bit more interesting than the Erdös problems. The difference, so far, has mostly been attributed to human involvement. This amplitudes paper started because human amplitudes researchers found an interesting loophole, and only after that used the AI. Unlike the mathematicians, they weren’t just searching a database.

This lines up with a general point, one people tend to make much less carefully. It’s often said that, unlike humans, AI will never be truly creative. It can solve mechanical problems, do things people have done before, but it will never be good at having truly novel ideas.

To me, that line of thinking goes a bit too far. I suspect it’s right on one level, that it will be hard for any of these reasoning models to propose anything truly novel. But if so, I think it will be for a different reason.

The thing is, creativity is not as magical as we make it out to be. Our ideas, scientific or artistic, don’t just come from the gods. They recombine existing ideas, shuffling them in ways more akin to randomness than miracle. They’re then filtered through experience, deep heuristics honed over careers. Some people are good at ideas, and some are bad at them. Having ideas takes work, and there are things people do to improve their ideas. Nothing about creativity suggests it should be impossible to mechanize.

However, a machine trained on text won’t necessarily know how to do any of that.

That’s because in science, we don’t write down our inspirations. By the time a result gets into a scientific paper or textbook, it’s polished and refined into a pure argument, cutting out most of the twists and turns that were an essential part of the creative process. Mathematics is even worse, most math papers don’t even mention the motivation behind the work, let alone the path taken to the paper.

This lack of documentation makes it hard for students, making success much more a function of having the right mentors to model good practices, rather than being able to pick them up from literature everyone can access. I suspect it makes it even harder for language models. And if today’s language model-based reasoning tools are bad at that crucial, human-seeming step, of coming up with the right idea at the right time? I think that has more to do with this lack of documentation, than with the fact that they’re “statistical parrots”.

The Timeline for Replacing Theorists Is Not Technological

Quanta Magazine recently published a reflection by Natalie Wolchover on the state of fundamental particle physics. The discussion covers a lot of ground, but one particular paragraph has gotten the lion’s share of the attention. Wolchover talked to Jared Kaplan, the ex-theoretical physicist turned co-founder of Anthropic, one of the foremost AI companies today.

Kaplan was one of Nima Arkani-Hamed’s PhD students, which adds an extra little punch.

There’s a lot to contest here. Is AI technology anywhere close to generating papers as good as the top physicists, or is that relegated to the sci-fi future? Does Kaplan really believe this, or is he just hyping up his company?

I don’t have any special insight into those questions, about the technology and Kaplan’s motivations. But I think that, even if we trusted him on the claim that AI could be generating Witten- or Nima-level papers in three years, that doesn’t mean it will replace theoretical physicists. That part of the argument isn’t a claim about the technology, but about society.

So let’s take the technological claims as given, and make them a bit more specific. Since we don’t have any objective way of judging the quality of scientific papers, let’s stick to the subjective. Today, there are a lot of people who get excited when Witten posts a new paper. They enjoy reading them, they find the insights inspiring, they love the clarity of the writing and their tendency to clear up murky ideas. They also find them reliable: the papers very rarely have mistakes, and don’t leave important questions unanswered.

Let’s use that as our baseline, then. Suppose that Anthropic had an AI workflow that could reliably write papers that were just as appealing to physicists as Witten’s papers are, for the same reasons. What happens to physicists?

Witten himself is retired, which for an academic means you do pretty much the same thing you were doing before, but now paid out of things like retirement savings and pension funds, not an institute budget. Nobody is going to fire Witten, there’s no salary to fire him from. And unless he finds these developments intensely depressing and demoralizing (possible, but very much depends on how this is presented), he’s not going to stop writing papers. Witten isn’t getting replaced.

More generally, though, I don’t think this directly results in anyone getting fired, or in universities trimming positions. The people making funding decisions aren’t just sitting on a pot of money, trying to maximize research output. They’ve got money to be spent on hires, and different pools of money to be spent on equipment, and the hires get distributed based on what current researchers at the institutes think is promising. Universities want to hire people who can get grants, to help fund the university, and absent rules about AI personhood, the AIs won’t be applying for grants.

Funding cuts might be argued for based on AI, but that will happen long before AI is performing at the Witten level. We already see this happening in other industries or government agencies, where groups that already want to cut funding are getting think tanks and consultants to write estimates that justify cutting positions, without actually caring whether those estimates are performed carefully enough to justify their conclusions. That can happen now, and doesn’t depend on technological progress.

AI could also replace theoretical physicists in another sense: the physicists themselves might use AI to do most of their work. That’s more plausible, but here adoption still heavily depends on social factors. Will people feel like they are being assessed on whether they can produce these Witten-level papers, and that only those who make them get hired, or funded? Maybe. But it will propagate unevenly, from subfield to subfield. Some areas will make their own rules forbidding AI content, there will be battles and scandals and embarrassments aplenty. It won’t be a single switch, the technology alone setting the timeline.

Finally, AI could replace theoretical physicists in another way, by people outside of academia filling the field so much that theoretical physicists have nothing more that they want to do. Some non-physicists are very passionate about physics, and some of those people have a lot of money. I’ve done writing work for one such person, whose foundation is now attempting to build an AI Physicist. If these AI Physicists get to Witten-level quality, they might start writing compelling paper after compelling paper. Those papers, though, will due to their origins be specialized. Much as philanthropists mostly fund the subfields they’ve heard of, philanthropist-funded AI will mostly target topics the people running the AI have heard are important. Much like physicists themselves adopting the technology, there will be uneven progress from subfield to subfield, inch by socially-determined inch.

In a hard-to-quantify area like progress in science, that’s all you can hope for. I suspect Kaplan got a bit of a distorted picture of how progress and merit work in theoretical physics. He studied with Nima Arkani-Hamed, who is undeniably exceptionally brilliant but also undeniably exceptionally charismatic. It must feel to a student of Nima’s that academia simply hires the best people, that it does whatever it takes to accomplish the obviously best research. But the best research is not obvious.

I think some of these people imagine a more direct replacement process, not arranged by topic and tastes, but by goals. They picture AI sweeping in and doing what theoretical physics was always “meant to do”: solve quantum gravity, and proceed to shower us with teleporters and antigravity machines. I don’t think there’s any reason to expect that to happen. If you just asked a machine to come up with the most useful model of the universe for a near-term goal, then in all likelihood it wouldn’t consider theoretical high-energy physics at all. If you see your AI as a tool to navigate between utopia and dystopia, theoretical physics might matter at some point: when your AI has devoured the inner solar system, is about to spread beyond the few light-minutes when it can signal itself in real-time, and has to commit to a strategy. But as long as the inner solar system remains un-devoured, I don’t think you’ll see an obviously successful theory of fundamental physics.

AGI Is an Economic Term, Not a Computer Science Term

Since it resonated with the audience, I’ll recap my main argument against AGI here. ‘General intelligence’ is like phlogiston, or the aether. It’s an outmoded scientific concept that does not refer to anything real. Any explanatory work it did can be done better by a richer scientific frame. 1/3

Shannon Vallor (@shannonvallor.bsky.social) 2025-10-02T22:09:06.610Z

I ran into this Bluesky post, and while a lot of the argument resonated with me, I think the author is missing something important.

Shannon Vallor is a philosopher of technology at the University of Edinburgh. She spoke recently at a meeting honoring the 75th anniversary of the Turing Test. The core of her argument, recapped in the Bluesky post, is that artificial general intelligence, or AGI, represents an outdated scientific concept, like phlogiston. While some researchers in the past thought of humans as having a kind of “general” intelligence that a machine would need to replicate, scientists today break down intelligence into a range of capabilities that can be present in different ways. From that perspective, searching for artificial general intelligence doesn’t make much sense: instead, researchers should focus on the particular capabilities they’re interested in.

I have a lot of sympathy for Vallor’s argument, though perhaps from a different direction than what she had in mind. I don’t know enough about intelligence in a biological context to comment there. But from a computer science perspective, intelligence obviously is composed of distinct capabilities. Something that computes, like a human or a machine, can have different amounts of memory, different processing speeds, different input and output rates. In terms of ability to execute algorithms, it can be a Turing machine, or something less than a Turing machine. In terms of the actual algorithms it runs, they can have different scaling for large inputs, and different overhead for small inputs. In terms of learning, one can have better data, or priors that are closer to the ground truth.

These days, all of these Turing machine algorithm capabilities are in some sense obviously not what the people interested in AGI are after. We already have them in currently-existing computers, after all. Instead, people who pursue AGI, and AI researchers more generally, are interested in heuristics. Humans do certain things without reliable algorithms, instead we do them faster, but unreliably. And while some human heuristics seem pretty general, it’s widely understood that in the heuristics world there is no free lunch. No heuristic is good for everything, and no heuristic is bad for everything.

So is “general intelligence” a mirage, like phlogiston?

If you think about it as a scientific goal, sure. But as a product, not so much.

Consider a word processor.

Obviously, from a scientific perspective, there are lots of capabilities that involve processing words. Some were things machines could do well before the advent of modern computers: consider typewriters, for instance. Others still are out of reach, after all, we do still pay people to write. (I myself am such person!)

But at the same time, if I say that a computer program is a word processor, you have a pretty good idea of what that means. There was a time when processing words involved an enormous amount of labor, work done by a large number of specialized people (mostly women). Look at a workplace documentary from the 1960’s, and compare it to a workplace today, and you’ll see that word processor technology has radically changed what tasks people do.

AGI may not make sense as a scientific goal, but it’s perfectly coherent in these terms.

Right now, a lot of tasks are done by what one could broadly call human intelligence. Some of these tasks have already fallen to technology, others will fall one by one. But it’s not unreasonable to think of a package deal, a technology that covers enough of such tasks that human intelligence stops being economically viable. That’s not because there will be some scientific general intelligence that the technology would then have, but because a decent number of intellectual tasks do seem to come bundled together. And you don’t need to cover 100% of human capabilities to radically change workplaces, any more than you needed to cover 100% of the work of a 1960’s secretary with a word processor for modern secretarial work to have a dramatically different scope and role.

It’s worth keeping in mind what is and isn’t scientifically coherent, to be aware that you can’t just extrapolate the idea of general intelligence to any future machine. (For one, it constrains what “superintelligence” could look like.) But that doesn’t mean we should be complacent, and assume that AGI is impossible in principle. AGI, like a word processor, would be a machine that covers a set of tasks well enough that people use it instead of hiring people to do the work by hand. It’s just a broader set of tasks.

Some Dumb AI Ideas

Sometimes, when I write a post about AI, I’ve been sitting on an idea for a long time. I’ve talked to experts, I’ve tried to understand the math, I’ve honed my points and cleared away clutter.

This is not one of those times. The ideas in this post almost certainly have something deeply wrong with them. But hopefully they’re interesting food for thought.

My first dumb idea: instruction tuning was a mistake.

I’m drawing the seeds of this one from a tumblr post by nostalgebraist, someone known for making a popular bot trained on his tumblr posts in the early days before GPT became ChatGPT.

AIs like ChatGPT are based on Large Language Models, insanely complicated mathematical formulas that predict, given part of a text, what the rest of that text is likely to look like. In the early days, this was largely how they were used. Loosely described nostalgebraist’s bot, called nostalgebraist-autoresponder, began with a list of tumblr posts and asks and determines what additional posts would best fit in.

If you think about it, though, ChatGPT doesn’t really work like that. ChatGPT has conversations: you send it messages, it sends you responses. The text it creates is a dialogue, with you supplying half the input. But most texts aren’t dialogues, and ChatGPT draws on a lot of non-dialogue texts to make its dialogue-like responses.

The reason it does this is something called instruction tuning. ChatGPT has been intentionally biased, not to give the most likely completion to a task in general, but to give completions that fit this dialogue genre. What I didn’t know until I read nostalgebraist’s post was that this genre was defined artificially: AI researchers made up fake dialogues with AI, cheesy sci-fi conversations imagining how an AI might respond to instructions from a user, and then biased the Large Language Model so that rather than giving the most likely text in general, it gives a text that is more likely to look like these cheesy sci-fi conversations. It’s why ChatGPT sounds kind of like a fictional robot: not because sci-fi writers accurately predicted what AI would sound like, but because AI was created based on sci-fi texts.

For nostalgebraist, this leads into an interesting reflection of how a sci-fi AI should behave, how being warped around a made-up genre without history or depth creates characters which act according to simple narratives and express surprising anxiety.

For myself, though, I can’t help but wonder if the goal of dialogue itself is the problem. Dialogue is clearly important commercially: people use ChatGPT because they can chat with it. But Large Language Models aren’t inherently chatbots: they produce plausible texts, of any sort you could imagine. People seem to want a machine that can, for example, answer scientific questions as part of a conversation. But most competent answers to scientific questions aren’t conversations, they’re papers. If people stuck with the “raw” model, producing excerpts of nonexistent papers rather than imitating a dialogue with a non-existent expert, wouldn’t you expect the answers to be more accurate, with the model no longer biased by an irrelevant goal? Is the need to make a sell-able chatbot making these AIs worse at everything else people are trying to use them for?

I’m imagining a world where, instead of a chatbot, OpenAI built an “alternate universe simulator”. You give it some context, some texts or parts of texts from a universe you made up, and it completes them in a plausible way. By imagining different universes, you can use it to answer different questions. Such a gimmick would get fewer customers, and fewer investors, it would probably do worse. But I have to wonder if the actual technology might have been more useful.

My second idea is dumber, to the point where I mostly know why it doesn’t work. But thinking about it might help clarify how things work for people unused to AI.

I saw someone point out that, unlike something like Wikipedia, AI doesn’t give you context. You shouldn’t trust Wikipedia, or a source you find on Google, blindly. If you want to, you can look through the edit history on Wikipedia, or figure out who wrote a page you found on Google and how. If ChatGPT tells you something, by default you don’t know where that knowledge came from. You can tell it to search, and then you’ll get links, but that’s because it’s using Google or the like behind the scenes anyway. You don’t know where the model is getting its ideas.

Why couldn’t we get that context, though?

Every text produced by a Large Language Model is causally dependent on its training data. Different data, different model, different text. That doesn’t mean that each text draws from one source, or just a few sources: ChatGPT isn’t copying the training data, at least not so literally.

But it does mean that, if ChatGPT says something is true, you should in principle be able to ask which data was most important in making it say that. If you leave a piece of data out of the training, and get similar answers, you can infer that the response you got doesn’t have much to do with that piece of data. But if you leave out a text in training, and now ChatGPT gives totally different responses to the same question…then there’s a pretty meaningful sense that it got the information from that source.

If this were the type of non-AI statistical model people use in physics, this would be straightforward. Researchers do this all the time: take one experiment out of the data, see how their analysis changes, and thereby figure out which experiments are most important to check. One can even sometimes calculate, given a model, where you should look.

Unfortunately, you can’t do this with ChatGPT. The model is just too big. You can’t calculate anything explicitly about it, the giant mathematical formulas behind it are so complicated that the most you can do is get probabilities out case by case, you can’t “unwind” them and see where the numbers come from. And you can’t just take out sources one by one, and train the model again: not when training takes months of expensive computer time.

So unlike with the previous idea, I understand even on a technical level why you can’t do this. But it helped me to be able to think about what I would like to do, if it were possible. Maybe it helps you too!

Microdosing Vibe Physics

Have you heard of “vibe physics”?

The phrase “vibe coding” came first. People have been using large language models like ChatGPT to write computer code (and not the way I did last year). They chat with the model, describing what they want to do and asking the model to code it up. You can guess the arguments around this, from people who are convinced AI is already better than a human programmer to people sure the code will be riddled with errors and vulnerabilities.

Now, there are people claiming not only to do vibe coding, but vibe physics: doing theoretical physics by chatting with an AI.

I think we can all agree that’s a lot less plausible. Some of the people who do vibe coding actually know how to code, but I haven’t seen anyone claiming to do vibe physics who actually understands physics. They’re tech entrepreneurs in the most prominent cases, random people on the internet otherwise. And while a lot of computer code is a minor tweak on something someone has already done, theoretical physics doesn’t work that way: if someone has already come up with your idea, you’re an educator, not a physicist.

Still, I think there is something to keep in mind about the idea of “vibe physics”, related to where physics comes from.

Here’s a question to start with: go back a bit before the current chat-bot boom. There were a ton of other computational and mathematical tools. Theorem-proving software could encode almost arbitrary mathematical statements in computer code and guarantee their accuracy. Statistical concepts like Bayes’ rule described how to reason from evidence to conclusions, not flawlessly but as well as anyone reliably can. We had computer simulations for a wealth of physical phenomena, and approximation schemes for many others.

With all those tools, why did we still have human physicists?

That is, go back before ChatGPT, before large language models. Why not just code up a program that starts with the evidence and checks which mathematical model fits it best?

In principle, I think you really could have done that. But you could never run that program. It would take too long.

Doing science 100% correctly and reliably is agonizingly slow, and prohibitively expensive. You cannot check every possible model, nor can you check those models against all the available data. You must simplify your problem, somehow, even if it makes your work less reliable, and sometimes incorrect.

And for most of history, humans have provided that simplification.

A physicist isn’t going to consider every possible model. They’re going to consider models that are similar to models they studied, or similar to models others propose. They aren’t going to consider all the evidence. They’ll look at some of the evidence, the evidence other physicists are talking about and puzzled by. They won’t simulate the consequences of their hypotheses in exhaustive detail. Instead, they’ll guess, based on their own experience, a calculation that captures what they expect to be relevant.

Human physicists provided the unreliable part of physics, the heuristics. The “vibe physics”, if you will.

AI is also unreliable, also heuristic. But humans still do this better than AI.

Part of the difference is specificity. These AIs are trained on all of human language, and then perhaps fine-tuned on a general class of problems. A human expert has spent their life fine-tuning on one specific type of problem, and their intuitions, their heuristics, their lazy associations and vibes, all will be especially well-suited to problems of that type.

Another part of the difference, though, is scale.

When you talk to ChatGPT, it follows its vibes into paragraphs of text. If you turn on reasoning features, you make it check its work in the background, but it still is generating words upon words inside, evaluating those words, then generating more.

I suspect, for a physicist, the “control loop” is much tighter. Many potential ideas get ruled out a few words in. Many aren’t even expressed in words at all, just concepts. A human physicist is ultimately driven by vibes, but they check and verify those vibes, based on their experience, at a much higher frequency than any current AI system can achieve.

(I know almost nothing about neuroscience. I’m just basing this on what it can feel like, to grope through a sentence and have it assemble itself as it goes into something correct, rather than having to go back and edit it.)

As companies get access to bigger datacenters, I suspect they’ll try to make this loop tighter, to get AI to do something closer to what (I suspect, it appears) humans do. And then maybe AI will be able to do vibe physics.

Even then, though, you should not do vibe physics with the AI.

If you look at the way people describe doing vibe physics, they’re not using the AI for the vibes. They’re providing the vibes, and the AI is supposed to check things.

And that, I can confidently say, is completely ass-backwards. The AI is a vibe machine, it is great at vibes. Substituting your vibes will just make it worse. On the other hand, the AI is awful at checking things. It can find published papers sometimes, which can help you check something. But it is not set up to do the math, at least not unless the math can be phrased as a simple Python script or an IMO problem. In order to do anything like that, it has to call another type of software to verify. And you could have just used that software.

Theoretical physics is still not something everyone can do. Proposing a crackpot theory based on a few papers you found on Google and a couple YouTube videos may make you feel less confident than proposing a crackpot theory based on praise from ChatGPT and a list of papers it claims have something to do with your idea, which makes it more tempting. But it’s still proposing a crackpot theory. If you want to get involved, there’s still no substitute for actually learning how physics works.

I’ve Felt Like a Hallucinating LLM

ChatGPT and its kin work by using Large Language Models, or LLMs.

A climate model is a pile of mathematics and code, honed on data from the climate of the past. Tell it how the climate starts out, and it will give you a prediction for what happens next.

Similarly, a language model is a pile of mathematics and code, honed on data from the texts of the past. Tell it how a text starts, and it will give you a prediction for what happens next.

We have a rough idea of what a climate model can predict. The climate has to follow the laws of physics, for example. Similarly, a text should follow the laws of grammar, the order of verbs and nouns and so forth. The creators of the earliest, smallest language models figured out how to do that reasonably well.

Texts do more than just follow grammar, though. They can describe the world. And LLMs are both surprisingly good and surprisingly bad at that. They can do a lot when used right, answering test questions most humans would struggle with. But they also “hallucinate”, confidently saying things that have nothing to do with reality.

If you want to understand why large language models make both good predictions and bad, you shouldn’t just think about abstract “texts”. Instead, think about a specific type of text: a story.

Stories follow grammar, most of the time. But they also follow their own logic. The hero sets out, saves the world, and returns home again. The evil queen falls from the tower at the climax of the final battle. There are three princesses, and only the third can break the spell.

We aren’t usually taught this logic, like we’re taught physics or grammar. We learn it from experience, from reading stories and getting used to patterns. It’s the logic, not of how a story must go, but of how a story typically goes. And that question, of what typically comes next, is exactly the question LLMs are designed to answer.

It’s also a question we sometimes answer.

I was a theatre kid, and I loved improv in particular. Some of it was improv comedy, the games and skits you might have seen on “Whose Line is it Anyway?” But some of it was more…hippy stuff.

I’d meet up with a group on Saturdays. One year we made up a creation myth, half-rehearsed and half-improvised, a collection of gods and primordial beings. The next year we moved the story forward. Civilization had risen…and fallen again. We played a group of survivors gathered around a campfire, wary groups wondering what came next.

We plotted out characters ahead of time. I was the “villain”, or the closest we had to one. An enforcer of the just-fallen empire, the oppressor embodied. While the others carried clubs, staves, and farm implements, I was the only one with a real weapon: a sword.

(Plastic in reality, but the audience knew what to do.)

In the arguments and recriminations of the story, that sword set me apart, a constant threat that turned my character from contemptible to dangerous, that gave me a seat at the table even as I antagonized and stirred the pot.

But the story had another direction. The arguments pushed and pulled, and gradually the survivors realized that they would not survive if they did not put their grievances to rest, if they did not seek peace. So, one man stepped forward, and tossed his staff into the fire.

The others followed. One by one, clubs and sticks and menacing tools were cast aside. And soon, I was the only one armed.

If I was behaving logically, if I followed my character’s interests, I would have “won” there. I had gotten what I wanted, now there was no check on my power.

But that wasn’t what the story wanted. Improv is a game of fast decisions and fluid invention. We follow our instincts, and our instincts are shaped by experience. The stories of the past guide our choices, and must often be the only guide: we don’t have time to edit, or to second-guess.

And I felt the story, and what it wanted. It was a command that transcended will, that felt like it left no room for an individual actor making an individual decision.

I cast my sword into the fire.

The instinct that brought me to do that is the same instinct that guides authors when they say that their characters write themselves, when their story goes in an unexpected direction. It’s an instinct that can be tempered and counteracted, with time and effort, because it can easily lead to nonsense. It’s why every good book needs an editor, why improv can be as repetitive as it is magical.

And it’s been the best way I’ve found to understand LLMs.

An LLM telling a story tells a typical story, based on the data used to create it. In the same way, an LLM giving advice gives typical advice, to some extent in content but more importantly in form, advice that is confident and mentions things advice often mentions. An LLM writing a biography will write a typical biography, which may not be your biography, even if your biography was one of those used to create it, because it tries to predict how a biography should go based on all the other biographies. And all of these predictions and hallucinations are very much the kind of snap judgement that disarmed me.

These days, people are trying to build on top of LLMs and make technology that does more, that can edit and check its decisions. For the most part, they’re building these checks out of LLMs. Instead of telling one story, of someone giving advice on the internet, they tell two stories: the advisor and the editor, one giving the advice and one correcting it. They have to tell these stories many times, broken up into many parts, to approximate something other than the improv actor’s first instincts, and that’s why software that does this is substantially more expensive than more basic software that doesn’t.

I can’t say how far they’ll get. Models need data to work well, decisions need reliability to be good, computers need infrastructure to compute. But if you want to understand what’s at an LLM’s beating heart, think about the first instincts you have in writing or in theatre, in stories or in play. Then think about a machine that just does that.

AI Can’t Do Science…And Neither Can Other Humans

Seen on Twitter:

I don’t know the context here, so I can’t speak to what Prof. Cronin meant. But it got me thinking.

Suppose you, like Prof. Cronin, were to insist that AI “cannot in principle” do science, because AI “is not autonomous” and “does not come up with its own problems to solve”. What might you mean?

You might just be saying that AI is bad at coming up with new problems to solve. That’s probably fair, at least at the moment. People have experimented with creating simple “AI researchers” that “study” computer programs, coming up with hypotheses about the programs’ performance and testing them. But it’s a long road from that to reproducing the much higher standards human scientists have to satisfy.

You probably don’t mean that, though. If you did, you wouldn’t have said “in principle”. You mean something stronger.

More likely, you might mean that AI cannot come up with its own problems, because AI is a tool. People come up with problems, and use AI to help solve them. In this perspective, not only is AI “not autonomous”, it cannot be autonomous.

On a practical level, this is clearly false. Yes, machine learning models, the core technology in current AI, are set up to answer questions. A user asks something, and receives the model’s prediction of the answer. That’s a tool, but for the more flexible models like GPT it’s trivial to turn it into something autonomous. Just add another program: a loop that asks the model what to do, does it, tells the model the result, and asks what to do next. Like taping a knife to a Roomba, you’ve made a very simple modification to make your technology much more dangerous.

You might object, though, that this simple modification of GPT is not really autonomous. After all, a human created it. That human had some goal, some problem they wanted to solve, and the AI is just solving the problem for them.

That may be a fair description of current AI, but insisting it’s true in principle has some awkward implications. If you make a “physics AI”, just tell it to do “good physics”, and it starts coming up with hypotheses you’d never thought of, is it really fair to say it’s just solving your problem?

What if the AI, instead, was a child? Picture a physicist encouraging a child to follow in their footsteps, filling their life with physics ideas and rhapsodizing about the hard problems of the field at the dinner table. Suppose the child becomes a physicist in turn, and finds success later in life. Were they really autonomous? Were they really a scientist?

What if the child, instead, was a scientific field, and the parent was the general public? The public votes for representatives, the representatives vote to hire agencies, and the agencies promise scientists they’ll give them money if they like the problems they come up with. Who is autonomous here?

(And what happens if someone takes a hammer to that process? I’m…still not talking about this! No-politics-rule still in effect, sorry! I do have a post planned, but it will have to wait until I can deal with the fallout.)

At this point, you’d probably stop insisting. You’d drop that “in principle”, and stick with the claim I started with, that current AI can’t be a scientist.

But you have another option.

You can accept the whole chain of awkward implications, bite all the proverbial bullets. Yes, you insist, AI is not autonomous. Neither is the physicist’s child in your story, and neither are the world’s scientists paid by government grants. Each is a tool, used by the one, true autonomous scientist: you.

You are stuck in your skull, a blob of curious matter trained on decades of experience in the world and pre-trained with a couple billion years of evolution. For whatever reason, you want to know more, so you come up with problems to solve. You’re probably pretty vague about those problems. You might want to see more pretty pictures of space, or wrap your head around the nature of time. So you turn the world into your tool. You vote and pay taxes, so your government funds science. You subscribe to magazines and newspapers, so you hear about it. You press out against the world, and along with the pressure that already exists it adds up, and causes change. Biological intelligences and artificial intelligences scurry at your command. From their perspective, they are proposing their own problems, much more detailed and complex than the problems you want to solve. But from yours, they’re your limbs beyond limbs, sight beyond sight, asking the fundamental questions you want answered.

Congratulations to John Hopfield and Geoffrey Hinton!

The 2024 Physics Nobel Prize was announced this week, awarded to John Hopfield and Geoffrey Hinton for using physics to propose foundational ideas in the artificial neural networks used for machine learning.

If the picture above looks off-center, it’s because this is the first time since 2015 that the Physics Nobel has been given to two, rather than three, people. Since several past prizes bundled together disparate ideas in order to make a full group of three, it’s noteworthy that this year the committee decided that each of these people deserved 1/2 the prize amount, without trying to find one more person to water it down further.

Hopfield was trained as a physicist, working in the broad area known as “condensed matter physics”. Condensed matter physicists use physics to describe materials, from semiconductors to crystals to glass. Over the years, Hopfield started using this training less for the traditional subject matter of the field and more to study the properties of living systems. He moved from a position in the physics department of Princeton to chemistry and biology at Caltech. While at Caltech he started studying neuroscience and proposed what are now known as Hopfield networks as a model for how neurons store memory. Hopfield networks have very similar properties to a more traditional condensed matter system called a “spin glass”, and from what he knew about those systems Hopfield could make predictions for how his networks would behave. Those networks would go on to be a major inspiration for the artificial neural networks used for machine learning today.

Hinton was not trained as a physicist, and in fact has said that he didn’t pursue physics in school because the math was too hard! Instead, he got a bachelor’s degree in psychology, and a PhD in the at the time nascent field of artificial intelligence. In the 1980’s, shortly after Hopfield published his network, Hinton proposed a network inspired by a closely related area of physics, one that describes temperature in terms of the statistics of moving particles. His network, called a Boltzmann machine, would be modified and made more efficient over the years, eventually becoming a key part of how artificial neural networks are “trained”.

These people obviously did something impressive. Was it physics?

In 2014, the Nobel prize in physics was awarded to the people who developed blue LEDs. Some of these people were trained as physicists, some weren’t: Wikipedia describes them as engineers. At the time, I argued that this was fine, because these people were doing “something physicists are good at”, studying the properties of a physical system. Ultimately, the thing that ties together different areas of physics is training: physicists are the people who study under other physicists, and go on to collaborate with other physicists. That can evolve in unexpected directions, from more mathematical research to touching on biology and social science…but as long as the work benefits from being linked to physics departments and physics degrees, it makes sense to say it “counts as physics”.

By that logic, we can probably call Hopfield’s work physics. Hinton is more uncertain: his work was inspired by a physical system, but so are other ideas in computer science, like simulated annealing. Other ideas, like genetic algorithms, are inspired by biological systems: does that mean they count as biology?

Then there’s the question of the Nobel itself. If you want to get a Nobel in physics, it usually isn’t enough to transform the field. Your idea has to actually be tested against nature. Theoretical physics is its own discipline, with several ideas that have had an enormous influence on how people investigate new theories, ideas which have never gotten Nobels because the ideas were not intended, by themselves, to describe the real world. Hopfield networks and Boltzmann machines, similarly, do not exist as physical systems in the real world. They exist as computer simulations, and it is those computer simulations that are useful. But one can simulate many ideas in physics, and that doesn’t tend to be enough by itself to get a Nobel.

Ultimately, though, I don’t think this way of thinking about things is helpful. The Nobel isn’t capable of being “fair”, there’s no objective standard for Nobel-worthiness, and not much reason for there to be. The Nobel doesn’t determine which new research gets funded, nor does it incentivize anyone (except maybe Brian Keating). Instead, I think the best way of thinking about the Nobel these days is a bit like Disney.

When Disney was young, its movies had to stand or fall on their own merits. Now, with so many iconic movies in its history, Disney movies are received in the context of that history. Movies like Frozen or Moana aren’t just trying to be a good movie by themselves, they’re trying to be a Disney movie, with all that entails.

Similarly, when the Nobel was young, it was just another award, trying to reward things that Alfred Nobel might have thought deserved rewarding. Now, though, each Nobel prize is expected to be “Nobel-like”, an analogy between each laureate and the laureates of the past. When new people are given Nobels the committee is on some level consciously telling a story, saying that these people fit into the prize’s history.

This year, the Nobel committee clearly wanted to say something about AI. There is no Nobel prize for computer science, or even a Nobel prize for mathematics. (Hinton already has the Turing award, the most prestigious award in computer science.) So to say something about AI, the Nobel committee gave rewards in other fields. In addition to physics, this year’s chemistry award went in part to the people behind AlphaFold2, a machine learning tool to predict what shapes proteins fold into. For both prizes, the committee had a reasonable justification. AlphaFold2 genuinely is an amazing advance in the chemistry of proteins, a research tool like nothing that came before. And the work of Hopfield and Hinton did lead ideas in physics to have an enormous impact on the world, an impact that is worth recognizing. Ultimately, though, whether or not these people should have gotten the Nobel doesn’t depend on that justification. It’s an aesthetic decision, one that (unlike Disney’s baffling decision to make live-action remakes of their most famous movies) doesn’t even need to impress customers. It’s a question of whether the action is “Nobel-ish” enough, according to the tastes of the Nobel committee. The Nobel is essentially expensive fanfiction of itself.

And honestly? That’s fine. I don’t think there’s anything else they could be doing at this point.

The Mistakes Are the Intelligence

There’s a lot of hype around large language models, the foundational technology behind services like ChatGPT. Representatives of OpenAI have stated that, in a few years, these models might have “PhD-level intelligence“. On the other hand, at the time, ChatGPT couldn’t count the number of letter “r”s in the word “strawberry”. The model and the setup around it has improved, and GPT-4o1 apparently now gets the correct 3 “r”s…but I’m sure it makes other silly mistakes, mistakes an intelligent human would never make.

The mistakes made by large language models are important, due to the way those models are used. If people are going to use them for customer service, writing transcripts, or editing grammar, they don’t want to introduce obvious screwups. (Maybe this means they shouldn’t use the models this way!)

But the temptation is to go further, to say that these mistakes are proof that these models are, and will always be, dumb, not intelligent. And that’s not the right way to think about intelligence.

When we talk about intelligent people, when we think about measuring things like IQ, we’re looking at a collection of different traits. These traits typically go together in humans: a human who is good at one will usually be good at the others. But from the perspective of computer science, these traits are very different.

Intelligent people tend to be good at following complex instructions. They can remember more, and reason faster. They can hold a lot in their head at once, from positions of objects to vocabulary.

These are all things that computers, inherently, are very good at. When Turing wrote down his abstract description of a computer, he imagined a machine with infinite memory, able to follow any instructions with perfect fidelity. Nothing could live up to that ideal, but modern computers are much closer to it than humans. “Computer” used to be a job, with rooms full of people (often women) hired to do calculations for scientific projects. We don’t do that any more, machines have made that work superfluous.

What’s more, the kind of processing a Turing machine does is probably the only way to reliably answer questions. If you want to make sure you get the correct answer every time, then it seems that you can’t do better than to use a sufficiently powerful computer.

But while computer-the-machine replaced computer-the-job, mathematician-the-job still exists. And that’s because not all intelligence is about answering questions reliably.

Alexander Grothendieck was a famous mathematician, known for his deep insights and powerful ideas. According to legend, when giving a talk referring to prime numbers, someone in the audience asked him to name a specific prime. He named 57.

With a bit of work, any high-school student can figure out that 57, which equals 3 times 19, isn’t a prime number. A computer can easily figure out that 57 is not a prime number. Even ChatGPT knows that 57 is not a prime number.

But this doesn’t mean that Grothendieck was dumber than a high school student, or dumber than ChatGPT. Grothendieck was using a different kind of intelligence, the heuristic kind.

Heuristics are unreliable reasoning. They’re processes that get the right answer some of the time, but not all of the time. Because of that, though, they don’t have the same limits as reliable computer programs. Pick the right situation and the right conditions, and a heuristic can give you an answer faster than you could possibly get by following reliable rules.

Intelligent humans follow instructions well, but they also have good heuristics. They solve problems creatively, sometimes problems that are very hard for computers to address. People like Grothendieck make leaps of mathematical reasoning, guessing at the right argument before they have completely fleshed out a proof. This kind of intelligence is error-prone: rely on it, and you might claim 57 is prime. But at the moment, it’s our only intellectual advantage over machines.

Ultimately, ChatGPT is an advance in language processing, and language is a great example. Sentences don’t have definite meaning, we interpret what we read and hear in context, and sometimes our interpretation is wrong. Sometimes we hear words no-one actually said! It’s impossible, both for current technology and for the human brain, to process general text in a 100% reliable way. So large language models like GPT don’t do it reliably. They use an approximate model, a big complicated pile of rules tweaked over and over again until, enough of the time, they get the next word right in a text.

The kind of heuristic reasoning done by large language models is more effective than many people expected. Being able to predict the next word in a text unreliably also means you can write code unreliably, or count things unreliably, or do math unreliably. You can’t do any of these things as well as an appropriately-chosen human, at least not with current resources.

But in the longer run, heuristic intelligence is precisely the type of intelligence we should aspire to…or fear. Right now, we hire humans to do intellectual work because they have good heuristics. If we could build a machine with equivalent or better heuristics for those tasks, then people would hire a lot fewer humans. And if you’re worried about AI taking over the world, you’re worried about AI coming up with shortcuts to our civilization, tricks we couldn’t anticipate or plan against that destroy everything we care about. Those tricks can’t come from following rules: if they did, we could discover them just as easily. They would have to come from heuristics, sideways solutions that don’t work all the time but happen to work the one time that matters.

So yes, until the latest release, ChatGPT couldn’t tell you how many “r”s are in “strawberry”. Counting “r”s is something computers could already do, because it’s something that can be done by following reliable rules. It’s also something you can do easily, if you follow reliable rules. ChatGPT impresses people because it can do some of the things you do, that can’t be done with reliable rules. If technology like it has any chance of changing the world, those are the kinds of things it will have to be able to do.