Tag Archives: Machine Learning

About the OpenAI Amplitudes Paper, but Not as Much as You’d Like

I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.

This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.

Context:

Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”

In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.

Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.

That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.

I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.

Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.

How impressed should you be?

I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.

If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.

What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.

(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)

Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.

This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.

For the moment, then, I have to guess what input the AI had, and what it actually accomplished.

Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.

Given that, how impressed should we be?

First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.

Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.

Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?

After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.

If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.

From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.

While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.

Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.

A stab at implications:

If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?

If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.

More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.

And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.

Remaining questions:

At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.

  1. What was the prompt given to the internal model, and how much did it do based on that prompt?
  2. Was it really done in one shot, no retries or feedback?
  3. How much did running the internal model cost?
  4. Is this result likely to be useful? Are there things people want to calculate that this could make easier? Recursion relations it could seed? Is it useful for SCET somehow?
  5. How easy would it have been for the authors to do what the LLM did? What about other experts in the community?

Hypothesis: If AI Is Bad at Originality, It’s a Documentation Problem

Recently, a few people have asked me about this paper.

A couple weeks back, OpenAI announced a collaboration with a group of amplitudes researchers, physicists who study the types of calculations people do to make predictions at particle colliders. The amplitudes folks had identified an interesting loophole, finding a calculation that many would have expected to be zero actually gave a nonzero answer. They did the calculation for different examples involving more and more particles, and got some fairly messy answers. They suspected, as amplitudes researchers always expect, that there was a simpler formula, one that worked for any number of particles. But they couldn’t find it.

Then a former amplitudes researcher at OpenAI suggested that they use AI to find it.

“Use AI” can mean a lot of different things, and most of them don’t look much like the way the average person talks to ChatGPT. This was closer than most. They were using “reasoning models”, loops that try to predict the next few phrases in a “chain of thought” again and again and again. Using that kind of tool, they were able to find that simpler formula, and mathematically prove that it was correct.

A few of you are hoping for an in-depth post about what they did, and its implications. This isn’t that. I’m still figuring out if I’ll be writing that for an actual news site, for money, rather than free, for you folks.

Instead, I want to talk about a specific idea I’ve seen crop up around the paper.

See, for some, the existence of a result like this isn’t all that surprising.

Mathematicians have been experimenting with reasoning models for a bit, now. Recently, a group published a systematic study, setting the AI loose on a database of minor open problems proposed by the famously amphetamine-fueled mathematician Paul Erdös. The AI managed to tackle a few of the problems, sometimes by identifying existing solutions that had not yet been linked to the problem database, but sometimes by proofs that appeared to be new.

The Erdös problems solved by the AI were not especially important. Neither was the problem solved by the amplitudes researchers, as far as I can tell at this point.

But I get the impression the amplitudes problem was a bit more interesting than the Erdös problems. The difference, so far, has mostly been attributed to human involvement. This amplitudes paper started because human amplitudes researchers found an interesting loophole, and only after that used the AI. Unlike the mathematicians, they weren’t just searching a database.

This lines up with a general point, one people tend to make much less carefully. It’s often said that, unlike humans, AI will never be truly creative. It can solve mechanical problems, do things people have done before, but it will never be good at having truly novel ideas.

To me, that line of thinking goes a bit too far. I suspect it’s right on one level, that it will be hard for any of these reasoning models to propose anything truly novel. But if so, I think it will be for a different reason.

The thing is, creativity is not as magical as we make it out to be. Our ideas, scientific or artistic, don’t just come from the gods. They recombine existing ideas, shuffling them in ways more akin to randomness than miracle. They’re then filtered through experience, deep heuristics honed over careers. Some people are good at ideas, and some are bad at them. Having ideas takes work, and there are things people do to improve their ideas. Nothing about creativity suggests it should be impossible to mechanize.

However, a machine trained on text won’t necessarily know how to do any of that.

That’s because in science, we don’t write down our inspirations. By the time a result gets into a scientific paper or textbook, it’s polished and refined into a pure argument, cutting out most of the twists and turns that were an essential part of the creative process. Mathematics is even worse, most math papers don’t even mention the motivation behind the work, let alone the path taken to the paper.

This lack of documentation makes it hard for students, making success much more a function of having the right mentors to model good practices, rather than being able to pick them up from literature everyone can access. I suspect it makes it even harder for language models. And if today’s language model-based reasoning tools are bad at that crucial, human-seeming step, of coming up with the right idea at the right time? I think that has more to do with this lack of documentation, than with the fact that they’re “statistical parrots”.

The Timeline for Replacing Theorists Is Not Technological

Quanta Magazine recently published a reflection by Natalie Wolchover on the state of fundamental particle physics. The discussion covers a lot of ground, but one particular paragraph has gotten the lion’s share of the attention. Wolchover talked to Jared Kaplan, the ex-theoretical physicist turned co-founder of Anthropic, one of the foremost AI companies today.

Kaplan was one of Nima Arkani-Hamed’s PhD students, which adds an extra little punch.

There’s a lot to contest here. Is AI technology anywhere close to generating papers as good as the top physicists, or is that relegated to the sci-fi future? Does Kaplan really believe this, or is he just hyping up his company?

I don’t have any special insight into those questions, about the technology and Kaplan’s motivations. But I think that, even if we trusted him on the claim that AI could be generating Witten- or Nima-level papers in three years, that doesn’t mean it will replace theoretical physicists. That part of the argument isn’t a claim about the technology, but about society.

So let’s take the technological claims as given, and make them a bit more specific. Since we don’t have any objective way of judging the quality of scientific papers, let’s stick to the subjective. Today, there are a lot of people who get excited when Witten posts a new paper. They enjoy reading them, they find the insights inspiring, they love the clarity of the writing and their tendency to clear up murky ideas. They also find them reliable: the papers very rarely have mistakes, and don’t leave important questions unanswered.

Let’s use that as our baseline, then. Suppose that Anthropic had an AI workflow that could reliably write papers that were just as appealing to physicists as Witten’s papers are, for the same reasons. What happens to physicists?

Witten himself is retired, which for an academic means you do pretty much the same thing you were doing before, but now paid out of things like retirement savings and pension funds, not an institute budget. Nobody is going to fire Witten, there’s no salary to fire him from. And unless he finds these developments intensely depressing and demoralizing (possible, but very much depends on how this is presented), he’s not going to stop writing papers. Witten isn’t getting replaced.

More generally, though, I don’t think this directly results in anyone getting fired, or in universities trimming positions. The people making funding decisions aren’t just sitting on a pot of money, trying to maximize research output. They’ve got money to be spent on hires, and different pools of money to be spent on equipment, and the hires get distributed based on what current researchers at the institutes think is promising. Universities want to hire people who can get grants, to help fund the university, and absent rules about AI personhood, the AIs won’t be applying for grants.

Funding cuts might be argued for based on AI, but that will happen long before AI is performing at the Witten level. We already see this happening in other industries or government agencies, where groups that already want to cut funding are getting think tanks and consultants to write estimates that justify cutting positions, without actually caring whether those estimates are performed carefully enough to justify their conclusions. That can happen now, and doesn’t depend on technological progress.

AI could also replace theoretical physicists in another sense: the physicists themselves might use AI to do most of their work. That’s more plausible, but here adoption still heavily depends on social factors. Will people feel like they are being assessed on whether they can produce these Witten-level papers, and that only those who make them get hired, or funded? Maybe. But it will propagate unevenly, from subfield to subfield. Some areas will make their own rules forbidding AI content, there will be battles and scandals and embarrassments aplenty. It won’t be a single switch, the technology alone setting the timeline.

Finally, AI could replace theoretical physicists in another way, by people outside of academia filling the field so much that theoretical physicists have nothing more that they want to do. Some non-physicists are very passionate about physics, and some of those people have a lot of money. I’ve done writing work for one such person, whose foundation is now attempting to build an AI Physicist. If these AI Physicists get to Witten-level quality, they might start writing compelling paper after compelling paper. Those papers, though, will due to their origins be specialized. Much as philanthropists mostly fund the subfields they’ve heard of, philanthropist-funded AI will mostly target topics the people running the AI have heard are important. Much like physicists themselves adopting the technology, there will be uneven progress from subfield to subfield, inch by socially-determined inch.

In a hard-to-quantify area like progress in science, that’s all you can hope for. I suspect Kaplan got a bit of a distorted picture of how progress and merit work in theoretical physics. He studied with Nima Arkani-Hamed, who is undeniably exceptionally brilliant but also undeniably exceptionally charismatic. It must feel to a student of Nima’s that academia simply hires the best people, that it does whatever it takes to accomplish the obviously best research. But the best research is not obvious.

I think some of these people imagine a more direct replacement process, not arranged by topic and tastes, but by goals. They picture AI sweeping in and doing what theoretical physics was always “meant to do”: solve quantum gravity, and proceed to shower us with teleporters and antigravity machines. I don’t think there’s any reason to expect that to happen. If you just asked a machine to come up with the most useful model of the universe for a near-term goal, then in all likelihood it wouldn’t consider theoretical high-energy physics at all. If you see your AI as a tool to navigate between utopia and dystopia, theoretical physics might matter at some point: when your AI has devoured the inner solar system, is about to spread beyond the few light-minutes when it can signal itself in real-time, and has to commit to a strategy. But as long as the inner solar system remains un-devoured, I don’t think you’ll see an obviously successful theory of fundamental physics.

Academia Tracks Priority, Not Provenance

A recent Correspondence piece in Nature Machine Intelligence points at an issue with using LLMs to write journal articles. LLMs are trained on enormous amounts of scholarly output, but the result is quite opaque: it is usually impossible to tell which sources influence a specific LLM-written text. That means that when a scholar uses an LLM, they may get a result that depends on another scholar’s work, without realizing it or documenting it. The ideas’ provenance gets lost, and the piece argues this is damaging, depriving scholars of credit and setting back progress.

It’s a good point. Provenance matters. If we want to prioritize funding for scholars whose ideas have the most impact, we need a way to track where ideas arise.

However, current publishing norms make essentially no effort to do this. Academic citations are not used to track provenance, and they are not typically thought of as tracking provenance. Academic citations track priority.

Priority is a central value in scholarship, with a long history. We give special respect to the first person to come up with an idea, make an observation, or do a calculation, and more specifically, the first person to formally publish it. We do this even if the person’s influence was limited, and even if the idea was rediscovered independently later on. In an academic context, being first matters.

In a paper, one is thus expected to cite the sources that have priority, that came up with an idea first. Someone who fails to do so will get citation request emails, and reviewers may request revisions to the paper to add in those missing citations.

One may also cite papers that were helpful, even if they didn’t come first. Tracking provenance in this way can be nice, a way to give direct credit to those who helped and point people to useful resources. But it isn’t mandatory in the same way. If you leave out a secondary source and your paper doesn’t use anything original to that source (like new notation), you’re much less likely to get citation request emails, or revision requests from reviewers. Provenance is just much lower priority.

In practice, academics track provenance in much less formal ways. Before citations, a paper will typically have an Acknowledgements section, where the authors thank those who made the paper possible. This includes formal thanks to funding agencies, but also informal thanks for “helpful discussions” that don’t meet the threshold of authorship.

If we cared about tracking provenance, those acknowledgements would be crucial information, an account of whose ideas directly influenced the ideas in the paper. But they’re not treated that way. No-one lists the number of times they’ve been thanked for helpful discussions on their CV, or in a grant application, no-one considers these discussions for hiring or promotion. You can’t look them up on an academic profile or easily graph them in a metascience paper. Unlike citations, unlike priority, there is essentially no attempt to measure these tracks of provenance in any organized way.

Instead, provenance is often the realm of historians or history-minded scholars, writing long after the fact. For academics, the fact that Yang and Mills published their theory first is enough, we call it Yang-Mills theory. For those studying the history, the story is murkier: it looks like Pauli came up with the idea first, and did most of the key calculations, but didn’t publish when it looked to him like the theory couldn’t describe the real world. What’s more, there is evidence suggesting that Yang knew about Pauli’s result, that he had read a letter from him on the topic, that the idea’s provenance goes back to Pauli. But Yang published, Pauli didn’t. And in the way academia has worked over the last 75 years, that claim of priority is what actually mattered.

Should we try to track provenance? Maybe. Maybe the emerging ubiquitousness of LLMs should be a wakeup call, a demand to improve our tracking of ideas, both in artificial and human neural networks. Maybe we need to demand interpretability from our research tools, to insist that we can track every conclusion back to its evidence for every method we employ, to set a civilizational technological priority on the accurate valuation of information.

What we shouldn’t do, though, is pretend that we just need to go back to what we were doing before.

Requests for an Ethnography of Cheating

What is AI doing to higher education? And what, if anything, should be done about it?

Chad Orzel at Counting Atoms had a post on this recently, tying the question to a broader point. There is a fundamental tension in universities, between actual teaching and learning and credentials. A student who just wants the piece of paper at the end has no reason not to cheat if they can get away with it, so the easier it becomes to get away with cheating (say, by using AI), the less meaningful the credential gets. Meanwhile, professors who want students to actually learn something are reduced to trying to “trick” these goal-oriented students into accidentally doing something that makes them fall in love with a subject, while being required to police the credential side of things.

Social science, as Orzel admits and emphasizes, is hard. Any broad-strokes picture like this breaks down into details, and while Orzel talks through some of those details he and I are of course not social scientists.

Because of that, I’m not going to propose my own “theory” here. Instead, think of this post as a request.

I want to read an ethnography of cheating. Like other ethnographies, it should involve someone spending time in the culture in question (here, cheating students), talking to the people involved, and getting a feeling for what they believe and value. Ideally, it would be augmented with an attempt at quantitative data, like surveys, that estimate how representative the picture is.

I suspect that cheating students aren’t just trying to get a credential. Part of why is that I remember teaching pre-meds. In the US, students don’t directly study medicine as a Bachelor’s degree. Instead, they study other subjects as pre-medical students (“pre-meds”), and then apply to Medical School, which grants a degree on the same level as a PhD. As part of their application, they include a standardized test called the MCAT, which checks that they have the basic level of math and science that the medical schools expect.

A pre-med in a physics class, then, has good reason to want to learn: the better they know their physics, the better they will do on the MCAT. If cheating was mostly about just trying to get a credential, pre-meds wouldn’t cheat.

I’m pretty sure they do cheat, though. I didn’t catch any cheaters back when I taught, but there were a lot of students who tried to push the rules, pre-meds and not.

Instead, I think there are a few other motivations involved. And in an ethnography of cheating, I’d love to see some attempt to estimate how prevalent they are:

  1. Temptation: Maybe students know that they shouldn’t cheat, in the same way they know they should go to the gym. They want to understand the material and learn in the same way people who exercise have physical goals. But the mind, and flesh, are weak. You have a rough week, you feel like you can’t handle the work right now. So you compensate. Some of the motivation here is still due to credentials: a student who shrugs and accepts that their breakup will result in failing a course is a student who might have to pay for an extra year of ultra-expensive US university education to get that credential. But I suspect there is a more fundamental motivation here, related to ego and easy self-deception. If you do the assignment, even if you cheat for part of it, you get to feel like you did it, while if you just turn in a blank page you have to accept the failure.
  2. Skepticism: Education isn’t worth much if it doesn’t actually work. Students may be skeptical that the things that professors are asking them to do actually help them learn what they want to learn, or that the things the professors want them to learn are actually the course’s most valuable content. A student who uses ChatGPT to write an essay might believe that they will never have to write something without ChatGPT in life, so why not use it now? Sometimes professors simply aren’t explicit about what an exercise is actually meant to teach (there have been a huge number of blog posts explaining that writing is meant to teach you to think, not to write), and sometimes professors are genuinely pretty bad at teaching, since there is little done to retain the good ones in most places. A student in this situation still has to be optimistic about some aspect of the education, at some time. But they may be disillusioned, or just interested in something very different.
  3. Internalized Expectations: Do employers actually care if you get a bad grade? Does it matter? By the time a student is in college, they’ve been spending half their waking hours in a school environment for over a decade. Maybe the need to get good grades is so thoroughly drilled in that the actual incentives don’t matter. If you think of yourself as the kind of person who doesn’t fail courses, and you start failing, what do you do?
  4. External Non-Credential Expectations: Don’t worry about the employers, worry about the parents. Some college students have the kind of parents who keep checking in on how they’re doing, who want to see evidence and progress the same way they did when they were kids. Any feedback, no matter how much it’s intended to teach, not to judge, might get twisted into a judgement. Better to avoid that judgement, right?
  5. Credentials, but for the Government, not Employers: Of course, for some students, failing really does wreck their life. If you’re on the kind of student visa that requires you maintain grades a certain level, you’ve got a much stronger incentive to cheat, imposed for much less reason.

If you’re aware of a good ethnography of cheating, let me know! And if you’re a social scientist, consider studying this!

Some Dumb AI Ideas

Sometimes, when I write a post about AI, I’ve been sitting on an idea for a long time. I’ve talked to experts, I’ve tried to understand the math, I’ve honed my points and cleared away clutter.

This is not one of those times. The ideas in this post almost certainly have something deeply wrong with them. But hopefully they’re interesting food for thought.

My first dumb idea: instruction tuning was a mistake.

I’m drawing the seeds of this one from a tumblr post by nostalgebraist, someone known for making a popular bot trained on his tumblr posts in the early days before GPT became ChatGPT.

AIs like ChatGPT are based on Large Language Models, insanely complicated mathematical formulas that predict, given part of a text, what the rest of that text is likely to look like. In the early days, this was largely how they were used. Loosely described nostalgebraist’s bot, called nostalgebraist-autoresponder, began with a list of tumblr posts and asks and determines what additional posts would best fit in.

If you think about it, though, ChatGPT doesn’t really work like that. ChatGPT has conversations: you send it messages, it sends you responses. The text it creates is a dialogue, with you supplying half the input. But most texts aren’t dialogues, and ChatGPT draws on a lot of non-dialogue texts to make its dialogue-like responses.

The reason it does this is something called instruction tuning. ChatGPT has been intentionally biased, not to give the most likely completion to a task in general, but to give completions that fit this dialogue genre. What I didn’t know until I read nostalgebraist’s post was that this genre was defined artificially: AI researchers made up fake dialogues with AI, cheesy sci-fi conversations imagining how an AI might respond to instructions from a user, and then biased the Large Language Model so that rather than giving the most likely text in general, it gives a text that is more likely to look like these cheesy sci-fi conversations. It’s why ChatGPT sounds kind of like a fictional robot: not because sci-fi writers accurately predicted what AI would sound like, but because AI was created based on sci-fi texts.

For nostalgebraist, this leads into an interesting reflection of how a sci-fi AI should behave, how being warped around a made-up genre without history or depth creates characters which act according to simple narratives and express surprising anxiety.

For myself, though, I can’t help but wonder if the goal of dialogue itself is the problem. Dialogue is clearly important commercially: people use ChatGPT because they can chat with it. But Large Language Models aren’t inherently chatbots: they produce plausible texts, of any sort you could imagine. People seem to want a machine that can, for example, answer scientific questions as part of a conversation. But most competent answers to scientific questions aren’t conversations, they’re papers. If people stuck with the “raw” model, producing excerpts of nonexistent papers rather than imitating a dialogue with a non-existent expert, wouldn’t you expect the answers to be more accurate, with the model no longer biased by an irrelevant goal? Is the need to make a sell-able chatbot making these AIs worse at everything else people are trying to use them for?

I’m imagining a world where, instead of a chatbot, OpenAI built an “alternate universe simulator”. You give it some context, some texts or parts of texts from a universe you made up, and it completes them in a plausible way. By imagining different universes, you can use it to answer different questions. Such a gimmick would get fewer customers, and fewer investors, it would probably do worse. But I have to wonder if the actual technology might have been more useful.

My second idea is dumber, to the point where I mostly know why it doesn’t work. But thinking about it might help clarify how things work for people unused to AI.

I saw someone point out that, unlike something like Wikipedia, AI doesn’t give you context. You shouldn’t trust Wikipedia, or a source you find on Google, blindly. If you want to, you can look through the edit history on Wikipedia, or figure out who wrote a page you found on Google and how. If ChatGPT tells you something, by default you don’t know where that knowledge came from. You can tell it to search, and then you’ll get links, but that’s because it’s using Google or the like behind the scenes anyway. You don’t know where the model is getting its ideas.

Why couldn’t we get that context, though?

Every text produced by a Large Language Model is causally dependent on its training data. Different data, different model, different text. That doesn’t mean that each text draws from one source, or just a few sources: ChatGPT isn’t copying the training data, at least not so literally.

But it does mean that, if ChatGPT says something is true, you should in principle be able to ask which data was most important in making it say that. If you leave a piece of data out of the training, and get similar answers, you can infer that the response you got doesn’t have much to do with that piece of data. But if you leave out a text in training, and now ChatGPT gives totally different responses to the same question…then there’s a pretty meaningful sense that it got the information from that source.

If this were the type of non-AI statistical model people use in physics, this would be straightforward. Researchers do this all the time: take one experiment out of the data, see how their analysis changes, and thereby figure out which experiments are most important to check. One can even sometimes calculate, given a model, where you should look.

Unfortunately, you can’t do this with ChatGPT. The model is just too big. You can’t calculate anything explicitly about it, the giant mathematical formulas behind it are so complicated that the most you can do is get probabilities out case by case, you can’t “unwind” them and see where the numbers come from. And you can’t just take out sources one by one, and train the model again: not when training takes months of expensive computer time.

So unlike with the previous idea, I understand even on a technical level why you can’t do this. But it helped me to be able to think about what I would like to do, if it were possible. Maybe it helps you too!

Microdosing Vibe Physics

Have you heard of “vibe physics”?

The phrase “vibe coding” came first. People have been using large language models like ChatGPT to write computer code (and not the way I did last year). They chat with the model, describing what they want to do and asking the model to code it up. You can guess the arguments around this, from people who are convinced AI is already better than a human programmer to people sure the code will be riddled with errors and vulnerabilities.

Now, there are people claiming not only to do vibe coding, but vibe physics: doing theoretical physics by chatting with an AI.

I think we can all agree that’s a lot less plausible. Some of the people who do vibe coding actually know how to code, but I haven’t seen anyone claiming to do vibe physics who actually understands physics. They’re tech entrepreneurs in the most prominent cases, random people on the internet otherwise. And while a lot of computer code is a minor tweak on something someone has already done, theoretical physics doesn’t work that way: if someone has already come up with your idea, you’re an educator, not a physicist.

Still, I think there is something to keep in mind about the idea of “vibe physics”, related to where physics comes from.

Here’s a question to start with: go back a bit before the current chat-bot boom. There were a ton of other computational and mathematical tools. Theorem-proving software could encode almost arbitrary mathematical statements in computer code and guarantee their accuracy. Statistical concepts like Bayes’ rule described how to reason from evidence to conclusions, not flawlessly but as well as anyone reliably can. We had computer simulations for a wealth of physical phenomena, and approximation schemes for many others.

With all those tools, why did we still have human physicists?

That is, go back before ChatGPT, before large language models. Why not just code up a program that starts with the evidence and checks which mathematical model fits it best?

In principle, I think you really could have done that. But you could never run that program. It would take too long.

Doing science 100% correctly and reliably is agonizingly slow, and prohibitively expensive. You cannot check every possible model, nor can you check those models against all the available data. You must simplify your problem, somehow, even if it makes your work less reliable, and sometimes incorrect.

And for most of history, humans have provided that simplification.

A physicist isn’t going to consider every possible model. They’re going to consider models that are similar to models they studied, or similar to models others propose. They aren’t going to consider all the evidence. They’ll look at some of the evidence, the evidence other physicists are talking about and puzzled by. They won’t simulate the consequences of their hypotheses in exhaustive detail. Instead, they’ll guess, based on their own experience, a calculation that captures what they expect to be relevant.

Human physicists provided the unreliable part of physics, the heuristics. The “vibe physics”, if you will.

AI is also unreliable, also heuristic. But humans still do this better than AI.

Part of the difference is specificity. These AIs are trained on all of human language, and then perhaps fine-tuned on a general class of problems. A human expert has spent their life fine-tuning on one specific type of problem, and their intuitions, their heuristics, their lazy associations and vibes, all will be especially well-suited to problems of that type.

Another part of the difference, though, is scale.

When you talk to ChatGPT, it follows its vibes into paragraphs of text. If you turn on reasoning features, you make it check its work in the background, but it still is generating words upon words inside, evaluating those words, then generating more.

I suspect, for a physicist, the “control loop” is much tighter. Many potential ideas get ruled out a few words in. Many aren’t even expressed in words at all, just concepts. A human physicist is ultimately driven by vibes, but they check and verify those vibes, based on their experience, at a much higher frequency than any current AI system can achieve.

(I know almost nothing about neuroscience. I’m just basing this on what it can feel like, to grope through a sentence and have it assemble itself as it goes into something correct, rather than having to go back and edit it.)

As companies get access to bigger datacenters, I suspect they’ll try to make this loop tighter, to get AI to do something closer to what (I suspect, it appears) humans do. And then maybe AI will be able to do vibe physics.

Even then, though, you should not do vibe physics with the AI.

If you look at the way people describe doing vibe physics, they’re not using the AI for the vibes. They’re providing the vibes, and the AI is supposed to check things.

And that, I can confidently say, is completely ass-backwards. The AI is a vibe machine, it is great at vibes. Substituting your vibes will just make it worse. On the other hand, the AI is awful at checking things. It can find published papers sometimes, which can help you check something. But it is not set up to do the math, at least not unless the math can be phrased as a simple Python script or an IMO problem. In order to do anything like that, it has to call another type of software to verify. And you could have just used that software.

Theoretical physics is still not something everyone can do. Proposing a crackpot theory based on a few papers you found on Google and a couple YouTube videos may make you feel less confident than proposing a crackpot theory based on praise from ChatGPT and a list of papers it claims have something to do with your idea, which makes it more tempting. But it’s still proposing a crackpot theory. If you want to get involved, there’s still no substitute for actually learning how physics works.

I’ve Felt Like a Hallucinating LLM

ChatGPT and its kin work by using Large Language Models, or LLMs.

A climate model is a pile of mathematics and code, honed on data from the climate of the past. Tell it how the climate starts out, and it will give you a prediction for what happens next.

Similarly, a language model is a pile of mathematics and code, honed on data from the texts of the past. Tell it how a text starts, and it will give you a prediction for what happens next.

We have a rough idea of what a climate model can predict. The climate has to follow the laws of physics, for example. Similarly, a text should follow the laws of grammar, the order of verbs and nouns and so forth. The creators of the earliest, smallest language models figured out how to do that reasonably well.

Texts do more than just follow grammar, though. They can describe the world. And LLMs are both surprisingly good and surprisingly bad at that. They can do a lot when used right, answering test questions most humans would struggle with. But they also “hallucinate”, confidently saying things that have nothing to do with reality.

If you want to understand why large language models make both good predictions and bad, you shouldn’t just think about abstract “texts”. Instead, think about a specific type of text: a story.

Stories follow grammar, most of the time. But they also follow their own logic. The hero sets out, saves the world, and returns home again. The evil queen falls from the tower at the climax of the final battle. There are three princesses, and only the third can break the spell.

We aren’t usually taught this logic, like we’re taught physics or grammar. We learn it from experience, from reading stories and getting used to patterns. It’s the logic, not of how a story must go, but of how a story typically goes. And that question, of what typically comes next, is exactly the question LLMs are designed to answer.

It’s also a question we sometimes answer.

I was a theatre kid, and I loved improv in particular. Some of it was improv comedy, the games and skits you might have seen on “Whose Line is it Anyway?” But some of it was more…hippy stuff.

I’d meet up with a group on Saturdays. One year we made up a creation myth, half-rehearsed and half-improvised, a collection of gods and primordial beings. The next year we moved the story forward. Civilization had risen…and fallen again. We played a group of survivors gathered around a campfire, wary groups wondering what came next.

We plotted out characters ahead of time. I was the “villain”, or the closest we had to one. An enforcer of the just-fallen empire, the oppressor embodied. While the others carried clubs, staves, and farm implements, I was the only one with a real weapon: a sword.

(Plastic in reality, but the audience knew what to do.)

In the arguments and recriminations of the story, that sword set me apart, a constant threat that turned my character from contemptible to dangerous, that gave me a seat at the table even as I antagonized and stirred the pot.

But the story had another direction. The arguments pushed and pulled, and gradually the survivors realized that they would not survive if they did not put their grievances to rest, if they did not seek peace. So, one man stepped forward, and tossed his staff into the fire.

The others followed. One by one, clubs and sticks and menacing tools were cast aside. And soon, I was the only one armed.

If I was behaving logically, if I followed my character’s interests, I would have “won” there. I had gotten what I wanted, now there was no check on my power.

But that wasn’t what the story wanted. Improv is a game of fast decisions and fluid invention. We follow our instincts, and our instincts are shaped by experience. The stories of the past guide our choices, and must often be the only guide: we don’t have time to edit, or to second-guess.

And I felt the story, and what it wanted. It was a command that transcended will, that felt like it left no room for an individual actor making an individual decision.

I cast my sword into the fire.

The instinct that brought me to do that is the same instinct that guides authors when they say that their characters write themselves, when their story goes in an unexpected direction. It’s an instinct that can be tempered and counteracted, with time and effort, because it can easily lead to nonsense. It’s why every good book needs an editor, why improv can be as repetitive as it is magical.

And it’s been the best way I’ve found to understand LLMs.

An LLM telling a story tells a typical story, based on the data used to create it. In the same way, an LLM giving advice gives typical advice, to some extent in content but more importantly in form, advice that is confident and mentions things advice often mentions. An LLM writing a biography will write a typical biography, which may not be your biography, even if your biography was one of those used to create it, because it tries to predict how a biography should go based on all the other biographies. And all of these predictions and hallucinations are very much the kind of snap judgement that disarmed me.

These days, people are trying to build on top of LLMs and make technology that does more, that can edit and check its decisions. For the most part, they’re building these checks out of LLMs. Instead of telling one story, of someone giving advice on the internet, they tell two stories: the advisor and the editor, one giving the advice and one correcting it. They have to tell these stories many times, broken up into many parts, to approximate something other than the improv actor’s first instincts, and that’s why software that does this is substantially more expensive than more basic software that doesn’t.

I can’t say how far they’ll get. Models need data to work well, decisions need reliability to be good, computers need infrastructure to compute. But if you want to understand what’s at an LLM’s beating heart, think about the first instincts you have in writing or in theatre, in stories or in play. Then think about a machine that just does that.

AI Can’t Do Science…And Neither Can Other Humans

Seen on Twitter:

I don’t know the context here, so I can’t speak to what Prof. Cronin meant. But it got me thinking.

Suppose you, like Prof. Cronin, were to insist that AI “cannot in principle” do science, because AI “is not autonomous” and “does not come up with its own problems to solve”. What might you mean?

You might just be saying that AI is bad at coming up with new problems to solve. That’s probably fair, at least at the moment. People have experimented with creating simple “AI researchers” that “study” computer programs, coming up with hypotheses about the programs’ performance and testing them. But it’s a long road from that to reproducing the much higher standards human scientists have to satisfy.

You probably don’t mean that, though. If you did, you wouldn’t have said “in principle”. You mean something stronger.

More likely, you might mean that AI cannot come up with its own problems, because AI is a tool. People come up with problems, and use AI to help solve them. In this perspective, not only is AI “not autonomous”, it cannot be autonomous.

On a practical level, this is clearly false. Yes, machine learning models, the core technology in current AI, are set up to answer questions. A user asks something, and receives the model’s prediction of the answer. That’s a tool, but for the more flexible models like GPT it’s trivial to turn it into something autonomous. Just add another program: a loop that asks the model what to do, does it, tells the model the result, and asks what to do next. Like taping a knife to a Roomba, you’ve made a very simple modification to make your technology much more dangerous.

You might object, though, that this simple modification of GPT is not really autonomous. After all, a human created it. That human had some goal, some problem they wanted to solve, and the AI is just solving the problem for them.

That may be a fair description of current AI, but insisting it’s true in principle has some awkward implications. If you make a “physics AI”, just tell it to do “good physics”, and it starts coming up with hypotheses you’d never thought of, is it really fair to say it’s just solving your problem?

What if the AI, instead, was a child? Picture a physicist encouraging a child to follow in their footsteps, filling their life with physics ideas and rhapsodizing about the hard problems of the field at the dinner table. Suppose the child becomes a physicist in turn, and finds success later in life. Were they really autonomous? Were they really a scientist?

What if the child, instead, was a scientific field, and the parent was the general public? The public votes for representatives, the representatives vote to hire agencies, and the agencies promise scientists they’ll give them money if they like the problems they come up with. Who is autonomous here?

(And what happens if someone takes a hammer to that process? I’m…still not talking about this! No-politics-rule still in effect, sorry! I do have a post planned, but it will have to wait until I can deal with the fallout.)

At this point, you’d probably stop insisting. You’d drop that “in principle”, and stick with the claim I started with, that current AI can’t be a scientist.

But you have another option.

You can accept the whole chain of awkward implications, bite all the proverbial bullets. Yes, you insist, AI is not autonomous. Neither is the physicist’s child in your story, and neither are the world’s scientists paid by government grants. Each is a tool, used by the one, true autonomous scientist: you.

You are stuck in your skull, a blob of curious matter trained on decades of experience in the world and pre-trained with a couple billion years of evolution. For whatever reason, you want to know more, so you come up with problems to solve. You’re probably pretty vague about those problems. You might want to see more pretty pictures of space, or wrap your head around the nature of time. So you turn the world into your tool. You vote and pay taxes, so your government funds science. You subscribe to magazines and newspapers, so you hear about it. You press out against the world, and along with the pressure that already exists it adds up, and causes change. Biological intelligences and artificial intelligences scurry at your command. From their perspective, they are proposing their own problems, much more detailed and complex than the problems you want to solve. But from yours, they’re your limbs beyond limbs, sight beyond sight, asking the fundamental questions you want answered.

Integration by Parts, Evolved

I posted what may be my last academic paper today, about a project I’ve been working on with Matthias Wilhelm for most of the last year. The paper is now online here. For me, the project has been a chance to broaden my horizons, learn new skills, and start to step out of my academic comfort zone. For Matthias, I hope it was grant money well spent.

I wanted to work on something related to machine learning, for the usual trendy employability reasons. Matthias was already working with machine learning, but was interested in pursuing a different question.

When is machine learning worthwhile? Machine learning methods are heuristics, unreliable methods that sometimes work well. You don’t use a heuristic if you have a reliable method that runs fast enough. But if all you have are heuristics to begin with, then machine learning can give you a better heuristic.

Matthias noticed a heuristic embedded deep in how we do particle physics, and guessed that we could do better. In particle physics, we use pictures called Feynman diagrams to predict the probabilities for different outcomes of collisions, comparing those predictions to observation to look for evidence of new physics. Each Feynman diagram corresponds to an integral, and for each calculation there are hundreds, thousands, or even millions of those integrals to do.

Luckily, physicists don’t actually have to do all those integrals. It turns out that most of them are related, by a slightly more advanced version of that calculus class mainstay, integration by parts. Using integration by parts you can solve a list of equations, finding out how to write your integrals in terms of a much smaller list.

How big a list of equations do you need, and which ones? Twenty-five years ago, Stefano Laporta proposed a “golden rule” to choose, based on his own experience, and people have been using it (more or less, with their own tweaks) since then.

Laporta’s rule is a heuristic, with no proof that it is the best option, or even that it will always work. So we probably shouldn’t have been surprised when someone came up with a better heuristic. Watching talks at a December 2023 conference, Matthias saw a presentation by Johann Usovitsch on a curious new rule. The rule was surprisingly simple, just one extra condition on top of Laporta’s. But it was enough to reduce the number of equations by a factor of twenty.

That’s great progress, but it’s also a bit frustrating. Over almost twenty-five years, no-one had guessed this one simple change?

Maybe, thought Matthias and I, we need to get better at guessing.

We started out thinking we’d try reinforcement learning, a technique where a machine is trained by playing a game again and again, changing its strategy when that strategy brings it a reward. We thought we could have the machine learn to cut away extra equations, getting rewarded if it could cut more while still getting the right answer. We didn’t end up pursuing this very far before realizing another strategy would be a better fit.

What is a rule, but a program? Laporta’s golden rule and Johann’s new rule could both be expressed as simple programs. So we decided to use a method that could guess programs.

One method stood out for sheer trendiness and audacity: FunSearch. FunSearch is a type of algorithm called a genetic algorithm, which tries to mimic evolution. It makes a population of different programs, “breeds” them with each other to create new programs, and periodically selects out the ones that perform best. That’s not the trendy or audacious part, though, people have been doing that sort of genetic programming for a long time.

The trendy, audacious part is that FunSearch generates these programs with a Large Language Model, or LLM (the type of technology behind ChatGPT). Using an LLM trained to complete code, FunSearch presents the model with two programs labeled v0 and v1 and asks it to complete v2. In general, program v2 will have some traits from v0 and v1, but also a lot of variation due to the unpredictable output of LLMs. The inventors of FunSearch used this to contribute the variation needed for evolution, using it to evolve programs to find better solutions to math problems.

We decided to try FunSearch on our problem, modifying it a bit to fit the case. We asked it to find a shorter list of equations, giving a better score for a shorter list but a penalty if the list wasn’t able to solve the problem fully.

Some tinkering and headaches later, it worked! After a few days and thousands of program guesses, FunSearch was able to find a program that reproduced the new rule Johann had presented. A few hours more, and it even found a rule that was slightly better!

But then we started wondering: do we actually need days of GPU time to do this?

An expert on heuristics we knew had insisted, at the beginning, that we try something simpler. The approach we tried then didn’t work. But after running into some people using genetic programming at a conference last year, we decided to try again, using a Python package they used in their work. This time, it worked like a charm, taking hours rather than days to find good rules.

This was all pretty cool, a great opportunity for me to cut my teeth on Python programming and its various attendant skills. And it’s been inspiring, with Matthias drawing together more people interested in seeing just how much these kinds of heuristic methods can do there. I should be clear though, that so far I don’t think our result is useful. We did better than the state of the art on an example, but only slightly, and in a way that I’d guess doesn’t generalize. And we needed quite a bit of overhead to do it. Ultimately, while I suspect there’s something useful to find in this direction, it’s going to require more collaboration, both with people using the existing methods who know better what the bottlenecks are, and with experts in these, and other, kinds of heuristics.

So I’m curious to see what the future holds. And for the moment, happy that I got to try this out!