Tag Archives: academia

An AI Opinions Chart

You ever read something and suddenly a whole classification scheme lights up in your head?

A thread on X from “stringking42069” showed me a combination of opinions I hadn’t seen before. stringking42069 is a pro-string theory commentator with a macho gym bro memer gimmick. He’s openly contemptuous of many physicists who describe themselves as string theorists, arguing that only a smaller number really deserve the name.

To be clear, none of that is the new combination. Long-time readers of this blog will remember a frequent commenter with a very similar attitude, if much less tendency to use the word “bro”.

The new thing, from my perspective, is how he thinks about AI. As he explains in that thread, he sees AI as great at certain kinds of physics calculations, ones where the methods and goals are mostly known and the challenge is working out the math. He doesn’t expect it to be able to contribute real creativity or judgement, the messy decision-making that physicists use to decide what is worth building in the first place.

Others with that perspective tend to argue that this will be a boon for scientists, who AI will free up to do creative work, multiplying their output. The difference is, stringking42069 thinks a lot of scientists are not doing creative work in the first place, including most of the people making extensive use of AI. So if anything he’s happy to see them go, and only pissed that they’re sucking up resources and attention on the way out, and discouraging students who could be joining the parts of the field that do real creative work.

It made me realize that there are two axes to thinking about AI in physics.

On the one hand, there’s where you think AI capabilities are. Is AI going to lead to “a nation of geniuses in a data center”, an AI-powered super-(cyber-)Ed Witten for everything and everyone? Is AI great at routine work and coding, but will never be able to do anything really creative or novel? Or is AI total hype, almost always a waste of time?

On the other hand, there’s another axis: misanthropy about science. For some of the people arguing about AI online, most scientists are good people trying their best to do worthwhile things. For others, most scientists are complacent and cliquish, wasting time and money on ideas that are going nowhere and forcing the real geniuses out of the field.

Put those together, and you get the table below:

Thinks academia is mostly fineMisanthrope
AI geniuses are comingThe practice of science will change. We’ll play at science like chess, and have fun trying to read and understand amazing AI insights.Soon all scientists will be out of a job when the public notices AI can do it all better. Then the real breakthroughs will come.
AI can do routine workAI frees scientists to focus on what we do best: creativity. We should think carefully about how to train junior scientists now, though.AI is comparable to bad scientists who only do derivative work. If they leave, we real paradigm-changers could inherit the field.
AI is complete hypeMost scientists don’t use AI. AI is worrying because it misleads students and the public, who should listen to real scientists.Scientists are shilling for AI companies, as you should expect for people who waste the public’s money on reputation games.

This classification is missing a lot, of course. One important question is not just what AI can do in principle, but what it can do cost-effectively, and whether anyone is actually willing to pay for it. A point where I agree with stringking42069 is that companies get a lot of good PR out of building AI physicists right now, and that PR benefit won’t be relevant forever. I’m also leaving out the more general questions of AI’s effect on society, for example people who think AI geniuses will lead to the end of the world as we know it.

But I suspect if you look at this table, you can already start matching the scientists you see on social media. I’ve seen examples of all of these in the wild (though the bottom-left is somewhat rare, as far as I can tell). Where do you fall?

Should You Read What You Cite? That Depends

When arXiv announced it would ban people for hallucinated citations, that is citations of papers that don’t exist, the discussion online got sidetracked by the question of whether academics actually read the papers they cite. Some people proudly insisted that any good scholar always reads every paper they reference, others argued that was ridiculous.

As always, the answer is never that simple. In certain fields, it is enormously important to read the papers you cite if you want to do solid, careful, scholarly work. In others, it’s entirely irrelevant.

It mostly comes down to what citations are for. And luckily, I’ve already written a post about that.

So let’s go through the citation motivations I mention in that post.

First, some citations are about respecting priority, feeding the system by which academics get credit for having an idea first. The incentive system of academia depends on getting this more or less right, but that doesn’t mean every academic has to check things at every step of the way. Besides, if you get this wrong, you’ll find out quickly. Submit a paper to a preprint server like arXiv, and you’ll be sure to get emails telling you that some obscure Soviet researcher figured it all out first.

Other citations are about substantiating claims. These are the most important to get right. Here, you really ought to have read, if not the whole paper, at least the full justification for the claim you’re making. You can have some leeway if the methods are unfamiliar enough, for example a complicated experiment you can’t understand all the details of. Science and technology do require some trust. But you should have at least a sense of where things could go wrong, and why.

Citations to provide context are a different beast. Here, you’re trying to tell a reader where your ideas come from. You can’t show them the conversations you have with your colleagues, the things they value and get you excited about. So you have to show them papers instead. But the papers aren’t the thing you read, they’re just a convenient proxy.

Finally, citations do sometimes just exist to follow social conventions. And yeah, you don’t have to read these, just like you don’t have to say how you’re doing when someone asks you how you’re doing. They’re the academic equivalent of social white lies, and should be taken roughly as seriously, both by their supporters and detractors.

Doing Things Well Is an International Activity

In the US, funding agencies seem to be increasingly opposed to an often inevitable feature of good science: international collaboration. Scientists have been told by officials at the National Institutes of Health that they need to remove mention of foreign collaborators from progress reports, or that they need to avoid such collaborations to begin with. At NASA, officials have told scientists that rather than just avoiding funding work in China, they should actively avoid collaborating with Chinese researchers. And a recently introduced bill would make that restriction more explicit.

I have a general policy against discussing concrete political issues on this blog, so I’m not going to dig into the details of who’s doing what here, how far it’s going or how novel it is. That policy extends to the comments. If you mention specific laws, politicians, or political parties, I will delete your comment.

I do want to say something more general, though. I think people often underestimate just how important international collaboration is.

I’ve talked before about how scientific specialization spreads scientists around the world. Scientists want to work with people who work on their specific interests, and there are often only a few people that fit that description. So people move across the world, creating centers of expertise.

More than that, though, essentially any activity, done well, is done internationally. The better you want to perform, the more likely it is that the best collaborator will be someone in another country.

People don’t notice this as much as they could, because they’re used to the exceptions. Popular art is often siloed by language and cultural references. Sports are intentionally set up as competitions between regions and nations, and militaries compete as a practical necessity. But without those exceptions, international competition wins out. The best doctor, the best classical musician, and the best businessperson for a job can’t be expected to come from one country or another. Those fields, like science, are international.

When that internationalism is weak, it’s a warning sign. Without that drive to succeed on an international stage, scientists get lazy. There are countries with a history of academic cronyism, where universities were run more on interpersonal politics than scholarly merit, cozy fiefdoms where prominent academics dole out positions. To combat this, policymakers work to make their research systems more international. They explicitly ask about international collaborations and participation in international conferences in grant applications, not to discourage them, but to encourage them: to reward academics who show merit on the international stage and break up lazy patronage networks.

It worries me that it sounds like some US policymakers want to do the opposite. People are increasingly worried about bias and groupthink in the sciences, and increasingly mad that scientists could be wasting the public’s money to maintain a cushy lifestyle. International collaboration is how you hold scientists to account, how you force them to compete and show their merit. If you drop that, academia is going to get a whole lot worse.

ArXiv Will Ban You for Hallucinated References

Thomas Dietterich, Chair of the Computer Science section of the preprint server arXiv.org, recently clarified the site’s policies towards “hallucinated” citations and other signs of careless use of AI in a post on X. If your paper contains a citation to a paper that doesn’t actually exist, or has other signs you didn’t read it before posting like leftover commentary (the example he gave was “here is a 200 word summary; would you like me to make any changes?”), then you can get banned from the arXiv for one year. Even after that year you’d be on a kind of “probation”, and would need to show that your next few papers had been accepted by peer-reviewed journals first before posting them.

At the risk of saying the obvious, this is a good idea! arXiv isn’t peer review, it isn’t meant to judge the value of the papers it hosts. But it still needs to be a useful place for scientists to post their papers, which is why they try to keep spam and irrelevant content to a minimum. If you don’t actually endorse the content of a paper, you shouldn’t post it in the first place.

That said, the whole existence of hallucinated citations on arXiv feels a little silly. It makes sense for academic journals and preprint servers in other fields. But arXiv was the first site of its kind for a reason. Its users, physicists, mathematicians, and computer scientists, don’t need much hand-holding when it comes to computers. Papers submitted to arXiv aren’t typically written in Word, they’re written in a document-writing language called LaTeX, that lets users make decently-formatted papers without help from a journal. Physicist-written code may be terrible by any reasonable criteria…but it exists, much more universally than for example biologist-written code.

This extends to citations. In my old field, there is a database called INSPIRE that updates automatically from arXiv. Click on a paper, and a handy “cite” link gives you standardized citations in several formats, ready to copy and paste into your LaTeX code. Nearly every citation in my papers is copied from there. The ones that aren’t are either from other fields where I didn’t know of that style of database, or things that haven’t been published (this can be manuscripts in preparation, or personal communications).

All of this, though, feels like a lot less than what the field could be doing. In a world where almost everyone posts their papers to the same website, and almost everyone has at least a rudimentary understanding of programming…why are people still writing citations in free-form text in the first place? Why aren’t citations built in to the submitted papers on arXiv, automatically linked to the papers they cite? Why don’t we have a setup where, except for a small number of “special” citations, every citation is built so that it automatically goes to a real paper, and gives a clear error message if it doesn’t? In short, why are hallucinated citations even possible?

Look, I’m naive, I get that. I believe in automation, not in the modern context of LLMs and other heuristics, but in setting clear procedures and building clear rules. The world doesn’t work that way! The clear rules are always more contentious than you expect, the fuzzy human-led version always the only choice people can agree on.

But still. Citations. There has to be a better system, right?

What AI Physicists Are Missing and What They Aren’t

I’ve seen a couple more thoughtful takes on use of LLMs for physics lately. This blog post by Minas Karamis is particularly nice.

He points out something that I’ve said a version of: an AI that must be supervised like a student isn’t very useful, because the main point of student projects isn’t the paper at the end: it’s training the student. If students don’t struggle through all the mistakes of a project, they won’t get the expertise to one day do greater things.

Someone might object that not all suffering is educational. In the 1700’s, Leonhard Euler calculated digit after digit of transcendental numbers by hand. Nobody asks students to do that anymore, and they still seem to turn out alright. Why would using an LLM for science be worse than using a computer for numerical calculations?

In a word: different skills. Programming numerics teaches you some of the same skills as calculating the numbers by hand: skills at being specific about what you mean, aware of the consequences of the details and their implications. Prompting an AI still requires those skills, to check whether the AI’s output is correct. But it’s much worse at teaching them: unlike programming or calculating, when prompting AI, the consequences of your actions aren’t predictable.

For some, though, there is another objection. Sure, using AI reliably might require those skills now. But when it gets better, surely being careful will stop mattering. Surely the AI will end up doing science on its own, and all that training will be as useful as if we trained the students to play football.

I’m skeptical, but not as strongly as some. I think we’re still living in a time when it makes sense to hire scientists, and train people to think, and invest in your retirement.

I don’t think I have any knock-down arguments for that, though. Just some suggestive ones.

One I’ve talked about before is that a lot of the most important parts of thinking aren’t written down. An AI physicist is going to have a hard time replicating the kinds of methods and approaches that people use behind the scenes, but rarely describe or spell out. It will be easier to suss this out over time, as more data accumulates of people working with LLMs and correcting them. But ultimately there isn’t going to be a lot of documentation of this kind of thing.

Another limitation is memory. A mature scientist can draw from experiences across their entire career. For an LLM, any problem it’s solved in the past is by default lost in each new session. People build structures around this, taking notes and reminding the AI when it “wakes up”, or making documents the AI can be prompted to check. But nothing in this vein so far seems to get nearly as wide-scope or powerful as human memory. A scientist career is still the best way we have to build durable, functional expertise.

Finally, there is a question of costs, and efficiency. Here I’m not an expert, and I get the impression the actual experts disagree. I don’t know whether we should expect scaling to hit a wall, but I wouldn’t be that surprised if it did.

There are other common reasons for skepticism that seem more dubious to me. I don’t think AI is inherently worse at creativity just because they’re trained on existing work, though some of the skills we associate with creativity aren’t very well-documented, and thus are hard to train for. I don’t think AI’s randomness or unreliability is a deal-breaker, because human intuition is also random and unreliable: we solve that with tools, and that’s something AI can in principle do as well. I don’t think humans are “more agentic” or something, except in the sense that most AIs are made by companies who need to make them behave in a customer-friendly way. But an agent is just a game-theoretic construct, a way to figure out can win or lose in situations with defined stakes, and anything you can train or engineer to try to win can be modeled by that construct.

Coming from a place of uncertainty, my main appeal to you is to not get hung up on the bad reasons, either yourself, or from the people you’re arguing with. Focus on the best arguments, and see where they take you.

ArXiv to Leave Cornell

Yes, I’m late to the party on this one.

A few weeks ago, arXiv.org announced that it will be leaving Cornell, the university that currently manages it, and establishing its own nonprofit.

arXiv is a crucial part of the infrastructure for physics, mathematics, computer science, and a few related fields. Researchers post papers to arXiv as what are called “preprints” before the papers are submitted to a journal. In practice, nobody ends up reading the journal versions: the arXiv is free to access, and typically reflects better what the paper’s authors want the paper to look like. So in practice, arXiv is how researchers in these fields communicate, which makes its role enormously important.

If you’re from another field, you might wonder how something like arXiv is financially sustainable. The answer is that it works better than you’d think, but not perfectly. They’ve been supported by philanthropy, in addition to Cornell, and while there have apparently been budget shortfalls and drama behind the scenes, But nonetheless, arXiv has stayed in continuous operation since 1991.

The move to an independent nonprofit is supposed to make it easier for arXiv to get philanthropic funding, which otherwise needed to be filtered through Cornell in ways that were sometimes opaque or didn’t give donors the control they wanted.

While it wasn’t mentioned in the announcements, I suspect another motivation is security. Universities are fixed in place, and that makes them easier to pressure. For an organization that wants to process scientific output in an unbiased way, the link to Cornell represented a vulnerability. It’s not a vulnerability that has mattered yet, and likely didn’t seem like it would ever matter. But it wouldn’t surprise me if they’re more worried now that someone might try to pressure Cornell in order to change how arXiv operates. For critical scientific infrastructure, it’s important to be as independent of those kinds of pressure as possible.

Trust Is a Tree

Scientists trust what they think they can verify.

In principle, you can work your way through the proof of every mathematical theorem. With enough money and time, you could replicate every experiment. For every expert opinion, you could dig through the literature and find how it was justified.

And while a scientist can’t actually do that for every field, they might be able to for the ones they care about most. In your specialty, you probably can check the logic behind every claim. And you know that enough people try, that you can trust your colleagues’ work.

As a science journalist, most of the time, you can’t do those checks. You don’t even pretend you can. Instead, you build trust, like a tree.

You start with a grounding. A former scientist might trust their former colleagues, people they trusted, as a scientist, to do (and know) good work. A non-scientist has to start somewhere else. They might use prestige, looking up those tenured folks at Harvard or Princeton or Stanford. They might look to who other journalists trusted, scientists who’ve already been in the news. They might track journals or roles, assuming that a publication in Nature, or a position on a national grant committee, has a special meaning.

And if things stopped there, it would be a pretty elitist system. It still can be, and often is. But there is another step, which softens it.

The trust builds.

When I want to know if a paper in an unfamiliar field makes sense, if it’s worth covering, I try to ask someone I trust. Sometimes, they don’t know, and shrug. Other, more useful, times, they don’t know, but they have a suggestion: someone they trust, who can give me the answer.

And so I ask the new person, and now I trust someone more.

And suppose the new person says the new paper is good, and worth covering, good science and all that jazz.

Well, now I can trust its authors too, right?

So when the next paper comes, I now don’t just have that first someone. I have the person they recommended, and the authors of the previous paper.

The trust builds out, and up, like branches on a tree.

The Twitter of Physics

The paper I talked about last week was frustratingly short. That’s not because the authors were trying to hide anything, or because they were lazy. It’s just that these days, that’s how the game is played.

Twitter started out with a fun gimmick: all posts had to be under 140 characters. The restriction inspired some great comedy, trying to pack as much humor as possible into a bite-sized format. Then, Twitter somehow became the place for journalists to discuss the news, tech people to discuss the industry, and politicians to discuss politics. Now, the length limit fuels conflict, an endless scroll of strong opinions without space for nuance.

Physics has something like this too.

In the 1950’s, it was hard for scientists to get the word out quickly about important results. The journal Physical Review had a trick: instead of normal papers, they’d accept breaking news in the form of letters to the editor, which they could publish more quickly than the average paper. In 1958, editor Samuel Goudsmit founded a new journal, Physical Review Letters (or PRL for short), that would publish those letters all in one place, enforcing a length limit to make them faster to process.

The new journal was a hit, and soon played host to a series of breakthrough results, as scientists chose it as a way to get their work out fast. That popularity created a problem, though. As PRL’s reputation grew, physicists started trying to publish there not because their results needed to get out fast, but because just by publishing in PRL, their papers would be associated with all of the famous breakthroughs the journal had covered. Goudsmit wrote editorials trying to slow this trend, but to no avail.

Now, PRL is arguably the most prestigious journal in physics, hosting over a quarter of Nobel prize-winning work. Its original motivation is no longer particularly relevant: the journal is not all that much faster than other journals in its area, if at all, and is substantially slower than the preprint server arXiv, which is where physicists actually read papers in practice.

The length limit has changed over the years, but not dramatically. It now sits at 3,750 words, typically allowing a five-or-six page article in tight two-column text.

If you see a physics paper on arXiv.org that fits the format, it’s almost certainly aimed at PRL, or one of the journals with similar policies that it inspired. It means the authors think their work is cool enough to hang out with a quarter of all Nobel-winning results, or at least would like it to be.

And that, in turn, means that anyone who wants to claim that prestige has to be concise. They have to leave out details (often, saving them for a later publication in a less-renowned journal). The results have to lean, by the journal’s nature, more to physicist-clickbait and a cleaned-up story than to anything their colleagues can actually replicate.

Is it fun? Yeah, I had some PRLs in my day. It’s a rush, shining up your work as far as it can go, trimming down complexities into six pages of essentials.

But I’m not sure it’s good for the field.

About the OpenAI Amplitudes Paper, but Not as Much as You’d Like

I’ve had a bit more time to dig in to the paper I mentioned last week, where OpenAI collaborated with amplitudes researchers, using one of their internal models to find and prove a simplified version of a particle physics formula. I figured I’d say a bit about my own impressions from reading the paper and OpenAI’s press release.

This won’t be a real “deep dive”, though it will be long nonetheless. As it turns out, most of the questions I’d like answers to aren’t answered in the paper or the press release. Getting them will involve actual journalistic work, i.e. blocking off time to interview people, and I haven’t done that yet. What I can do is talk about what I know so far, and what I’m still wondering.

Context:

Scattering amplitudes are formulas used by particle physicists to make predictions. For a while, people would just calculate these when they needed them, writing down pages of mess that you could plug in numbers to to get answers. However, forty years ago two physicists decided they wanted more, writing “we hope to obtain a simplified form for the answer, making our result not only an experimentalist’s, but a theorist’s delight.”

In their next paper, they managed to find that “theorist’s delight”: a simplified, intuitive-looking answer that worked for calculations involving any number of particles, summarizing many different calculations. Ten years later, a few people had started building on it, and ten years after that, the big shots started paying attention. A whole subfield, “amplitudeology”, grew from that seed, finding new forms of “theorists’s delight” in scattering amplitudes.

Each subfield has its own kind of “theory of victory”, its own concept for what kind of research is most likely to yield progress. In amplitudes, it’s these kinds of simplifications. When they work out well, they yield new, more efficient calculation techniques, yielding new messy results which can be simplified once more. To one extent or another, most of the field is chasing after those situations when simplification works out well.

That motivation shapes both the most ambitious projects of senior researchers, and the smallest student projects. Students often spend enormous amounts of time looking for a nice formula for something and figuring out how to generalize it, often on a question suggested by a senior researcher. These projects mostly serve as training, but occasionally manage to uncover something more impressive and useful, an idea others can build around.

I’m mentioning all of this, because as far as I can tell, what ChatGPT and the OpenAI internal model contributed here roughly lines up with the roles students have on amplitudes papers. In fact, it’s not that different from the role one of the authors, Alfredo Guevara, had when I helped mentor him during his Master’s.

Senior researchers noticed something unusual, suggested by prior literature. They decided to work out the implications, did some calculations, and got some messy results. It wasn’t immediately clear how to clean up the results, or generalize them. So they waited, and eventually were contacted by someone eager for a research project, who did the work to get the results into a nice, general form. Then everyone publishes together on a shared paper.

How impressed should you be?

I said, “as far as I can tell” above. What’s annoying is that this paper makes it hard to tell.

If you read through the paper, they mention AI briefly in the introduction, saying they used GPT-5.2 Pro to conjecture formula (39) in the paper, and an OpenAI internal model to prove it. The press release actually goes into more detail, saying that the humans found formulas (29)-(32), and GPT-5.2 Pro found a special case where it could simplify them to formulas (35)-(38), before conjecturing (39). You can get even more detail from an X thread by one of the authors, OpenAI Research Scientist Alex Lupsasca. Alex had done his PhD with another one of the authors, Andrew Strominger, and was excited to apply the tools he was developing at OpenAI to his old research field. So they looked for a problem, and tried out the one that ended up in the paper.

What is missing, from the paper, press release, and X thread, is any real detail about how the AI tools were used. We don’t have the prompts, or the output, or any real way to assess how much input came from humans and how much from the AI.

(We have more for their follow-up paper, where Lupsasca posted a transcript of the chat.)

Contra some commentators, I don’t think the authors are being intentionally vague here. They’re following business as usual. In a theoretical physics paper, you don’t list who did what, or take detailed account of how you came to the results. You clean things up, and create a nice narrative. This goes double if you’re aiming for one of the most prestigious journals, which tend to have length limits.

This business-as-usual approach is ok, if frustrating, for the average physics paper. It is, however, entirely inappropriate for a paper showcasing emerging technologies. For a paper that was going to be highlighted this highly by OpenAI, the question of how they reached their conclusion is much more interesting than the results themselves. And while I wouldn’t ask them to go to the standards of an actual AI paper, with ablation analysis and all that jazz, they could at least have aimed for the level of detail of my final research paper, which gave samples of the AI input and output used in its genetic algorithm.

For the moment, then, I have to guess what input the AI had, and what it actually accomplished.

Let’s focus on the work done by the internal OpenAI model. The descriptions I’ve seen suggest that it started where GPT-5.2 Pro did, with formulas (29)-(32), but with a more specific prompt that guided what it was looking for. It then ran for 12 hours with no additional input, and both conjectured (39) and proved it was correct, providing essentially the proof that follows formula (39) in the paper.

Given that, how impressed should we be?

First, the model needs to decide to go to a specialized region, instead of trying to simplify the formula in full generality. I don’t know whether they prompted their internal model explicitly to do this. It’s not something I’d expect a student to do, because students don’t know what types of results are interesting enough to get published, so they wouldn’t be confident in computing only a limited version of a result without an advisor telling them it was ok. On the other hand, it is actually something I’d expect an LLM to be unusually likely to do, as a result of not managing to consistently stick to the original request! What I don’t know is whether the LLM proposed this for the right reason: that if you have the formula for one region, you can usually find it for other regions.

Second, the model needs to take formulas (29)-(32), write them in the specialized region, and simplify them to formulas (35)-(38). I’ve seen a few people saying you can do this pretty easily with Mathematica. That’s true, though not every senior researcher is comfortable doing that kind of thing, as you need to be a bit smarter than just using the Simplify[] command. Most of the people on this paper strike me as pen-and-paper types who wouldn’t necessarily know how to do that. It’s definitely the kind of thing I’d expect most students to figure out, perhaps after a couple of weeks of flailing around if it’s their first crack at it. The LLM likely would not have used Mathematica, but would have used SymPy, since these “AI scientist” setups usually can write and execute Python code. You shouldn’t think of this as the AI reasoning through the calculation itself, but it at least sounds like it was reasonably quick at coding it up.

Then, the model needs to conjecture formula (39). This gets highlighted in the intro, but as many have pointed out, it’s pretty easy to do. If any non-physicists are still reading at this point, take a look:

Could you guess (39) from (35)-(38)?

After that, the paper goes over the proof that formula (39) is correct. Most of this proof isn’t terribly difficult, but the way it begins is actually unusual in an interesting way. The proof uses ideas from time-ordered perturbation theory, an old-fashioned way to do particle physics calculations. Time-ordered perturbation theory isn’t something any of the authors are known for using with regularity, but it has recently seen a resurgence in another area of amplitudes research, showing up for example in papers by Matthew Schwartz, a colleague of Strominger at Harvard.

If a student of Strominger came up with an idea drawn from time-ordered perturbation theory, that would actually be pretty impressive. It would mean that, rather than just learning from their official mentor, this student was talking to other people in the department and broadening their horizons, showing a kind of initiative that theoretical physicists value a lot.

From an LLM, though, this is not impressive in the same way. The LLM was not trained by Strominger, it did not learn specifically from Strominger’s papers. Its context suggested it was working on an amplitudes paper, and it produced an idea which would be at home in an amplitudes paper, just a different one than the one it was working on.

While not impressive, that capability may be quite useful. Academic subfields can often get very specialized and siloed. A tool that suggests ideas from elsewhere in the field could help some people broaden their horizons.

Overall, it appears that that twelve-hour OpenAI internal model run reproduced roughly what an unusually bright student would be able to contribute over the course of a several-month project. Like most student projects, you could find a senior researcher who could do the project much faster, maybe even faster than the LLM. But it’s unclear whether any of the authors could have: different senior researchers have different skillsets.

A stab at implications:

If we take all this at face-value, it looks like OpenAI’s internal model was able to do a reasonably competent student project with no serious mistakes in twelve hours. If they started selling that capability, what would happen?

If it’s cheap enough, you might wonder if professors would choose to use the OpenAI model instead of hiring students. I don’t think this would happen, though: I think it misunderstands why these kinds of student projects exist in a theoretical field. Professors sometimes use students to get results they care about, but more often, the student’s interest is itself the motivation, with the professor wanting to educate someone, to empire-build, or just to take on their share of the department’s responsibilities. AI is only useful for this insofar as AI companies continue reaching out to these people to generate press releases: once this is routinely possible, the motivation goes away.

More dangerously, if it’s even cheaper, you could imagine students being tempted to use it. The whole point of a student project is to train and acculturate the student, to get them to the point where they have affection for the field and the capability to do more impressive things. You can’t skip that, but people are going to be tempted to.

And of course, there is the broader question of how much farther this technology can go. That’s the hardest to estimate here, since we don’t know the prompts used. So I don’t know if seeing this result tells us anything more about the bigger picture than we knew going in.

Remaining questions:

At the end of the day, there are a lot of things I still want to know. And if I do end up covering this professionally, they’re things I’ll ask.

  1. What was the prompt given to the internal model, and how much did it do based on that prompt?
  2. Was it really done in one shot, no retries or feedback?
  3. How much did running the internal model cost?
  4. Is this result likely to be useful? Are there things people want to calculate that this could make easier? Recursion relations it could seed? Is it useful for SCET somehow?
  5. How easy would it have been for the authors to do what the LLM did? What about other experts in the community?

Hypothesis: If AI Is Bad at Originality, It’s a Documentation Problem

Recently, a few people have asked me about this paper.

A couple weeks back, OpenAI announced a collaboration with a group of amplitudes researchers, physicists who study the types of calculations people do to make predictions at particle colliders. The amplitudes folks had identified an interesting loophole, finding a calculation that many would have expected to be zero actually gave a nonzero answer. They did the calculation for different examples involving more and more particles, and got some fairly messy answers. They suspected, as amplitudes researchers always expect, that there was a simpler formula, one that worked for any number of particles. But they couldn’t find it.

Then a former amplitudes researcher at OpenAI suggested that they use AI to find it.

“Use AI” can mean a lot of different things, and most of them don’t look much like the way the average person talks to ChatGPT. This was closer than most. They were using “reasoning models”, loops that try to predict the next few phrases in a “chain of thought” again and again and again. Using that kind of tool, they were able to find that simpler formula, and mathematically prove that it was correct.

A few of you are hoping for an in-depth post about what they did, and its implications. This isn’t that. I’m still figuring out if I’ll be writing that for an actual news site, for money, rather than free, for you folks.

Instead, I want to talk about a specific idea I’ve seen crop up around the paper.

See, for some, the existence of a result like this isn’t all that surprising.

Mathematicians have been experimenting with reasoning models for a bit, now. Recently, a group published a systematic study, setting the AI loose on a database of minor open problems proposed by the famously amphetamine-fueled mathematician Paul Erdös. The AI managed to tackle a few of the problems, sometimes by identifying existing solutions that had not yet been linked to the problem database, but sometimes by proofs that appeared to be new.

The Erdös problems solved by the AI were not especially important. Neither was the problem solved by the amplitudes researchers, as far as I can tell at this point.

But I get the impression the amplitudes problem was a bit more interesting than the Erdös problems. The difference, so far, has mostly been attributed to human involvement. This amplitudes paper started because human amplitudes researchers found an interesting loophole, and only after that used the AI. Unlike the mathematicians, they weren’t just searching a database.

This lines up with a general point, one people tend to make much less carefully. It’s often said that, unlike humans, AI will never be truly creative. It can solve mechanical problems, do things people have done before, but it will never be good at having truly novel ideas.

To me, that line of thinking goes a bit too far. I suspect it’s right on one level, that it will be hard for any of these reasoning models to propose anything truly novel. But if so, I think it will be for a different reason.

The thing is, creativity is not as magical as we make it out to be. Our ideas, scientific or artistic, don’t just come from the gods. They recombine existing ideas, shuffling them in ways more akin to randomness than miracle. They’re then filtered through experience, deep heuristics honed over careers. Some people are good at ideas, and some are bad at them. Having ideas takes work, and there are things people do to improve their ideas. Nothing about creativity suggests it should be impossible to mechanize.

However, a machine trained on text won’t necessarily know how to do any of that.

That’s because in science, we don’t write down our inspirations. By the time a result gets into a scientific paper or textbook, it’s polished and refined into a pure argument, cutting out most of the twists and turns that were an essential part of the creative process. Mathematics is even worse, most math papers don’t even mention the motivation behind the work, let alone the path taken to the paper.

This lack of documentation makes it hard for students, making success much more a function of having the right mentors to model good practices, rather than being able to pick them up from literature everyone can access. I suspect it makes it even harder for language models. And if today’s language model-based reasoning tools are bad at that crucial, human-seeming step, of coming up with the right idea at the right time? I think that has more to do with this lack of documentation, than with the fact that they’re “statistical parrots”.