Author Archives: 4gravitons

When Your Theory Is Already Dead

Occasionally, people try to give “even-handed” accounts of crackpot physics, like people who claim to have invented anti-gravity devices. These accounts don’t go so far as to say that the crackpots are right, and will freely point out plausible doubts about the experiments. But at the end of the day, they’ll conclude that we still don’t really know the answer, and perhaps the next experiment will go differently. More tests are needed.

For someone used to engineering, or to sciences without much theory behind them, this might sound pretty reasonable. Sure, any one test can be critiqued. But you can’t prove a negative: you can’t rule out a future test that might finally see the effect.

That’s all well and good…if you have no idea what you’re doing. But these people, just like anyone else who grapples with physics, aren’t just proposing experiments. They’re proposing theories: models of the world.

And once you’ve got a theory, you don’t just have to care about future experiments. You have to care about past experiments too. Some theories…are already dead.

The "You're already dead" scene from the anime North Star
Warning: this is a link to TVTropes, enter only if you have lots of time on your hands

To get a little more specific, let’s talk about antigravity proposals that use scalar fields.

Scalar fields seem to have some sort of mysticism attached to them in the antigravity crackpot community, but for physicists they’re just the simplest possible type of field, the most obvious thing anyone would have proposed once they were comfortable enough with the idea of fields in the first place. We know of one, the Higgs field, which gives rise to the Higgs boson.

We also know that if there are any more, they’re pretty subtle…and as a result, pretty useless.

We know this because of a wide variety of what are called “fifth-force experiments“, tests and astronomical observations looking for an undiscovered force that, like gravity, reaches out to long distances. Many of these experiments are quite general, the sort of thing that would pick up a wide variety of scalar fields. And so far, none of them have seen anything.

That “so far” doesn’t mean “wait and see”, though. Each time physicists run a fifth-force experiment, they establish a limit. They say, “a fifth force cannot be like this“. It can’t be this strong, it can’t operate on these scales, it can’t obey this model. Each experiment doesn’t just say “no fifth force yet”, it says “no fifth force of this kind, at all”.

When you write down a theory, if you’re not careful, you might find it has already been ruled out by one of these experiments. This happens to physicists all the time. Physicists want to use scalar fields to understand the expansion of the universe, they use them to think about dark matter. And frequently, a model one physicist proposed will be ruled out, not by new experiments, but by someone doing the math and realizing that the model is already contradicted by a pre-existing fifth-force experiment.

So can you prove a negative? Sort of.

If you never commit to a model, if you never propose an explanation, then you can never be disproven, you can always wait for the experiment of your dreams to come true. But if you have any model, any idea, any explanation at all, then your explanation will have implications. Those implications may kill your theory in a future experiment. Or, they may have already killed it.

Requests for an Ethnography of Cheating

What is AI doing to higher education? And what, if anything, should be done about it?

Chad Orzel at Counting Atoms had a post on this recently, tying the question to a broader point. There is a fundamental tension in universities, between actual teaching and learning and credentials. A student who just wants the piece of paper at the end has no reason not to cheat if they can get away with it, so the easier it becomes to get away with cheating (say, by using AI), the less meaningful the credential gets. Meanwhile, professors who want students to actually learn something are reduced to trying to “trick” these goal-oriented students into accidentally doing something that makes them fall in love with a subject, while being required to police the credential side of things.

Social science, as Orzel admits and emphasizes, is hard. Any broad-strokes picture like this breaks down into details, and while Orzel talks through some of those details he and I are of course not social scientists.

Because of that, I’m not going to propose my own “theory” here. Instead, think of this post as a request.

I want to read an ethnography of cheating. Like other ethnographies, it should involve someone spending time in the culture in question (here, cheating students), talking to the people involved, and getting a feeling for what they believe and value. Ideally, it would be augmented with an attempt at quantitative data, like surveys, that estimate how representative the picture is.

I suspect that cheating students aren’t just trying to get a credential. Part of why is that I remember teaching pre-meds. In the US, students don’t directly study medicine as a Bachelor’s degree. Instead, they study other subjects as pre-medical students (“pre-meds”), and then apply to Medical School, which grants a degree on the same level as a PhD. As part of their application, they include a standardized test called the MCAT, which checks that they have the basic level of math and science that the medical schools expect.

A pre-med in a physics class, then, has good reason to want to learn: the better they know their physics, the better they will do on the MCAT. If cheating was mostly about just trying to get a credential, pre-meds wouldn’t cheat.

I’m pretty sure they do cheat, though. I didn’t catch any cheaters back when I taught, but there were a lot of students who tried to push the rules, pre-meds and not.

Instead, I think there are a few other motivations involved. And in an ethnography of cheating, I’d love to see some attempt to estimate how prevalent they are:

  1. Temptation: Maybe students know that they shouldn’t cheat, in the same way they know they should go to the gym. They want to understand the material and learn in the same way people who exercise have physical goals. But the mind, and flesh, are weak. You have a rough week, you feel like you can’t handle the work right now. So you compensate. Some of the motivation here is still due to credentials: a student who shrugs and accepts that their breakup will result in failing a course is a student who might have to pay for an extra year of ultra-expensive US university education to get that credential. But I suspect there is a more fundamental motivation here, related to ego and easy self-deception. If you do the assignment, even if you cheat for part of it, you get to feel like you did it, while if you just turn in a blank page you have to accept the failure.
  2. Skepticism: Education isn’t worth much if it doesn’t actually work. Students may be skeptical that the things that professors are asking them to do actually help them learn what they want to learn, or that the things the professors want them to learn are actually the course’s most valuable content. A student who uses ChatGPT to write an essay might believe that they will never have to write something without ChatGPT in life, so why not use it now? Sometimes professors simply aren’t explicit about what an exercise is actually meant to teach (there have been a huge number of blog posts explaining that writing is meant to teach you to think, not to write), and sometimes professors are genuinely pretty bad at teaching, since there is little done to retain the good ones in most places. A student in this situation still has to be optimistic about some aspect of the education, at some time. But they may be disillusioned, or just interested in something very different.
  3. Internalized Expectations: Do employers actually care if you get a bad grade? Does it matter? By the time a student is in college, they’ve been spending half their waking hours in a school environment for over a decade. Maybe the need to get good grades is so thoroughly drilled in that the actual incentives don’t matter. If you think of yourself as the kind of person who doesn’t fail courses, and you start failing, what do you do?
  4. External Non-Credential Expectations: Don’t worry about the employers, worry about the parents. Some college students have the kind of parents who keep checking in on how they’re doing, who want to see evidence and progress the same way they did when they were kids. Any feedback, no matter how much it’s intended to teach, not to judge, might get twisted into a judgement. Better to avoid that judgement, right?
  5. Credentials, but for the Government, not Employers: Of course, for some students, failing really does wreck their life. If you’re on the kind of student visa that requires you maintain grades a certain level, you’ve got a much stronger incentive to cheat, imposed for much less reason.

If you’re aware of a good ethnography of cheating, let me know! And if you’re a social scientist, consider studying this!

To Measure Something or to Test It

Black holes have been in the news a couple times recently.

On one end, there was the observation of an extremely large black hole in the early universe, when no black holes of the kind were expected to exist. My understanding is this is very much a “big if true” kind of claim, something that could have dramatic implications but may just be being misunderstood. At the moment, I’m not going to try to work out which one it is.

In between, you have a piece by me in Quanta Magazine a couple weeks ago, about tests of whether black holes deviate from general relativity. They don’t, by the way, according to the tests so far.

And on the other end, you have the coverage last week of a “confirmation” (or even “proof”) of the black hole area law.

The black hole area law states that the total area of the event horizons of all black holes will always increase. It’s also known as the second law of black hole thermodynamics, paralleling the second law of thermodynamics that entropy always increases. Hawking proved this as a theorem in 1971, assuming that general relativity holds true.

(That leaves out quantum effects, which indeed can make black holes shrink, as Hawking himself famously later argued.)

The black hole area law is supposed to hold even when two black holes collide and merge. While the combination may lose energy (leading to gravitational waves that carry energy to us), it will still have greater area, in the end, than the sum of the black holes that combined to make it.

Ok, so that’s the area law. What’s this paper that’s supposed to “finally prove” it?

The LIGO, Virgo, and KAGRA collaborations recently published a paper based on gravitational waves from one particularly clear collision of black holes, which they measured back in January. They compare their measurements to predictions from general relativity, and checked two things: whether the measurements agreed with predictions based on the Kerr metric (how space-time around a rotating black hole is supposed to behave), and whether they obeyed the area law.

The first check isn’t so different in purpose from the work I wrote about in Quanta Magazine, just using different methods. In both studies, physicists are looking for deviations from the laws of general relativity, triggered by the highly curved environments around black holes. These deviations could show up in one way or another in any black hole collision, so while you would ideally look for them by scanning over many collisions (as the paper I reported on did), you could do a meaningful test even with just one collision. That kind of a check may not be very strenuous (if general relativity is wrong, it’s likely by a very small amount), but it’s still an opportunity, diligently sought, to be proven wrong.

The second check is the one that got the headlines. It also got first billing in the paper title, and a decent amount of verbiage in the paper itself. And if you think about it for more than five minutes, it doesn’t make a ton of sense as presented.

Suppose the black hole area law is wrong, and sometimes black holes lose area when they collide. Even if this happened sometimes, you wouldn’t expect it to happen every time. It’s not like anyone is pondering a reverse black hole area law, where black holes only shrink!

Because of that, I think it’s better to say that LIGO measured the black hole area law for this collision, while they tested whether black holes obey the Kerr metric. In one case, they’re just observing what happened in this one situation. In the other, they can try to draw implications for other collisions.

That doesn’t mean their work wasn’t impressive, but it was impressive for reasons that don’t seem to be getting emphasized. It’s impressive because, prior to this paper, they had not managed to measure the areas of colliding black holes well enough to confirm that they obeyed the area law! The previous collisions looked like they obeyed the law, but when you factor in the experimental error they couldn’t say it with confidence. The current measurement is better, and can. So the new measurement is interesting not because it confirms a fundamental law of the universe or anything like that…it’s interesting because previous measurements were so bad, that they couldn’t even confirm this kind of fundamental law!

That, incidentally, feels like a “missing mood” in pop science. Some things are impressive not because of their amazing scale or awesome implications, but because they are unexpectedly, unintuitively, really really hard to do. These measurements shouldn’t be thought of, or billed, as tests of nature’s fundamental laws. Instead they’re interesting because they highlight what we’re capable of, and what we still need to accomplish.

What You’re Actually Scared of in Impostor Syndrome

Academics tend to face a lot of impostor syndrome. Something about a job with no clear criteria for success, where you could always in principle do better and you mostly only see the cleaned-up, idealized version of others’ work, is a recipe for driving people utterly insane with fear.

The way most of us talk about that fear, it can seem like a cognitive bias, like a failure of epistemology. “Competent people think they’re less competent than they are,” the less-discussed half of the Dunning-Kruger effect.

(I’ve talked about it that way before. And, in an impostor-syndrome-inducing turn of events, I got quoted in a news piece in Nature about it.)

There’s something missing in that perspective, though. It doesn’t really get across how impostor syndrome feels. There’s something very raw about it, something that feels much more personal and urgent than an ordinary biased self-assessment.

To get at the core of it, let me ask a question: what happens to impostors?

The simple answer, the part everyone will admit to, is to say they stop getting grants, or stop getting jobs. Someone figures out they can’t do what they claim, and stops choosing them to receive limited resources. Pretty much anyone with impostor syndrome will say that they fear this: the moment that they reach too far, and the world decides they aren’t worth the money after all.

In practice, it’s not even clear that that happens. You might have people in your field who are actually thought of as impostors, on some level. People who get snarked about behind their back, people where everyone rolls their eyes when they ask a question at a conference and the question just never ends. People who are thought of as shiny storytellers without substance, who spin a tale for journalists but aren’t accomplishing anything of note. Those people…aren’t facing consequences at all, really! They keep getting the grants, they keep finding the jobs, and the ranks of people leaving for industry are instead mostly filled with those you respect.

Instead, I think what we fear when we feel impostor syndrome isn’t the obvious consequence, or even the real consequence, but something more primal. Primatologists and psychologists talk about our social brain, and the role of ostracism. They talk about baboons who piss off the alpha and get beat up and cast out of the group, how a social animal on their own risks starvation and becomes easy prey for bigger predators.

I think when we wake up in a cold sweat remembering how we had no idea what that talk was about, and were too afraid to ask, it’s a fear on that level that’s echoing around in our heads. That the grinding jags of adrenaline, the run-away-and-hide feeling of never being good enough, the desperate unsteadiness of trying to sound competent when you’re sure that you’re not and will get discovered at any moment…that’s not based on any realistic fears about what would happen if you got caught. That’s your monkey-brain, telling you a story drilled down deep by evolution.

Does that help? I’m not sure. If you manage to tell your inner monkey that it won’t get eaten by a lion if its friends stop liking it, let me know!

The Rocks in the Ground Era of Fundamental Physics

It’s no secret that the early twentieth century was a great time to make progress in fundamental physics. On one level, it was an era when huge swaths of our understanding of the world were being rewritten, with relativity and quantum mechanics just being explored. It was a time when a bright student could guide the emergence of whole new branches of scholarship, and recently discovered physical laws could influence world events on a massive scale.

Put that way, it sounds like it was a time of low-hanging fruit, the early days of a field when great strides can be made before the easy problems are all solved and only the hard ones are left. And that’s part of it, certainly: the fields sprung from that era have gotten more complex and challenging over time, requiring more specialized knowledge to make any kind of progress. But there is also a physical reason why physicists had such an enormous impact back then.

The early twentieth century was the last time that you could dig up a rock out of the ground, do some chemistry, and end up with a discovery about the fundamental laws of physics.

When scientists like Curie and Becquerel were working with uranium, they didn’t yet understand the nature of atoms. The distinctions between elements were described in qualitative terms, but only just beginning to be physically understood. That meant that a weird object in nature, “a weird rock”, could do quite a lot of interesting things.

And once you find a rock that does something physically unexpected, you can scale up. From the chemistry experiments of a single scientist’s lab, countries can build industrial processes to multiply the effect. Nuclear power and the bomb were such radical changes because they represented the end effect of understanding the nature of atoms, and atoms are something people could build factories to manipulate.

Scientists went on to push that understanding further. They wanted to know what the smallest pieces of matter were composed of, to learn the laws behind the most fundamental laws they knew. And with relativity and quantum mechanics, they could begin to do so systematically.

US particle physics has a nice bit of branding. They talk about three frontiers: the Energy Frontier, the Intensity Frontier, and the Cosmic Frontier.

Some things we can’t yet test in physics are gated by energy. If we haven’t discovered a particle, it may be because it’s unstable, decaying quickly into lighter particles so we can’t observe it in everyday life. If these particles interact appreciably with particles of everyday matter like protons and electrons, then we can try to make them in particle colliders. These end up creating pretty much everything up to a certain mass, due to a combination of the tendency in quantum mechanics for everything that can happen to happen, and relativity’s E=mc^2. In the mid-20th century these particle colliders were serious pieces of machinery, but still small enough to make industrial: now, there are so-called medical accelerators in many hospitals based on their designs. But current particle accelerators are a different beast, massive facilities built by international collaborations. This is the Energy Frontier.

Some things in physics are gated by how rare they are. Some particles interact only very faintly with other particles, so to detect them, physicists have to scan a huge chunk of matter, a giant tank of argon or a kilometer of antarctic ice, looking for deviations from the norm. Over time, these experiments have gotten bigger, looking for more and more subtle effects. A few weird ones still fit on tabletops, but only because they have the tools to measure incredibly small variations. Most are gigantic. This is the Intensity Frontier.

Finally, the Cosmic Frontier looks for the unknown behind both kinds of gates, using the wider universe to look at events with extremely high energy or size.

Pushing these frontiers has meant cleaning up our understanding of the fundamental laws of physics up to these frontiers. It means that whatever is still hiding, it either requires huge amounts of energy to produce, or is an extremely rare, subtle effect.

That means that you shouldn’t expect another nuclear bomb out of fundamental physics. Physics experiments are already working on vast scales, to the extent that a secret government project would have to be smaller than publicly known experiments, in physical size, energy use, and budget. And you shouldn’t expect another nuclear power plant, either: we’ve long passed the kinds of things you could devise a clever industrial process to take advantage of at scale.

Instead, new fundamental physics will only be directly useful once we’re the kind of civilization that operates on a much greater scale than we do today. That means larger than the solar system: there wouldn’t be much advantage, at this point, of putting a particle physics experiment on the edge of the Sun. It means the kind of civilization that tosses galaxies around.

It means that right now, you won’t see militaries or companies pushing the frontiers of fundamental physics, unlike the way they might have wanted to at the dawn of the twentieth century. By the time fundamental physics is useful in that way, all of these actors will likely be radically different: companies, governments, and in all likelihood human beings themselves. Instead, supporting fundamental physics right now is an act of philanthropy, maintaining a practice because it maintains good habits of thought and produces powerful ideas, the same reasons organizations support mathematics or poetry. That’s not nothing, and fundamental physics is still often affordable as philanthropy goes. But it’s not changing the world, not the way physicists did in the early twentieth century.

Two Types of Scientific Fraud: for a Fee and for Power

A paper about scientific fraud has been making the rounds in social media lately. The authors gather evidence of large-scale networks of fraudsters across multiple fields, from teams of editors that fast-track fraudulent research to businesses that take over journals, sell spots for articles, and then move on to a new target when the journal is de-indexed. I’m not an expert in this kind of statistical sleuthing, but the work looks impressively thorough.

Still, I think the authors overplay their results a bit. They describe themselves as revealing something many scientists underestimate. They point to what they label as misconceptions: that scientific fraud is usually perpetrated alone by individual unethical scientists, or that it is almost entirely a problem of the developing world, and present their work as disproving those misconceptions. Listen to them, and you might get the feeling that science is rife with corruption, that no result, or scientist, can be trusted.

As far as I can tell, though, those “misconceptions” they identify are true. Someone who believes that scientific fraud is perpetrated by loners is probably right, as is someone who believes it largely takes place outside of the first world.

As is often the case, the problem is words.

“Scientific Fraud” is a single term for two different things. The two both involve bad actors twisting scientific activity. But in everything else — their incentives, their geography, their scale, and their consequences — they are dramatically different.

One of the types of scientific fraud is largely about power.

In references 84-89 of the paper, the authors give examples of large-scale scientific fraud in Europe and the US. All (except one, which I’ll mention later) are about the career of a single researcher. Each of these people systematically bent the truth, whether with dodgy statistics, doctored images, or inflating citation counts. Some seemed motivated to promote a particular scientific argument, cutting corners to push a particular conclusion through. Others were purer cases of self-promotion. These people often put pressure on students, postdocs, and other junior researchers in their orbits, which increases the scale of their impact. In some cases, their work rippled out to convince other researchers, prolonging bad ideas and strangling good ones. These were people with power, who leveraged that power to increase their power.

There also don’t appear to be that many of them. These people are loners in a meaningful sense, cores of fraud working on their own behalf. They don’t form networks with each other, for the most part: because they work towards their own aggrandizement, they have no reason to trust anyone else doing the same. I have yet to see evidence that the number of these people is increasing. They exist, they’re a problem, they’re important to watch out for. But they’re not a crisis, and they shouldn’t shift your default expectations of science.

The other, quite different, type of scientific fraud is fraud for a fee.

The cases this paper investigates seem to fall into this category. They are businesses, offering the raw material of academic credit (papers, co-authorship, citations, publication) for cash. They’re paper mills, of various sorts. These are, at least from an academic perspective, large organizations, with hundreds or thousands of customers and tens of suborned editors or scientists farming out their credibility. As the authors of this paper argue, fraudsters of this type are churning out more and more papers, potentially now fueled by AI, adding up to a still small, but non-negligible, proportion of scientific papers in total.

Compared to the first type of fraud, though, buying credit in this way doesn’t give very much power. As the paper describes, many of the papers churned out by paper mills don’t even go into relevant journals: for example, they mention “an article about roasting hazelnuts in a journal about HIV/AIDS care”. An article like that isn’t going to mislead the hazelnut roasting community, or the HIV/AIDS community. Indeed, that would be counter to its purpose. The paper isn’t intended to be read at all, and ideally gets ignored: it’s just supposed to inflate a number.

These numbers are most relevant in the developing world, and when push comes to shove, almost all of the buyers of these services identified by the authors of this paper come from there. In many developing countries, a combination of low trust and advice from economists leads to explicit point systems, where academics are paid or hired explicitly based on criteria like where and how often they publish or how they are cited. The more a country can trust people to vouch for each other without corruption, the less these kinds of incentives have purchase. Outside of the developing world, involvement in paper mills and the like generally seems to involve a much smaller number of people, and typically as sellers, not buyers: selling first-world credibility in exchange for fees from many developing-world applicants.

(The one reference I mentioned above is an interesting example of this: a system built out of points and low trust to recruit doctors from the developing world to the US, gamed by a small number of co-authorship brokers.)

This kind of fraud doesn’t influence science directly. Its perpetrators aren’t trying to get noticed, but to keep up a cushy scam. You don’t hear their conclusions in the press, other scientists don’t see their work. Instead, they siphon off resources: cannibalizing journals, flooding editors with mass-produced crap, and filling positions and slurping up science budgets in the countries that can least afford them. As they publish more and more, they shouldn’t affect your expectations of the credibility of science: any science you hear about will be either genuine, or fraud from the other category. But they do make the science you hear about harder and harder to do.

(The authors point out one exception: what about AI? If a company trains a large language model on the current internet, will its context windows be long enough to tell that that supposedly legitimate paper about hazelnuts is in an HIV/AIDS journal? If something gets said often enough, copied again and again in papers sold by a mill, will an AI trained on all these papers be convinced? Presumably, someone is being paid good money to figure out how to filter AI-generated slop from training data: can they filter paper mill fraud as well?)

It’s a shame that we have one term, scientific fraud, to deal with these two very different things. But it’s important to keep in mind that they are different. Fraud for power and fraud for money can have very different profiles, and offer very different risks. If you don’t trust a scientific result, it’s worth understanding what might be at play.

Some Dumb AI Ideas

Sometimes, when I write a post about AI, I’ve been sitting on an idea for a long time. I’ve talked to experts, I’ve tried to understand the math, I’ve honed my points and cleared away clutter.

This is not one of those times. The ideas in this post almost certainly have something deeply wrong with them. But hopefully they’re interesting food for thought.

My first dumb idea: instruction tuning was a mistake.

I’m drawing the seeds of this one from a tumblr post by nostalgebraist, someone known for making a popular bot trained on his tumblr posts in the early days before GPT became ChatGPT.

AIs like ChatGPT are based on Large Language Models, insanely complicated mathematical formulas that predict, given part of a text, what the rest of that text is likely to look like. In the early days, this was largely how they were used. Loosely described nostalgebraist’s bot, called nostalgebraist-autoresponder, began with a list of tumblr posts and asks and determines what additional posts would best fit in.

If you think about it, though, ChatGPT doesn’t really work like that. ChatGPT has conversations: you send it messages, it sends you responses. The text it creates is a dialogue, with you supplying half the input. But most texts aren’t dialogues, and ChatGPT draws on a lot of non-dialogue texts to make its dialogue-like responses.

The reason it does this is something called instruction tuning. ChatGPT has been intentionally biased, not to give the most likely completion to a task in general, but to give completions that fit this dialogue genre. What I didn’t know until I read nostalgebraist’s post was that this genre was defined artificially: AI researchers made up fake dialogues with AI, cheesy sci-fi conversations imagining how an AI might respond to instructions from a user, and then biased the Large Language Model so that rather than giving the most likely text in general, it gives a text that is more likely to look like these cheesy sci-fi conversations. It’s why ChatGPT sounds kind of like a fictional robot: not because sci-fi writers accurately predicted what AI would sound like, but because AI was created based on sci-fi texts.

For nostalgebraist, this leads into an interesting reflection of how a sci-fi AI should behave, how being warped around a made-up genre without history or depth creates characters which act according to simple narratives and express surprising anxiety.

For myself, though, I can’t help but wonder if the goal of dialogue itself is the problem. Dialogue is clearly important commercially: people use ChatGPT because they can chat with it. But Large Language Models aren’t inherently chatbots: they produce plausible texts, of any sort you could imagine. People seem to want a machine that can, for example, answer scientific questions as part of a conversation. But most competent answers to scientific questions aren’t conversations, they’re papers. If people stuck with the “raw” model, producing excerpts of nonexistent papers rather than imitating a dialogue with a non-existent expert, wouldn’t you expect the answers to be more accurate, with the model no longer biased by an irrelevant goal? Is the need to make a sell-able chatbot making these AIs worse at everything else people are trying to use them for?

I’m imagining a world where, instead of a chatbot, OpenAI built an “alternate universe simulator”. You give it some context, some texts or parts of texts from a universe you made up, and it completes them in a plausible way. By imagining different universes, you can use it to answer different questions. Such a gimmick would get fewer customers, and fewer investors, it would probably do worse. But I have to wonder if the actual technology might have been more useful.

My second idea is dumber, to the point where I mostly know why it doesn’t work. But thinking about it might help clarify how things work for people unused to AI.

I saw someone point out that, unlike something like Wikipedia, AI doesn’t give you context. You shouldn’t trust Wikipedia, or a source you find on Google, blindly. If you want to, you can look through the edit history on Wikipedia, or figure out who wrote a page you found on Google and how. If ChatGPT tells you something, by default you don’t know where that knowledge came from. You can tell it to search, and then you’ll get links, but that’s because it’s using Google or the like behind the scenes anyway. You don’t know where the model is getting its ideas.

Why couldn’t we get that context, though?

Every text produced by a Large Language Model is causally dependent on its training data. Different data, different model, different text. That doesn’t mean that each text draws from one source, or just a few sources: ChatGPT isn’t copying the training data, at least not so literally.

But it does mean that, if ChatGPT says something is true, you should in principle be able to ask which data was most important in making it say that. If you leave a piece of data out of the training, and get similar answers, you can infer that the response you got doesn’t have much to do with that piece of data. But if you leave out a text in training, and now ChatGPT gives totally different responses to the same question…then there’s a pretty meaningful sense that it got the information from that source.

If this were the type of non-AI statistical model people use in physics, this would be straightforward. Researchers do this all the time: take one experiment out of the data, see how their analysis changes, and thereby figure out which experiments are most important to check. One can even sometimes calculate, given a model, where you should look.

Unfortunately, you can’t do this with ChatGPT. The model is just too big. You can’t calculate anything explicitly about it, the giant mathematical formulas behind it are so complicated that the most you can do is get probabilities out case by case, you can’t “unwind” them and see where the numbers come from. And you can’t just take out sources one by one, and train the model again: not when training takes months of expensive computer time.

So unlike with the previous idea, I understand even on a technical level why you can’t do this. But it helped me to be able to think about what I would like to do, if it were possible. Maybe it helps you too!

Technology as Evidence

How much can you trust general relativity?

On the one hand, you can read through a lovely Wikipedia article full of tests, explaining just how far and how precisely scientists have pushed their knowledge of space and time. On the other hand, you can trust GPS satellites.

As many of you may know, GPS wouldn’t work if we didn’t know about general relativity. In order for the GPS in your phone to know where you are, it has to compare signals from different satellites, each giving the location and time the signal was sent. To get an accurate result, the times measured on those satellites have to be adjusted: because of the lighter gravity they experience, time moves more quickly for them than for us down on Earth.

In a sense, general relativity gets tested every minute of every day, on every phone in the world. That’s pretty trustworthy! Any time that science is used in technology, it gets tested in this way. The ideas we can use are ideas that have shown they can perform, ideas which do what we expect again and again and again.

In another sense, though, GPS is a pretty bad test of general relativity. It tests one of general relativity’s simplest consequences, based on the Schwarzchild metric for how gravity behaves near a large massive object, and not to an incredibly high degree of precision. Gravity could still violate general relativity in a huge number of other ways, and GPS would still function. That’s why the other tests are valuable: if you want to be sure general relativity doesn’t break down, you need to test it under conditions that GPS doesn’t cover, and to higher precision.

Once you know to look for it, these layers of tests come up everywhere. You might see the occasional article talking about tests of quantum gravity. The tests they describe are very specific, testing a very general and basic question: does quantum mechanics make sense at all in a gravitational world? In contrast, most scientists who research quantum gravity don’t find that question very interesting: if gravity breaks quantum mechanics in a way those experiments could test, it’s hard to imagine it not leading to a huge suite of paradoxes. Instead, quantum gravity researchers tend to be interested in deeper problems with quantum gravity, distinctions between theories that don’t dramatically break with our existing ideas, but that because of that are much harder to test.

The easiest tests are important, especially when they come from technology: they tell us, on a basic level, what we can trust. But we need the hard tests too, because those are the tests that are most likely to reveal something new, and bring us to a new level of understanding.

Newsworthiness Bias

I had a chat about journalism recently, and I had a realization about just how weird science journalism, in particular, is.

Journalists aren’t supposed to be cheerleaders. Journalism and PR have very different goals (which is why I keep those sides of my work separate). A journalist is supposed to be uncompromising, to write the truth even if it paints the source in a bad light.

Norms are built around this. Serious journalistic outlets usually don’t let sources see pieces before they’re published. The source doesn’t have the final say in how they’re portrayed: the journalist reserves the right to surprise them if justified. Investigative journalists can be superstars, digging up damning secrets about the powerful.

When a journalist starts a project, the piece might turn out positive, or negative. A politician might be the best path forward, or a disingenuous grifter. A business might be a great investment opportunity, or a total scam. A popular piece of art might be a triumph, or a disappointment.

And a scientific result?

It might be a fraud, of course. Scientific fraud does exist, and is a real problem. But it’s not common, really. Pick a random scientific paper, filter by papers you might consider reporting on in the first place, and you’re very unlikely to find a fraudulent result. Science journalists occasionally report on spectacularly audacious scientific frauds, or frauds in papers that have already made the headlines. But you don’t expect fraud in the average paper you cover.

It might be scientifically misguided: flawed statistics, a gap in a proof, a misuse of concepts. Journalists aren’t usually equipped to ferret out these issues, though. Instead, this is handled in principle by peer review, and in practice by the scientific community outside of the peer review process.

Instead, for a scientific result, the most common negative judgement isn’t that it’s a lie, or a mistake. It’s that it’s boring.

And certainly, a good science journalist can judge a paper as boring. But there is a key difference between doing that, and judging a politician as crooked or a popular work of art as mediocre. You can write an article about the lying candidate for governor, or the letdown Tarantino movie. But if a scientific result is boring, and nobody else has covered it…then it isn’t newsworthy.

In science, people don’t usually publish their failures, their negative results, their ho-hum obvious conclusions. That fills the literature with only the successes, a phenomenon called publication bias. It also means, though, that scientists try to make their results sound more successful, more important and interesting, than they actually are. Some of the folks fighting the replication crisis have coined a term for this: they call it importance hacking.

The same incentives apply to journalists, especially freelancers. Starting out, it was far from clear that I could make enough to live on. I felt like I had to make every lead count, to find a newsworthy angle on every story idea I could find, because who knew when I would find another one? Over time, I learned to balance that pull better. Now that I’m making most of my income from consulting instead, the pressure has eased almost entirely: there are things I’m tempted to importance-hack for the sake of friends, but nothing that I need to importance-hack to stay in the black.

Doing journalism on the side may be good for me personally at the moment, but it’s not really a model. Much like we need career scientists, even if their work is sometimes boring, we need career journalists, even if they’re sometimes pressured to overhype.

So if we don’t want to incentivize science journalists to be science cheerleaders, what can we do instead?

In science, one way to address publication bias is with pre-registered studies. A scientist sets out what they plan to test, and a journal agrees to publish the result, no matter what it is. You could imagine something like this for science journalism. I once proposed a recurring column where every month I would cover a random paper from arXiv.org, explaining what it meant to accomplish. I get why the idea was turned down, but I still think about it.

In journalism, the arts offer the closest parallel with a different approach. There are many negative reviews of books, movies, and music, and most of them merely accuse the art of being boring, not evil. These exist because they focus on popular works that people pay attention to anyway, so that any negative coverage has someone to convince. You could imagine applying this model to science, though it could be a bit silly. I’m envisioning a journalist who writes an article every time Witten publishes, rating some papers impressive and others disappointing, the same way a music journalist might cover every Taylor Swift album.

Neither of these models are really satisfactory. You could imagine an even more adversarial model, where journalists run around accusing random scientists of wasting the government’s money, but that seems dramatically worse.

So I’m not sure. Science is weird, and hard to accurately value: if we knew how much something mattered already, it would be engineering, not science. Journalism is weird: it’s public-facing research, where the public facing is the whole point. Their combination? Even weirder.

Microdosing Vibe Physics

Have you heard of “vibe physics”?

The phrase “vibe coding” came first. People have been using large language models like ChatGPT to write computer code (and not the way I did last year). They chat with the model, describing what they want to do and asking the model to code it up. You can guess the arguments around this, from people who are convinced AI is already better than a human programmer to people sure the code will be riddled with errors and vulnerabilities.

Now, there are people claiming not only to do vibe coding, but vibe physics: doing theoretical physics by chatting with an AI.

I think we can all agree that’s a lot less plausible. Some of the people who do vibe coding actually know how to code, but I haven’t seen anyone claiming to do vibe physics who actually understands physics. They’re tech entrepreneurs in the most prominent cases, random people on the internet otherwise. And while a lot of computer code is a minor tweak on something someone has already done, theoretical physics doesn’t work that way: if someone has already come up with your idea, you’re an educator, not a physicist.

Still, I think there is something to keep in mind about the idea of “vibe physics”, related to where physics comes from.

Here’s a question to start with: go back a bit before the current chat-bot boom. There were a ton of other computational and mathematical tools. Theorem-proving software could encode almost arbitrary mathematical statements in computer code and guarantee their accuracy. Statistical concepts like Bayes’ rule described how to reason from evidence to conclusions, not flawlessly but as well as anyone reliably can. We had computer simulations for a wealth of physical phenomena, and approximation schemes for many others.

With all those tools, why did we still have human physicists?

That is, go back before ChatGPT, before large language models. Why not just code up a program that starts with the evidence and checks which mathematical model fits it best?

In principle, I think you really could have done that. But you could never run that program. It would take too long.

Doing science 100% correctly and reliably is agonizingly slow, and prohibitively expensive. You cannot check every possible model, nor can you check those models against all the available data. You must simplify your problem, somehow, even if it makes your work less reliable, and sometimes incorrect.

And for most of history, humans have provided that simplification.

A physicist isn’t going to consider every possible model. They’re going to consider models that are similar to models they studied, or similar to models others propose. They aren’t going to consider all the evidence. They’ll look at some of the evidence, the evidence other physicists are talking about and puzzled by. They won’t simulate the consequences of their hypotheses in exhaustive detail. Instead, they’ll guess, based on their own experience, a calculation that captures what they expect to be relevant.

Human physicists provided the unreliable part of physics, the heuristics. The “vibe physics”, if you will.

AI is also unreliable, also heuristic. But humans still do this better than AI.

Part of the difference is specificity. These AIs are trained on all of human language, and then perhaps fine-tuned on a general class of problems. A human expert has spent their life fine-tuning on one specific type of problem, and their intuitions, their heuristics, their lazy associations and vibes, all will be especially well-suited to problems of that type.

Another part of the difference, though, is scale.

When you talk to ChatGPT, it follows its vibes into paragraphs of text. If you turn on reasoning features, you make it check its work in the background, but it still is generating words upon words inside, evaluating those words, then generating more.

I suspect, for a physicist, the “control loop” is much tighter. Many potential ideas get ruled out a few words in. Many aren’t even expressed in words at all, just concepts. A human physicist is ultimately driven by vibes, but they check and verify those vibes, based on their experience, at a much higher frequency than any current AI system can achieve.

(I know almost nothing about neuroscience. I’m just basing this on what it can feel like, to grope through a sentence and have it assemble itself as it goes into something correct, rather than having to go back and edit it.)

As companies get access to bigger datacenters, I suspect they’ll try to make this loop tighter, to get AI to do something closer to what (I suspect, it appears) humans do. And then maybe AI will be able to do vibe physics.

Even then, though, you should not do vibe physics with the AI.

If you look at the way people describe doing vibe physics, they’re not using the AI for the vibes. They’re providing the vibes, and the AI is supposed to check things.

And that, I can confidently say, is completely ass-backwards. The AI is a vibe machine, it is great at vibes. Substituting your vibes will just make it worse. On the other hand, the AI is awful at checking things. It can find published papers sometimes, which can help you check something. But it is not set up to do the math, at least not unless the math can be phrased as a simple Python script or an IMO problem. In order to do anything like that, it has to call another type of software to verify. And you could have just used that software.

Theoretical physics is still not something everyone can do. Proposing a crackpot theory based on a few papers you found on Google and a couple YouTube videos may make you feel less confident than proposing a crackpot theory based on praise from ChatGPT and a list of papers it claims have something to do with your idea, which makes it more tempting. But it’s still proposing a crackpot theory. If you want to get involved, there’s still no substitute for actually learning how physics works.