Monthly Archives: August 2025

Two Types of Scientific Fraud: for a Fee and for Power

A paper about scientific fraud has been making the rounds in social media lately. The authors gather evidence of large-scale networks of fraudsters across multiple fields, from teams of editors that fast-track fraudulent research to businesses that take over journals, sell spots for articles, and then move on to a new target when the journal is de-indexed. I’m not an expert in this kind of statistical sleuthing, but the work looks impressively thorough.

Still, I think the authors overplay their results a bit. They describe themselves as revealing something many scientists underestimate. They point to what they label as misconceptions: that scientific fraud is usually perpetrated alone by individual unethical scientists, or that it is almost entirely a problem of the developing world, and present their work as disproving those misconceptions. Listen to them, and you might get the feeling that science is rife with corruption, that no result, or scientist, can be trusted.

As far as I can tell, though, those “misconceptions” they identify are true. Someone who believes that scientific fraud is perpetrated by loners is probably right, as is someone who believes it largely takes place outside of the first world.

As is often the case, the problem is words.

“Scientific Fraud” is a single term for two different things. The two both involve bad actors twisting scientific activity. But in everything else — their incentives, their geography, their scale, and their consequences — they are dramatically different.

One of the types of scientific fraud is largely about power.

In references 84-89 of the paper, the authors give examples of large-scale scientific fraud in Europe and the US. All (except one, which I’ll mention later) are about the career of a single researcher. Each of these people systematically bent the truth, whether with dodgy statistics, doctored images, or inflating citation counts. Some seemed motivated to promote a particular scientific argument, cutting corners to push a particular conclusion through. Others were purer cases of self-promotion. These people often put pressure on students, postdocs, and other junior researchers in their orbits, which increases the scale of their impact. In some cases, their work rippled out to convince other researchers, prolonging bad ideas and strangling good ones. These were people with power, who leveraged that power to increase their power.

There also don’t appear to be that many of them. These people are loners in a meaningful sense, cores of fraud working on their own behalf. They don’t form networks with each other, for the most part: because they work towards their own aggrandizement, they have no reason to trust anyone else doing the same. I have yet to see evidence that the number of these people is increasing. They exist, they’re a problem, they’re important to watch out for. But they’re not a crisis, and they shouldn’t shift your default expectations of science.

The other, quite different, type of scientific fraud is fraud for a fee.

The cases this paper investigates seem to fall into this category. They are businesses, offering the raw material of academic credit (papers, co-authorship, citations, publication) for cash. They’re paper mills, of various sorts. These are, at least from an academic perspective, large organizations, with hundreds or thousands of customers and tens of suborned editors or scientists farming out their credibility. As the authors of this paper argue, fraudsters of this type are churning out more and more papers, potentially now fueled by AI, adding up to a still small, but non-negligible, proportion of scientific papers in total.

Compared to the first type of fraud, though, buying credit in this way doesn’t give very much power. As the paper describes, many of the papers churned out by paper mills don’t even go into relevant journals: for example, they mention “an article about roasting hazelnuts in a journal about HIV/AIDS care”. An article like that isn’t going to mislead the hazelnut roasting community, or the HIV/AIDS community. Indeed, that would be counter to its purpose. The paper isn’t intended to be read at all, and ideally gets ignored: it’s just supposed to inflate a number.

These numbers are most relevant in the developing world, and when push comes to shove, almost all of the buyers of these services identified by the authors of this paper come from there. In many developing countries, a combination of low trust and advice from economists leads to explicit point systems, where academics are paid or hired explicitly based on criteria like where and how often they publish or how they are cited. The more a country can trust people to vouch for each other without corruption, the less these kinds of incentives have purchase. Outside of the developing world, involvement in paper mills and the like generally seems to involve a much smaller number of people, and typically as sellers, not buyers: selling first-world credibility in exchange for fees from many developing-world applicants.

(The one reference I mentioned above is an interesting example of this: a system built out of points and low trust to recruit doctors from the developing world to the US, gamed by a small number of co-authorship brokers.)

This kind of fraud doesn’t influence science directly. Its perpetrators aren’t trying to get noticed, but to keep up a cushy scam. You don’t hear their conclusions in the press, other scientists don’t see their work. Instead, they siphon off resources: cannibalizing journals, flooding editors with mass-produced crap, and filling positions and slurping up science budgets in the countries that can least afford them. As they publish more and more, they shouldn’t affect your expectations of the credibility of science: any science you hear about will be either genuine, or fraud from the other category. But they do make the science you hear about harder and harder to do.

(The authors point out one exception: what about AI? If a company trains a large language model on the current internet, will its context windows be long enough to tell that that supposedly legitimate paper about hazelnuts is in an HIV/AIDS journal? If something gets said often enough, copied again and again in papers sold by a mill, will an AI trained on all these papers be convinced? Presumably, someone is being paid good money to figure out how to filter AI-generated slop from training data: can they filter paper mill fraud as well?)

It’s a shame that we have one term, scientific fraud, to deal with these two very different things. But it’s important to keep in mind that they are different. Fraud for power and fraud for money can have very different profiles, and offer very different risks. If you don’t trust a scientific result, it’s worth understanding what might be at play.

Some Dumb AI Ideas

Sometimes, when I write a post about AI, I’ve been sitting on an idea for a long time. I’ve talked to experts, I’ve tried to understand the math, I’ve honed my points and cleared away clutter.

This is not one of those times. The ideas in this post almost certainly have something deeply wrong with them. But hopefully they’re interesting food for thought.

My first dumb idea: instruction tuning was a mistake.

I’m drawing the seeds of this one from a tumblr post by nostalgebraist, someone known for making a popular bot trained on his tumblr posts in the early days before GPT became ChatGPT.

AIs like ChatGPT are based on Large Language Models, insanely complicated mathematical formulas that predict, given part of a text, what the rest of that text is likely to look like. In the early days, this was largely how they were used. Loosely described nostalgebraist’s bot, called nostalgebraist-autoresponder, began with a list of tumblr posts and asks and determines what additional posts would best fit in.

If you think about it, though, ChatGPT doesn’t really work like that. ChatGPT has conversations: you send it messages, it sends you responses. The text it creates is a dialogue, with you supplying half the input. But most texts aren’t dialogues, and ChatGPT draws on a lot of non-dialogue texts to make its dialogue-like responses.

The reason it does this is something called instruction tuning. ChatGPT has been intentionally biased, not to give the most likely completion to a task in general, but to give completions that fit this dialogue genre. What I didn’t know until I read nostalgebraist’s post was that this genre was defined artificially: AI researchers made up fake dialogues with AI, cheesy sci-fi conversations imagining how an AI might respond to instructions from a user, and then biased the Large Language Model so that rather than giving the most likely text in general, it gives a text that is more likely to look like these cheesy sci-fi conversations. It’s why ChatGPT sounds kind of like a fictional robot: not because sci-fi writers accurately predicted what AI would sound like, but because AI was created based on sci-fi texts.

For nostalgebraist, this leads into an interesting reflection of how a sci-fi AI should behave, how being warped around a made-up genre without history or depth creates characters which act according to simple narratives and express surprising anxiety.

For myself, though, I can’t help but wonder if the goal of dialogue itself is the problem. Dialogue is clearly important commercially: people use ChatGPT because they can chat with it. But Large Language Models aren’t inherently chatbots: they produce plausible texts, of any sort you could imagine. People seem to want a machine that can, for example, answer scientific questions as part of a conversation. But most competent answers to scientific questions aren’t conversations, they’re papers. If people stuck with the “raw” model, producing excerpts of nonexistent papers rather than imitating a dialogue with a non-existent expert, wouldn’t you expect the answers to be more accurate, with the model no longer biased by an irrelevant goal? Is the need to make a sell-able chatbot making these AIs worse at everything else people are trying to use them for?

I’m imagining a world where, instead of a chatbot, OpenAI built an “alternate universe simulator”. You give it some context, some texts or parts of texts from a universe you made up, and it completes them in a plausible way. By imagining different universes, you can use it to answer different questions. Such a gimmick would get fewer customers, and fewer investors, it would probably do worse. But I have to wonder if the actual technology might have been more useful.

My second idea is dumber, to the point where I mostly know why it doesn’t work. But thinking about it might help clarify how things work for people unused to AI.

I saw someone point out that, unlike something like Wikipedia, AI doesn’t give you context. You shouldn’t trust Wikipedia, or a source you find on Google, blindly. If you want to, you can look through the edit history on Wikipedia, or figure out who wrote a page you found on Google and how. If ChatGPT tells you something, by default you don’t know where that knowledge came from. You can tell it to search, and then you’ll get links, but that’s because it’s using Google or the like behind the scenes anyway. You don’t know where the model is getting its ideas.

Why couldn’t we get that context, though?

Every text produced by a Large Language Model is causally dependent on its training data. Different data, different model, different text. That doesn’t mean that each text draws from one source, or just a few sources: ChatGPT isn’t copying the training data, at least not so literally.

But it does mean that, if ChatGPT says something is true, you should in principle be able to ask which data was most important in making it say that. If you leave a piece of data out of the training, and get similar answers, you can infer that the response you got doesn’t have much to do with that piece of data. But if you leave out a text in training, and now ChatGPT gives totally different responses to the same question…then there’s a pretty meaningful sense that it got the information from that source.

If this were the type of non-AI statistical model people use in physics, this would be straightforward. Researchers do this all the time: take one experiment out of the data, see how their analysis changes, and thereby figure out which experiments are most important to check. One can even sometimes calculate, given a model, where you should look.

Unfortunately, you can’t do this with ChatGPT. The model is just too big. You can’t calculate anything explicitly about it, the giant mathematical formulas behind it are so complicated that the most you can do is get probabilities out case by case, you can’t “unwind” them and see where the numbers come from. And you can’t just take out sources one by one, and train the model again: not when training takes months of expensive computer time.

So unlike with the previous idea, I understand even on a technical level why you can’t do this. But it helped me to be able to think about what I would like to do, if it were possible. Maybe it helps you too!

Technology as Evidence

How much can you trust general relativity?

On the one hand, you can read through a lovely Wikipedia article full of tests, explaining just how far and how precisely scientists have pushed their knowledge of space and time. On the other hand, you can trust GPS satellites.

As many of you may know, GPS wouldn’t work if we didn’t know about general relativity. In order for the GPS in your phone to know where you are, it has to compare signals from different satellites, each giving the location and time the signal was sent. To get an accurate result, the times measured on those satellites have to be adjusted: because of the lighter gravity they experience, time moves more quickly for them than for us down on Earth.

In a sense, general relativity gets tested every minute of every day, on every phone in the world. That’s pretty trustworthy! Any time that science is used in technology, it gets tested in this way. The ideas we can use are ideas that have shown they can perform, ideas which do what we expect again and again and again.

In another sense, though, GPS is a pretty bad test of general relativity. It tests one of general relativity’s simplest consequences, based on the Schwarzchild metric for how gravity behaves near a large massive object, and not to an incredibly high degree of precision. Gravity could still violate general relativity in a huge number of other ways, and GPS would still function. That’s why the other tests are valuable: if you want to be sure general relativity doesn’t break down, you need to test it under conditions that GPS doesn’t cover, and to higher precision.

Once you know to look for it, these layers of tests come up everywhere. You might see the occasional article talking about tests of quantum gravity. The tests they describe are very specific, testing a very general and basic question: does quantum mechanics make sense at all in a gravitational world? In contrast, most scientists who research quantum gravity don’t find that question very interesting: if gravity breaks quantum mechanics in a way those experiments could test, it’s hard to imagine it not leading to a huge suite of paradoxes. Instead, quantum gravity researchers tend to be interested in deeper problems with quantum gravity, distinctions between theories that don’t dramatically break with our existing ideas, but that because of that are much harder to test.

The easiest tests are important, especially when they come from technology: they tell us, on a basic level, what we can trust. But we need the hard tests too, because those are the tests that are most likely to reveal something new, and bring us to a new level of understanding.

Newsworthiness Bias

I had a chat about journalism recently, and I had a realization about just how weird science journalism, in particular, is.

Journalists aren’t supposed to be cheerleaders. Journalism and PR have very different goals (which is why I keep those sides of my work separate). A journalist is supposed to be uncompromising, to write the truth even if it paints the source in a bad light.

Norms are built around this. Serious journalistic outlets usually don’t let sources see pieces before they’re published. The source doesn’t have the final say in how they’re portrayed: the journalist reserves the right to surprise them if justified. Investigative journalists can be superstars, digging up damning secrets about the powerful.

When a journalist starts a project, the piece might turn out positive, or negative. A politician might be the best path forward, or a disingenuous grifter. A business might be a great investment opportunity, or a total scam. A popular piece of art might be a triumph, or a disappointment.

And a scientific result?

It might be a fraud, of course. Scientific fraud does exist, and is a real problem. But it’s not common, really. Pick a random scientific paper, filter by papers you might consider reporting on in the first place, and you’re very unlikely to find a fraudulent result. Science journalists occasionally report on spectacularly audacious scientific frauds, or frauds in papers that have already made the headlines. But you don’t expect fraud in the average paper you cover.

It might be scientifically misguided: flawed statistics, a gap in a proof, a misuse of concepts. Journalists aren’t usually equipped to ferret out these issues, though. Instead, this is handled in principle by peer review, and in practice by the scientific community outside of the peer review process.

Instead, for a scientific result, the most common negative judgement isn’t that it’s a lie, or a mistake. It’s that it’s boring.

And certainly, a good science journalist can judge a paper as boring. But there is a key difference between doing that, and judging a politician as crooked or a popular work of art as mediocre. You can write an article about the lying candidate for governor, or the letdown Tarantino movie. But if a scientific result is boring, and nobody else has covered it…then it isn’t newsworthy.

In science, people don’t usually publish their failures, their negative results, their ho-hum obvious conclusions. That fills the literature with only the successes, a phenomenon called publication bias. It also means, though, that scientists try to make their results sound more successful, more important and interesting, than they actually are. Some of the folks fighting the replication crisis have coined a term for this: they call it importance hacking.

The same incentives apply to journalists, especially freelancers. Starting out, it was far from clear that I could make enough to live on. I felt like I had to make every lead count, to find a newsworthy angle on every story idea I could find, because who knew when I would find another one? Over time, I learned to balance that pull better. Now that I’m making most of my income from consulting instead, the pressure has eased almost entirely: there are things I’m tempted to importance-hack for the sake of friends, but nothing that I need to importance-hack to stay in the black.

Doing journalism on the side may be good for me personally at the moment, but it’s not really a model. Much like we need career scientists, even if their work is sometimes boring, we need career journalists, even if they’re sometimes pressured to overhype.

So if we don’t want to incentivize science journalists to be science cheerleaders, what can we do instead?

In science, one way to address publication bias is with pre-registered studies. A scientist sets out what they plan to test, and a journal agrees to publish the result, no matter what it is. You could imagine something like this for science journalism. I once proposed a recurring column where every month I would cover a random paper from arXiv.org, explaining what it meant to accomplish. I get why the idea was turned down, but I still think about it.

In journalism, the arts offer the closest parallel with a different approach. There are many negative reviews of books, movies, and music, and most of them merely accuse the art of being boring, not evil. These exist because they focus on popular works that people pay attention to anyway, so that any negative coverage has someone to convince. You could imagine applying this model to science, though it could be a bit silly. I’m envisioning a journalist who writes an article every time Witten publishes, rating some papers impressive and others disappointing, the same way a music journalist might cover every Taylor Swift album.

Neither of these models are really satisfactory. You could imagine an even more adversarial model, where journalists run around accusing random scientists of wasting the government’s money, but that seems dramatically worse.

So I’m not sure. Science is weird, and hard to accurately value: if we knew how much something mattered already, it would be engineering, not science. Journalism is weird: it’s public-facing research, where the public facing is the whole point. Their combination? Even weirder.

Microdosing Vibe Physics

Have you heard of “vibe physics”?

The phrase “vibe coding” came first. People have been using large language models like ChatGPT to write computer code (and not the way I did last year). They chat with the model, describing what they want to do and asking the model to code it up. You can guess the arguments around this, from people who are convinced AI is already better than a human programmer to people sure the code will be riddled with errors and vulnerabilities.

Now, there are people claiming not only to do vibe coding, but vibe physics: doing theoretical physics by chatting with an AI.

I think we can all agree that’s a lot less plausible. Some of the people who do vibe coding actually know how to code, but I haven’t seen anyone claiming to do vibe physics who actually understands physics. They’re tech entrepreneurs in the most prominent cases, random people on the internet otherwise. And while a lot of computer code is a minor tweak on something someone has already done, theoretical physics doesn’t work that way: if someone has already come up with your idea, you’re an educator, not a physicist.

Still, I think there is something to keep in mind about the idea of “vibe physics”, related to where physics comes from.

Here’s a question to start with: go back a bit before the current chat-bot boom. There were a ton of other computational and mathematical tools. Theorem-proving software could encode almost arbitrary mathematical statements in computer code and guarantee their accuracy. Statistical concepts like Bayes’ rule described how to reason from evidence to conclusions, not flawlessly but as well as anyone reliably can. We had computer simulations for a wealth of physical phenomena, and approximation schemes for many others.

With all those tools, why did we still have human physicists?

That is, go back before ChatGPT, before large language models. Why not just code up a program that starts with the evidence and checks which mathematical model fits it best?

In principle, I think you really could have done that. But you could never run that program. It would take too long.

Doing science 100% correctly and reliably is agonizingly slow, and prohibitively expensive. You cannot check every possible model, nor can you check those models against all the available data. You must simplify your problem, somehow, even if it makes your work less reliable, and sometimes incorrect.

And for most of history, humans have provided that simplification.

A physicist isn’t going to consider every possible model. They’re going to consider models that are similar to models they studied, or similar to models others propose. They aren’t going to consider all the evidence. They’ll look at some of the evidence, the evidence other physicists are talking about and puzzled by. They won’t simulate the consequences of their hypotheses in exhaustive detail. Instead, they’ll guess, based on their own experience, a calculation that captures what they expect to be relevant.

Human physicists provided the unreliable part of physics, the heuristics. The “vibe physics”, if you will.

AI is also unreliable, also heuristic. But humans still do this better than AI.

Part of the difference is specificity. These AIs are trained on all of human language, and then perhaps fine-tuned on a general class of problems. A human expert has spent their life fine-tuning on one specific type of problem, and their intuitions, their heuristics, their lazy associations and vibes, all will be especially well-suited to problems of that type.

Another part of the difference, though, is scale.

When you talk to ChatGPT, it follows its vibes into paragraphs of text. If you turn on reasoning features, you make it check its work in the background, but it still is generating words upon words inside, evaluating those words, then generating more.

I suspect, for a physicist, the “control loop” is much tighter. Many potential ideas get ruled out a few words in. Many aren’t even expressed in words at all, just concepts. A human physicist is ultimately driven by vibes, but they check and verify those vibes, based on their experience, at a much higher frequency than any current AI system can achieve.

(I know almost nothing about neuroscience. I’m just basing this on what it can feel like, to grope through a sentence and have it assemble itself as it goes into something correct, rather than having to go back and edit it.)

As companies get access to bigger datacenters, I suspect they’ll try to make this loop tighter, to get AI to do something closer to what (I suspect, it appears) humans do. And then maybe AI will be able to do vibe physics.

Even then, though, you should not do vibe physics with the AI.

If you look at the way people describe doing vibe physics, they’re not using the AI for the vibes. They’re providing the vibes, and the AI is supposed to check things.

And that, I can confidently say, is completely ass-backwards. The AI is a vibe machine, it is great at vibes. Substituting your vibes will just make it worse. On the other hand, the AI is awful at checking things. It can find published papers sometimes, which can help you check something. But it is not set up to do the math, at least not unless the math can be phrased as a simple Python script or an IMO problem. In order to do anything like that, it has to call another type of software to verify. And you could have just used that software.

Theoretical physics is still not something everyone can do. Proposing a crackpot theory based on a few papers you found on Google and a couple YouTube videos may make you feel less confident than proposing a crackpot theory based on praise from ChatGPT and a list of papers it claims have something to do with your idea, which makes it more tempting. But it’s still proposing a crackpot theory. If you want to get involved, there’s still no substitute for actually learning how physics works.