Academia Tracks Priority, Not Provenance

A recent Correspondence piece in Nature Machine Intelligence points at an issue with using LLMs to write journal articles. LLMs are trained on enormous amounts of scholarly output, but the result is quite opaque: it is usually impossible to tell which sources influence a specific LLM-written text. That means that when a scholar uses an LLM, they may get a result that depends on another scholar’s work, without realizing it or documenting it. The ideas’ provenance gets lost, and the piece argues this is damaging, depriving scholars of credit and setting back progress.

It’s a good point. Provenance matters. If we want to prioritize funding for scholars whose ideas have the most impact, we need a way to track where ideas arise.

However, current publishing norms make essentially no effort to do this. Academic citations are not used to track provenance, and they are not typically thought of as tracking provenance. Academic citations track priority.

Priority is a central value in scholarship, with a long history. We give special respect to the first person to come up with an idea, make an observation, or do a calculation, and more specifically, the first person to formally publish it. We do this even if the person’s influence was limited, and even if the idea was rediscovered independently later on. In an academic context, being first matters.

In a paper, one is thus expected to cite the sources that have priority, that came up with an idea first. Someone who fails to do so will get citation request emails, and reviewers may request revisions to the paper to add in those missing citations.

One may also cite papers that were helpful, even if they didn’t come first. Tracking provenance in this way can be nice, a way to give direct credit to those who helped and point people to useful resources. But it isn’t mandatory in the same way. If you leave out a secondary source and your paper doesn’t use anything original to that source (like new notation), you’re much less likely to get citation request emails, or revision requests from reviewers. Provenance is just much lower priority.

In practice, academics track provenance in much less formal ways. Before citations, a paper will typically have an Acknowledgements section, where the authors thank those who made the paper possible. This includes formal thanks to funding agencies, but also informal thanks for “helpful discussions” that don’t meet the threshold of authorship.

If we cared about tracking provenance, those acknowledgements would be crucial information, an account of whose ideas directly influenced the ideas in the paper. But they’re not treated that way. No-one lists the number of times they’ve been thanked for helpful discussions on their CV, or in a grant application, no-one considers these discussions for hiring or promotion. You can’t look them up on an academic profile or easily graph them in a metascience paper. Unlike citations, unlike priority, there is essentially no attempt to measure these tracks of provenance in any organized way.

Instead, provenance is often the realm of historians or history-minded scholars, writing long after the fact. For academics, the fact that Yang and Mills published their theory first is enough, we call it Yang-Mills theory. For those studying the history, the story is murkier: it looks like Pauli came up with the idea first, and did most of the key calculations, but didn’t publish when it looked to him like the theory couldn’t describe the real world. What’s more, there is evidence suggesting that Yang knew about Pauli’s result, that he had read a letter from him on the topic, that the idea’s provenance goes back to Pauli. But Yang published, Pauli didn’t. And in the way academia has worked over the last 75 years, that claim of priority is what actually mattered.

Should we try to track provenance? Maybe. Maybe the emerging ubiquitousness of LLMs should be a wakeup call, a demand to improve our tracking of ideas, both in artificial and human neural networks. Maybe we need to demand interpretability from our research tools, to insist that we can track every conclusion back to its evidence for every method we employ, to set a civilizational technological priority on the accurate valuation of information.

What we shouldn’t do, though, is pretend that we just need to go back to what we were doing before.

3 thoughts on “Academia Tracks Priority, Not Provenance

  1. boldly91f5a7d879's avatarboldly91f5a7d879

    Nobody comes up with a first idea. Only a handful of people from unusual backgrounds have parachuted into an unknown subject matter and proceeded to make original contributions. In an ocean of knowledge, we occupy our (potentially multiple) local gyres along with other thinkers who have equal exposure to the same subject matter and to the work of our predecessors. The acknowledgements section is not an exercise in false modesty, but a true reflection that the vast majority of academics work cooperatively with and rely upon others. The views of scholars are highly path dependent, but those paths are never hacked through virgin forest, they connect oases of understanding and avoid sloughs of impossibility. While the young and ambitious may see their ideas as fresh and original, the mature and self-aware see their work as the product of their times which would have been elaborated by someone else had they not followed their particular path.

    Like

    Reply
  2. sylvainribault's avatarsylvainribault

    When writing a research article, my main worry is to help readers understand the topic itself, not the history of the topic. I try to cite texts that are the most helful in this respect, under the hypothesis that readers might want to consult them. Clarity or completeness of the cited works matter more than priority or provenance, and I favour review articles over primary sources.

    Priority and provenance only matter if we view citations as a way to allocate credit for past discoveries. It is perhaps inevitable that citations are interpreted in this way, but this should not be at the expense of their main function: providing the needed background for understanding a new research work.

    These days, the abuse of bibliometrics for managing research leads to citation inflation. This is detrimental both to the allocation of credit, and to providing useful background.

    Liked by 1 person

    Reply
  3. Nico's avatarNico

    I actually think science has better reason to track provenance than priority. Have you never worked your way back through references to understand better an idea?

    Even Leibniz, while going after priority, made the case for provenance:

    “It is most useful that the true origins of memorable inventions be known, especially of those that were conceived not by accident but by an effort of meditation. The use of this is not merely that history may give everyone his due and others be spurred by the expectation of similar praise, but also that the art of discovery may be promoted and its method become known through brilliant examples.”

    Like

    Reply

Leave a comment! If it's your first time, it will go into moderation.