Tag Archives: arXiv

Fields and Scale

I am a theoretical particle physicist, and every morning I check the arXiv.

arXiv.org is a type of website called a preprint server. It’s where we post papers before they are submitted to (and printed by) a journal. In practice, everything in our field shows up on arXiv, publicly accessible, before it appears anywhere else. There’s no peer review process on arXiv, the journals still handle that, but in our field peer review doesn’t often notice substantive errors. So in practice, we almost never read the journals: we just check arXiv.

And so every day, I check the arXiv. I go to the section on my sub-field, and I click on a link that lists all of the papers that were new that day. I skim the titles, and if I see an interesting paper I’ll read the abstract, and maybe download the full thing. Checking as I’m writing this, there were ten papers posted in my field, and another twenty “cross-lists” were posted in other fields but additionally classified in mine.

Other fields use arXiv: mathematicians and computer scientists and even economists use it in roughly the same way physicists do. For biology and medicine, though, there are different, newer sites: bioRxiv and medRxiv.

One thing you may notice is the different capitalization. When physicists write arXiv, the “X” is capitalized. In the logo, it looks like a Greek letter chi, thus saying “archive”. The biologists and medical researchers capitalize the R instead. The logo still has an X that looks like a chi, but positioned with the R it looks like the Rx of medical prescriptions.

Something I noticed, but you might not, was the lack of a handy link to see new papers. You can search medRxiv and bioRxiv, and filter by date. But there’s no link that directly takes you to the newest papers. That suggests that biologists aren’t using bioRxiv like we use arXiv, and checking the new papers every day.

I was curious if this had to do with the scale of the field. I have the impression that physics and mathematics are smaller fields than biology, and that much less physics and mathematics research goes on than medical research. Certainly, theoretical particle physics is a small field. So I might have expected arXiv to be smaller than bioRxiv and medRxiv, and I certainly would expect fewer papers in my sub-field than papers in a medium-sized subfield of biology.

On the other hand, arXiv in my field is universal. In biology, bioRxiv and medRxiv are still quite controversial. More and more people are using them, but not every journal accepts papers posted to a preprint server. Many people still don’t use these services. So I might have expected bioRxiv and medRxiv to be smaller.

Checking now, neither answer is quite right. I looked between November 1 and November 2, and asked each site how many papers were uploaded between those dates. arXiv had the most, 604 papers. bioRxiv had roughly half that many, 348. medRxiv had 97.

arXiv represents multiple fields, bioRxiv is “just” biology. Specializing, on that day arXiv had 235 physics papers, 135 mathematics papers, and 250 computer science papers. So each individual field has fewer papers than biology in this period.

Specializing even further, I can look at a subfield. My subfield, which is fairly small, had 20 papers between those dates. Cell biology, which I would expect to be quite a big subfield, had 33.

Overall, the numbers were weirdly comparable, with medRxiv unexpectedly small compared to both arXiv and bioRxiv. I’m not sure whether there are more biologists than physicists, but I’m pretty sure there should be more cell biologists than theoretical particle physicists. This suggests that many still aren’t using bioRxiv. It makes me wonder: will bioRxiv grow dramatically in future? Are the people running it ready for if it does?

The arXiv SciComm Challenge

Fellow science communicators, think you can explain everything that goes on in your field? If so, I have a challenge for you. Pick a day, and go through all the new papers on arXiv.org in a single area. For each one, try to give a general-audience explanation of what the paper is about. To make it easier, you can ignore cross-listed papers. If your field doesn’t use arXiv, consider if you can do the challenge with another appropriate site.

I’ll start. I’m looking at papers in the “High Energy Physics – Theory” area, announced 6 Jan, 2022. I’ll warn you in advance that I haven’t read these papers, just their abstracts, so apologies if I get your paper wrong!

arXiv:2201.01303 : Holographic State Complexity from Group Cohomology

This paper says it is a contribution to a Proceedings. That means it is based on a talk given at a conference. In my field, a talk like this usually won’t be presenting new results, but instead summarizes results in a previous paper. So keep that in mind.

There is an idea in physics called holography, where two theories are secretly the same even though they describe the world with different numbers of dimensions. Usually this involves a gravitational theory in a “box”, and a theory without gravity that describes the sides of the box. The sides turn out to fully describe the inside of the box, much like a hologram looks 3D but can be printed on a flat sheet of paper. Using this idea, physicists have connected some properties of gravity to properties of the theory on the sides of the box. One of those properties is complexity: the complexity of the theory on the sides of the box says something about gravity inside the box, in particular about the size of wormholes. The trouble is, “complexity” is a bit subjective: it’s not clear how to give a good definition for it for this type of theory. In this paper, the author studies a theory with a precise mathematical definition, called a topological theory. This theory turns out to have mathematical properties that suggest a well-defined notion of complexity for it.

arXiv:2201.01393 : Nonrelativistic effective field theories with enhanced symmetries and soft behavior

We sometimes describe quantum field theory as quantum mechanics plus relativity. That’s not quite true though, because it is possible to define a quantum field theory that doesn’t obey special relativity, a non-relativistic theory. Physicists do this if they want to describe a system moving much slower than the speed of light: it gets used sometimes for nuclear physics, and sometimes for modeling colliding black holes.

In particle physics, a “soft” particle is one with almost no momentum. We can classify theories based on how they behave when a particle becomes more and more soft. In normal quantum field theories, if they have special behavior when a particle becomes soft it’s often due to a symmetry of the theory, where the theory looks the same even if something changes. This paper shows that this is not true for non-relativistic theories: they have more requirements to have special soft behavior, not just symmetry. They “bootstrap” a few theories, using some general restrictions to find them without first knowing how they work (“pulling them up by their own bootstraps”), and show that the theories they find are in a certain sense unique, the only theories of that kind.

arXiv:2201.01552 : Transmutation operators and expansions for 1-loop Feynman integrands

In recent years, physicists in my sub-field have found new ways to calculate the probability that particles collide. One of these methods describes ordinary particles in a way resembling string theory, and from this discovered a whole “web” of theories that were linked together by small modifications of the method. This method originally worked only for the simplest Feynman diagrams, the “tree” diagrams that correspond to classical physics, but was extended to the next-simplest diagrams, diagrams with one “loop” that start incorporating quantum effects.

This paper concerns a particular spinoff of this method, that can find relationships between certain one-loop calculations in a particularly efficient way. It lets you express calculations of particle collisions in a variety of theories in terms of collisions in a very simple theory. Unlike the original method, it doesn’t rely on any particular picture of how these collisions work, either Feynman diagrams or strings.

arXiv:2201.01624 : Moduli and Hidden Matter in Heterotic M-Theory with an Anomalous U(1) Hidden Sector

In string theory (and its more sophisticated cousin M theory), our four-dimensional world is described as a world with more dimensions, where the extra dimensions are twisted up so that they cannot be detected. The shape of the extra dimensions influences the kinds of particles we can observe in our world. That shape is described by variables called “moduli”. If those moduli are stable, then the properties of particles we observe would be fixed, otherwise they would not be. In general it is a challenge in string theory to stabilize these moduli and get a world like what we observe.

This paper discusses shapes that give rise to a “hidden sector”, a set of particles that are disconnected from the particles we know so that they are hard to observe. Such particles are often proposed as a possible explanation for dark matter. This paper calculates, for a particular kind of shape, what the masses of different particles are, as well as how different kinds of particles can decay into each other. For example, a particle that causes inflation (the accelerating expansion of the universe) can decay into effects on the moduli and dark matter. The paper also shows how some of the moduli are made stable in this picture.

arXiv:2201.01630 : Chaos in Celestial CFT

One variant of the holography idea I mentioned earlier is called “celestial” holography. In this picture, the sides of the box are an infinite distance away: a “celestial sphere” depicting the angles particles go after they collide, in the same way a star chart depicts the angles between stars. Recent work has shown that there is something like a sensible theory that describes physics on this celestial sphere, that contains all the information about what happens inside.

This paper shows that the celestial theory has a property called quantum chaos. In physics, a theory is said to be chaotic if it depends very precisely on its initial conditions, so that even a small change will result in a large change later (the usual metaphor is a butterfly flapping its wings and causing a hurricane). This kind of behavior appears to be present in this theory.

arXiv:2201.01657 : Calculations of Delbrück scattering to all orders in αZ

Delbrück scattering is an effect where the nuclei of heavy elements like lead can deflect high-energy photons, as a consequence of quantum field theory. This effect is apparently tricky to calculate, and previous calculations have involved approximations. This paper finds a way to calculate the effect without those approximations, which should let it match better with experiments.

(As an aside, I’m a little confused by the claim that they’re going to all orders in αZ when it looks like they just consider one-loop diagrams…but this is probably just my ignorance, this is a corner of the field quite distant from my own.)

arXiv:2201.01674 : On Unfolded Approach To Off-Shell Supersymmetric Models

Supersymmetry is a relationship between two types of particles: fermions, which typically make up matter, and bosons, which are usually associated with forces. In realistic theories this relationship is “broken” and the two types of particles have different properties, but theoretical physicists often study models where supersymmetry is “unbroken” and the two types of particles have the same mass and charge. This paper finds a new way of describing some theories of this kind that reorganizes them in an interesting way, using an “unfolded” approach in which aspects of the particles that would normally be combined are given their own separate variables.

(This is another one I don’t know much about, this is the first time I’d heard of the unfolded approach.)

arXiv:2201.01679 : Geometric Flow of Bubbles

String theorists have conjectured that only some types of theories can be consistently combined with a full theory of quantum gravity, others live in a “swampland” of non-viable theories. One set of conjectures characterizes this swampland in terms of “flows” in which theories with different geometry can flow in to each other. The properties of these flows are supposed to be related to which theories are or are not in the swampland.

This paper writes down equations describing these flows, and applies them to some toy model “bubble” universes.

arXiv:2201.01697 : Graviton scattering amplitudes in first quantisation

This paper is a pedagogical one, introducing graduate students to a topic rather than presenting new research.

Usually in quantum field theory we do something called “second quantization”, thinking about the world not in terms of particles but in terms of fields that fill all of space and time. However, sometimes one can instead use “first quantization”, which is much more similar to ordinary quantum mechanics. There you think of a single particle traveling along a “world-line”, and calculate the probability it interacts with other particles in particular ways. This approach has recently been used to calculate interactions of gravitons, particles related to the gravitational field in the same way photons are related to the electromagnetic field. The approach has some advantages in terms of simplifying the results, which are described in this paper.

A Field That Doesn’t Read Its Journals

Last week, the University of California system ended negotiations with Elsevier, one of the top academic journal publishers. UC had been trying to get Elsevier to switch to a new type of contract, one in which instead of paying for access to journals they pay for their faculty to publish, then make all the results openly accessible to the public. In the end they couldn’t reach an agreement and thus didn’t renew their contract, cutting Elsevier off from millions of dollars and their faculty from reading certain (mostly recent) Elsevier journal articles. There’s a nice interview here with one of the librarians who was sent to negotiate the deal.

I’m optimistic about what UC was trying to do. Their proposal sounds like it addresses some of the concerns raised here with open-access systems. Currently, journals that offer open access often charge fees directly to the scientists publishing in them, fees that have to be scrounged up from somebody’s grant at the last minute. By setting up a deal for all their faculty together, UC would have avoided that. While the deal fell through, having an organization as big as the whole University of California system advocating open access (and putting the squeeze on Elsevier’s profits) seems like it can only lead to progress.

The whole situation feels a little surreal, though, when I compare it to my own field.

At the risk of jinxing it, my field’s relationship with journals is even weirder than xkcd says.

arXiv.org is a website that hosts what are called “preprints”, which originally meant papers that haven’t been published yet. They’re online, freely accessible to anyone who wants to read them, and will be for as long as arXiv exists to host them. Essentially everything anyone publishes in my field ends up on arXiv.

Journals don’t mind, in part, because many of them are open-access anyway. There’s an organization, SCOAP3, that runs what is in some sense a large-scale version of what UC was trying to set up: instead of paying for subscriptions, university libraries pay SCOAP3 and it covers the journals’ publication costs.

This means that there are two coexisting open-access systems, the journals themselves and arXiv. But in practice, arXiv is the one we actually use.

If I want to show a student a paper, I don’t send them to the library or the journal website, I tell them how to find it on arXiv. If I’m giving a talk, there usually isn’t room for a journal reference, so I’ll give the arXiv number instead. In a paper, we do give references to journals…but they’re most useful when they have arXiv links as well. I think the only times I’ve actually read an article in a journal were for articles so old that arXiv didn’t exist when they were published.

We still submit our papers to journals, though. Peer review still matters, we still want to determine whether our results are cool enough for the fancy journals or only good enough for the ordinary ones. We still put journal citations on our CVs so employers and grant agencies know not only what we’ve done, but which reviewers liked it.

But the actual copy-editing and formatting and publishing, that the journals still employ people to do? Mostly, it never gets read.

In my experience, that editing isn’t too impressive. Often, it’s about changing things to fit the journal’s preferences: its layout, its conventions, its inconvenient proprietary document formats. I haven’t seen them try to fix grammar, or improve phrasing. Maybe my papers have unusually good grammar, maybe they do more for other papers. And maybe they used to do more, when journals had a more central role. But now, they don’t change much.

Sometimes the journal version ends up on arXiv, if the authors put it there. Sometimes it doesn’t. And sometimes the result is in between. For my last paper about Calabi-Yau manifolds in Feynman diagrams, we got several helpful comments from the reviewers, but the journal also weighed in to get us to remove our more whimsical language, down to the word “bestiary”. For the final arXiv version, we updated for the reviewer comments, but kept the whimsical words. In practice, that version is the one people in our field will read.

This has some awkward effects. It means that sometimes important corrections don’t end up on arXiv, and people don’t see them. It means that technically, if someone wanted to insist on keeping an incorrect paper online, they could, even if a corrected version was published in a journal. And of course, it means that a large amount of effort is dedicated to publishing journal articles that very few people read.

I don’t know whether other fields could get away with this kind of system. Physics is small. It’s small enough that it’s not so hard to get corrections from authors when one needs to, small enough that social pressure can get wrong results corrected. It’s small enough that arXiv and SCOAP3 can exist, funded by universities and private foundations. A bigger field might not be able to do any of that.

For physicists, we should keep in mind that our system can and should still be improved. For other fields, it’s worth considering whether you can move in this direction, and what it would cost to do so. Academic publishing is in a pretty bizarre place right now, but hopefully we can get it to a better one.

A Paper About Ranking Papers

If you’ve ever heard someone list problems in academia, citation-counting is usually near the top. Hiring and tenure committees want easy numbers to judge applicants with: number of papers, number of citations, or related statistics like the h-index. Unfortunately, these metrics can be gamed, leading to a host of bad practices that get blamed for pretty much everything that goes wrong in science. In physics, it’s not even clear that these statistics tell us anything: papers in our field have been including more citations over time, and for thousand-person experimental collaborations the number of citations and papers don’t really reflect any one person’s contribution.

It’s pretty easy to find people complaining about this. It’s much rarer to find a proposed solution.

That’s why I quite enjoyed Alessandro Strumia and Riccardo Torre’s paper last week, on Biblioranking fundamental physics.

Some of their suggestions are quite straightforward. With the number of citations per paper increasing, it makes sense to divide each paper by the number of citations it contains: it means more to get cited by a paper with ten citations than by a paper with one hundred. Similarly, you could divide credit for a paper among its authors, rather than giving each author full credit.

Some are more elaborate. They suggest using a variant of Google’s PageRank algorithm to rank papers and authors. Essentially, the algorithm imagines someone wandering from paper to paper and tries to figure out which papers are more central to the network. This is apparently an old idea, but by combining it with their normalization by number of citations they eke a bit more mileage from it. (I also found their treatment a bit clearer than the older papers they cite. There are a few more elaborate setups in the literature as well, but they seem to have a lot of free parameters so Strumia and Torre’s setup looks preferable on that front.)

One final problem they consider is that of self-citations, and citation cliques. In principle, you could boost your citation count by citing yourself. While that’s easy to correct for, you could also be one of a small number of authors who cite each other a lot. To keep the system from being gamed in this way, they propose a notion of a “CitationCoin” that counts (normalized) citations received minus (normalized) citations given. The idea is that, just as you can’t make anyone richer just by passing money between your friends without doing anything with it, so a small community can’t earn “CitationCoins” without getting the wider field interested.

There are still likely problems with these ideas. Dividing each paper by its number of authors seems like overkill: a thousand-person paper is not typically going to get a thousand times as many citations. I also don’t know whether there are ways to game this system: since the metrics are based in part on citations given, not just citations received, I worry there are situations where it would be to someone’s advantage to cite others less. I think they manage to avoid this by normalizing by number of citations given, and they emphasize that PageRank itself is estimating something we directly care about: how often people read a paper. Still, it would be good to see more rigorous work probing the system for weaknesses.

In addition to the proposed metrics, Strumia and Torre’s paper is full of interesting statistics about the arXiv and InSpire databases, both using more traditional metrics and their new ones. Whether or not the methods they propose work out, the paper is definitely worth a look.

An Elliptical Workout

I study scattering amplitudes, probabilities that particles scatter off each other.

In particular, I’ve studied them using polylogarithmic functions. Polylogarithmic functions can be taken apart into “logs”, which obey identities much like logarithms do. They’re convenient and nice, and for my favorite theory of N=4 super Yang-Mills they’re almost all you need.

Well, until ten particles get involved, anyway.

That’s when you start needing elliptic integrals, and elliptic polylogarithms. These integrals substitute one of the “logs” of a polylogarithm with an integration over an elliptic curve.

And with Jacob Bourjaily, Andrew McLeod, Marcus Spradlin, and Matthias Wilhelm, I’ve now computed one.

tenpointimage

This one, to be specific

Our paper, The Elliptic Double-Box Integral, went up on the arXiv last night.

The last few weeks have been a frenzy of work, finishing up our calculations and writing the paper. It’s the fastest I’ve ever gotten a paper out, which has been a unique experience.

Computing this integral required new, so far unpublished tricks by Jake Bourjaily, as well as some rather powerful software and Mark Spradlin’s extensive expertise in simplifying polylogarithms. In the end, we got the integral into a “canonical” form, one other papers had proposed as the right way to represent it, with the elliptic curve in a form standardized by Weierstrass.

One of the advantages of fixing a “canonical” form is that it should make identities obvious. If two integrals are actually the same, then writing them according to the same canonical rules should make that clear. This is one of the nice things about polylogarithms, where these identities are really just identities between logs and the right form is comparatively easy to find.

Surprisingly, the form we found doesn’t do this. We can write down an integral in our “canonical” form that looks different, but really is the same as our original integral. The form other papers had suggested, while handy, can’t be the final canonical form.

What the final form should be, we don’t yet know. We have some ideas, but we’re also curious what other groups are thinking. We’re relatively new to elliptic integrals, and there are other groups with much more experience with them, some with papers coming out soon. As far as we know they’re calculating slightly different integrals, ones more relevant for the real world than for N=4 super Yang-Mills. It’s going to be interesting seeing what they come up with. So if you want to follow this topic, don’t just watch for our names on the arXiv: look for Claude Duhr and Falko Dulat, Luise Adams and Stefan Weinzierl. In the elliptic world, big things are coming.

An Amplitudes Flurry

Now that we’re finally done with flurries of snow here in Canada, in the last week arXiv has been hit with a flurry of amplitudes papers.

kitchener-construction

We’re also seeing a flurry of construction, but that’s less welcome.

Andrea Guerrieri, Yu-tin Huang, Zhizhong Li, and Congkao Wen have a paper on what are known as soft theorems. Most famously studied by Weinberg, soft theorems are proofs about what happens when a particle in an amplitude becomes “soft”, or when its momentum becomes very small. Recently, these theorems have gained renewed interest, as new amplitudes techniques have allowed researchers to go beyond Weinberg’s initial results (to “sub-leading” order) in a variety of theories.

Guerrieri, Huang, Li, and Wen’s contribution to the topic looks like it clarifies things quite a bit. Previously, most of the papers I’d seen about this had been isolated examples. This paper ties the various cases together in a very clean way, and does important work in making some older observations more rigorous.

 

Vittorio Del Duca, Claude Duhr, Robin Marzucca, and Bram Verbeek wrote about transcendental weight in something known as the multi-Regge limit. I’ve talked about transcendental weight before: loosely, it’s counting the power of pi that shows up in formulas. The multi-Regge limit concerns amplitudes with very high energies, in which we have a much better understanding of how the amplitudes should behave. I’ve used this limit before, to calculate amplitudes in N=4 super Yang-Mills.

One slogan I love to repeat is that N=4 super Yang-Mills isn’t just a toy model, it’s the most transcendental part of QCD. I’m usually fairly vague about this, because it’s not always true: while often a calculation in N=4 super Yang-Mills will give the part of the same calculation in QCD with the highest power of pi, this isn’t always the case, and it’s hard to propose a systematic principle for when it should happen. Del Duca, Duhr, Marzucca, and Verbeek’s work is a big step in that direction. While some descriptions of the multi-Regge limit obey this property, others don’t, and in looking at the ones that don’t the authors gain a better understanding of what sorts of theories only have a “maximally transcendental part”. What they find is that even when such theories aren’t restricted to N=4 super Yang-Mills, they have shared properties, like supersymmetry and conformal symmetry. Somehow these properties are tied to the transcendentality of functions in the amplitude, in a way that’s still not fully understood.

 

My colleagues at Perimeter released two papers over the last week: one, by Freddy Cachazo and Alfredo Guevara, uses amplitudes techniques to look at classical gravity, while the other, by Sebastian Mizera and Guojun Zhang, looks at one of the “pieces” inside string theory amplitudes.

I worked with Freddy and Alfredo on an early version of their result, back at the PSI Winter School. While I was off lazing about in Santa Barbara, they were hard at work trying to understand how the quantum-looking “loops” one can use to make predictions for potential energy in classical gravity are secretly classical. What they ended up finding was a trick to figure out whether a given amplitude was going to have a classical part or be purely quantum. So far, the trick works for amplitudes with one loop, and a few special cases at higher loops. It’s still not clear if it works for the general case, and there’s a lot of work still to do to understand what it means, but it definitely seems like an idea with potential. (Pun mostly not intended.)

I’ve talked before about “Z theory”, the weird thing you get when you isolate the “stringy” part of string theory amplitudes. What Sebastian and Guojun have carved out isn’t quite the same piece, but it’s related. I’m still not sure of the significance of cutting string amplitudes up in this way, I’ll have to read the paper more thoroughly (or chat with the authors) to find out.

A Tale of Two Archives

When it comes to articles about theoretical physics, I have a pet peeve, one made all the more annoying by the fact that it appears even in pieces that are otherwise well written. It involves the following disclaimer:

“This article has not been peer-reviewed.”

Here’s the thing: if you’re dealing with experiments, peer review is very important. Plenty of experiments have subtle problems with their methods, enough that it’s important to have a group of experts who can check them. In experimental fields, you really shouldn’t trust things that haven’t been through a journal yet: there’s just a lot that can go wrong.

In theoretical physics, though, peer review is important for different reasons. Most papers are mathematically rigorous enough that they’re not going to be wrong per se, and most of the ways they could be wrong won’t be caught by peer review. While peer review sometimes does catch mistakes, much more often it’s about assessing the significance of a result. Peer review determines whether a result gets into a prestigious journal or a less prestigious one, which in turn matters for job and grant applications.

As such, it doesn’t really make sense for a journalist to point out that a theoretical physics paper hasn’t been peer reviewed yet. If you think it’s important enough to write an article about, then you’ve already decided it’s significant: peer review wasn’t going to tell you anything else.

We physicists post our papers to arXiv, a free-to-access paper repository, before submitting them to journals. While arXiv does have some moderation, it’s not much: pretty much anyone in the field can post whatever they want.

This leaves a lot of people confused. In that sort of system, how do we know which papers to trust?

Let’s compare to another archive: Archive of Our Own, or AO3 for short.

Unlike arXiv, AO3 hosts not physics, but fanfiction. However, like arXiv it’s quite lightly moderated and free to access. On arXiv you want papers you can trust, on AO3 you want stories you enjoy. In each case, if anyone can post, how do you find them?

The first step is filtering. AO3 and arXiv both have systems of tags and subject headings. The headings on arXiv are simpler and more heavily moderated than those on AO3, but they both serve the purpose of letting people filter out the subjects, whether scientific or fictional, that they find interesting. If you’re interested in astrophysics, try astro-ph on arXiv. If you want Harry Potter fanfiction, try the “Harry Potter – J.K. Rowling” tag on AO3.

Beyond that, it helps to pay attention to authors. When an author has written something you like, it’s worth it not only to keep up with other things they write, but to see which other authors they like and pay attention to them as well. That’s true whether the author is Juan Maldacena or your favorite source of Twilight fanfic.

Even if you follow all of this, you can’t trust every paper you find on arXiv. You also won’t enjoy everything you dig up on AO3. Either way, publication (in journals or books) won’t solve your problem: both are an additional filter, but not an infallible one. Judgement is still necessary.

This is all to say that “this article has not been peer-reviewed” can be a useful warning, but often isn’t. In theoretical physics, knowing who wrote an article and what it’s about will often tell you much more than whether or not it’s been peer-reviewed yet.

arXiv vs. snarXiv: Can You Tell the Difference?

Have you ever played arXiv vs snarXiv?

arXiv is a preprint repository: it’s where we physicists put our papers before they’re published to journals.

snarXiv is…well..sound it out.

A creation of David Simmons-Duffin, snarXiv randomly generates titles and abstracts out of trendy arXiv buzzwords. It’s designed so that the papers on it look almost plausible…until you take a closer look, anyway.

Hence the game, arXiv vs snarXiv. Given just the titles of two papers, can you figure out which one is real, and which is fake?

I played arXiv vs snarXiv for a bit today, waiting for some code to run. Out of twenty questions, I only got two wrong.

Sometimes, it was fairly clear which paper was fake because snarXiv overreached. By trying to pile on too many buzzwords, it ended up with a title that repeated itself, or didn’t quite work grammatically.

Other times, I had to use some actual physics knowledge. Usually, this meant noticing when a title tied together unrelated areas in an implausible way. When a title claims to tie obscure mathematical concepts from string theory to a concrete problem in astronomy, it’s pretty clearly snarXiv talking.

The toughest questions, including the ones I got wrong, were when snarXiv went for something subtle. For short enough titles, the telltale signs of snarXiv were suppressed. There just weren’t enough buzzwords for a mistake to show up. I’m not sure there’s a way to distinguish titles like that, even for people in the relevant sub-field.

How well do you do at arXiv vs snarXiv? Any tips?

arXiv, Our Printing Press

IMG_20160714_091400

Johannes Gutenberg, inventor of the printing press, and possibly the only photogenic thing on the Mainz campus

I’ve had a few occasions to dig into older papers recently, and I’ve noticed a trend: old papers are hard to read!

Ok, that might not be surprising. The older a paper is, the greater the chance it will use obsolete notation, or assume a context that has long passed by. Older papers have different assumptions about what matters, or what rigor requires, and their readers cared about different things. All this is to be expected: a slow, gradual approach to a modern style and understanding.

I’ve been noticing, though, that this slow, gradual approach doesn’t always hold. Specifically, it seems to speed up quite dramatically at one point: the introduction of arXiv, the website where we store all our papers.

Part of this could just be a coincidence. As it happens, the founding papers in my subfield, those that started Amplitudes with a capital “A”, were right around the time that arXiv first got going. It could be that all I’m noticing is the difference between Amplitudes and “pre-Amplitudes”, with the Amplitudes subfield sharing notation more than they did before they had a shared identity.

But I suspect that something else is going on. With arXiv, we don’t just share papers (that was done, piecemeal, before arXiv). We also share LaTeX.

LaTeX is a document formatting language, like a programming language for papers. It’s used pretty much universally in physics and math, and increasingly in other fields. As it turns out, when we post a paper to arXiv, we don’t just send a pdf: we include the raw LaTeX code as well.

Before arXiv, if you wanted to include an equation from another paper, you’d format it yourself. You’d probably do it a little differently from the other paper, in accord with your own conventions, and just to make it easier on yourself. Over time, more and more differences would crop up, making older papers harder and harder to read.

With arXiv, you can still do all that. But you can also just copy.

Since arXiv makes the LaTeX code behind a paper public, it’s easy to lift the occasional equation. Even if you’re not lifting it directly, you can see how they coded it. Even if you don’t plan on copying, the default gets flipped around: instead of having to try to make your equation like the one in the previous paper and accidentally getting it wrong, every difference is intentional.

This reminds me, in a small-scale way, of the effect of the printing press on anatomy books.

Before the printing press, books on anatomy tended to be full of descriptions, but not illustrations. Illustrations weren’t reliable: there was no guarantee the monk who copied them would do so correctly, so nobody bothered. This made it hard to tell when an anatomist (fine it was always Galen) was wrong: he could just be using an odd description. It was only after the printing press that books could actually have illustrations that were reliable across copies of a book. Suddenly, it was possible to point out that a fellow anatomist had left something out: it would be missing from the illustration!

In a similar way, arXiv seems to have led to increasingly standard notation. We still aren’t totally consistent…but we do seem a lot more consistent than older papers, and I think arXiv is the reason why.

Who Plagiarizes an Acknowledgements Section?

I’ve got plagiarists on the brain.

Maybe it was running into this interesting discussion about a plagiarized application for the National Science Foundation’s prestigious Graduate Research Fellowship Program. Maybe it’s due to the talk Paul Ginsparg, founder of arXiv, gave this week about, among other things, detecting plagiarism.

Using arXiv’s repository of every paper someone in physics thought was worth posting, Ginsparg has been using statistical techniques to sift out cases of plagiarism. Probably the funniest cases involved people copying a chunk of their thesis acknowledgements section, as excerpted here. Compare:

“I cannot describe how indebted I am to my wonderful girlfriend, Amanda, whose love and encouragement will always motivate me to achieve all that I can. I could not have written this thesis without her support; in particular, my peculiar working hours and erratic behaviour towards the end could not have been easy to deal with!”

“I cannot describe how indebted I am to my wonderful wife, Renata, whose love and encouragement will always motivate me to achieve all that I can. I could not have written this thesis without her support; in particular, my peculiar working hours and erratic behaviour towards the end could not have been easy to deal with!”

Why would someone do this? Copying the scientific part of a thesis makes sense, in a twisted way: science is hard! But why would someone copy the fluff at the end, the easy part that’s supposed to be a genuine take on your emotions?

The thing is, the acknowledgements section of a thesis isn’t exactly genuine. It’s very formal: a required section of the thesis, with tacit expectations about what’s appropriate to include and what isn’t. It’s also the sort of thing you only write once in your life: while published papers also have acknowledgements sections, they’re typically much shorter, and have different conventions.

If you ever were forced to write thank-you notes as a kid, you know where I’m going with this.

It’s not that you don’t feel grateful, you do! But when you feel grateful, you express it by saying “thank you” and moving on. Writing a note about it isn’t very intuitive, it’s not a way you’re used to expressing gratitude, so the whole experience feels like you’re just following a template.

Literally in some cases.

That sort of situation: where it doesn’t matter how strongly you feel something, only whether you express it in the right way, is a breeding ground for plagiarism. Aunt Mildred isn’t going to care what you write in your thank-you note, and Amanda/Renata isn’t going to be moved by your acknowledgements section. It’s so easy to decide, in that kind of situation, that it’s better to just grab whatever appropriate text you can than to teach yourself a new style of writing.

In general, plagiarism happens because there’s a disconnect between incentives and what they’re meant to be for. In a world where very few beginning graduate students actually have a solid research plan, the NSF’s fellowship application feels like a demand for creative lying, not an honest way to judge scientific potential. In countries eager for highly-cited faculty but low on preexisting experts able to judge scientific merit, tenure becomes easier to get by faking a series of papers than by doing the actual work.

If we want to get rid of plagiarism, we need to make sure our incentives match our intent. We need a system in which people succeed when they do real work, get fellowships when they honestly have talent, and where we care about whether someone was grateful, not how they express it. If we can’t do that, then there will always be people trying to sneak through the cracks.