At Quanta This Week, With a Piece on Multiple Imputation

I’ve got another piece in Quanta Magazine this week.

While my past articles in Quanta have been about physics, this time I’m stretching my science journalism muscles in a new direction. I was chatting with a friend who works for a pharmaceutical company, and he told me about a statistical technique that sounded ridiculous. Luckily, he’s a patient person, and after annoying him and a statistician family member for a while I understood that the technique actually made sense. Since I love sharing counterintuitive facts, I thought this would be a great story to share with Quanta’s readers. I then tracked down more statisticians, and annoyed them in a more professional way, finally resulting in the Quanta piece.

The technique is called multiple imputation, and is a way to deal with missing data. By filling in (“imputing”) missing information with good enough guesses, you can treat a dataset with missing data as if it was complete. If you do this imputation multiple times with the help of a source of randomness, you can also model how uncertain those guesses are, so your final statistical estimates are as uncertain as they ought to be. That, in a nutshell, is multiple imputation.

In the piece, I try to cover the key points: how the technique came to be, how it spread, and why people use it. To complement that, in this post I wanted to get a little bit closer to the technical details, and say a bit about why some of the workarounds a naive physicist would come up with don’t actually work.

If you’re anything like me, multiple imputation sounds like a very weird way to deal with missing data. In order to fill in missing data, you have to use statistical techniques to find good guesses. Why can’t you just use the same techniques to analyze the data in the first place? And why do you have to use a random number generator to model your uncertainty, instead of just doing propagation of errors?

It turns out, you can sort of do both of these things. Full Information Maximum Likelihood is a method where you use all the data you have, and only the data you have, without imputing anything or throwing anything out. The catch is that you need a model, one with parameters you can try to find the most likely values for. Physicists usually do have a model like this (for example, the Standard Model), so I assumed everyone would. But for many things you want to measure in social science and medicine, you don’t have any such model, so multiple imputation ends up being more versatile in practice.

(If you want more detail on this, you need to read something written by actual statisticians. The aforementioned statistician family member has a website here that compares and contrasts multiple imputation with full information maximum likelihood.)

What about the randomness? It turns out there is yet another technique, called Fractional Imputation. While multiple imputation randomly chooses different values to impute, fractional imputation gives each value a weight based on the chance for it to come up. This gives the same result…if you can compute the weights, and store all the results. The impression I’ve gotten is that people are working on this, but it isn’t very well-developed.

“Just do propagation of errors”, the thing I wanted to suggest as a physicist, is much less of an option. In many of these datasets, you don’t attribute errors to the base data points to begin with. And on the other hand, if you want to be more sophisticated, then something like propagation of errors is too naive. You have a variety of different variables, correlated with each other in different ways, giving a complicated multivariate distribution. Propagation of errors is already pretty fraught when you go beyond linear relationships (something they don’t tend to tell baby physicists), using it for this would be pushing it rather too far.

The thing I next wanted to suggest, “just carry the distribution through the calculation”, turns out to relate to something I’ve called the “one philosophical problem of my sub-field”. In the area of physics I’ve worked in, a key question is what it means to have “done” an integral. Here, one can ask what it means to do a calculation on a distribution. In both cases, the end goal is to get numbers out: physics predictions on the one hand, statistical estimates on the other. You can get those numbers by “just” doing numerics, using randomness and approximations to estimate the number you’re interested in. And in a way, that’s all you can do. Any time you “just do the integral” or “just carry around the distribution”, the thing you get in the end is some function: it could be a well-understood function like a sine or log, or it could be an exotic function someone defined for that purpose. But whatever function you get, you get numbers out of it the same way. A sine or a log, on a computer, is just an approximation scheme, a program that outputs numbers.

(But we do still care about analytic results, we don’t “just” do numerics. That’s because understanding the analytics helps us do numerics better, we can get more precise numbers faster and more stably. If you’re just carrying around some arbitrarily wiggly distribution, it’s not clear you can do that.)

So at this point, I get it. I’m still curious to see how Fractional Imputation develops, and when I do have an actual model I’d lean to wanting to use Full Information Maximum Likelihood instead. (And there are probably some other caveats I may need to learn at some point!) But I’m comfortable with the idea that Multiple Imputation makes sense for the people using it.

Leave a comment! If it's your first time, it will go into moderation.