American Journal of Ophthalmology
Volume 148, Issue 6 , Pages 820-822, December 2009

Missing Data: What a Little Can Do, and What Researchers Can Do in Response

  • Thomas R. Belin

      Affiliations

    • Corresponding Author InformationInquiries to Thomas R. Belin, UCLA Department of Biostatistics, 51-267 Center for Health Sciences, Los Angeles, CA 90095

Department of Biostatistics, UCLA School of Public Health, and Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine at UCLA, Los Angeles, California

Accepted 20 July 2009.

Article Outline

 

If only 5% of the data values are missing, is it OK to drop the cases with incomplete data and analyze only the cases with complete data? Questions along these lines frequently are directed to statisticians by applied researchers. In considering the potential impact of missing data, an example illustrates why such an apparently simple question does not have a simple answer.

Suppose a two-arm randomized study compares two cataract surgery protocols (standard and new), with 20 patients receiving each treatment and with the primary outcome, Y, being the number of lines of improvement between baseline and 6 months after surgery on a Snellen visual acuity (VA) chart. Suppose further that baseline measurements also are obtained for a clinical characteristic, X (eg, preoperative macular thickness), that is highly predictive of outcomes in both arms of the study. A hypothetical set of data values for such a study is shown in the Table, listing a patient identification number, macular thickness (X) in micrometers, and lines of improvement on VA (Y) between baseline and 6 months. Note that the groups have identical distributions of X, so confounding by X is not a concern in comparing outcomes on Y. Note also that X and Y are highly correlated, with sample correlations of rstd = 0.81 and rnew = 0.79 based on the 20 patients in each group. If there were no missing data, one could calculate the average outcome Ȳnew = 2.30 in the new-treatment group and Ȳstd = 1.60 in the standard-treatment group; familiar inference procedures would yield a 95% confidence interval (CI) of 0.02 to 1.38 for the difference between group means, corresponding to a P value of .044 based on a two-sample t test.

TABLE. Hypothetical Data from a Two-Arm Randomized Study: Macular Thickness (X) from Baseline Assessment, Visual Acuity Improvement (Y) Based in Part on 6-Month Assessment
Standard TreatmentNew Treatment
PatientMacular Thickness (X)Visual Acuity Improvement (Y)aPatientMacular Thickness (X)Visual Acuity Improvement (Y)a
101180□02011800
10218002021801
10319512031951
10419512041952
10521002052101
10621022062102
10722512072252
10822522082253
10924012092402
11024022102403
11125512112551
11225522122553
11327012132702
11427022142703
11528522152852
11628532162853
11730022173003
11830032183004
11931532193154
1203153220315□4

aValues in boxes (□) treated as missing in some analyses.

Suppose further, however, that one patient in each arm failed to appear for the 6-month evaluation, implying 5% missing data in each arm (ie, 1/20 = 5%). In this example, one missing outcome is associated with a standard-treatment patient whose baseline macular thickness is tied for the smallest observed value, and the other is associated with a new-treatment patient whose baseline macular thickness is tied for the largest observed value. Analyzing the remaining 19 patients in each arm yields average outcomes of Ȳ*new = 2.21 and Ȳ*std = 1.68, with a 95% CI for the mean difference of −0.14 to 1.20, corresponding to a P value of = .120 from a two-sample t test.

Although it is worth reemphasizing that the widespread use of α = 0.05 as a significance level is a scientific convention that should not be viewed as sacred, the distinction between P < .05 in the original scenario and > .10 because of the 5% missing data gives grounds for pause. Under these circumstances, it is reasonable to wonder whether the partially observed cases with available baseline measurements on X can be used to enhance the precision of the outcome analysis, especially because baseline values of X are so strongly correlated with the 6-month outcomes Y.

One method for incorporating X into such an analysis is known as multiple imputation.1, 2, 3 Multiple imputation is a well-established general strategy for handling missing data that makes use of available data (including covariate values) to fill in plausible values of missing items. To avoid exaggerating the precision of the inference, one produces several (eg, 5) plausible values for each missing item and carries out a separate analysis for each so-called completed data set. A key ingredient in the overall inference is to combine between-imputation variance (ie, variation across imputed data sets in estimates of the target quantity, which reflects uncertainty because of values being missing) with within-imputation variance (ie, the average of the squared standard errors from the separate analyses, which reflects uncertainty because of having one rather than another sample of size 20). To estimate the between-imputation variance, more than one imputation is needed; filling in only a single value (ie, a single imputation) exaggerates precision by pretending that there is no uncertainty about the values of missing items.

To illustrate the method, we produced 5 multiple imputations for each of the missing items in the hypothetical example. Multiple imputation software is available in many statistical packages, although there are differences in programs and underlying modeling assumptions, and inferences can be sensitive to details of the imputation procedure. The implementation here used SAS software (SAS Institute Inc, Cary, North Carolina, USA) (specifically, the procedures PROC MI, and PROC MIANALYZE), with missing values of VA improvement imputed based on a linear regression of VA improvement (Y) on macular thickness (X). (More generally, a modern statistical computing strategy known as Markov chain Monte Carlo can be used to produce imputations; this approach, which was used here, reduces to regression-based imputation when predictor variables are observed completely, as in the present setting.) The 5 imputed values for the standard-treatment case (Y101) were −1.47, 0.81, 0.48, 0.66, and −0.50, and the 5 imputed values for the new-treatment case (Y220) were 3.39, 3.73, 3.59, 3.31, and 2.74. Using these values, the estimated mean difference between groups was estimated to be 0.67 with a 95% CI of 0.01 to 1.33, corresponding to an overall P value of P* = .047.

We also considered rounding the imputed values to the nearest whole number (ie, −1, 1, 0, 1, 0 for Y101 and 3, 4, 4, 3, 3 for Y220), which had only a slight impact on inferences (estimated mean difference, 0.66; 95% CI, 0.00 to 1.31; P* = .048). Either way, by making use of evidence in the data that X and Y are highly correlated, the procedure is able to recover information that was lost when values were omitted from the analysis.

Considering the possibility that chance variation might have played a role in producing a result that was just barely significant at the 0.05 level, we implemented the multiple imputation procedure with an equally valid but slightly different user-supplied setting governing random number generation and obtained a P value of P* = .055. Five more analogous perturbations of the procedure yielded successively P* = .060, .050, .053, .059, and .040. If greater precision were desired, the number of imputations could be increased, but although there is clearly a modest amount of sensitivity to user-supplied settings, it also seems clear that the finding of = .120 based only on the cases with complete data understates the significance of the difference. The point of this example is not to suggest that multiple imputation will always produce significant results, but rather that it can incorporate information from partially observed cases, which can mitigate bias and improve precision without exaggerating significance levels.

In the literature on incomplete data analysis, distinctions are drawn among different types of mechanisms that may give rise to missing data.4 The term missing completely at random refers to settings where missing values are like a random subsample of all values. When missing values are missing completely at random, complete case analysis is valid (although it may not make full use of available information). The term missing at random has a similar sounding name but refers to the much broader set of scenarios where missing values on one variable are allowed to depend on observed values of other variables. For example, if older individuals were less likely to return for a follow-up measurement, the missing values would not be missing completely at random, but might be missing at random. The missing-at-random assumption is not guaranteed to hold, but data sets typically do not contain information to contradict the missing-at-random assumption. General-purpose multiple-imputation software typically allows values to be missing at random.

In some settings, missing values might not be missing at random, as when dropping out of a study is related to an underlying, unmeasured characteristic. Multiple imputation may still be useful in such a scenario, but inferences will depend on assumptions that are not connected to available data. A classic example of inference assuming that missing values were not occurring at random5 was the decision to reinforce Allied planes in World War II based on an assumed selection effect. Certain surfaces on returning planes were seen to have more bullet holes than others, but rather than reinforcing surfaces of planes where returning planes were seen to have many bullet holes (which might make sense if it was thought that enemy planes were aiming at those areas), the decision was made to reinforce surfaces of planes where returning planes were seen to have few bullet holes (since it was thought that the reason fewer bullet holes were seen on certain surfaces was not because those areas were less frequently targeted, but rather because planes that were hit on those surfaces were less likely to return). A good strategy for making the assumptions underlying general-purpose multiple-imputation software more plausible is to measure a wide array of characteristics and to incorporate those characteristics into imputation models.2

In the example presented here, the incomplete cases had very different values of macular thickness (X), which motivates the idea of regression-based imputation that controls for macular thickness. Inferences would not always be sensitive to having 5% missing data, but when observed outcomes are related strongly to a covariate and cases with missing outcomes have different covariate values across the groups being compared, there may be sensitivity in inferences, as shown here.

In summary, a reasonable answer to the question, “If only 5% of the data values are missing, is it OK to drop the cases with incomplete data and analyze only the cases with complete data?” would be, “It depends.” The good news is that recent advances such as multiple imputation and associated statistical computing strategies provide statisticians and allied researchers with sophisticated techniques to address missing data.

Back to Article Outline

 

This study was supported by several branches of the National Institutes of Health, Bethesda, Maryland, during the past 2 years, including ongoing support from Grant Nos. R01 MH078853, P30 MH082760, and P30 MH58017 from the National Institutes of Mental Health; Grant No. R01 CA109650 from the National Cancer Institute; Grant No. R01 DA16850 from the National Institute on Drug Abuse. The author discloses multiple consulting roles in the past 2 years, including participation on an external review of the Statistical Research Division of the U.S. Census Bureau, service on the data safety monitoring board of the University of Southern California Well Elderly Study, service as a dissertation reader for a PhD student at RAND Graduate School on predictors of HIV testing, and participation in a study of the Center on Child Abuse and Neglect at the University of Oklahoma Health Sciences Center. This study was conformed with the Declaration of Helsinki and all applicable federal and state laws of the United States of America.

Back to Article Outline

References 

  1. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons; 1987;
  2. Rubin DB. Multiple imputation after 18+ years (with discussion). J Am Stat Assoc. 1996;91:473–489
  3. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999;8:3–15
  4. Little RJA, Rubin DB. Statistical Analysis with Missing Data, 2nd ed. New York: John Wiley & Sons; 2002;
  5. Wainer H. Eelworms, bullet holes, and Geraldine Ferraro: some problems with statistical adjustment and some solutions. J Educ Stat. 1989;14:121–140

PII: S0002-9394(09)00539-X

doi:10.1016/j.ajo.2009.07.027

American Journal of Ophthalmology
Volume 148, Issue 6 , Pages 820-822, December 2009