The DEcIDE Methods Center publishes a monthly literature scan of current articles of interest to the field of comparative effectiveness research.
You can find them all here.
CER Scan [Epub ahead of print]
- Drug Saf. 2011 Jan 2012; 35(1):61-78 [Epub ahead of print]
Identifying Adverse Events of Vaccines Using a Bayesian Method of Medically Guided Information Sharing. Crooks CJ, Prieto-Merino D, Evans SJ. Division of Epidemiology and Public Health, University of Nottingham, Nottingham, UK.
Background: The detection of adverse events following immunization (AEFI) fundamentally depends on how these events are classified. Standard methods impose a choice between either grouping similar events together to gain power or splitting them into more specific definitions. We demonstrate a method of medically guided Bayesian information sharing that avoids grouping or splitting the data, and we further combine this with the standard epidemiological tools of stratification and multivariate regression. Objective: The aim of this study was to assess the ability of a Bayesian hierarchical model to identify gastrointestinal AEFI in children, and then combine this with testing for effect modification and adjustments for confounding. Study Design: Reporting odds ratios were calculated for each gastrointestinal AEFI and vaccine combination. After testing for effect modification, these were then re-estimated using multivariable logistic regression adjusting for age, sex, year and country of report. A medically guided hierarchy of AEFI terms was then derived to allow information sharing in a Bayesian model. Setting: All spontaneous reports of AEFI in children under 18 years of age in the WHO VigiBase™ (Uppsala Monitoring Centre, Uppsala, Sweden) before June 2010. Reports with missing age were included in the main analysis in a separate category and excluded in a subsequent sensitivity analysis. Exposures: The 15 most commonly prescribed childhood vaccinations, excluding influenza vaccines. Main Outcome Measures: All gastrointestinal AEFI coded by WHO Adverse Reaction Terminology. Results: A crude analysis identified 132 signals from 655 reported combinations of gastrointestinal AEFI. Adjusting for confounding by age, sex, year of report and country of report, where appropriate, reduced the number of signals identified to 88. The addition of a Bayesian hierarchical model identified four further signals and removed three. Effect modification by age and sex was identified for six vaccines for the outcomes of vomiting, nausea, diarrhoea and salivary gland enlargement.
Conclusion: This study demonstrated a sequence of methods for routinely analysing spontaneous report databases that was easily understandable and reproducible. The combination of classical and Bayesian methods in this study help to focus the limited resources for hypothesis testing studies towards adverse events with the strongest support from the data.
PMID: 22136183 [PubMed – as supplied by publisher]
CER Scan [published within the last 30 days]
- Am J Epidemiol. 2011 Dec 1;174(11):1213-22. Epub 2011 Oct 24.
- Am J Epidemiol. 2011 Dec 1;174(11):1223-7. Epub 2011 Oct 27.
- Am J Epidemiol. 2011 Dec 1;174(11):1228-9. Epub 2011 Oct 24. Myers et Al. Response to "understanding bias amplification". Myers JA, Rassen JA, Gagne JJ, Huybrechts KF, Schneeweiss S, Rothman KJ, Glynn RJ.
- Epidemiology. 2011 Nov;22(6):815-22.
Effects of adjusting for instrumental variables on bias and precision of effect estimates.
Myers JA, Rassen JA, Gagne JJ, Huybrechts KF, Schneeweiss S, Rothman KJ, Joffe MM, Glynn RJ.
Recent theoretical studies have shown that conditioning on an instrumental variable (IV), a variable that is associated with exposure but not associated with outcome except through exposure, can increase both bias and variance of exposure effect estimates. Although these findings have obvious implications in cases of known IVs, their meaning remains unclear in the more common scenario where investigators are uncertain whether a measured covariate meets the criteria for an IV or rather a confounder. The authors present results from two simulation studies designed to provide insight into the problem of conditioning on potential IVs in routine epidemiologic practice. The simulations explored the effects of conditioning on IVs, near-IVs (predictors of exposure that are weakly associated with outcome), and confounders on the bias and variance of a binary exposure effect estimate. The results indicate that effect estimates which are conditional on a perfect IV or near-IV may have larger bias and variance than the unconditional estimate. However, in most scenarios considered, the increases in error due to conditioning were small compared with the total estimation error. In these cases, minimizing unmeasured confounding should be the priority when selecting variables for adjustment, even at the risk of conditioning on IVs.
PMID: 22025356 [PubMed – in process]
Invited commentary: understanding bias amplification. Pearl J.
In choosing covariates for adjustment or inclusion in propensity score analysis, researchers must weigh the benefit of reducing confounding bias carried by those covariates against the risk of amplifying residual bias carried by unmeasured confounders. The latter is characteristic of covariates that act like instrumental variables-that is, variables that are more strongly associated with the exposure than with the outcome. In this issue of the Journal (Am J Epidemiol. 2011;174(11):1213-1222), Myers et al. compare the bias amplification of a near-instrumental variable with its bias-reducing potential and suggest that, in practice, the latter outweighs the former. The author of this commentary sheds broader light on this comparison by considering the cumulative effects of conditioning on multiple covariates and showing that bias amplification may build up at a faster rate than bias reduction. The author further derives a partial order on sets of covariates which reveals preference for conditioning on outcome-related, rather than exposure-related, confounders.
PMCID: PMC3224255 [Available on 2012/12/1] PMID: 22034488 [PubMed – in process]
Response to Invited Commentary
PMID: 22025355 [PubMed – in process]
Estimating bias from loss to follow-up in the Danish National Birth Cohort. Greene N, Greenland S, Olsen J, Nohr EA. Department of Epidemiology, School of Public Health, University of California
Loss to follow-up in cohort studies may result in biased association estimates. Of 61,895 women entering the Danish National Birth Cohort and completing the first data-collection phase, 37,178 (60%) opted to be in the 7-year follow-up. Using national registry data to obtain end point information on all members of the cohort, we estimated associations in the baseline and the 7-year follow-up participant populations for 5 exposure-outcome associations: (a) size at birth and childhood asthma, (b) assisted reproductive treatment and childhood hospitalizations, (c) prepregnancy body mass index and childhood infections, (d) alcohol drinking in early pregnancy and childhood developmental disorders, and (e) maternal smoking in pregnancy and childhood attention-deficit hyperactivity disorder (ADHD). We estimated follow-up bias in the odds or rate ratios by calculating relative ratios. For all but one of the above analyses, the bias appeared to be small, between -10% and +8%. For maternal smoking in pregnancy and childhood ADHD, we estimated a positive bias of approximately 33% (95% bootstrap limits of -30% and +152%). The presence and magnitude of bias due to loss to follow-up depended on the nature of the factors or outcomes examined, with the most pronounced contribution in this study coming from maternal smoking. Our methods and results may inform bias analyses in future pregnancy cohort studies.
PMID: 21918455 [PubMed – in process]
DECEMBER THEME: Methods for Addressing Missing Data in CER
- Stat Med. 2011 Dec 4. doi: 10.1002/sim.4413. [Epub ahead of print]
- Stat Methods Med Res. 2011 Mar 23. [Epub ahead of print]
- Stat Med. 2011 Mar 15;30(6):627-41. doi: 10.1002/sim.4124. Epub 2010 Dec 28.
- Am J Epidemiol. 2010 Nov 1;172(9):1070-6. Epub 2010 Sep 14.
- Artif Intell Med. 2010 Oct;50(2):105-15. Epub 2010 Jul 16.
- J Clin Epidemiol. 2010 Jul;63(7):728-36. Epub 2010 Mar 25.
- Pharmacoepidemiol Drug Saf. 2010 Jun;19(6):618-26.
- Circ Cardiovasc Qual Outcomes. 2010 Jan;3(1):98-105.
- Am J Epidemiol. 2010 Mar 1;171(5):624-32. Epub 2010 Jan 27.
- J Sch Psychol. 2010 Feb;48(1):5-37.
- Int J Epidemiol. 2010 Feb;39(1):118-28. Epub 2009 Oct 25.
Diagnosing imputation models by applying target analyses to posterior replicates of completed data. He Y, Zaslavsky AM. Department of Health Care Policy, Harvard Medical School, Boston, MA, 02115, USA. email@example.com.
Multiple imputation ﬁlls in missing data with posterior predictive draws from imputation models. To assess the adequacy of imputation models, we can compare completed data with their replicates simulated under the imputation model. We apply analyses of substantive interest to both datasets and use posterior predictive checks of the differences of these estimates to quantify the evidence of model inadequacy. We can further integrate out the imputed missing data and their replicates over the completed-data analyses to reduce variance in the comparison. In many cases, the checking procedure can be easily implemented using standard imputation software by treating re-imputations under the model as posterior predictive replicates. Thus, it can be applied for non-Bayesian imputation methods. We also sketch several strategies for applying the method in the context of practical imputation analyses. We illustrate the method using two real data applications and study its property using a simulation. Copyright © 2011 John Wiley & Sons, Ltd. Copyright © 2011 John Wiley & Sons, Ltd.
PMID: 22139814 [PubMed – as supplied by publisher]
Using causal diagrams to guide analysis in missing data problems. Daniel RM, Kenward MG, Cousens SN, De Stavola BL. Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK.
Estimating causal effects from incomplete data requires additional and inherently untestable assumptions regarding the mechanism giving rise to the missing data. We show that using causal diagrams to represent these additional assumptions both complements and clarifies some of the central issues in missing data theory, such as Rubin’s classification of missingness mechanisms (as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR)) and the circumstances in which causal effects can be estimated without bias by analysing only the subjects with complete data. In doing so, we formally extend the back-door criterion of Pearl and others for use in incomplete data examples. These ideas are illustrated with an example drawn from an occupational cohort study of the effect of cosmic radiation on skin cancer incidence.
PMID: 21389091 [PubMed – as supplied by publisher]
Estimating propensity scores with missing covariate data using general location mixture models. Mitra R, Reiter JP. School of Mathematics, University of Southampton, Southampton, SO17 1BJ, U.K. R.Mitra@soton.ac.uk
In many observational studies, analysts estimate causal effects using propensity scores, e.g. by matching, sub-classifying, or inverse probability weighting based on the scores. Estimation of propensity scores is complicated when some values of the covariates are missing. Analysts can use multiple imputation to create completed data sets from which propensity scores can be estimated. We propose a general location mixture model for imputations that assumes that the control units are a latent mixture of (i) units whose covariates are drawn from the same distributions as the treated units’ covariates and (ii) units whose covariates are drawn from different distributions. This formulation reduces the influence of control units outside the treated units’ region of the covariate space on the estimation of parameters in the imputation model, which can result in more plausible imputations. In turn, this can result in more reliable estimates of propensity scores and better balance in the true covariate distributions when matching or sub-classifying. We illustrate the benefits of the latent class modeling approach with simulations and with an observational study of the effect of breast feeding on children’s cognitive abilities. Copyright © 2010 John Wiley & Sons, Ltd.
PMID: 21337358 [PubMed – indexed for MEDLINE]
Multiple imputation for missing data via sequential regression trees. Burgette LF, Reiter JP. Department of Statistical Science, Duke University, Durham, North Carolina 27708.
Multiple imputation is particularly well suited to deal with missing data in large epidemiologic studies, because typically these studies support a wide range of analyses by many data users. Some of these analyses may involve complex modeling, including interactions and nonlinear relations. Identifying such relations and encoding them in imputation models, for example, in the conditional regressions for multiple imputation via chained equations, can be daunting tasks with large numbers of categorical and continuous variables. The authors present a nonparametric approach for implementing multiple imputation via chained equations by using sequential regression trees as the conditional models. This has the potential to capture complex relations with minimal tuning by the data imputer. Using simulations, the authors demonstrate that the method can result in more plausible imputations, and hence more reliable inferences, in complex settings than the naive application of standard sequential regression imputation techniques. They apply the approach to impute missing values in data on adverse birth outcomes with more than 100 clinical and survey variables. They evaluate the imputations using posterior predictive checks with several epidemiologic analyses of interest.
PMID: 20841346 [PubMed – indexed for MEDLINE]
Free Full Text: http://aje.oxfordjournals.org/content/172/9/1070.long
Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L. Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, E.T.S.I. Informática, Campus de Teatinos s/n, 29071 Málaga, Spain. firstname.lastname@example.org
OBJECTIVES: Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set.
MATERIALS AND METHODS: Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion
(LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values.
RESULTS: The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman’s test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model.
CONCLUSION: The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures.
Copyright © 2010 Elsevier B.V. All rights reserved.
PMID: 20638252 [PubMed – indexed for MEDLINE]
Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. Knol MJ, Janssen KJ, Donders AR, Egberts AC, Heerdink ER, Grobbee DE, Moons KG, Geerlings MI. Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Str. 6.131, PO Box 85500, 3508 GA Utrecht, The Netherlands. email@example.com
OBJECTIVE: Missing indicator method (MIM) and complete case analysis (CC) are frequently used to handle missing confounder data. Using empirical data, we demonstrated the degree and direction of bias in the effect estimate when using these methods compared with multiple imputation (MI).
STUDY DESIGN AND SETTING: From a cohort study, we selected an exposure (marital status), outcome (depression), and confounders (age, sex, and income). Missing values in "income" were created according to different patterns of missingness: missing values were created completely at random and depending on exposure and outcome values. Percentages of missing values ranged from 2.5% to 30%.
RESULTS: When missing values were completely random, MIM gave an overestimation of the odds ratio, whereas CC and MI gave unbiased results. MIM and CC gave under- or overestimations when missing values depended on observed values. Magnitude and direction of bias depended on how the missing values were related to exposure and outcome. Bias increased with increasing percentage of missing
CONCLUSION: MIM should not be used in handling missing confounder data because it gives unpredictable bias of the odds ratio even with small percentages of missing values. CC can be used when missing values are completely random, but it gives loss of statistical power.
Copyright 2010 Elsevier Inc. All rights reserved.
PMID: 20346625 [PubMed – indexed for MEDLINE]
Issues in multiple imputation of missing data for large general practice clinical databases. Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Department of Primary Care and Population Health, University College London, Rowland Hill Street, London NW32PF
PURPOSE: Missing data are a substantial problem in clinical databases. This paper aims to examine patterns of missing data in a primary care database, compare this to nationally representative datasets and explore the use of multiple imputation (MI) for these data.
METHODS: The patterns and extent of missing health indicators in a UK primary care database (THIN) were quantified using 488 384 patients aged 16 or over in their first year after registration with a GP from 354 General Practices. MI models were developed and the resulting data compared to that from nationally representative datasets (14 142 participants aged 16 or over from the Health Survey for England 2006 (HSE) and 4 252 men from the British Regional Heart Study (BRHS)).
RESULTS: Between 22% (smoking) and 38% (height) of health indicator data were missing in newly registered patients, 2004-2006. Distributions of height, weight and blood pressure were comparable to HSE and BRHS, but alcohol and smoking were not. After MI the percentage of smokers and non-drinkers was higher in THIN than the comparison datasets, while the percentage of ex-smokers and heavy drinkers was lower. Height, weight and blood pressure remained similar to the comparison datasets.
CONCLUSIONS: Given available data, the results are consistent with smoking and alcohol data missing not at random whereas height, weight and blood pressure missing at random. Further research is required on suitable imputation methods for smoking and alcohol in such databases.
PMID: 20306452 [PubMed – indexed for MEDLINE]
Missing data analysis using multiple imputation: getting to the heart of the matter. He Y. Department of Health Care Policy, Harvard Medical School
Missing data are a pervasive problem in health investigations. We describe some background of missing data analysis and criticize ad hoc methods that are prone to serious problems. We then focus on multiple imputation, in which missing cases are first filled in by several sets of plausible values to create multiple completed datasets, then standard complete-data procedures are applied to each completed dataset, and finally the multiple sets of results are combined to yield a single inference. We introduce the basic concepts and general methodology and provide some guidance for application. For illustration, we use a study assessing the effect of cardiovascular diseases on hospice discussion for late stage lung cancer patients.
PMCID: PMC2818781; PMID: 20123676 [PubMed – indexed for MEDLINE]
Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Lee KJ, Carlin JB. Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Royal Children’s Hospital, Flemington Road, Parkville, Victoria
Statistical analysis in epidemiologic studies is often hindered by missing data, and multiple imputation is increasingly being used to handle this problem. In a simulation study, the authors compared 2 methods for imputation that are widely available in standard software: fully conditional specification (FCS) or "chained equations" and multivariate normal imputation (MVNI). The authors created data sets of 1,000 observations to simulate a cohort study, and missing data were induced under 3 missing-data mechanisms. Imputations were performed using FCS (Royston’s "ice") and MVNI (Schafer’s NORM) in Stata (Stata Corporation, College Station, Texas), with transformations or prediction matching being used to manage nonnormality in the continuous variables. Inferences for a set of regression parameters were compared between these approaches and a complete-case analysis. As expected, both FCS and MVNI were generally less biased than complete-case analysis, and both produced similar results despite the presence of binary and ordinal variables that clearly did not follow a normal distribution. Ignoring
skewness in a continuous covariate led to large biases and poor coverage for the corresponding regression parameter under both approaches, although inferences for other parameters were largely unaffected. These results provide reassurance that similar results can be expected from FCS and MVNI in a standard regression analysis involving variously scaled variables.
PMID: 20106935 [PubMed – indexed for MEDLINE]
Free Full Text: http://aje.oxfordjournals.org/content/171/5/624.long
An introduction to modern missing data analyses. Baraldi AN, Enders CK. Arizona State University, USA. Amanda.Baraldi@asu.edu
A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional techniques. This article explains the theoretical underpinnings of missing data analyses, gives an overview of traditional missing data techniques, and provides accessible descriptions of maximum likelihood and multiple imputation. In particular, this article focuses on maximum likelihood estimation and presents two analysis examples from the Longitudinal Study of American Youth data. One of these examples includes a description of the use of auxiliary variables. Finally, the paper illustrates ways that researchers can use intentional, or planned, missing data to enhance their research designs.
PMID: 20006986 [PubMed – indexed for MEDLINE]
Modelling relative survival in the presence of incomplete data: a tutorial. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Cancer Research UK Cancer Survival Group, London School of Hygiene and Tropical Medicine, London, UK. firstname.lastname@example.org
BACKGROUND: Missing data frequently create problems in the analysis of population-based data sets, such as those collected by cancer registries. Restriction of analysis to records with complete data may yield inferences that are substantially different from those that would have been obtained had no data been missing. ‘Naive’ methods for handling missing data, such as restriction of the analysis to complete records or creation of a ‘missing’ category, have drawbacks that can invalidate the conclusions from the analysis. We offer a tutorial on modern methods for handling missing data in relative survival analysis.
METHODS: We estimated relative survival for 29 563 colorectal cancer patients who were diagnosed between 1997 and 2004 and registered in the North West Cancer Intelligence Service. The method of multiple imputation (MI) was applied to account for the common example of incomplete stage at diagnosis, under the missing at random (MAR) assumption. Multivariable regression with a generalized linear model and Poisson error structure was then used to estimate the excess hazard of death of the colorectal cancer patients, over and above the background mortality, adjusting for significant predictors of mortality.
RESULTS: Incomplete information on stage, morphology and grade meant that only 55% of the data could be included in the ‘complete-case’ analysis. All cases could be included after indicator method (IM) or MI method. Handling missing data by MI produced a significantly lower estimate of the excess mortality for stage, morphology and grade, with the largest reductions occurring for late-stage and high-grade tumours, when compared with the results of complete-case analysis.
CONCLUSION: In complete-case analysis, almost 50% of the information could not be included, and with the IM, all records with missing values for stage were combined into a single ‘missing’ category. We show that MI methods greatly improved the results by exploiting all the information in the incomplete records. This method also helped to ensure efficient inferences about survival were made from the multivariate regression analyses.
PMID: 19858106 [PubMed – indexed for MEDLINE]