Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
| |
Authors: | Xiaoyan Yin Daniel Levy Christine Willinger Aram Adourian Martin G. Larson |
| |
Affiliation: | 1. The Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, MA, U.S.A.;2. Department of Biostatistics, School of Public Health, Boston University, Boston, MA, U.S.A.;3. Department of Cardiology, Boston University, Boston, MA, U.S.A.;4. Population Sciences Branch, Division of Intramural Research, National Heart, Lung, and Blood Institute, Boston, MA, U.S.A.;5. BG Medicine Inc., Waltham, MA, U.S.A.;6. Department of Mathematics and Statistics, Boston University, Boston, MA, U.S.A. |
| |
Abstract: | Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ? N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. |
| |
Keywords: | multiple imputation stepwise selection high dimension imputation quality |
|
|