首页 | 本学科首页   官方微博 | 高级检索  
     


INAUGURAL ARTICLE by a Recently Elected Academy Member:Statistical learning and selective inference
Authors:Jonathan Taylor  Robert J. Tibshirani
Affiliation:aDepartment of Statistics, Stanford University, Stanford, CA, 94305;;bDepartment of Health Research & Policy and Department of Statistics, Stanford University, Stanford, CA, 94305
Abstract:We describe the problem of “selective inference.” This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have “cherry-picked”—searched for the strongest associations—means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.Statistical science has changed a great deal in the past 10–20 years, and is continuing to change, in response to technological advances in science and industry. The world is awash with big and complicated data, and researchers are trying to make sense out of it. Leading examples include data from “omic” assays in the biomedical sciences, financial forecasting from economic and business indicators, and the analysis of user click patterns to optimize ad placement on websites. This has led to an explosion of interest in the fields of statistics and machine learning and spawned a new field some call “data science.”In the words of Yoav Benjamini, statistical methods have become “industrialized” in response to these changes. Whereas traditionally scientists fit a few statistical models by hand, now they use sophisticated computational tools to search through a large number of models, looking for meaningful patterns. Having done this search, the challenge is then to judge the strength of the apparent associations that have been found. For example, a correlation of 0.9 between two measurements A and B is probably noteworthy. However, suppose that I had arrived at A and B as follows: I actually started with 1,000 measurements and I searched among all pairs of measurements for the most correlated pair; these turn out to be A and B, with correlation 0.9. With this backstory, the finding is not nearly as impressive and could well have happened by chance, even if all 1,000 measurements were uncorrelated. Now, if I just reported to you that these two measures A and B have correlation 0.9, and did not tell which of these two routes I used to obtain them, you would not have enough information to judge the strength of the apparent relationship. This statistical problem has become known as “selective inference,” the assessment of significance and effect sizes from a dataset after mining the same data to find these associations.As another example, suppose that we have a quantitative value y, a measurement of the survival time of a patient after receiving either a standard treatment or a new experimental treatment. I give the old drug (1) or new drug (2) at random to a set of patients and compute the mean difference in the outcome z=(y¯2y¯1)/s, where s is an estimate of SD of the raw difference. Then I could approximate the distribution of z by a standard normal distribution, and hence if I reported to you a value of, say, z = 2.5 you would be impressed because a value that large is unlikely to occur by chance if the new treatment had the same effectiveness as the old one (the P value is about 1%). However, what if instead I tried out many new treatments and reported to you only ones for which |z| > 2? Then a value of 2.5 is not nearly as surprising. Indeed, if the two treatments were equivalent, the conditional probability that |z| exceeds 2.5, given that it is larger than 2, is about 27%. Armed with knowledge of the process that led to the value z = 2.5, the correct selective inference would assign a P value of 0.27 to the finding, rather than 0.01.If not taken into account, the effects of selection can greatly exaggerate the apparent strengths of relationships. We feel that this is one of the causes of the current crisis in reproducibility in science (e.g., ref. 1). With increased competiveness and pressure to publish, it is natural for researchers to exaggerate their claims, intentionally or otherwise. Journals are much more likely to publish studies with low P values, and we (the readers) never hear about the great number of studies that showed no effect and were filed away (the “file-drawer effect”). This makes it difficult to assess the strength of a reported P value of, say, 0.04.The challenge of correcting for the effects of selection is a complex one, because the selective decisions can occur at many different stages in the analysis process. However, some exciting progress has recently been made in more limited problems, such as that of adaptive regression techniques for supervised learning. Here the selections are made in a well-defined way, so that we can exactly measure their effects on subsequent inferences. We describe these new techniques here, as applied to two widely used statistical methods: classic supervised learning, via forward stepwise regression, and modern sparse learning, via the “lasso.” Later, we indicate the broader scope of their potential applications, including principal components analysis.
Keywords:inference   P values   lasso
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号