Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of SNP-SNP Interaction |
| |
Authors: | M. Garcí a-Magariñ os,I. Ló pez-de-Ullibarri,R. Cao, A. Salas |
| |
Affiliation: | Unidade de Xenética, Instituto de Medicina Legal, and Departamento de Anatomía Patolóxica e Ciencias Forenses, Facultade de Medicina, Universidade de Santiago de Compostela, Galicia, Spain;Departamento de Estadística e Investigación Operativa, Universidad de Santiago de Compostela, Galicia, Spain;Departamento de Matemáticas, Universidade da Coruña, Galicia, Spain |
| |
Abstract: | Most common human diseases are likely to have complex etiologies. Methods of analysis that allow for the phenomenon of epistasis are of growing interest in the genetic dissection of complex diseases. By allowing for epistatic interactions between potential disease loci, we may succeed in identifying genetic variants that might otherwise have remained undetected. Here we aimed to analyze the ability of logistic regression (LR) and two tree‐based supervised learning methods, classification and regression trees (CART) and random forest (RF), to detect epistasis. Multifactor‐dimensionality reduction (MDR) was also used for comparison. Our approach involves first the simulation of datasets of autosomal biallelic unphased and unlinked single nucleotide polymorphisms (SNPs), each containing a two‐loci interaction (causal SNPs) and 98 ‘noise’ SNPs. We modelled interactions under different scenarios of sample size, missing data, minor allele frequencies (MAF) and several penetrance models: three involving both (indistinguishable) marginal effects and interaction, and two simulating pure interaction effects. In total, we have simulated 99 different scenarios. Although CART, RF, and LR yield similar results in terms of detection of true association, CART and RF perform better than LR with respect to classification error. MAF, penetrance model, and sample size are greater determining factors than percentage of missing data in the ability of the different techniques to detect true association. In pure interaction models, only RF detects association. In conclusion, tree‐based methods and LR are important statistical tools for the detection of unknown interactions among true risk‐associated SNPs with marginal effects and in the presence of a significant number of noise SNPs. In pure interaction models, RF performs reasonably well in the presence of large sample sizes and low percentages of missing data. However, when the study design is suboptimal (unfavourable to detect interaction in terms of e.g. sample size and MAF) there is a high chance of detecting false, spurious associations. |
| |
Keywords: | Epistasis gene-gene interaction autosomal SNPs random forest CART logistic regression missing data complex diseases |
|
|