A machine-learning approach to map landscape connectivity in Aedes aegypti with genetic and environmental data |
| |
Authors: | Evlyn Pless Norah P. Saarman Jeffrey R. Powell Adalgisa Caccone Giuseppe Amatulli |
| |
Affiliation: | aDepartment of Ecology and Evolutionary Biology, Yale University, New Haven, CT, 06511;bDepartment of Anthropology, University of California, Davis, CA, 95616;cDepartment of Biology, Utah State University, Logan, UT, 84321;dSchool of the Environment, Yale University, New Haven, CT, 06511;eCenter for Research Computing, Yale University, New Haven, CT, 06511 |
| |
Abstract: | Mapping landscape connectivity is important for controlling invasive species and disease vectors. Current landscape genetics methods are often constrained by the subjectivity of creating resistance surfaces and the difficulty of working with interacting and correlated environmental variables. To overcome these constraints, we combine the advantages of a machine-learning framework and an iterative optimization process to develop a method for integrating genetic and environmental (e.g., climate, land cover, human infrastructure) data. We validate and demonstrate this method for the Aedes aegypti mosquito, an invasive species and the primary vector of dengue, yellow fever, chikungunya, and Zika. We test two contrasting metrics to approximate genetic distance and find Cavalli-Sforza–Edwards distance (CSE) performs better than linearized FST. The correlation (R) between the model’s predicted genetic distance and actual distance is 0.83. We produce a map of genetic connectivity for Ae. aegypti’s range in North America and discuss which environmental and anthropogenic variables are most important for predicting gene flow, especially in the context of vector control.Landscape genetics—explicitly quantifying the effects of a heterogenous landscape on gene flow—is an important tool for both conservation biology and the control of invasive species and disease vectors including the “yellow fever mosquito” (Aedes aegypti) (1, 2). We demonstrate that current limitations in landscape genetics can be addressed with a machine-learning approach integrated into an iterative optimization process. Isolation by distance (IBD) is a classical model in population genetics that assumes dispersal is limited in proportion to geographic distance, resulting in increasing genetic differentiation with increasing geographic distance between populations (3–5). Although this pattern is commonly seen in nature, factors such as history and dispersal limitations caused by the environment (i.e., “isolation by resistance”) (6) can produce deviations from IBD. Landscape resistance (alias friction) and its inverse, connectivity, determine how organisms move through a landscape (7). Modeling landscape connectivity can be used to identify the environmental variables that affect the organisms’ gene flow and genetic structure; predict how climate and land use change will affect their gene flow and distribution in the future; and inform conservation, vector control, and other management decisions (1, 8–13). Our goals are to use environmental data (the predictors) to build a model of genetic connectivity (the observed data) that improves on IBD and to identify environmental drivers of gene flow patterns.We implement a machine-learning approach that offers a number of advantages over classical methods in landscape genetics: The machine-learning approach is more objective, it allows the inclusion of correlated variables, and it is able to account for different shapes and magnitudes of correlations between predictor and response variables at different locations in the landscape (14–17). In comparison, a common approach in landscape genetics called resistance surface mapping involves the subjective process of creating resistance surfaces for environmental variables, in which each pixel represents a hypothesized resistance to the organism’s movement often based on expert opinion (6, 18). Effective landscape distances through the resistance surfaces can be found with least cost path or circuit theory analysis (19) and then analyzed for associations with genetic distance (20).One option to circumvent the subjectivity of creating resistance surfaces is to model genetic connectivity directly from environmental data. Bouyer et al. (7) took this approach and used a maximum-likelihood method to integrate genetic data and environmental data to map landscape resistance in tsetse flies. Additionally, they introduced an iterative optimization approach in which each subsequent iteration used least cost path lines through the previously predicted resistance surface—an improvement over modeling organism movement as straight lines (16, 17). While this presented a major advance, the maximum-likelihood methodology requires exclusion of correlated data, establishing the relationship between environmental variables and genetic distance before building the model, and transforming or discretizing nonlinear relationships. Additionally, this approach assumes one relationship between each environmental variable and the genetic data across the whole landscape. To build on previous advances while overcoming some of their limitations, we combine iterative optimization with a machine-learning method called random forest (RF).RF is a nonlinear classification and regression tree analysis that can handle many inputs, including redundant or irrelevant variables, as well as continuous and categorical data types (14, 15). RF creates many internal training/testing subdatasets and aggregates the predictors, resulting in stable and consistent results that generally do not overfit the data and can be evaluated through validation processes (14). It is easier to tune and less likely to overfit noisy data than another machine-learning method we considered, gradient boosting (21). Additionally, RF has been successfully incorporated into ecological studies (22) and a small number of landscape genetics studies (16, 17, 23). These studies considered only the environmental predictor values at the genetic collection sites (23) or along straight lines between each pair of sites (16, 17), in contrast to the least cost path analysis we implement here (7).We demonstrate the efficacy of our method to map landscape connectivity for an important disease vector. Ae. aegypti is highly invasive and the primary vector of yellow fever, Zika, dengue, and chikungunya. Except for yellow fever, there are no reliable, widely used vaccines for these diseases, so vector control is essential. Ae. aegypti originated in Africa and is now found throughout the tropics and increasingly in temperate regions (24–26). The species is temperature constrained, preferring warm, humid areas close to humans (the females’ preferred source for bloodmeals outside their native African range) (27). In the United States, it has a patchy distribution throughout southern states, especially Texas, Florida, and California (28). Although Ae. aegypti can disperse >1 km, its usual lifetime dispersal is only around 200 m (29–32). Passive “hitchhiking” via human transportation networks is responsible for long-distance invasions and worldwide spread of Ae. aegypti and its close relative (33–35). Climate change is also expanding the range of Aedes species, which could expose nearly 1 billion additional people to diseases carried by these mosquitoes for the first time (26).Although IBD is common in nature and a helpful null model in landscape genetics (20), geographic distance is often an inadequate sole predictor of genetic distance (as in the case of our dataset; SI Appendix, Fig. S1). Therefore, a more complex model is needed to explain and predict genetic distance and corresponding landscape connectivity. In this paper we introduce an iterative machine-learning approach to integrate environmental predictors and genetic observation data and apply it to map landscape connectivity for the Ae. aegypti mosquito in North America. We also find and examine the most important variables for building the connectivity model and provide validation of our proposed method. |
| |
Keywords: | landscape genetics random forest vector control invasive species gene flow |
|
|