Prediction of Dengue Fever Cases in South America
The project developed and compared traditional statistical and machine learning models to predict dengue outbreaks in Bolivia and Nicaragua using climatic, socioeconomic, and historical dengue case data. The best model achieved significant predictive accuracy, with a mean absolute error of 19.05 for Bolivia and 41.82 for Nicaragua.
The report aims to evaluate and compare the effectiveness of traditional statistical models and machine learning models in predicting dengue outbreaks in Bolivia and Nicaragua. The primary goal is to use regional and meteorological data to forecast the number of dengue cases and facilitate early detection and prevention.
Key Findings
The best predictive model achieved an average mean absolute error (MAE) of 19.05 for Bolivia and 41.82 for Nicaragua.
Both traditional statistical models and machine learning models can accurately predict dengue cases, highlighting their potential for early detection and prevention in various regions.
Introduction
Dengue is a viral infection transmitted by Aedes mosquitoes. It can cause mild illness or progress to severe dengue, leading to shock, internal bleeding, and death. Early detection and access to medical care can reduce fatality rates for severe dengue to less than 1%. Approximately half of the world's population is at risk, with an estimated 100 to 400 million infections annually.
The project aimed to create a tool to predict dengue cases in a given region using climate and socioeconomic data, along with historical dengue case data. The increasing incidence of dengue worldwide and the necessity for community involvement in dengue case prediction were key motivations.
Dengue fever generally occurs in tropical and subtropical areas worldwide, particularly in urban and semi-urban areas. This makes it a significant health concern in many countries in Asia and Latin America, where it can be a leading cause of serious illness and fatalities. The economic status and sanitation measures of a region also play critical roles in the spread of dengue, influencing the design of prediction tools that must account for various climatic and economic factors.
Figure 1. Map of major outbreak regions
Project Idea & Implementation
The project, part of a six-credit Data Science course, was developed in partnership with HPE over six months. The objective was to create a solution to predict dengue cases using climatic and socioeconomic data and previous dengue case records. Weekly meetings with the advisor and HPE team provided valuable feedback and guidance. In addition, a presentation was made to exhibit the results in a conference
Figure 2. Explanation of key findings
The project timeline was structured to ensure systematic progress, with tasks divided among team members and weekly updates presented to the advisor and HPE team. This collaborative approach allowed for iterative development and refinement of models based on real-world feedback and expert insights.
Research Methodology
Country Selection and Dataset
Bolivia and Nicaragua were selected based on data granularity, regional division, and a significant number of years of data. Data on dengue cases and climatic variables were collected, and exploratory data analyses were conducted to assess the correlation between dengue cases and climatic variables.
Initial research focused on identifying countries with detailed, high-quality data on dengue cases and climatic conditions. Bolivia and Nicaragua provided comprehensive datasets that met our criteria, offering regional data with weekly granularity over several years. This allowed for a robust analysis of dengue trends and their correlation with climatic factors.
Data Collection and Preprocessing
Dengue case data were obtained from the PLISA database, and climatic data were collected using NASA’s POWER API. The data were preprocessed to align temporal frequencies, group by regions, remove duplicates, and transform the number of cases for model input.
Figure 3. DAVe: NASA’s POWER API
Preprocessing steps included aggregating climatic data to match the weekly format of dengue case data, ensuring consistency across datasets. Data cleaning involved handling missing values, outlier detection, and transforming variables to enhance model performance. The logarithmic transformation of dengue cases improved correlations with climatic variables, facilitating better predictive modeling.
Predictive Models
Statistical Learning Models
ARIMA: Accounts for autocorrelation and handles non-stationary data.
SARIMA: Incorporates seasonality, making it suitable for dengue case prediction with its seasonal components.
Both ARIMA and SARIMA models were chosen for their effectiveness in time series forecasting. ARIMA handles non-stationary data by applying differencing, while SARIMA adds seasonal components, crucial for capturing the periodic nature of dengue outbreaks.
Machine Learning Models
Linear Regression, SVR, Random Forest Regressor, Lasso, Ridge, K-Nearest Neighbors, Gradient Boosting, XgBoost, LightGBM, and CatBoost: Evaluated for their suitability in handling non-linear patterns and large datasets.
Machine learning models were selected based on their ability to capture complex relationships in the data. Ensemble methods like Random Forest and Gradient Boosting were included for their robustness, while SVR and K-Nearest Neighbors provided non-linear modeling capabilities.
Neural Networks
Simple LSTM: Utilizes gates to control data flow, capturing long-term dependencies.
Multi-LSTM: Stacks multiple LSTM layers, enhancing the model's ability to learn complex temporal patterns.
Neural networks, particularly LSTM models, were implemented to leverage their strength in handling sequential data. The Multi-LSTM model's layered approach allowed for capturing deeper temporal patterns, improving prediction accuracy.
Pipeline and Grid Search
A preprocessing pipeline standardized data inputs for machine learning models. A grid search was conducted to find optimal parameters, ensuring the best model performance.
The pipeline included time shift selection, dummy variable encoding, feature selection, and data scaling. The grid search process iterated through combinations of model parameters to identify the most effective configuration, enhancing predictive accuracy.
Figure 4. Roadmap followed
Results
Bolivia
Best Model: Stacked Regressor with an MAE of 19.07.
Findings: The lag of dengue cases significantly influenced predictions, with climatic variables having a minor impact.
The Stacked Regressor model outperformed others, integrating multiple regression techniques. The results highlighted the importance of historical dengue data, while climatic variables provided marginal improvements.
Figure 4. Comparison of the real and predicted number of dengue cases resulted from the Stacked Regressor model in the regions of Beni, La Paz, Santa Cruz and Chuquisaca.
Nicaragua
Best Model: SVR with an MAE of 19.93.
Findings: Similar to Bolivia, the lag of dengue cases was a crucial predictor, and climatic variables had limited influence.
SVR demonstrated strong predictive capabilities, with results reinforcing the significance of past dengue cases. The findings suggested a consistent pattern across both countries, emphasizing historical trends over climatic inputs.
Figure 5. Comparison of true and predicted number of dengue cases resulted from the SVR model in the regions of Managua, Estelí, Carazo Juan and Masaya.
Conclusions
The study demonstrated the effectiveness of combining traditional statistical models and machine learning models for predicting dengue cases. While the lag of dengue cases was the most influential predictor, climatic variables contributed minimally to improving model accuracy. Future research should explore integrating more nuanced data and refining model parameters to enhance predictive accuracy.
The research underscored the potential of predictive modeling in public health, enabling early interventions and resource allocation. However, the limited impact of climatic variables suggests the need for additional factors, such as socioeconomic data, to improve model robustness.
Future Steps
Future research directions include broadening the scope of data to capture a more comprehensive picture of dengue dynamics. Integrating socioeconomic factors and exploring sophisticated neural network designs can further enhance prediction accuracy. Improved model interpretability will support actionable insights for public health officials.