Abstract
This paper investigates the predictability of violent conflict in Africa using machine learning techniques applied to satellite-derived environmental and socioeconomic data across 50km × 50km grid cells from 2019–2023. We implement and compare four modeling approaches—logistic regression, K-nearest neighbors, random forest, and neural networks—trained on climate variables, wealth indicators, and historical conflict patterns to predict binary conflict occurrence. Our Conservative ensemble model achieves 72.3% recall and 52.0% precision, demonstrating that environmental stressors and socioeconomic conditions can meaningfully predict conflict events. However, the precision-recall tradeoff highlights the challenge of developing early warning systems that balance conflict detection with operational feasibility for policymakers and humanitarian organizations.
Introduction
Conflict remains a major barrier to development, stability, and humanitarian well-being, particularly in regions affected by environmental stress and resource scarcity. Understanding and predicting the spatial and temporal dynamics of conflict events can provide critical insights for policymakers, aid organizations, and local communities to better allocate resources, anticipate humanitarian crises, and design targeted interventions.
Africa, given its complex dynamics of resource-driven conflicts, economic inequalities, and climate-induced stress, serves as an ideal context to explore how environmental and socioeconomic factors interact and influence conflict risks. Our research question: Can the occurrence of violence in a given geography be predicted using local socioeconomic and environmental data?
Relevance
Our project is part of a broader research agenda using data and predictive modelling to measure the likelihood of conflict. Notable efforts include the Violence and Impacts Early Warning System (VIEWS), led by Uppsala University and PRIO, which provides monthly predictions on worldwide fatalities from state-based violence at the grid level. Significant resources are directed towards diplomatic and peacekeeping efforts to prevent escalation and establish accurate early alert systems. On-the-ground data is hard to extract; automation promises to enhance planning and strategic use of resources using satellite imagery, maps, media, and official statistics for subnational diagnosis. A key challenge: over-prediction of conflict is understandable given the high cost of false negatives, but limited resources constrain operational response.
Methodology
We divide the African continent into 50km × 50km grid cells and match environmental and socioeconomic features from satellite and auxiliary datasets to each cell. We implement four modeling approaches—logistic regression, K-nearest neighbors, random forest, and neural networks—trained on climate variables, wealth indicators, and historical conflict patterns to predict binary conflict occurrence.
Our Conservative ensemble combines these models (LR 30%, KNN 25%, RF 35%, NN 10%) to achieve balanced recall and precision. We address class imbalance (~92% non-conflict, ~8% conflict) using SMOTE during training.
Data Pipeline
Our pipeline integrates five data sources—four from satellite imagery and one from ground-referenced records—to build a rich feature set for each 50km grid cell. These environmental and socioeconomic signals capture the conditions that may precede or accompany conflict.
Data Collection
Between 1989 and 2023, violent events across Africa showed a sharp peak in 1994, intermittent decline through 2006, and a steady increase from 2010 onwards. Thirteen of 46 countries comprised the majority of violent events, with DR Congo, Nigeria, and Somalia accounting for more than one-third of all records.
UCDP
Uppsala Conflict Data Program: georeferenced conflict events with coordinates and dates worldwide.
ERA5 Reanalysis
Monthly atmospheric variables—temperature, precipitation, surface pressure—at ~31km resolution.
Meta RWI
Relative Wealth Index: predicts living standards using connectivity data and satellite imagery.
Hansen Forest
Global forest extent and change from 2000 onward; annual forest cover loss.
NASA VIIRS
Nighttime light emissions as a proxy for human settlements and economic activity.
Feature Set
Conflict prevalence: Conflict occurrence (binary label), number of conflict events in prior years.
Climate and environmental: Average and max annual temperature, annual precipitation, surface pressure, forest cover loss (Hansen), temperature anomaly (binary deviation from historical norms).
Socioeconomic: Relative Wealth Index (Meta RWI), nighttime lights intensity, year-over-year change in nighttime lights (economic growth proxy).
Final Dataset
We extract additional features: year-over-year change in nighttime lights (economic proxy) and temperature anomalies (deviations from long-term norms). The final dataset has 34,015 grid-year observations with ~92% non-conflict and 8% conflict. Data cleaning includes checking for missing values, consistent data types and units, and normalization where necessary. We use SMOTE to address class imbalance during training.
Why Multiple Models?
We compare linear and non-linear approaches because conflict prediction involves both interpretable drivers (e.g., wealth, climate) and complex spatial patterns. No single model dominates: logistic regression offers interpretability, KNN captures geography, random forest provides feature importance, and neural networks learn non-linear interactions. Our Conservative ensemble balances these strengths for early-warning applications where both recall and precision matter.
1. Logistic Regression (ElasticNet)
Interpretable baseline. We model conflict probability via the sigmoid of a linear combination of features:
We minimize the negative log-likelihood plus ElasticNet regularization:
Hyperparameters: C (inverse penalty strength), α (L1/L2 mix). Best: α=0 (L2 only), C=10.
2. K-Nearest Neighbors
Geography-based: conflict tends to cluster. We select the k nearest neighbors \(S_k\) by distance, then assign the majority class with distance-weighted voting:
Distance metrics: Euclidean \(D_{\text{euc}}(x_i,x_j) = \sqrt{\sum_m (x_{im}-x_{jm})^2}\) or cosine similarity. Best: k=3, cosine similarity, distance weights \(w_j = 1/d_j\).
3. Random Forest
Ensemble of B decision trees trained on bootstrapped samples with random feature subsets. Final prediction by majority vote:
Conflict probability:
Best: 500 trees, unlimited depth. Feature importance via Gini impurity.
4. Neural Network
Fully connected 64 → 32 → 1 with ReLU and sigmoid output:
Binary Cross-Entropy loss:
SGD with momentum 0.9, 100 epochs, batch size 128.
5. Conservative Ensemble
Weighted average of predicted probabilities, threshold 0.5 for classification:
Results
Conservative ensemble: 72.3% recall, 52.0% precision, 92.3% accuracy. Environmental stressors and socioeconomic conditions meaningfully predict conflict. The precision-recall tradeoff reflects the challenge of early-warning systems: balancing detection with operational feasibility.
Demo Video
Replace the href when you have a demo video.