Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions
File version
Author(s)
Salaeh, Nureehan
Pham, Quoc Bao
Wipulanusat, Warit
Weesakul, Uruya
Suksuwan, Nukul
Nam Thai, Van
Kader, Shuraik
Tariq, Aqil
Ditthakit, Pakorn
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
Hydrological experts face substantial challenges in obtaining reliable data imputations due to the prevalence of incomplete rainfall data in many regions. This study presents a novel systematic framework for optimally imputing daily rainfall by integrating ERA5-Land data with observational data, leveraging tree-based machine learning algorithms. In Thailand’s southern basin (TSB), the framework is divided into five main steps: data collection, regionalization (clustering and homogeneity analysis), feature selection, model development (hyperparameter optimization and model training and testing), and performance comparison. The key findings reveal that regionalization, used as a preliminary feature selection step, enhanced data homogeneity and identified three clusters for the TSB dataset, as verified by the Fligner–Killeen and Brown–Forsythe tests. ERA5-Land significantly overestimates precipitation, particularly during high-rainfall periods, but quantile transformation (QT) effectively corrects these biases, aligning ERA5-Land distributions with observations and improving accuracy, especially at lower quantiles. Feature selection comparisons revealed that the genetic algorithm (GA) retained more features, whereas BorutaShap identified critical features, reducing redundancy and achieving slightly better performance, particularly with random forest (RF). Hyperparameter tuning revealed that simpler models such as RF and extra trees (ET) performed well even with default settings, whereas extreme gradient boosting (XGBoost) required precise tuning to maximize performance. Model performance evaluation revealed that QT-corrected ERA5-Land data significantly improved the imputation accuracy, with ET outperforming RF and XGBoost even under high levels of missing data. This study highlights the critical role of integrating bias-corrected datasets and advanced Machine Learning (ML) models for rainfall imputation in data-scarce regions. The proposed framework offers a scalable and reproducible methodology that can be adapted to other areas facing similar challenges, providing the global scientific community with a practical solution for enhancing hydrological data reliability and improving water resource management strategies.
Journal Title
Earth Systems and Environment
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
This publication has been entered in Griffith Research Online as an advance online version.
Access the data
Related item(s)
Subject
Machine learning
Hydrology
Groundwater hydrology
Data structures and algorithms
Persistent link to this record
Citation
Pinthong, S; Salaeh, N; Pham, QB; Wipulanusat, W; Weesakul, U; Suksuwan, N; Nam Thai, V; Kader, S; Tariq, A; Ditthakit, P, Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions, Earth Systems and Environment, 2025