Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions

No Thumbnail Available
File version
Author(s)
Pinthong, Sirimon
Salaeh, Nureehan
Pham, Quoc Bao
Wipulanusat, Warit
Weesakul, Uruya
Suksuwan, Nukul
Nam Thai, Van
Kader, Shuraik
Tariq, Aqil
Ditthakit, Pakorn
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2025
Size
File type(s)
Location
License
Abstract

Hydrological experts face substantial challenges in obtaining reliable data imputations due to the prevalence of incomplete rainfall data in many regions. This study presents a novel systematic framework for optimally imputing daily rainfall by integrating ERA5-Land data with observational data, leveraging tree-based machine learning algorithms. In Thailand’s southern basin (TSB), the framework is divided into five main steps: data collection, regionalization (clustering and homogeneity analysis), feature selection, model development (hyperparameter optimization and model training and testing), and performance comparison. The key findings reveal that regionalization, used as a preliminary feature selection step, enhanced data homogeneity and identified three clusters for the TSB dataset, as verified by the Fligner–Killeen and Brown–Forsythe tests. ERA5-Land significantly overestimates precipitation, particularly during high-rainfall periods, but quantile transformation (QT) effectively corrects these biases, aligning ERA5-Land distributions with observations and improving accuracy, especially at lower quantiles. Feature selection comparisons revealed that the genetic algorithm (GA) retained more features, whereas BorutaShap identified critical features, reducing redundancy and achieving slightly better performance, particularly with random forest (RF). Hyperparameter tuning revealed that simpler models such as RF and extra trees (ET) performed well even with default settings, whereas extreme gradient boosting (XGBoost) required precise tuning to maximize performance. Model performance evaluation revealed that QT-corrected ERA5-Land data significantly improved the imputation accuracy, with ET outperforming RF and XGBoost even under high levels of missing data. This study highlights the critical role of integrating bias-corrected datasets and advanced Machine Learning (ML) models for rainfall imputation in data-scarce regions. The proposed framework offers a scalable and reproducible methodology that can be adapted to other areas facing similar challenges, providing the global scientific community with a practical solution for enhancing hydrological data reliability and improving water resource management strategies.

Journal Title

Earth Systems and Environment

Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note

This publication has been entered in Griffith Research Online as an advance online version.

Access the data
Related item(s)
Subject

Machine learning

Hydrology

Groundwater hydrology

Data structures and algorithms

Persistent link to this record
Citation

Pinthong, S; Salaeh, N; Pham, QB; Wipulanusat, W; Weesakul, U; Suksuwan, N; Nam Thai, V; Kader, S; Tariq, A; Ditthakit, P, Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions, Earth Systems and Environment, 2025

Collections