An exploration of challenges associated with machine learning for time series forecasting of COVID-19 community spread using wastewater-based epidemiological data

Loading...
Thumbnail Image
File version

Accepted Manuscript (AM)

Author(s)
Vaughan, Liam
Zhang, Muyang
Gu, Haoran
Rose, Joan
Naughton, Colleen
Medema, Gertjan
Allan, Vajra
Roiko, Anne
Blackall, Linda
Zamyadi, Arash
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2022
Size
File type(s)
Location
Abstract

Wastewater-based epidemiology (WBE) has gained increasing attention as a complementary tool to conventional surveillance methods with potential for significant resource and labour savings when used for public health monitoring. Using WBE datasets to train machine learning algorithms and develop predictive models may also facilitate early warnings for the spread of outbreaks. The challenges associated with implementing Random Forest (RF) for timeseries forecasting of COVID-19 was evaluated by running RF on WBE datasets across 108 sites in five regions: Scotland, Catalonia, Ohio, the Netherlands, and Switzerland. This method uses measurements of SARS-CoV-2 RNA fragment concentration in samples taken at the inlets of wastewater treatment plants, providing insight into the prevalence of infection in upstream wastewater catchment populations. RF's forecasting performance at each site was quantitatively evaluated by determining mean absolute percentage error (MAPE) values, which was used to highlight challenges affecting future implementations of RF for WBE forecasting efforts. Performance was generally poor using WBE datasets from Catalonia, Scotland, and Ohio with 'reasonable' or better forecasts constituting 0 %, 5 %, and 0 % of these regions' forecasts, respectively. RF's performance was much stronger with WBE data from the Netherlands and Switzerland, which provided 55 % and 45 % 'reasonable' or better forecasts respectively. Sampling frequency and training set size were identified as key factors contributing to accuracy, while inclusion of too many unnecessary variables (or e.g., flow data) was identified as a contributing factor to poor performance. The contribution of catchment population on forecast accuracy was more ambiguous. This study determined that the factors governing RF's forecast performance are complicated and interrelated, which presents challenges for further work in this space. A sufficiently accurate further iteration of the tool discussed within this study would provide significant but varying value for public health departments for monitoring future, or ongoing outbreaks, assisting the implementation of on-time health response measures.

Journal Title

Science of The Total Environment

Conference Title
Book Title
Edition
Volume

858

Issue

Part 1

Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© 2022 Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licence (http://creativecommons.org/licenses/by-nc-nd/4.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, providing that the work is properly cited.

Item Access Status
Note
Access the data
Related item(s)
Subject

Environmental engineering

Epidemiology

Data structures and algorithms

COVID-19

Machine learning

Time series forecasting

Wastewater-based epidemiology

Persistent link to this record
Citation

Vaughan, L; Zhang, M; Gu, H; Rose, J; Naughton, C; Medema, G; Allan, V; Roiko, A; Blackall, L; Zamyadi, A, An exploration of challenges associated with machine learning for time series forecasting of COVID-19 community spread using wastewater-based epidemiological data, Science of The Total Environment, 2022, 858 (Part 1), pp. 159748

Collections