Aarhus Universitets segl

No 246: New phosphorus model for estimating annual water-weighted concentration of total phosphorus from diffuse sources in ID15 catchments

Larsen, S.E., Kjeldgaard, A., Windolf, J., Tornbjerg, H. & Kronvang, B. 2022. Ny fosformodel til estimering af årlig vandføringsvægtet koncentration af total fosfor fra diffuse kilder i ID15-oplande. Aarhus Universitet, DCE – Nationalt Center for Miljø og Energi, 80 s. - Teknisk rapport nr. 246. http://dce2.au.dk/pub/TR246.pdf

Summary

In this report, statistical analyses were carried out in several steps with the purpose of developing a new model for the simulation of flow-weighted total phosphorus concentrations in streams draining catchments with an average area of approx. 20 km2. The final product is the presented model predicting the annual concentration of total phosphorus (TP), developed in the machine learning software ‘DataRobot’.

The purpose of the statistical analyses was to develop a model based on logarithm-transformed data. The data set used for the development of the model included data on a total of 207 catchments with 2389 observations of the flow-weighted annual TP concentration as input for the DataRobot cross-validation. A completely independent dataset contained data from 142 catchments with 1261 observations of the flow-weighted annual concentration of TP.

The developed machine learning model is of the type ‘Xtreme Gradient Boosted Trees Regressor with early stopping’. The model encompasses a total of 13 explanatory variables, including, as the most significant, the extent of drainage in the catchment, the built-up area in the catchment, the level of cultivation in the catchment, the extent of the ground erosion in the catchment and the annual deviation in precipitation from a long-term average.

The three layers in the development of the machine learning model in DataRobot had the following degrees of explanation: The training data set included 64% of the data (R2 = 0.69), the validation data set included 16% of the data (R2 = 0.71), and the holdout data set included 20% of the data (R2 = 0.67). In addition, the model was validated against the independent dataset, which had a good degree of explanation (R2) both before (0.62) and after (0.41) retransformation. This is a much higher degree of explanation than in the formerly used TP-model in NOVANA.

When switching from the current bias-corrected TP model to the newly developed machine-learning TP model presented in this report, there was an average decrease in the TP supply to coastal waters in Denmark of approx. 3% in the period 1990-2019. The switch to the new TP model will, in some years, change the input to second order coast sections with a maximum decrease from 0.9 to 6.9%. 

The uncertainty of the developed new machine learning TP model was calculated on the basis of the validation of the model against the completely independent validation data set on type catchments (N = 1261) as well as the calibration data set (N = 2389). Root Mean Square Error (RMSE) is estimated to be very small and hence a good model (< 0,2) for the vast majority of the georegions based on the independent validation data. Also the mean absolute error (MAE) is relatively small in most georegions (0.003-0.055 mg P/l).