<- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/project_id=usgsrc4cast/T20240813050531_usgsrc4cast-2024-08-12-climatology.csv.gz", show_col_types = FALSE) df
1 tl;dr: How to submit a forecast
We provide an overview of the steps for submitting with the details below:
- Explore the data (e.g., targets) and build your forecast model.
- Register and describe your model at https://forms.gle/kg2Vkpho9BoMXSy57. You are not required to register if your forecast submission uses the word “example” in your model_id”. Any forecasts with “example” in the model_id will not be used in forecast evaluation analyses. Use
usgsrc4cast
as the challenge you are registering for. - Generate a forecast!
- Write the forecast output to a file that follows our standardized format (described below).
- Submit your forecast using an R or python function (provided below).
- Watch your forecast be evaluated as new data are collected.
2 Generating a forecast
2.1 All forecasting approaches are welcome
We encourage you to use any modeling approach to make a prediction about the future conditions at any of the USGS sites.
2.2 Must include uncertainty
Forecasts require you to make an assessment of the confidence in your prediction of the future. You can represent your confidence (i.e., uncertainty in the forecast) using a distribution or numerically using an ensemble (or sample) of predictions.
2.3 Any model drivers/covariates/features are welcome
You can use any data as model inputs (including all of the forecast target data available to date). All target data are available in with a 1 day delay (latency) from time of collection. You may want to use the updated target data to re-train a model or for use in data assimilation.
As a genuine forecasting challenge, you will need forecasted drivers if your model uses drivers as inputs. If you are interested in using forecasted meteorology, we are downloading and processing NOAA Global Ensemble Forecasting System (GEFS) weather forecasts for each USGS site. The NOAA GEFS forecasts extend 35-days ahead. More information about accessing the weather forecasts can be found here
2.4 Forecasts extend out to 30 days in the future
Forecasts can be submitted for 1 to 30 days ahead. See the variable tables for the horizon that is associated with each variable.
2.5 Forecasts can be submitted everyday
Since forecasts can be submitted everyday, automation is important. We provide an example GitHub repository for both R and Python that can be used to automate your forecast with GitHub Actions. It also includes the use of a custom Docker Container for R (eco4cast/rocker-neon4cast:latest) or Python (eco4cast/usgsrc4cast-python:latest) that has many of the packages and functions needed to generate and submit forecasts.
3 You can forecast at any of the USGS sites
If are you are getting started, we recommend a set of focal sites. You are also welcome to submit forecasts to all or a subset of USGS sites . More information about USGS sites can be found in the site metadata and on USGS’s website
4 Forecast file format
The file is a csv format with the following columns:
project_id
: useusgsrc4cast
model_id
: the short name of the model defined as the model_id in your registration. The model_id should have no spaces.model_id
should reflect a method to forecast one or a set of target variables and must be unique to theusgsrc4cast
challenge.datetime
: forecast timestamp. Format%Y-%m-%d %H:%M:%S
with UTC as the time zone. Forecasts submitted with a%Y-%m-%d
format will be converted to a full datetime assuming UTC mid-night.reference_datetime
: The start of the forecast; this should be 0 times steps in the future. There should only be one value ofreference_datetime
in the file. Format is%Y-%m-%d %H:%M:%S
with UTC as the time zone. Forecasts submitted with a%Y-%m-%d
format will be converted to a full datetime assuming UTC mid-night.duration
: the time-step of the forecast. Use the value ofP1D
for a daily forecast,P1W
for a weekly forecast, andPT30M
for 30 minute forecast. This value should match the duration of the target variable that you are forecasting. Formatted as ISO 8601 durationsite_id
: code for USGS site.family
name of the probability distribution that is described by the parameter values in the parameter column (see list below for accepted distribution). An ensemble forecast as a family ofensemble
. See note below about familyparameter
the parameters for the distribution (see note below about the parameter column) or the number of the ensemble members. For example, the parameters for a normal distribution are calledmu
andsigma
.variable
: standardized variable name. It must match the variable name in the target file (e.g.,chla
).prediction
: forecasted value for the parameter in the parameter column
5 Representing uncertainity
Uncertainty is represented through the family - parameter columns in the file that you submit.
5.0.1 Parameteric forecast
For a parametric forecast with a normal distribution, the family
column would have the word normal
to designate a normal distribution and the parameter column must have values of mu
and sigma
for each forecasted variable, site_id, depth, and time combination.
Parameteric forecasts for binary variables should use bernoulli
as the family and prob
as the parameter.
The following names and parameterization of the distribution are currently supported (family: parameters):
lognormal
:mu
,sigma
normal
:mu
,sigma
bernoulli
:prob
beta
:shape1
,shape2
uniform
:min
,max
gamma
:shape
,rate
logistic
:location
,scale
exponential
:rate
poisson
:lambda
If you are submitting a forecast that is not in the supported list, we recommend using the ensemble format and sampling from your distribution to generate a set of ensemble members that represents your forecast distribution.
5.0.2 Ensemble (or sample) forecast
Ensemble (or sample) forecasts use the family
value of ensemble
and the parameter
values are the ensemble index.
When forecasts using the ensemble family are scored, we assume that the ensemble is a set of equally likely realizations that are sampled from a distribution that is comparable to that of the observations (i.e., no broad adjustments are required to make the ensemble more consistent with observations). This is referred to as a “perfect ensemble” by Bröcker and Smith (2007). Ensemble (or sample) forecasts are transformed to a probability distribution function (e.g., dressed) using the default methods in the scoringRules
R package (empirical version of the quantile decomposition for the crps
).
5.1 Example forecasts
Here is an example of a forecast that uses a normal distribution:
|>
df head() |>
::kable() knitr
project_id | model_id | datetime | reference_datetime | duration | site_id | family | parameter | variable | prediction |
---|---|---|---|---|---|---|---|---|---|
usgsrc4cast | climatology | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | normal | mu | chla | 1.493576 |
usgsrc4cast | climatology | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | normal | sigma | chla | 0.874529 |
usgsrc4cast | climatology | 2024-08-14 | 2024-08-12 | P1D | USGS-01427510 | normal | mu | chla | 1.657496 |
usgsrc4cast | climatology | 2024-08-14 | 2024-08-12 | P1D | USGS-01427510 | normal | sigma | chla | 0.874529 |
usgsrc4cast | climatology | 2024-08-15 | 2024-08-12 | P1D | USGS-01427510 | normal | mu | chla | 1.722410 |
usgsrc4cast | climatology | 2024-08-15 | 2024-08-12 | P1D | USGS-01427510 | normal | sigma | chla | 0.874529 |
For an ensemble (or sample) forecast, the family
column uses the word ensemble
to designate that it is a ensemble forecast and the parameter column is the ensemble member number (1
, 2
, 3
…)
<- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/project_id=usgsrc4cast/T20240813050531_usgsrc4cast-2024-08-12-persistenceRW.csv.gz", show_col_types = FALSE) df
|>
df ::arrange(variable, site_id, datetime, parameter) |>
dplyrhead() |>
::kable() knitr
project_id | model_id | datetime | reference_datetime | duration | site_id | family | parameter | variable | prediction |
---|---|---|---|---|---|---|---|---|---|
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 1 | chla | 1.684588 |
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 2 | chla | 2.198210 |
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 3 | chla | 1.680515 |
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 4 | chla | 4.687647 |
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 5 | chla | 2.029173 |
usgsrc4cast | persistenceRW | 2024-08-13 | 2024-08-12 | P1D | USGS-01427510 | ensemble | 6 | chla | 1.524530 |
6 Submission process
6.1 File name
Save your forecast as a csv file with the following naming convention:
project_id-year-month-day-model_id.csv
. Compressed csv files with the csv.gz extension are also accepted.
The project_id
is this forecast challenge, usgsrc4cast
.
The year, month, and day are the year, month, and day the reference_datetime (horizon = 0). For example, if a forecast starts today and tomorrow is the first forecasted day, horizon = 0 would be today, and used in the file name. model_id
is the id for the model name that you specified in the model metadata Google Form (model_id has no spaces in it).
6.2 Uploading forecast
Individual forecast files can be uploaded any time.
Teams will submit their forecast csv files through an R function. The csv file can only contain one unique model_id
and one unique project_id
.
The submit function is available using the following code in R
source("https://raw.githubusercontent.com/eco4cast/usgsrc4cast-ci/main/R/eco4cast-helpers/submit.R")
source("https://raw.githubusercontent.com/eco4cast/usgsrc4cast-ci/main/R/eco4cast-helpers/forecast_output_validator.R")
submit(forecast_file = "project_id-year-month-day-model_id.csv", project_id = "usgsrc4cast")
or Python
import requests
def download_and_exec_script(url):
= requests.get(url)
response
response.raise_for_status() exec(response.text, globals())
"https://raw.githubusercontent.com/eco4cast/usgsrc4cast-ci/main/python/submit.py")
download_and_exec_script("https://raw.githubusercontent.com/eco4cast/usgsrc4cast-ci/main/python/forecast_output_validator.py")
download_and_exec_script(
= "project_id-year-month-day-model_id.csv", project_id = "usgsrc4cast") submit(forecast_file
7 Post-submission
7.1 Processing
After submission, our servers will process uploaded files by converting them to a parquet format on our public s3 storage. A pub_datetime
column will be added that denotes when a forecast was submitted. Summaries are generated of the forecasts provide descriptive statistics of the forecast.
7.2 Evaluation
All forecasts are scored daily using new data until the full horizon of the forecast has been scored. Forecasts are scored using the crps
function in the scoringRules
R package. More information about the scoring metric can be found at here
7.3 Comparison
Forecast performance can be compared to the performance of baseline models. We are automatically submitting the following baseline models:
climatology
: the normal distribution (mean and standard deviation) of that day-of-year in the historical observationspersistenceRW
: a random walk model that assumes no change in the mean behavior. The random walk is initialized using the most resent observation.
Our forecast performance page includes evaluations of all submitted models.
7.4 Catalog
Information and code for accessing the forecasts and scores can be found on our forecast catalog page.
8 Questions?
Thanks for reading this document!
If you still have questions about how to submit your forecast to the EFI-USGS River Chlorophyll Forecasting Challenge, we encourage you to email Dr. Jacob Zwart (jzwart{at}usgs.gov).