<- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/T20231102190926_aquatics-2023-10-19-climatology.csv.gz", show_col_types = FALSE) df
1 tl;dr: How to submit a forecast
We provide an overview of the steps for submitting with the details below:
- Explore the data (e.g., targets) and build your forecast model.
- Register and describe your model at https://forms.gle/kg2Vkpho9BoMXSy57. You are not required to register if your forecast submission uses the word “example” in your model_id”. Any forecasts with “example” in the model_id will not be used in forecast evaluation analyses. You can use neon4cast as the challenge for which you are registering.
- Generate a forecast!
- Write the forecast output to a file that follows our standardized format (described below).
- Submit your forecast using an R function (provided below).
- Watch your forecast be evaluated as new data are collected.
2 Generating a forecast
2.1 All forecasting approaches are welcome
We encourage you to use any modeling approach to predict the future conditions at any of the NEON sites and variables.
2.2 Must include uncertainty
Forecasts require you to assess the confidence in your prediction of the future. You can represent your confidence (i.e., uncertainty in the forecast) using a distribution or numerically using an ensemble (or sample) of predictions.
2.3 Any model drivers/covariates/features are welcome
You can use any data as model inputs (including all target data available to date). All sensor-based target data are available with a 1 to 7-day delay (latency) from the time of collection. You may want to use the updated target data to re-train a model or for use in data assimilation.
As a genuine forecasting challenge, you will need forecasted drivers if your model uses drivers as inputs. If you want to use forecasted meteorology, we download and process NOAA Global Ensemble Forecasting System (GEFS) weather forecasts for each NEON site. The NOAA GEFS forecasts extend 35-days ahead. More information about accessing the weather forecasts can be found here
2.4 Forecasts can be for a range of horizons
Forecasts can be submitted for 1 day to 1 year ahead, depending on the variable. See the variable tables for the horizon associated with each variable.
2.5 Forecasts can be submitted every day
Since forecasts can be submitted every day, automation is essential. We provide an example GitHub repository that can be used to automate your forecast with GitHub Actions. It also includes using a custom Docker Container eco4cast/rocker-neon4cast:latest that contains many of the packages and functions needed to generate and submit forecasts.
We only evaluate forecasts for any weekly variables (e.g., beetles and ticks) that were submitted on the Sunday of each week. So, we recommend only submitting forecasts of the weekly variables on Sundays.
3 You can forecast at any of the NEON sites
If you are just getting started, we recommend a set of focal sites for each of the five “themes.” You are also welcome to submit forecasts to all or a subset of NEON sites. More information about NEON sites can be found in the site metadata and on NEON’s website
4 Forecast file format
The file is a CSV format with the following columns:
project_id
: useneon4cast
model_id
: the short name of the model defined as the model_id in your registration. The model_id should have no spaces.model_id
should reflect a method to forecast one or a set of target variables and must be unique to the neon4cast challenge.datetime
: forecast timestamp. Format%Y-%m-%d %H:%M:%S
with UTC as the time zone. Forecasts submitted with a%Y-%m-%d
format will be converted to a full datetime assuming UTC mid-night.reference_datetime
: The start of the forecast; this should be 0 times steps in the future. There should only be one value ofreference_datetime
in the file. Format is%Y-%m-%d %H:%M:%S
with UTC as the time zone. Forecasts submitted with a%Y-%m-%d
format will be converted to a full datetime assuming UTC mid-night.duration
: the time-step of the forecast. Use the value ofP1D
for a daily forecast,P1W
for a weekly forecast, andPT30M
for 30 minute forecast. This value should match the duration of the target variable you are forecasting. Formatted as ISO 8601 durationsite_id
: code for NEON site.family
name of the probability distribution that is described by the parameter values in the parameter column (see list below for accepted distribution). An ensemble forecast as a family ofensemble
. See note below about family.parameter
the parameters for the distribution (see note below about the parameter column) or the number of the ensemble members. For example, the parameters for a normal distribution are calledmu
andsigma
.variable
: standardized variable name. It must match the variable name in the target file.prediction
: forecasted value for the parameter in the parameter column
5 Representing uncertainty
Uncertainty is represented through the family-parameter columns in the file that you submit.
5.0.1 Parameteric forecast
For a parametric forecast with a normal distribution, the family
column would have the word normal
to designate a normal distribution, and the parameter column must have values of mu
and sigma
for each forecasted variable, site_id, depth, and time combination.
Parameteric forecasts for binary variables should use bernoulli
as the family and prob
as the parameter.
The following names and parameterization of the distribution are currently supported (family: parameters):
lognormal
:mu
,sigma
normal
:mu
,sigma
bernoulli
:prob
beta
:shape1
,shape2
uniform
:min
,max
gamma
:shape
,rate
logistic
:location
,scale
exponential
:rate
poisson
:lambda
If you are submitting a forecast that is not in the supported list, we recommend using the ensemble format and sampling from your distribution to generate a set of ensemble members that represents your forecast distribution.
5.0.2 Ensemble (or sample) forecast
Ensemble (or sample) forecasts use the family
value of ensemble
and the parameter
values are the ensemble index.
When forecasts using the ensemble family are scored, we assume that the ensemble is a set of equally likely realizations sampled from a distribution comparable to that of the observations (i.e., no broad adjustments are required to make the ensemble more consistent with observations). This is referred to as a “perfect ensemble” by Bröcker and Smith (2007). Ensemble (or sample) forecasts are transformed to a probability distribution function (e.g., dressed) using the default methods in the scoringRules
R package (empirical version of the quantile decomposition for the crps
).
5.1 Example forecasts
Here is an example of a forecast that uses a normal distribution:
|>
df head() |>
::kable() knitr
model_id | datetime | reference_datetime | site_id | family | parameter | variable | prediction |
---|---|---|---|---|---|---|---|
climatology | 2023-10-20 | 2023-10-19 | ARIK | normal | mu | oxygen | 4.542862 |
climatology | 2023-10-20 | 2023-10-19 | ARIK | normal | sigma | oxygen | 1.448393 |
climatology | 2023-10-20 | 2023-10-19 | ARIK | normal | mu | temperature | 8.070854 |
climatology | 2023-10-20 | 2023-10-19 | ARIK | normal | sigma | temperature | 1.330059 |
climatology | 2023-10-21 | 2023-10-19 | ARIK | normal | mu | oxygen | 4.194895 |
climatology | 2023-10-21 | 2023-10-19 | ARIK | normal | sigma | oxygen | 1.448393 |
For an ensemble (or sample) forecast, the family
column uses the word ensemble
to designate that it is an ensemble forecast, and the parameter column is the ensemble member number (1
, 2
, 3
…).
<- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/T20231102190926_aquatics-2023-10-19-persistenceRW.csv.gz", show_col_types = FALSE) df
|>
df ::arrange(variable, site_id, datetime, parameter) |>
dplyrhead() |>
::kable() knitr
model_id | datetime | reference_datetime | site_id | family | parameter | variable | prediction |
---|---|---|---|---|---|---|---|
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 1 | chla | 3.795652 |
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 2 | chla | 3.963322 |
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 3 | chla | 2.053637 |
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 4 | chla | 3.294723 |
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 5 | chla | 1.847344 |
persistenceRW | 2023-10-20 | 2023-10-19 | BARC | ensemble | 6 | chla | 7.829229 |
6 Submission process
6.1 File name
Save your forecast as a CSV file with the following naming convention:
theme_name-year-month-day-model_id.csv
. Compressed csv files with the csv.gz extension are also accepted.
The theme_name
options are: terrestrial_daily, terrestrial_30min, aquatics, beetles, ticks, or phenology.
The year, month, and day are the year, month, and day the reference_datetime (horizon = 0). For example, if a forecast starts today and tomorrow is the first forecasted day, horizon = 0 would be today and used in the file name. model_id
is the id for the model name that you specified in the model metadata Google Form (model_id has no spaces in it).
6.2 Uploading forecast
Individual forecast files can be uploaded at any time.
Teams will submit their forecast csv files through an R function. The csv file can only contain one unique model_id
and one unique project_id
.
The function is available using the following code
::install_github("eco4cast/neon4cast") remotes
The submit function is
library(neon4cast)
::submit(forecast_file = "theme_name-year-month-day-model_id.csv") neon4cast
7 Post-submission
7.1 Processing
After submission, our servers will process uploaded files by converting them to a parquet format on our public s3 storage. A pub_datetime
column will be added that denotes when a forecast was submitted. Summaries are generated of the forecasts provide descriptive statistics of the forecast.
7.2 Evaluation
All forecasts are scored daily using new data until the full horizon of the forecast has been scored. Forecasts are scored using the crps
function in the scoringRules
R package. More information about the scoring metric can be found at here
7.3 Comparison
Forecast performance can be compared to the performance of baseline models. We are automatically submitting the following baseline models:
climatology
: the normal distribution (mean and standard deviation) of that day-of-year in the historical observationspersistenceRW
: a random walk model that assumes no change in the mean behavior. The random walk is initialized using the most recent observation.mean
: the historical mean of the data is submitted for the beetle theme.
Our forecast performance page includes evaluations of all submitted models.
7.4 Catalog
Our forecast catalog page provides information and code for accessing the forecasts and scores.
8 Questions?
Thanks for reading this document!
If you still have questions about how to submit your forecast to the NEON Ecological Forecasting Challenge, we encourage you to email Dr. Quinn Thomas (rqthomas{at}vt.edu).