NEON Ecological Forecast Challenge

1 Overview: How to submit a forecast

We provide an overview of the steps for submitting with the details below:

Explore the data (e.g., targets) and build your forecast model.
Register and describe your model at https://forms.gle/kg2Vkpho9BoMXSy57. You are not required to register if your forecast submission uses the word “example” in your model_id”. Any forecasts with “example” in the model_id will not be used in forecast evaluation analyses. You can use neon4cast as the challenge for which you are registering.
Generate a forecast!
Write the forecast output to a file that follows our standardized format (described below).
Submit your forecast using an R function (provided below).
Watch your forecast be evaluated as new data are collected.

2 Generating a forecast

2.1 All forecasting approaches are welcome

We encourage you to use any modeling approach to predict the future conditions at any of the NEON sites and variables.

2.2 Must include uncertainty

Forecasts require you to assess the confidence in your prediction of the future. You can represent your confidence (i.e., uncertainty in the forecast) using a distribution or numerically using an ensemble (or sample) of predictions.

2.3 Any model drivers/covariates/features are welcome

You can use any data as model inputs (including all target data available to date). All sensor-based target data are available with a 1 to 7-day delay (latency) from the time of collection. You may want to use the updated target data to re-train a model or for use in data assimilation.

As a genuine forecasting challenge, you will need forecasted drivers if your model uses drivers as inputs. If you want to use forecasted meteorology, we download and process NOAA Global Ensemble Forecasting System (GEFS) weather forecasts for each NEON site. The NOAA GEFS forecasts extend 35-days ahead. More information about accessing the weather forecasts can be found here

2.4 Forecasts can be for a range of horizons

Forecasts can be submitted for 1 day to 1 year ahead, depending on the variable. See the variable tables for the horizon associated with each variable.

2.5 Forecasts can be submitted every day

Since forecasts can be submitted every day, automation is essential. We provide an example GitHub repository that can be used to automate your forecast with GitHub Actions. It also includes using a custom Docker Container eco4cast/rocker-neon4cast:latest that contains many of the packages and functions needed to generate and submit forecasts.

We only evaluate forecasts for any weekly variables (e.g., beetles and ticks) that were submitted on the Sunday of each week. Therefore, we recommend only submitting forecasts of the weekly variables on Sundays.

3 You can forecast at any of the NEON sites

If you are just getting started, we recommend a set of focal sites for each of the five “themes.” You are also welcome to submit forecasts to all or a subset of NEON sites. More information about NEON sites can be found in the site metadata and on NEON’s website

4 Forecast file format

The file is a CSV format with the following columns:

project_id: use neon4cast
model_id: the short name of the model defined as the model_id in your registration. The model_id should have no spaces. model_id should reflect a method to forecast one or a set of target variables and must be unique to the neon4cast challenge.
datetime: forecast timestamp. Format %Y-%m-%d %H:%M:%S with UTC as the time zone. Forecasts submitted with a %Y-%m-%d format will be converted to a full datetime assuming UTC mid-night.
reference_datetime: The start of the forecast; this should be 0 times steps in the future. There should only be one value of reference_datetime in the file. Format is %Y-%m-%d %H:%M:%S with UTC as the time zone. Forecasts submitted with a %Y-%m-%d format will be converted to a full datetime assuming UTC mid-night.
duration: the time-step of the forecast. Use the value of P1D for a daily forecast and P1W for a weekly forecast. This value should match the duration of the target variable you are forecasting. Formatted as ISO 8601 duration
site_id: code for NEON site.
family name of the probability distribution that is described by the parameter values in the parameter column (see list below for accepted distribution). An ensemble forecast as a family of ensemble. See note below about family.
parameter the parameters for the distribution (see note below about the parameter column) or the number of the ensemble members. For example, the parameters for a normal distribution are called mu and sigma.
variable: standardized variable name. It must match the variable name in the target file.
prediction: forecasted value for the parameter in the parameter column

5 Representing uncertainty

Uncertainty is represented through the family-parameter columns in the file that you submit.

5.0.1 Parameteric forecast

For a parametric forecast with a normal distribution, the family column would have the word normal to designate a normal distribution, and the parameter column must have values of mu and sigma for each forecasted variable, site_id, depth, and time combination.

Parameteric forecasts for binary variables should use bernoulli as the family and prob as the parameter.

The following names and parameterization of the distribution are currently supported (family: parameters):

lognormal: mu, sigma
normal: mu,sigma
bernoulli: prob
beta: shape1, shape2
uniform: min, max
gamma: shape, rate
logistic: location, scale
exponential: rate
poisson: lambda

If you are submitting a forecast that is not in the supported list, we recommend using the ensemble format and sampling from your distribution to generate a set of ensemble members that represents your forecast distribution.

5.0.2 Ensemble (or sample) forecast

Ensemble (or sample) forecasts use the family value of ensemble and the parameter values are the ensemble index.

When forecasts using the ensemble family are scored, we assume that the ensemble is a set of equally likely realizations sampled from a distribution comparable to that of the observations (i.e., no broad adjustments are required to make the ensemble more consistent with observations). This is referred to as a “perfect ensemble” by Bröcker and Smith (2007). Ensemble (or sample) forecasts are transformed to a probability distribution function (e.g., dressed) using the default methods in the scoringRules R package (empirical version of the quantile decomposition for the crps).

5.1 Example forecasts

Here is an example of a forecast that uses a normal distribution:

df <- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/T20231102190926_aquatics-2023-10-19-climatology.csv.gz", show_col_types = FALSE)

df |> 
  head() |> 
  knitr::kable()

model_id	datetime	reference_datetime	site_id	family	parameter	variable	prediction
climatology	2023-10-20	2023-10-19	ARIK	normal	mu	oxygen	4.542862
climatology	2023-10-20	2023-10-19	ARIK	normal	sigma	oxygen	1.448393
climatology	2023-10-20	2023-10-19	ARIK	normal	mu	temperature	8.070854
climatology	2023-10-20	2023-10-19	ARIK	normal	sigma	temperature	1.330059
climatology	2023-10-21	2023-10-19	ARIK	normal	mu	oxygen	4.194895
climatology	2023-10-21	2023-10-19	ARIK	normal	sigma	oxygen	1.448393

For an ensemble (or sample) forecast, the family column uses the word ensemble to designate that it is an ensemble forecast, and the parameter column is the ensemble member number (1, 2, 3 …).

df <- readr::read_csv("https://sdsc.osn.xsede.org/bio230014-bucket01/challenges/forecasts/raw/T20231102190926_aquatics-2023-10-19-persistenceRW.csv.gz", show_col_types = FALSE)

df |> 
  dplyr::arrange(variable, site_id, datetime, parameter) |> 
  head() |> 
  knitr::kable()

model_id	datetime	reference_datetime	site_id	family	parameter	variable	prediction
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	1	chla	3.795652
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	2	chla	3.963322
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	3	chla	2.053637
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	4	chla	3.294723
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	5	chla	1.847344
persistenceRW	2023-10-20	2023-10-19	BARC	ensemble	6	chla	7.829229

6 Submission process

6.1 File name

Save your forecast as a CSV file with the following naming convention:

theme_name-year-month-day-model_id.csv. Compressed csv files with the csv.gz extension are also accepted.

The theme_name options are: terrestrial_daily, terrestrial_30min, aquatics, beetles, ticks, or phenology.

The year, month, and day are the year, month, and day the reference_datetime (horizon = 0). For example, if a forecast starts today and tomorrow is the first forecasted day, horizon = 0 would be today and used in the file name. model_id is the id for the model name that you specified in the model metadata Google Form (model_id has no spaces in it).

6.2 Uploading forecast

Individual forecast files can be uploaded at any time.

Teams will submit their forecast csv files through an R function. The csv file can only contain one unique model_id and one unique project_id.

The function is available using the following code

remotes::install_github("eco4cast/neon4cast")

The submit function is

library(neon4cast)
neon4cast::submit(forecast_file = "theme_name-year-month-day-model_id.csv")

7 Post-submission

7.1 Processing

After submission, our servers will process uploaded files by converting them to a parquet format on our public s3 storage. A pub_datetime column will be added that denotes when a forecast was submitted. Summaries are generated of the forecasts provide descriptive statistics of the forecast.

7.2 Evaluation

All forecasts are scored daily using new data until the full horizon of the forecast has been scored. Forecasts are scored using the crps function in the scoringRules R package. More information about the scoring metric can be found at here

7.3 Comparison

Forecast performance can be compared to the performance of baseline models. We are automatically submitting the following baseline models:

climatology: the normal distribution (mean and standard deviation) of that day-of-year in the historical observations
persistenceRW: a random walk model that assumes no change in the mean behavior. The random walk is initialized using the most recent observation.
mean: the historical mean of the data is submitted for the beetle theme.

Our forecast performance page includes evaluations of all submitted models.

7.4 Catalog

Our forecast catalog page provides information and code for accessing the forecasts and scores.

8 Questions?

Thanks for reading this document!

If you still have questions about how to submit your forecast to the NEON Ecological Forecasting Challenge, we encourage you to email Dr. Quinn Thomas (rqthomas{at}vt.edu).