12 Example Forecast Workflow
Here is an example of a complete workflow for generating a forecast submission to the Challenge. The example is for the aquatics theme and it forecasts water temperature and oxygen. The water temperature forecast uses a linear regression between air temperature and water temperature to predict water temperature in the future. It then uses the prediction of water temperature to predict oxygen water oxygen concentration by assuming that the oxygen is 100% saturated given the predicted temperature.
To generate the forecast we need to:
1) Build a relationship between past air temperature and past water temperature.
2) Apply the relationships from #1 to forecasted air temperature.
3) Calculate oxygen concentration from the forecasted water temperature.
Therefore we need to:
1) Download the historical water temperature data for the NEON sites (called “targets”).
2) Download historical air temperature data for the NEON sites (we the stacked NOAA GEFS weather).
3) Download NOAA weather forecast for the NEON sites.
4) Create linear regression model based on historical data for each NEON site.
5) Apply linear regression to using weather forecasts for each NEON.
6) Write forecast output file.
7) Correct metadata describing forecast.
8) Submit forecast to Challenge.
Each of these steps are below
12.1 Step 0: Set up R environment and directories
We will be downloading NOAA forecasts from the Challenge s3 bucket and submitting to the s3 bucket. Therefore the AWS information is needed.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(neon4cast)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rMR)
## Loading required package: biglm
## Loading required package: DBI
#library(EFIstandards)
Sys.setenv("AWS_DEFAULT_REGION" = "data",
"AWS_S3_ENDPOINT" = "ecoforecast.org")
dir.create("drivers", showWarnings = FALSE)
Define the date that the forecast starts. For demonstration purposes, we are setting the date to 2022-02-15
. In a real-time application, use forecast_date <- Sys.Date()
<- lubridate::as_date("2022-02-15") forecast_date
12.2 Step 0: Define team name and team members
<- "air2waterSat"
team_name
<- list(list(individualName = list(givenName = "Quinn",
team_list surName = "Thomas"),
organizationName = "Virginia Tech",
electronicMailAddress = "rqthomas@vt.edu"))
12.3 Step 1: Download latest target data and site description data
These targets are updated when new data is available from NEON.
<- readr::read_csv("https://data.ecoforecast.org/targets/aquatics/aquatics-targets.csv.gz", guess_max = 1e6) target
## Rows: 9763 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): siteID
## dbl (8): oxygen, temperature, chla, oxygen_sd, temperature_sd, chla_sd, dep...
## date (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A table is available with NEON site descriptions. The calculation of oxygen saturation requires the elevation of each site, which is included in the site description table.
<- readr::read_csv("https://raw.githubusercontent.com/eco4cast/neon4cast-aquatics/master/Aquatic_NEON_Field_Site_Metadata_20210928.csv") site_data
## Rows: 5 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): field_domain_id, field_site_id, field_site_name, field_site_type, ...
## dbl (10): field_latitude, field_longitude, field_utm_northing, field_utm_eas...
## lgl (14): field_minimum_elevation_m, field_maximum_elevation_m, field_domina...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
12.4 Step 2: Get drivers
12.4.1 Step 2.1: Download Paste NOAA forecast stacked together
To build the relations between air and water temperature, we need historical air temperature data to associate with historical water temperature data. Here we use a product that the Challenge organizers created that combines day 1 NOAA weather forecasts (i.e., when the forecasts are most accurate) together to generate an estimate of past weather. He we download this “stack” NOAA product for the set of NEON sites in the targets file.
<- unique(target$siteID)
sites for(i in 1:length(sites)){
::get_stacked_noaa_s3(".",site = sites[i], averaged = FALSE)
neon4cast }
12.4.2 Step 2.2: Download NOAA future forecast
We need NOAA Weather forecasts of the future. Fortunately, the Challenge organizers are downloading and subsetting the weather forecasts for each NEON site. Here we download the weather forecast (date = forecast_date
) that started at mid-night UTC (cycle=00
) for the set of sites in the target file.
<- unique(target$siteID)
sites for(i in 1:length(sites)){
::get_noaa_forecast_s3(".",model = "NOAAGEFS_1hr", site = sites[i], date = forecast_date, cycle = "00")
neon4cast }
12.4.3 Step 2.3 Create data frames of drivers
The NOAA stacked and forecast files are netcdf files. Here we convert to data frames to used in our analyses.
<- neon4cast::stack_noaa(dir = "drivers", model = "NOAAGEFS_1hr_stacked")
noaa_past <- neon4cast::stack_noaa(dir = "drivers", model = "NOAAGEFS_1hr", forecast_date = forecast_date) noaa_future
12.4.4 Step 2.4 Aggregate (to day) and convert units of drivers
Since we are forecasting daily mean water temperature and oxygen, we need to aggregate the 1 hr weather data to the daily time-scale. We also need to convert from Kelvin to Celsius.
<- noaa_past %>%
noaa_past_mean mutate(date = as_date(time)) %>%
group_by(date) %>%
summarize(air_temperature = mean(air_temperature, na.rm = TRUE), .groups = "drop") %>%
rename(time = date) %>%
mutate(air_temperature = air_temperature - 273.15)
For the future weather has 31 ensemble members (i.e., different trajectories of weather), that we want to use to generate uncertainty in our water temperature and oxygen forecasts. The aggregation below maintains the separate ensemble members (group_by(date, ensemble)
).
<- noaa_future %>%
noaa_future_mean mutate(date = as_date(time)) %>%
group_by(date, ensemble) %>%
summarize(air_temperature = mean(air_temperature, na.rm = TRUE), .groups = "drop") %>%
rename(time = date) %>%
mutate(ensemble = as.numeric(stringr::str_sub(ensemble, start = 4, end = 6)),
air_temperature = air_temperature - 273.15)
12.4.5 Step 2.5: Merge in past NOAA data into the targets file, matching by date.
Before building our linear model we need merge in the historical air temperature to match with the historical water temperature
<- left_join(target, noaa_past_mean, by = "time") target
12.5 Step 3.0: Generate forecasts for each site
We generate a forecast for each site.
<- NULL
forecast
for(i in 1:length(sites)){
# Get site information for elevation
<- site_data %>% filter(field_site_id == sites[i])
site_info
#Fit linear model based on past data: water temperature = m * air temperature + b
<- lm(target$temperature~target$air_temperature)
fit
#use linear regression to forecast water temperature for each ensemble member
<- fit$coefficients[1] + fit$coefficients[2] * noaa_future_mean$air_temperature
forecasted_temperature
#use forecasted temperature to predict oxygen by assuming that oxygen is saturated.
<- rMR::Eq.Ox.conc(forecasted_temperature, elevation.m = ,site_info$field_mean_elevation_m,
forecasted_oxygen bar.press = NULL,
bar.units = NULL,
out.DO.meas = "mg/L",
salinity = 0,
salinity.units = "pp.thou")
#Build site level dataframe. Note we are not forecasting chla
<- tibble(time = noaa_future_mean$time,
site_forecast siteID = sites[i],
ensemble = noaa_future_mean$ensemble,
forecast = 1,
temperature = forecasted_temperature,
oxygen = forecasted_oxygen,
chla = NA)
#Bind with other sites
<- bind_rows(forecast, site_forecast)
forecast }
Use ggplot to visualize the forecast for each variable and site. The spread in forecast is due to uncertainty in the weather forecast, where each line is associated with a different NOAA GEFS weather forecast ensemble member.
%>%
forecast select(-chla) %>%
pivot_longer(cols = c("temperature","oxygen"), names_to = "variable", values_to = "values") %>%
ggplot(aes(x = time, y = values, group = ensemble)) +
geom_line() +
facet_grid(variable~siteID, scale ="free")
Forecast output file name in standards requires for Challenge. “csv.gz” means that it will be compressed
<- paste0("aquatics","-",min(forecast$time),"-",team_name,".csv.gz") forecast_file
Write csv to disk
write_csv(forecast, forecast_file)
Confirm that output file meets standard for Challenge
::forecast_output_validator(forecast_file) neon4cast
12.6 Step 4: Generate metadata
First we need to create a “list” in R that describes the forecast model. We only have one source of uncertainty (driver). It is propagated (i.e., represented in the forecasted output) using 31 ensemble members.
= list(
model_metadata forecast = list(
model_description = list(
forecast_model_id = "air2waterSat", #What goes here
name = "Air temperature to water temperature linear regression plus assume saturated oxygen",
type = "empirical",
repository = "https://github.com/rqthomas/neon4cast-example"
),initial_conditions = list(
status = "absent"
),drivers = list(
status = "propagates",
complexity = 1, #Just temperature
propagation = list(
type = "ensemble",
size = 31)
),parameters = list(
status = "absent"
),random_effects = list(
status = "absent"
),process_error = list(
status = "absent"
),obs_error = list(
status = "absent"
)
) )
Using the team_list
and model_metadata
above, we can create the metadata file. The function neon4cast::generate_metadata
adds additional information to complete the metadata (i.e., geographic information about the site, etc.)
<- neon4cast::generate_metadata(forecast_file, team_list, model_metadata) metadata_file
12.7 Step 5: Submit forecast!
Now we can submit the forecast output and the metadata file to the Challenge using the neon4cast::submit()
function
::submit(forecast_file = forecast_file,
neon4castmetadata = metadata_file,
ask = FALSE)
You can check on the status of your submission using
::check_submission(forecast_file) neon4cast
On following day after submission, you can see the forecast on the dashboard at shiny.ecoforecast.org
12.8 Example on github
The example code above can be found on GitHub as a template repository at [https://github.com/eco4cast/neon4cast-example.git]. See the Readme for more information about using the template