Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • uba-ki-lab/ubair
1 result
Show changes
Commits on Source (3)
^CONTRIBUTING\.md$
^Dockerfile$
^LICENSE\.md$
^Meta$
^README\.Rmd$
^\./inst/extdata/data$
^\.Rproj\.user$
^\.dvc$
^\.gitlab$
^\.gitlab-ci\.yml$
^\.gitlab/issue_templates/.gitkeep$
^\.lintr$
^\.pre-commit-config\.yaml$
^\.venv$
^doc$
^examples$
^renv$
^renv\.lock$
^ubair\.Rproj$
^vignettes/figure/
source("renv/activate.R")
.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store
.quarto
.venv/
inst/doc
/doc/
/Meta/
Contributing to ubair
All contributions to ubair are welcome! This document outlines the process and guidelines for contributing to the project.
Coding Standards
Aim for self-explanatory and readable code to minimize the need for additional documentation.
Reporting Bugs and Feature Requests
Use GitLab issues to report bugs or propose features. While we do not provide a specific template, please include sufficient details to help us understand and address the issue.
Workflow and Version Control
Pull requests (PRs) are welcome. You may either:
• Fork the repository and submit a PR from your fork.
• Work directly on a branch and submit a PR.
Testing and Validation
New features must allow for reproducible results. Validate your code against existing data and workflows to ensure compatibility and consistency.
Licensing
Ensure that any new dependencies added are compatible with the GPL-3.0-or-later used in this project.
General Guidelines
Discuss significant changes (e.g., new features) in a GitLab issue before starting your work. When adding new files or modifying the folder structure, ensure compatibility with the existing structure.
Thank you for contributing to ubair! If you have any questions, feel free to contact the maintainers listed in the README file.
Package: ubair
Title: Auswirkungen externer Bedingungen auf die Luftqualität
Version: 1.1.0
Authors@R:
person("Imke", "Voss", , "imke.voss@uba.de", role = c("aut", "cre", "cph"))
person("Raphael", "Franke", , "raphael.franke@uba.de", role = c("aut", "cre"))
Description: Statistische Untersuchung der Auswirkungen externer Bedingungen auf die Luftqualität.
License: GPL (>= 3) + file LICENSE
Depends:
R (>= 4.4.0),
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Suggests:
testthat (>= 3.0.0),
deepnet,
fastshap,
treeshap,
shapviz,
knitr,
rmarkdown
Config/testthat/edition: 3
Imports:
rlang,
data.table,
dplyr,
ggplot2,
forecast,
lubridate,
tidyr,
yaml,
ranger,
lightgbm
LazyData: true
LazyDataCompression: xz
VignetteBuilder: knitr
FROM rocker/tidyverse:4.4.1
# Set environment variables
ENV R_LIBS_USER=/usr/local/lib/R/site-library
ENV CRAN=https://cran.rstudio.com
# Install system dependencies
RUN apt-get update && apt-get install -y \
python3-pip \
libtool \
automake \
autoconf \
gcc \
g++ \
make \
qpdf \
&& pip install dvc[s3]
# Set the working directory to match GitLab CI
WORKDIR /builds/use-case-luft/ubair
# Copy the project files including renv.lock
COPY . .
# Install R packages via renv and devtools
RUN R -e "install.packages('renv', repos='$CRAN')" && \
R -e "renv::restore()" && \
R -e "install.packages('devtools', repos='$CRAN')"
DL-DE->BY-2.0
Data licence Germany – attribution – version 2.0
This licence refers to the sample_data_DESN025 provided in this publication.
Provider of the data: Sächsisches Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG)
Alterations in the data: Codes for incorrect values have been removed.
(1) Any use will be permitted provided it fulfils the requirements of this "Data licence Germany – attribution – Version 2.0".
The data and meta-data provided may, for commercial and non-commercial use, in particular
- be copied, printed, presented, altered, processed and transmitted to third parties;
- be merged with own data and with the data of others and be combined to form new and independent datasets;
- be integrated in internal and external business processes, products and applications in public and non-public electronic networks.
(2) The user must ensure that the source note contains the following information:
- the name of the provider,
- the annotation "Data licence Germany – attribution – Version 2.0" or "dl-de/by-2-0" referring to the licence text available at www.govdata.de/dl-de/by-2-0, and
- a reference to the dataset (URI).
This applies only if the entity keeping the data provides the pieces of information 1-3 for the source note.
(3) Changes, editing, new designs or other amendments must be marked as such in the source note.
URL: http://www.govdata.de/dl-de/by-2-0
This diff is collapsed.
# Generated by roxygen2: do not edit by hand
export(calc_performance_metrics)
export(calc_summary_statistics)
export(clean_data)
export(copy_default_params)
export(detrend)
export(estimate_effect_size)
export(get_meteo_available)
export(load_params)
export(load_uba_data_from_dir)
export(plot_counterfactual)
export(plot_station_measurements)
export(prepare_data_for_modelling)
export(rescale_predictions)
export(retrend_predictions)
export(run_counterfactual)
export(run_dynamic_regression)
export(run_fnn)
export(run_lightgbm)
export(run_rf)
export(scale_data)
export(split_data_counterfactual)
import(lightgbm)
import(ranger)
import(splines)
importFrom(data.table,":=")
importFrom(data.table,.SD)
importFrom(data.table,melt)
importFrom(dplyr,"%>%")
importFrom(dplyr,across)
importFrom(dplyr,mutate)
importFrom(dplyr,select)
importFrom(dplyr,where)
importFrom(forecast,BoxCox.lambda)
importFrom(forecast,auto.arima)
importFrom(forecast,forecast)
importFrom(ggplot2,aes)
importFrom(ggplot2,facet_wrap)
importFrom(ggplot2,geom_line)
importFrom(ggplot2,geom_ribbon)
importFrom(ggplot2,geom_smooth)
importFrom(ggplot2,geom_vline)
importFrom(ggplot2,ggplot)
importFrom(ggplot2,labs)
importFrom(ggplot2,theme_bw)
importFrom(lubridate,ymd_h)
importFrom(lubridate,ymd_hm)
importFrom(stats,lm)
importFrom(stats,predict)
importFrom(stats,spline)
importFrom(tidyr,gather)
importFrom(yaml,read_yaml)
## ubair 1.1.0
**Initial CRAN Submission**\
**Public Release**
- First public release of `ubair` on CRAN.
- Previous internal versions (pre-CRAN) were used privately or shared within a specific group.
- The package is intended for use with air quality station measurement data. These can be obtained for Germany via [Air Quality Data API (UBA)](https://www.umweltbundesamt.de/daten/luft/luftdaten/doc) or per request to immission@uba.de
- Includes the following features:
- **Counterfactual analysis**: Apply counterfactual analysis to assess external effects on air quality based on meteorological data, with support for four models: LightGBM, Random Forest, Feedforward Neural Network, and Dynamic Regression.
- **Data detrending and retrending**: Enable users to remove and reintroduce trends in data as needed.
- **Visualization and evaluation**: Provide tools for visualizing and evaluating the results of counterfactual analyses.
- **Data preprocessing**: Preprocess data to prepare it for counterfactual analysis.
This diff is collapsed.
#' Clean and Optionally Aggregate Environmental Data
#'
#' Cleans a data table of environmental measurements by filtering for a specific
#' station, removing duplicates, and optionally aggregating the data on a daily
#' basis using the mean.
#'
#' @param env_data A data table in long format.
#' Must include columns:
#' \describe{
#' \item{Station}{Station identifier for the data.}
#' \item{Komponente}{Measured environmental component e.g. temperature, NO2.}
#' \item{Wert}{Measured value.}
#' \item{date}{Timestamp as Date-Time object (`YYYY-MM-DD HH:MM:SS` format).}
#' \item{Komponente_txt}{Textual description of the component.}
#' }
#' @param station Character. Name of the station to filter by.
#' @param aggregate_daily Logical. If `TRUE`, aggregates data to daily mean values. Default is `FALSE`.
#' @return A `data.table`:
#' \itemize{
#' \item If `aggregate_daily = TRUE`: Contains columns for station, component, day, year,
#' and the daily mean value of the measurements.
#' \item If `aggregate_daily = FALSE`: Contains cleaned data with duplicates removed.
#' }
#' @details Duplicate rows (by `date`, `Komponente`, and `Station`) are removed. A warning is issued
#' if duplicates are found.
#' @examples
#' # Example data
#' env_data <- data.table::data.table(
#' Station = c("DENW094", "DENW094", "DENW006", "DENW094"),
#' Komponente = c("NO2", "O3", "NO2", "NO2"),
#' Wert = c(45, 30, 50, 40),
#' date = as.POSIXct(c(
#' "2023-01-01 08:00:00", "2023-01-01 09:00:00",
#' "2023-01-01 08:00:00", "2023-01-02 08:00:00"
#' )),
#' Komponente_txt = c(
#' "Nitrogen Dioxide", "Ozone", "Nitrogen Dioxide", "Nitrogen Dioxide"
#' )
#' )
#'
#' # Clean data for StationA without aggregation
#' cleaned_data <- clean_data(env_data, station = "DENW094", aggregate_daily = FALSE)
#' print(cleaned_data)
#' @export
clean_data <- function(env_data, station, aggregate_daily = FALSE) {
env_data <- .add_year_column(env_data)
env_data <- dplyr::filter(env_data, Station == station)
env_data_unique <- unique(env_data, by = c("date", "Komponente", "Station"))
if (nrow(env_data_unique) < nrow(env_data)) {
warning(sprintf(
"%d duplicate row(s) were removed.",
nrow(env_data) - nrow(env_data_unique)
))
}
if (aggregate_daily) {
env_data_unique <- .aggregate_data(env_data_unique)
}
env_data_unique
}
#' Get Available Meteorological Components
#'
#' Identifies unique meteorological components from the provided environmental data,
#' filtering only those that match the predefined UBA naming conventions. These components
#' include "GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", and "WIND_V".
#' @param env_data Data table containing environmental data.
#' Must contain column "Komponente"
#' @return A vector of available meteorological components.
#' @examples
#' # Example environmental data
#' env_data <- data.table::data.table(
#' Komponente = c("TMP", "NO2", "GLO", "WIR"),
#' Wert = c(25, 40, 300, 50),
#' date = as.POSIXct(c(
#' "2023-01-01 08:00:00", "2023-01-01 09:00:00",
#' "2023-01-01 10:00:00", "2023-01-01 11:00:00"
#' ))
#' )
#' # Get available meteorological components
#' meteo_components <- get_meteo_available(env_data)
#' print(meteo_components)
#' @export
get_meteo_available <- function(env_data) {
meteo_available <- unique(env_data$Komponente) %>%
.[. %in% c("GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", "WIND_V")]
meteo_available
}
#' @return A data.table with an added year column.
#' @noRd
.add_year_column <- function(env_data) {
env_data_copy <- data.table::copy(env_data)
env_data_copy[, year := lubridate::year(date)]
}
#' Adds a `day` column to the data, representing the date truncated to day-level
#' precision. This column is used for later aggregations.
#' @noRd
.add_day_column <- function(env_data) {
env_data %>% dplyr::mutate(day = lubridate::floor_date(date, unit = "day"))
}
#' @return A data.table aggregated to daily mean values and a new column day.
#' @noRd
.aggregate_data <- function(env_data) {
env_data <- .add_day_column(env_data)
env_data[, list(Wert = mean(Wert, na.rm = TRUE)),
by = list(Station, Komponente, Komponente_txt, day, year)
]
}
#' Load Parameters from YAML File
#'
#' Reads a YAML file containing model parameters, including station settings,
#' variables, and configurations for various models. If no file path is
#' provided, the function defaults to loading `params.yaml` from the package's
#' `extdata` directory.
#'
#' @param filepath Character. Path to the YAML file. If `NULL`, the function
#' will attempt to load the default `params.yaml` provided in the package.
#' @return A list containing the parameters loaded from the YAML file.
#' @details
#' The YAML file should define parameters in a structured format, such as:
#'
#' ```yaml
#' target: 'NO2'
#'
#' lightgbm:
#' nrounds: 200
#' eta: 0.03
#' num_leaves: 32
#'
#' dynamic_regression:
#' ntrain: 8760
#'
#' random_forest:
#' num.trees: 300
#' max.depth: 10
#'
#' meteo_variables:
#' - GLO
#' - TMP
#' ```
#' @examples
#' \dontrun{
#' params <- load_params("path/to/custom_params.yaml")
#' }
#' @export
#' @importFrom yaml read_yaml
load_params <- function(filepath = NULL) {
if (is.null(filepath)) {
filepath <- system.file("extdata", "params.yaml", package = "ubair")
}
if (file.exists(filepath)) {
params <- yaml::read_yaml(filepath)
return(params)
} else {
stop("YAML file not found at the specified path.")
}
}
#' Copy Default Parameters File
#'
#' Copies the default `params.yaml` file, included with the package, to a
#' specified destination directory. This is useful for initializing parameter
#' files for custom edits.
#'
#' @param dest_dir Character. The path to the directory where the `params.yaml`
#' file will be copied.
#' @return Nothing is returned. A message is displayed upon successful copying.
#' @details
#' The `params.yaml` file contains default model parameters for various
#' configurations such as LightGBM, dynamic regression, and others. See the
#' \code{\link[ubair:load_params]{load_params()}}` documentation for an example of the file's structure.
#'
#' @examples
#' \dontrun{
#' copy_default_params("path/to/destination")
#' }
#' @export
copy_default_params <- function(dest_dir) {
if (!dir.exists(dest_dir)) {
stop("Destination directory does not exist.")
}
file.copy(system.file("extdata", "params.yaml", package = "ubair"),
file.path(dest_dir, "params.yaml"),
overwrite = TRUE
)
message("Default params.yaml copied to ", normalizePath(dest_dir))
}
#' Load UBA Data from Directory
#'
#' This function loads data from CSV files in the specified directory. It supports two formats:
#'
#' 1. "inv": Files must contain the following columns:
#' - `Station`, `Komponente`, `Datum`, `Uhrzeit`, `Wert`.
#' 2. "24Spalten": Files must contain:
#' - `Station`, `Komponente`, `Datum`, and columns `Wert01`, ..., `Wert24`.
#'
#' File names should include "inv" or "24Spalten" to indicate their format. The function scans
#' recursively for `.csv` files in subdirectories and combines the data into a single `data.table`
#' in long format.
#' Files that are not in the exected format will be ignored.
#'
#' @param data_dir Character. Path to the directory containing `.csv` files.
#' @return A `data.table` containing the loaded data in long format. Returns an error if no valid
#' files are found or the resulting dataset is empty.
#' @export
#' @importFrom lubridate ymd_hm ymd_h
#' @importFrom tidyr gather
#' @importFrom dplyr mutate select %>%
load_uba_data_from_dir <- function(data_dir) {
if (!dir.exists(data_dir)) {
stop(paste("Directory does not exist:", data_dir))
}
all_files <- list.files(data_dir,
pattern = "\\.csv$",
recursive = TRUE,
full.names = TRUE
)
list_data_parts <- lapply(unique(dirname(all_files)), function(dir) {
.load_data(dir)
})
combined_data <- data.table::rbindlist(list_data_parts, fill = TRUE)
if (nrow(combined_data) == 0) {
stop(paste(
"The resulting data is empty after loading all files from",
data_dir
))
}
return(combined_data)
}
#' Load data for a Specific Directory
#'
#' @param data_dir Character. Path to the directory containing `.csv` files.
#' @return A `data.table` with the loaded data for the directory. Returns an
#' empty `data.table` if no files match the expected format.
#' @noRd
.load_data <- function(data_dir) {
data_files <- list.files(data_dir, pattern = "\\.csv$")
list_data <- lapply(data_files, function(file) {
.load_data_file(data_dir, file)
})
data.table::rbindlist(list_data, fill = TRUE)
}
#' Load data from a specific file
#'
#' Loads data from a file in "inv" or "24Spalten" format. Unsupported formats
#' return an empty `data.table`.
#'
#' @param data_dir Character. Base directory containing the file.
#' @param file Character. Name of the file to load.
#' @return `data.table` containing loaded data or empty data.table for
#' unsupported formats.
#' @noRd
.load_data_file <- function(data_dir, file) {
file_path <- file.path(data_dir, file)
if (grepl("inv", file)) {
.load_inv_file(file_path, file)
} else if (grepl("24Spalten", file)) {
.load_spalten_file(file_path, file)
} else {
data.table::data.table()
}
}
#' Load data from an 'inv' file
#'
#' @param file_path Full path to the 'inv' file.
#' @param file The filename being loaded.
#' @return A data.table with the loaded 'inv' data.
#' @noRd
.load_inv_file <- function(file_path, file) {
data.table::fread(file_path,
quote = "'",
na.strings = c(
"-999", "-888", "-777", "-666", "555", "-555", "-333",
"-111", "555.00000000"
)
) %>%
mutate(
date = ymd_hm(paste(Datum, Uhrzeit)),
Komponente_txt = Komponente,
Komponente = substring(sub("\\_.*", "", file), 7)
)
}
#' Load data from a '24Spalten' file
#'
#' @param file_path Full path to the 'Spalten' file.
#' @param file The filename being loaded.
#' @return `data.table` containing the processed data from the '24Spalten' file.
#' @noRd
.load_spalten_file <- function(file_path, file) {
data.table::fread(file_path,
quote = "'",
na.strings = c(
"-999", "-888", "-777", "-666", "555", "-555", "-333",
"-111", "555.00000000"
)
) %>%
tidyr::gather("time", "Wert", Wert01:Wert24) %>%
mutate(
Uhrzeit = substring(time, 5),
date = ymd_h(paste(Datum, Uhrzeit)),
Komponente_txt = Komponente,
Komponente = substring(sub("\\_.*", "", file), 8)
) %>%
dplyr::select(-c(Nachweisgrenze, Lieferung))
}
#' Removes trend from data
#'
#' Takes a list of train and application data as prepared by
#' [ubair::split_data_counterfactual()]
#' and removes a polynomial, exponential or cubic spline spline trend function.
#' Trend is obtained only from train data. Use as part of preprocessing before
#' training a model based on decision trees, i.e. random forest and lightgbm.
#' For the other methods it may be helpful but they are generally able to
#' deal with trends themselves. Therefore we recommend to try out different
#' versions and guide decisisions using the model evaluation metrics from
#' [ubair::calc_performance_metrics()].
#'
#' Apply [ubair::retrend_predictions()] to predictions to return to the
#' original data units.
#'
#' @param split_data List of two named dataframes called train and apply
#' @param mode String which defines type of trend is present. Options are
#' "linear", "quadratic", "exponential", "spline", "none".
#' "none" returns original data
#' @param num_splines Defines the number of cubic splines if `mode="spline"`.
#' Choose num_splines=1 for cubic polynomial trend. If `mode!="spline"`, this
#' parameter is ignored
#' @param log_transform If `TRUE`, use a log-transformation before detrending
#' to ensure positivity of all predictions in the rest of the pipeline.
#' A exp transformation is necessary during retrending to return to the solution
#' space. Use only in combination with `log_transform` parameter in
#' [ubair::retrend_predictions()]
#' @return List of 3 elements. 2 dataframes: detrended train, apply and the
#' trend function
#' @examples
#' \dontrun{
#' split_data <- split_data_counterfactual(
#' dt_prepared, training_start,
#' training_end, application_start, application_end
#' )
#' detrended_list <- detrend(split_data, mode = "linear")
#' detrended_train <- detrended_list$train
#' detrended_apply <- detrended_list$apply
#' trend <- detrended_list$model
#' }
#' @export
#' @importFrom stats lm predict spline
#' @import splines
detrend <- function(split_data,
mode = "linear",
num_splines = 5,
log_transform = FALSE) {
dt_train_new <- data.table::copy(split_data$train)
dt_apply_new <- data.table::copy(split_data$apply)
stopifnot("log_transform needs to be boolean, i.e. either TRUE or FALSE" = class(log_transform) == "logical")
if (log_transform) {
dt_train_new$value <- log(dt_train_new$value)
dt_apply_new$value <- log(dt_apply_new$value)
}
if (mode == "linear") {
trend <- lm(value ~ date_unix, data = dt_train_new)
} else if (mode == "quadratic") {
trend <- lm(value ~ date_unix + I(date_unix^2), data = dt_train_new)
} else if (mode == "exponential") {
trend <- lm(value ~ log(date_unix), data = dt_train_new)
} else if (mode == "spline") {
stopifnot("Set num_splines larger or equal to 1" = num_splines >= 1)
knots <- seq(
from = min(dt_train_new$date_unix),
to = max(dt_train_new$date_unix),
length.out = num_splines + 1
)
knots <- knots[2:(length(knots) - 1)]
if (length(knots) <= 2) { # If 2 or less knots, just fit cubic polynomial
trend <- lm(value ~ date_unix + I(date_unix^2) + I(date_unix^3),
data = dt_train_new
)
} else { # Else use splines
trend <- lm(value ~ bs(date_unix, knots = knots), data = dt_train_new)
}
} else if (mode == "none") {
trend <- lm(value ~ 1 - 1, data = dt_train_new)
} else {
stop("mode needs to be any of the following strings: 'linear',
'quadratic', 'exponential', 'spline', 'none'")
}
trend_train_values <- predict(trend, newdata = dt_train_new)
dt_train_new$value <- dt_train_new$value - trend_train_values
trend_apply_values <- predict(trend, newdata = dt_apply_new)
dt_apply_new$value <- dt_apply_new$value - trend_apply_values
return(list(train = dt_train_new, apply = dt_apply_new, model = trend))
}
#' Restors the trend in the prediction
#'
#' Takes a dataframe of predictions as returned by any of
#' the 'run_model' functions and restores a trend which was previously
#' removed via [ubair::detrend()]. This is necessary for the predictions
#' and the true values to have the same units. The function is basically
#' the inverse function to [ubair::detrend()] and should only be used in
#' combination with it.
#'
#' @param dt_predictions Dataframe of predictions with columns `value`,
#' `prediction`, `prediction_lower`, `prediction_upper`
#' @param trend lm object generated by [ubair::detrend()]
#' @param log_transform Returns values to solution space, if they have been
#' log transformed during detrending. Use only in combination with `log_transform`
#' parameter in detrend function.
#' @return Retrended dataframe with same structure as `dt_predictions`
#' which is returned by any of the run_model() functions.
#' @examples
#' \dontrun{
#' detrended_list <- detrend(split_data,
#' mode = detrending_function,
#' log_transform = log_transform
#' )
#' trend <- detrended_list$model
#' detrended_train <- detrended_list$train
#' detrended_apply <- detrended_list$apply
#' detrended_train <- detrended_train %>% select(value, dplyr::any_of(variables))
#' result <- run_lightgbm(
#' train = detrended_train,
#' test = detrended_apply,
#' model_params = params$lightgbm,
#' alpha = 0.9,
#' calc_shaps = FALSE
#' )
#' retrended_predictions <- retrend_predictions(result$dt_predictions, trend)
#' }
#' @export
retrend_predictions <- function(dt_predictions, trend, log_transform = FALSE) {
stopifnot("log_transform needs to be boolean, i.e. TRUE or FALSE" = class(log_transform) == "logical")
stopifnot("trend object needs to be a linear model of class 'lm'" = class(trend) == "lm")
stopifnot(
"Not all of 'value', 'prediction', 'prediction_lower', 'prediction_upper' are columns in dt_predictions" =
all(c("value", "prediction", "prediction_lower", "prediction_upper")
%in% colnames(dt_predictions))
)
trend_value <- predict(trend, newdata = dt_predictions)
if (log_transform) {
dt_predictions[,
c("value", "prediction", "prediction_lower", "prediction_upper") := lapply(.SD, \(x) exp(x + trend_value)),
.SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
]
} else {
dt_predictions[,
c("value", "prediction", "prediction_lower", "prediction_upper") := lapply(.SD, \(x) x + trend_value),
.SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
]
}
dt_predictions
}
#' Standardize Training and Application Data
#'
#' This function standardizes numeric columns of the `train_data` and applies
#' the same scaling (mean and standard deviation) to the corresponding columns
#' in `apply_data`. It returns the standardized data along with the scaling
#' parameters (means and standard deviations). This is particularly important
#' for neural network approaches as they tend to be numerically unstable and
#' deteriorate otherwise.
#'
#' @param train_data A data frame containing the training dataset to be
#' standardized. It must contain numeric columns.
#' @param apply_data A data frame containing the dataset to which the scaling
#' from `train_data` will be applied.
#'
#' @return A list containing the following elements:
#' \item{train}{The standardized training data.}
#' \item{apply}{The `apply_data` scaled using the means and standard deviations
#' from the `train_data`.}
#' \item{means}{The means of the numeric columns in `train_data`.}
#' \item{sds}{The standard deviations of the numeric columns in `train_data`.}
#' @export
#' @importFrom dplyr mutate across where
#' @examples
#' \dontrun{
#' scale_result <- scale_data(
#' train_data = detrended_list$train,
#' apply_data = detrended_list$apply, scale = TRUE
#' )
#' scaled_train <- scale_result$train
#' scaled_apply <- scale_result$apply
#' }
scale_data <- function(train_data,
apply_data) {
means <- attr(
scale(train_data %>% select(where(is.numeric))),
"scaled:center"
)
sds <- attr(
scale(train_data %>% select(where(is.numeric))),
"scaled:scale"
)
train_data <- train_data %>%
dplyr::mutate_if(is.numeric, ~ as.numeric(scale(.)))
apply_data <- apply_data %>%
mutate(across(
names(sds),
~ as.numeric(scale(.,
center = means[dplyr::cur_column()],
scale = sds[dplyr::cur_column()]
))
))
list(
train = train_data,
apply = apply_data,
means = means,
sds = sds
)
}
#' Rescale predictions to original scale.
#'
#' This function rescales the predicted values (`prediction`, `prediction_lower`,
#' `prediction_upper`). The scaling is reversed using the means and
#' standard deviations that were saved from the training data. It is the inverse
#' function to [ubair::scale_data()] and should be used only in combination.
#'
#' @param scale_result A list object returned by [ubair::scale_data()],
#' containing the means and standard deviations used for scaling.
#' @param dt_predictions A data frame containing the predictions,
#' including columns `prediction`, `prediction_lower`, `prediction_upper`.
#'
#' @return A data frame with the predictions and numeric columns rescaled back
#' to their original scale.
#'
#' @export
#' @examples
#' \dontrun{
#' scale_res <- scale_data(train_data = train, apply_data = apply)
#' res <- run_fnn(train = scale_res$train, test = scale_res$apply, params)
#' dt_predictions <- res$dt_predictions
#' rescaled_predictions <- rescale_predictions(scale_res, dt_predictions)
#' }
rescale_predictions <- function(scale_result, dt_predictions) {
means <- scale_result$means
sds <- scale_result$sds
rescaled_predictions <- dt_predictions %>%
mutate(
prediction = prediction * sds["value"] + means["value"],
prediction_lower = prediction_lower * sds["value"] + means["value"],
prediction_upper = prediction_upper * sds["value"] + means["value"]
)
return(rescaled_predictions)
}
#' Calculates performance metrics of a business-as-usual model
#'
#' Model agnostic function to calculate a number of common performance
#' metrics on the reference time window.
#' Uses the true data `value` and the predictions `prediction` for this calculation.
#' The coverage is calculated from the columns `value`, `prediction_lower` and
#' `prediction_upper`.
#' Removes dates in the effect and buffer range as the model is not expected to
#' be performing correctly for these times. The incorrectness is precisely
#' what we are using for estimating the effect.
#' @param predictions data.table or data.frame with the following columns
#' \describe{
#' \item{date}{Date of the observation. Needs to be comparable to
#' date_effect_start element.}
#' \item{value}{True observed value of the station}
#' \item{prediction}{Predicted model output for the same time and station
#' as value}
#' \item{prediction_lower}{Lower end of the prediction interval}
#' \item{prediction_upper}{Upper end of the prediction interval}
#' }
#'
#' @param date_effect_start A date. Start date of the
#' effect that is to be evaluated. The data from this point onwards is disregarded
#' for calculating model performance
#' @param buffer Integer. An additional buffer window before date_effect_start to account
#' for uncertainty in the effect start point. Disregards additional buffer data
#' points for model evaluation
#' @return Named vector with performance metrics of the model
#' @export
calc_performance_metrics <- function(predictions, date_effect_start = NULL, buffer = 0) {
df <- data.table::copy(predictions)
stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
if (!is.null(date_effect_start)) {
stopifnot(
"date_effect_start needs to be NULL or a date object" =
inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
)
df <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
}
metrics <- c(
"RMSE" = sqrt(mean((df$value - df$prediction)**2)),
"MSE" = mean((df$value - df$prediction)**2),
"MAE" = mean(abs(df$value - df$prediction)),
"MAPE" = mean(abs(df$value - df$prediction) / (df$value + 1)),
"Bias" = mean(df$prediction - df$value),
"R2" = 1 - sum((df$value - df$prediction)**2) / sum((df$value - mean(df$value))**2),
"Coverage lower" = mean(df$value >= df$prediction_lower),
"Coverage upper" = mean(df$value <= df$prediction_upper),
"Coverage" = mean(df$value >= df$prediction_lower) +
mean(df$value <= df$prediction_upper) - 1,
"Correlation" = stats::cor(df$value, df$prediction),
"MFB" = 2 * mean((df$prediction - df$value) / (df$prediction + df$value)),
"FGE" = 2 * mean(abs(df$prediction - df$value) / (df$prediction + df$value))
)
metrics
}
#' Calculates summary statistics for predictions and true values
#'
#' Helps with analyzing predictions by comparing them with the true values on
#' a number of relevant summary statistics.
#' @param predictions Data.table or data.frame with the following columns
#' \describe{
#' \item{date}{Date of the observation. Needs to be comparable to
#' date_effect_start element.}
#' \item{value}{True observed value of the station}
#' \item{prediction}{Predicted model output for the same time and station
#' as value}
#' }
#' @param date_effect_start A date. Start date of the
#' effect that is to be evaluated. The data from this point onwards is disregarded
#' for calculating model performance
#' @param buffer Integer. An additional buffer window before date_effect_start to account
#' for uncertainty in the effect start point. Disregards additional buffer data
#' points for model evaluation
#' @return data.frame of summary statistics with columns true and prediction
#' @export
calc_summary_statistics <- function(predictions, date_effect_start = NULL, buffer = 0) {
df <- data.table::copy(predictions)
stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
if (!is.null(date_effect_start)) {
stopifnot(
"date_effect_start needs to be NULL or a date object" =
inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
)
df <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
}
data.frame(
true = c(
min(df$value),
max(df$value),
stats::var(df$value),
mean(df$value),
stats::quantile(df$value, probs = 0.05),
stats::quantile(df$value, probs = 0.25),
stats::median(df$value),
stats::quantile(df$value, probs = 0.75),
stats::quantile(df$value, probs = 0.95)
),
prediction = c(
min(df$prediction),
max(df$prediction),
stats::var(df$prediction),
mean(df$prediction),
stats::quantile(df$prediction, probs = 0.05),
stats::quantile(df$prediction, probs = 0.25),
stats::median(df$prediction),
stats::quantile(df$prediction, probs = 0.75),
stats::quantile(df$prediction, probs = 0.95)
),
row.names = c(
"min", "max", "var", "mean", "5-percentile",
"25-percentile", "median/50-percentile",
"75-percentile", "95-percentile"
)
)
}
#' Estimates size of the external effect
#'
#' Calculates an estimate for the absolute and relative effect size of the
#' external effect. The absolute effect is the difference between the model
#' bias in the reference time and the effect time windows. The relative effect
#' is the absolute effect divided by the mean true value in the reference
#' window.
#'
#' Note: Since the bias is of the model is an average over predictions and true
#' values, it is important, that the effect window is specified correctly.
#' Imagine a scenario like a fire which strongly affects the outcome for one
#' hour and is gone the next hour. If we use a two week effect window, the
#' estimated effect will be 14*24=336 times smaller compared to using a 1-hour
#' effect window. Generally, we advise against studying very short effects (single
#' hour or single day). The variability of results will be too large to learn
#' anything meaningful.
#'
#' @param df Data.table or data.frame with the following columns
#' \describe{
#' \item{date}{Date of the observation. Needs to be comparable to
#' date_effect_start element.}
#' \item{value}{True observed value of the station}
#' \item{prediction}{Predicted model output for the same time and station
#' as value}
#' }
#' @param date_effect_start A date. Start date of the
#' effect that is to be evaluated. The data from this point onward is disregarded
#' for calculating model performance.
#' @param buffer Integer. An additional buffer window before date_effect_start to account
#' for uncertainty in the effect start point. Disregards additional buffer data
#' points for model evaluation
#' @param verbose Prints an explanation of the results if TRUE
#' @return A list with two numbers: Absolute and relative estimated effect size.
#' @export
estimate_effect_size <- function(df, date_effect_start, buffer = 0, verbose = FALSE) {
stopifnot(
"date_effect_start needs to be NULL or a date object" =
inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
)
stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
reference <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
effect <- df[date >= date_effect_start]
bias_ref <- mean(reference$prediction - reference$value)
bias_effect <- mean(effect$prediction - effect$value)
effectsize <- bias_ref - bias_effect
rel_effectsize <- effectsize / mean(effect$value)
rel_effectsize <- round(rel_effectsize, 4)
if (verbose) {
cat(sprintf("The external effect changed the target value on average by %.3f compared to the reference time window. This is a %1.2f%% relative change.", effectsize, 100 * rel_effectsize))
}
list(absolute_effect = effectsize, relative_effect = rel_effectsize)
}
#' Prepare Data for Training a model
#'
#' Prepares environmental data by filtering for relevant components,
#' converting the data to a wide format, and adding temporal features. Should be
#' called before
#' \code{\link[ubair:split_data_counterfactual]{split_data_counterfactual()}}
#'
#' @param env_data A data table in long format.
#' Must include the following columns:
#' \describe{
#' \item{Station}{Station identifier for the data.}
#' \item{Komponente}{The environmental component being measured
#' (e.g., temperature, NO2).}
#' \item{Wert}{The measured value of the component.}
#' \item{date}{Timestamp as `POSIXct` object in `YYYY-MM-DD HH:MM:SS` format.}
#' \item{Komponente_txt}{A textual description of the component.}
#' }
#' @param params A list of modelling parameters loaded from `params.yaml`.
#' Must include:
#' \describe{
#' \item{meteo_variables}{A vector of meteorological variable names.}
#' \item{target}{The name of the target variable.}
#' }
#' @return A `data.table` in wide format, with columns:
#' `date`, one column per component, and temporal features
#' like `date_unix`, `day_julian`, `weekday`, and `hour`.
#' @examples
#' env_data <- data.table::data.table(
#' Station = c("StationA", "StationA", "StationA"),
#' Komponente = c("NO2", "TMP", "NO2"),
#' Wert = c(50, 20, 40),
#' date = as.POSIXct(c("2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 12:00:00"))
#' )
#' params <- list(meteo_variables = c("TMP"), target = "NO2")
#' prepared_data <- prepare_data_for_modelling(env_data, params)
#' print(prepared_data)
#'
#' @export
prepare_data_for_modelling <- function(env_data, params) {
components <- c(params$meteo_variables, params$target)
dt_filtered <- .extract_components(env_data, components)
dt_wide <- .cast_to_wide(dt_filtered)
if (!params$target %in% names(dt_wide)) {
warning(sprintf("Target '%s' is not present in the data. Make sure it exists
and you have set the correct target name", params$target))
stop("Exiting function due to missing target data.")
}
dt_prepared <- dt_wide %>%
dplyr::rename(value = params$target) %>%
.add_date_variables(replace = TRUE) %>%
dplyr::filter(!is.na(value))
dt_prepared
}
#' Turn date feature into temporal features date_unix, day_julian, weekday and
#' hour
#'
#' @param df Data.table with column date formatted as date-time object
#' @param replace Boolean which determines whether to replace existing temporal variables
#' @return A data.table with all relevant temporal features for modelling
#' @noRd
.add_date_variables <- function(df, replace) {
names <- names(df)
if (replace) {
df$date_unix <- as.numeric(df$date)
df$day_julian <- lubridate::yday(df$date)
df$weekday <- .wday_monday(df$date, as.factor = TRUE)
df$hour <- lubridate::hour(df$date)
} else {
if (!"date_unix" %in% names) {
df$date_unix <- as.numeric(df$date)
}
if (!"day_julian" %in% names) {
df$day_julian <- lubridate::yday(df$date)
}
if (!"weekday" %in% names) {
df$weekday <- .wday_monday(df$date, as.factor = TRUE)
}
if (!"hour" %in% names) {
df$hour <- lubridate::hour(df$date)
}
}
return(df)
}
#' Reformat lubridate weekdays into weekdays with monday as day 1
#'
#' @param x Vector of date-time objects
#' @param as.factor Boolean that determines whether to return a factor or a numeric vector
#' @noRd
.wday_monday <- function(x, as.factor = FALSE) {
x <- lubridate::wday(x)
x <- x - 1
x <- ifelse(x == 0, 7, x)
if (as.factor) {
x <- factor(x, levels = 1:7, ordered = TRUE)
}
return(x)
}
#' Split Data into Training and Application Datasets
#'
#' Splits prepared data into training and application datasets based on
#' specified date ranges for a business-as-usual scenario. Data before
#' `application_start` and after `application_end` is used as training data,
#' while data within the date range is used for application.
#'
#' @param dt_prepared The prepared data table.
#' @param application_start The start date(date object) for the application
#' period of the business-as-usual simulation. This coincides with the start of
#' the reference window.
#' Can be created by e.g. lubridate::ymd("20191201")
#' @param application_end The end date(date object) for the application period
#' of the business-as-usual simulation. This coincides with the end of
#' the effect window.
#' Can be created by e.g. lubridate::ymd("20191201")
#' @return A list with two elements:
#' \describe{
#' \item{train}{Data outside the application period.}
#' \item{apply}{Data within the application period.}
#' }
#' @examples
#' dt_prepared <- data.table::data.table(
#' date = as.Date(c("2023-01-01", "2023-01-05", "2023-01-10")),
#' value = c(50, 60, 70)
#' )
#' result <- split_data_counterfactual(
#' dt_prepared,
#' application_start = as.Date("2023-01-03"),
#' application_end = as.Date("2023-01-08")
#' )
#' print(result$train)
#' print(result$apply)
#' @export
split_data_counterfactual <- function(dt_prepared,
application_start,
application_end) {
stopifnot(
inherits(application_start, "Date"),
inherits(application_end, "Date")
)
stopifnot(application_start <= application_end)
dt_train <- dt_prepared[date < application_start | date > application_end]
dt_apply <- dt_prepared[date >= application_start & date <= application_end]
list(train = dt_train, apply = dt_apply)
}
#' Extract Components for Modelling
#' Stop with error message if any selected meteo variable/component is not
#' contained in the data.
#'
#' @param env_data Daily aggregated data table.
#' @param components Vector of component names to extract.
#' @return A data.table filtered by the specified components.
#' @noRd
.extract_components <- function(env_data, components) {
if (!all(components %in% env_data$Komponente)) {
missing_components <- components[!components %in% env_data$Komponente]
stop(paste(
"Data does not contain all selected variables:", missing_components,
"\n Check data and meteo_variables/params.yaml."
))
}
env_data[Komponente %in% components, list(Komponente, Wert, date)]
}
#' @param dt_filtered Filtered data.table.
#' @return A wide-format data.table.
#' @noRd
#' @examples
#' dt_filtered <- data.table::data.table(
#' date = as.POSIXct(c("2023-01-01", "2023-01-01", "2023-01-02")),
#' Komponente = c("NO2", "TMP", "NO2"),
#' Wert = c(50, 20, 40)
#' )
#' wide_data <- .cast_to_wide(dt_filtered)
#' print(wide_data)
.cast_to_wide <- function(dt_filtered) {
data.table::dcast(dt_filtered,
formula = date ~ Komponente,
value.var = "Wert"
)
}
#' Environmental Data for Modelling from station DESN025 in Leipzig-Mitte.
#'
#' A dataset containing environmental measurements collected at station in
#' Leipzig Mitte with observations of different environmental components over
#' time. This data is used for environmental modelling tasks, including
#' meteorological variables and other targets.
#'
#' @format ## sample_data_DESN025
#' A data table with the following columns:
#' \describe{
#' \item{Station}{Station identifier where the data was collected.}
#' \item{Komponente}{The environmental component being measured
#' (e.g., temperature, NO2).}
#' \item{Wert}{The measured value of the component.}
#' \item{date}{The timestamp for the observation, formatted as a Date-Time
#' object in the format
#' \code{"YYYY-MM-DD HH:MM:SS"} (e.g., "2010-01-01 07:00:00").}
#' \item{Komponente_txt}{A textual description or label for the component.}
#' }
#'
#' The dataset is structured in a long format and is prepared for further
#' transformation into a wide format for modelling.
#'
#' @source Umweltbundesamt
#' @examples
#' \dontrun{
#' params <- load_params("path/to/params.yaml")
#' dt_prepared <- prepare_data_for_modelling(sample_data_DESN025, params)
#' }
"sample_data_DESN025"
# required to suppress devtools::check() notes that occur from data.table syntax
utils::globalVariables(c(
".", "Datum", "Komponente", "Komponente_txt",
"Lieferung", "Nachweisgrenze", "Station", "Uhrzeit",
"Wert", "Wert01", "Wert24", "day", "part", "prediction_upper",
"prediction_lower", "prediction",
"se", "time", "value", "variable", "WIR", "WIG", "year", "Werte_aggregiert"
))
#' Descriptive plot of daily time series data
#'
#' This function produces descriptive time-series plots with smoothing
#' for the meteorological and potential target variables that were measured at a station.
#'
#' @param env_data A data table of measurements of one air quality measurement station.
#' The data should contain the following columns:
#' \describe{
#' \item{Station}{Station identifier where the data was collected.}
#' \item{Komponente}{The environmental component being measured
#' (e.g., temperature, NO2).}
#' \item{Wert}{The measured value of the component.}
#' \item{date}{The timestamp for the observation,
#' formatted as a Date-Time object in the format
#' \code{"YYYY-MM-DD HH:MM:SS"} (e.g., "2010-01-01 07:00:00").}
#' \item{Komponente_txt}{A textual description or label for the component.}
#' }
#' @param variables list of variables to plot. Must be in `env_data$Komponente`.
#' Meteorological variables can be obtained from params.yaml.
#' @param years Optional. A numeric vector, list, or a range specifying the
#' years to restrict the plotted data.
#' You can provide:
#' - A single year: `years = 2020`
#' - A numeric vector of years: `years = c(2019, 2020, 2021)`
#' - A range of years: `years = 2019:2021`
#' If not provided, data for all available years will be used.
#' @param smoothing_factor A number that defines the magnitude of smoothing.
#' Default is 1. Smaller numbers correspond to less smoothing, larger numbers to more.
#' @export
#' @importFrom ggplot2 ggplot aes geom_line facet_wrap geom_smooth
plot_station_measurements <- function(env_data, variables, years = NULL, smoothing_factor = 1) {
stopifnot("No data in the env_data data.table" = nrow(env_data) > 0)
stopifnot(
"More than one station in env_data. Use clean_data to specify which one to use" =
length(unique(env_data$Station)) == 1
)
if (is.null(years)) {
years <- unique(env_data$year)
}
if (!"day" %in% colnames(env_data)) {
env_data <- .aggregate_data(env_data)
}
env_data[, Werte_aggregiert := data.table::frollmean(Wert,
12 * 7 * smoothing_factor,
na.rm = TRUE,
align = "center"
),
by = Komponente_txt
]
p <- ggplot(env_data[Komponente %in% variables & year %in% years], aes(day, Wert)) +
geom_line() +
facet_wrap(~Komponente_txt, scales = "free_y", ncol = 1) +
geom_line(aes(day, Werte_aggregiert), color = "blue", size = 1.5)
p
}
#' Prepare Plot Data and Plot Counterfactuals
#'
#' Smooths the predictions using a rolling mean, prepares the data for plotting,
#' and generates the counterfactual plot for the application window. Data before
#' the red box are reference window, red box is buffer and values after black,
#' dotted line are effect window.
#'
#' The optional grey ribbon is a prediction interval for the hourly values. The
#' interpretation for a 90% prediction interval (to be defined in `alpha` parameter
#' of [ubair::run_counterfactual()]) is that 90% of the true hourly values
#' (not the rolled means) lie within the grey band. This might be helpful for
#' getting an idea of the variance of the data and predictions.
#'
#' @param predictions The data.table containing the predictions (hourly)
#' @param params Parameters for plotting, including the target variable.
#' @param window_size The window size for the rolling mean (default is 14 days).
#' @param date_effect_start A date. Start date of the
#' effect that is to be evaluated. The data from this point onwards is disregarded
#' for calculating model performance
#' @param buffer Integer. An additional, optional buffer window before
#' `date_effect_start` to account for uncertainty in the effect start point.
#' Disregards additional buffer data points for model evaluation.
#' Use `buffer=0` for no buffer.
#' @param plot_pred_interval Boolean. If `TRUE`, shows a grey band of the prediction
#' interval.
#' @return A ggplot object with the counterfactual plot. Can be adjusted further,
#' e.g. set limits for the y-axis for better visualisation.
#' @export
#' @importFrom ggplot2 ggplot aes geom_ribbon geom_line geom_vline theme_bw labs
#' @importFrom data.table melt
#' @importFrom data.table .SD
#' @importFrom data.table :=
plot_counterfactual <- function(predictions, params, window_size = 14,
date_effect_start = NULL, buffer = 0,
plot_pred_interval = TRUE) {
stopifnot("No data in predictions" = nrow(predictions) > 0)
stopifnot(
"Not all of 'value', 'prediction', 'prediction_lower', 'prediction_upper' are present in predictions data.table" =
c("value", "prediction", "prediction_lower", "prediction_upper") %in% colnames(predictions)
)
# Smooth the data using a rolling mean
dt_plot <- data.table::copy(predictions)
dt_plot <- dt_plot[,
c("value", "prediction", "prediction_lower", "prediction_upper") :=
lapply(.SD, \(x) data.table::frollmean(x, 24 * window_size, align = "center")),
.SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
]
dt_plot <- dt_plot[!is.na(value)]
# Melt the data for plotting
dt_plot_melted <- melt(dt_plot,
id.vars = c("date", "prediction_lower", "prediction_upper"),
measure.vars = c("value", "prediction"),
variable.name = "variable",
value.name = "value"
)
# Prepare the ggplot object
p <- ggplot(dt_plot_melted, aes(x = date))
if (plot_pred_interval) {
p <- p +
geom_ribbon(aes(ymin = prediction_lower, ymax = prediction_upper),
fill = "grey70", alpha = 0.8
)
}
p <- p + geom_line(aes(y = value, color = variable)) +
theme_bw() +
ggplot2::scale_x_datetime(
date_minor_breaks = "1 month",
date_breaks = "2 month"
) +
labs(y = paste(params$target, paste("concentration", window_size, "d rolling mean")))
# Add vertical lines for external effect if provided
if (!is.null(date_effect_start) && length(date_effect_start) == 1) {
p <- p + geom_vline(
xintercept = date_effect_start, linetype = 4,
colour = "black"
)
}
if (buffer > 0) {
p <- p + ggplot2::annotate("rect",
xmin = date_effect_start - as.difftime(buffer, units = "hours"),
xmax = date_effect_start,
ymin = -Inf,
ymax = Inf,
alpha = 0.1,
fill = "red"
)
}
p
}
---
output:
github_document:
df_print: kable
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
eval = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# ubair <img src="inst/sticker/stickers-ubair-1.png" align="right" width="20%"/>
**ubair** is an R package for Statistical Investigation of the Impact of External Conditions on Air Quality: it uses the statistical software R to analyze and visualize the impact of external factors, such as traffic restrictions, hazards, and political measures, on air quality. It aims to provide experts with a transparent comparison of modeling approaches and to support data-driven evaluations for policy advisory purposes.
## Installation
Install via cran or if you have access to
[https://gitlab.ai-env.de/use-case-luft/ubair](https://gitlab.ai-env.de/use-case-luft/ubair)
you can use one of the following options:
#### Using an archive file
Recommended if you do not have git installed.
- Download zip/tar.gz from GitLab
- Start a new R-Project or open an existing one
- in R-Studio:
- go to 'Packages'-Tab (next to Help/Plots/Files)
- Click on 'Install' (left upper corner)
- Install from: choose "Package Archive File"
- Browse to zip-file
- 'Install'
- alternatively, type in console:
```{r install_package_local}
install.packages("<path-to-zip>/ubair-master.zip", repos = NULL, type = "source")
```
#### Using remote package
Git needs to be installed.
```{r install_package_remote}
install.packages("remotes")
# requires a configures ssh-key
remotes::install_git("git@gitlab.ai-env.de:use-case-luft/ubair.git")
# alternative via password
remotes::install_git("https://gitlab.ai-env.de/use-case-luft/ubair.git")
```
## Sample Usage of package
For a more detailed explanation of the package, you can access the vignettes:
- View user_sample source code directly in the [vignettes/](vignettes/) folder.
- Open vignette by function `vignette("user_sample_1", package = "ubair")`,
if the package was installed with vignettes
``` {r load data, eval=TRUE}
library(ubair)
params <- load_params()
env_data <- sample_data_DESN025
```
```{r plot-meteo-data, eval=TRUE, fig.height=10, fig.width=10}
# Plot meteo data
plot_station_measurements(env_data, params$meteo_variables)
```
- split data into training, reference and effect time intervals
<img src="man/figures/time_split_overview.png" width="100%"/>
``` {r counterfactual-scenario, eval=TRUE, fig.height=5, fig.width=10}
application_start <- lubridate::ymd("20191201") # This coincides with the start of the reference window
date_effect_start <- lubridate::ymd_hm("20200323 00:00") # This splits the forecast into reference and effect
application_end <- lubridate::ymd("20200504") # This coincides with the end of the effect window
buffer <- 24 * 14 # 14 days buffer
dt_prepared <- prepare_data_for_modelling(env_data, params)
dt_prepared <- dt_prepared[complete.cases(dt_prepared)]
split_data <- split_data_counterfactual(
dt_prepared, application_start,
application_end
)
res <- run_counterfactual(split_data,
params,
detrending_function = "linear",
model_type = "lightgbm",
alpha = 0.9,
log_transform = TRUE,
calc_shaps = TRUE
)
predictions <- res$prediction
plot_counterfactual(predictions, params,
window_size = 14,
date_effect_start,
buffer = buffer,
plot_pred_interval = TRUE
)
```
```{r evaluation metrics, eval=TRUE}
round(calc_performance_metrics(predictions, date_effect_start, buffer = buffer), 2)
round(calc_summary_statistics(predictions, date_effect_start, buffer = buffer), 2)
estimate_effect_size(predictions, date_effect_start, buffer = buffer, verbose = TRUE)
```
### SHAP feature importances
```{r feature_importance, eval=TRUE, fig.height=4, fig.width=8}
shapviz::sv_importance(res$importance, kind = "bee")
xvars <- c("TMP", "WIG", "GLO", "WIR")
shapviz::sv_dependence(res$importance, v = xvars)
```
## Development
### Prerequisites
1. **R**: Make sure you have R installed (recommended version 4.4.1). You can download it from [CRAN](https://cran.r-project.org/).
2. **RStudio** (optional but recommended): Download from [RStudio](https://www.rstudio.com/).
### Setting Up the Environment
Install the development version of ubair:
```{r}
install.packages("renv")
renv::restore()
devtools::build()
devtools::load_all()
```
### Development
#### Install pre-commit hook (required to ensure tidyverse code formatting)
```
pip install pre-commit
```
#### Add new requirements
If you add new dependencies to *ubair* package, make sure to update the renv.lock file:
``` r
renv::snapshot()
```
#### style and documentation
Before you commit your changes update documentation, ensure style complies with tidyverse styleguide and all tests run without error
```{r}
# update documentation and check package integrity
devtools::check()
# apply tidyverse style (also applied as precommit hook)
usethis::use_tidy_style()
# you can check for existing lintr warnings by
devtools::lint()
# run tests
devtools::test()
# build README.md if any changes have been made to README.Rmd
devtools::build_readme()
```
#### Pre-commit hook
in .pre-commit-hook.yaml pre-commit rules are defined and applied before each commmit.
This includes:
split
- run styler to format code in tidyverse style
- run roxygen to update doc
- check if readme is up to date
- run lintr to finally check code style format
If precommit fails, check the automatically applied changes, stage them and retry to commit.
#### Test Coverage
Install covr to run this.
```{r test_coverage, message=FALSE, warning=FALSE}
cov <- covr::package_coverage(type = "all")
cov_list <- covr::coverage_to_list(cov)
data.table::data.table(
part = c("Total", names(cov_list$filecoverage)),
coverage = c(cov_list$totalcoverage, as.vector(cov_list$filecoverage))
)
```
```{r test_coverage_report}
covr::report(cov)
```
## Contacts
**Jore Noa Averbeck**
[JoreNoa.Averbeck@uba.de](mailto:JoreNoa.Averbeck@uba.de)
**Raphael Franke**
[Raphael.Franke@uba.de](mailto:Raphael.Franke@uba.de)
**Imke Voß**
[imke.voss@uba.de](mailto:imke.voss@uba.de)

Consent

On this website, we use the web analytics service Matomo to analyze and review the use of our website. Through the collected statistics, we can improve our offerings and make them more appealing for you. Here, you can decide whether to allow us to process your data and set corresponding cookies for these purposes, in addition to technically necessary cookies. Further information on data protection—especially regarding "cookies" and "Matomo"—can be found in our privacy policy. You can withdraw your consent at any time.