Compare revisions

5d21dfb7 · 5d21dfb7 · 5d21dfb7 · 5d21dfb7 · 5d21dfb7 · 5d21dfb7
--- a/README.md
+++ b/README.md
-# ubair
+<!-- README.md is generated from README.Rmd. Please edit that file -->

+# ubair <img src="inst/sticker/stickers-ubair-1.png" align="right" width="20%"/>

+**ubair** is an R package for Statistical Investigation of the Impact of External Conditions on Air Quality: it uses the statistical software R to analyze and visualize the impact of external factors, such as traffic restrictions, hazards, and political measures, on air quality. It aims to provide experts with a transparent comparison of modeling approaches and to support data-driven evaluations for policy advisory purposes.

-## Getting started
+## Installation

-To make it easy for you to get started with GitLab, here's a list of recommended next steps.
+Install via cran or if you have access to <https://gitlab.ai-env.de/use-case-luft/ubair> you can use one of the following options:

-Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
+#### Using an archive file

-## Add your files
+Recommended if you do not have git installed.

- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
+-   Download zip/tar.gz from GitLab
+-   Start a new R-Project or open an existing one
+-   in R-Studio:
+    -   go to ‘Packages’-Tab (next to Help/Plots/Files)
+    -   Click on ‘Install’ (left upper corner)
+    -   Install from: choose “Package Archive File”
+    -   Browse to zip-file
+    -   ‘Install’
+-   alternatively, type in console:

+``` r
+install.packages("<path-to-zip>/ubair-master.zip", repos = NULL, type = "source")
 ```
-cd existing_repo
-git remote add origin https://gitlab.opencode.de/uba-ki-lab/ubair.git
-git branch -M main
-git push -uf origin main
+
+#### Using remote package
+
+Git needs to be installed.
+
+``` r
+install.packages("remotes")
+# requires a configures ssh-key
+remotes::install_git("git@gitlab.ai-env.de:use-case-luft/ubair.git")
+# alternative via password
+remotes::install_git("https://gitlab.ai-env.de/use-case-luft/ubair.git")
 ```

-## Integrate with your tools
+## Sample Usage of package

- [ ] [Set up project integrations](https://gitlab.opencode.de/uba-ki-lab/ubair/-/settings/integrations)
+For a more detailed explanation of the package, you can access the vignettes:

-## Collaborate with your team
+-   View user_sample source code directly in the [vignettes/](vignettes/) folder.
+-   Open vignette by function `vignette("user_sample_1", package = "ubair")`, if the package was installed with vignettes

- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
+``` r
+library(ubair)
+params <- load_params()
+env_data <- sample_data_DESN025
+```

-## Test and Deploy
+``` r
+# Plot meteo data
+plot_station_measurements(env_data, params$meteo_variables)
+```

-Use the built-in continuous integration in GitLab.
+<img src="man/figures/README-plot-meteo-data-1.png" width="100%"/>
+
+-   split data into training, reference and effect time intervals <img src="man/figures/time_split_overview.png" width="100%"/>
+
+``` r
+application_start <- lubridate::ymd("20191201") # This coincides with the start of the reference window
+date_effect_start <- lubridate::ymd_hm("20200323 00:00") # This splits the forecast into reference and effect
+application_end <- lubridate::ymd("20200504") # This coincides with the end of the effect window
+
+buffer <- 24 * 14 # 14 days buffer
+
+dt_prepared <- prepare_data_for_modelling(env_data, params)
+dt_prepared <- dt_prepared[complete.cases(dt_prepared)]
+split_data <- split_data_counterfactual(
+  dt_prepared, application_start,
+  application_end
+)
+res <- run_counterfactual(split_data,
+  params,
+  detrending_function = "linear",
+  model_type = "lightgbm",
+  alpha = 0.9,
+  log_transform = TRUE,
+  calc_shaps = TRUE
+)
+```

- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
+```         
+#> [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.028078 seconds.
+#> You can set `force_col_wise=true` to remove the overhead.
+#> [LightGBM] [Info] Total Bins 1557
+#> [LightGBM] [Info] Number of data points in the train set: 104486, number of used features: 9
+#> [LightGBM] [Info] Start training from score -0.000000
+```

-***
+``` r
+predictions <- res$prediction

-# Editing this README
+plot_counterfactual(predictions, params,
+  window_size = 14,
+  date_effect_start,
+  buffer = buffer,
+  plot_pred_interval = TRUE
+)
+```

-When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template.
+<img src="man/figures/README-counterfactual-scenario-1.png" width="100%"/>

-## Suggestions for a good README
+``` r
+round(calc_performance_metrics(predictions, date_effect_start, buffer = buffer), 2)
+```

-Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
+```         
+#>           RMSE            MSE            MAE           MAPE           Bias 
+#>           7.38          54.48           5.38           0.18          -2.73 
+#>             R2 Coverage lower Coverage upper       Coverage    Correlation 
+#>           0.74           0.97           0.95           0.92           0.89 
+#>            MFB            FGE 
+#>          -0.05           0.19
+```

-## Name
-Choose a self-explaining name for your project.
+``` r
+round(calc_summary_statistics(predictions, date_effect_start, buffer = buffer), 2)
+```

-## Description
-Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
+::: kable-table
+|                      |   true | prediction |
+|:---------------------|-------:|-----------:|
+| min                  |   3.36 |       5.58 |
+| max                  | 111.90 |      59.71 |
+| var                  | 212.96 |     128.16 |
+| mean                 |  30.80 |      28.07 |
+| 5-percentile         |   9.29 |      10.73 |
+| 25-percentile        |  19.85 |      19.40 |
+| median/50-percentile |  29.60 |      27.09 |
+| 75-percentile        |  40.54 |      36.27 |
+| 95-percentile        |  56.80 |      47.69 |
+:::
+
+``` r
+estimate_effect_size(predictions, date_effect_start, buffer = buffer, verbose = TRUE)
+```

-## Badges
-On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
+```         
+#> The external effect changed the target value on average by -6.294 compared to the reference time window. This is a -26.37% relative change.

-## Visuals
-Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
+#> $absolute_effect
+#> [1] -6.294028
+#> 
+#> $relative_effect
+#> [1] -0.2637
+```

-## Installation
-Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
+### SHAP feature importances
+
+``` r
+shapviz::sv_importance(res$importance, kind = "bee")
+```
+
+<img src="man/figures/README-feature_importance-1.png" width="100%"/>
+
+``` r
+xvars <- c("TMP", "WIG", "GLO", "WIR")
+shapviz::sv_dependence(res$importance, v = xvars)
+```
+
+<img src="man/figures/README-feature_importance-2.png" width="100%"/>

-## Usage
-Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
+## Development

-## Support
-Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
+### Prerequisites

-## Roadmap
-If you have ideas for releases in the future, it is a good idea to list them in the README.
+1.  **R**: Make sure you have R installed (recommended version 4.4.1). You can download it from [CRAN](https://cran.r-project.org/).
+2.  **RStudio** (optional but recommended): Download from [RStudio](https://www.rstudio.com/).

-## Contributing
-State if you are open to contributions and what your requirements are for accepting them.
+### Setting Up the Environment

-For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
+Install the development version of ubair:
+
+``` r
+install.packages("renv")
+renv::restore()
+devtools::build()
+devtools::load_all()
+```
+
+### Development
+
+#### Install pre-commit hook (required to ensure tidyverse code formatting)
+
+```         
+pip install pre-commit
+```
+
+#### Add new requirements
+
+If you add new dependencies to *ubair* package, make sure to update the renv.lock file:
+
+``` r
+renv::snapshot()
+```
+
+#### style and documentation
+
+Before you commit your changes update documentation, ensure style complies with tidyverse styleguide and all tests run without error
+
+``` r
+# update documentation and check package integrity
+devtools::check()
+# apply tidyverse style (also applied as precommit hook)
+usethis::use_tidy_style()
+# you can check for existing lintr warnings by
+devtools::lint()
+# run tests
+devtools::test()
+# build README.md if any changes have been made to README.Rmd
+devtools::build_readme()
+```
+
+#### Pre-commit hook
+
+in .pre-commit-hook.yaml pre-commit rules are defined and applied before each commmit. This includes: split - run styler to format code in tidyverse style - run roxygen to update doc - check if readme is up to date - run lintr to finally check code style format
+
+If precommit fails, check the automatically applied changes, stage them and retry to commit.
+
+#### Test Coverage
+
+Install covr to run this.
+
+``` r
+cov <- covr::package_coverage(type = "all")
+cov_list <- covr::coverage_to_list(cov)
+data.table::data.table(
+  part = c("Total", names(cov_list$filecoverage)),
+  coverage = c(cov_list$totalcoverage, as.vector(cov_list$filecoverage))
+)
+```
+
+``` r
+covr::report(cov)
+```

-You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
+## Contacts

-## Authors and acknowledgment
-Show your appreciation to those who have contributed to the project.
+**Jore Noa Averbeck** [JoreNoa.Averbeck\@uba.de](mailto:JoreNoa.Averbeck@uba.de){.email}

-## License
-For open source projects, say how it is licensed.
+**Raphael Franke** [Raphael.Franke\@uba.de](mailto:Raphael.Franke@uba.de){.email}

-## Project status
-If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
+**Imke Voß** [imke.voss\@uba.de](mailto:imke.voss@uba.de){.email}
--- a/data/sample_data_DESN025.rda
+++ b/data/sample_data_DESN025.rda
--- a/inst/extdata/params.yaml
+++ b/inst/extdata/params.yaml
+target: 'NO2'
+
+lightgbm:
+  nrounds: 200
+  eta: 0.03
+  num_leaves: 32
+
+dynamic_regression:
+  ntrain: 8760 # 24*365 = 1 year of training data
+
+random_forest:
+  num.trees: 300
+  mtry: NULL
+  min.node.size: NULL
+  max.depth: 10
+
+fnn:
+  activationfun: tanh
+  output: linear
+  learningrate: 0.05
+  learningrate_scale: 1
+  batchsize: 32
+  momentum: 0.9
+  visible_dropout: 0.0
+  hidden_dropout: 0.0
+  hidden:
+    - 50
+    - 50
+  numepochs: 200
+
+meteo_variables:
+  - GLO
+  - TMP
+  - RFE
+  - WIG
+  - WIR
+  - LDR
--- a/inst/sticker/smoke.png
+++ b/inst/sticker/smoke.png
--- a/inst/sticker/stickers-ubair-1.png
+++ b/inst/sticker/stickers-ubair-1.png
--- a/ki-lab-logo.png
+++ b/ki-lab-logo.png
--- a/man/calc_performance_metrics.Rd
+++ b/man/calc_performance_metrics.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/model_evaluation.R
+\name{calc_performance_metrics}
+\alias{calc_performance_metrics}
+\title{Calculates performance metrics of a business-as-usual model}
+\usage{
+calc_performance_metrics(predictions, date_effect_start = NULL, buffer = 0)
+}
+\arguments{
+\item{predictions}{data.table or data.frame with the following columns
+\describe{
+\item{date}{Date of the observation. Needs to be comparable to
+date_effect_start element.}
+\item{value}{True observed value of the station}
+\item{prediction}{Predicted model output for the same time and station
+as value}
+\item{prediction_lower}{Lower end of the prediction interval}
+\item{prediction_upper}{Upper end of the prediction interval}
+}}
+
+\item{date_effect_start}{A date. Start date of the
+effect that is to be evaluated. The data from this point onwards is disregarded
+for calculating model performance}
+
+\item{buffer}{Integer. An additional buffer window before date_effect_start to account
+for uncertainty in the effect start point. Disregards additional buffer data
+points for model evaluation}
+}
+\value{
+Named vector with performance metrics of the model
+}
+\description{
+Model agnostic function to calculate a number of common performance
+metrics on the reference time window.
+Uses the true data \code{value} and the predictions \code{prediction} for this calculation.
+The coverage is calculated from the columns \code{value}, \code{prediction_lower} and
+\code{prediction_upper}.
+Removes dates in the effect and buffer range as the model is not expected to
+be performing correctly for these times. The incorrectness is precisely
+what we are using for estimating the effect.
+}
--- a/man/calc_summary_statistics.Rd
+++ b/man/calc_summary_statistics.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/model_evaluation.R
+\name{calc_summary_statistics}
+\alias{calc_summary_statistics}
+\title{Calculates summary statistics for predictions and true values}
+\usage{
+calc_summary_statistics(predictions, date_effect_start = NULL, buffer = 0)
+}
+\arguments{
+\item{predictions}{Data.table or data.frame with the following columns
+\describe{
+\item{date}{Date of the observation. Needs to be comparable to
+date_effect_start element.}
+\item{value}{True observed value of the station}
+\item{prediction}{Predicted model output for the same time and station
+as value}
+}}
+
+\item{date_effect_start}{A date. Start date of the
+effect that is to be evaluated. The data from this point onwards is disregarded
+for calculating model performance}
+
+\item{buffer}{Integer. An additional buffer window before date_effect_start to account
+for uncertainty in the effect start point. Disregards additional buffer data
+points for model evaluation}
+}
+\value{
+data.frame of summary statistics with columns true and prediction
+}
+\description{
+Helps with analyzing predictions by comparing them with the true values on
+a number of relevant summary statistics.
+}
--- a/man/clean_data.Rd
+++ b/man/clean_data.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_cleaning.R
+\name{clean_data}
+\alias{clean_data}
+\title{Clean and Optionally Aggregate Environmental Data}
+\usage{
+clean_data(env_data, station, aggregate_daily = FALSE)
+}
+\arguments{
+\item{env_data}{A data table in long format.
+Must include columns:
+\describe{
+\item{Station}{Station identifier for the data.}
+\item{Komponente}{Measured environmental component e.g. temperature, NO2.}
+\item{Wert}{Measured value.}
+\item{date}{Timestamp as Date-Time object (\verb{YYYY-MM-DD HH:MM:SS} format).}
+\item{Komponente_txt}{Textual description of the component.}
+}}
+
+\item{station}{Character. Name of the station to filter by.}
+
+\item{aggregate_daily}{Logical. If \code{TRUE}, aggregates data to daily mean values. Default is \code{FALSE}.}
+}
+\value{
+A \code{data.table}:
+\itemize{
+\item If \code{aggregate_daily = TRUE}: Contains columns for station, component, day, year,
+and the daily mean value of the measurements.
+\item If \code{aggregate_daily = FALSE}: Contains cleaned data with duplicates removed.
+}
+}
+\description{
+Cleans a data table of environmental measurements by filtering for a specific
+station, removing duplicates, and optionally aggregating the data on a daily
+basis using the mean.
+}
+\details{
+Duplicate rows (by \code{date}, \code{Komponente}, and \code{Station}) are removed. A warning is issued
+if duplicates are found.
+}
+\examples{
+# Example data
+env_data <- data.table::data.table(
+  Station = c("DENW094", "DENW094", "DENW006", "DENW094"),
+  Komponente = c("NO2", "O3", "NO2", "NO2"),
+  Wert = c(45, 30, 50, 40),
+  date = as.POSIXct(c(
+    "2023-01-01 08:00:00", "2023-01-01 09:00:00",
+    "2023-01-01 08:00:00", "2023-01-02 08:00:00"
+  )),
+  Komponente_txt = c(
+    "Nitrogen Dioxide", "Ozone", "Nitrogen Dioxide", "Nitrogen Dioxide"
+  )
+)
+
+# Clean data for StationA without aggregation
+cleaned_data <- clean_data(env_data, station = "DENW094", aggregate_daily = FALSE)
+print(cleaned_data)
+}
--- a/man/copy_default_params.Rd
+++ b/man/copy_default_params.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_loading.R
+\name{copy_default_params}
+\alias{copy_default_params}
+\title{Copy Default Parameters File}
+\usage{
+copy_default_params(dest_dir)
+}
+\arguments{
+\item{dest_dir}{Character. The path to the directory where the \code{params.yaml}
+file will be copied.}
+}
+\value{
+Nothing is returned. A message is displayed upon successful copying.
+}
+\description{
+Copies the default \code{params.yaml} file, included with the package, to a
+specified destination directory. This is useful for initializing parameter
+files for custom edits.
+}
+\details{
+The \code{params.yaml} file contains default model parameters for various
+configurations such as LightGBM, dynamic regression, and others. See the
+\code{\link[ubair:load_params]{load_params()}}` documentation for an example of the file's structure.
+}
+\examples{
+\dontrun{
+copy_default_params("path/to/destination")
+}
+}
--- a/man/detrend.Rd
+++ b/man/detrend.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_preprocessing.R
+\name{detrend}
+\alias{detrend}
+\title{Removes trend from data}
+\usage{
+detrend(split_data, mode = "linear", num_splines = 5, log_transform = FALSE)
+}
+\arguments{
+\item{split_data}{List of two named dataframes called train and apply}
+
+\item{mode}{String which defines type of trend is present. Options are
+"linear", "quadratic", "exponential", "spline", "none".
+"none" returns original data}
+
+\item{num_splines}{Defines the number of cubic splines if \code{mode="spline"}.
+Choose num_splines=1 for cubic polynomial trend. If \code{mode!="spline"}, this
+parameter is ignored}
+
+\item{log_transform}{If \code{TRUE}, use a log-transformation before detrending
+to ensure positivity of all predictions in the rest of the pipeline.
+A exp transformation is necessary during retrending to return to the solution
+space. Use only in combination with \code{log_transform} parameter in
+\code{\link[=retrend_predictions]{retrend_predictions()}}}
+}
+\value{
+List of 3 elements. 2 dataframes: detrended train, apply and the
+trend function
+}
+\description{
+Takes a list of train and application data as prepared by
+\code{\link[=split_data_counterfactual]{split_data_counterfactual()}}
+and removes a polynomial, exponential or cubic spline spline trend function.
+Trend is obtained only from train data. Use as part of preprocessing before
+training a model based on decision trees, i.e. random forest and lightgbm.
+For the other methods it may be helpful but they are generally able to
+deal with trends themselves. Therefore we recommend to try out different
+versions and guide decisisions using the model evaluation metrics from
+\code{\link[=calc_performance_metrics]{calc_performance_metrics()}}.
+}
+\details{
+Apply \code{\link[=retrend_predictions]{retrend_predictions()}} to predictions to return to the
+original data units.
+}
+\examples{
+\dontrun{
+split_data <- split_data_counterfactual(
+  dt_prepared, training_start,
+  training_end, application_start, application_end
+)
+detrended_list <- detrend(split_data, mode = "linear")
+detrended_train <- detrended_list$train
+detrended_apply <- detrended_list$apply
+trend <- detrended_list$model
+}
+}
--- a/man/estimate_effect_size.Rd
+++ b/man/estimate_effect_size.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/model_evaluation.R
+\name{estimate_effect_size}
+\alias{estimate_effect_size}
+\title{Estimates size of the external effect}
+\usage{
+estimate_effect_size(df, date_effect_start, buffer = 0, verbose = FALSE)
+}
+\arguments{
+\item{df}{Data.table or data.frame with the following columns
+\describe{
+\item{date}{Date of the observation. Needs to be comparable to
+date_effect_start element.}
+\item{value}{True observed value of the station}
+\item{prediction}{Predicted model output for the same time and station
+as value}
+}}
+
+\item{date_effect_start}{A date. Start date of the
+effect that is to be evaluated. The data from this point onward is disregarded
+for calculating model performance.}
+
+\item{buffer}{Integer. An additional buffer window before date_effect_start to account
+for uncertainty in the effect start point. Disregards additional buffer data
+points for model evaluation}
+
+\item{verbose}{Prints an explanation of the results if TRUE}
+}
+\value{
+A list with two numbers: Absolute and relative estimated effect size.
+}
+\description{
+Calculates an estimate for the absolute and relative effect size of the
+external effect. The absolute effect is the difference between the model
+bias in the reference time and the effect time windows. The relative effect
+is the absolute effect divided by the mean true value in the reference
+window.
+}
+\details{
+Note: Since the bias is of the model is an average over predictions and true
+values, it is important, that the effect window is specified correctly.
+Imagine a scenario like a fire which strongly affects the outcome for one
+hour and is gone the next hour. If we use a two week effect window, the
+estimated effect will be 14*24=336 times smaller compared to using a 1-hour
+effect window. Generally, we advise against studying very short effects (single
+hour or single day). The variability of results will be too large to learn
+anything meaningful.
+}
--- a/man/figures/README-counterfactual-scenario-1.png
+++ b/man/figures/README-counterfactual-scenario-1.png
--- a/man/figures/README-feature_importance-1.png
+++ b/man/figures/README-feature_importance-1.png
--- a/man/figures/README-feature_importance-2.png
+++ b/man/figures/README-feature_importance-2.png
--- a/man/figures/README-plot-meteo-data-1.png
+++ b/man/figures/README-plot-meteo-data-1.png
--- a/man/figures/time_split_overview.png
+++ b/man/figures/time_split_overview.png
--- a/man/get_meteo_available.Rd
+++ b/man/get_meteo_available.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_cleaning.R
+\name{get_meteo_available}
+\alias{get_meteo_available}
+\title{Get Available Meteorological Components}
+\usage{
+get_meteo_available(env_data)
+}
+\arguments{
+\item{env_data}{Data table containing environmental data.
+Must contain column "Komponente"}
+}
+\value{
+A vector of available meteorological components.
+}
+\description{
+Identifies unique meteorological components from the provided environmental data,
+filtering only those that match the predefined UBA naming conventions. These components
+include "GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", and "WIND_V".
+}
+\examples{
+# Example environmental data
+env_data <- data.table::data.table(
+  Komponente = c("TMP", "NO2", "GLO", "WIR"),
+  Wert = c(25, 40, 300, 50),
+  date = as.POSIXct(c(
+    "2023-01-01 08:00:00", "2023-01-01 09:00:00",
+    "2023-01-01 10:00:00", "2023-01-01 11:00:00"
+  ))
+)
+# Get available meteorological components
+meteo_components <- get_meteo_available(env_data)
+print(meteo_components)
+}
--- a/man/load_params.Rd
+++ b/man/load_params.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_loading.R
+\name{load_params}
+\alias{load_params}
+\title{Load Parameters from YAML File}
+\usage{
+load_params(filepath = NULL)
+}
+\arguments{
+\item{filepath}{Character. Path to the YAML file. If \code{NULL}, the function
+will attempt to load the default \code{params.yaml} provided in the package.}
+}
+\value{
+A list containing the parameters loaded from the YAML file.
+}
+\description{
+Reads a YAML file containing model parameters, including station settings,
+variables, and configurations for various models. If no file path is
+provided, the function defaults to loading \code{params.yaml} from the package's
+\code{extdata} directory.
+}
+\details{
+The YAML file should define parameters in a structured format, such as:
+
+\if{html}{\out{<div class="sourceCode yaml">}}\preformatted{target: 'NO2'
+
+lightgbm:
+  nrounds: 200
+  eta: 0.03
+  num_leaves: 32
+
+dynamic_regression:
+  ntrain: 8760
+
+random_forest:
+  num.trees: 300
+  max.depth: 10
+
+meteo_variables:
+  - GLO
+  - TMP
+}\if{html}{\out{</div>}}
+}
+\examples{
+\dontrun{
+params <- load_params("path/to/custom_params.yaml")
+}
+}
--- a/man/load_uba_data_from_dir.Rd
+++ b/man/load_uba_data_from_dir.Rd
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_loading.R
+\name{load_uba_data_from_dir}
+\alias{load_uba_data_from_dir}
+\title{Load UBA Data from Directory}
+\usage{
+load_uba_data_from_dir(data_dir)
+}
+\arguments{
+\item{data_dir}{Character. Path to the directory containing \code{.csv} files.}
+}
+\value{
+A \code{data.table} containing the loaded data in long format. Returns an error if no valid
+files are found or the resulting dataset is empty.
+}
+\description{
+This function loads data from CSV files in the specified directory. It supports two formats:
+}
+\details{
+\enumerate{
+\item "inv": Files must contain the following columns:
+\itemize{
+\item \code{Station}, \code{Komponente}, \code{Datum}, \code{Uhrzeit}, \code{Wert}.
+}
+\item "24Spalten": Files must contain:
+\itemize{
+\item \code{Station}, \code{Komponente}, \code{Datum}, and columns \code{Wert01}, ..., \code{Wert24}.
+}
+}
+
+File names should include "inv" or "24Spalten" to indicate their format. The function scans
+recursively for \code{.csv} files in subdirectories and combines the data into a single \code{data.table}
+in long format.
+Files that are not in the exected format will be ignored.
+}
No results found