Compare revisions

Raphael Franke · Raphael Franke · Raphael Franke · 5d21dfb7 · 5d21dfb7 · 5d21dfb7
--- a/.Rbuildignore
+++ b/.Rbuildignore
+^CONTRIBUTING\.md$
+^Dockerfile$
+^LICENSE\.md$
+^Meta$
+^README\.Rmd$
+^\./inst/extdata/data$
+^\.Rproj\.user$
+^\.dvc$
+^\.gitlab$
+^\.gitlab-ci\.yml$
+^\.gitlab/issue_templates/.gitkeep$
+^\.lintr$
+^\.pre-commit-config\.yaml$
+^\.venv$
+^doc$
+^examples$
+^renv$
+^renv\.lock$
+^ubair\.Rproj$
+^vignettes/figure/
--- a/.Rprofile
+++ b/.Rprofile
+source("renv/activate.R")
--- a/.gitignore
+++ b/.gitignore
+.Rproj.user
+.Rhistory
+.Rdata
+.httr-oauth
+.DS_Store
+.quarto
+.venv/
+inst/doc
+/doc/
+/Meta/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+Contributing  to ubair
+
+All contributions to ubair are welcome! This document outlines the process and guidelines for contributing to the project.
+
+Coding Standards
+Aim for self-explanatory and readable code to minimize the need for additional documentation.
+
+Reporting Bugs and Feature Requests
+Use GitLab issues to report bugs or propose features. While we do not provide a specific template, please include sufficient details to help us understand and address the issue.
+
+Workflow and Version Control
+Pull requests (PRs) are welcome. You may either:
+•	Fork the repository and submit a PR from your fork.
+•	Work directly on a branch and submit a PR. 
+
+Testing and Validation
+New features must allow for reproducible results. Validate your code against existing data and workflows to ensure compatibility and consistency.
+
+Licensing
+Ensure that any new dependencies added are compatible with the GPL-3.0-or-later used in this project.
+
+General Guidelines
+Discuss significant changes (e.g., new features) in a GitLab issue before starting your work. When adding new files or modifying the folder structure, ensure compatibility with the existing structure. 
+
+Thank you for contributing to ubair! If you have any questions, feel free to contact the maintainers  listed in the README file.
--- a/DESCRIPTION
+++ b/DESCRIPTION
+Package: ubair
+Title: Auswirkungen externer Bedingungen auf die Luftqualität
+Version: 1.1.0
+Authors@R:
+    person("Imke", "Voss", , "imke.voss@uba.de", role = c("aut", "cre", "cph"))
+    person("Raphael", "Franke", , "raphael.franke@uba.de", role = c("aut", "cre"))
+Description: Statistische Untersuchung der Auswirkungen externer Bedingungen auf die Luftqualität.
+License: GPL (>= 3) + file LICENSE
+Depends: 
+  R (>= 4.4.0),
+Encoding: UTF-8
+Roxygen: list(markdown = TRUE)
+RoxygenNote: 7.3.2
+Suggests: 
+    testthat (>= 3.0.0),
+    deepnet,
+    fastshap, 
+    treeshap,
+    shapviz,
+    knitr,
+    rmarkdown
+Config/testthat/edition: 3
+Imports: 
+    rlang,
+    data.table,
+    dplyr,
+    ggplot2,
+    forecast,
+    lubridate,
+    tidyr,
+    yaml,
+    ranger,
+    lightgbm
+LazyData: true
+LazyDataCompression: xz
+VignetteBuilder: knitr
--- a/Dockerfile
+++ b/Dockerfile
+FROM rocker/tidyverse:4.4.1
+
+# Set environment variables
+ENV R_LIBS_USER=/usr/local/lib/R/site-library
+ENV CRAN=https://cran.rstudio.com
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    python3-pip \
+    libtool \
+    automake \
+    autoconf \
+    gcc \
+    g++ \
+    make \
+    qpdf \
+    && pip install dvc[s3]
+
+# Set the working directory to match GitLab CI
+WORKDIR /builds/use-case-luft/ubair
+
+# Copy the project files including renv.lock
+COPY . .
+
+# Install R packages via renv and devtools
+RUN R -e "install.packages('renv', repos='$CRAN')" && \
+    R -e "renv::restore()" && \
+    R -e "install.packages('devtools', repos='$CRAN')"
--- a/LICENSE
+++ b/LICENSE
+DL-DE->BY-2.0
+Data licence Germany – attribution – version 2.0
+
+This licence refers to the sample_data_DESN025 provided in this publication.
+Provider of the data: Sächsisches Landesamt für Umwelt, Landwirtschaft und Geologie (LfULG)
+Alterations in the data: Codes for incorrect values have been removed.
+
+
+(1) Any use will be permitted provided it fulfils the requirements of this "Data licence Germany – attribution – Version 2.0".
+
+The data and meta-data provided may, for commercial and non-commercial use, in particular
+
+    - be copied, printed, presented, altered, processed and transmitted to third parties;
+    - be merged with own data and with the data of others and be combined to form new and independent datasets;
+    - be integrated in internal and external business processes, products and applications in public and non-public electronic networks.
+
+(2) The user must ensure that the source note contains the following information:
+
+    - the name of the provider,
+    - the annotation "Data licence Germany – attribution – Version 2.0" or "dl-de/by-2-0" referring to the licence text available at www.govdata.de/dl-de/by-2-0, and
+    - a reference to the dataset (URI).
+
+This applies only if the entity keeping the data provides the pieces of information 1-3 for the source note.
+
+(3) Changes, editing, new designs or other amendments must be marked as such in the source note.
+
+
+ 
+URL: http://www.govdata.de/dl-de/by-2-0
--- a/LICENSE.md
+++ b/LICENSE.md
+GNU General Public License v3.0 or later
+========================================
+
+_Version 3, 29 June 2007_  
+_Copyright © 2007 Free Software Foundation, Inc. &lt;<http://fsf.org/>&gt;_
+
+Everyone is permitted to copy and distribute verbatim copies of this license
+document, but changing it is not allowed.
+
+## Preamble
+
+The GNU General Public License is a free, copyleft license for software and other
+kinds of works.
+
+The licenses for most software and other practical works are designed to take away
+your freedom to share and change the works. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change all versions of a
+program--to make sure it remains free software for all its users. We, the Free
+Software Foundation, use the GNU General Public License for most of our software; it
+applies also to any other work released this way by its authors. You can apply it to
+your programs, too.
+
+When we speak of free software, we are referring to freedom, not price. Our General
+Public Licenses are designed to make sure that you have the freedom to distribute
+copies of free software (and charge for them if you wish), that you receive source
+code or can get it if you want it, that you can change the software or use pieces of
+it in new free programs, and that you know you can do these things.
+
+To protect your rights, we need to prevent others from denying you these rights or
+asking you to surrender the rights. Therefore, you have certain responsibilities if
+you distribute copies of the software, or if you modify it: responsibilities to
+respect the freedom of others.
+
+For example, if you distribute copies of such a program, whether gratis or for a fee,
+you must pass on to the recipients the same freedoms that you received. You must make
+sure that they, too, receive or can get the source code. And you must show them these
+terms so they know their rights.
+
+Developers that use the GNU GPL protect your rights with two steps: **(1)** assert
+copyright on the software, and **(2)** offer you this License giving you legal permission
+to copy, distribute and/or modify it.
+
+For the developers' and authors' protection, the GPL clearly explains that there is
+no warranty for this free software. For both users' and authors' sake, the GPL
+requires that modified versions be marked as changed, so that their problems will not
+be attributed erroneously to authors of previous versions.
+
+Some devices are designed to deny users access to install or run modified versions of
+the software inside them, although the manufacturer can do so. This is fundamentally
+incompatible with the aim of protecting users' freedom to change the software. The
+systematic pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable. Therefore, we have designed
+this version of the GPL to prohibit the practice for those products. If such problems
+arise substantially in other domains, we stand ready to extend this provision to
+those domains in future versions of the GPL, as needed to protect the freedom of
+users.
+
+Finally, every program is threatened constantly by software patents. States should
+not allow patents to restrict development and use of software on general-purpose
+computers, but in those that do, we wish to avoid the special danger that patents
+applied to a free program could make it effectively proprietary. To prevent this, the
+GPL assures that patents cannot be used to render the program non-free.
+
+The precise terms and conditions for copying, distribution and modification follow.
+
+## TERMS AND CONDITIONS
+
+### 0. Definitions
+
+“This License” refers to version 3 of the GNU General Public License.
+
+“Copyright” also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+“The Program” refers to any copyrightable work licensed under this
+License. Each licensee is addressed as “you”. “Licensees” and
+“recipients” may be individuals or organizations.
+
+To “modify” a work means to copy from or adapt all or part of the work in
+a fashion requiring copyright permission, other than the making of an exact copy. The
+resulting work is called a “modified version” of the earlier work or a
+work “based on” the earlier work.
+
+A “covered work” means either the unmodified Program or a work based on
+the Program.
+
+To “propagate” a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for infringement under
+applicable copyright law, except executing it on a computer or modifying a private
+copy. Propagation includes copying, distribution (with or without modification),
+making available to the public, and in some countries other activities as well.
+
+To “convey” a work means any kind of propagation that enables other
+parties to make or receive copies. Mere interaction with a user through a computer
+network, with no transfer of a copy, is not conveying.
+
+An interactive user interface displays “Appropriate Legal Notices” to the
+extent that it includes a convenient and prominently visible feature that **(1)**
+displays an appropriate copyright notice, and **(2)** tells the user that there is no
+warranty for the work (except to the extent that warranties are provided), that
+licensees may convey the work under this License, and how to view a copy of this
+License. If the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+### 1. Source Code
+
+The “source code” for a work means the preferred form of the work for
+making modifications to it. “Object code” means any non-source form of a
+work.
+
+A “Standard Interface” means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of interfaces
+specified for a particular programming language, one that is widely used among
+developers working in that language.
+
+The “System Libraries” of an executable work include anything, other than
+the work as a whole, that **(a)** is included in the normal form of packaging a Major
+Component, but which is not part of that Major Component, and **(b)** serves only to
+enable use of the work with that Major Component, or to implement a Standard
+Interface for which an implementation is available to the public in source code form.
+A “Major Component”, in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system (if any) on which
+the executable work runs, or a compiler used to produce the work, or an object code
+interpreter used to run it.
+
+The “Corresponding Source” for a work in object code form means all the
+source code needed to generate, install, and (for an executable work) run the object
+code and to modify the work, including scripts to control those activities. However,
+it does not include the work's System Libraries, or general-purpose tools or
+generally available free programs which are used unmodified in performing those
+activities but which are not part of the work. For example, Corresponding Source
+includes interface definition files associated with source files for the work, and
+the source code for shared libraries and dynamically linked subprograms that the work
+is specifically designed to require, such as by intimate data communication or
+control flow between those subprograms and other parts of the work.
+
+The Corresponding Source need not include anything that users can regenerate
+automatically from other parts of the Corresponding Source.
+
+The Corresponding Source for a work in source code form is that same work.
+
+### 2. Basic Permissions
+
+All rights granted under this License are granted for the term of copyright on the
+Program, and are irrevocable provided the stated conditions are met. This License
+explicitly affirms your unlimited permission to run the unmodified Program. The
+output from running a covered work is covered by this License only if the output,
+given its content, constitutes a covered work. This License acknowledges your rights
+of fair use or other equivalent, as provided by copyright law.
+
+You may make, run and propagate covered works that you do not convey, without
+conditions so long as your license otherwise remains in force. You may convey covered
+works to others for the sole purpose of having them make modifications exclusively
+for you, or provide you with facilities for running those works, provided that you
+comply with the terms of this License in conveying all material for which you do not
+control copyright. Those thus making or running the covered works for you must do so
+exclusively on your behalf, under your direction and control, on terms that prohibit
+them from making any copies of your copyrighted material outside their relationship
+with you.
+
+Conveying under any other circumstances is permitted solely under the conditions
+stated below. Sublicensing is not allowed; section 10 makes it unnecessary.
+
+### 3. Protecting Users' Legal Rights From Anti-Circumvention Law
+
+No covered work shall be deemed part of an effective technological measure under any
+applicable law fulfilling obligations under article 11 of the WIPO copyright treaty
+adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention
+of such measures.
+
+When you convey a covered work, you waive any legal power to forbid circumvention of
+technological measures to the extent such circumvention is effected by exercising
+rights under this License with respect to the covered work, and you disclaim any
+intention to limit operation or modification of the work as a means of enforcing,
+against the work's users, your or third parties' legal rights to forbid circumvention
+of technological measures.
+
+### 4. Conveying Verbatim Copies
+
+You may convey verbatim copies of the Program's source code as you receive it, in any
+medium, provided that you conspicuously and appropriately publish on each copy an
+appropriate copyright notice; keep intact all notices stating that this License and
+any non-permissive terms added in accord with section 7 apply to the code; keep
+intact all notices of the absence of any warranty; and give all recipients a copy of
+this License along with the Program.
+
+You may charge any price or no price for each copy that you convey, and you may offer
+support or warranty protection for a fee.
+
+### 5. Conveying Modified Source Versions
+
+You may convey a work based on the Program, or the modifications to produce it from
+the Program, in the form of source code under the terms of section 4, provided that
+you also meet all of these conditions:
+
+* **a)** The work must carry prominent notices stating that you modified it, and giving a
+relevant date.
+* **b)** The work must carry prominent notices stating that it is released under this
+License and any conditions added under section 7. This requirement modifies the
+requirement in section 4 to “keep intact all notices”.
+* **c)** You must license the entire work, as a whole, under this License to anyone who
+comes into possession of a copy. This License will therefore apply, along with any
+applicable section 7 additional terms, to the whole of the work, and all its parts,
+regardless of how they are packaged. This License gives no permission to license the
+work in any other way, but it does not invalidate such permission if you have
+separately received it.
+* **d)** If the work has interactive user interfaces, each must display Appropriate Legal
+Notices; however, if the Program has interactive interfaces that do not display
+Appropriate Legal Notices, your work need not make them do so.
+
+A compilation of a covered work with other separate and independent works, which are
+not by their nature extensions of the covered work, and which are not combined with
+it such as to form a larger program, in or on a volume of a storage or distribution
+medium, is called an “aggregate” if the compilation and its resulting
+copyright are not used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit. Inclusion of a covered work in an aggregate
+does not cause this License to apply to the other parts of the aggregate.
+
+### 6. Conveying Non-Source Forms
+
+You may convey a covered work in object code form under the terms of sections 4 and
+5, provided that you also convey the machine-readable Corresponding Source under the
+terms of this License, in one of these ways:
+
+* **a)** Convey the object code in, or embodied in, a physical product (including a
+physical distribution medium), accompanied by the Corresponding Source fixed on a
+durable physical medium customarily used for software interchange.
+* **b)** Convey the object code in, or embodied in, a physical product (including a
+physical distribution medium), accompanied by a written offer, valid for at least
+three years and valid for as long as you offer spare parts or customer support for
+that product model, to give anyone who possesses the object code either **(1)** a copy of
+the Corresponding Source for all the software in the product that is covered by this
+License, on a durable physical medium customarily used for software interchange, for
+a price no more than your reasonable cost of physically performing this conveying of
+source, or **(2)** access to copy the Corresponding Source from a network server at no
+charge.
+* **c)** Convey individual copies of the object code with a copy of the written offer to
+provide the Corresponding Source. This alternative is allowed only occasionally and
+noncommercially, and only if you received the object code with such an offer, in
+accord with subsection 6b.
+* **d)** Convey the object code by offering access from a designated place (gratis or for
+a charge), and offer equivalent access to the Corresponding Source in the same way
+through the same place at no further charge. You need not require recipients to copy
+the Corresponding Source along with the object code. If the place to copy the object
+code is a network server, the Corresponding Source may be on a different server
+(operated by you or a third party) that supports equivalent copying facilities,
+provided you maintain clear directions next to the object code saying where to find
+the Corresponding Source. Regardless of what server hosts the Corresponding Source,
+you remain obligated to ensure that it is available for as long as needed to satisfy
+these requirements.
+* **e)** Convey the object code using peer-to-peer transmission, provided you inform
+other peers where the object code and Corresponding Source of the work are being
+offered to the general public at no charge under subsection 6d.
+
+A separable portion of the object code, whose source code is excluded from the
+Corresponding Source as a System Library, need not be included in conveying the
+object code work.
+
+A “User Product” is either **(1)** a “consumer product”, which
+means any tangible personal property which is normally used for personal, family, or
+household purposes, or **(2)** anything designed or sold for incorporation into a
+dwelling. In determining whether a product is a consumer product, doubtful cases
+shall be resolved in favor of coverage. For a particular product received by a
+particular user, “normally used” refers to a typical or common use of
+that class of product, regardless of the status of the particular user or of the way
+in which the particular user actually uses, or expects or is expected to use, the
+product. A product is a consumer product regardless of whether the product has
+substantial commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+“Installation Information” for a User Product means any methods,
+procedures, authorization keys, or other information required to install and execute
+modified versions of a covered work in that User Product from a modified version of
+its Corresponding Source. The information must suffice to ensure that the continued
+functioning of the modified object code is in no case prevented or interfered with
+solely because modification has been made.
+
+If you convey an object code work under this section in, or with, or specifically for
+use in, a User Product, and the conveying occurs as part of a transaction in which
+the right of possession and use of the User Product is transferred to the recipient
+in perpetuity or for a fixed term (regardless of how the transaction is
+characterized), the Corresponding Source conveyed under this section must be
+accompanied by the Installation Information. But this requirement does not apply if
+neither you nor any third party retains the ability to install modified object code
+on the User Product (for example, the work has been installed in ROM).
+
+The requirement to provide Installation Information does not include a requirement to
+continue to provide support service, warranty, or updates for a work that has been
+modified or installed by the recipient, or for the User Product in which it has been
+modified or installed. Access to a network may be denied when the modification itself
+materially and adversely affects the operation of the network or violates the rules
+and protocols for communication across the network.
+
+Corresponding Source conveyed, and Installation Information provided, in accord with
+this section must be in a format that is publicly documented (and with an
+implementation available to the public in source code form), and must require no
+special password or key for unpacking, reading or copying.
+
+### 7. Additional Terms
+
+“Additional permissions” are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions. Additional
+permissions that are applicable to the entire Program shall be treated as though they
+were included in this License, to the extent that they are valid under applicable
+law. If additional permissions apply only to part of the Program, that part may be
+used separately under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+When you convey a copy of a covered work, you may at your option remove any
+additional permissions from that copy, or from any part of it. (Additional
+permissions may be written to require their own removal in certain cases when you
+modify the work.) You may place additional permissions on material, added by you to a
+covered work, for which you have or can give appropriate copyright permission.
+
+Notwithstanding any other provision of this License, for material you add to a
+covered work, you may (if authorized by the copyright holders of that material)
+supplement the terms of this License with terms:
+
+* **a)** Disclaiming warranty or limiting liability differently from the terms of
+sections 15 and 16 of this License; or
+* **b)** Requiring preservation of specified reasonable legal notices or author
+attributions in that material or in the Appropriate Legal Notices displayed by works
+containing it; or
+* **c)** Prohibiting misrepresentation of the origin of that material, or requiring that
+modified versions of such material be marked in reasonable ways as different from the
+original version; or
+* **d)** Limiting the use for publicity purposes of names of licensors or authors of the
+material; or
+* **e)** Declining to grant rights under trademark law for use of some trade names,
+trademarks, or service marks; or
+* **f)** Requiring indemnification of licensors and authors of that material by anyone
+who conveys the material (or modified versions of it) with contractual assumptions of
+liability to the recipient, for any liability that these contractual assumptions
+directly impose on those licensors and authors.
+
+All other non-permissive additional terms are considered “further
+restrictions” within the meaning of section 10. If the Program as you received
+it, or any part of it, contains a notice stating that it is governed by this License
+along with a term that is a further restriction, you may remove that term. If a
+license document contains a further restriction but permits relicensing or conveying
+under this License, you may add to a covered work material governed by the terms of
+that license document, provided that the further restriction does not survive such
+relicensing or conveying.
+
+If you add terms to a covered work in accord with this section, you must place, in
+the relevant source files, a statement of the additional terms that apply to those
+files, or a notice indicating where to find the applicable terms.
+
+Additional terms, permissive or non-permissive, may be stated in the form of a
+separately written license, or stated as exceptions; the above requirements apply
+either way.
+
+### 8. Termination
+
+You may not propagate or modify a covered work except as expressly provided under
+this License. Any attempt otherwise to propagate or modify it is void, and will
+automatically terminate your rights under this License (including any patent licenses
+granted under the third paragraph of section 11).
+
+However, if you cease all violation of this License, then your license from a
+particular copyright holder is reinstated **(a)** provisionally, unless and until the
+copyright holder explicitly and finally terminates your license, and **(b)** permanently,
+if the copyright holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+Moreover, your license from a particular copyright holder is reinstated permanently
+if the copyright holder notifies you of the violation by some reasonable means, this
+is the first time you have received notice of violation of this License (for any
+work) from that copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+Termination of your rights under this section does not terminate the licenses of
+parties who have received copies or rights from you under this License. If your
+rights have been terminated and not permanently reinstated, you do not qualify to
+receive new licenses for the same material under section 10.
+
+### 9. Acceptance Not Required for Having Copies
+
+You are not required to accept this License in order to receive or run a copy of the
+Program. Ancillary propagation of a covered work occurring solely as a consequence of
+using peer-to-peer transmission to receive a copy likewise does not require
+acceptance. However, nothing other than this License grants you permission to
+propagate or modify any covered work. These actions infringe copyright if you do not
+accept this License. Therefore, by modifying or propagating a covered work, you
+indicate your acceptance of this License to do so.
+
+### 10. Automatic Licensing of Downstream Recipients
+
+Each time you convey a covered work, the recipient automatically receives a license
+from the original licensors, to run, modify and propagate that work, subject to this
+License. You are not responsible for enforcing compliance by third parties with this
+License.
+
+An “entity transaction” is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an organization, or
+merging organizations. If propagation of a covered work results from an entity
+transaction, each party to that transaction who receives a copy of the work also
+receives whatever licenses to the work the party's predecessor in interest had or
+could give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if the predecessor
+has it or can get it with reasonable efforts.
+
+You may not impose any further restrictions on the exercise of the rights granted or
+affirmed under this License. For example, you may not impose a license fee, royalty,
+or other charge for exercise of rights granted under this License, and you may not
+initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging
+that any patent claim is infringed by making, using, selling, offering for sale, or
+importing the Program or any portion of it.
+
+### 11. Patents
+
+A “contributor” is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based. The work thus
+licensed is called the contributor's “contributor version”.
+
+A contributor's “essential patent claims” are all patent claims owned or
+controlled by the contributor, whether already acquired or hereafter acquired, that
+would be infringed by some manner, permitted by this License, of making, using, or
+selling its contributor version, but do not include claims that would be infringed
+only as a consequence of further modification of the contributor version. For
+purposes of this definition, “control” includes the right to grant patent
+sublicenses in a manner consistent with the requirements of this License.
+
+Each contributor grants you a non-exclusive, worldwide, royalty-free patent license
+under the contributor's essential patent claims, to make, use, sell, offer for sale,
+import and otherwise run, modify and propagate the contents of its contributor
+version.
+
+In the following three paragraphs, a “patent license” is any express
+agreement or commitment, however denominated, not to enforce a patent (such as an
+express permission to practice a patent or covenant not to sue for patent
+infringement). To “grant” such a patent license to a party means to make
+such an agreement or commitment not to enforce a patent against the party.
+
+If you convey a covered work, knowingly relying on a patent license, and the
+Corresponding Source of the work is not available for anyone to copy, free of charge
+and under the terms of this License, through a publicly available network server or
+other readily accessible means, then you must either **(1)** cause the Corresponding
+Source to be so available, or **(2)** arrange to deprive yourself of the benefit of the
+patent license for this particular work, or **(3)** arrange, in a manner consistent with
+the requirements of this License, to extend the patent license to downstream
+recipients. “Knowingly relying” means you have actual knowledge that, but
+for the patent license, your conveying the covered work in a country, or your
+recipient's use of the covered work in a country, would infringe one or more
+identifiable patents in that country that you have reason to believe are valid.
+
+If, pursuant to or in connection with a single transaction or arrangement, you
+convey, or propagate by procuring conveyance of, a covered work, and grant a patent
+license to some of the parties receiving the covered work authorizing them to use,
+propagate, modify or convey a specific copy of the covered work, then the patent
+license you grant is automatically extended to all recipients of the covered work and
+works based on it.
+
+A patent license is “discriminatory” if it does not include within the
+scope of its coverage, prohibits the exercise of, or is conditioned on the
+non-exercise of one or more of the rights that are specifically granted under this
+License. You may not convey a covered work if you are a party to an arrangement with
+a third party that is in the business of distributing software, under which you make
+payment to the third party based on the extent of your activity of conveying the
+work, and under which the third party grants, to any of the parties who would receive
+the covered work from you, a discriminatory patent license **(a)** in connection with
+copies of the covered work conveyed by you (or copies made from those copies), or **(b)**
+primarily for and in connection with specific products or compilations that contain
+the covered work, unless you entered into that arrangement, or that patent license
+was granted, prior to 28 March 2007.
+
+Nothing in this License shall be construed as excluding or limiting any implied
+license or other defenses to infringement that may otherwise be available to you
+under applicable patent law.
+
+### 12. No Surrender of Others' Freedom
+
+If conditions are imposed on you (whether by court order, agreement or otherwise)
+that contradict the conditions of this License, they do not excuse you from the
+conditions of this License. If you cannot convey a covered work so as to satisfy
+simultaneously your obligations under this License and any other pertinent
+obligations, then as a consequence you may not convey it at all. For example, if you
+agree to terms that obligate you to collect a royalty for further conveying from
+those to whom you convey the Program, the only way you could satisfy both those terms
+and this License would be to refrain entirely from conveying the Program.
+
+### 13. Use with the GNU Affero General Public License
+
+Notwithstanding any other provision of this License, you have permission to link or
+combine any covered work with a work licensed under version 3 of the GNU Affero
+General Public License into a single combined work, and to convey the resulting work.
+The terms of this License will continue to apply to the part which is the covered
+work, but the special requirements of the GNU Affero General Public License, section
+13, concerning interaction through a network will apply to the combination as such.
+
+### 14. Revised Versions of this License
+
+The Free Software Foundation may publish revised and/or new versions of the GNU
+General Public License from time to time. Such new versions will be similar in spirit
+to the present version, but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program specifies that
+a certain numbered version of the GNU General Public License “or any later
+version” applies to it, you have the option of following the terms and
+conditions either of that numbered version or of any later version published by the
+Free Software Foundation. If the Program does not specify a version number of the GNU
+General Public License, you may choose any version ever published by the Free
+Software Foundation.
+
+If the Program specifies that a proxy can decide which future versions of the GNU
+General Public License can be used, that proxy's public statement of acceptance of a
+version permanently authorizes you to choose that version for the Program.
+
+Later license versions may give you additional or different permissions. However, no
+additional obligations are imposed on any author or copyright holder as a result of
+your choosing to follow a later version.
+
+### 15. Disclaimer of Warranty
+
+THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER
+EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE
+QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
+DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+### 16. Limitation of Liability
+
+IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY
+COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS
+PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL,
+INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE
+OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE
+WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+### 17. Interpretation of Sections 15 and 16
+
+If the disclaimer of warranty and limitation of liability provided above cannot be
+given local legal effect according to their terms, reviewing courts shall apply local
+law that most closely approximates an absolute waiver of all civil liability in
+connection with the Program, unless a warranty or assumption of liability accompanies
+a copy of the Program in return for a fee.
+
+_END OF TERMS AND CONDITIONS_
+
+## How to Apply These Terms to Your New Programs
+
+If you develop a new program, and you want it to be of the greatest possible use to
+the public, the best way to achieve this is to make it free software which everyone
+can redistribute and change under these terms.
+
+To do so, attach the following notices to the program. It is safest to attach them
+to the start of each source file to most effectively state the exclusion of warranty;
+and each file should have at least the “copyright” line and a pointer to
+where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program does terminal interaction, make it output a short notice like this
+when it starts in an interactive mode:
+
+    <program>  Copyright (C) <year>  <name of author>
+    This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type 'show c' for details.
+
+The hypothetical commands `show w` and `show c` should show the appropriate parts of
+the General Public License. Of course, your program's commands might be different;
+for a GUI interface, you would use an “about box”.
+
+You should also get your employer (if you work as a programmer) or school, if any, to
+sign a “copyright disclaimer” for the program, if necessary. For more
+information on this, and how to apply and follow the GNU GPL, see
+&lt;<http://www.gnu.org/licenses/>&gt;.
+
+The GNU General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may consider it
+more useful to permit linking proprietary applications with the library. If this is
+what you want to do, use the GNU Lesser General Public License instead of this
+License. But first, please read
+&lt;<http://www.gnu.org/philosophy/why-not-lgpl.html>&gt;.
--- a/NAMESPACE
+++ b/NAMESPACE
+# Generated by roxygen2: do not edit by hand
+
+export(calc_performance_metrics)
+export(calc_summary_statistics)
+export(clean_data)
+export(copy_default_params)
+export(detrend)
+export(estimate_effect_size)
+export(get_meteo_available)
+export(load_params)
+export(load_uba_data_from_dir)
+export(plot_counterfactual)
+export(plot_station_measurements)
+export(prepare_data_for_modelling)
+export(rescale_predictions)
+export(retrend_predictions)
+export(run_counterfactual)
+export(run_dynamic_regression)
+export(run_fnn)
+export(run_lightgbm)
+export(run_rf)
+export(scale_data)
+export(split_data_counterfactual)
+import(lightgbm)
+import(ranger)
+import(splines)
+importFrom(data.table,":=")
+importFrom(data.table,.SD)
+importFrom(data.table,melt)
+importFrom(dplyr,"%>%")
+importFrom(dplyr,across)
+importFrom(dplyr,mutate)
+importFrom(dplyr,select)
+importFrom(dplyr,where)
+importFrom(forecast,BoxCox.lambda)
+importFrom(forecast,auto.arima)
+importFrom(forecast,forecast)
+importFrom(ggplot2,aes)
+importFrom(ggplot2,facet_wrap)
+importFrom(ggplot2,geom_line)
+importFrom(ggplot2,geom_ribbon)
+importFrom(ggplot2,geom_smooth)
+importFrom(ggplot2,geom_vline)
+importFrom(ggplot2,ggplot)
+importFrom(ggplot2,labs)
+importFrom(ggplot2,theme_bw)
+importFrom(lubridate,ymd_h)
+importFrom(lubridate,ymd_hm)
+importFrom(stats,lm)
+importFrom(stats,predict)
+importFrom(stats,spline)
+importFrom(tidyr,gather)
+importFrom(yaml,read_yaml)
--- a/NEWS.md
+++ b/NEWS.md
+## ubair 1.1.0
+
+**Initial CRAN Submission**\
+**Public Release**
+
+-   First public release of `ubair` on CRAN.
+-   Previous internal versions (pre-CRAN) were used privately or shared within a specific group.
+-   The package is intended for use with air quality station measurement data. These can be obtained for Germany via [Air Quality Data API (UBA)](https://www.umweltbundesamt.de/daten/luft/luftdaten/doc) or per request to immission@uba.de
+-   Includes the following features:
+    -   **Counterfactual analysis**: Apply counterfactual analysis to assess external effects on air quality based on meteorological data, with support for four models: LightGBM, Random Forest, Feedforward Neural Network, and Dynamic Regression.
+    -   **Data detrending and retrending**: Enable users to remove and reintroduce trends in data as needed.
+    -   **Visualization and evaluation**: Provide tools for visualizing and evaluating the results of counterfactual analyses.
+    -   **Data preprocessing**: Preprocess data to prepare it for counterfactual analysis.
--- a/R/counterfactual_model.R
+++ b/R/counterfactual_model.R
+#' Full counterfactual simulation run
+#'
+#' Chains detrending, training of a selected model, prediction and retrending together
+#' for ease of use. See documentation of individual functions for details.
+#' @param split_data List of two named dataframes called train and apply
+#' @param params A list of parameters that define the following:
+#' \describe{
+#'   \item{meteo_variables}{A character vector specifying the names of the
+#'   meteorological variables used as inputs.}
+#'   \item{model}{A list of hyperparameters for training the chosen model. Name of this list
+#'   and its parameters depend on the chosen models. See [ubair::run_dynamic_regression()],
+#'   [ubair::run_lightgbm()], [ubair::run_rf()] and [ubair::run_fnn()] functions for details}
+#'   }
+#' @param detrending_function String which defines type of trend to remove.
+#' Options are "linear","quadratic", "exponential", "spline", "none". See [ubair::detrend()]
+#' and [ubair::retrend_predictions()] for details.
+#' @param model_type String to decide which model to use. Current options random
+#' forest "rf", gradient boosted decision trees "lightgbm", "dynamic_regression" and feedforward neural network "fnn"
+#' @param alpha Confidence level of the prediction interval between 0 and 1.
+#' @param log_transform If TRUE, uses log transformation during detrending and
+#' retrending. For details see [ubair::detrend()] documentation
+#' @param calc_shaps Boolean value. If TRUE, calculate SHAP values for the
+#' method used and format them so they can be visualised with \code{\link[shapviz:sv_importance]{shapviz:sv_importance()}} and
+#' \code{\link[shapviz:sv_dependence]{shapviz:sv_dependence()}}.
+#' The SHAP values are generated for a subset (or all, depending on the size of the dataset) of the
+#' test data.
+#' @return Data frame of predictions and model
+#' @examples
+#' \dontrun{
+#' split_data <- split_data_counterfactual(
+#'   dt_prepared, training_start,
+#'   training_end, application_start, application_end
+#' )
+#' res <- run_counterfactual(split_data, params, detrending_function = "linear")
+#' prediction <- res$retrended_predictions
+#' random_forest_model <- res$model
+#' }
+#' @export
+run_counterfactual <- function(split_data,
+                               params,
+                               detrending_function = "none",
+                               model_type = "rf",
+                               alpha = 0.9,
+                               log_transform = FALSE,
+                               calc_shaps = FALSE) {
+  variables <- c("day_julian", "weekday", "hour", params$meteo_variables)
+  detrended_list <- detrend(
+    split_data,
+    mode = detrending_function,
+    log_transform = log_transform
+  )
+  trend <- detrended_list$model
+  detrended_train <- detrended_list$train
+  detrended_apply <- detrended_list$apply
+
+  if (model_type == "rf") {
+    detrended_train <- detrended_train %>% select(value, !!variables)
+    res <- run_rf(
+      train = detrended_train,
+      test = detrended_apply,
+      model_params = params$random_forest,
+      alpha = alpha,
+      calc_shaps = calc_shaps
+    )
+  } else if (model_type == "lightgbm") {
+    detrended_train <- detrended_train %>%
+      select(value, dplyr::any_of(variables))
+    res <- run_lightgbm(
+      train = detrended_train,
+      test = detrended_apply,
+      model_params = params$lightgbm,
+      alpha = alpha,
+      calc_shaps = calc_shaps
+    )
+  } else if (model_type == "dynamic_regression") {
+    res <- run_dynamic_regression(
+      train = detrended_train,
+      test = detrended_apply,
+      params = params,
+      alpha = alpha,
+      calc_shaps = calc_shaps
+    )
+  } else if (model_type == "fnn") {
+    res <- suppressMessages(run_fnn(
+      train = detrended_train,
+      test = detrended_apply,
+      params = params,
+      calc_shaps = calc_shaps
+    ))
+  } else {
+    stop("Wrong model_type. Select one of 'rf', 'lightgbm', 'dynamic_regression', 'fnn'.")
+  }
+  retrended_predictions <- retrend_predictions(res$dt_predictions,
+    trend,
+    log_transform = log_transform
+  )
+  list(prediction = retrended_predictions, model = res$model, importance = res$importance)
+}
+
+#' Run the dynamic regression model
+#'
+#' This function trains a dynamic regression model with fourier transformed temporal features
+#' and meteorological variables as external regressors on the
+#' specified training dataset and makes predictions on the test dataset in a
+#' counterfactual scenario. This is referred to as a dynamic regression model in
+#' [Forecasting: Principles and Practise, Chapter 10 - Dynamic regression models](https://otexts.com/fpp3/dynamic.html)
+#'
+#' Note: Runs the dynamic regression model for individualised use with own data pipeline.
+#' Otherwise use [ubair::run_counterfactual()] to call this function.
+#' @param train Dataframe of train data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param test Dataframe of test data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param params list of hyperparameters to use in dynamic_regression call. Only uses ntrain to specify
+#' the number of data points to use for training. Default is 8760 which results in
+#' 1 year of hourly data
+#' @param alpha Confidence level of the prediction interval between 0 and 1.
+#' @param calc_shaps Boolean value. If TRUE, calculate SHAP values for the
+#' method used and format them so they can be visualised with \code{\link[shapviz:sv_importance]{shapviz:sv_importance()}} and
+#' \code{\link[shapviz:sv_dependence]{shapviz:sv_dependence()}}.
+#' The SHAP values are generated for a subset (or all, depending on the size of the dataset) of the
+#' test data.
+#' @return Data frame of predictions and model
+#' @importFrom forecast BoxCox.lambda auto.arima forecast
+#' @export
+run_dynamic_regression <- function(train,
+                                   test,
+                                   params,
+                                   alpha,
+                                   calc_shaps) {
+  train <- train %>% dplyr::filter(date < test$date[1])
+  # 24 * 365 = 8760 (1 year of training data)
+  ntrain <- ifelse(is.null(params$dynamic_regression$ntrain), 8760, params$dynamic_regression$ntrain)
+  train <- utils::tail(train, ntrain)
+  message(
+    paste("Using data for dynamic regression training from ", min(train$date), "to ", max(train$date)),
+    ". Too long training series can lead to worse performance. Adjust this via the dynamic_regression$ntrain hyperparameter."
+  )
+  train_transformed <- .transform_input(train, params)
+  test_transformed <- .transform_input(test, params)
+  scale_result <- scale_data(
+    train_data = train_transformed,
+    apply_data = test_transformed
+  )
+  xreg <- scale_result$train %>%
+    dplyr::select(-value) %>%
+    as.matrix()
+  xreg_pred <- scale_result$apply %>%
+    dplyr::select(-value) %>%
+    as.matrix()
+  y <- scale_result$train$value %>% stats::ts()
+  model <- forecast::auto.arima(y,
+    d = 0, xreg = xreg,
+    seasonal = FALSE,
+    trace = FALSE,
+    allowdrift = TRUE,
+    allowmean = TRUE,
+    lambda = NULL,
+    biasadj = TRUE
+  )
+  pred <- forecast::forecast(
+    object = model,
+    xreg = xreg_pred,
+    level = alpha,
+    lambda = NULL,
+    biasadj = TRUE
+  )
+  test$prediction <- pred$mean %>% as.numeric()
+  test$prediction_lower <- pred$lower[, 1] %>% as.numeric()
+  test$prediction_upper <- pred$upper[, 1] %>% as.numeric()
+  dt_predictions <- rescale_predictions(
+    scale_result = scale_result,
+    dt_predictions = test
+  )
+  if (calc_shaps) {
+    rlang::check_installed(c("fastshap", "shapviz"),
+      reason = "calculate shap values for dynamic regression"
+    )
+    shap_vals <- fastshap::explain(
+      model,
+      X = xreg,
+      nsim = 100,
+      newdata = xreg_pred,
+      pred_wrapper = function(object, newdata) {
+        fc_prediction <- forecast::forecast(
+          object = object,
+          xreg = newdata,
+          lambda = NULL,
+          biasadj = TRUE
+        )
+        return(fc_prediction$mean %>% as.numeric())
+      },
+      shap_only = FALSE
+    )
+    shp <- shapviz::shapviz(shap_vals)
+  } else {
+    shp <- NULL
+  }
+  list(dt_predictions = dt_predictions, model = model, importance = shp)
+}
+
+#' Run random forest model with ranger
+#'
+#' This function trains a random forest model (ranger) on the
+#' specified training dataset and makes predictions on the test dataset in a
+#' counterfactual scenario. The model uses meteorological variables and temporal features.
+#'
+#' Note: Runs the random forest model for individualised use with own data pipeline.
+#' Otherwise use [ubair::run_counterfactual()]  to call this function.
+#' @param train Dataframe of train data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param test Dataframe of test data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param model_params list of hyperparameters to use in ranger call. See \code{\link[ranger:ranger]{ranger:ranger()}} for options.
+#' @param alpha Confidence level of the prediction interval between 0 and 1.
+#' @param calc_shaps Boolean value. If TRUE, calculate SHAP values for the
+#' method used and format them so they can be visualised with \code{\link[shapviz:sv_importance]{shapviz:sv_importance()}} and
+#' \code{\link[shapviz:sv_dependence]{shapviz:sv_dependence()}}.
+#' The SHAP values are generated for a subset (or all, depending on the size of the dataset) of the
+#' test data.
+#' @return List with data frame of predictions and model
+#' @import ranger
+#' @export
+run_rf <- function(train, test, model_params, alpha, calc_shaps) {
+  function_parameters <- list(
+    y = train$value,
+    x = train %>% select(-value),
+    importance = "none",
+    splitrule = "variance",
+    keep.inbag = TRUE,
+    quantreg = TRUE
+  )
+  function_parameters <- append(function_parameters, model_params)
+  model <- do.call(ranger::ranger, args = function_parameters)
+  quantiles <- data.table::as.data.table(predict(model,
+    test,
+    type = "quantiles",
+    quantiles = c((1 - alpha) / 2, (1 + alpha) / 2)
+  )$predictions)
+  if (calc_shaps) {
+    rlang::check_installed(c("treeshap", "shapviz"),
+      reason = "calculate shap values for random forest"
+    )
+    shap_subset <- test[sample(nrow(test), min(nrow(test), 1000)), ] %>% select(-value)
+    unified <- treeshap::ranger.unify(model, data.matrix(train %>% select(-value)))
+    treeshap_vals <- treeshap::treeshap(unified, data.matrix(shap_subset), verbose = FALSE)
+    shp <- shapviz::shapviz(treeshap_vals, X_pred = data.matrix(shap_subset), X = shap_subset)
+  } else {
+    shp <- NULL
+  }
+  test[, "prediction" := predict(model, test)$predictions]
+  test[, c("prediction_lower", "prediction_upper") := quantiles]
+  list(dt_predictions = test, model = model, importance = shp)
+}
+
+#' Run gradient boosting model with lightgbm
+#'
+#' This function trains a gradient boosting model (lightgbm) on the
+#' specified training dataset and makes predictions on the test dataset in a
+#' counterfactual scenario. The model uses meteorological variables and temporal features.
+#'
+#' Note: Runs the gradient boosting model for individualised use with own data pipeline.
+#' Otherwise use [ubair::run_counterfactual()]  to call this function.
+#' @param train Dataframe of train data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param test Dataframe of test data as returned by the [ubair::split_data_counterfactual()]
+#' function.
+#' @param model_params list of hyperparameters to use in lgb.train call.
+#' See \code{\link[lightgbm:lgb.train]{lightgbm:lgb.train()}} params argument for details.
+#' @param alpha Confidence level of the prediction interval between 0 and 1.
+#' @param calc_shaps Boolean value. If TRUE, calculate SHAP values for the
+#' method used and format them so they can be visualised with \code{\link[shapviz:sv_importance]{shapviz:sv_importance()}} and
+#' \code{\link[shapviz:sv_dependence]{shapviz:sv_dependence()}}.
+#' The SHAP values are generated for a subset (or all, depending on the size of the dataset) of the
+#' test data.
+#' @return List with data frame of predictions and model
+#' @import lightgbm
+#' @export
+run_lightgbm <- function(train, test, model_params, alpha, calc_shaps) {
+  dtrain <- lgb.Dataset(
+    data = data.matrix(train %>% select(-value)),
+    label = train$value
+  )
+  model_mean <- lgb.train(
+    params = model_params[names(model_params) != "nrounds"],
+    data = dtrain,
+    nrounds = model_params$nrounds
+  )
+  params_lower <- append(
+    model_params[names(model_params) != "nrounds"],
+    list(
+      "objective" = "quantile",
+      "alpha" = (1 - alpha) / 2,
+      "verbosity" = 0
+    )
+  )
+  model_lower <- lgb.train(
+    params = params_lower,
+    data = dtrain,
+    nrounds = model_params$nrounds
+  )
+  params_upper <- append(
+    model_params[names(model_params) != "nrounds"],
+    list(
+      "objective" = "quantile",
+      "alpha" = (1 + alpha) / 2,
+      "verbosity" = 0
+    )
+  )
+  model_upper <- lgb.train(
+    params = params_upper,
+    data = dtrain,
+    nrounds = model_params$nrounds
+  )
+  dapply <- test %>% select(!!colnames(train))
+  dapply <- data.matrix(dapply %>% select(-value))
+  dt_predictions <- test
+  dt_predictions$prediction <- predict(model_mean, dapply)
+  dt_predictions$prediction_lower <- predict(model_lower, dapply)
+  dt_predictions$prediction_upper <- predict(model_upper, dapply)
+  if (calc_shaps) {
+    rlang::check_installed(c("shapviz"),
+      reason = "calculate shap values for lightgbm"
+    )
+    shp <- shapviz::shapviz(model_mean, X_pred = dapply, X = dapply)
+  } else {
+    shp <- NULL
+  }
+  list(dt_predictions = dt_predictions, model = model_mean, importance = shp)
+}
+
+#' Train a Feedforward Neural Network (FNN) in a Counterfactual Scenario.
+#'
+#' Trains a feedforward neural network (FNN) model on the
+#' specified training dataset and makes predictions on the test dataset in a
+#' counterfactual scenario. The model uses meteorological variables and
+#' sin/cosine-transformed features. Scales the data before training and rescales
+#' predictions, as the model does not converge with unscaled data.
+#'
+#' This function provides flexibility for users with their own data pipelines
+#' or workflows. For a simplified pipeline, consider using
+#' \code{\link[ubair:run_counterfactual]{run_counterfactual()}}.
+#'
+#' Experiment with hyperparameters such as \code{learning_rate},
+#' \code{batchsize}, \code{hidden_layers}, and \code{num_epochs} to improve
+#' performance.
+#'
+#' Warning: Using many or large hidden layers in combination with a high number
+#' of epochs can lead to long training times.
+#'
+#' @param train A data frame or tibble containing the training dataset,
+#' including the target variable (`value`)
+#' and meteorological variables specified in `params$meteo_variables`.
+#' @param test A data frame or tibble containing the test dataset on which
+#' predictions will be made,
+#' using the same meteorological variables as in the training dataset.
+#' @param params A list of parameters that define the following:
+#' \describe{
+#'   \item{meteo_variables}{A character vector specifying the names of the
+#'   meteorological variables used as inputs.}
+#'   \item{fnn}{A list of hyperparameters for training the feedforward neural
+#'   network, including:
+#'     \itemize{
+#'       \item \code{activation_fun}: The activation function for the hidden
+#'       layers (e.g., "sigmoid", "tanh").
+#'       \item \code{momentum}: The momentum factor for training.
+#'       \item \code{learningrate_scale}: Factor for adjusting learning rate.
+#'       \item \code{output_fun}: The activation function for the output layer
+#'       \item \code{batchsize}: The size of the batches during training.
+#'       \item \code{hidden_dropout}: Dropout rate for the hidden layers to
+#'       prevent overfitting.
+#'       \item \code{visible_dropout}: Dropout rate for the input layer.
+#'       \item \code{hidden_layers}: A vector specifying the number of neurons
+#'       in each hidden layer.
+#'       \item \code{num_epochs}: Number of epochs (iterations) for training.
+#'       \item \code{learning_rate}: Initial learning rate.
+#'     }
+#'   }
+#' }
+#' @param calc_shaps Boolean value. If TRUE, calculate SHAP values for the
+#' method used and format them so they can be visualised with
+#' \code{\link[shapviz:sv_importance]{shapviz:sv_importance()}} and
+#' \code{\link[shapviz:sv_dependence]{shapviz:sv_dependence()}}.
+#' The SHAP values are generated for a subset (or all, depending on the size of the dataset) of the
+#' test data.
+#' @return A list with three elements:
+#' \describe{
+#'   \item{\code{dt_predictions}}{A data frame containing the test data along
+#' with the predicted values:
+#'     \describe{
+#'       \item{\code{prediction}}{The predicted values from the FNN model.}
+#'       \item{\code{prediction_lower}}{The same predicted values, as no
+#'       quantile model is available yet for FNN.}
+#'       \item{\code{prediction_upper}}{The same predicted values, as no
+#'       quantile model is available yet for FNN.}
+#'     }
+#'   }
+#'   \item{\code{model}}{The trained FNN model object from the
+#'   \code{deepnet::nn.train()} function.}
+#'   \item{\code{importance}}{SHAP importance values (if
+#'   \code{calc_shaps = TRUE}). Otherwise, `NULL`.}
+#' }
+#' @export
+run_fnn <- function(train, test, params, calc_shaps) {
+  rlang::check_installed(c("deepnet"), reason = "to run a fnn")
+
+  train_transformed <- .transform_input(train, params)
+  test_transformed <- .transform_input(test, params)
+
+  scale_result <- scale_data(
+    train_data = train_transformed,
+    apply_data = test_transformed
+  )
+
+  train_matrix <- scale_result$train %>%
+    dplyr::select(-value) %>%
+    as.matrix()
+  test_matrix <- scale_result$apply %>%
+    dplyr::select(-value) %>%
+    as.matrix()
+  target_vector <- scale_result$train$value
+
+  fnn_function_parameter <- append(
+    list(x = train_matrix, y = target_vector),
+    params$fnn
+  )
+  model <- do.call(deepnet::nn.train, args = fnn_function_parameter)
+
+  prediction <- deepnet::nn.predict(model, test_matrix)
+  # no quantil model yet for feedforward neural network
+  test <- test %>%
+    mutate(
+      prediction = prediction,
+      prediction_lower = prediction,
+      prediction_upper = prediction
+    )
+
+  test <- rescale_predictions(
+    scale_result = scale_result,
+    dt_predictions = test
+  )
+  if (calc_shaps) {
+    rlang::check_installed(c("fastshap", "shapviz"),
+      reason = "calculate shap values for fnn"
+    )
+    shap_vals <- fastshap::explain(
+      model,
+      X = train_matrix,
+      nsim = 50,
+      newdata = test_matrix,
+      pred_wrapper = function(object, newdata) {
+        deepnet::nn.predict(
+          nn = object,
+          x = newdata
+        ) %>% as.numeric()
+      },
+      shap_only = FALSE
+    )
+    shp <- shapviz::shapviz(shap_vals)
+  } else {
+    shp <- NULL
+  }
+  list(dt_predictions = test, model = model, importance = shp)
+}
+
+#' Make fourier features out of hour, weekday and day of the year
+#'
+#' @return Numeric matrix of sin and cos transformed temporal features of dimension
+#' n x 6.
+#' @noRd
+.make_fourier_features <- function(df) {
+  df$weekday <- df$weekday %>% as.numeric()
+  fourier_list <- list(
+    sin((24 * (df$day_julian) + df$hour) / (24 * 365) * 2 * pi),
+    cos((24 * (df$day_julian) + df$hour) / (24 * 365) * 2 * pi),
+    sin((24 * (df$weekday - 1) + df$hour) / (24 * 7) * 2 * pi),
+    cos((24 * (df$weekday - 1) + df$hour) / (24 * 7) * 2 * pi),
+    sin(df$hour / 24 * 2 * pi),
+    cos(df$hour / 24 * 2 * pi)
+  )
+  res_matrix <- matrix(unlist(fourier_list), ncol = 6, byrow = FALSE)
+  colnames(res_matrix) <- c(
+    "sin_year", "cos_year", "sin_week", "cos_week",
+    "sin_hour", "cos_hour"
+  )
+  res_matrix
+}
+
+#' Transform Data to cyclic features for fnn/dynamic regression
+#'
+#' @return Numeric matrix combining the selected meteorological variables
+#' (transformed wind vectors as U and V components) and the Fourier features.
+#' @noRd
+.transform_input <- function(input_data, params) {
+  input_meteo <- input_data %>% select(!!params$meteo_variables)
+  if ("WIG" %in% params$meteo_variables && "WIR" %in% params$meteo_variables) {
+    input_meteo <- .make_wind_vectors(input_meteo) %>% select(-WIG, -WIR)
+  }
+  fourier_features <- .make_fourier_features(input_data)
+  transformed <- cbind(input_meteo, fourier_features, input_data[, "value"])
+  transformed
+}
+
+#' Creates wind vectors from direction and speed
+#'
+#' Takes a dataframe with columns WIG (wind speed) and
+#' WIR (wind direction in degrees 0 to 360) and creates wind vectors with U and
+#' V component
+#' @return Data frame of the same format as the input with additional columns
+#' "WIND_U" and "WIND_V".
+#' @noRd
+.make_wind_vectors <- function(dt_prepared) {
+  dt_prepared <- dt_prepared %>% mutate(
+    "WIND_U" = sin(pi * WIR / 180) * WIG,
+    "WIND_V" = cos(pi * WIR / 180) * WIG
+  )
+  dt_prepared
+}
--- a/R/data_cleaning.R
+++ b/R/data_cleaning.R
+#' Clean and Optionally Aggregate Environmental Data
+#'
+#' Cleans a data table of environmental measurements by filtering for a specific
+#' station, removing duplicates, and optionally aggregating the data on a daily
+#' basis using the mean.
+#'
+#' @param env_data A data table in long format.
+#' Must include columns:
+#' \describe{
+#'   \item{Station}{Station identifier for the data.}
+#'   \item{Komponente}{Measured environmental component e.g. temperature, NO2.}
+#'   \item{Wert}{Measured value.}
+#'   \item{date}{Timestamp as Date-Time object (`YYYY-MM-DD HH:MM:SS` format).}
+#'   \item{Komponente_txt}{Textual description of the component.}
+#' }
+#' @param station Character. Name of the station to filter by.
+#' @param aggregate_daily Logical. If `TRUE`, aggregates data to daily mean values. Default is `FALSE`.
+#' @return A `data.table`:
+#' \itemize{
+#'   \item If `aggregate_daily = TRUE`: Contains columns for station, component, day, year,
+#'         and the daily mean value of the measurements.
+#'   \item If `aggregate_daily = FALSE`: Contains cleaned data with duplicates removed.
+#' }
+#' @details Duplicate rows (by `date`, `Komponente`, and `Station`) are removed. A warning is issued
+#' if duplicates are found.
+#' @examples
+#' # Example data
+#' env_data <- data.table::data.table(
+#'   Station = c("DENW094", "DENW094", "DENW006", "DENW094"),
+#'   Komponente = c("NO2", "O3", "NO2", "NO2"),
+#'   Wert = c(45, 30, 50, 40),
+#'   date = as.POSIXct(c(
+#'     "2023-01-01 08:00:00", "2023-01-01 09:00:00",
+#'     "2023-01-01 08:00:00", "2023-01-02 08:00:00"
+#'   )),
+#'   Komponente_txt = c(
+#'     "Nitrogen Dioxide", "Ozone", "Nitrogen Dioxide", "Nitrogen Dioxide"
+#'   )
+#' )
+#'
+#' # Clean data for StationA without aggregation
+#' cleaned_data <- clean_data(env_data, station = "DENW094", aggregate_daily = FALSE)
+#' print(cleaned_data)
+#' @export
+clean_data <- function(env_data, station, aggregate_daily = FALSE) {
+  env_data <- .add_year_column(env_data)
+  env_data <- dplyr::filter(env_data, Station == station)
+  env_data_unique <- unique(env_data, by = c("date", "Komponente", "Station"))
+
+  if (nrow(env_data_unique) < nrow(env_data)) {
+    warning(sprintf(
+      "%d duplicate row(s) were removed.",
+      nrow(env_data) - nrow(env_data_unique)
+    ))
+  }
+  if (aggregate_daily) {
+    env_data_unique <- .aggregate_data(env_data_unique)
+  }
+
+  env_data_unique
+}
+
+#' Get Available Meteorological Components
+#'
+#' Identifies unique meteorological components from the provided environmental data,
+#' filtering only those that match the predefined UBA naming conventions. These components
+#' include "GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", and "WIND_V".
+#' @param env_data Data table containing environmental data.
+#' Must contain column "Komponente"
+#' @return A vector of available meteorological components.
+#' @examples
+#' # Example environmental data
+#' env_data <- data.table::data.table(
+#'   Komponente = c("TMP", "NO2", "GLO", "WIR"),
+#'   Wert = c(25, 40, 300, 50),
+#'   date = as.POSIXct(c(
+#'     "2023-01-01 08:00:00", "2023-01-01 09:00:00",
+#'     "2023-01-01 10:00:00", "2023-01-01 11:00:00"
+#'   ))
+#' )
+#' # Get available meteorological components
+#' meteo_components <- get_meteo_available(env_data)
+#' print(meteo_components)
+#' @export
+get_meteo_available <- function(env_data) {
+  meteo_available <- unique(env_data$Komponente) %>%
+    .[. %in% c("GLO", "LDR", "RFE", "TMP", "WIG", "WIR", "WIND_U", "WIND_V")]
+  meteo_available
+}
+
+#' @return A data.table with an added year column.
+#' @noRd
+.add_year_column <- function(env_data) {
+  env_data_copy <- data.table::copy(env_data)
+  env_data_copy[, year := lubridate::year(date)]
+}
+
+#' Adds a `day` column to the data, representing the date truncated to day-level
+#' precision. This column is used for later aggregations.
+#' @noRd
+.add_day_column <- function(env_data) {
+  env_data %>% dplyr::mutate(day = lubridate::floor_date(date, unit = "day"))
+}
+
+#' @return A data.table aggregated to daily mean values and a new column day.
+#' @noRd
+.aggregate_data <- function(env_data) {
+  env_data <- .add_day_column(env_data)
+  env_data[, list(Wert = mean(Wert, na.rm = TRUE)),
+    by = list(Station, Komponente, Komponente_txt, day, year)
+  ]
+}
--- a/R/data_loading.R
+++ b/R/data_loading.R
+#' Load Parameters from YAML File
+#'
+#' Reads a YAML file containing model parameters, including station settings,
+#' variables, and configurations for various models. If no file path is
+#' provided, the function defaults to loading `params.yaml` from the package's
+#' `extdata` directory.
+#'
+#' @param filepath Character. Path to the YAML file. If `NULL`, the function
+#' will attempt to load the default `params.yaml` provided in the package.
+#' @return A list containing the parameters loaded from the YAML file.
+#' @details
+#' The YAML file should define parameters in a structured format, such as:
+#'
+#' ```yaml
+#' target: 'NO2'
+#'
+#' lightgbm:
+#'   nrounds: 200
+#'   eta: 0.03
+#'   num_leaves: 32
+#'
+#' dynamic_regression:
+#'   ntrain: 8760
+#'
+#' random_forest:
+#'   num.trees: 300
+#'   max.depth: 10
+#'
+#' meteo_variables:
+#'   - GLO
+#'   - TMP
+#' ```
+#' @examples
+#' \dontrun{
+#' params <- load_params("path/to/custom_params.yaml")
+#' }
+#' @export
+#' @importFrom yaml read_yaml
+load_params <- function(filepath = NULL) {
+  if (is.null(filepath)) {
+    filepath <- system.file("extdata", "params.yaml", package = "ubair")
+  }
+  if (file.exists(filepath)) {
+    params <- yaml::read_yaml(filepath)
+    return(params)
+  } else {
+    stop("YAML file not found at the specified path.")
+  }
+}
+
+#' Copy Default Parameters File
+#'
+#' Copies the default `params.yaml` file, included with the package, to a
+#' specified destination directory. This is useful for initializing parameter
+#' files for custom edits.
+#'
+#' @param dest_dir Character. The path to the directory where the `params.yaml`
+#' file will be copied.
+#' @return Nothing is returned. A message is displayed upon successful copying.
+#' @details
+#' The `params.yaml` file contains default model parameters for various
+#' configurations such as LightGBM, dynamic regression, and others. See the
+#' \code{\link[ubair:load_params]{load_params()}}` documentation for an example of the file's structure.
+#'
+#' @examples
+#' \dontrun{
+#' copy_default_params("path/to/destination")
+#' }
+#' @export
+copy_default_params <- function(dest_dir) {
+  if (!dir.exists(dest_dir)) {
+    stop("Destination directory does not exist.")
+  }
+  file.copy(system.file("extdata", "params.yaml", package = "ubair"),
+    file.path(dest_dir, "params.yaml"),
+    overwrite = TRUE
+  )
+  message("Default params.yaml copied to ", normalizePath(dest_dir))
+}
+
+
+#' Load UBA Data from Directory
+#'
+#' This function loads data from CSV files in the specified directory. It supports two formats:
+#'
+#' 1. "inv": Files must contain the following columns:
+#'    - `Station`, `Komponente`, `Datum`, `Uhrzeit`, `Wert`.
+#' 2. "24Spalten": Files must contain:
+#'    - `Station`, `Komponente`, `Datum`, and columns `Wert01`, ..., `Wert24`.
+#'
+#' File names should include "inv" or "24Spalten" to indicate their format. The function scans
+#' recursively for `.csv` files in subdirectories and combines the data into a single `data.table`
+#' in long format.
+#' Files that are not in the exected format will be ignored.
+#'
+#' @param data_dir Character. Path to the directory containing `.csv` files.
+#' @return A `data.table` containing the loaded data in long format. Returns an error if no valid
+#' files are found or the resulting dataset is empty.
+#' @export
+#' @importFrom lubridate ymd_hm ymd_h
+#' @importFrom tidyr gather
+#' @importFrom dplyr mutate select  %>%
+load_uba_data_from_dir <- function(data_dir) {
+  if (!dir.exists(data_dir)) {
+    stop(paste("Directory does not exist:", data_dir))
+  }
+  all_files <- list.files(data_dir,
+    pattern = "\\.csv$",
+    recursive = TRUE,
+    full.names = TRUE
+  )
+
+  list_data_parts <- lapply(unique(dirname(all_files)), function(dir) {
+    .load_data(dir)
+  })
+
+  combined_data <- data.table::rbindlist(list_data_parts, fill = TRUE)
+  if (nrow(combined_data) == 0) {
+    stop(paste(
+      "The resulting data is empty after loading all files from",
+      data_dir
+    ))
+  }
+
+  return(combined_data)
+}
+
+#' Load data for a Specific Directory
+#'
+#' @param data_dir Character. Path to the directory containing `.csv` files.
+#' @return A `data.table` with the loaded data for the directory. Returns an
+#' empty `data.table` if no files match the expected format.
+#' @noRd
+.load_data <- function(data_dir) {
+  data_files <- list.files(data_dir, pattern = "\\.csv$")
+  list_data <- lapply(data_files, function(file) {
+    .load_data_file(data_dir, file)
+  })
+  data.table::rbindlist(list_data, fill = TRUE)
+}
+
+#' Load data from a specific file
+#'
+#' Loads data from a file in "inv" or "24Spalten" format. Unsupported formats
+#' return an empty `data.table`.
+#'
+#' @param data_dir Character. Base directory containing the file.
+#' @param file Character. Name of the file to load.
+#' @return `data.table` containing loaded data or empty data.table for
+#' unsupported formats.
+#' @noRd
+.load_data_file <- function(data_dir, file) {
+  file_path <- file.path(data_dir, file)
+  if (grepl("inv", file)) {
+    .load_inv_file(file_path, file)
+  } else if (grepl("24Spalten", file)) {
+    .load_spalten_file(file_path, file)
+  } else {
+    data.table::data.table()
+  }
+}
+
+#' Load data from an 'inv' file
+#'
+#' @param file_path Full path to the 'inv' file.
+#' @param file The filename being loaded.
+#' @return A data.table with the loaded 'inv' data.
+#' @noRd
+.load_inv_file <- function(file_path, file) {
+  data.table::fread(file_path,
+    quote = "'",
+    na.strings = c(
+      "-999", "-888", "-777", "-666", "555", "-555", "-333",
+      "-111", "555.00000000"
+    )
+  ) %>%
+    mutate(
+      date = ymd_hm(paste(Datum, Uhrzeit)),
+      Komponente_txt = Komponente,
+      Komponente = substring(sub("\\_.*", "", file), 7)
+    )
+}
+
+#' Load data from a '24Spalten' file
+#'
+#' @param file_path Full path to the 'Spalten' file.
+#' @param file The filename being loaded.
+#' @return `data.table` containing the processed data from the '24Spalten' file.
+#' @noRd
+.load_spalten_file <- function(file_path, file) {
+  data.table::fread(file_path,
+    quote = "'",
+    na.strings = c(
+      "-999", "-888", "-777", "-666", "555", "-555", "-333",
+      "-111", "555.00000000"
+    )
+  ) %>%
+    tidyr::gather("time", "Wert", Wert01:Wert24) %>%
+    mutate(
+      Uhrzeit = substring(time, 5),
+      date = ymd_h(paste(Datum, Uhrzeit)),
+      Komponente_txt = Komponente,
+      Komponente = substring(sub("\\_.*", "", file), 8)
+    ) %>%
+    dplyr::select(-c(Nachweisgrenze, Lieferung))
+}
--- a/R/data_preprocessing.R
+++ b/R/data_preprocessing.R
+#' Removes trend from data
+#'
+#' Takes a list of train and application data as prepared by
+#' [ubair::split_data_counterfactual()]
+#' and removes a polynomial, exponential or cubic spline spline trend function.
+#' Trend is obtained only from train data. Use as part of preprocessing before
+#' training a model based on decision trees, i.e. random forest and lightgbm.
+#' For the other methods it may be helpful but they are generally able to
+#' deal with trends themselves. Therefore we recommend to try out different
+#' versions and guide decisisions using the model evaluation metrics from
+#' [ubair::calc_performance_metrics()].
+#'
+#' Apply [ubair::retrend_predictions()] to predictions to return to the
+#' original data units.
+#'
+#' @param split_data List of two named dataframes called train and apply
+#' @param mode String which defines type of trend is present. Options are
+#' "linear", "quadratic", "exponential", "spline", "none".
+#' "none" returns original data
+#' @param num_splines Defines the number of cubic splines if `mode="spline"`.
+#' Choose num_splines=1 for cubic polynomial trend. If `mode!="spline"`, this
+#' parameter is ignored
+#' @param log_transform If `TRUE`, use a log-transformation before detrending
+#' to ensure positivity of all predictions in the rest of the pipeline.
+#' A exp transformation is necessary during retrending to return to the solution
+#' space. Use only in combination with `log_transform` parameter in
+#' [ubair::retrend_predictions()]
+#' @return List of 3 elements. 2 dataframes: detrended train, apply and the
+#' trend function
+#' @examples
+#' \dontrun{
+#' split_data <- split_data_counterfactual(
+#'   dt_prepared, training_start,
+#'   training_end, application_start, application_end
+#' )
+#' detrended_list <- detrend(split_data, mode = "linear")
+#' detrended_train <- detrended_list$train
+#' detrended_apply <- detrended_list$apply
+#' trend <- detrended_list$model
+#' }
+#' @export
+#' @importFrom stats lm predict spline
+#' @import splines
+detrend <- function(split_data,
+                    mode = "linear",
+                    num_splines = 5,
+                    log_transform = FALSE) {
+  dt_train_new <- data.table::copy(split_data$train)
+  dt_apply_new <- data.table::copy(split_data$apply)
+  stopifnot("log_transform needs to be boolean, i.e. either TRUE or FALSE" = class(log_transform) == "logical")
+  if (log_transform) {
+    dt_train_new$value <- log(dt_train_new$value)
+    dt_apply_new$value <- log(dt_apply_new$value)
+  }
+  if (mode == "linear") {
+    trend <- lm(value ~ date_unix, data = dt_train_new)
+  } else if (mode == "quadratic") {
+    trend <- lm(value ~ date_unix + I(date_unix^2), data = dt_train_new)
+  } else if (mode == "exponential") {
+    trend <- lm(value ~ log(date_unix), data = dt_train_new)
+  } else if (mode == "spline") {
+    stopifnot("Set num_splines larger or equal to 1" = num_splines >= 1)
+    knots <- seq(
+      from = min(dt_train_new$date_unix),
+      to = max(dt_train_new$date_unix),
+      length.out = num_splines + 1
+    )
+    knots <- knots[2:(length(knots) - 1)]
+    if (length(knots) <= 2) { # If 2 or less knots, just fit cubic polynomial
+      trend <- lm(value ~ date_unix + I(date_unix^2) + I(date_unix^3),
+        data = dt_train_new
+      )
+    } else { # Else use splines
+      trend <- lm(value ~ bs(date_unix, knots = knots), data = dt_train_new)
+    }
+  } else if (mode == "none") {
+    trend <- lm(value ~ 1 - 1, data = dt_train_new)
+  } else {
+    stop("mode needs to be any of the following strings: 'linear',
+         'quadratic', 'exponential', 'spline', 'none'")
+  }
+  trend_train_values <- predict(trend, newdata = dt_train_new)
+  dt_train_new$value <- dt_train_new$value - trend_train_values
+  trend_apply_values <- predict(trend, newdata = dt_apply_new)
+  dt_apply_new$value <- dt_apply_new$value - trend_apply_values
+  return(list(train = dt_train_new, apply = dt_apply_new, model = trend))
+}
+
+#' Restors the trend in the prediction
+#'
+#' Takes a dataframe of predictions as returned by any of
+#' the 'run_model' functions and restores a trend which was previously
+#' removed via [ubair::detrend()]. This is necessary for the predictions
+#' and the true values to have the same units. The function is basically
+#' the inverse function to [ubair::detrend()] and should only be used in
+#' combination with it.
+#'
+#' @param dt_predictions Dataframe of predictions with columns `value`,
+#' `prediction`, `prediction_lower`, `prediction_upper`
+#' @param trend lm object generated by [ubair::detrend()]
+#' @param log_transform Returns values to solution space, if they have been
+#' log transformed during detrending. Use only in combination with `log_transform`
+#' parameter in detrend function.
+#' @return Retrended dataframe with same structure as `dt_predictions`
+#' which is returned by any of the run_model() functions.
+#' @examples
+#' \dontrun{
+#' detrended_list <- detrend(split_data,
+#'   mode = detrending_function,
+#'   log_transform = log_transform
+#' )
+#' trend <- detrended_list$model
+#' detrended_train <- detrended_list$train
+#' detrended_apply <- detrended_list$apply
+#' detrended_train <- detrended_train %>% select(value, dplyr::any_of(variables))
+#' result <- run_lightgbm(
+#'   train = detrended_train,
+#'   test = detrended_apply,
+#'   model_params = params$lightgbm,
+#'   alpha = 0.9,
+#'   calc_shaps = FALSE
+#' )
+#' retrended_predictions <- retrend_predictions(result$dt_predictions, trend)
+#' }
+#' @export
+retrend_predictions <- function(dt_predictions, trend, log_transform = FALSE) {
+  stopifnot("log_transform needs to be boolean, i.e. TRUE or FALSE" = class(log_transform) == "logical")
+  stopifnot("trend object needs to be a linear model of class 'lm'" = class(trend) == "lm")
+  stopifnot(
+    "Not all of 'value', 'prediction', 'prediction_lower', 'prediction_upper' are columns in dt_predictions" =
+      all(c("value", "prediction", "prediction_lower", "prediction_upper")
+      %in% colnames(dt_predictions))
+  )
+  trend_value <- predict(trend, newdata = dt_predictions)
+  if (log_transform) {
+    dt_predictions[,
+      c("value", "prediction", "prediction_lower", "prediction_upper") := lapply(.SD, \(x) exp(x + trend_value)),
+      .SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
+    ]
+  } else {
+    dt_predictions[,
+      c("value", "prediction", "prediction_lower", "prediction_upper") := lapply(.SD, \(x) x + trend_value),
+      .SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
+    ]
+  }
+
+  dt_predictions
+}
+
+
+#' Standardize Training and Application Data
+#'
+#' This function standardizes numeric columns of the `train_data` and applies
+#' the same scaling (mean and standard deviation) to the corresponding columns
+#' in `apply_data`. It returns the standardized data along with the scaling
+#' parameters (means and standard deviations). This is particularly important
+#' for neural network approaches as they tend to be numerically unstable and
+#' deteriorate otherwise.
+#'
+#' @param train_data A data frame containing the training dataset to be
+#' standardized. It must contain numeric columns.
+#' @param apply_data A data frame  containing the dataset to which the scaling
+#' from `train_data` will be applied.
+#'
+#' @return A list containing the following elements:
+#' \item{train}{The standardized training data.}
+#' \item{apply}{The `apply_data` scaled using the means and standard deviations
+#' from the `train_data`.}
+#' \item{means}{The means of the numeric columns in `train_data`.}
+#' \item{sds}{The standard deviations of the numeric columns in `train_data`.}
+#' @export
+#' @importFrom dplyr mutate across where
+#' @examples
+#' \dontrun{
+#' scale_result <- scale_data(
+#'   train_data = detrended_list$train,
+#'   apply_data = detrended_list$apply, scale = TRUE
+#' )
+#' scaled_train <- scale_result$train
+#' scaled_apply <- scale_result$apply
+#' }
+scale_data <- function(train_data,
+                       apply_data) {
+  means <- attr(
+    scale(train_data %>% select(where(is.numeric))),
+    "scaled:center"
+  )
+  sds <- attr(
+    scale(train_data %>% select(where(is.numeric))),
+    "scaled:scale"
+  )
+
+  train_data <- train_data %>%
+    dplyr::mutate_if(is.numeric, ~ as.numeric(scale(.)))
+
+  apply_data <- apply_data %>%
+    mutate(across(
+      names(sds),
+      ~ as.numeric(scale(.,
+        center = means[dplyr::cur_column()],
+        scale = sds[dplyr::cur_column()]
+      ))
+    ))
+
+  list(
+    train = train_data,
+    apply = apply_data,
+    means = means,
+    sds = sds
+  )
+}
+
+
+#' Rescale predictions to original scale.
+#'
+#' This function rescales the predicted values (`prediction`, `prediction_lower`,
+#' `prediction_upper`). The scaling is reversed using the means and
+#' standard deviations that were saved from the training data. It is the inverse
+#' function to [ubair::scale_data()] and should be used only in combination.
+#'
+#' @param scale_result A list object returned by [ubair::scale_data()],
+#' containing the means and standard deviations used for scaling.
+#' @param dt_predictions A data frame containing the predictions,
+#' including columns `prediction`, `prediction_lower`, `prediction_upper`.
+#'
+#' @return A data frame with the predictions and numeric columns rescaled back
+#' to their original scale.
+#'
+#' @export
+#' @examples
+#' \dontrun{
+#' scale_res <- scale_data(train_data = train, apply_data = apply)
+#' res <- run_fnn(train = scale_res$train, test = scale_res$apply, params)
+#' dt_predictions <- res$dt_predictions
+#' rescaled_predictions <- rescale_predictions(scale_res, dt_predictions)
+#' }
+rescale_predictions <- function(scale_result, dt_predictions) {
+  means <- scale_result$means
+  sds <- scale_result$sds
+  rescaled_predictions <- dt_predictions %>%
+    mutate(
+      prediction = prediction * sds["value"] + means["value"],
+      prediction_lower = prediction_lower * sds["value"] + means["value"],
+      prediction_upper = prediction_upper * sds["value"] + means["value"]
+    )
+  return(rescaled_predictions)
+}
--- a/R/model_evaluation.R
+++ b/R/model_evaluation.R
+#' Calculates performance metrics of a business-as-usual model
+#'
+#' Model agnostic function to calculate a number of common performance
+#' metrics on the reference time window.
+#' Uses the true data `value` and the predictions `prediction` for this calculation.
+#' The coverage is calculated from the columns `value`, `prediction_lower` and
+#' `prediction_upper`.
+#' Removes dates in the effect and buffer range as the model is not expected to
+#' be performing correctly for these times. The incorrectness is precisely
+#' what we are using for estimating the effect.
+#' @param predictions data.table or data.frame with the following columns
+#'  \describe{
+#'    \item{date}{Date of the observation. Needs to be comparable to
+#'    date_effect_start element.}
+#'    \item{value}{True observed value of the station}
+#'    \item{prediction}{Predicted model output for the same time and station
+#'    as value}
+#'    \item{prediction_lower}{Lower end of the prediction interval}
+#'    \item{prediction_upper}{Upper end of the prediction interval}
+#'  }
+#'
+#' @param date_effect_start A date. Start date of the
+#' effect that is to be evaluated. The data from this point onwards is disregarded
+#' for calculating model performance
+#' @param buffer Integer. An additional buffer window before date_effect_start to account
+#' for uncertainty in the effect start point. Disregards additional buffer data
+#' points for model evaluation
+#' @return Named vector with performance metrics of the model
+#' @export
+calc_performance_metrics <- function(predictions, date_effect_start = NULL, buffer = 0) {
+  df <- data.table::copy(predictions)
+  stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
+  if (!is.null(date_effect_start)) {
+    stopifnot(
+      "date_effect_start needs to be NULL or a date object" =
+        inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
+    )
+    df <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
+  }
+  metrics <- c(
+    "RMSE" = sqrt(mean((df$value - df$prediction)**2)),
+    "MSE" = mean((df$value - df$prediction)**2),
+    "MAE" = mean(abs(df$value - df$prediction)),
+    "MAPE" = mean(abs(df$value - df$prediction) / (df$value + 1)),
+    "Bias" = mean(df$prediction - df$value),
+    "R2" = 1 - sum((df$value - df$prediction)**2) / sum((df$value - mean(df$value))**2),
+    "Coverage lower" = mean(df$value >= df$prediction_lower),
+    "Coverage upper" = mean(df$value <= df$prediction_upper),
+    "Coverage" = mean(df$value >= df$prediction_lower) +
+      mean(df$value <= df$prediction_upper) - 1,
+    "Correlation" = stats::cor(df$value, df$prediction),
+    "MFB" = 2 * mean((df$prediction - df$value) / (df$prediction + df$value)),
+    "FGE" = 2 * mean(abs(df$prediction - df$value) / (df$prediction + df$value))
+  )
+  metrics
+}
+
+
+#' Calculates summary statistics for predictions and true values
+#'
+#' Helps with analyzing predictions by comparing them with the true values on
+#' a number of relevant summary statistics.
+#' @param predictions Data.table or data.frame with the following columns
+#'  \describe{
+#'    \item{date}{Date of the observation. Needs to be comparable to
+#'    date_effect_start element.}
+#'    \item{value}{True observed value of the station}
+#'    \item{prediction}{Predicted model output for the same time and station
+#'    as value}
+#'  }
+#' @param date_effect_start A date. Start date of the
+#' effect that is to be evaluated. The data from this point onwards is disregarded
+#' for calculating model performance
+#' @param buffer Integer. An additional buffer window before date_effect_start to account
+#' for uncertainty in the effect start point. Disregards additional buffer data
+#' points for model evaluation
+#' @return data.frame of summary statistics with columns true and prediction
+#' @export
+calc_summary_statistics <- function(predictions, date_effect_start = NULL, buffer = 0) {
+  df <- data.table::copy(predictions)
+  stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
+  if (!is.null(date_effect_start)) {
+    stopifnot(
+      "date_effect_start needs to be NULL or a date object" =
+        inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
+    )
+    df <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
+  }
+  data.frame(
+    true = c(
+      min(df$value),
+      max(df$value),
+      stats::var(df$value),
+      mean(df$value),
+      stats::quantile(df$value, probs = 0.05),
+      stats::quantile(df$value, probs = 0.25),
+      stats::median(df$value),
+      stats::quantile(df$value, probs = 0.75),
+      stats::quantile(df$value, probs = 0.95)
+    ),
+    prediction = c(
+      min(df$prediction),
+      max(df$prediction),
+      stats::var(df$prediction),
+      mean(df$prediction),
+      stats::quantile(df$prediction, probs = 0.05),
+      stats::quantile(df$prediction, probs = 0.25),
+      stats::median(df$prediction),
+      stats::quantile(df$prediction, probs = 0.75),
+      stats::quantile(df$prediction, probs = 0.95)
+    ),
+    row.names = c(
+      "min", "max", "var", "mean", "5-percentile",
+      "25-percentile", "median/50-percentile",
+      "75-percentile", "95-percentile"
+    )
+  )
+}
+
+#' Estimates size of the external effect
+#'
+#' Calculates an estimate for the absolute and relative effect size of the
+#' external effect. The absolute effect is the difference between the model
+#' bias in the reference time and the effect time windows. The relative effect
+#' is the absolute effect divided by the mean true value in the reference
+#' window.
+#'
+#' Note: Since the bias is of the model is an average over predictions and true
+#' values, it is important, that the effect window is specified correctly.
+#' Imagine a scenario like a fire which strongly affects the outcome for one
+#' hour and is gone the next hour. If we use a two week effect window, the
+#' estimated effect will be 14*24=336 times smaller compared to using a 1-hour
+#' effect window. Generally, we advise against studying very short effects (single
+#' hour or single day). The variability of results will be too large to learn
+#' anything meaningful.
+#'
+#' @param df Data.table or data.frame with the following columns
+#'  \describe{
+#'    \item{date}{Date of the observation. Needs to be comparable to
+#'    date_effect_start element.}
+#'    \item{value}{True observed value of the station}
+#'    \item{prediction}{Predicted model output for the same time and station
+#'    as value}
+#'  }
+#' @param date_effect_start A date. Start date of the
+#' effect that is to be evaluated. The data from this point onward is disregarded
+#' for calculating model performance.
+#' @param buffer Integer. An additional buffer window before date_effect_start to account
+#' for uncertainty in the effect start point. Disregards additional buffer data
+#' points for model evaluation
+#' @param verbose Prints an explanation of the results if TRUE
+#' @return A list with two numbers: Absolute and relative estimated effect size.
+#' @export
+estimate_effect_size <- function(df, date_effect_start, buffer = 0, verbose = FALSE) {
+  stopifnot(
+    "date_effect_start needs to be NULL or a date object" =
+      inherits(date_effect_start, "Date") | lubridate::is.POSIXt(date_effect_start)
+  )
+  stopifnot("Buffer needs to be larger or equal to 0" = buffer >= 0)
+  reference <- df[date < (date_effect_start - as.difftime(buffer, units = "hours"))]
+  effect <- df[date >= date_effect_start]
+  bias_ref <- mean(reference$prediction - reference$value)
+  bias_effect <- mean(effect$prediction - effect$value)
+  effectsize <- bias_ref - bias_effect
+  rel_effectsize <- effectsize / mean(effect$value)
+  rel_effectsize <- round(rel_effectsize, 4)
+  if (verbose) {
+    cat(sprintf("The external effect changed the target value on average by %.3f compared to the reference time window. This is a %1.2f%% relative change.", effectsize, 100 * rel_effectsize))
+  }
+  list(absolute_effect = effectsize, relative_effect = rel_effectsize)
+}
--- a/R/modelling.R
+++ b/R/modelling.R
+#' Prepare Data for Training a model
+#'
+#' Prepares environmental data by filtering for relevant components,
+#' converting the data to a wide format, and adding temporal features. Should be
+#' called before
+#' \code{\link[ubair:split_data_counterfactual]{split_data_counterfactual()}}
+#'
+#' @param env_data A data table in long format.
+#' Must include the following columns:
+#' \describe{
+#'   \item{Station}{Station identifier for the data.}
+#'   \item{Komponente}{The environmental component being measured
+#'        (e.g., temperature, NO2).}
+#'   \item{Wert}{The measured value of the component.}
+#'   \item{date}{Timestamp as `POSIXct` object in `YYYY-MM-DD HH:MM:SS` format.}
+#'   \item{Komponente_txt}{A textual description of the component.}
+#' }
+#' @param params A list of modelling parameters loaded from `params.yaml`.
+#' Must include:
+#' \describe{
+#'   \item{meteo_variables}{A vector of meteorological variable names.}
+#'   \item{target}{The name of the target variable.}
+#' }
+#' @return A `data.table` in wide format, with columns:
+#' `date`, one column per component, and temporal features
+#' like `date_unix`, `day_julian`, `weekday`, and `hour`.
+#' @examples
+#' env_data <- data.table::data.table(
+#'   Station = c("StationA", "StationA", "StationA"),
+#'   Komponente = c("NO2", "TMP", "NO2"),
+#'   Wert = c(50, 20, 40),
+#'   date = as.POSIXct(c("2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 12:00:00"))
+#' )
+#' params <- list(meteo_variables = c("TMP"), target = "NO2")
+#' prepared_data <- prepare_data_for_modelling(env_data, params)
+#' print(prepared_data)
+#'
+#' @export
+prepare_data_for_modelling <- function(env_data, params) {
+  components <- c(params$meteo_variables, params$target)
+  dt_filtered <- .extract_components(env_data, components)
+  dt_wide <- .cast_to_wide(dt_filtered)
+
+  if (!params$target %in% names(dt_wide)) {
+    warning(sprintf("Target '%s' is not present in the data. Make sure it exists
+                    and you have set the correct target name", params$target))
+    stop("Exiting function due to missing target data.")
+  }
+  dt_prepared <- dt_wide %>%
+    dplyr::rename(value = params$target) %>%
+    .add_date_variables(replace = TRUE) %>%
+    dplyr::filter(!is.na(value))
+  dt_prepared
+}
+
+#' Turn date feature into temporal features date_unix, day_julian, weekday and
+#' hour
+#'
+#' @param df Data.table with column date formatted as date-time object
+#' @param replace Boolean which determines whether to replace existing temporal variables
+#' @return A data.table with all relevant temporal features for modelling
+#' @noRd
+.add_date_variables <- function(df, replace) {
+  names <- names(df)
+  if (replace) {
+    df$date_unix <- as.numeric(df$date)
+    df$day_julian <- lubridate::yday(df$date)
+    df$weekday <- .wday_monday(df$date, as.factor = TRUE)
+    df$hour <- lubridate::hour(df$date)
+  } else {
+    if (!"date_unix" %in% names) {
+      df$date_unix <- as.numeric(df$date)
+    }
+    if (!"day_julian" %in% names) {
+      df$day_julian <- lubridate::yday(df$date)
+    }
+    if (!"weekday" %in% names) {
+      df$weekday <- .wday_monday(df$date, as.factor = TRUE)
+    }
+    if (!"hour" %in% names) {
+      df$hour <- lubridate::hour(df$date)
+    }
+  }
+  return(df)
+}
+
+#' Reformat lubridate weekdays into weekdays with monday as day 1
+#'
+#' @param x Vector of date-time objects
+#' @param as.factor Boolean that determines whether to return a factor or a numeric vector
+#' @noRd
+.wday_monday <- function(x, as.factor = FALSE) {
+  x <- lubridate::wday(x)
+  x <- x - 1
+  x <- ifelse(x == 0, 7, x)
+  if (as.factor) {
+    x <- factor(x, levels = 1:7, ordered = TRUE)
+  }
+  return(x)
+}
+
+
+#' Split Data into Training and Application Datasets
+#'
+#' Splits prepared data into training and application datasets based on
+#' specified date ranges for a business-as-usual scenario. Data before
+#' `application_start` and after `application_end` is used as training data,
+#' while data within the date range is used for application.
+#'
+#' @param dt_prepared The prepared data table.
+#' @param application_start The start date(date object) for the application
+#' period of the business-as-usual simulation. This coincides with the start of
+#' the reference window.
+#' Can be created by e.g. lubridate::ymd("20191201")
+#' @param application_end The end date(date object)  for the application period
+#' of the business-as-usual simulation. This coincides with the end of
+#' the effect window.
+#' Can be created by e.g. lubridate::ymd("20191201")
+#' @return A list with two elements:
+#' \describe{
+#'   \item{train}{Data outside the application period.}
+#'   \item{apply}{Data within the application period.}
+#' }
+#' @examples
+#' dt_prepared <- data.table::data.table(
+#'   date = as.Date(c("2023-01-01", "2023-01-05", "2023-01-10")),
+#'   value = c(50, 60, 70)
+#' )
+#' result <- split_data_counterfactual(
+#'   dt_prepared,
+#'   application_start = as.Date("2023-01-03"),
+#'   application_end = as.Date("2023-01-08")
+#' )
+#' print(result$train)
+#' print(result$apply)
+#' @export
+split_data_counterfactual <- function(dt_prepared,
+                                      application_start,
+                                      application_end) {
+  stopifnot(
+    inherits(application_start, "Date"),
+    inherits(application_end, "Date")
+  )
+  stopifnot(application_start <= application_end)
+  dt_train <- dt_prepared[date < application_start | date > application_end]
+  dt_apply <- dt_prepared[date >= application_start & date <= application_end]
+  list(train = dt_train, apply = dt_apply)
+}
+
+
+#' Extract Components for Modelling
+#' Stop with error message if any selected meteo variable/component is not
+#' contained in the data.
+#'
+#' @param env_data Daily aggregated data table.
+#' @param components Vector of component names to extract.
+#' @return A data.table filtered by the specified components.
+#' @noRd
+.extract_components <- function(env_data, components) {
+  if (!all(components %in% env_data$Komponente)) {
+    missing_components <- components[!components %in% env_data$Komponente]
+    stop(paste(
+      "Data does not contain all selected variables:", missing_components,
+      "\n Check data and meteo_variables/params.yaml."
+    ))
+  }
+  env_data[Komponente %in% components, list(Komponente, Wert, date)]
+}
+
+#' @param dt_filtered Filtered data.table.
+#' @return A wide-format data.table.
+#' @noRd
+#' @examples
+#' dt_filtered <- data.table::data.table(
+#'   date = as.POSIXct(c("2023-01-01", "2023-01-01", "2023-01-02")),
+#'   Komponente = c("NO2", "TMP", "NO2"),
+#'   Wert = c(50, 20, 40)
+#' )
+#' wide_data <- .cast_to_wide(dt_filtered)
+#' print(wide_data)
+.cast_to_wide <- function(dt_filtered) {
+  data.table::dcast(dt_filtered,
+    formula = date ~ Komponente,
+    value.var = "Wert"
+  )
+}
--- a/R/sample_data_DESN025.R
+++ b/R/sample_data_DESN025.R
+#' Environmental Data for Modelling from station DESN025 in Leipzig-Mitte.
+#'
+#' A dataset containing environmental measurements collected at station in
+#' Leipzig Mitte with observations of different environmental components over
+#' time. This data is used for environmental modelling tasks, including
+#' meteorological variables and other targets.
+#'
+#' @format ## sample_data_DESN025
+#' A data table with the following columns:
+#' \describe{
+#'   \item{Station}{Station identifier where the data was collected.}
+#'   \item{Komponente}{The environmental component being measured
+#'        (e.g., temperature, NO2).}
+#'   \item{Wert}{The measured value of the component.}
+#'   \item{date}{The timestamp for the observation, formatted as a Date-Time
+#'   object in the format
+#'          \code{"YYYY-MM-DD HH:MM:SS"} (e.g., "2010-01-01 07:00:00").}
+#'   \item{Komponente_txt}{A textual description or label for the component.}
+#' }
+#'
+#' The dataset is structured in a long format and is prepared for further
+#' transformation into a wide format for modelling.
+#'
+#' @source Umweltbundesamt
+#' @examples
+#' \dontrun{
+#' params <- load_params("path/to/params.yaml")
+#' dt_prepared <- prepare_data_for_modelling(sample_data_DESN025, params)
+#' }
+"sample_data_DESN025"
--- a/R/utils.R
+++ b/R/utils.R
+# required to suppress devtools::check() notes that occur from data.table syntax
+utils::globalVariables(c(
+  ".", "Datum", "Komponente", "Komponente_txt",
+  "Lieferung", "Nachweisgrenze", "Station", "Uhrzeit",
+  "Wert", "Wert01", "Wert24", "day", "part", "prediction_upper",
+  "prediction_lower", "prediction",
+  "se", "time", "value", "variable", "WIR", "WIG", "year", "Werte_aggregiert"
+))
--- a/R/visualisation.R
+++ b/R/visualisation.R
+#' Descriptive plot of daily time series data
+#'
+#' This function produces descriptive time-series plots with smoothing
+#' for the meteorological and potential target variables that were measured at a station.
+#'
+#' @param env_data A data table of measurements of one air quality measurement station.
+#' The data should contain the following columns:
+#' \describe{
+#'   \item{Station}{Station identifier where the data was collected.}
+#'   \item{Komponente}{The environmental component being measured
+#'        (e.g., temperature, NO2).}
+#'   \item{Wert}{The measured value of the component.}
+#'   \item{date}{The timestamp for the observation,
+#'          formatted as a Date-Time object in the format
+#'          \code{"YYYY-MM-DD HH:MM:SS"} (e.g., "2010-01-01 07:00:00").}
+#'   \item{Komponente_txt}{A textual description or label for the component.}
+#' }
+#' @param variables list of variables to plot. Must be in `env_data$Komponente`.
+#' Meteorological variables can be obtained from params.yaml.
+#' @param years Optional. A numeric vector, list, or a range specifying the
+#' years to restrict the plotted data.
+#'   You can provide:
+#'   - A single year: `years = 2020`
+#'   - A numeric vector of years: `years = c(2019, 2020, 2021)`
+#'   - A range of years: `years = 2019:2021`
+#'   If not provided, data for all available years will be used.
+#' @param smoothing_factor A number that defines the magnitude of smoothing.
+#' Default is 1. Smaller numbers correspond to less smoothing, larger numbers to more.
+#' @export
+#' @importFrom ggplot2 ggplot aes geom_line facet_wrap geom_smooth
+plot_station_measurements <- function(env_data, variables, years = NULL, smoothing_factor = 1) {
+  stopifnot("No data in the env_data data.table" = nrow(env_data) > 0)
+  stopifnot(
+    "More than one station in env_data. Use clean_data to specify which one to use" =
+      length(unique(env_data$Station)) == 1
+  )
+  if (is.null(years)) {
+    years <- unique(env_data$year)
+  }
+  if (!"day" %in% colnames(env_data)) {
+    env_data <- .aggregate_data(env_data)
+  }
+  env_data[, Werte_aggregiert := data.table::frollmean(Wert,
+    12 * 7 * smoothing_factor,
+    na.rm = TRUE,
+    align = "center"
+  ),
+  by = Komponente_txt
+  ]
+  p <- ggplot(env_data[Komponente %in% variables & year %in% years], aes(day, Wert)) +
+    geom_line() +
+    facet_wrap(~Komponente_txt, scales = "free_y", ncol = 1) +
+    geom_line(aes(day, Werte_aggregiert), color = "blue", size = 1.5)
+  p
+}
+
+#' Prepare Plot Data and Plot Counterfactuals
+#'
+#' Smooths the predictions using a rolling mean, prepares the data for plotting,
+#' and generates the counterfactual plot for the application window. Data before
+#' the red box are reference window, red box is buffer and values after black,
+#' dotted line are effect window.
+#'
+#' The optional grey ribbon is a prediction interval for the hourly values. The
+#' interpretation for a 90% prediction interval (to be defined in `alpha` parameter
+#' of [ubair::run_counterfactual()]) is that 90% of the true hourly values
+#' (not the rolled means) lie within the grey band. This might be helpful for
+#' getting an idea of the variance of the data and predictions.
+#'
+#' @param predictions The data.table containing the predictions (hourly)
+#' @param params Parameters for plotting, including the target variable.
+#' @param window_size The window size for the rolling mean (default is 14 days).
+#' @param date_effect_start A date. Start date of the
+#' effect that is to be evaluated. The data from this point onwards is disregarded
+#' for calculating model performance
+#' @param buffer Integer. An additional, optional buffer window before
+#' `date_effect_start` to account for uncertainty in the effect start point.
+#' Disregards additional buffer data points for model evaluation.
+#' Use `buffer=0` for no buffer.
+#' @param plot_pred_interval Boolean. If `TRUE`, shows a grey band of the prediction
+#' interval.
+#' @return A ggplot object with the counterfactual plot. Can be adjusted further,
+#' e.g. set limits for the y-axis for better visualisation.
+#' @export
+#' @importFrom ggplot2 ggplot aes geom_ribbon geom_line geom_vline theme_bw labs
+#' @importFrom data.table melt
+#' @importFrom data.table .SD
+#' @importFrom data.table :=
+plot_counterfactual <- function(predictions, params, window_size = 14,
+                                date_effect_start = NULL, buffer = 0,
+                                plot_pred_interval = TRUE) {
+  stopifnot("No data in predictions" = nrow(predictions) > 0)
+  stopifnot(
+    "Not all of 'value', 'prediction', 'prediction_lower', 'prediction_upper' are present in predictions data.table" =
+      c("value", "prediction", "prediction_lower", "prediction_upper") %in% colnames(predictions)
+  )
+  # Smooth the data using a rolling mean
+  dt_plot <- data.table::copy(predictions)
+  dt_plot <- dt_plot[,
+    c("value", "prediction", "prediction_lower", "prediction_upper") :=
+      lapply(.SD, \(x) data.table::frollmean(x, 24 * window_size, align = "center")),
+    .SDcols = c("value", "prediction", "prediction_lower", "prediction_upper")
+  ]
+
+  dt_plot <- dt_plot[!is.na(value)]
+
+  # Melt the data for plotting
+  dt_plot_melted <- melt(dt_plot,
+    id.vars = c("date", "prediction_lower", "prediction_upper"),
+    measure.vars = c("value", "prediction"),
+    variable.name = "variable",
+    value.name = "value"
+  )
+
+  # Prepare the ggplot object
+  p <- ggplot(dt_plot_melted, aes(x = date))
+  if (plot_pred_interval) {
+    p <- p +
+      geom_ribbon(aes(ymin = prediction_lower, ymax = prediction_upper),
+        fill = "grey70", alpha = 0.8
+      )
+  }
+  p <- p + geom_line(aes(y = value, color = variable)) +
+    theme_bw() +
+    ggplot2::scale_x_datetime(
+      date_minor_breaks = "1 month",
+      date_breaks = "2 month"
+    ) +
+    labs(y = paste(params$target, paste("concentration", window_size, "d rolling mean")))
+
+  # Add vertical lines for external effect if provided
+  if (!is.null(date_effect_start) && length(date_effect_start) == 1) {
+    p <- p + geom_vline(
+      xintercept = date_effect_start, linetype = 4,
+      colour = "black"
+    )
+  }
+  if (buffer > 0) {
+    p <- p + ggplot2::annotate("rect",
+      xmin = date_effect_start - as.difftime(buffer, units = "hours"),
+      xmax = date_effect_start,
+      ymin = -Inf,
+      ymax = Inf,
+      alpha = 0.1,
+      fill = "red"
+    )
+  }
+  p
+}
--- a/README.Rmd
+++ b/README.Rmd
+---
+output: 
+  github_document:
+     df_print: kable
+editor_options: 
+  chunk_output_type: console
+---
+
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+  message = FALSE,
+  warning = FALSE,
+  eval = FALSE,
+  comment = "#>",
+  fig.path = "man/figures/README-",
+  out.width = "100%"
+)
+```
+
+# ubair <img src="inst/sticker/stickers-ubair-1.png" align="right" width="20%"/>
+
+**ubair** is an R package for Statistical Investigation of the Impact of External Conditions on Air Quality: it uses the statistical software R to analyze and visualize the impact of external factors, such as traffic restrictions, hazards, and political measures, on air quality. It aims to provide experts with a transparent comparison of modeling approaches and to support data-driven evaluations for policy advisory purposes.
+
+## Installation
+Install via cran or if you have access to 
+[https://gitlab.ai-env.de/use-case-luft/ubair](https://gitlab.ai-env.de/use-case-luft/ubair)
+you can use one of the following options:
+
+#### Using an archive file
+Recommended if you do not have git installed.
+
+- Download zip/tar.gz from GitLab
+- Start a new R-Project or open an existing one
+- in R-Studio:
+  - go to 'Packages'-Tab (next to Help/Plots/Files)
+  - Click on 'Install' (left upper corner)
+  - Install from: choose "Package Archive File"
+  - Browse to zip-file
+  - 'Install'
+- alternatively, type in console:
+```{r install_package_local}
+install.packages("<path-to-zip>/ubair-master.zip", repos = NULL, type = "source")
+```
+
+#### Using remote package
+Git needs to be installed.
+```{r install_package_remote}
+install.packages("remotes")
+# requires a configures ssh-key
+remotes::install_git("git@gitlab.ai-env.de:use-case-luft/ubair.git")
+# alternative via password
+remotes::install_git("https://gitlab.ai-env.de/use-case-luft/ubair.git")
+```
+
+## Sample Usage of package
+For a more detailed explanation of the package, you can access the vignettes:
+
+- View user_sample source code directly in the [vignettes/](vignettes/) folder.
+- Open vignette by function `vignette("user_sample_1", package = "ubair")`, 
+if the package was installed with vignettes
+
+``` {r load data, eval=TRUE}
+library(ubair)
+params <- load_params()
+env_data <- sample_data_DESN025
+```
+
+```{r plot-meteo-data, eval=TRUE, fig.height=10, fig.width=10}
+# Plot meteo data
+plot_station_measurements(env_data, params$meteo_variables)
+```
+
+- split data into training, reference and effect time intervals
+<img src="man/figures/time_split_overview.png" width="100%"/>
+``` {r counterfactual-scenario, eval=TRUE, fig.height=5, fig.width=10}
+application_start <- lubridate::ymd("20191201") # This coincides with the start of the reference window
+date_effect_start <- lubridate::ymd_hm("20200323 00:00") # This splits the forecast into reference and effect
+application_end <- lubridate::ymd("20200504") # This coincides with the end of the effect window
+
+buffer <- 24 * 14 # 14 days buffer
+
+dt_prepared <- prepare_data_for_modelling(env_data, params)
+dt_prepared <- dt_prepared[complete.cases(dt_prepared)]
+split_data <- split_data_counterfactual(
+  dt_prepared, application_start,
+  application_end
+)
+res <- run_counterfactual(split_data,
+  params,
+  detrending_function = "linear",
+  model_type = "lightgbm",
+  alpha = 0.9,
+  log_transform = TRUE,
+  calc_shaps = TRUE
+)
+predictions <- res$prediction
+
+plot_counterfactual(predictions, params,
+  window_size = 14,
+  date_effect_start,
+  buffer = buffer,
+  plot_pred_interval = TRUE
+)
+```
+
+```{r evaluation metrics, eval=TRUE}
+round(calc_performance_metrics(predictions, date_effect_start, buffer = buffer), 2)
+round(calc_summary_statistics(predictions, date_effect_start, buffer = buffer), 2)
+estimate_effect_size(predictions, date_effect_start, buffer = buffer, verbose = TRUE)
+```
+
+
+### SHAP feature importances
+```{r feature_importance, eval=TRUE, fig.height=4, fig.width=8}
+shapviz::sv_importance(res$importance, kind = "bee")
+xvars <- c("TMP", "WIG", "GLO", "WIR")
+shapviz::sv_dependence(res$importance, v = xvars)
+```
+
+## Development
+
+### Prerequisites
+
+1. **R**: Make sure you have R installed (recommended version 4.4.1). You can download it from [CRAN](https://cran.r-project.org/).
+2. **RStudio** (optional but recommended): Download from [RStudio](https://www.rstudio.com/).
+
+### Setting Up the Environment
+Install the development version of ubair:
+
+```{r}
+install.packages("renv")
+renv::restore()
+devtools::build()
+devtools::load_all()
+```
+### Development
+#### Install pre-commit hook (required to ensure tidyverse code formatting)
+```
+pip install pre-commit
+```
+#### Add new requirements
+If you add new dependencies to *ubair* package, make sure to update the renv.lock file:
+
+``` r
+renv::snapshot()
+```
+#### style and documentation
+Before you commit your changes update documentation, ensure style complies with tidyverse styleguide and all tests run without error
+```{r}
+# update documentation and check package integrity
+devtools::check()
+# apply tidyverse style (also applied as precommit hook)
+usethis::use_tidy_style()
+# you can check for existing lintr warnings by
+devtools::lint()
+# run tests
+devtools::test()
+# build README.md if any changes have been made to README.Rmd
+devtools::build_readme()
+```
+
+#### Pre-commit hook
+in .pre-commit-hook.yaml pre-commit rules are defined and applied before each commmit.
+This includes:
+split
+- run styler to format code in tidyverse style
+- run roxygen to update doc
+- check if readme is up to date
+- run lintr to finally check code style format
+
+If precommit fails, check the automatically applied changes, stage them and retry to commit.
+
+#### Test Coverage
+Install covr to run this.
+```{r test_coverage, message=FALSE, warning=FALSE}
+cov <- covr::package_coverage(type = "all")
+cov_list <- covr::coverage_to_list(cov)
+data.table::data.table(
+  part = c("Total", names(cov_list$filecoverage)),
+  coverage = c(cov_list$totalcoverage, as.vector(cov_list$filecoverage))
+)
+```
+
+```{r test_coverage_report}
+covr::report(cov)
+```
+
+## Contacts
+**Jore Noa Averbeck** 
+[JoreNoa.Averbeck@uba.de](mailto:JoreNoa.Averbeck@uba.de)
+
+**Raphael Franke** 
+[Raphael.Franke@uba.de](mailto:Raphael.Franke@uba.de)
+
+**Imke Voß** 
+[imke.voss@uba.de](mailto:imke.voss@uba.de)
No results found