---
title: "AquaSensR inputs and checks"
vignette: >
  %\VignetteIndexEntry{inputs}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
---

```{r}
#| label: setup
#| include: false
#| message: false
#| warning: false
library(AquaSensR)
library(dplyr)
```

AquaSensR requires two input files to use the functions in the package:

1. **Continuous monitoring data**: time series of sensor observations at a site, one column per parameter.
2. **Data Quality Objectives**: parameter-specific data quality objectives used by the four QC checks (gross range, spike, rate of change, and flatline).

The DQO file is an Excel workbook (`.xlsx`). The continuous monitoring data file can be an Excel workbook (`.xlsx`), a CSV file (`.csv`), or a comma-delimited text file (`.txt`). This vignette describes how to import and check each input dataset. It is critical that the input datasets follow the exact specified format. Example files with the correct format are included with the package and are used throughout.

## Load the package

Load the package in an R session after installation: 

```{r}
#| eval: false
library(AquaSensR)
```

## File paths

First, specify the location of the two files by saving their paths to R variables. In practice you will supply paths to your own files, for example:

```{r}
#| eval: false
contpth <- "path/to/your/ContinuousData.xlsx"
dqopth <- "path/to/your/DQO.xlsx"
```

The examples below use the files included with the package:

```{r}
contpth <- system.file("extdata/ExampleCont1.xlsx", package = "AquaSensR")
dqopth <- system.file("extdata/ExampleDQO.xlsx", package = "AquaSensR")
```

## Continuous monitoring data

Use `readASRcont()` to import continuous monitoring data. The function reads the Excel file, automatically runs a series of checks via `checkASRcont()`, and then formats the result for downstream use. The `tz` argument sets the time zone for the output `DateTime` column (see `OlsonNames()` for valid values).  The default value is Eastern without daylight savings (`Etc/GMT+5`) and does not need to be set explicitly, unless you need a different time zone. For example, if your data are in local time and the time zone observes DST, consider using a time zone like `America/New_York` that will automatically adjust for daylight savings.

AquaSensR accepts two input formats for the date and time information. The examples below demonstrate both.

**Format 1** — separate `Date` and `Time` columns (`ExampleCont1.xlsx`):

```{r}
contdat <- readASRcont(contpth)
```

**Format 2** — combined `DateTime` column (`ExampleCont2.xlsx`):

```{r}
contpth2 <- system.file("extdata/ExampleCont2.xlsx", package = "AquaSensR")
contdat2 <- readASRcont(contpth2)
```

Both calls return identically structured output (see [Output format] below).

### Format requirements

The continuous monitoring data file must follow one of two accepted schemas. Additional unrecognised columns will trigger an error.

**Format 1: separate Date and Time columns**

| Column | Description |
|--------|-------------|
| `Date` | Observation date, parseable by `lubridate::parse_date_time()` in year-first (e.g., `2024-06-01`), month-first (e.g., `06/01/2024`), or day-first (e.g., `01/06/2024`) formats |
| `Time` | Observation time in 24-hour (e.g., `16:30:33`), 12-hour AM/PM (e.g., `4:30:33 PM`), or Excel-native format (e.g., `1899-12-31 16:30:33`) |
| At least one parameter column | Column name must match a `Parameter` entry in `paramsASR` (e.g., `Water_Temp_C`) |

**Format 2: combined DateTime column**

| Column | Description |
|--------|-------------|
| `DateTime` | Combined date and time with the date in year-first (e.g., `2024-06-01 16:30:33`), month-first (e.g., `06/01/2024 16:30:33`), or day-first format, combined with 24-hour or 12-hour AM/PM time (e.g., `2024-06-01 4:30:33 PM`) |
| At least one parameter column | Column name must match a `Parameter` entry in `paramsASR` (e.g., `Water_Temp_C`) |

Currently, AquaSensR allows the following parameters.  Note the inclusion of the units in the parameter name.  Make sure the parameter name matches the units used in your data.

```{r, echo = F}
paramsASR |>
  select(
    Description = `Label`,
    `Required file name` = `Parameter`,
    Units = `uom`
  ) |>
  knitr::kable()
```

The list above can also be viewed in R with the `paramsASR` dataset, which is included in the package and used for the checks. 

```{r}
paramsASR
```

### Checks performed

The `readASRcont()` function imports the data and runs a series of checks using the `checkASRcont()` function.  Most checks stop with an informative error if they fail, except the check for missing values which produces a warning since these may occur in continuous data.  The checks evaluate the following:

1. **Column names**: all columns are either `Date`, `Time`, `DateTime`, or a recognised parameter from `paramsASR`.
2. **Required columns present**: either `Date` and `Time` (Format 1) or `DateTime` (Format 2).
3. **At least one parameter column**: at least one column matches an entry in `paramsASR$Parameter`.
4. **Date format** *(Format 1 only)*: all values in `Date` are parseable by `lubridate::parse_date_time()` in year-first, month-first, or day-first formats.
5. **Time format** *(Format 1 only)*: all values in `Time` are parseable by `lubridate::parse_date_time()` in 24-hour, 12-hour AM/PM, or Excel-native formats.
6. **DateTime format** *(Format 2 only)*: all values in `DateTime` are parseable by `lubridate::parse_date_time()` with year-first, month-first, or day-first date order combined with 24-hour or 12-hour AM/PM time.
7. **Missing values**: `NA` values in parameter columns produce a warning listing the affected columns and row numbers.  Missing values in `DateTime`, `Date`, or `Time` columns remain an error.
8. **Numeric parameter columns**: all parameter columns contain numeric values.

### Example: triggering an error

Adding an unrecognised column causes `checkASRcont()` to stop immediately. The following examples demonstrate this for both formats.

```{r}
#| error: true
nms <- names(readxl::read_excel(contpth, n_max = 0))
col_types <- ifelse(nms %in% c("Date", "Time", "DateTime"), "text", "guess")
contdat_raw <- suppressWarnings(
  readxl::read_excel(
    contpth,
    col_types = col_types,
    na = c("NA", "na", ""),
    guess_max = Inf
  )
)

contdat_raw$BadColumn <- 1

checkASRcont(contdat_raw)
```

### Output format

After passing all checks, `readASRcont()` returns a data frame with the same structure regardless of input format:

- `DateTime`: time-zone-aware `POSIXct` column
- One numeric column per parameter present in the input file

```{r}
head(contdat)
```

```{r}
head(contdat2)
```

## Data quality objectives

The data quality objectives file includes various information for the quality control checks applied to each parameter (see the [quality control vignette](qcoverview.html) for details). Use `readASRdqo()` to import the data quality objectives. The function reads the workbook, runs checks via `checkASRdqo()`, and returns a formatted data frame.

```{r}
dqodat <- readASRdqo(dqopth)
```

### Format requirements

The workbook must contain exactly the following columns (all required; thresholds you do not want to apply should be left blank / `NA`):

| Column | Description |
|--------|-------------|
| `Parameter` | Parameter name matching `paramsASR$Parameter` |
| `Flag` | Flag level for the thresholds in the row, either "Fail" or "Suspect" |
| `GrMin` | Gross range, lower threshold |
| `GrMax` | Gross range, upper threshold |
| `Spike` | Spike, absolute step size for a flag |
| `FlatN` | Flatline, run length at which a flag is triggered |
| `FlatDelta` | Flatline, the run range (max minus min) must be strictly less than this value to continue the run; a change equal to or greater than `FlatDelta` resets the run |
| `RoCStDv` | Rate of change, multiplier applied to the rolling SD (flag if `\|diff\| > SD × RoCStDv`) |
| `RoCHours` | Rate of change, look-back window length in hours |

### Checks performed

The `readASRdqo()` function imports the data quality objectives and runs a series of checks using the `checkASRdqo()` function.  The checks evaluate the following and stops with an informative error if any check fails:

1. **Column names**: Should include only Parameter, Flag, GrMin, GrMax, Spike, FlatN, FlatDelta, RoCStDv, and RoCHours
2. **All columns present**: All columns from the previous check should be present
3. **At least one parameter is present**: At least one parameter in the `Parameter` column matches the `Parameter` column in `paramsASR`
4. **Parameter format**: All parameters listed in the `Parameter` column should match those in the `Parameter` column in `paramsASR`
5. **Flag column**: The `Flag` column should contain only "Fail" or "Suspect" entries
6. **Numeric columns**: All columns except `Parameter` and `Flag` should be numeric values

### Example: triggering an error

Supplying an unrecognised parameter name fails the parameter format check:

```{r}
#| error: true
# import the data for the example
dqodat_raw <- suppressWarnings(
  readxl::read_excel(dqopth, na = c("NA", "na", ""), guess_max = Inf)
)

# introduce a typo in the Parameter column
dqodat_raw$Parameter[1] <- "WaterTemp"

checkASRdqo(dqodat_raw)
```

### Output format

After passing all checks, `readASRdqo()` returns a data frame with the columns listed in the format requirements table above, with all threshold columns coerced to numeric.

```{r}
head(dqodat)
```

The remaining functions in AquaSensR can now be used after the continuous data and data quality objectives files are successfully imported.