4 Data Preprocessing

This section of the demo script preprocesses the spectral and lab data that will be used to make models and predictions from. The following subsections explain the steps of preprocessing and their respective functions in more detail

  • Data Preprocessing must be performed for the reference set, used to create the models, as well as the dataset you are making predictions from, the prediction set
  • The prediction set and reference set will be the same dataset if you are just using a single spectral library, but can be different if you are using models from one set to make predictions on another.
  • Data preprocesing is executed by the functions within preprocess_functions.R

Get Spectral Library

A spectral library is simply a dataset containing spectral data for various samples. For the purpose of training models, it is also necessary to have corresponding lab data for the soil properties you are interested in predicting. getSpecLib() is a wrapper function that takes a folder of OPUS files containing spectral data, and a ‘csv’ of lab data, and outputs a merged file with all the spectral and lab data for each sample.

getSpecLib()

  • getSpecLib( SPECPATH, LABPATH, SAVENAME )
    • SPECPATH: string- The path to the folder of opus files, from within the ‘Soil-Predictions-Example’ folder. Default set to “/Data_Raw/SPECTRA”
    • LABPATH: string- The path to the ‘csv’ of lab data. This file must include a sample_id column first, followed by the column(s) of lab data for the soil properties of interest. sample_id will be used to merge the spectra to its corresponding lab data
    • SAVENAME: string- The name you would like to save the spectral file after processing. The default is set to “none” which will not save a file.
Output Dataset

Output Dataset

Extract Spectra

For Bruker Instruments, an OPUS file containing spectral data, will be output for each sample that is scanned. To compile these separate files into one dataset, we use a couple functions from the simplerspc package by Philip Baumann, as well as the stringr and foreach packages.

opus_to_dataset()

  • opus_to_dataset( SPECPATH, NWAVE, SAVENAME )
    • SPECPATH: string- The path to the folder of opus files, from within the ‘Soil-Predictions-Example’ folder. Default set to “/Data_Raw/SPECTRA”
    • NWAVE: integer- The number of wavelengths to extract. This will be set on the FTIR before running. The default is set to 3017, which we use at WHRC
    • SAVENAME: string- The name you would like to save the spectral file after processing. The default is set to “none” which will not save a file. _

Load appropriate packages for opus_to_dataset()

Get the paths of all OPUS files…

A single path will look something like this: /Soil-Predictions-Example/Data_Raw/ref-SPECTRA/WHRC03405_S_001_030.0

Extract the spectra and gathers it into a tibble data frame…

Truncate the dataset to the number of wavelengths specified, to ensure the spectra from different samples align…

Process spectra into a dataframe and

Assign a sample_id based off the file names1

Reformat the dataframe to have all the spectra as a matrix column…

Optionally save the spectra as an R dataset and csv file if SAVENAME is passed…

Process Spectra

Subset Spectral Range

To yield the best predictions, we exclude areas of the spectral range that may be problematic. The following function narrows down the regions of the spectra by truncating wavenumbers below 628 and between 2268 to 2389, which is a CO2 sensitive region

Baseline Transformation

We can perform a baseline transformation to normalize the spectra, by subtracting the minimum values for each row/sample.

Other Transformations

  • Standard Normal Variate
  • Derivatives
  • Continuum removal
  • Multiplicative scatter correction

(Calibration Transfer)

{Optional} Recommended when the spectral library of samples to be predicted was scanned by a different instrument than the samples used to built the model. For example, you would want to perform a calibration transfer on the prediction set, if you were using the KSSL library to make predictions on samples scanned at Woods Hole Research Center.

Merge with Lab Data

If there is lab data associated with your soil samples, this can be merged with the spectral data and later used to assess the performance of your models. The example lab dataset below provides information about where the soil sample was taken with the Site_ID and Horizon, as well as the lab measurements for various soil properties including Organic Carbon, Sand, Silt and Clay.

The merge() command joins the lab dataset to the spectral dataset. The all.y=TRUE parameter indicates that the final dataset will contain all the rows of spectra. This means that if some samples do not have lab data, they will be assigned a value of NA but the spectra will remain in the set.

Optionally save the spectra as an R dataset and csv file if SAVENAME is passed within getSpecLib()…

The final dataframe contains a unique ID, lab data, and a matrix of spectral data called ‘spc’. It is suggested to save this file as RData so it may be reloaded as needed.

Refine Spectral Library

You may want to refine the samples you use to build your model or predict off of by…

  • Subsetting the set to 15000 samples if it is large
  • Eliminating samples with NA, negative, or outlier lab data
  • Spliting the set or property subsets into calibration and validation groups

refineSpecLib()

  • refineSpecLib( SPECLIB, PROP, OUTLIER, LARGE, CALVAL, SAVENAME )
    • SPECLIB: dataframe- A dataframe object containing a sample_id column, a matrix of spectral data as column spc, and lab data columns if you intend to use the PROP parameter.
    • PROP: string- The column name of the soil property of interest. If this is passed, the set will be refined using this lab data. Default is set to NA.
    • OUTLIER: vector- A vector containing the outlier removal methods you would like applied to the data. Default is c(“stdev”) which detects lab data outliers. It can also be set to c(“fratio”) to detect spectral outliers, both c(“stdev”, “fratio”), or neither c(“none”)
    • LARGE: boolean- Set to TRUE if you have a large dataset you would like to subset. Default is FALSE.
    • CALVAL: boolean- Set to TRUE if you would like to create calibration and validation sets. It will return the dataset with a column, calib, that is assigned 1 for samples in the calibration set and 0 for those in the validation set. 80/20 split. Default is set to FALSE.
    • SAVENAME: string- The name you would like to save the spectral file after processing. The default is set to “none” which will not save a file.

Large Sets

If you reference set exceeds 15000 samples, you may chose to subset it. We have found that 15000 is optimal for speed and performance of the models, when the reference set is very large. This subset can be performed using conditional latin hypercube sampling, with the clhs package.

sub_large_set()

  • sub_large_set( SPECLIB, SUBCOUNT)
    • SPECLIB: dataframe- Dataframe including spectral data as a matrix ‘spc’
    • SUBCOUNT: integer- Number of samples to subset. Default is set to 15000

Faulty Lab Data

To yield the best models, we can exclude rows with faulty lab data (NA, negative, and outlier values). This may vary by soil property, so the process should be repeated for each property.

noNA()

  • noNA( dataset, column )
    • dataset: dataframe to eliminate NAs from
    • column: column to check for NA values

Gets rid of NA values…

noNeg()

  • noNeg( dataset, column )
    • dataset: dataframe to eliminate negative values from
    • column: column to check for negative values

Gets rid of Negative values…

Outliers

We can identify both outliers in the lab data and outliers in the spectral data, to optimize our models. The following functions are called based on the variable OUTLIER passed in refineSpecLib(). They are stored in outlier_functions.R

Lab Data Outliers

This outlier detection approach creates a PLS model, regresses the predictions against the lab data, and identifies the 1% of samples that were farthest from the line of best fit. These samples will be printed out in the consol and highlighted on a plot like the one below. In this case, the sample size is about 600, so 6 outliers were identified.

Spectral Outliers

There may also be outliers in the spectral data, which we can detect and remove. Eliminating them from the reference set may build a stronger model, so this can be performed in the preprocessing stage. For the prediction set, you may want to see where the prediction samples lie within the reference set as well. Is the reference set data representative of the prediction set samples? If they fall to far outside of the reference set space, you might decide it is not appropriate to make predictions for these samples.

To detect spectral outliers, we can look at the spectra in principal component space and identify which samples are farthest from the center of the data. We use a statistic called the fratio, to measure this variation. Using the scores and loadings from principal component analysis, we get predictions of the spectra, which can be compared to the actual values of the spectra. Looking at the probability distribution of the residuals with the fratio, we can flag samples that vary significantly from their predicted values.

The 3D plot below shows an example of spectral outliers being detected in the principal component space of PC1, PC2 and PC3:

fratio_outliers()

  • fratio_outliers( SPECLIB, P, SHOW, PLOT )
    • SPECLIB: dataframe- A dataframe to eliminate standard deviation outliers from. Must include spectral matrix as ‘spc’ column.
    • P: double- A fraction signifying the threshold to be used for detecting outliers. Default is set to 0.99
    • SHOW: boolean- Whether or not to show results. Default is TRUE.
    • PLOT: boolean- Whether or not to plot results. Default is TRUE.

Cal/Val Groups

You may chose to subset a portion of the reference set as a calibration group which will be used to build the models- leaving the remaining samples as the validation set to test the model. Kennard Stone is a method for performing this type of separation while ensuring each group is representative of the set. The following function returns the spectral library with an additional column, ‘calib’, signifying whether or not the sample is in the calibration set. This is used in the demo script to define the reference set and prediction set:


  1. sample_ids that are numeric may cause issues while merging so a string ID is advised