Welcome back!

Previously we looked at some basic vizualization of spectroscopic data. In this post, we begin preprocessing the data for further analysis. Namely, we present a simple, yet effective baseline correction algorithm. Now, you might be wondering what is baseline correction. For an answer, we have to look at the various constituents of spectra.

Spectra are generally composed of three types of signal: The analyte-specific signal, which we care about; some high-frequency (white) noise; and lastly, a low-frequency signal generally referred to as the baseline. Naturally, we want to suppress the latter two, as their presence is detrimental for the quality of the spectra. There are two corresponding quantities commonly used to quantify this detrimental effect: Noise is linked to the signal-to-noise ratio (SNR), while the baseline is linked to the signal-to-background ratio (SBR). As the name(s) suggest, a higher signal and a lower baseline (or noise) results in a higher SBR (or SNR). Unfortunately, there is little you can do about your signal once it is acquired. Thus, what's left is to numerically reduce the baseline (or noise).

Now, this might seem like a trivial task. Unfortunately, it is not. The issue is that we cannot really tell what is the basline under the emission lines, as the baseline is obscured by the emission lines. Hence, we have to estimate the baseline under the emission lines from points (signal) adjecent to the emission lines. Consequently, there is a relatively rich literature addressing the issue, e.g., Dyar et al., Hu et al., and Liu & Yu.

While numerous algorithms have been developed during the years, in this post we will stick to one of the simples ones. Here, by simple we not only mean algorithmic complexity, but possibility of automization. The simplest approach would be connecting two points adjecent to the emission line. However, selecting these points must be done manually.

Thus, we will demonstrate baseline correction using an algorithm originally published by Yaroshchyk & Eberhardt, which we found to perform very well with minimum optimization. Their approach consists of two steps: First, the moving minima of the spectrum are determined by a moving minimum filter with a selected window width. Next, the "moving minima spectrum" is smoothed by a moving Gaussian filter. The window width of the Gaussian filter can differ from that of the moving minima's. Consequently, there are two hyperparameters to optimize. Nevertheless, by choosing the window widths to be wider than a typical emission line, the algorithm generally performs well. Let's see baseline correction in action.

We begin with loading in the necessary libraries.

In [2]:
import os
import h5py
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import random

from sklearn.decomposition import PCA

Let's read in a limited subset (10 spectra for each specimen) of our benchmark dataset. You might recognize the code below from our post on loading in our contest dataset. You might also notice, that this is not the best-performing code we presented. However, for low spectra counts, we noticed little difference in execution speed and this version is less memory-hungry.

In [3]:
os.chdir("C:/Users/kepes/Projects/blog/dataImport/data")

spectraCount = 10

trainFile = h5py.File("contest_TRAIN.h5",'r')

wavelengths = trainFile["Wavelengths"]
wavelengths = wavelengths[list(wavelengths.keys())[0]][()]

indices = []
[indices.extend(range(i,(i+spectraCount))) for i in range(0,50000,500)]
trainClass = trainFile["Class"]["1"][()][indices]

if 'trainData' in locals(): del trainData
for sample in list(trainFile["Spectra"].keys()):
    tempData = trainFile["Spectra"][sample][()]
    tempData = tempData[:,0:spectraCount]
    if "trainData" not in locals():
        trainData = tempData.transpose()
    else:
        trainData = np.append(trainData, tempData.transpose(), axis = 0)

trainFile.close()
del tempData, sample

Next, we define the baseline subtraction function. At this point, we aim to apply it to each individual spectrum, i.e, we will apply it to each row of the dataset. Thus, it will take a single spectrum as input and the hyperparameters (the window widths of the two moving functions).

In [4]:
def baselineYaroschyk(inputSpectrum, movMinWindow = 100, smoothingWindow = 200):
    inputSpectrum = pd.DataFrame(inputSpectrum)

    baselineTemp = inputSpectrum.rolling(movMinWindow, center = True, min_periods = 1).min()
    baseline = baselineTemp.rolling(window=smoothingWindow, win_type='gaussian', center=True).mean(std=smoothingWindow/2)

    output = pd.DataFrame(data = [inputSpectrum-baseline,baseline,baselineTemp])
    return(output)

For validation purposes, apart from the baseline-corrected spectrum, we will also return the estimated baseline and the moving minima spectrum (the minima found by the moving minima function).

To see the the outcome of the baseline correction, we select a limited spectral range for the upcoming visualization by plotting a whole random spectrum.

In [14]:
plt.figure(figsize=[15, 8])

plt.plot(wavelengths[:], trainData[0,:])
plt.xlabel('Wavelength (nm)', fontsize=20)
plt.ylabel('Intensity (a.u.)', fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
Out[14]:
(array([-2500.,     0.,  2500.,  5000.,  7500., 10000., 12500., 15000.,
        17500., 20000., 22500.]),
 <a list of 11 Text yticklabel objects>)

Seems like the part around 600 nm possesses a high baseline. Hence, It the following, we will plot only a that portion of the spectrum. With this setlled, we apply the baseline correction to each spectrum.

In [16]:
ppData = np.apply_along_axis(baselineYaroschyk, axis=1, arr=trainData, movMinWindow=100, smoothingWindow=200)

And plot the results.

In [23]:
selWlRange = range(20000,22000)

plt.figure(figsize=[15, 8])

plt.plot(wavelengths[selWlRange], trainData[0,selWlRange])
plt.plot(wavelengths[selWlRange], ppData[0,0,0].iloc[selWlRange])
plt.plot(wavelengths[selWlRange], ppData[0,1,0].iloc[selWlRange])
plt.plot(wavelengths[selWlRange], ppData[0,2,0].iloc[selWlRange])

plt.xlabel('Wavelength (nm)', fontsize=20)
plt.ylabel('Intensity (a.u.)', fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

plt.legend(['Initial spectrum', 'Baseline-corrected spectrum', 'Estimated baseline', 'Moving minima'], prop={'size': 20})
Out[23]:
<matplotlib.legend.Legend at 0x1bb039f0550>

And that is all. We can see the initial spectrum, the estimated baseline, and the basline corrected spectrum. All this without the extensive fine-tuning of the baseline correction function's hyperparameters. Nevertheless, feel free to play around with those hyperparameters and have a look at how they influence the results.

Till next time...


References

[1] Kepes, et al.; Influence of baseline subtraction on laser-induced breakdown spectroscopic data

[2] Yaroshchyk & Eberhardt; Automatic correction of continuum background in Laser-induced Breakdown Spectroscopy using a model-free algorithm