Welcome back in our Hands-on section for processing of spectroscopic data in Python. At this point, we suppose you have already gone through the last post, considering data formats and importing. You should have imported the benchmark dataset and loaded the following variables:

  • trainData
  • trainClass
  • wavelengths
  • testData

Today, we will use all mentioned variables except testData. Firstly, just simple plotting and visualization, followed by the demonstration of the PCA algorithm.

Let's start with the importing of useful libraries. matplotlib and numpy are just essential Python libraries used almost always. However, the seaborn library adjust the appearance of figures created by Matplotlib (similar to beloved ggplot-style plots used in R). pandas offer great tools for handling large datasets and operations on them.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set()

The most basic visualization task is to plot a spectrum. We need to create a Pandas DataFrame object to combine wavelength values and corresponding intensities. Here, we would like to plot the first spectrum of the dataset, which is indexed as 0 (beware of this indexing if you are coming from MATLAB or R).

In [0]:
df=pd.DataFrame({'wave': wavelengths, 'intensity': trainData[0,:] })

fig = plt.figure()
f1 = fig.add_subplot(1, 1, 1)
f1.plot('wave', 'intensity', data=df)
f1.set_xlabel('Wavelength (nm)')
f1.set_ylabel('Intensity (-)')
fig.set_size_inches(12, 6)

While this simple matplotlib figure is often all you need, we will demonstrate more advanced, interactive plotting.

Let's continue with exploring the benchmark dataset (check out more details about its complexity). It is often beneficial to average spectra in large datasets to obtain some general information. We will create a mean spectrum for each class. For this, the numpy library offers the simplest solution. In total, there are 12 classes.

In [0]:
trainData_mean = []
for i in range(12):
    trainData_mean.append(np.mean(trainData[trainClass == i], axis=0))

Now, we could use matplotlib again, but for interactive handling, the plotly library is a better option. Here, we define a function for plotting with two arguments. A cycle is used to add spectra for all classes.

In [0]:
import plotly.graph_objects as go
def plot_interactive(y_val , x_val):
    fig = go.Figure()
    for i in range(12):
        fig.add_trace(go.Scatter(x=x_val, y=y_val[:,i],
                    mode='lines',
                    name= str (i+1) + '. class'))

    fig.update_layout(
        title = "Mean spectrum for each class",
        xaxis_title = "Wavelength (nm)",
        yaxis_title = "Relative intensity (-)"
    )
    fig.show()

Finally, we may apply the function to our data. Our data matrix is transposed to be prepared for plotting. Explore the broad interactivity of the plotly figures (zoom in, click on the legend to select specific spectra to plot and much more)!

In [0]:
plot_interactive(np.transpose(trainData_mean), wavelengths)

As was promised, we will continue with simple PCA implementation using the scikit-learn library.

In [0]:
from sklearn.decomposition import PCA

pca = PCA(n_components=20)
pca_scores = pca.fit_transform(trainData)


fig = plt.figure()
fig.set_size_inches(16, 9)
f2 = fig.add_subplot(1, 1, 1)
plt.scatter(pca_scores[:, 0], pca_scores[:, 1],
            c=trainClass, edgecolor='none', alpha=1,
            cmap='Paired' )
plt.colorbar(ticks=np.linspace(1,12,12), label='Class')
plt.clim(0.5, 12.5)
f2.set_xlabel('Principal component 1')
f2.set_ylabel('Principal component 2')
Out[0]:
Text(0, 0.5, 'Principal component 2')

This output is a big mess! Of course it is, as we are using the benchmark dataset, which was designed to be highly non-trivial. To be more pedagogical and to obtain those nice demonstrative score-plots, we will select only a small subset of the dataset (e.g. only class 4 and 10) and compute principal components again.

In [0]:
trainData_class4 = trainData[np.nonzero(trainClass == 4)]
trainData_class10 = trainData[np.nonzero(trainClass == 10)]
trainData_subset = np.append(trainData_class4, trainData_class10, axis = 0)
trainClass_subset = np.append(trainClass[np.nonzero(trainClass == 4)], trainClass[np.nonzero(trainClass == 10)], axis = 0)
In [0]:
pca_scores = pca.fit_transform(trainData_subset)

fig = plt.figure()
fig.set_size_inches(16, 9)
f2 = fig.add_subplot(1, 1, 1)
plt.scatter(pca_scores[:, 0], pca_scores[:, 1],
            c=trainClass_subset, edgecolor='none', alpha=1,
            cmap='Paired' )
plt.colorbar(ticks=np.linspace(1,12,12), label='Class')
plt.clim(0.5, 12.5)
f2.set_xlabel('Principal component 1')
f2.set_ylabel('Principal component 2')
Out[0]:
Text(0, 0.5, 'Principal component 2')

Now it is much easier to imagine some simple binary classifier to separate spectra. In our future posts, we will build some classifiers.

Having a PCA score plot, it is important to visualize the contribution of original variables to specific components. This is called loadings (in fact, loadings are eigenvectors multiplied by the square root of corresponding eigenvalues).

In [0]:
pca = PCA(n_components=5)
pca_train_scores = pca.fit_transform(trainData_subset)

loadings = np.sqrt(pca.explained_variance_) * pca.components_

#loadings = np.array([list(np.sqrt(pca.explained_variance_[i]) * pca.components_[i]) for i in range(10)])
In [0]:
loadings_df = pd.DataFrame(loadings.T)
loadings_df.insert(0,'Wavelength', wavelengths)
loadings_df.plot(x = "Wavelength", figsize=(16,9))

#ax = ts.plot()
#ax.set_xlim(pd.Timestamp('2015-02-15'), pd.Timestamp('2015-07-01'))
#ax.set_ylim(-5, 5)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ccbd91da0>

References

[1] Kepes et. al., Benchmark paper

[2] Vrabel et. al., Contest paper