Welcome back.

In our previous post, we introduced a spectroscopic dataset aimed at benchmarking classification models. In this post, we will load in that dataset. Thus, we will go through the script provided in the repository with the dataset. Then, we will go a step further by improving that script by making it faster. This will mark the beginning of a series of posts dealing with the processing of spectroscopic data in python

Let's dive in and import the necessary libraries.

In [1]:
import os
import h5py
import numpy as np
import time
import matplotlib.pyplot as plt
import random

from sklearn.model_selection import train_test_split

Next, we change the working directory to the folder containing the datset. Adjust this part accordingly.

In [2]:
# os.chdir("C:/path/to/your/data/")
os.chdir('D:/Projects/EMSLIBS_contest/Data/')

As mentioned in the previous post, the dataset contains 500 spectra for each specimen. However, for experimentation and testing your code, one might choose to use a subset of the full dataset. Thus, here we chose the desired number of spectra to load in from each specimen.

In [3]:
spectraCount = 100

Next, we open a connection to the .hdf5 file containing the training data. It's probably worth mentioning that "hdf" stands for "hierarchical data format" and the 5 marks its version number. We opted for this format as is platform independent and libraries are readily available for its handling in most programming languages. Moreover, the .hdf files include brief metadata in the form of user-defined attribute names. You can find out more about the format here.

In [4]:
trainFile = h5py.File("contest_TRAIN.h5",'r')
In [5]:
print([trainFile.keys()])
[trainFile[key] for key in trainFile.keys()]
# for key in trainFile.keys(): print(trainFile[key])
[<KeysViewHDF5 ['Class', 'Spectra', 'Wavelengths']>]
Out[5]:
[<HDF5 group "/Class" (1 members)>,
 <HDF5 group "/Spectra" (100 members)>,
 <HDF5 group "/Wavelengths" (1 members)>]
In [6]:
wavelengths = trainFile["Wavelengths"]
wavelengths = wavelengths[list(wavelengths.keys())[0]][()]

wlMin = min(wavelengths)
wlMax = max(wavelengths)
wlCount = len(wavelengths)
res = (wlMax - wlMin)/wlCount

print("There are %d wavelength values ranging from %02d nm to %d nm with a resolution of %0.2f nm." \
      % (wlCount, wlMin, wlMax, res))
There are 40002 wavelength values ranging from 200 nm to 1000 nm with a resolution of 0.02 nm.

Now, let's explore the classification labels contained in the 'Class' group of the file.

In [7]:
print([trainFile["Class"]["1"]])
[<HDF5 dataset "1": shape (50000,), type "<f8">]

Pretty self-explanatory.

Next, let's assign the labels to a variable. Here, we truncate the whole vector according to the number of spectra we want to load in for each specimen.

In [8]:
trainClass = trainFile["Class"]["1"][()]
for i in range(0,50000,500):
    if i == 0:
        tempLabel = trainClass[0:spectraCount]
    else:
        tempLabel = np.append(tempLabel, trainClass[i:(i+spectraCount)])
        
trainClass = tempLabel

del tempLabel, i

Note, that [()] is replaced the .value method in the newest numpy version.

While the script above does the job (and is used in the script in the repository), there is a more concise way of doing the same:

In [9]:
indices = []
[indices.extend(range(i,(i+spectraCount))) for i in range(0,50000,500)]
trainClass = trainFile["Class"]["1"][()][indices]

Now that we have the independent variables and the labels, let's load in the data itself. We will time this process as we will provide an improvement below.

In [10]:
start = time.time()

if 'trainData' in locals(): del trainData
for sample in list(trainFile["Spectra"].keys()):
    tempData = trainFile["Spectra"][sample][()]
    tempData = tempData[:,0:spectraCount]
    if "trainData" not in locals():
        trainData = tempData.transpose()
    else:
        trainData = np.append(trainData, tempData.transpose(), axis = 0)
        
print('There are %d spectra in the training dataset and there are %d corresponding labels' \
      % (trainData.shape[0], len(trainClass)))
print("Code executed in %f s" % (time.time() - start))

trainFile.close()
del tempData, sample
There are 10000 spectra in the training dataset and there are 10000 corresponding labels
Code executed in 182.012827 s

Since the dataset was concieved as the subject of a classification contest, we did not specify the training-validation split. Rather, it was left to the contestants judgement. It turned out to be a significant part of the data preparation. But more on this in a later post.

Let's plot a random spectrum (row) of the loaded dataset for visual validation.

Here, we plot a random spectrum for visual exploration.

In [11]:
plt.figure(figsize=(4, 2), dpi= 200, facecolor='w', edgecolor='k')
plt.plot(wavelengths, trainData[random.randint(0,trainData.shape[0]),:])
Out[11]:
[<matplotlib.lines.Line2D at 0x2c58003cc08>]

Looks like a spectrum! Let's split the dataset into training and validation sets.

In [12]:
trainData, validData, trainClass, validClass = train_test_split(trainData, trainClass, test_size=0.33, random_state=42)

print('There are %d spectra in the training dataset and there are %d corresponding labels. \
Similarly, there are %d spectra in the validation dataset and there are %d corresponding labels.' \
      % (trainData.shape[0], len(trainClass), validData.shape[0], len(validClass) ))
There are 6700 spectra in the training dataset and there are 6700 corresponding labels. Similarly, there are 3300 spectra in the validation dataset and there are 3300 corresponding labels.

Now, let's adjust the script above for loading in the data faster.

In [13]:
start = time.time()

trainFile = h5py.File("contest_TRAIN.h5",'r')

trainData = []
for sample in list(trainFile["Spectra"].keys()):
    tempData = trainFile["Spectra"][sample][()][:,0:spectraCount]
    trainData.append(tempData.transpose())
print("Data loaded in in %f s" % (time.time() - start))    

mid = time.time()
trainData = np.reshape(trainData, newshape=(len(trainData)*trainData[0].shape[0], trainData[0].shape[1]))
print("Data reshaped in %f s" % (time.time() - mid))            
    
trainFile.close()
del tempData, sample

print('There are %d spectra in the training dataset.' \
      % (trainData.shape[0]))
print("Code executed in %f s" % (time.time() - start))
Data loaded in in 69.682979 s
Data reshaped in 3.196915 s
There are 10000 spectra in the training dataset.
Code executed in 72.890904 s

Less than 1/2 of the time it took above. Not bad. This difference increases with the number of spectra imported. However, a word of caution: during limited internal testing, this approach was found to be very memory hungry, often causing crashes even on PCs with 16 GB of RAM (while the whole dataset is less than 8 GB).

Now, we could use the train_test_split() function to split the dataset again. However, there is a catch. train_test_split() randomly splits the dataset into two. Now, let's consider what we know about the testing data. It contains data from specimens which are not included in the training dataset. Thus, it would be better to validate in a similar manner too.

In other words, let's exclude data from a few specimens from the training data and use it for validation. For this, we create a copy of the labels, and considering the number of spectra available for each specimen, we randomly select every spectrum from one specimen from each class for validation. By setting the random seed we make sure that we can reproduce our results.

In [14]:
start = time.time()

trainFile = h5py.File("contest_TRAIN.h5",'r')

indices = []
[indices.extend(range(i,(i+spectraCount))) for i in range(0,50000,500)]
trainClass = trainFile["Class"]["1"][()][indices]

random.seed(1123)

trainClassRed = trainClass[:]
trainClassRed = trainClassRed[range(0,trainClassRed.shape[0], spectraCount),]

validId = []
for tempLabel in np.unique(trainClassRed):
    validId.append(random.choice(np.where(trainClassRed == tempLabel)[0].tolist()) +1)

validData = []
trainData = []
for sample in list(trainFile["Spectra"].keys()):
    tempData = trainFile["Spectra"][sample][()][:,0:spectraCount]
    if int(sample) in validId:
        validData.append(tempData.transpose())
    else:
        trainData.append(tempData.transpose())
        
trainData = np.reshape(trainData, newshape=(len(trainData)*trainData[0].shape[0], trainData[0].shape[1]))
validData = np.reshape(validData, newshape=(len(validData)*validData[0].shape[0], validData[0].shape[1]))   

tempVar = [range(((tempLabel-1)*spectraCount),((tempLabel)*spectraCount)) for tempLabel in validId]

validId = [np.ravel([i for i in tempLabel]) for tempLabel in tempVar]
validId = np.reshape(validId, (len(validId)*validId[0].shape[0], 1))

validClass = trainClass[validId]
trainClass = np.delete(trainClass, validId)

trainFile.close()
del tempData, sample, tempLabel

print("Code executed in %f s" % (time.time() - start))

print('There are %d spectra in the training dataset and there are %d corresponding labels. \
Similarly, there are %d spectra in the validation dataset and there are %d corresponding labels.' \
      % (trainData.shape[0], len(trainClass), validData.shape[0], len(validClass) ))
Code executed in 73.673175 s
There are 8800 spectra in the training dataset and there are 8800 corresponding labels. Similarly, there are 1200 spectra in the validation dataset and there are 1200 corresponding labels.

Lastly, let's read in the testing data.

In [15]:
testFile = h5py.File("contest_TEST.h5",'r')

testData = []
for sample in list(testFile["UNKNOWN"].keys()):
    tempData = testFile["UNKNOWN"][sample][()]
    testData.append(tempData.transpose())

testData = np.reshape(testData, newshape=(len(testData)*testData[0].shape[0], testData[0].shape[1]))

testFile.close()
del tempData, sample

That's it. Now we can start working with our data. In our next post, we will perform some basic exploratory analysis of the data.

Till next time.