Welcome back.
In our previous post, we introduced a spectroscopic dataset aimed at benchmarking classification models. In this post, we will load in that dataset. Thus, we will go through the script provided in the repository with the dataset. Then, we will go a step further by improving that script by making it faster. This will mark the beginning of a series of posts dealing with the processing of spectroscopic data in python
Let's dive in and import the necessary libraries.
import os
import h5py
import numpy as np
import time
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
Next, we change the working directory to the folder containing the datset. Adjust this part accordingly.
# os.chdir("C:/path/to/your/data/")
os.chdir('D:/Projects/EMSLIBS_contest/Data/')
As mentioned in the previous post, the dataset contains 500 spectra for each specimen. However, for experimentation and testing your code, one might choose to use a subset of the full dataset. Thus, here we chose the desired number of spectra to load in from each specimen.
spectraCount = 100
Next, we open a connection to the .hdf5
file containing the training data. It's probably worth mentioning that "hdf" stands for "hierarchical data format" and the 5 marks its version number. We opted for this format as is platform independent and libraries are readily available for its handling in most programming languages. Moreover, the .hdf files include brief metadata in the form of user-defined attribute names. You can find out more about the format here.
trainFile = h5py.File("contest_TRAIN.h5",'r')
print([trainFile.keys()])
[trainFile[key] for key in trainFile.keys()]
# for key in trainFile.keys(): print(trainFile[key])
wavelengths = trainFile["Wavelengths"]
wavelengths = wavelengths[list(wavelengths.keys())[0]][()]
wlMin = min(wavelengths)
wlMax = max(wavelengths)
wlCount = len(wavelengths)
res = (wlMax - wlMin)/wlCount
print("There are %d wavelength values ranging from %02d nm to %d nm with a resolution of %0.2f nm." \
% (wlCount, wlMin, wlMax, res))
Now, let's explore the classification labels contained in the 'Class' group of the file.
print([trainFile["Class"]["1"]])
Pretty self-explanatory.
Next, let's assign the labels to a variable. Here, we truncate the whole vector according to the number of spectra we want to load in for each specimen.
trainClass = trainFile["Class"]["1"][()]
for i in range(0,50000,500):
if i == 0:
tempLabel = trainClass[0:spectraCount]
else:
tempLabel = np.append(tempLabel, trainClass[i:(i+spectraCount)])
trainClass = tempLabel
del tempLabel, i
Note, that [()]
is replaced the .value
method in the newest numpy
version.
While the script above does the job (and is used in the script in the repository), there is a more concise way of doing the same:
indices = []
[indices.extend(range(i,(i+spectraCount))) for i in range(0,50000,500)]
trainClass = trainFile["Class"]["1"][()][indices]
Now that we have the independent variables and the labels, let's load in the data itself. We will time this process as we will provide an improvement below.
start = time.time()
if 'trainData' in locals(): del trainData
for sample in list(trainFile["Spectra"].keys()):
tempData = trainFile["Spectra"][sample][()]
tempData = tempData[:,0:spectraCount]
if "trainData" not in locals():
trainData = tempData.transpose()
else:
trainData = np.append(trainData, tempData.transpose(), axis = 0)
print('There are %d spectra in the training dataset and there are %d corresponding labels' \
% (trainData.shape[0], len(trainClass)))
print("Code executed in %f s" % (time.time() - start))
trainFile.close()
del tempData, sample
Since the dataset was concieved as the subject of a classification contest, we did not specify the training-validation split. Rather, it was left to the contestants judgement. It turned out to be a significant part of the data preparation. But more on this in a later post.
Let's plot a random spectrum (row) of the loaded dataset for visual validation.
Here, we plot a random spectrum for visual exploration.
plt.figure(figsize=(4, 2), dpi= 200, facecolor='w', edgecolor='k')
plt.plot(wavelengths, trainData[random.randint(0,trainData.shape[0]),:])
Looks like a spectrum! Let's split the dataset into training and validation sets.
trainData, validData, trainClass, validClass = train_test_split(trainData, trainClass, test_size=0.33, random_state=42)
print('There are %d spectra in the training dataset and there are %d corresponding labels. \
Similarly, there are %d spectra in the validation dataset and there are %d corresponding labels.' \
% (trainData.shape[0], len(trainClass), validData.shape[0], len(validClass) ))
Now, let's adjust the script above for loading in the data faster.
start = time.time()
trainFile = h5py.File("contest_TRAIN.h5",'r')
trainData = []
for sample in list(trainFile["Spectra"].keys()):
tempData = trainFile["Spectra"][sample][()][:,0:spectraCount]
trainData.append(tempData.transpose())
print("Data loaded in in %f s" % (time.time() - start))
mid = time.time()
trainData = np.reshape(trainData, newshape=(len(trainData)*trainData[0].shape[0], trainData[0].shape[1]))
print("Data reshaped in %f s" % (time.time() - mid))
trainFile.close()
del tempData, sample
print('There are %d spectra in the training dataset.' \
% (trainData.shape[0]))
print("Code executed in %f s" % (time.time() - start))
Less than 1/2 of the time it took above. Not bad. This difference increases with the number of spectra imported. However, a word of caution: during limited internal testing, this approach was found to be very memory hungry, often causing crashes even on PCs with 16 GB of RAM (while the whole dataset is less than 8 GB).
Now, we could use the train_test_split()
function to split the dataset again. However, there is a catch. train_test_split()
randomly splits the dataset into two. Now, let's consider what we know about the testing data. It contains data from specimens which are not included in the training dataset. Thus, it would be better to validate in a similar manner too.
In other words, let's exclude data from a few specimens from the training data and use it for validation. For this, we create a copy of the labels, and considering the number of spectra available for each specimen, we randomly select every spectrum from one specimen from each class for validation. By setting the random seed we make sure that we can reproduce our results.
start = time.time()
trainFile = h5py.File("contest_TRAIN.h5",'r')
indices = []
[indices.extend(range(i,(i+spectraCount))) for i in range(0,50000,500)]
trainClass = trainFile["Class"]["1"][()][indices]
random.seed(1123)
trainClassRed = trainClass[:]
trainClassRed = trainClassRed[range(0,trainClassRed.shape[0], spectraCount),]
validId = []
for tempLabel in np.unique(trainClassRed):
validId.append(random.choice(np.where(trainClassRed == tempLabel)[0].tolist()) +1)
validData = []
trainData = []
for sample in list(trainFile["Spectra"].keys()):
tempData = trainFile["Spectra"][sample][()][:,0:spectraCount]
if int(sample) in validId:
validData.append(tempData.transpose())
else:
trainData.append(tempData.transpose())
trainData = np.reshape(trainData, newshape=(len(trainData)*trainData[0].shape[0], trainData[0].shape[1]))
validData = np.reshape(validData, newshape=(len(validData)*validData[0].shape[0], validData[0].shape[1]))
tempVar = [range(((tempLabel-1)*spectraCount),((tempLabel)*spectraCount)) for tempLabel in validId]
validId = [np.ravel([i for i in tempLabel]) for tempLabel in tempVar]
validId = np.reshape(validId, (len(validId)*validId[0].shape[0], 1))
validClass = trainClass[validId]
trainClass = np.delete(trainClass, validId)
trainFile.close()
del tempData, sample, tempLabel
print("Code executed in %f s" % (time.time() - start))
print('There are %d spectra in the training dataset and there are %d corresponding labels. \
Similarly, there are %d spectra in the validation dataset and there are %d corresponding labels.' \
% (trainData.shape[0], len(trainClass), validData.shape[0], len(validClass) ))
Lastly, let's read in the testing data.
testFile = h5py.File("contest_TEST.h5",'r')
testData = []
for sample in list(testFile["UNKNOWN"].keys()):
tempData = testFile["UNKNOWN"][sample][()]
testData.append(tempData.transpose())
testData = np.reshape(testData, newshape=(len(testData)*testData[0].shape[0], testData[0].shape[1]))
testFile.close()
del tempData, sample
That's it. Now we can start working with our data. In our next post, we will perform some basic exploratory analysis of the data.
Till next time.