In our previous post, we described the common characteristics of spectroscopic data (sparsity, redundancy, and high dimensionality). In this post, we describe a dataset that you can use to experiment with spectroscopic data and to get familiar with the challenges posed by its properties. In the following, we state the motivation behind the creation of the dataset and its properties.
One of the most common applications of spectroscopic analysis is telling materials apart in a qualitative manner. Naturally, this is a standard classification problem. However, the properties of spectroscopic data can easily lead to overfitted model. While overfitting can be identified by using various cross-validation schemes, these are generally omitted. As an example, it is not uncommon to find classification accuracies presented for a model, which used data from the same homogenous physical specimens for both training and testing. Imagine using a set of images of a car for training and using the the same images with different lighting conditions for testing. It would not pose a significant challenge. Moreover, the spectroscopic data analysed in most papers is generally not made publicly available. Hence, it is impossible to tell which classification approach (including data preparation and classification model) actually works.
Consequently, we created a dataset to address this issue. The data provided for training comes from different physical specimens than the testing data, while both represent the same classes. To achieve this, we used certified standard powders of a variety of geological specimens (46, to be exact). These 46 powders were classified by a geologist into 12 geological classes. Each class is characterized by a unique chemical composition with a certain level of variability. For example, the powders from one class all contain iron (e.g., in the range of 15—30 wt.%), while the other specimens contain no iron at all. To further increase the difficulty of the classification task, we mixed the powders, yielding 138 specimens in total. Subsequently, we divided the specimens into training and testing sets and carried out the spectroscopic measurements using laser-induced breakdown spectroscopy. For further details about the mixing and measurement procedures please have a look at the open-access publication titled Benchmark classification dataset for laser-induced breakdown spectroscopy. For each specimen in the training dataset, 500 emission spectra (observations/samples) are provided. Each spectrum consists of 35000 variables. The complete training datasets is about 7 GB, while the test data is 3 GB. Both datasets are available in the hdf5 file format. The number of spectra obtained from the testing specimens varies to avoid using the number of spectra as an additional input.
The dataset was initially published as a classification contest, with the labels of the test dataset unavailable to users. The results of this contest will be the subject of another post. Currently, to provide a fully self-contained dataset, the published data descriptor (the link above) and the related data repository now contain the labels of the test dataset. Code for reading in the data are available for R, Python, and MATLAB, making your first encounter with spectroscopic data painless. Go, give the classification challenge a try.