Welcome to our new blog!
In this modern era of data science and machine learning, there are plenty of excellent resources for image processing, natural language processing and other exciting stuff. However, we have found out complete lack of content related to spectroscopic data. So, we have decided to create a platform to gather useful information about the processing of such data and notice you about new ideas in spectroscopy. Also, be prepared for some educational content (python/R code) and occasionally an interview with leading spectroscopists.
What are spectroscopic data?¶
In the beginning, it is essential to ask what do we mean by the term "spectroscopic data"? Below, we are providing an exhaustive answer, but there is one more important question to ask: Why do we want to consider spectroscopic data separately? Well, because they are different and very specific. Nowadays, there are various powerful methods and strategies on how to train a model in machine learning, but most of them were invented for a specific task (image recognition, ...). Unfortunately, direct generalization to the spectroscopic data is often impossible. Methods and strategies have to be redesigned with special conditions, implied by the properties of the data and task.
Fig. 1. LIBS spectrum of aluminium alloy, standardized sample BAM-311. The figure is demonstrating significant properties of spectroscopic data.
Finally, an answer:¶
Despite considerable differences in each spectroscopic method, there are common features present in spectroscopic data (result of spectroscopic measurement). The data could origin from a wide category of the analytic methods (atomic emission spectroscopy - AES, absorption spectroscopy, molecular spectroscopies, and many more...), they only need to posses following properties [1]:
- Sparsity (existence of spectral lines) In spectra, we usually observe peak-like structures (spectral lines) of known shapes and positions. This structure is material-specific and offers valuable information. However, lines cover relatively minor parts of wavelength range, and they are surrounded by “unimportant” information (noise, continuum, ...). A trained spectroscopic specialist can distill only important features, but it becomes a challenge in the case of automatized spectra processing. A simple computational model cannot easily recognize what is important and what is noise, special techniques are required to make it possible.
- High-dimensionality The dimension of spectroscopic data is dependent on the resolution of a spectrometer, used for measurement. It is not unusual to have tens of thousands of variables in the spectrum.
- Line-redundancy (multiple lines per element) Usually, there are several spectral lines, corresponding to the selected element, present in a single spectrum. If our goal is to decide the presence of a specific element in the measured sample, we do not have to identify all lines, but one or two (for confirmation) is enough. This fact implies that spectral data are highly redundant for specific tasks (classification, detection of elements, ...).
- Value-redundancy (multiple intensity values per single line) Spectral lines are of a characteristic shape originated from line-broadening mechanisms, caused by interactions in plasma and the process of measurement. From the data point of view, many variables are correlated together and correspond to a single peak. This needless extent of variables can be represented by only a few variables (e.g., the central position of the line, total intensity, and width of Voigt line profile). The amount of necessary variables is determined by a specific taskthat we aim to solve.
Other examples of spectroscopic data, which posses similar properties: Fig. 2. Raman spectrum of perovskite. Takend from the database [2].
Fig. 3. FTIR spectrum of horse cartilage. Takend from the data descriptor [3].
References
[1] Vrabel et. al., RBM for dimension reduction of large spectroscopic data
[3] Sarin et. al., Dataset on equine cartilage near infrared spectra, composition, and functional properties