The main purpose of data mining subproject is to explore hidden and non-trivial relationships between quantitative spectroscopy data (LIBS, FTIR, etc) and qualitative properties of biomedical, biochemical or geophysical specimens.
Research questions our group tries to address include:
1) What is the most suitable method for feature selection (selecting spectral regions the most useful for data representation) and feature extraction (generating novel representing attributes based on linear or nonlinear transformations performed on spectroscopy data)?
We are trying to determine to which extent feature selection techniques developed in other domains (e.g., analysis of hyperspectral remote sensing data or microarray data analysis) may be suitable for the spectroscopy domain. Also, we are striving to determine the optimal number of principal components to represent the data in linear feature selection, as well as to evaluate usefulness and availability of non-linear manifold-based techniques.
2) What is the best classification model that can provide association between measured spectroscopy data and discrete specimen class (e.g., “yes” and “no cancer”, type of protein) or continuous characterization of the specimen (e.g., percentage of compounds in mixture)?
Here we explore capability of various state-of-the art classification models (e.g., neural networks, support vector machines, adaptive local hyperplanes) and regression techniques and evaluate their predictive capability using cross-validation technique on data unseen during the training phase.
3)Can we distinguish between different categories of specimen or data based solely on spectroscopy data, and not using prior knowledge such as class label?
We utilize unsupervised clustering techniques (e.g., DBSCAN, OPTICS, k-means) to determine groups of self-similar data and evaluate the usefulness of such modeling for better understanding of physical and chemical properties of data.
3)Can we develop real-time techniques that can identify unusual spectroscopy data for future consideration and examination?
We have developed incremental outlier detection algorithms with capability of determining irregularities in data streams.
4) Can we reduce effects of imperfect measurement equipment, drift in intensity of spectral lines, heteroscedasticity, etc?
We investigate applicability of advanced statistical techniques including Kalman filters, particle filters, cepstrum analysis, etc.
Classification accuracy of LIBS protein data vs. number of principal component used for different machine learning algorithms.
For More Information & List of Research Publications