The problem of predicting continuous scalar outcomes from functional predictors has received high levels of interest in recent years in many fields, especially in the food industry. The
Near-infrared spectroscopy (NIRS) is a technique for measuring and analyzing reflection spectra in a class of wavelengths.
Although the NIR has given excellent results when used in various other fields such environment and the petrochemical industries, it remains relatively new for its use in virology. This method has also been used with great success for the identification of HIV-1 and the influenza virus. The advantage of using this method is that it does not require reagents or test kits that take a considerable time to perform these tests. For example, we mention the PCR (Polymerase chain reaction) or RT-PCR (reverse transcription-polymerase chain reaction) test that gives results in most cases for more than 2 h.
Usually, the NIR spectrometry is combined with some multivariate statistical models, such as the principal component regression or the partial least regression. To increase the accuracy of this procedure, we use the recent development in data science. Precisely, we combine the NIR spectrometry technology with big-data techniques modeling. The statistical modeling of big-data is an emerging topic of applied statistics. It has received considerable attention during the last decade. The development of the current technology provides a way to measure different types of instruments and the informatics tools that motivate this subject’s work. Besides, this advancement allows the researchers to recover big data being recorded over time.
One of the most advantages of this thematic is the fact that the statistical data can be treated as curves. Our main goal in this project is to develop a new software code induced from some recent statistical models adapted for NIR spectrometry data viewed as curves. The proposed models include the functional version of the PCR regression (principal component regression), and the PLS regression (partial least squares regression), etc. It is worth noting that the originality of the nonparametric analysis of functional statistics is that it links the probability structure to the topological structure to explore the most pertinent information about the data. An alternative to the preceding methods, we propose a new smoothing method constructed by the combination of the nonparametric functional regression methods and the kernel nearest-neighbor scheme. This new smoothing method keeps the robustness of the weighting functions.
Functional data analysis (FDA) arises mainly to resolve problems relating to time-like curves. In chemometric, it is usual to measure specific parameters in terms of a set of spectrometric curves that are observed in a finite set of points (functional data). In the past decades, spectroscopy has steadily gained importance as a rapid and non-destructive analytical technique in the domains of medicine, chemistry and pharmaceutical, environmental, agricultural, and food sciences.
Near-infrared spectrometry (NIR) provides benchmark examples coming from chemometrics. It is an analytical chemometric technology quick technique that involves subjecting a sample to infrared radiation to measure certain parameters of interest in terms of the absorbance spectrum; see, among others [
There are many applications of the FDA in spectrometry. For example, these NIR spectra have been used in [
More precisely, this paper aims to use the functional Near-Infrared Reflectance spectroscopy approach to predict some chemical components with some modern statistical models based on the kernel and k-NN procedures. In this article, three NIR spectroscopy datasets are used as examples: Cookie dough, sugar, and tecator data. Specifically, we propose three models for this kind of data: Functional Nonparametric Regression, Functional Robust Regression, and Functional Relative Error Regression, with both kernel and k-NN approaches.
The paper is organized as follows. Section 2 describes the prediction problems and the data used. We discuss our results in Section 3. The conclusion is presented in Section 4.
Grid of measurements Near-infrared spectrometry provides benchmark examples coming from chemometrics. This is a non-destructive technology able to measure numerous chemical compounds in a wide variety of products (food industry, petroleum industry, wood industry, etc.); see among others [
All these curves involve some continuum in their structure, even if they are observed at discrete points. The terminology of functional data refers to this continuous feature.
Throughout these three examples, which will be our connecting thread, one can remark that the grid of measurements (i.e., wavelengths) for the spectrometric curves is quite dense.
In chemometrics, there are often function-like absorbance or emission spectra–-mainly for food samples–-used to determine certain ingredients’ content. The use of spectra function is typically much cheaper than alternative chemical analysis.
This paper aims to present various ways of nonlinear modeling relationships in datasets containing functional data and discuss methodological aspects. We focus on the particular case when one regresses a scalar response on an explanatory functional variable. To fix the ideas, let’s present the mathematical formulation of our prediction problem. Indeed, assume that we aim to predict the content of certain ingredients: the sucrose content for the cookie dough, the quality ash in the percentage of the sugar given, and the fat content for the piece of meat. Denoted contents by
where
The nonparametric estimation of the functional regression was initially studied by [
It follows that
So, for all fixed curves
with
This regression model is obtained by resolving the following optimization problem
This last regression is an alternative nonparametric regression to the least square regression model. It is recently considered in functional statistics by [
The expression of this regression is explicitly given by
The performance of all the models mentioned above is closely linked with the use of different parameters involved in the estimation. We opted for the asymmetric quadratic kernel defined as
For basic materials on the latter notion, we refer the readers to [
Using the kernel CV method, we obtain
and
where
Using the method of k-Nearest Neighbors k-NN procedure, we obtain
and
where
Methods data | Classic CV | Classic k-NN | Robust CV | Robust k-NN | Relative CV | Relative k-NN |
---|---|---|---|---|---|---|
Cookie dough | 2.9108 | 2.0372 | 2.8136 | 2.0609 | 3.0574 | 2.1502 |
Sugar | 2.1599 | 1.7417 | 2.1149 | 1.7183 | 2.1499 | 1.7698 |
Tecator | 4.0304 | 2.2646 | 3.8297 | 2.2857 | 7.2204 | 3.2498 |
Methods data | Classic CV | Classic k-NN | Robust CV | Robust k-NN | Relative CV | Relative k-NN |
---|---|---|---|---|---|---|
Cookie dough | 0.0641 | 0.0406 | 0.0629 | 0.0428 | 0.0487 | 0.0325 |
Sugar | 0.0349 | 0.0252 | 0.0325 | 0.0249 | 0.0315 | 0.0236 |
Tecator | 1.5249 | 0.2758 | 1.0383 | 0.2167 | 0.3473 | 0.1357 |
The values of RMSE are relatively stable and smaller for the three k-NN functional models, namely
The principal NIR data parameters were evaluated using a sample of 72, 268, and 215 observations for the cookie dough, sugar, and tecator data, respectively. The results are summarized in
The comparison of both prediction plots in
A review of the FDA methodologies, most used in chemometrics, has been presented in this work next to different applications, most of which are in spectroscopy where the absorbance spectrum is a functional variable whose observations are functions of wavelength. The work has been divided into two main parts that can be read independently. The first part (Section 2) presents a set of chemometrics applications in most of which the aim is to either predict a variable of interest from the NIR spectrum. The second part (Section 3) summarizes our functional models’ results based on the proposed methods defined in
In this work, an alternative approach to deal with spectrometric data has been suggested. This approach considers a spectrum as a function of the wavelength or wave-number rather than as a set of separate points. We combine the recent development in Chemistry and modern Statistics. Specifically, we use the NIR spectroscopy technology from Chemistry, which is an inexpensive, rapid, and accurate method. Moreover, it reduces the need for conventional wet Chemistry procedures. On the other hand, from modern statistics, we use some functional models that allow exploring all the information of the spectroscopy analysis where spectral data are viewed as curves. Specifically, we propose three models for this kind of data: Functional Nonparametric Regression, Functional Robust Regression, and Functional Relative Error Regression, with both kernel and k-NN approach to compare between them. On the real examples studied (Cookie dough, Sugar, and tecator data), we show that our method using the k-NN procedure is more efficient (gives better results in the sense of MSE) than those with Cross-validation. To conclude, models of intermediate dimensionality in the high-dimensional setting is undoubtedly a highway for deriving new useful statistical methods for the food industry.