FT-NIR spectral interpretation
Figure 1a and S1 depict individual raw and preprocessed spectra acquired from domestic and imported kimchi samples. In Fig. 1a, individual raw spectra exhibit a similar pattern characterized by eight broad absorption bands, encompassing numerous adjacent and overlapping bands throughout the wavenumber range. The differences in absorbance intensity among these spectra were likely associated with multiplicative responses to variations in pathlength, as these differences became significantly smaller after preprocessing with pathlength correction methods such as multiplicative signal correction (MSC) and standard normal variate (SNV)19 (Fig. 1a and S1a,h). The individual spectral patterns of domestic and imported sample groups remained consistent regardless of the type of pathlength correction method used, but variations were observed based on the smoothing method employed; Norris derivative (ND) proved more effective than Savitzky–Golay filtering (SG) in reducing the noise level of spectra, thereby enhancing their appearance (Figure S1). This suggests that different preprocessing methods may have different effects on the model performance for the discrimination of kimchi samples according to their origin.
Figure 1b illustrates representative average patterns of spectra preprocessed with one of the pathlength correction methods (MSC) for domestic and imported sample groups. As depicted in Fig. 1b, the two average spectra nearly overlap, making it challenging to differentiate them with the naked eye. This suggests that, on average, the chemical information obtained from these two sample groups is notably similar, despite the diversity and complexity of intrinsic/extrinsic factors influencing the quality of kimchi products. It underscores the need for a practical chemometric tool to extract information useful for distinguishing between them.
Figure 1c exhibits representative average patterns of spectra preprocessed with a combined method involving one of the pathlength correction methods + one of the derivative methods + one of the smoothing methods (MSC + D2 + ND) for domestic and imported sample groups. As depicted in Fig. 1c, the D2 treatment revealed some sharp peaks. Among the noteworthy regions, ten peaks marked with numbers exhibited significant differences (p < 0.05) in intensity between the two sample groups. The chemical compositions associated with these differences may contain valuable information for classifying the origin of kimchi. Although the complexity of constituents in kimchi products limits the interpretation of structural information because of the extensive overlapping of bands, some insights are summarized in Table 1.
In general, a band around peak No.1 (7278 cm−1) and bands around peaks No. 7 (4401 cm−1), No. 8 (4327 cm−1), No. 9 (4262 cm−1), and No. 10 (4046 cm−1) correspond to combinations of C-H22. A band near peak No. 2 (6954 cm−1) corresponds to the first overtone of O-H and N-H22. Two bands around peak No. 3 (5789 cm−1 for domestic; 5785 cm−1 for imported) and peak No. 4 (5677 cm−1) are caused by the first overtone of C-H20. Another two bands around peak No. 5 (4852 cm−1 for domestic; 4844 cm−1 for imported) and peak No. 6 (4736 cm−1) correspond to combinations of O-H and N-H20. These FT-NIR absorption band regions are closely linked to carbohydrates (7278 cm−1, 6954 cm−1, 5789 and 5785 cm−1, 5677 cm−1, 4852 and 4844 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, and 4262 cm−1)19,17,20,21,22,23,24, proteins (5789 and 5785 cm−1, 4852 and 4844 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, 4262 cm−1, and 4046 cm−1)19,17,21, and lipids (5789 and 5785 cm−1, 5677 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, 4262 cm−1, and 4046 cm−1)21,25,26.
According to information from the food nutrition database by the Korean Ministry of Food and Drug Safety, kimchi contains approximately 90% moisture, 2% protein, less than 1% lipid, 6–7% carbohydrates, and 2–3% ash (Table 2). Carbohydrates are the most predominant nutrients in kimchi. These compounds include dietary fibers, which are related to the integrity of the plant cell wall27, and sugars that contribute to the sweet taste of kimchi28. These are mainly derived from plant-based ingredients, and sucrose is often added to enhance the taste. Meanwhile, proteins are primarily derived from animal-based ingredients such as fish sauces and dried-fish extracts29,30. Some of these components in kimchi have been found to vary depending on the geographical origin. The differences in some metabolites including amino acids, sugars, and proteins were usefully used to classify the geographical origin of kimchi in previous studies21,31. However, some of these components undergo changes in content during fermentation; for example, sugars such as glucose and fructose are consumed by lactic acid bacteria during fermentation28. Hence, the chemical compositions of kimchi can vary depending on both the ingredients used and the conditions of processing and fermentation. It is essential to overcome the complexity and diversity of factors that determine the quality of kimchi to identify the indigenous characteristics of kimchi based on its origin. Therefore, it is critical to develop a classification model with good performance that can distinguish between the characteristics of kimchi according to its geographical origin.
Classification of geographical origin of domestic and imported kimchi samples by PCA
Figure 2 and S2 present PCA score plots constructed from the top two PCs using preprocessed FT-NIR spectral data of domestic and imported samples. These two-dimensional PCA score plots explained 18.9% (SNV + D2 + SG)–75.2% (MSC) of the total variance. This indicates that the FT-NIR analytical tool has high dimensionality, enhancing its discrimination ability against similar samples32. As shown in Fig. 2a and b, when pathlength correction methods were applied, domestic and imported kimchi samples widely overlapped on the plots, and the distribution patterns on the PCA plots were very similar regardless of the type of pathlength correction method used. These results indicate a similarity in spectral transformation between MSC and SNV, also observed in Figure S1, projected onto the PCA plots.
The variation of data points for each sample group on these plots (Fig. 2a,b) decreased as the degree of preprocessing increased; for example, the distribution of data points was small in the order of MSC + D2 < MSC + D1 < MSC (Fig. 2a,c,e). Meanwhile, when a combined method including either D1 (Fig. 2c,d and Figures S2a,c,d,e) or D2 + ND (Fig. 2f and S2g) was applied, the two sample groups could be differentiated to some extent by PC 2. These results highlight the importance of selecting appropriate preprocessing methods to improve the classification of the geographical origin of kimchi. However, despite applying various combined preprocessing methods to the raw spectral data, PCA could not completely separate the two sample groups based on their geographical origin. These results indicate that the differences in FT-NIR data between domestic and imported kimchi samples were not sufficiently clear for a full distinction using PCA. A similar unclear classification pattern was also observed in PCA using electronic noise data for determining the geographical origin of kimchi in a study by Lee et al.31. The authors could not obtain a completely differentiated pattern between domestic and imported kimchi sample groups using proteomic data. Conversely, a clear separation between domestic and imported kimchi sample groups was achieved by PCA using 1H NMR data. Thus, our results suggest that an advanced chemometric method should be attempted to more explicitly differentiate domestic and imported kimchi samples based on their geographical origin.
Classification of geographical origin of domestic and imported kimchi samples by chemometric techniques
Comparison of preprocessing method
Various chemometric techniques, including KNN, CART, SVM, NB, RF, and PLS-DA, were employed to construct a classification model for determining the geographical origin of kimchi. Tables S1 and 3 summarize the cross-validation results with training sets and performance evaluations with test sets in terms of accuracy, recall, specificity, precision, and F1 score. Figure 3 presents confusion matrices obtained during model testing. These results reveal that the impact of applying preprocessing methods to the raw spectral data on the classification outcome depended on the type of chemometric algorithm used. This phenomenon aligns with observations in previous studies on the discrimination based on FT-NIR analysis of cocoa beans16, sea cucumbers18, and raw milk33. As indicated in Tables S1 and 3, when employing KNN and SVM algorithms, no misclassification occurred even without data preprocessing in both training and testing models, supported by the perfect values of all performance metrics. Specifically, when using KNN, minor misclassifications were observed in model training with spectral data preprocessed with MSC + D2 and SNV + D2, but all preprocessing methods yielded flawless results without classification errors in model testing. Similarly, when using SVM, insignificant classification errors were observed in model training with several datasets preprocessed with MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG. However, all data preprocessing methods, with two exceptions (MSC + D2 and SNV + D2), produced immaculate classification results in model testing.
Conversely, for other algorithms such as CART, NB, RF, and PLS-DA, applying preprocessing methods to the raw spectral data generally improved the classification results in both training and testing models, with a few exceptions where the results slightly deteriorated (Tables S1 and 3). Among these four algorithms, RF and PLS-DA achieved complete classification results using some preprocessed datasets. When using RF, all performance metrics had a high value (0.97) in both training and testing models without data preprocessing. However, these values slightly decreased to 0.88–0.95 (accuracy), 0.90–0.96 (recall), 0.87–0.96 (specificity), 0.87–0.96 (precision), and 0.89–0.95 (F1 score), respectively in both training and testing models when applying four data preprocessing methods (MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG). This result shows that the four methods were not useful in enhancing the performance of the RF model for classifying the geographical origin of kimchi. However, one method (MSC + D1 + ND) in model training and three methods (MSC + D1 + ND, SNV, and SVN + D1 + ND) in model testing allowed for flawless model performance. Based on these results, MSC + D1 + ND was found to be the most effective one in reinforcing the performance of the RF among preprocessing methods applied in this study, and the RF algorithm could establish a robust model with the excellent values of all performance metrics (1.00) using the dataset preprocessed with MSC + D1 + ND. Meanwhile, when using PLS-DA, values of all performance metrics ranged from 0.87 to 0.93 in both training and testing models without data preprocessing. All data preprocessing methods except for six (MSC, MSC + D2, MSC + D2 + SG, SNV, SNV + D2, and SNV + D2 + SG) in model training and all methods except for two (MSC + D2 and SNV + D2) in model testing led to perfect model performance. These results imply that applying D1 following one of the scattering correction methods (MSC and SNV) was crucial for improving the performance of PLS-DA models regardless of whether a smoothing technique is applied. Based on these results, six methods (MSC + D1, MSC + D1 + ND, MSC + D1 + SG, SNV + D1, SNV + D1 + ND, and SNV + D1 + SG) were found to be the most suitable one for enhancing the performance of PLS-DA models among the preprocessing methods applied in our study.
However, CART and NB algorithms could not achieve perfect classification performance even with preprocessed spectral datasets in both training and testing models. In terms of performance metrics, the best preprocessing methods for improving the performance of models based on the CART and NB algorithms were found to be MSC + D2 + ND and SNV + D2 + SG, respectively. Although most data preprocessing methods in addition to the best methods were beneficial in enhancing the classification performance when using CART and NB algorithms, the recall values of CART and NB models decreased from 0.90 to 0.93 to 0.66–0.88 and from 0.77 to 0.81 to 0.00–0.73, respectively in some cases using datasets preprocessed with each of the four methods: MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG for CART, and MSC, MSC + D2 + ND, MSC + D2 + SG, and SNV for NB, respectively. This result indicates that applying these preprocessing methods significantly reduced the predictive ability of the CART and NB models for domestic samples.
From these findings, it is evident that data preprocessing plays a less critical role in the performance of classification models based on KNN and SVM algorithms for identifying the geographical origin of kimchi. However, data preprocessing proves generally helpful in enhancing model performance when utilizing other algorithms, particularly PLS-DA (Table 3). These results underscore the importance of selecting an optimal data preprocessing method tailored to each classification algorithm to achieve improved model performance.
In this context, functions of MSC or SNV were essential to build all types of classification models used in our study. Especially, the performance of a few models such as CART and NB models was greatly improved with these methods. These results demonstrate that differences in sample pathlength and scattering effects seriously disrupted to recognize important features for the classification, and eliminating the hindrance was considerably helpful for enhancing the prediction ability of used classification models. In contrast, applying D1 or D2 following one of the scattering correction methods (MSC and SNV) did not always guarantee improved model performance. Unlike to MSC and SNV, the impact of D1 and D2 on model performance varied depending on the classification algorithm. For instance, MSC + D2 and SNV + D2 had lower values of performance metrics than did MSC and SNV when using the CART, SVM, RF, and PLS-DA algorithms, whereas higher values when using the NB algorithm. These results suggest that useful information related to the feature of interest might be covered with increased noise by D2, which led to degraded performing models when using the CART, SVM, RF, and PLS-DA algorithms. On the other hand, despite the noise amplification, when using the NB, it seemed that modeling was improved since the overlapped peaks were resolved and detailed structures were emphasized by the action of D2. Hence, preprocessing with either MSC + D2 or SNV + D2 is not recommended for the classification of the geographical origin of kimchi based on the chemometric algorithms used in this study. None of the classification algorithms, using datasets preprocessed with these methods, demonstrated a perfect classification result in model training (Table 3). This aligns with the PCA results, where datasets preprocessed with these methods exhibited patterns of kimchi samples that were challenging to classify into two categories according to their origin (Fig. 2e and S2f). These results suggest that when employing D2 for data preprocessing of raw spectra, it is essential to use a smoothing method simultaneously for better classification results. Among the two smoothing methods used in this study, ND is recommended because it consistently achieved better results than SG for the classification of the geographical origin of kimchi based on the algorithms used in this study. Moreover, the PCA results were superior when ND was applied for data preprocessing compared to when SG was used. Specifically, the PCA using datasets preprocessed with either MSC + D2 + SG or SNV + D2 + SG failed to identify a distinguishable cluster of kimchi samples according to the geographical origin of kimchi (Fig S2b,h).
Comparison of chemometric algorithms
As indicated in Table 3; Fig. 3, except for CART and NB, all supervised chemometric algorithms used in this study successfully established classification models with flawless performance. Fifteen KNN models, thirteen SVM models, three RF models, and twelve PLS-DA models without any classification errors in model testing. However, in CART models, the best performance with minor misclassifications was observed when using datasets preprocessed with MSC + D1, MSC + D1 + ND, MSC + D2 + ND, SNV + D1, and SNV + D1 + ND (0.98 of accuracy, 1.00 of recall, 0.97 of specificity, 0.97 of precision, and 0.98 of F1 score). In these cases, only one imported kimchi sample was incorrectly discriminated as domestic. In contrast, in NB models, at least four misclassifications were observed.
Among the four successful algorithms, KNN emerged as the optimal choice for determining the geographical origin of kimchi. It consistently and accurately classified all 30 domestic and 30 imported kimchi samples based on their geographical origin, irrespective of the data preprocessing method employed, as evident in the model testing results (Fig. 3). Furthermore, the classification of the test set was accomplished within a short execution time of 11 s (Table S2). This execution time was comparable to those obtained from SVM models, except for one based on the raw dataset, but significantly shorter than those from RF and PLS-DA models. These results highlight that KNN outperformed other algorithms used in this study in recognizing and extracting distinct features between domestic and imported kimchi samples, even without the aid of data preprocessing.
KNN has a successful track record in determining the geographical origin of foods. Similar to our study, KNN models exhibited superior performances compared to other models, such as SVM and RF, for identifying the geographical origin of white tea based on NIR in a study by Zhang et al.15. KNN models based on destructive analytical techniques, such as gas chromatography-mass spectrometry, demonstrated perfect results for discriminating the geographical origin of tea34,35,36,37 and liquors38,39. The simplicity of KNN’s mathematical approach, lack of assumptions regarding underlying data, and robustness against outliers contribute to its effectiveness in solving classification problems40,41,42. These advantages appear to have contributed to the excellent performance in the classification of kimchi according to geographical origin in our study.
While KNN demonstrated the best performance, SVM and PLS-DA algorithms also showed high potential for discriminating kimchi samples based on their geographical origin (Table 3; Fig. 3). In SVM models, only one imported kimchi sample was incorrectly classified as domestic, and this occurred only when spectral data were preprocessed with MSC + D2 and SNV + D2 in model testing (Fig. 3). All metrics values obtained in model testing were also 1.00, except for when using datasets preprocessed with MSC + D2 and SNV + D2 (0.98 of accuracy, 1.00 of recall, 0.97 of specificity, 0.97 of precision, and 0.98 of F1 score). Similarly, in PLS-DA models, except for one using the raw dataset, only one imported kimchi sample was incorrectly discriminated as domestic, and again, this happened only when spectral data were preprocessed with MSC + D2 and SNV + D2 in model testing. All metrics values obtained in model testing for PLS-DA were also 1.00, except for PLS-DA models constructed using raw, MSC + D2, and SNV + D2 datasets. Between these two algorithms, SVM performed better than PLS-DA because, similar to KNN, SVM effectively captured the differences between domestic and imported kimchi samples even without data preprocessing and was not significantly affected by data preprocessing. In contrast, when using PLS-DA, data preprocessing was essential to prevent misclassifications, and the execution time of PLS-DA was longer than that of SVM, except for one case based on the raw dataset (Table S2).
Numerous successful instances of determining geographical origin using SVM are evident in previously reported studies. For example, the SVM model exhibited impeccable performance in discriminating roast green tea based on geographical origin using FT-NIR data in a study by Chen et al.14, surpassing other methods such as LDA, KNN, and back propagation artificial neural networks. Another study by Gaiad et al.33 demonstrated that SVM, based on trace element profiles, outperformed other models including RF, KNN, LDA, and PLS-DA in identifying the geographical origin of lemon juices. SVM is known for its robustness against outliers and high generalization performance, preventing overfitting43. However, it requires data preprocessing, such as normalization, when the data scale varies, and it is not suitable for multi-classification problems. Nonetheless, these limitations do not pose issues for binary classification with consistent data scales, as demonstrated in discriminating the geographical origin of kimchi in this study. Therefore, alongside KNN, SVM emerges as a promising chemometric tool for determining the geographical origin of kimchi.
Conversely, a poor classification result emerged when the NB model was established using the dataset preprocessed with MSC + D2 + SG (Fig. 3). In this instance, while all 30 imported kimchi samples were correctly identified, all 30 domestic kimchi samples were misclassified as imported. In evaluating the classification models for the geographical origin of kimchi, both precision and recall are critical, as it is essential to simultaneously avoid misclassification of both domestic and imported kimchi samples. Therefore, the F1 score, the harmonic mean between precision and recall, may be more useful for this classification than accuracy in assessing model performance. Moreover, precision may be deemed more important than recall in the classification of the geographical origin of kimchi. If a model misclassifies a few domestic kimchi products as imported, it could be rectified through additional testing, but if a model misclassifies imported kimchi products as domestic, it might be easily overlooked. In this context, it can be asserted that the NB model built from the dataset preprocessed with MSC + D2 + SG exhibited the poorest performance in discriminating kimchi samples according to their geographical origin, as both precision and F1 score were indeterminate owing to the misclassification of all domestic samples.
As a result, it was determined that a successful classification model could be established using KNN, SVM, and PLS-DA algorithms when the FT-NIR spectra were preprocessed with all methods, except for MSC + D2 and SNV + D2, as utilized in this study. However, the collected samples do not represent the broader market variability. Thus, to ensure the generalizability of models for practical applications, the sample size should be progressively increased, and the feasibility of the proposed approach should be continuously re-evaluated as the sample size grows.