The present invention relates to a method for predicting an unknown feedstuff raw material and/or feedstuff by means of near infrared spectroscopy and similarity analysis using a cleaned-up database with spectra of known feedstuff raw material and/or feedstuff.
Animal diets typically contain a variety of different feedstuffs and/or feedstuff raw materials. It is therefore necessary to know the identity and type of a feedstuff and/or feedstuff raw material as precisely and as quickly as possible. This is particularly relevant, when different feedstuffs and/or feedstuff raw materials shall be mixed to yield a diet with a specific composition for a specific species.
The methods of qualitative analysis of feedstuffs and feedstuff raw materials in principle allow a precise identification of feedstuffs and/or feedstuff raw materials of unknown type, i.e. unknown identity, origin etc. However, these methods require cost- and maintenance-intensive lab equipment. Further disadvantages of these methods are their high standards for the time required and the expertise and experience of the operating staff. In principle, near infrared spectroscopy would be a suitable means for the identification and determination of feedstuffs and/or feedstuff raw materials. According to EP 3361248 A1 the use of near infrared spectroscopy also allows to predict the processing influence on the nutritional value of feedstuffs and/or feedstuffs. This document discloses a method for assessing the processing influences on the nutritional value of feedstuff raw materials and/or feedstuffs. This method comprises the steps of i) subjecting a sample of a feedstuff raw material and/or feedstuff to near infrared spectroscopy, ii) matching the absorption intensities at the respective wavelengths or wavenumbers in the near infrared spectrum with the corresponding parameters and their obtained values obtained from chemical analysis of the same sample and generating a calibration graph and/or calibration equation, iii) subjecting another sample of a feedstuff raw material and/or feedstuff to near infrared spectroscopy, and iv) obtaining the values of specific parameters for this sample from the calibration graph and/or calibration equation.
When used as a routine method, however, near infrared spectroscopy requires the knowledge of the identity and type of feedstuffs and/or feedstuff raw materials. However, human mistakes in the selection of feedstuffs and/or feedstuff raw materials can already lead to an incorrect classification of a feedstuff and/or feedstuff raw material regarding its identity and present form. Based on incorrect classification, the wrong calibration method would be chosen for the near infrared analysis of the ingredients and their specific amounts in the feedstuff and/or feedstuff raw material. Thus, the data obtained from the incorrectly calibrated NIR spectrometer would be erroneous. Consequently, these data would be misleading for any further operating steps, in which the respective feedstuff raw material and/or feedstuff is involved.
An option to overcome this problem is the recording a near infrared spectrum of a sample of the unknown feedstuff and/or feedstuff raw material and performing a similarity search for the recorded spectrum. This approach is described in the published international application WO 2016/141198 A1 and in the article “Algorithms, Strategies and Application Progress of Spectral Searching Methods” (Chu X.-L., Li J.-Y., Chen P., Xu Y.-P., Chinese Journal of Analytical Chemistry, 2014, 42(9), 1379-1386). In detail, a similarity search comprises the step of analyzing the similarity of the recorded spectrum of an unknown feedstuff raw material and/or feedstuff with the near infrared spectra of a population of known feedstuff raw materials and/or feedstuffs. Basis for the similarity analysis is the transformation of the relevant information of a spectrum, i.e. absorption intensities at their wavelengths or wavenumbers to the corresponding vector, both for the spectrum of the unknown feedstuff raw material and/or feedstuff and of each spectrum of the population of spectra of known feedstuff raw material and/or feedstuff. In the next step, the thus obtained vector of the spectrum of an unknown feedstuff raw material and/or feedstuff, which is hereinafter also referred to as query vector, and the vectors of a population of spectra of known feedstuff raw materials and/or feedstuffs, hereinafter also referred to as database vectors, are subjected to a similarity analysis. The multitude of database vectors is hereinafter also referred to as set of database vectors. The similarity analysis comprises the calculation of the similarity measure and/or the distance measure between the query vector of the recorded spectrum of the known feedstuff raw material and/or feedstuff and each database vector of the population of spectra of known feedstuff raw materials and/or feedstuffs. A similarity analysis involving a similarity measure is in principle a search for the nearest neighbor to the query system, here the query vector. In this case, a high similarity value for a database vector indicates a high similarity of a database vector to the query vector. Therefore, the similarity values for all database vector are ranked in descending order, with the highest values at the top. By comparison, when the similarity analysis involves a distance measure, a low similarity value for a database vector indicates a high similarity of a database vector to the query vector. Here, the similarity values for all database vector are ranked in ascending order, with the lowest values at the top. In any case, the top-ranked database vector has the highest similarity with the query vector, independently, whether the similarity analysis involves a similarity measure or a distance measure. In principle, a general similarity search is always based on the assumption that the database vectors at the top of the ranking are most likely the vectors to be relevant for the query vector. However, the methods of the prior art cannot solve problems arising when there are false positives at the top of the ranking of the database vectors. In the worst case, even the top-ranked vector could be a false positively. Reasons for false positive entries in the ranking of database vectors can be the erroneous assignment of a database vector or of a corresponding NIR spectrum to a non-matching class of feedstuff raw materials and/or feedstuffs, the heterogeneity or messiness of the class of feedstuff raw materials and/or feedstuffs, whose NIR spectra were recorded, or the similarity of some feedstuff raw materials and/or feedstuffs classes to one another. Any of these cases complicate a precise and reliable assignment of a database vector to the query vector.
An alternative method is disclosed in US 2011/0153226 A1. This document discloses a method for spectral searching an unknown mixture, comprising: obtaining one or more candidate mixture combinations by comparing the spectrum of the unknown mixture with the spectrum of each of a first plurality of library compounds; generating a model for each of the candidate mixture combinations based, at least in part, on a modeling metric; computing a residual spectrum corresponding to each of the candidate mixture combinations by removing the spectrum of each of the compounds of the candidate mixture combination from the spectrum of the unknown mixture; identifying one or more potential compounds by comparing each residual spectrum with the spectrum of each of a second plurality of library compounds; adding the potential compounds to the candidate mixture combinations to generate an updated list of the candidate mixture combinations; and repeating the generating of the model, computing of the residual spectrum, identifying of the potential compounds, and adding of the potential compounds until a first termination condition is satisfied.
EP 0807809 A2 discloses another method for matching an unknown product with one of a library of known products comprising the steps: 1) measuring a near infrared absorbance spectrum for each of said known products, 2) generating known product vectors extending into hyperspace representing the absorbance spectra determined for each of said known products, 3) dividing said known product vectors into clusters of vectors extending into hyperspace wherein the vectors inside each cluster are closer to each other in hyperspace than the vectors outside of such cluster, 4) dividing at least some of said clusters of vectors into sub-clusters of vectors extending into hyperspace, 5) repeating said step 4) on at least some of said sub-clusters until all of said sub-clusters have fewer than a predetermined number of vectors, 6) surrounding each of said clusters and sub-clusters with an envelope defined in the corresponding hyperspace, 7) measuring the absorption spectrum of said unknown product, 8) determining in which of said envelopes surrounding said clusters divided in step 3) a vector, representing said unknown product and extending into the hyperspace of said clusters, falls, 9), if the vector representing said unknown product falls into an envelope surrounding a cluster which is divided into sub-clusters, then determining in which envelope surrounding a sub-cluster a vector representing said unknown product and extending into the hyperspace of such sub-cluster, falls, 10) repeating the step 9) on further divided sub-clusters until a vector representing said unknown product is determined to fall into an envelope surrounding a sub-cluster which is not further defined, and 11) then determining which known product represented by a vector within said last-named envelope said unknown product matches.
CN 109459409 A discloses a near infrared anomalous spectral recognition method. In this method the sample space is generally linearized by the Hilbert space filling curve. Next, a hyperparameter must be selected. In the study of outlier identification, the determination of the value of said hyperparameter should be determined according to experience. This however requires an experienced and trained staff. Specifically, this document discloses abnormal spectrum identification involving principal component spatial distance metric. However, any method involving a principal component analysis sets high demands on computational power and time. Therefore, it is not suitable for large data volumes, as is the case for population of spectra.
The article “Evaluation of Local Approaches to Obtain Accurate Near-Infrared (NIR) Equations for Prediction of Ingredient Composition of Compound Feeds” (Fernández-Ahumada E. et al., Applied Spectroscopy, vol. 67, no. 8, 2013, pages 924-929) relates to a method for improving the accuracy of intact feed calibrations for the near-infrared (NIR) prediction of the ingredient composition. This article discloses that prior to calibration development, an outlier elimination routine served for the detection of samples with atypical spectra identified by extreme Hotelling's T2 and Q residual values. Approximately 10% of the overall database were considered spectral outliers and removed, leaving 20320 samples. Specifically, this document teaches the CARNAC method (Comparison Analysis Using Restructured Near-Infrared and Constituent Data) using PLS (Partial Least Squares) factors as input variables. This approach, however, is not suitable for small data volumes, because in this case it is difficult to divide the data into a training set and a test set.
Accordingly, there is a need for a method, which allows for a less complicated and at the same time very precise prediction of unknown feedstuffs and/or feedstuff raw materials.
It was found that this problem is solved in that outliers are removed from each set of database vectors prior to the use of the set of database vectors in the similarity analysis with the query vector of an unknown feedstuff raw material and/or feedstuff. An outlier can be the result of a human and/or an instrumental error. A human error is, for example, the erroneous assignment of a vector of a near infrared spectrum of a specific class of feedstuff raw materials and/or feedstuffs to a set of (database) vectors of a different class of feedstuff raw materials and/or feedstuffs. An example for an instrumental error is a measurement of a sample of a feedstuff raw material and/or feedstuff with an infrared spectrometer, that is not calibrated correctly or not calibrated at all. Typically, an outlier is a (observed) value, i.e. database vector, that is unusual and not plausible within the context of the other values, i.e. the set of database vectors. The removal of outliers therefore leads to a homogenization of a set of database vectors. Consequently, the likelihood of a wrong assignment is significantly removed. This increases the precision in the prediction of a feedstuff raw material and/or feedstuff.
Object of the present invention is therefore a computer-implemented method for predicting a feedstuff and/or feedstuff raw material comprising the steps of
a) providing a near infrared spectrum of a sample of an unknown feedstuff raw material and/or feedstuff,
b) transforming absorption intensities of wavelengths or wavenumbers in the spectrum of step a) to give a query vector,
c) providing a set of database vectors of a population of spectra of known feedstuff raw materials and/or feedstuffs, wherein an outlier is removed from the set of database vectors, wherein the step c) further comprises one or more of the options c1) to c4)
d) calculating a similarity measure and/or a distance measure between the query vector of step b) and each database vector of step c) to give a similarity value for each database vector with the query vector,
e) ranking the similarity values obtained in step d) in descending order, when a similarity measure is calculated in step d) or in ascending order, when a distance measure is calculated in step d), wherein in any case the top-ranked database vector has the highest similarity with the query vector, and
f) assigning the feedstuff raw material and/or feedstuff of the database vector with the highest similarity in step e) to the sample of step a).
In the context of the present invention the term unknown feedstuff raw material and/or feedstuff refers to any kind of feedstuff and/or feedstuff raw material whose identity, composition, origin and/or form, i.e. whether it is ground or unground, is not known. By comparison, in the context of the present invention the term known feedstuff raw material and/or feedstuff refers to any kind of feedstuff and/or feedstuff raw material whose identify, composition, origin and/or form, i.e. whether it is ground or unground, is known. Accordingly, a population of spectra of known feedstuff raw materials and/or feedstuffs is a number or multitude of spectra, which are known to belong to a specific feedstuff and/or feedstuff raw material of known identity, composition, origin and/or form.
According to the present invention a near infrared spectrum of a sample of an unknown feedstuff raw material and/or feedstuff is provided in step a). In the context of the present invention this means that the place where the spectrum to be provided is recorded and the place where the computer-implemented method according to the present invention is performed, can be different or identical. For example, it is possible that a near infrared spectrum of a sample of an unknown feedstuff raw material and/or feedstuff is recorded at one place, and sent in any way to a remote place, where the computer-implemented method according to the present invention is performed. Alternatively, both recording of the spectrum and the prediction of the feedstuff raw material and/or feedstuff based on said spectrum can be performed at the same place.
In an embodiment the step a) of the computer-implemented method comprises the recording of a near infrared spectrum of a sample of an unknown feedstuff raw material and/or feedstuff.
Option c1) for removing outliers involves a pairwise correlation of outliers. Specifically, this option involves the identification of the pair of database vectors, which, in terms of similarity, are the most distant neighbors or synonymously the most dissimilar to each other in a set of database vectors. Next, the thus identified pair of database vector is removed from the set of database vectors. This option is illustrated in
Option c2) for removing outliers involves the identification of the database vector, which, in terms of similarity, is the most distant neighbor on average or synonymously the most dissimilar on average to all other database vectors in a set of database vectors. Next, the thus identified database vector is removed from the set of database vectors. This option is illustrated in
Option c3) for removing outliers involves the identification of the database vector, which, in terms of similarity, is the most distant neighbor or synonymously the most dissimilar to all other database vectors in a set of database vectors. Compared to the preceding option, the thus identified database is absolutely the most dissimilar database vector in a set of database vectors, while the preceding option identifies the most dissimilar database vector, relative to the dissimilarity of the other database vectors. This option is illustrated in
Option c4) for removing outliers involves the identification of the database vector, which, in terms of similarity, is the most distant neighbor or synonymously the most dissimilar to the centroid of a set of database vectors. Next, the thus identified database vector is removed from the set of database vectors. In mathematics and physics the term centroid, when used in context with a plane figure, denotes the arithmetic mean position of all points in the figure; and therefore, it is also referred to as geometric center of said figure. Hence, it is also the point at which a cutout of the shape could be perfectly balanced on the tip of a pin. When the figure is extended to an object in a multidimensional space, the term centroid denotes the mean position of all points of said in all coordinate directions. In the context of the present invention, the term centroid of a set of database vectors therefore denotes the arithmetic mean position of all points of the database vectors in all coordinate directions. This option is illustrated in
To ensure for the greatest possible preciseness in prediction, it is preferred to apply one or more of the options c1) to c4) to each set of database vectors. This has the benefit that the entirety of all sets of database vectors is homogenized and not only a set of database vectors alone.
The four options can be used either alone or in combination. When two or more options are used, it is possible to subject a set of database vectors sequentially or parallel to two or more option. In the first case, a set of database vectors is subject to a first option, and the thus obtained set database vectors free of the removed database vector is subjected to a second or further option. Alternatively, it is also possible to subject a set of database vectors to two or more options in parallel and to compare the results of the thus obtained sets of database vectors, which are free of the removed database vectors, and to continue with the set of database vectors, which is considered the most suitable for the method according to the present invention, for example because the most outliers were removed from said set of database vectors. It is preferred to use all four option in parallel, to compare the results of the four option, and to remove a database vector only when at least 2 options, in particular at least 3 options, indicate it as the most dissimilar.
In an embodiment of the computer-implemented method according to the present invention the step
c1) comprises the steps of
c1a) calculating a similarity measure and/or a distance measure of each database vector in a set of database vectors to the other database vectors in said set of database vectors to give similarity values of pairs of database vectors,
c1b) ranking the similarity values obtained in step c1a) in descending order, when a similarity measure is calculated in step c1a), or in ascending order, when a distance measure is calculated in step c1a), wherein in any case the bottom-ranked similarity value relates to the two database vectors being the most dissimilar to each other, and
c1c) removing at least the two database vectors with the lowest ranking in step c1b) from the set of database vectors.
In another embodiment of the computer-implemented method according to the present invention the step c2) comprises the steps of
c2a) calculating a similarity measure and/or a distance measure of each database vector in a set of database vectors to the other database vectors in said set of database vectors to give similarity values of a database vector to the other database vectors,
c2b) forming the sum of the similarity values obtained for each database vector in step c2a), and calculating the average similarity value for each database vector,
c2c) ranking the average similarity values obtained in step c2b) in descending order, when a similarity measure is calculated in step c2b), or in ascending order, when a distance measure is calculated in step c2b), wherein in any case the bottom-ranked average similarity value relates to the database vector being the most dissimilar on average to all other database vectors, and
c2d) removing at least the database vector with the lowest ranking in step c2c) from the set of database vectors.
In another embodiment of the computer-implemented method according to the present invention the step c3) comprises the steps of
c3a) calculating a similarity measure and/or a distance measure of each database vector in a set of database vectors to the other database vectors in said set of database vectors to give similarity values of a database vector to the other database vectors,
c3b) ranking the similarity values obtained in step c3a) in descending order, when the similarity measure is calculated in step c3a), or in ascending order, when a distance measure is calculated in step c3a), wherein in any case the bottom-ranked similarity value relates to the database vector being the most dissimilar to all other database vectors,
c3c) counting the occurrence of a feedstuff raw material and/or feedstuff among the database vectors in the ranking of step c3b),
c3d) weighting each feedstuff raw material and/or feedstuff according to its ranking in step c3b) and according to the frequency of its occurrence in step c3c) to give weighted rank positions of the feedstuff raw materials and/or feedstuffs, and
c3e) forming the sum of the weighted rank positions of a feedstuff raw material and/or feedstuff of step c3d) and,
c3f) removing at least the database vector of the feedstuff raw material and/or feedstuff with the lowest weighted rank position in step c3e) from the set of database vectors.
In yet another embodiment of the computer-implemented method according to the present invention the step c4) comprises the steps of
c4a) determining the centroid of all database vectors in a set of database vectors,
c4b) calculating a similarity measure and/or a distance measure of each database vector to the centroid of step c4a) to give a similarity value for each database vector to the centroid,
c4c) ranking the similarity values obtained in step c4b) in descending order, when a similarity measure is calculated in step c4b), or in ascending order, when a distance measure is calculated in step c4b), wherein in any case the bottom-ranked similarity value relates to the database vector being the most dissimilar to the centroid, and
c4d) removing at least the database vector with the lowest ranking in step c4c) from the set of database vectors.
The centroid can be calculated according to any suitable procedure known in the art. For example, the centroid can be calculated by taking all database vectors in a set of database with n vectors, summing up all 1-n positions over all vectors, and dividing each position of by the number of the vectors.
In the context of the present invention the term cleaning up a set of database vector or cleaning up a dataset is used equivalent to the expression removing a spectral outlier from a set of database vectors or dataset or removing a database vector meeting any of the requirements in step c1), c2), c3) and/or c4).
The method according to the present invention is not limited regarding a specific distance or similarity measure for analyzing the similarity between the query vector of step b) and the database vectors of step c) and for analyzing the similarity within a set of database vectors in steps c1a), c2a), c3a), and/or c4b). Therefore, any distance or similarity measure, which is suited to determine the similarity of the vectors of step b) with the vectors of step c) can be used in the method according to the present invention. In principle, a similarity analysis is based on a nearest neighbor search. It was found that the Cosine coefficient is a particularly suitable similarity measure for a nearest neighbor search in the method according to the present invention. For example, the Cosine coefficient, which allows the calculation of the similarity between two vectors extremely rapidly with a high precision, is particularly suitable in the method according to the present invention. The Cosine coefficient of two vectors A and B is represented by the following formula
where xjA and xjB are components of the vectors A and B, respectively, and n is the number of spaces, here the number of absorption intensities at specific wavelengths or wavenumbers. The values for the similarity range from −1 meaning exactly the opposite to each other, to 1 meaning exactly the same, with 0 indicating orthogonality (decorrelation), and in-between values indicating intermediate similarity or dissimilarity.
Alternatively, the similarity between the vectors can also be calculated by means of a distance measure. For example, the Euclidian distance, which allows the calculation of the similarity between two vectors extremely rapidly and precisely, is particularly suitable in the method according to the present invention. The Euclidian distance of two vectors A and B is represented by the following formula
where xjA and xjB are components of the vectors A and B, respectively, and n is the number of spaces, here the number of absorption intensities at specific wavelengths or wavenumbers.
Advantageously, any database vectors with a similarity value of 0 are directly removed from the set of database vectors which allows or a more efficient removal of outliers and thus an even more precise prediction of feedstuff raw materials and/or feedstuffs.
In a preferred embodiment of the computer-implemented method according to the present invention a database vector with a similarity value of 0 is removed from the set of database vectors in step c1b), c2c), c3b), and/or c4c).
It is preferred that in step c) of the computer-implemented method according to the present invention the similarity measure is the Cosine coefficient and the distance measure is the Euclidian distance.
In its broadest meaning a vector is a geometric object that has magnitude (or length) and direction. In a Cartesian coordinate system, a vector can be represented by identifying the coordinates of its initial and terminal point. Therefore, a vector is suited to represent an absorption intensity at a specific wavelength or wavenumber in a two-dimensional near infrared spectrum. In addition, a vector is not limited to the description of a two-dimensional system. Rather, a vector can describe multi-dimensional spaces, such as a near infrared spectrum with a multitude of absorption intensities at a multitude of different wavelengths or wavenumbers. In this case, each dimension of the said vector corresponds to a single absorption intensity at a specific wavelength or wavenumber.
In one embodiment of the computer-implemented method according to the present invention the vector in steps b) and c) is a multi-dimensional vector, with each dimension corresponding to an absorption intensity of a specific wavelength or wavenumber.
Like the query vector of the spectrum of an unknown feedstuff raw material and/or feedstuff, also the set of database vectors provided in step c) of the computer-implemented method according to the present invention is obtained by transforming each spectrum of a population of spectra of known feedstuff raw materials and/or feedstuffs into the corresponding vector. If the set of database vectors is not already present, the step c) also comprises the transformation of each spectrum of a population of spectra of known feedstuff raw materials and/or feedstuffs into the corresponding vector to give the set of database vectors. In that case the step c) of the computer-implemented method according to the present invention comprises the steps of transforming each spectrum of a population of spectra of known feedstuff raw materials and/or feedstuffs to the corresponding vector to give a set of data set vectors and providing the thus obtained set of database vectors of a population of spectra of known feedstuff raw materials and/or feedstuffs.
It is therefore also possible to remove an outlier already from the infrared spectra of known feedstuff raw materials and/or feedstuffs which are to be transformed into the set of database vectors. In this case, the step c1), c2), c3), and/or c4), preferably the steps c1a) to c1c), c2a) to c2d), c3a) to c3f) and/or c4a) to c4d) are carried out with the infrared spectra of a population of known feedstuff raw materials and/or feedstuffs. Next, the thus cleaned infrared spectra are transformed into vectors to give the set of database vectors.
In this alternative embodiment of the computer-implemented method according to the present invention the step c) therefore comprises the steps of
c1′) removing a pair of infrared spectra being the most dissimilar to each other in a population of infrared spectra from said population of infrared spectra,
c2′) removing an infrared spectrum being the most dissimilar on average to the other infrared spectra in a population of infrared spectra from said population of infrared spectra,
c3′) removing an infrared spectrum being the most dissimilar to all other infrared spectra in a population of infrared spectra from said population of infrared spectra, and/or
c4′) removing an infrared spectrum being the most dissimilar to the centroid of a population of infrared spectra from said population of infrared spectra.
To ensure for the greatest possible preciseness in prediction, it is preferred to apply this option to each set of database vectors. This has the benefit that the entirety of all sets of database vectors is homogenized and not only a set of database vectors alone.
In a preferred embodiment of the computer-implemented method according to the present invention the step c1′)—prior to removing a pair of infrared spectra being the most dissimilar to each other—further comprises the steps of
c1a′) calculating a similarity measure and/or a distance measure of each infrared spectrum to the other infrared spectra in a population of infrared spectra to give similarity values of pairs of infrared spectra,
c1b′) ranking the similarity values obtained in step c1a′) in descending order, when a similarity measure is calculated in step c1a′), or in ascending order, when a distance measure is calculated in step c1a′), wherein in any case the bottom-ranked similarity values relates to the two infrared spectra being the most dissimilar to each other, and
c1c′) indicating at least the two infrared spectra with the lowest ranking in step c1b′) as the most dissimilar to each other in the population of infrared spectra.
In another preferred embodiment of the computer-implemented method according to the present invention the step c2′)—prior to removing an infrared spectrum being the most dissimilar on average to the other infrared spectra—further comprises the steps of
c2a′) calculating a similarity measure and/or a distance measure of each infrared spectrum to the other infrared spectra in a population of infrared spectra to give similarity values of an infrared spectrum to the other infrared spectra,
c2b) forming the sum of the similarity values obtained for each database vector in step c2a), and calculating the average similarity value for each database vector,
c2c′) ranking the average similarity values obtained in step c2b′) in descending order, when a similarity measure is calculated in step c2a′), or in ascending order, when a distance measure is calculated in step c2a′), wherein in any case the bottom-ranked average similarity value relates to the infrared spectrum being the most dissimilar to all other infrared spectra, and
c2d′) indicating at least the infrared spectrum with the lowest ranking in step c2c) as the most dissimilar to all other infrared spectra in the population of infrared spectra.
In another preferred embodiment of the computer-implemented method according to the present invention the step c3′)—prior to removing an infrared spectrum being the most dissimilar to all other infrared spectra—further comprises the steps of
c3a′) calculating a similarity measure and/or a distance measure of each infrared spectrum to the other infrared spectra in a population of infrared spectra to give similarity values of an infrared spectrum to the other infrared spectra,
c3b′) ranking the similarity values obtained in step c3a′) in descending order, when the similarity measure is calculated in step c3a′), or in ascending order, when a distance measure is calculated in step c3a′), wherein in any case the bottom-ranked similarity value relates to the infrared spectrum being the most dissimilar to all other infrared spectra,
c3c′) counting the occurrence of a feedstuff raw material and/or feedstuff in the ranking of step c3b′),
c3d′) weighting each feedstuff raw material and/or feedstuff according to its ranking in step c3b′) and according to the frequency of its occurrence in step c3c′) to give weighted rank positions of the feedstuff raw materials and/or feedstuffs, and
c3e′) forming the sum of the weighted rank positions of a feedstuff raw material and/or feedstuff, and
c3f′) indicating at least the infrared spectrum of the feedstuff raw material and/or feedstuff as the most dissimilar to all other infrared spectra the population of infrared spectra.
In yet another preferred embodiment of the computer-implemented method according to the present invention the step c4′)—prior to removing a database vector being the most dissimilar to all other database vectors—further comprises the steps of
c4a′) determining the centroid of all infrared spectra in a population of infrared spectra,
c4b′) calculating a similarity measure and/or a distance measure of each infrared spectrum to the centroid of step c4a′) to give a similarity value for each infrared spectrum to the centroid,
c4c′) ranking the similarity values obtained in step c4b′) in descending order, when a similarity measure is calculated in step c4b′), or in ascending order, when a distance measure is calculated in step c4b′), wherein in any case the bottom-ranked similarity value relates to the infrared spectrum being the most dissimilar to the centroid, and
c4d′) indicating at least the infrared spectrum with the lowest ranking in step c4c′) as the most dissimilar to the centroid from the population of infrared spectra.
The identification of outliers is facilitated by taking a derivative, preferably the first derivative, of the spectra. Before taking the first derivative, the spectrum in question is typically subjected to a standardizing procedure, such as standard normal variate (SNV), detrend, multiplicative scatter correction (MSC) or extended multiplicative signal correction (EMSC). Detrending (baseline corrections) is performed through subtraction of a linear or polynomial fit of the baseline from the original spectrum to remove tilted baseline variation, usually found in NIR reflectance spectra of powdered samples. Standard normal variate is another frequently used pre-treatment method due to its simple algorithm and effectiveness in scattering correction. SNV is often used on spectra where baseline and pathlength changes cause differences between otherwise identical spectra. Multiplicative scatter correction is achieved by regressing a measured spectrum against a reference spectrum and then correcting the measured spectrum using the slope and intercept of this linear fit. This pretreatment method has proven to be effective in minimizing baseline offsets and multiplicative effect. The outcome of MSC, in many cases, is very similar to SNV. Nevertheless, many spectroscopists prefer SNV over MSC since SNV corrects each spectrum individually and does not need the entire data set. The extended multiplicative signal correction preprocessing method allows a separation of physical light-scattering effects from chemical absorbance effects in spectra from powders or turbid solutions, for example. The model-based method is particularly useful in minimizing wavelength-dependent light scattering variation. After pretreatment the corrected spectra become insensitive to light scattering variations and responds linearly to the analyte concentration. The mathematical description of EMSC is given below.
A measurement spectrum, can be approximated by the sum of baseline offsets, ideal chemical absorbance per beer's law, and wavelength-dependent variations, and written as
x
i
≈a
i
+b
i
x
i,chem
+d
i
λ+e
iλ2
where a: baseline offset; b: pathlength; d and e: wavelength-dependent variation
x
i,corrected=(xi−diλ−eiλ2)/bi
Through EMSC parameter estimation, an EMSC-corrected spectrum can be obtained, with only chemical absorbance part left after removal of baseline offset and wavelength-dependent variations.
Cases may arise, where the position of a signal peak in a spectrum, either in step b) and/or in step c) of the method according to the present invention cannot be located because the maxima and minima of the individual peaks cannot be clearly identified in such a spectrum. An easier locating of individual peaks in the spectrum is possible, when the minima and maxima of the peaks are easier identifiable. Taking the first derivative of a spectrum facilitates the identification of the peaks in the spectrum because it gives a zero crossing of peak maxima or peak minima. Taking the second derivative gives a peak minimum at exactly that position, where a peak maximum was in the original spectrum and vice versa. Taking the first or second derivative of a spectrum also facilitates the identification of an outlier in the population of spectra of known feedstuff raw materials and/or feedstuffs.
In another embodiment of the method according to the present invention a derivative is formed of the spectrum of the unknown feedstuff raw material and/or feedstuff of step a) and/or of the spectra of known feedstuff raw materials and/or feedstuffs prior to their transformation into vector for step c).
Preferably, the first derivative is formed of the spectrum of a feedstuff and/or feedstuff raw material of unknown type of step a) and/or of the spectra of known feedstuff raw materials and/or feedstuffs prior to their transformation into vector for step c).
According to step b) of the present invention absorption intensities of wavelengths or wavenumbers in a spectrum are transformed to give a query vector. In principle, one could select the strongest and therefore most meaningful absorption intensities in a spectrum and transfer only said absorption intensities to give a vector. However, this would require a thorough analysis of each individual spectrum of a sample substance, which is not only time-consuming but also requires a very good understanding of near infrared spectra. Hence, this approach would not be suitable for a routine analysis. Further, this approach has the disadvantage that meaningful but relatively weak absorption intensities in a spectrum may be ignored so that information would get lost. This could lead to a wrong assigning of the unknown feedstuff raw material and/or feedstuff in the end. It is therefore favorable to consider as many information as possible in the spectrum without a preceding in-depth analysis of the spectrum. It is therefore preferred to transform the absorption intensities of equidistant wavelengths or wavenumbers in a spectrum, i.e. in step b) and/or c), to give a vector of said spectrum. In order to allow to the best possible similarity analysis between the query vector and the database vectors, it is preferred to transform the absorption intensities of equidistant wavelengths or wavenumbers in a spectrum of step b) and step c) of the method according to the present invention to give a vector of said spectrum or spectra. Preferably, the distances of the absorption intensities being transformed to vectors in step b) are identical with the distances of the absorptions intensities transformed to vectors in step c). This allows for a higher precision in the prediction of the computer-implemented method according to the present invention, even without having any specific knowledge of the sample substance and its spectra at all.
In one embodiment invention in step b) and/or c) of the computer-implemented method according to the present invention the absorption intensities of equidistant wavelengths or wavenumbers in a spectrum are transformed to give a vector of spectrum in step b) and/or c).
In a further embodiment of the computer-implemented method according to the present invention the distances of the absorption intensities being transformed to vectors in step b) are identical with the distances of the absorptions intensities transformed to vectors in step c).
Preferably, the absorption intensities of wavelengths or wavenumbers in a spectrum, which are transformed to give a vector of said spectrum, have small distances between each other. This has the advantage that most if not all relevant absorption intensities, i.e. information, of a spectrum are transformed to a vector of said spectrum. This is believed to allow a very precise transformation of all relevant information of a spectrum into vectors, even without having knowledge of the feedstuff and/or feedstuff raw material whose spectrum was recorded, in particular of its identity, composition, origin and/or form. Preferably, the distance between the wavelengths in step b) of the method according to the present invention is from 0.1+/−10% to 10+/−10% nm, from 0.1+/−10% to 5+/−10% nm, or from 0.1+/−10% to 2+/−10% nm. Accordingly, the distance between the wavenumbers in step b) of the method according to the present invention is from 108+/−10% to 106+/−10%, from 108+/−10 to 5*106+/−10% nm, or from 108+/−10% to 2*106+/−10% nm. In the context of the present invention, the term +/−10% is used with respect to explicitly mentioned values to indicate that deviations from said explicitly mentioned values are still within the scope of the present invention, provided that they essentially lead to the effects of the present invention. The distances between the wavelengths or wavenumbers in step c) of the method according to the present invention are preferably the same as those in step b), in order to provide for the best possible comparison between the recorded spectrum of an unknown feedstuff and/or feedstuff raw material and the spectra of known feedstuffs and/or feedstuff raw materials.
In an embodiment of the computer-implemented method according to the present invention, the distance between the wavelengths or wavenumbers in step b) and/or step c) is from 0.1 nm+/−10% to 10 nm+/−10% or from 108 cm−1+/−10% to 106 cm−1+/−10%.
In principle, the computer-implemented method according to the present invention is not subject to any limitation regarding the number of absorption intensities to be transformed to give a vector.
Rather, the number of relevant information in a spectrum of a feedstuff raw material and/or feedstuff strongly depends on the individual feedstuff raw material and/or feedstuff, and in particular, on its composition and components. The more complex a feedstuff and/or feedstuff raw material is, i.e. the more components a feedstuff and/or feedstuff raw material contains, the more information are required from a near infrared spectrum for predicting an unknown feedstuff and/or feedstuff raw material. Again, it would not be practical to perform an in-depth analysis in order to find out the absorption intensities which necessarily must be transferred to give a vector. A suitable option for the number of absorption intensities to be transformed to give a vector is to correlate them with the distance between the corresponding wavelengths or wavenumbers, e.g. from 0.1 nm+/−10% to 10 nm+/−10% or 0.1+/−10% to 2+/−10% nm, and the recording range of the spectrum, e.g. from 1,100 to 2,500 nm. Preferably, the number of absorption intensities to be transformed into a vector is at least 100, in particular, said number ranges from 150+/−10% to 15,000+/−10% or from 700+/−10% to 15,000+/−10%.
In another embodiment of the method according to the present invention the number of absorption intensities in each spectrum being transformed to a vector is 100+/−10% or more.
In some cases, it is preferred that a similarity analysis is performed at first, followed by counting the occurrences of a feedstuff raw material and/or feedstuff in the ranking of similarity values. Next, the thus determined number of similarity values of the feedstuff raw material and/or feedstuff are weighted according to their rank position to give weighted rank position, the sum is formed of the weighted rank positions to give scores of the feedstuff raw materials and/or feedstuffs, and the highest score indicates the feedstuff raw material and/or feedstuff with the greatest similarity to the sample substance.
In one embodiment the step e) of the method according to the present invention comprises the steps of
e1) counting the number of occurrence of each of the feedstuff raw materials and/or feedstuffs among the top-ranked database vectors in the ranking of step e), wherein said number of occurrences is indicated by the variable N,
e2) weighting the first N similarity values of each of the feedstuff raw materials and/or feedstuffs according to their position in the ranking of step e1) to give weighted rank positions of each of the feedstuff raw materials and/or feedstuffs, and
e3) forming the sum of the weighted rank positions of step e2) for each of the feedstuff raw materials and/or feedstuffs to give scores of each of the feedstuff raw materials and/or feedstuffs, wherein the highest score indicates the highest similarity.
In this case the feedstuff raw material and/or feedstuff of the database vector with the highest similarity in step e3) is assigned to the sample of step a).
The population of spectra of known feedstuff raw materials and/or feedstuffs used in the method according to the present invention is not limited to specific feedstuff raw materials and/or feedstuff. Rather, it preferably comprises spectra of all feedstuff raw materials and/or feedstuffs used in animal nutrition, preferably in the nutrition of poultry, pigs, pigs and/or animals kept in aquacultures, such as fish and/or crustacea. The spectra of the feedstuff raw materials and/or feedstuffs may significantly differ depending on their form or appearance, e.g. when they are present in ground or unground form. Therefore, the spectra population preferably also comprises spectra, which were recorded from the aforementioned feedstuff raw materials and/or feedstuff in ground and/or unground form.
In another embodiment of the method according to the present invention the population of spectra of known feedstuffs and/or feedstuff raw materials in step c) of said method comprises spectra of all feedstuffs and/or feedstuff raw materials in ground and/or unground form used in animal nutrition.
In principle, the method according to the present invention is not limited in any way regarding the number and types of feedstuffs and/or feedstuff raw materials, whose spectra, recorded in ground and/or unground form, make up the spectra population. Notwithstanding, it is preferred that the population of spectra of known feedstuffs and/or feedstuff raw materials in step c) of said method comprises spectra of all feedstuffs and/or feedstuff raw material in ground and/or unground form used in animal nutrition, preferably in the nutrition of poultry, pigs, pigs and/or animals kept in aquacultures, such as fish and/or crustacea The feedstuff raw material and/or feedstuff can be of animal and/or vegetable origin. Particularly preferred feedstuffs and/or feedstuff raw materials are unprocessed and/or processed feedstuff raw materials and/or feedstuff. Processed feedstuff raw materials and/or feedstuff are those, which were subjected to any type of heat or pressure treatment in order to remove or detoxify anti-nutritive factors. Preferred feedstuffs and/or feedstuff raw material are oilseeds, in particular soy extraction meal and expeller, full-fat soybeans, rapeseed meal and expeller, cotton extraction meal, peanut extraction meal, sunflower extraction meal, coconut extraction meal, and/or palm kernel extraction meal; legumes, in particular toasted guar flour; brewery and distillation by-products, in particular dried distiller's grain with solubles (DDGS), by-products of grain-processing and feedstuff production, in particular corn gluten, maize seed meal and/or bakery by-products; anima by-products, in particular fish meal, meat meal, poultry meal, blood meal, and/or bone meal; and also any type of grains. In particular, the feedstuff raw material is soy, soybeans or a soybean product.
Depending on factors such as climate, soil and genetics of the plants, feedstuff raw materials and/or feedstuff from different global growing areas may differ in their ingredients and the contents of said ingredients. In order to allow reliable and reproducible predictions of a feedstuff and/or feedstuff raw materials, it is therefore preferred that the feedstuff raw material and/or feedstuff, whose spectra are part of the spectra population, is from all of its global growing areas.
In a further embodiment of the method according to the present invention the feedstuff raw material and/or feedstuff, whose spectra are part of the spectra population, is from all of its global growing areas.
The number of spectra of feedstuff raw materials and/or feedstuffs of the spectra population used in the method according to the present invention should be representative to allow for a reliable and reproducible prediction of the feedstuff raw material and/or feedstuff in question. Therefore, the population of spectra comprises at least 50 spectra of samples of each feedstuff and/or feedstuff raw material, i.e. each feedstuff and/or feedstuff raw material in ground and/or unground form used in animal nutrition, preferably in the nutrition of poultry, pigs, pigs and/or animals kept in aquacultures, such as fish and/or crustacea, from each of its global growing areas. The method according to the present invention is not subject to any limitations regarding the number of spectra of samples of any feedstuff and/or feedstuff raw material from any of its global growing areas. Hence, the number of spectra of samples of any feedstuff and/or feedstuff raw material from any of its global growing areas may range from 50 to 10,000, from 50 to 5,000, from 50 to 2,500, from 50 to 2,000, from 50 to 1,500, from 50 to 1,000, from 100 to 1,000, from 50 to 500, from 100 to 500, from 50 to 250, from 100 to 250, or from 50 to 100.
In yet another embodiment of the method according to the present invention the population of spectra of known feedstuffs and/or feedstuff raw materials of step c) comprises at least 50 spectra of samples of each feedstuff and/or feedstuff raw material from each of its global growing areas.
When the population of spectra of feedstuffs and/or feedstuff raw material of known type considers each global growing area of a feedstuff and/or feedstuff raw material and the number of spectra from each global growing area is representative, the method according to the present invention allows not only a reliable and reproducible prediction of the feedstuff raw material and/or feedstuff in question but also a prediction of the origin of the feedstuff raw material and/or feedstuff in question.
Therefore, said population of spectra of known feedstuffs and/or feedstuff raw material of step c) preferably comprises at least 50 spectra of samples of each feedstuff and/or feedstuff raw material from each of its global growing areas. The number of spectra of samples of a feedstuff and/or feedstuff raw material from each global growing area is not subject to any limitations. Hence, the number of spectra of samples of any feedstuff and/or feedstuff raw material from each global growing area may range from 50 to 10,000, from 50 to 5,000, from 50 to 2,500, from 50 to 2,000, from 50 to 1,500, from 50 to 1,000, from 100 to 1,000, from 50 to 500, from 100 to 500, from 50 to 250, from 100 to 250, or from 50 to 100.
It is preferred to provide the set of database vectors of a population of spectra of known feedstuff raw materials and/or feedstuffs directly in step c) of the computer-implemented method according to the present invention. It is also possible that first only a population of spectra of known feedstuff raw materials and/or feedstuffs is provided, which is transformed in the next step to the set of database vector for the similarity analysis in step d). In this case, the step c) of the computer-implemented method according to the present invention also comprises the step of transforming the absorption intensities of wavelengths or wavenumbers in each spectrum of a population of spectra of known feedstuff raw materials and/or feedstuffs into vectors. The thus obtained multitude of vectors of the population of spectra of known feedstuff raw materials and/or feedstuffs is then the set of database vectors, as mentioned above. In any case, it is preferred to store the population of spectra of known feedstuff raw materials and/or feedstuffs or the set of database vectors of said population of spectra of known feedstuff raw materials and/or feedstuffs on a processing unit, such as a computer or a cloud. The processing unit on which the population of spectra or the database vectors are stored can be identical with or different from the processing unit, which carries out the computer-implemented method according to the present invention. In the second case, the first processing unit, which carries out the computer-implemented method according to the present invention, and the second processing unit, on which the population of spectra or the database vectors are stored, form a network. For example, it is also possible that the population of spectra or the database vectors are stored on a cloud. In that case, the first processing unit, e.g. a computer, which carries out the computer-implemented method according to the present invention, and the second processing unit, e.g. a cloud, on which the population of spectra or the database vectors are stored, form a network.
Another object of the present invention is therefore a system for predicting a feedstuff raw material and/or feedstuff, comprising a processing unit adapted to carry out the computer-implemented method according to the present invention.
The principle of cleaning a set of database vectors or a population of spectra from an outlier also allows to provide an improved database comprising sets of database vectors of a population of spectra of known feedstuff raw materials and/or feedstuff or a population of spectra of known feedstuff raw materials and/or feedstuffs, which is freed from outliers, and therefore suitable for use in the method and/or in the system according to the present invention.
Said database comprises i) a set of database vectors of a population of spectra of a known feedstuff raw materials and/or feedstuffs and/or ii) a population of spectra of known feedstuff raw materials and/or feedstuffs, wherein the set of database vectors and/or the population of spectra is free from outliers, which is stored on a computer, a server, a cloud or any type of computer readable medium such as a disc, a CD-ROM, a flash drive, or a USB stick.
Said database is obtainable or obtained by the same principles as an outlier is removed from a set of database vectors or from a population of infrared spectra from said population of infrared spectra.
Therefore, the database is obtainable or obtained by the step c) and one or more of steps c1) to c4) of the computer-implemented method according to the present invention, preferably by the step c) and one or more steps selected from the group consisting of steps c1a) to c1c), steps c2a) to c2d), steps c3a) to c3f) and steps c4a) to c4d).
In case the population of spectra of known feedstuff raw materials is stored on a computer, it is preferred that said population is stored on a second computer, i.e. a computer, which does not already carry out the computer-implemented method according to the present invention. As consequence, the working load is evenly distributed between the computers, which may lead to a quicker performance of the computer-implemented method according to the present invention. This also allows for a communication between the user and the provider of a database with the database vector for use in the method according to the present invention, for example, an update of the database vectors.
In an embodiment of the system according to the present invention the processing unit adapted to carry out the computer-implemented method according to the present invention forms a network with at least one other processing unit, on which the database vectors are stored.
In another embodiment the system according to the present invention further comprises a database comprising i) a set of database vectors of a population of spectra of known feedstuff raw materials and/or feedstuffs and/or ii) a population of spectra of known feedstuff raw materials and/or feedstuffs, wherein the set of database vectors and/or the population of spectra is free from outliers.
In the example 4 different types of filters, indicated as Filter 1 to Filter 4, were compared for their suitably for cleaning up a dataset for predicting a material class, i.e. a feedstuff raw material and/or feedstuff. The results for these filters were compared with the case where no use was made of a filter, indicated as Filter 0. Hence, the example with Filter 0 was a comparison example not according to the invention and the examples with Filters 1 to 4 were according to the invention. In detail, the 4 different types of filters were:
The filters were used for cleaning up two datasets of NIR spectra for predicting a material class, specifically a feedstuff raw material and/or feedstuff. The two data sets contained spectra measured on two different infrared spectrometers, a NIRS™ DS2500 Feed Analyzer from Foss and an MPA FT-NIR Analyzer or TANGO FT-NIR Analyzer from Bruker. The four different filters were applied onto the datasets, i.e. the criteria were calculated, and the thus identified most distant spectra or the spectra with lowest score in a majority-voting and weighting procedure were removed from the dataset. Next, the criteria for the filter were recalculated and applied again until 20% of the spectra were removed from the datasets.
After application of the filters, a nearest neighbor search was carried out to predict the material code for a set of query spectra. In detail, 20 runs with 200 random query spectra were carried out, where it was counted how many times a material code was predicted correctly out of the 200 queries.
The results for the data set with spectra measured on a NIRS™ DS2500 Feed Analyzer from Foss are summarized in table 1 and the results for the data set with spectra measured on an MPA FT-NIR Analyzer or TANGO FT-NIR Analyzer from Bruker are summarized in table 2.
Each of the filters 1 to 4 in the cleaning of the dataset generally led to an improvement in the prediction of a material code, compared to filter 0. The results for the filter 1, filter 2 and filter 4 are almost identical. The use of the filter 3 led to the best improvement in the prediction results with the dataset of spectra measured on a Foss NIR analyzer than the other filters. Specifically, the use of the filter 3 gave better results than the rest of the filters in 8 of 20 runs.
Each of the filters 1 to 4 in the cleaning up of the dataset generally led to an improvement in the prediction of a material code, compared to filter 0. The results for the filter 1, filter 2 and filter 4 are almost identical. Again, the use of the filter 3 led to the best improvement in the prediction results with the dataset of spectra measured on a Bruker NIR analyzer than the other filters. Here, the improvement with the filter 3 was even stronger than with the dataset of spectra measured on a Bruker NIR. Specifically, the use of the filter 3 gave better results in the improvement than the rest of the filters in 17 of 20 runs. This is also expressed by the significantly higher average value for the filter 3 than in the other cases.
Summarizing all options for removing spectral outliers from the database of vectors according to the present invention led to an improvement in the prediction of a material code, compared to the case where no spectral outliers were removed. Generally, the use of filter 3, a combination of majority-voting and weighting for removal of spectral outliers from the dataset, gave the best improvement in the prediction results for a material code. Further, the results are not dependent on the specific NIR device on which the NIR spectra of the datasets were measured.
Number | Date | Country | Kind |
---|---|---|---|
19181932.5 | Jun 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/067432 | 6/23/2020 | WO |