The present invention relates to methods for preparing synthetic multicomponent samples of powders, liquids and liquids with suspended components that can mimic dynamic processes, particularly multicomponent mixtures, crystallizations, distillations, chemical reactions and fermentations, to be used in the development of multivariate calibrations and multivariate supervision systems of such processes. In particular, the present invention relates to a method for developing multivariate calibrations and/or supervision systems and more particularly to a computer program and a system for running such a method.
Monitoring, supervision and control of industrial processes is often difficult and time consuming since these systems have complex multicomponent matrices that require laboratory chemical and physical analysis by reference methods. Furthermore, these determinations often have a low acquisition frequency compared to process dynamics, are highly time and resources consuming offline techniques and do not allow process monitoring and control.
Recently the use of spectroscopic data allowed the quality control (chemical and physical) to be readily available (on-line) at the process (in-situ) instead of being determined in the laboratory (offline). The use of these on-line and multi-parametric techniques has proved to be capable of determining process trajectories and multi-parameter analysis (chemical and physical) in almost real-time. This can be a great advantage over reference analytical laboratory methods, since spectroscopic sensors can enable the implementation of process control strategies.
Spectroscopic sensors perform multidimensional measurements that through the use of multivariate data analysis, also called chemometrics, can be related with specific chemical or physical parameters, or even dynamic process profiles. These relations are established by multivariate calibrations that can be used either to monitor or supervise processes and can be a part of advanced process control.
Although spectroscopic sensors present an advantage over reference analytical laboratory methods, they require the development of robust calibrations that can establish the relation between the spectral data and a specific parameter, which is called a multivariate calibration (linear or nonlinear). Moreover, spectral data can be used without any specific parameter to establish multivariate dynamic process profiles.
Multivariate calibrations are usually developed either by using in-line process spectral data (X) and in-process samples that are withdrawn from the process and then analyzed by offline analytical methods (y) or by using samples prepared by orthogonal design of experiments. Disregarding the used method for multivariate calibrations, the dataset to build them should comprise data from samples that allow the development of calibrations that meet the following requisites as good as possible:
Spectral Selectivity—The calibration should be selective for each calibrated variable, predicting it with accuracy in any stage of a process and even predict outside its range.
Variability—The calibration data should include as much variability as possible to ensure model validity over a wide range of conditions. Matrix effects should also be included. Matrix effects can be considered as the combined effect of all components (known and unknown) within the sample on its spectra, e.g., considering different media solutions for a calibration or different solvents in mixture.
Uniformity—The calibration samples should have a distribution as even as possible across the expected calibration range. This will increase the quality of the calibration by avoiding extreme samples (sometimes seen by the model as outlier samples).
Non-correlativity—The existence of correlations between parameters can give rise to lack of selectivity.
Even though the use of process samples is usually the most common option for the development of multivariate calibrations for spectroscopic sensors, there are several drawbacks behind it. For example, samples from these processes are comparably complex matrixes of known and sometimes unknown compounds (multicomponent). Or processes dynamics and sampling frequency do not match, for instance, high number of samples with no process variation and a low number of samples when the process presents a high dynamic change. Or high number of samples are required to have a significant process variability (several batches or operation modes are needed). Or high amount of offline analytical measurements are needed. Or high correlation coefficients between some of the parameters are present in the calibration samples (such as reaction profile composition). Or processes can take several days (for instance fermentations).
To overcome these drawbacks or at least some of them and meet the necessary requests of good calibrations as good as possible, orthogonal designs of experiments directly imported from statistic literature, have been proposed to produce non-process derived samples. These designs guarantee that all the experiments are independent, meeting the above mentioned requisites but commonly yield comparably poor calibration results for spectroscopy sensors. Also, these designs usually only have 3 to 5 levels of variation, which often is not enough for the calibration of these sensors. Furthermore, these orthogonal methods only can be used for known and measurable variations, which is not the case when considering process samples where a considerable amount of unknown and non-measurable variations are present.
In the article “Analysis of pharmaceuticals by NIR spectroscopy without a reference method” by Marcelo Blanco and Anna Peguero, Trends in Analytical Chemistry, Volume 29, Issue 10, November 2010, Pages 1127-1136, refers to constructing calibration sets for quantifying the active pharmaceutical ingredient (API) and excipients in pharmaceutical tablets. The operating procedure involves using an appropriate experimental design to prepare a set of laboratory samples and recording a series of NIR spectra during the pharmaceutical-production process. Process spectra are calculated by difference between the NIR spectra for production tablets and those for laboratory samples containing API and excipients at their nominal concentrations.
The article “Rapid calibration of near-infrared spectroscopic measurements of mammalian cell cultivations” by M. R. Riley et al., Biotechnology Progress, vol. 15, no. 6, pages 1133-1141, relate to an approach for generating NIR spectroscopic calibrations wherein a small number of experimentally collected spectra serve as inputs to a computational procedure that yields a large number of simulated spectra, each containing both analyte-specific and analyte-independent information.
The article “An Introduction to Multivariate Calibration and Analysis” by K. R. Beebe et al., Anal. Chem., 1987, 59 (17), pp 1007A-1017A, refers to a general introduction to multivariate calibration and analysis using multiple linear regression, principal component regression as well as partial least square.
The article “Transfer of multivariate calibration models: a review” by R. N. Feudale et al., Chemometrics and Intelligent Laboratory Systems, Volume 64, Issue 2, 28 Nov. 2002, Pages 181-192, refers to an overview of the different methods used for calibration transfer and a critical assessment of their validity and applicability.
The article “The evolution of chemometrics” by P. K. Hopke, Analytica Chimica Acta Volume 500, Issues 1-2, 19 Dec. 2003, Pages 365-377, the development of chemometrics as a subfield of chemistry and particularly analytical chemistry is presented.
The article “Second- and third-order multivariate calibration: data, algorithms and applications” by G. M. Escandar et al. refers to a review of second- and third-order multivariate calibration, based on the growing literature in the field, the variety of data being produced by modern instruments, and the proliferation of algorithms capable of dealing with higher-order data.
Therefore, there is a need for a method for preparing synthetic multicomponent samples that can mimic dynamic processes to be used in the development of comparably accurate and robust multivariate calibrations and of multivariate supervision systems of such processes. Also there is a need for a method which allows making these calibrations and supervision systems comparably accurate and robust.
The method, computer program and system according to the invention are described in more detail herein below by way of exemplary embodiments and with reference to the attached drawings, in which:
An object of the invention is to provide a method for preparing at least one synthetic multicomponent biotechnological and/or chemical process sample and to provide a synthetic sample layout generation and sample handling system for preparing a set of synthetic samples mimicking a dynamic process or a specific process step or a variation thereof. Another object of the invention is to provide a computer program product comprising one or more computer readable media having computer executable instructions for performing the steps of the aforementioned method.
The object of the invention is solved with the features of the independent claims. Dependent claims refer to preferred embodiments.
As described in more detail below, the present invention makes use, among others, of Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS), which is a chemometric method or algorithm that is commonly used for the resolution of multicomponent responses in unknown unresolved mixtures as, e.g. described in [3]. The MCR-ALS method or algorithm aims for the description of multicomponent systems by using a bilinear model that relates the system composition with multivariate responses, such as spectroscopic measurements, electrochemical signals, etc. (see, e.g., [3]).
In order to better understand and explain the method according to the invention, a brief description of the MCR-ALS algorithm is given below. A more detailed description of such algorithm can be found in [1] or in [3].
MCR-ALS, a bilinear modelling method, is generally used to decompose an experimental data matrix D into the product of two orthogonal matrices: a matrix C related with the “true” pure responses profiles associated with the variation contribution of each component of the original data matrix which is usually related with the changes in chemical composition or the amount of solution in the mixture, and a matrix ST related with the observed data variation such as, for instance, instrumental measures or spectroscopic changes. The simulation resolves Equation 1 below iteratively by the Alternating Least Squares (ALS) algorithm, calculating the pure responses profiles C and the observed variations ST optimally fitting the experimental data matrix D, while minimizing the matrix of residuals E that cannot be described by the model or algorithm.
D=CS
T
+E (Equation 1)
The optimization procedure requires the user to establish a number of components to be used in the model and an initial estimate of C or ST, which can be achieved by using chemometrics methods such as Principal Component Analysis, Evolving Factor Analysis (EFA) or SIMPLISMA, or other suitable methods.
Since the MCR-ALS method is an ambiguous method, incorporating iterative procedures for constraints implementation can be a good option to reduce its ambiguity, i.e., to decrease the number of feasible solutions that fit equally well the experimental data. Commonly used constraints for matrices C and ST can include non-negativity, unimodality, closure, trilinearity, selectivity and/or other constraints that are known. The inclusion of such constraints helps finding meaningful physical or chemical solutions for the problems to be solved.
The MCR-ALS algorithm optimization performance is evaluated by the percentage of lack of fit (Equation 2), defined as the difference between the original data D and the product of matrices C and ST, the percentage of variance explained by the model (Equation 3) and the standard deviation of the residuals (Equation 4). The following equations show how to calculate the performance of the MCR-ALS algorithm or model:
where, eij is the residual obtained from the difference between the element at position ij of matrix D and the estimate from the MCR-ALS model or algorithm, dij is the valued of the element at position ij of matrix D and nrows and ncolumns designate the number of rows and columns of matrix D.
In particular, the invention relates to a method for preparing at least one synthetic multicomponent biotechnological and/or chemical process sample for developing multivariate calibrations for monitoring systems and/or multivariate supervisory systems of increased robustness, wherein a set of synthetic samples mimicking a dynamic process or a specific process step or a variation thereof is prepared, the method comprising the steps of: preparing historical process data of the dynamic process or the specific process step or the variation thereof to be mimicked and for which the set of synthetic samples are to be prepared; creating a data matrix D of the historical process data; determining a plurality of main solutions of the data matrix D; determining which main solutions of the data matrix D are necessary to mimic the dynamic process or the specific process step or the variation thereof within a predetermined variance; determining the analyte composition of the necessary main solutions and the respective relative amount of the necessary main solutions with respect to the data matrix D; and creating at least one sample by assembling the necessary main solutions according to the determined analyte composition.
Preferably, creating the at least one sample further comprises the step of mixing the assembled necessary main solutions according to their determined respective relative amount.
Preferably, preparing historical process data comprises arranging data to a specific format, for instance sorting by time or according to process evolution.
Preferably, creating a least one sample comprises producing, assembling, mixing and/or preparing the at least one sample.
The historical process data of the dynamic process or specific process step or a variation thereof can be obtained, e.g., by measuring parameters in the dynamic process or specific process step, by averaging several runs of dynamic processes or specific process steps, by gathering such data in the literature or the like.
For example, the obtained historical process data can comprise analytically determined composition values of the data matrix of the dynamic process or the specific process step or a variation thereof. Such data allows for efficiently and in some cases sufficiently perform the method.
The obtained historical process data of the dynamic process or the specific process step or a variation thereof can also comprise physical and/or process information.
For example, the physical and/or process information can be pH values, temperature values or the like. Such data also allows for efficiently and in some cases sufficiently perform the method.
The obtained historical process data of the dynamic process or the specific process step or a variation thereof can also be organized according to a process time. Organizing the historical process data in such manner allows—in the case of a batch or fed-batch process step or dynamic changes in a continuous process—for obtaining time profiles for the specific process step.
The obtained historical process data of the dynamic process or the specific process step or a variation thereof further can also comprise data of at least one further specific process step corresponding to an experimental run of the specific process step under similar or different process conditions. Like this, a data matrix can be formed which can be used as matrix D in the MCR-ALS algorithm described above.
Using factor decomposition algorithm allows for determining how many main solutions are necessary to mimic the dynamic process or the specific process step or the variation thereof within a predetermined variance.
The predetermined variance can be the optimum accumulated captured variance and can be set according to good modelling practices known by the skilled person, e.g. a chemometrics specialist. For example, it can be set that the predetermined variance, i.e. the accumulated captured variance, is between 95 to 99%. In other words, it can be determined how many main solutions of the data matrix D are necessary to mimic the dynamic process or the specific process step or the variation thereof within a predetermined variance, in this example between 95 to 99%.
Preferably, an evolving factor analysis algorithm is used for the decomposition of the data matrix, i.e. the determination of how many main solutions are necessary to mimic the dynamic process or the specific process step or the variation thereof within a predetermined variance. Such determination allows for efficiently choosing the number of relevant main solutions and determines where they appear or disappear during the dynamic process, the specific process step or a variation thereof, in the case of naturally ordered datasets such as most batch, semi-batch and dynamic changes in continuous processes.
Preferably, the optimum accumulated variance is determined within decomposing the data matrix using a factor decomposition method, such as PCA, evolving factor analysis or other suitable method and the optimum accumulated captured variance is evaluated whether to be within a predefined range or not. Determining the optimum accumulated variance or accumulated captured variance in such a range allows for obtaining the optimum number of multivariate main solutions that have to be used in the method according to the invention. If the optimum accumulated captured variance is within the predefined range, the optimum of the number of multivariate main solutions can be obtained. If the used factor decomposition method is not able to capture such accumulated variance, then the data matrix or matrix D can be split into smaller data matrices as described below and these new datasets can be used in the further steps of the method according to the invention. Thereby, the predefined range of the optimum accumulated variance can be preferably more than 94% and more preferably from 95% to 99%. Determining the optimum accumulated captured variance in said range allows for particularly efficiently obtaining the optimum number of multivariate main solutions that have to be used in the method according to the invention.
Preferably, an appearance time of the number of main solutions and/or a disappearance time of the number of main solutions is determined within the decomposition of the data matrix using evolving decomposition methods. Such appearance or disappearance time allows for efficiently performing the method.
The initial estimation of analyte composition of the necessary main solutions and the initial estimation of the respective relative amount of the necessary main solutions with respect to the data matrix D can be determined by evolving factor analysis and/or determination of the purest sample/variable. An example for an evolving factor analysis can be found in [2]; an example for the determination of the purest sample/variable can be found in [4].
In particular, the plurality of main solutions of the data matrix D can be determined using an appropriate chemometric method. The chemometric method can be a factor decomposition method, in particular principal component analysis (PCA), singular value decomposition (SVD) or evolving factor analysis, and/or a parallel factor analysis (PARAFAC), multivariate curve resolution-alternating least squares (MCR-ALS), evolving factor analysis.
According to an embodiment of the invention, the method further comprises the step of creating a matrix S comprising the determined analyte composition of the necessary main solutions; the step of creating a matrix C comprising the relative amount of the necessary main solutions with respect to the data matrix D; and the step of determining whether the matrix product C×ST corresponds to the data matrix D within a predetermined tolerance.
The predetermined tolerance can be a value according to good modelling practices known by the skilled person. For example, such a tolerance can be 95%. In other words, it is determined whether the matrix product C×ST corresponds to the data matrix D to or above 95%, i.e. the deviation of the corresponding matrix entries can be 5% or less.
If the matrix product C×ST does not correspond to the data matrix D within the predetermined tolerance the following steps are preferably repeated: determining which main solutions of the data matrix D are necessary to mimic the dynamic process or the specific process step or the variation thereof within the predetermined variance; determining the analyte composition of the necessary main solutions and the relative amount of each necessary main solution; creating the matrix S; creating the matrix C; determining whether the matrix product C×ST corresponds to the data matrix D within the predetermined tolerance. For example, if it is determined that the predetermined tolerance, e.g. 95%, is not reached, the aforementioned steps according to said embodiment can be repeated until the predetermined tolerance is met. With said tolerance the itineration performance can be evaluated.
The relative amount of the necessary main solutions for creating the data matrix C and/or the determined analyte composition of the necessary main solutions for creating the matrix S may be selected based on logical constraints and/or physical constraints and/or chemical constraints. For example, a constraint can be that an analyte concentration cannot be negative, i.e. the non-negative constraint. Another example is that for determining the relative amount of each main solution the non-negative constraint should be met and the sum of all fractions should be equal to 1, i.e. the closure constraint.
The method may further comprise the steps of: determining the relative amount of the necessary main solutions comprised in the matrix C for at least two instants of time; performing a regression between the relative amount of the necessary main solutions comprised in the matrix C and a respective time variable of the historical process data using the determined relative amount of the necessary main solutions for the at least two instants of time; estimating the relative amount of the necessary main solutions for at least one other instant of time between the at least two instants of time based on the regression; and creating an augmented time dependent matrix Caug comprising C and the estimated necessary main solutions for the at least one other instant of time.
Preferably, the method comprises the steps of assembling the necessary main solutions comprised in Caug and mixing the necessary main solutions comprised in Caug according to the determined respective relative amount.
Preferably, estimating the relative amount of the necessary main solution comprises computing an unknown value by establishing regression between at least two adjacent values.
Preferably, the relative amount of each main solution for the determined number of main solutions at a specific time within deconvoluting the data matrix, e.g. by a MCR-ALS algorithm with imposed constraints, can be obtained by regression between the relative amount of each main solution predicted, e.g. by the multivariate curve resolution alternating least squares algorithm, and the time variable from the original process data. The time variable can be relative or absolute. The regressions can be a linear, polynomial or other regression.
Thereby, mixtures with the relative amount of the multicomponent of each main solution for the determined number of main solutions at the specific time preferably are obtained. The mixtures can be rows of C. The specific time in this context can be any given time between the beginning and the end of the dynamic process, the specific process step or a variation thereof, comprised in D or part of such process comprised in Di. (An example of such procedure is given in Example 1 and 2 below.) Such obtaining of the mixtures allows for obtaining more samples than the MCR-ALS algorithm by itself, as such C can be augmented into Caug where the amount of each main solution for a given time obtained from the MCR-ALS algorithm and from the regression procedure are combined. This allows for increasing the number of samples to be run in the experimental implementation procedure.
Preferably, each of the number of main solutions can be added for a known process time after assembling each main solution for the determined number of main solutions according to the MCR-ALS results loadings matrix. In particular, after all main solutions are assembled with the characteristics determined by the MCR-ALS algorithm the dynamic process or the specific process step or a variation thereof or part of it can be mimicked by the addition of the amount of each main solution as determined by Caug for a known process time. The Caug matrix can be a p×n matrix where p is the number of samples to be created, here accounted as the sum of MCR-ALS algorithm samples (equal to the rows of matrix D) and the number of samples generated by the regression of C against time, and n is the number of main solutions that, e.g., was established by PCA or Evolving Factor Analysis. Thus, each row of Caug can indicate the relative amount of each main solution (presented in its columns) that has to be mixed to recreate the original samples in matrix D or the augmented samples obtained from the regression of C against time. By mixing the amount of main solution for each row, one can mimic the complete process and recreate all the samples necessary to build a library for multivariate calibration and supervision systems development. For that, a multivariable analytical device such as, e.g., comprised in the synthetic sample preparing system described below, mostly a spectroscopic sensor or any other multichannel instrument, may have to take measurements for each of the recreated samples. The obtained instrumental measurements (x) and the sample composition (y) or time (t) can be used to establish multivariate calibrations using, for example, Projection into Latent Structures regression (PLS) or for establishing supervision systems, for example, Multivariable Statistical Process Control (MSPC). Although these are comparably important methods used for multivariate calibrations and supervision systems, the present invention is not limited to these specific methods. Moreover, the present invention is also not limited to linear multivariate calibration methods. In fact, further chemometric methods can benefit from the creation of such samples in this manner, such as other linear and nonlinear regression methods, and classification methods, such as, PCA, Projection into Latent Structures Discriminant Analysis (PLS-DA), Soft Independent Modelling of Class Analogy (SIMCA) or other. Based on the method according to the present invention, discriminant methods can be developed. For example, process phases may be established based on prior knowledge and using created samples to develop discriminant and/or classification methods. New samples (e.g. from new process runs) could then be classified using these methods.
The data matrix D may be partitioned into a plurality of data matrices Di and wherein each matrix Di is individually processed in each step following the creating of the data matrix D, preferably if it is determined that only a limited number of main solutions can be used to mimic the dynamic process or the specific process step or the variation thereof at that the limited number of main solution present a variance below the predetermined variance. For example, the number of main solutions is limited to 2 or 3 by a binary or ternary mixture device with automated control, the data matrix D is partitioned into a plurality of data matrices Di and wherein each matrix Di is individually processed in each step following the creating of the data matrix D.
Preferably, individually processed in this context comprises that each of the data matrices Di are treated separately from the other sub-matrices Di.
For another example, if the number of main solutions is set manually according to experimental limitation, e.g. to 3 or 4, the data matrix D is partitioned into a plurality of data matrices Di and wherein each matrix Di is individually processed in each step following the creating of the data matrix D.
Another example is that it is determined that the model used for decomposing the data matrix D is not capable to capture a predetermined variance, i.e. if it is determined that the number of main solutions for describing the data matrix D which are configured to mimic the dynamic process or the specific process step or the variation thereof are below the predetermined variance. In this example, the matrix D can be partitioned into a plurality of data matrices Di and wherein each matrix Di is individually processed in each step following the creating of the data matrix D.
Therefore, the data matrix can be split in a plurality of smaller data matrices and each one of the plurality of smaller data matrices is individually processed in each step following the decomposition of the data matrix using a factor decomposition method, such as PCA, evolving factor analysis or other suitable method. As explained above, in some complex situations, the number of main solutions may not be practical to be implemented in an experimental setup either automatically, such as a binary or ternary mixture device with automated control where the number of main mixtures is limited to two or three, respectively, or manually where the number of main solutions cannot be too high. In these cases the number of main solutions can be set manually according to experimental limitation, which may oblige to divide the complete process or process step in smaller parts as needed, i.e., dividing the data matrix or matrix D into smaller matrices Di. Thereby, the data matrix can be split in the plurality of smaller data matrices if the optimum accumulated variance is determined not to be within the predefined range.
The historical data may comprise at least one of the following: analytically determined composition values of the dynamic process or the specific process step or the variation thereof, physical information, process information, data of at least one further process variation corresponding to additional experimental and/or simulated runs of the process under similar or different process conditions and/or wherein the historical data is organized in the matrix D according to the process time.
Preferably, organized in this context comprises the sorting of the historical data.
The step of determining the necessary main solutions and/or determining the analyte composition of the necessary main solutions and/or determining the relative amount of the necessary main solutions with respect to the data matrix D may be performed using an appropriate chemometric method, in particular at least one of the following chemometric methods: a factor decomposition method, in particular principal component analysis (PCA), singular value decomposition (SVD) or evolving factor analysis, and/or a parallel factor analysis (PARAFAC), multivariate curve resolution-alternating least squares (MCR-ALS), evolving factor analysis.
The method may further comprise the step of preprocessing the matrix D, preferably filtering, more preferably using at least one of: a Savitzky-Golay filter, a Kernel smoother, a smoothing spline, a moving average, or a weighted moving average. Preferably, preprocessing further comprises the application of a chemometric preprocessing algorithm.
In particular, the data matrix or matrix D can be preprocessed by using any suitable algorithm, e.g., for smoothing, mean centering, auto scaling, or the like, all of them known and used in the field of chemometrics. In some cases, a smoothing of data from one point to the other might be recommended, as an example, like the Savitsky-Golay filter algorithm can be used as a smoothing pre-processing method. This can be important when dealing with dynamic processes or specific process step or a variation thereof, since the process does not pass abruptly from one stage to the other. In some cases, the use of such preprocessing allows minimizing outlier points that are not aligned with the general trend of the dataset. This step can be of importance in order to have good results in the following steps. Although the data can be pre-processed, sometimes this is not necessary, especially if the data is only comprised of analytical composition in the same units and no outliers are enclosed in the dataset.
The method may further comprise the step of preparing at least one second sample being configured to break co-linearity between parameters or analytes comprised in the data matrix D using a co-linearity breaking method.
The co-linearity breaking method preferably comprises at least one of the following: creating of an orthogonal design of experiments, random spiking of the specific compounds in the at least one sample, programmed spiking, random mixing of samples or any combination thereof.
Preferably, random spiking comprises adding a random amount of a known chemical compound in order to break correlations. The amount of the compound to be introduced can be determined computationally.
According to another embodiment, the method may further comprise the step of creating a third sample by grouping the sample with the at least one second sample.
The invention also relates to a synthetic sample generation system and sample handling system for preparing a set of synthetic samples mimicking a dynamic process or a specific process step or a variation thereof. The system comprises: a computational unit; and a mixing unit; wherein the computational unit comprises: means for preparing historical process data of the dynamic process or the specific process step or the variation thereof to be mimicked and for which the set of synthetic samples are to be prepared; means for creating a data matrix D of the historical process data; means for determining a plurality of main solutions of the data matrix D; means for determining which main solutions of the data matrix D are necessary to mimic the dynamic process or the specific process step or the variation thereof within a predetermined variance; means for determining the analyte composition of the necessary main solutions and the respective relative amount of the necessary main solutions with respect to the data matrix D; and wherein the mixing unit is configured to mix any of the necessary main solutions and is configured to create the at least one sample by assembling the necessary main solutions according to the determined analyte composition.
Preferably, the mixing unit is configured to create the at least one sample by assembling the necessary main solutions according to the determined analyte composition and by mixing the assembled necessary main solutions according to their determined respective relative amount.
The computational unit may further comprise means for creating a matrix S comprising the determined analyte composition of the necessary main solutions; means for creating a matrix C comprising the relative amount of the necessary main solutions with respect to the data matrix D; and means for determining whether the matrix product C×ST corresponds to the data matrix D within a predetermined tolerance.
The system may further comprise a measurement device having at least one sensor. Thereby, parameters which are useful for many dynamic processes can efficiently be measured.
The at least one sensor of the measurement device may be configured to measure at least one parameter in-situ in the dynamic process, the specific process step or the variation thereof and/or the at least one sensor may comprise at least one of the following: a pH probe, a temperature probe, a spectroscopic sensor, or a multichannel instrument.
The system may further comprise at least one control device comprising at least one temperature control structure and/or at least one pH control structure. With such a control device the conditions can be defined and adjusted. Thereby, the at least one control device preferably is adapted to recreate the conditions of the dynamic process or the specific process step or the variation thereof.
The at least one control device is preferably configured to recreate respective conditions of the dynamic process or the specific process step or the variation thereof.
The at least one control device may also be configured to break the co-linearity between parameters or analytes comprised in the data matrix D using a co-linearity breaking method.
Preferably, the at least one control device is configured to break the co-linearity using at least one of the following co-linearity breaking methods: creating of an orthogonal design of experiments, random spiking of the specific compounds in the at least one sample, programmed spiking, random mixing of samples or any combination thereof.
The system may further comprise an analytical device configured to analyze at least one of the main solutions and/or the at least one sample.
Preferably, the means for determining the respective relative amount of the necessary main solutions with respect to the data matrix D are configured to determine the respective relative amount of the necessary main solutions for at least two instants of time. Preferably, the system further comprises means for performing a regression between the relative amount of the necessary main solutions comprised in the matrix C and a respective time variable of the historical process data using the determined relative amount of the necessary main solutions for the at least two instants of time; means for estimating the relative amount of the necessary main solutions for at least one other instant of time between the at least two instants of time based on the regression; and means for creating an augmented time dependent matrix Caug comprising C and the estimated necessary relative amount of main solutions for the at least one other instant of time. Preferably, the mixing unit is configured to assemble the necessary main solutions comprised in S and to mix the necessary main solutions comprised in Caug according to the determined respective relative amount.
The invention also relates to a computer program product comprising one or more computer readable media having computer executable instructions for performing the steps of at least one of the aforementioned methods. Such a computer program product allows for efficiently performing the method according to the invention and achieving the respective benefits.
Among others, decomposing the data matrix using a factor decomposition method, such as PCA or Evolving Factor Analysis or other suitable method, allows for determining how many main solutions are needed to completely mimic the dynamic process or the specific process step or a variation thereof. Thereby, the factor decomposition methods can be used according to good modelling practices known by a person skilled in chemometrics.
Estimation of initial multicomponent analyte composition and/or relative amount for each main solution can, for example, be achieved by evolving factor analysis as, e.g., described in [2] or determination of the purest sample/variable as, e.g., described in [4], as described above. Particularly, such estimation can be performed as an initial estimation after determining the number of main solutions necessary to mimic the dynamic process or process step or variation thereof and eventually their appearance and disappearance time.
With the initial estimation and number of main solutions necessary to mimic the dynamic process or process step or variation thereof the data matrix D or its smaller parts Di as described below are deconvoluted by the MCR-ALS algorithm with imposed constraints to obtain and optimize the multicomponent analyte composition of each of the main solutions to be used and its relative amount at a specific time to mimic the dynamic process or process step or a variation thereof described by matrix D or Di if only part of the process is to be mimicked. Thus, deconvoluting the data matrix by a MCR-ALS algorithm with imposed constraints allows for mimicking the dynamic process or process step or a variation thereof described by the data matrix. In particular, the MCR-ALS constrained algorithm can resolve equation 1 mentioned above in an iterative way. Thereby, e.g., it can be implemented in the method according to the invention as follows:
Firstly, establish the number of main solutions determined by a factor decomposition method, such as PCA or evolving factor analysis or other suitable methods. Secondly, estimate the initial multicomponent analyte composition (first iteration) of each main solution and/or the relative amount of each main solution. Thirdly, estimate the analyte concentration in main solution (ST) based on logical constraints and physical/chemical constraints for main solutions (for example an analyte concentration cannot be negative—non-negative constraint). Fourthly, estimate the relative amount of each main solution (C) based on logical constraints and physical/chemical constraints (for instance non-negative constraint and the sum of all fractions should be equal to 1—closure constraint). Fifthly, compare the results obtained by the MCR-ALS algorithm (C×ST) with the original data matrix (D or Di) using equations 2 through 4 mentioned above to evaluate the iteration performance. Sixthly, if the iteration performance is not as high as required then the initial estimation is changed and steps 3 to 5 described above are repeated. This will be repeated until the performance of the MCR-ALS algorithm is adequate to mimic the dynamic process or process step or a variation thereof presented in matrix D or Di. Seventhly, when finished the analyte concentration for main solutions (ST) that can mimic the dynamic process, process step or variation thereof or that can mimic part of it, and the amount of each main solution (C) at a given time of the dynamic process, process step or variation thereof are obtained for each of the experimental data points in D or Di.
By obtaining a relative amount of each main solution (C) for the determined number of main solutions at a specific time each of the data points, i.e. the rows of the data matrix (D), can be related with a time variable. The specific time can be represented by a relative or absolute time variable from original process data. In particular, it can be any time between the beginning and the end of the dynamic process or process step or a variation thereof comprised in the data matrix. Like this, it is possible to represent the amount of each main solution (C) against the absolute or relative time variable (t) associated with each row of the data matrix D. This allows for increasing the number of simulated data points that can mimic the process or process step.
The MCR-ALS loading matrix also referred to as the ST matrix can particularly be an n×m matrix where n is the number of main solutions that, e.g., was established by a factor decomposition method, such as PCA, evolving factor analysis or other suitable method and m is the number of properties such as analytical composition, pH, temperature, etc. that has been measured in the dynamic process or the specific process step or a variation thereof and are included in the data matrix D, e.g. m is equal to the number of columns in matrix D, for the MCR-ALS algorithm. Thus, each row of ST is one main solution and each column of such row is a concentration or other property that follow the same order as the data matrix or D matrix. To obtain the main solutions of the MCR-ALS one may have to assemble the solutions so they can match the concentrations and properties established by the MCR-ALS algorithm, i.e., if main solution 1 (ST1—first row of ST) results were concentration x for compound a, y for compound b and z for compound c, then one may have to determine the amount of each compound (x, y and z) so that the concentration of the solution is equal to the ST1 vector. In a separate apparatus, ST2 (second main solution and second row of the ST matrix) may have to be assembled according to the results obtained in the MCR-ALS algorithm and the same for the remaining main solutions. Such an apparatus, e.g., can be comprised in the synthetic sample preparing system described below.
Unlike the common use for MRC-ALS method or algorithm mentioned above, the method according to the invention uses MRC-ALS in a reverse way, i.e, it starts with the known profile such as, for instance, the composition of specific components along time, and finds the necessary main solutions and its relative amount in the mixture that allow to mimic dynamic processes or specific process steps or a variation thereof, assuming that dynamic processes or specific process steps or parts of such processes or a variation thereof can be simulated as a mixture of synthetic main solutions.
The method according to the invention can be carried out in a discrete, fed-batch or continuous manner with lower or higher level of automation. Although not limiting for the present invention, an apparatus for such experimental execution should comprise at least a mixing device of some kind and should allow the in-situ measurement of one or preferably more sensors such as, pH probes, temperature probes, spectroscopic sensors, and other analytical instruments that can be used to build multivariate calibrations and supervision systems. In a preferred embodiment this apparatus should also have at least temperature and pH control systems. And in a more preferred embodiment the apparatus should have the proper control systems that allow the recreation of the dynamic process conditions, or the specific process step conditions or a variation thereof that might influence the multivariable analytical devices used to build multivariate calibrations and supervision systems.
Thus, the present invention describes a new method that is comprised of creating a set of synthetic samples that mimic a dynamic process, a specific process step or a variation thereof that can be used to develop multivariate calibrations and supervision systems. In order to enhance calibration models accuracy and robustness the introduction of randomly or programmed spiked and orthogonal experimental designed samples is also described. The new method works by using prior knowledge of dynamic processes, a specific process step or variation thereof expressed by their available data such as, e.g., process parameters such as temperature, pH, feeds, etc. and quality control data such as analyte concentration, reagent and product concentration along time, etc. and using such data to build mixtures profiles that can be used to build synthetic samples that mimic the process dynamics. Like this, a novel methodology to overcome the calibration developments drawbacks listed above can be provided comprised by preparing synthetic multicomponent samples that can mimic dynamic processes, a specific process step or a variation thereof and using orthogonal design of experiments or random or programmed spikes or random or programmed mixing of samples to enhance calibration models performance.
The method described herein takes into account samples that mimic a process dynamic, i.e., have the same behaviour as process samples. Such a method can tackle problems related to process dynamics versus sampling frequency, allows the creation of a high number of samples such that several batches or operation modes can be mimicked, reduces the high amount of offline analytics needed (highly reduced since only main solutions have to be determined) and reduces time required to develop calibrations in complex systems (for instance fermentations). However, it does not solve the problem of highly correlated data present in the process.
Therefore, samples preferably are prepared according to an orthogonal design of experiments and spectroscopic or other analytical instruments data is acquired that can be related with chemical composition of physical properties of the samples determined by a reference analytical method. As such, within this step of the method a new set of samples is created that can break co-linearity between parameters or analytes present in the data matrix D. This can be achieved by preparing samples according to the orthogonal design of experiments, which may include full or fractional factorial designs or d-optimal designs, by random spiking of specific compounds in samples or by programmed spiking. The introduction of such samples allows the creation of correlation-breaking samples to enhance model robustness, accuracy and selectivity that are not attained by process mimicking. These samples may have to be prepared individually in the case of orthogonal designed samples or can be built from existing samples in the case of random or programmed samples. After sample preparation a multivariable analytical device can be used to acquire data (X) that can be related with chemical composition of physical properties of the samples determined by a reference analytical method (y).
As mentioned herebefore, random and programmed spiking only account for known and measurable variations. In some cases, however, there might be some unknown or undesired correlations that have to be taken into consideration. Therefore, in order to break these correlations random or programmed mixing of synthetic or process samples is recommendable. A detailed description can be found in example 4.
Thereby, synthetic multicomponent samples that mimic the dynamic process or the specific process step or a variation thereof and the prepared samples preferably are grouped together. In particular, the two data subsets, i.e. the synthetic multicomponent samples that mimic the dynamic process or the specific process step or a variation thereof and the co-linearity breaking samples, can be grouped together to produce accurate and robust multivariate calibrations for each of component following good modelling practices known by persons skilled in chemometrics. For the development of a supervision system based on MSPC control charts and graphics, only the synthetic multicomponent samples that mimic the dynamic process or the specific process step or a variation thereof are used.
In summary, the method according to the invention is comprised of a series of steps that are described in detail above and below. According to the invention, it is assumed that dynamic processes or specific process step or a variation thereof can be mimicked by the use of scale independent multicomponent mixtures, hereafter and herebefore called main solutions, and by adding different amounts of such solutions in a mixture, hereafter and herebefore called relative amount of solution.
The described method allows to significantly reduce time and resources to develop multivariate calibrations and supervision systems based on spectroscopic sensors by allowing scale down from process scale to lab scale, reducing the amount of reference analytics and decreasing the amount of time for creating the dataset to be used in such multivariate calibration and supervision systems. Furthermore, the multicomponent systems developed by the presented method enable the estimation of chemical (concentration, relative amounts, acidity, etc.), physical (e.g. state of cells, density, etc.) properties and multivariate process profiles based on spectroscopic sensors.
With the described method, process dynamics can be followed, e.g. reaction profiles, mixing profiles, where profile is considered to be a trend that evolves with time and samples can be produced that follow such a trend. As such, the invention allows to build evolving (Batch) multivariate statistical control charts (also many times referred as MSPC) using the spectra registered on the produced samples and by mixing 2 or more of this samples to produce sub-samples that mimic a process trend that evolves with time.
With the present invention, samples and sub-samples can be created that can mimic the spectral information of a reaction without having to execute a real reaction.
Using the described method and/or system, mathematical or computational simulation of spectra to develop the calibration can be avoided. In fact, real complex samples are created that mimic process trends (for instance, reaction profiles) from which spectra are acquired. As the method allows creating samples that mimic process evolution as a function of time, the user can build much more samples than using the methods disclosed in the prior art and can benefit from the reduced amount of analytical burden.
Furthermore, as the method allows for chemical and physical variation to be mimicked, there is no need to simulate the spectra of the process by computational/mathematical simulations. Also the described method allows to measure spectral non-linearity imposed by processes and also takes into account other interferences (for instance, one could use a different media in a fermentation run to introduce such variation in the process without having to execute a fermentation run). Finally, spectra acquired from real samples are used and not simulated by an algorithm or computational method. Thus, the method allows the usage of the matrix effects in the presented model.
Preferably, matrix effects in this context are changes in the spectra that cannot be related with changes in the process state (i.e. its composition at a given time). Preferably, matrix effects can be one of the following:
One of the advantages of the invention is that matrix effects can be taken into account by using samples from the process and random mixing can be performed so that these unwanted effects can be removed from the model, or at least taken into account, i.e. increasing model robustness.
Moreover, the method allows to determine how many main solutions, are required to mimic a process (for instance a fermentation run), which is the chemical composition of each main solution and how to mix such solutions in order to obtain the synthetic samples that mimic the process or process step.
In summary, the method and the system according to the invention allow to mimic processes and their variation by assembling the main solutions by weighting each of the chemical compounds or measuring a volume of a dissolved compound or mixture and diluting them in a solvent (for instance media) to obtain the concentrations determined by the method according to the invention and then mixing the required volumes of each main solution to obtain synthetic samples that mimic process samples.
Also, since this method allows the process to be mimicked as a mixture of solutions instead of a real process step, it is possible to make the sampling frequency adequate for process dynamics. This can be an advantage over developed calibrations and supervision systems based only on process samples since this allows the development of evenly distributed and more robust calibrations.
Additionally, because this method allows the recreation of samples/profiles based on historical data at lab scale and since historical data is largely available in the industry this allows the development of very specific quality by design calibrations and supervision systems that can include the entire design space of samples from conception, through process development and industrial production.
Synthetic samples produced in this way can have the same high correlation coefficients between some of the parameters as real in-process samples, which do not guarantee spectral selectivity and non-correlativity requisites for calibration development. As such, the use of this method can be combined with orthogonal experimental designed or randomly/programmed spiked samples as proposed in the present invention.
The use of such a mixed model allows combining advantages of both methods and allows the creation of accurate and robust multivariate calibrations and supervision systems based on spectroscopic sensors and chemometrics.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments and examples described hereinafter.
Summarizing the above, the invention relates to a method to tackle the drawbacks that are usually found in the development of multivariate calibrations and monitoring systems based on spectroscopic measurements (near infrared, mid infrared, Raman, 2D-Fluorescence and other spectroscopies) outlined above.
To overcome these drawbacks it is proposed to create two different types of multicomponent samples (synthetic multicomponent samples and co-linearity breaking samples). Using both groups of samples for multivariate calibrations assures better accuracy and robustness and is an innovative approach in this scientific area because one does not need to collect real process samples.
Thereby, the main achievements are:
It is described above and below that the creation of synthetic multicomponent samples that uses several known chemometric methods/algorithms, integrated into an innovative methodology to mimic dynamic processes or a specific process step or a variation thereof. Also, a take on a specific chemometric method allows the creation of these samples in an innovative way.
An aspect which might be of particular relevance in certain scenarios is the creation of supervisory systems based on the methodology proposed. E.g., Example 6 below illustrates one possible supervision system created in accordance with the present invention. This point specifically can open a very important field of scaling up (potentially speeding up industrialization of biopharmaceutical processes).
Also the creation of co-linearity breaking samples is explained. The use of both types of multicomponent samples to develop multivariate calibrations might be of particular relevance. Example 5 below illustrates one possible accurate and robust multivariate calibration created in accordance with the present invention.
In the following, embodiments of the invention are described by means of specific examples. The examples are provided for illustrative proposes, and are not intended to limit the scope of the invention as claimed herein. Any variations in the exemplified articles are intended to fall within the scope of the present invention.
Example 1 relates to synthetic multicomponent fermentation samples. It is intended to show a step-by-step implementation of part of the present invention in a multicomponent (several analytes) complex system, without imposing any limitations on its use in other systems not covered by this example.
A good example for the application of the present invention is its use to mimic fed batch fermentations, as for example, Chinese Hamster Ovary (CHO) mammalian cell cultivation. Such fed batch fermentations contain multicomponent complex matrix analytes that vary depending on cell activity and imposed process conditions such as feeds, feed rates, pH and temperature profiles, etc.
These systems are rather complex in its nature and the need of their monitoring has long been an integral part of industrial processes since it allows increased automation and introduction of advanced control schemes. In recent years, process analytical technology (PAT) has emerged as a drive motor for overcoming the drawbacks of process monitoring, generally related with lack of capturing processes dynamics by insufficient sampling frequency and the high amount of offline analytical measurements needed at the process that are not available at the right time (time between sampling and analysis is too long for process control).
Applying PAT tools to real-time multivariate data collected by an on-line sensor (such as Near Infrared, Mid Infrared, 2D Fluorescence, Dielectric Spectroscopies, or any other kind of on-line sensors) provides an efficient means of identifying/reducing variation, managing process risks, relating process information to critical quality attributes (CQAs) and determining process improvement opportunities such as detecting contaminations, increasing yield, reducing impurities variability and implementing advance process control for such processes. Although their big advantages, PAT tools require a development stage, also called calibration, where in-process samples have to be measured by the on-line sensor and a sample can be withdrawn to be analyzed by standard analytical methods, although in some cases this analysis is not required. Usually, in-process samples have to be measured in process conditions, but the following example shows a way to create such multicomponent synthetic samples that mimic the process dynamics without the need to collect data in a real process.
In this example, the analytes profiles from a typical fed-batch fermentation were measured by offline standard analytical methods and recorded in digital format to make them available for future reference of batch trends. For exemplification of the present invention one culture parameter (viable cell density) and seven analytes profiles (glutamine, glutamate, glucose, lactate, ammonia, product and asparagine) were chosen to recreate such a fermentation using synthetic multicomponent mixtures.
(Step 1) The first step is the evaluation/organization of such historical process data. In the present example this dataset contains analytical data for viable cell density (VCD), glutamine (GTF), glutamate (GLU), glucose (GLF), lactate (LAF), ammonia (NHF), product (PRO) and aspargine (ASN), as well as the fermentation time (t) at which the sample was obtained from the fermentation batch. For this example no physical parameters were considered as the fermentation was run under constant pH and temperatures. Table 1 shows the normalized data matrix considered for this example already ordered according to the requirements presented for the present invention.
(Step 2) In a second step, the data matrix D is created from the data available in Table 1. Columns 2 to 9 (VCD to ASN) are used, yielding a D matrix with 14 rows (samples) and 8 columns (multicomponent analyte composition), which is then pre-processed using an auto scaling data pre-treatment. The resulting data matrix is then analyzed by Principal Component Analysis to estimate the number of main solutions needed to mimic the process dynamics presented in Table 1. In this case, no other pre-processing technique of the dataset was needed, but in some cases the use of other pre-processing algorithms might be recommended.
The pre-processed matrix data D from step two was decomposed using the Principal Component Analysis. This allows obtaining the optimum number of multivariate main solutions that have to be used in the present invention to mimic the dynamic process or process step or a variation thereof. For selecting the optimum number of components the usual good modelling practices for PCA was used, i.e., the number of components should guarantee capturing more than 95% of data variation, only principal components with eigenvalue higher than 1 should be chosen. As such, for the present example three principal components were chosen as can be seen in Table 2. This number of components is equal to the initial estimate of the number of main solutions necessary to describe the fermentation batch profile with 97.4% of the data captured by the model.
As a result of the PCA, three main solutions can be used to prepare the synthetic multicomponent samples to mimic the process dynamics. Although this number is not too high there might be limitations in the implementation of such a procedure. As such, the present example will assume that it was only possible to use binary mixtures of main solutions to mimic the entire fermentation or parts of it. As it can be seen from Table 2, using only two components to describe the complete fermentation run leads to a large amount of variation that is not explained (up to 20%) by the PCA, resulting in deviations from the original data (batch profile for each analyte). As such, in this case (explained variation<95%) it is advisable to split the dynamic process (fermentation) into smaller fractions of such process. For illustrative purposes, the present example will use only two main solutions to describe parts of a complete dynamic process. In the end, to simulate the complete batch run of the present examples it is necessary to run each of its smaller parts and group the data sets together at the end.
To obtain the smaller parts of the fermentation process that can be mimicked using only two main solutions, matrix D was reanalyzed using PCA algorithm but this time starting with a smaller number of samples (three to start with), alternatively an Evolving Factor Analysis can be used to identify changes in the dynamic process in a more automatic way. The three samples were analyzed by fixing the number of PCs to two and if the explained variance was higher than 95% then another sample was added to the data set. This procedure was stopped when the captured variance fell below 95%. At this time, the n−1 iteration is considered as the best interval for the first subset of the original D matrix. In order to maintain continuity for the fermentation profile, the second subset of data starts at the same point that the preceding subset ends. The procedure is repeated to find the second, third and other necessary intervals until the complete fermentation run is covered. The results obtained with such procedure for the present example contains four data subsets, which are presented in Table 3.
(Step 3) After establishing the number of main solutions for each part of the fermentation run the MCR-ALS algorithm included in this invention requires an initial estimation of the multicomponent analyte composition and relative amount of each main solution. The determination of the purest sample was used for initial estimation based on [4]. This estimation assumes that the initial content of main solution is as close as possible to 1 (or 100%) while the other main solution is as close as possible to 0 (or 0%).
(Step 4) With the number of main solutions fixed by the PCA procedure and the initial estimation set the MCR-ALS algorithm can be used to extract the main solutions composition and their mixture profile necessary to mimic the fermentation batch run. In order to obtain meaningful solutions, three logical constraints were used:
(Step 5) The results from MCR-ALS algorithm for each of the data subsets were obtained and are presented in Table 4 and 5. The relative amounts of each main solution are described by the scores of the MCR-ALS algorithm (Table 4) whereas the multicomponent analyte composition of each main solution is described by the loadings of the MCR-ALS algorithm (Table 5).
(Step 6) The MCR-ALS algorithm estimations were evaluated by plotting the batch profiles of each analyte contained in the original data (matrix D) and MCR-ALS estimations obtained by multiplying the relative amount of solutions (Table 4) by the concentration of analytes in the main solutions (Table 5), as described in Equation 1 above. As it can be seen in
The resolved concentration profiles for each simulated mixture is compared with the original experimental dataset (
Example 2 relates to increasing the number of Synthetic Multicomponent Fermentation Samples for better capturing of process dynamics.
One of the drawbacks of off-line measurements in dynamic processes is the low amount of samples available to measure when compared to the variation of specific analyte. This is of great importance when dealing with online sensors, such as Near and Mid Infrared spectroscopies, among other sensors. Thus, the present example uses part of the solution provided in Example 1 to increase the number of samples that can be used to mimic process steps. This is of great use if a very defined profile is necessary in some parts of the process, specially to increase the number of samples in under-sampled process phases. Good examples of such under sampling can be seen in GTF profile in
Using the obtained main solutions relative amount on data subset 1 and the corresponding fermentation times (in time units) from the original data matrix D, it is possible to represent each main solution relative amount as a “relative amount profile”, i.e., relative amount represented against time (
By using the method described in this example the number of samples can be increased according to the users needs. Furthermore, this is not limited to a binary system and can be used to increase the number of samples in more complex systems, where the number of solutions is higher.
The increase of samples was also done to other data subsets presented in example 1 of the present invention. The MRC-ALS data and the increased data points obtained in this step of the invention were used to create the synthetic samples that mimic the historical fermentation batch. An example of the relative amounts of each solution for the first data subset (solutions A1 and B1) is presented in Table 7.
This process was extended to the other data subsets presented in Example 1 and the method was tested experimentally by assembling the solutions (A1, A2, A3, A4, B1, B2, B3 and B4) and recreating the process dynamic by mixing the determined amount of each main solution for that specific time.
The experimental implementation was evaluated by plotting the batch profiles of each analyte contained in the original data (matrix D) and the profiles obtained by creating the samples described above. As it can be seen in
Example 3 relates to creating data samples to make calibration models more accurate and robust.
The examples presented so far only represents the process dynamics, and as such, have the same behaviour as process samples. The fermentation profiles naturally retain a metabolism-induced concentration correlation between cellular substrates and metabolic products. In fact, analyzing the correlation of the analytes presented in the off-line data (Table 1) show that these correlations are evident. Since the data generated by Example 1 and Example 2 completely mimics the fermentation profile it is also expected that the correlations are still retained by those synthetic multicomponent samples. As it can be seen in Table 8 some analytes and culture parameters present high correlation (0.8<|R|<1), medium correlation (0.6<|R|<0.8) or slight correlation (0.5<|R|<0.6), while others present no correlation at all (|R|<0.5). From those the analytes and culture parameter profiles that are highly correlated, may lead to calibrations with lack of selectivity, thus creating indirect calibrations that might underperform if the new batches present different profiles. Thus, the present invention comprises a step of creating a new set of samples that can break co-linearity between parameters or analytes present in the data matrix D, enabling the creation of samples to be included in the development of robust and accurate calibrations of spectroscopic sensors.
The preparation of such samples can be done in several ways, including, but not limited to, orthogonal design of experiments, random spiking of specific compounds in samples or by programmed spiking. The present example illustrates a way of implementing programmed spiking to break co-linearity of fermentations profiles. These programmed spiking samples include looking at Table 8 and searching for pairs of analytes that present a high correlation, for instance the correlation between product (PRO) and Glutamate (GLU). As is can be seen in
A way to break such correlations is to use samples that maintain the linear correlation and spike them with one of the analytes, maintaining the other as a constant value. This procedure will break the correlations between a pair of analytes. The introduction of such spikes creates a new sample that starts to break such correlations (showed in
In order to achieve an optimal dataset for calibration development, this analysis and sample preparation procedure can be repeated for other pairs of variables identified in Table 8 as having high and medium correlation.
The samples created with this procedure can be added to the dataset of samples created in Example 1 and Example 2, creating an extended dataset that has lower correlations as its characteristics. As it can be seen in Table 9, the inclusion of such samples in the dataset created in Example 1 and Example 2 allows the construction of correlation-breaking samples to enhance model robustness, accuracy and selectivity that are not attained by process mimicking. This is confirmed by the reduction of data correlation between each pair of analytes (presented in Table 9).
After sample preparation the spectroscopic sensor or analytical instrument can be used to acquire data (X) that can be related with chemical composition of physical properties of the samples determined by a reference analytical method (y).
Example 4 relates to creating data samples to make calibration models more accurate and robust by minimizing unknown or undesired correlations that cannot be measured by analytical methods. Examples of these unknown or undesired correlations are, for example, mixtures of solvents or solutions with different spectra that are not directly correlated with the analyte to be calibrated, dynamic process that change spectra over time without changing the analyte concentration being calibrated, and any other changes that affect the spectra without changing the analyte amount.
When developing calibrations according to example 1 and 2 the samples generated contain highly correlated metabolite concentration as a result of metabolic relations. To overcome this constraints, samples for breaking correlation from example 3 were used. As mentioned before, these samples do no break matrix correlation contained in real fermentation or other dynamic processes. Therefore, an additional dataset was planned based on mixing random amounts of two or more process samples (meaning samples from process experiments, i.e., fermentation runs) chosen in an arbitrary way.
For this example, it will be taken into consideration that a matrix effect is present and that samples from the beginning of the process have different spectra from the remaining process stages. This is the case of a fermentation run, wherein the fermentation feeds do not have the same composition as the fermentation broth and the analytes cannot explain all the variation in the spectra. In these cases the use of an additional dataset containing process samples randomly mixed from different process stages will contribute to reduce the called “matrix effects” not accounted in the analytes profiles, i.e., if samples after feeding present a different spectra due to a different matrix being introduced in the fermentation broth, making random or programmed mixtures will help breaking the colinearity due to this effect.
Example 5 relates to creating NIRS and MIRS calibrations using samples generated according to the presented method.
The samples generated in Examples 1 trough 4 were used to build multivariate calibrations that relate samples NIR and MIR spectra or parts of that spectrum with a chemical or physical property of a specific sample. The relationship between spectra and chemical and physical properties requires the use of multivariate data analysis methods, such as, Projection into Latent Structures regressions. The optimization of such methods can involve advanced spectra pre-processing, specific wavelength selection, samples selection methods, etc., which are known to skilled persons in both chemometrics and spectroscopies. The details of developing such calibration are not included in this example since they are out of scope of the present invention. As such, the present example serves as a proof that using the samples generated with the presented invention method in accordance with the invention, it is possible to create calibrations that can be used in real fermentation runs.
For that, the spectra (NIR, MIR, or other not presented in this example) of all samples generated in Examples 1, 2, 3 and 4 were acquired and the analytical properties for such samples were obtained either by laboratory analytical methods or estimated from main solutions. The calibrations for the analytes presented in Examples 1, 2, 3 and 4 were developed using good modelling practices, which are known to people skilled in this area. The model performance for the determination of viable cell density with NIRS and glucose using MIRS are presented here as an example of possible applications of the present invention (
The models presented above were challenged by using them to predict viable cell density and glucose in real fermentation runs. The results presented in
Regarding the results obtained for VCD model two other things have to be considered. First, the model built with synthetic multicomponent samples does not have information about morphology/cell state along the simulated fermentation, i.e., cells introduced in the engineered samples are not at the same state as if they were in a real fermentation. This clearly influences model performance, especially at the end of the fermentation run, where the number of cells is high but the viable cells start to decrease.
Example 6 relates to multivariate process trajectories development using synthetic multicomponent samples.
Spectroscopic techniques, such as those presented in Example 5 can be very valuable tools to monitor fermentation runs. Also, the use of these spectroscopic techniques is not limited to the development of calibrations for monitoring/control single process events. In fact, monitoring process trajectories is an interesting area where chemometric tools can play an important role. The great advantage of these tools is the ability to capture much more than the individual analytes profiles, and as such, they can be very useful for process supervision and control, even without quantification methods.
The presented method according to the invention also allows the creation of multivariate process trajectories based on the samples created in Examples 1 and 2, since these samples present the same characteristics as real fermentation samples. As such, using spectroscopic sensors information it is possible to build multivariate batch trajectories that can be used to benchmark current fermentation runs with prior batches and identify deviations.
To exemplify the use of this technique the samples produced in Example 1 and 2 were used to build a NIRS PCA model according to good modelling practices. After model optimization, the spectra from a fermentation run were acquired inline. These spectra were projected on the PCA model to monitor the batch performance. As it can be seen in
The analyte composition as well as optional other solution characteristics, as e.g. physical properties etc., of the necessary main solutions are contained in a Matrix S and the relative amount of each main solution to be used for preparing/creating the synthetic process samples are contained in a matrix C.
With the information in the matrix S, which main solutions are necessary to create the at least one sample mimicking the dynamic process and which analyte composition these necessary main solution have, the necessary main solutions are assembled/mixed according to the determined analyte composition.
Then, with the information in the matrix C and which relative amount the assembled necessary main solutions have, the assembled necessary main solutions are mixed according to their determined respective relative amount. With this the at least one sample mimicking the dynamic process is created.
It is understood by the skilled person that according to another embodiment the analyte composition as well as optional other solution characteristics, as e.g. physical properties etc., of the necessary main solutions do not have to be necessarily contained in a Matrix S but can also be contained in any other representation, e.g. an ordinary table. According to another embodiment the relative amount of each main solution which is used for preparing/creating the synthetic process samples does also not necessarily have to be contained in a matrix C but can also be contained in any other representation, e.g. an ordinary table.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope and spirit of the following claims. In particular, the present invention covers further embodiments with any combination of features from different embodiments described above and below.
The invention also covers all further features shown in the Figs. individually although they may not have been described in the afore or following description. Also, single alternatives of the embodiments described in the figures and the description and single alternatives of features thereof can be disclaimed from the subject matter of the invention.
Furthermore, in the claims the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. A single unit or step may fulfil the functions of several features recited in the claims. The terms “essentially”, “about”, “approximately” and the like in connection with an attribute or a value particularly also define exactly the attribute or exactly the value, respectively. The term “about” in the context of a given numerate value or range refers to a value or range that is, e.g., within 20%, within 10%, within 5%, or within 2% of the given value or range. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. In particular, e.g., a computer program can be a computer program product stored on a computer readable medium which computer program product can have computer executable program code adapted to be executed to implement a specific method such as the method according to the invention. Any reference signs in the claims should not be construed as limiting the scope.
In the context of the invention, the terms and abbreviations listed below can related to the following mentioned for the respective terms and abbreviations:
Robust calibrations: Meaning that the calibrations can still predict accurately even if the sample has some interferences (physical or chemical).
[2] Gampp, H, M Maeder, C J Meyer, and A D Zuberbuhler. “Calculation of equilibrium constants from multiwavelength spectroscopic data-IV: Model-free least squares refinement by use of evolving factor analysis.” Talanta 33 (1986): 943-951.
[3] Jaumout, Joaquim, Raimundo Gargallo, Anna de Juan, and Romà Tauler. “A graphical user-friendly interface for MCR-ALS: a new tool for multivariate curve resolution in MATLAB.” Chemometrics and Intelligent Laboratory Systems 76 (2005): 101-110.
Number | Date | Country | Kind |
---|---|---|---|
13199699.3 | Dec 2013 | EP | regional |
This application is a continuation of and claims priority to PCT patent application no. PCT/EP2014/079152, filed Dec. 23, 2014, which claims priority to European patent application no. 13199699.3, filed Dec. 27, 2013, both of which applications are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2014/079152 | Dec 2014 | US |
Child | 15192708 | US |