The present invention relates to a method and system for testing machine learning models.
An embedded system is a combination of a processor, a memory and other hardware that is designed for a specific function or which operates within a larger system. Examples of embedded systems include, without limitation, microcontrollers, ready-made computer boards, and application-specific integrated circuits. Embedded systems may be found within many Internet of Things (IoT) devices. Some embedded systems may utilize a machine learning model to analyze and process data collected from various sensors provided in the embedded system. Using machine learning models in this way allows for more efficient and effective processing of the large volume of data collected by the IoT device
A drawback, however, is that these machine learning models may lose accuracy over time, due to new input behavior (such as the evolution of input data), degradation or loss of accuracy of input sensors, or upgrading of the input sensors. Consequently, machine learning models deployed on IoT embedded systems require updates, sometimes as frequently as on an hourly basis.
According to a first aspect there is provided a method performed by an electronic device for testing machine learning models, the electronic device comprising a program for executing a first machine learning model and a second machine learning model, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
According to a second aspect there is provided an electronic device with a processing element and a data storage element, the storage element containing code that, when executed by the processing element, causes the electronic device to perform a method for testing machine learning models, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
According to a third aspect there is provided a non-transitory computer-readable storage medium containing code that, when executed by an electronic device, causes the electronic device to perform a method for testing machine learning models, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
Before discussing the embodiments with reference to the accompanying figures, the following description of embodiments is provided.
A first embodiment provides a method performed by an electronic device for testing machine learning models, the electronic device comprising a program for executing a first machine learning model and a second machine learning model, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
The machine learning model update data package may be a delta update of the program on the electronic device. The delta update may include a difference between the first machine learning model and the second machine learning model to allow the second machine learning model to be generated from the update package. In this way, in some embodiments, the amount of data that needs to be transferred to the electronic device can be reduced.
The first machine learning model and the second machine learning model may be different versions of a common machine learning model Accordingly, generating the second machine learning model from the update data package may comprise generating a version of the common machine learning model from an earlier version of the common machine learning model using the update data package.
The machine learning model update may be received via a wireless connection.
The program may execute the first machine learning model and the second machine learning model in parallel. In other implementations, the program may execute the first machine learning model and the second machine learning model sequentially.
In some embodiments, the electronic device may be configured to send the results of executing the first machine learning model and the second machine learning model to a related infrastructure for analysis.
In other embodiments, analysis of the results of executing the first machine learning model and the second machine learning model may be performed on the electronic device.
The electronic device may be configured to perform unsupervised learning. In such embodiments, executing the first machine learning model may generate first output values and executing the second machine learning model may generates second output values. The method may comprise running the program on a plurality of sets of input data to generate a plurality of sets of first and second output values; analyzing the first and second output values to identify a property of each of the first and second output values, and selecting one of the first machine learning model and the second machine learning model based the identified properties. The method may further comprise analyzing the first and second output values to see if the property has deviated beyond a threshold amount from a desired performance of the machine models. The property may be an intra-class entropy and/or an extra-class entropy.
The method may comprise selecting one of the first machine learning model and second machine learning model based on the model having output values that have at least one of a smaller intra-class entropy value and a larger extra-class entropy value.
In other embodiments, the electronic device is configured to perform supervised learning. Executing the first machine learning model may generate first output values and executing the second machine learning model may generate second output values. The method may comprise running the program on a plurality of sets of input data to generate a plurality of sets of first and second output values; calculating a first entropy associated with the first output values and a second entropy associated with the second output values; and selecting one of the first machine learning model and the second machine learning model based on the calculated first entropy and second entropy.
The program may be configured to send a request to receive an updated model if the first entropy and second entropy do not meet a predetermined criteria. The predetermined criteria may based on a deviation of the value of the first entropy and/or second entropy from a threshold value associated with the machine learning models.
The method may comprise selecting one of the first machine learning model and second machine learning model based on the values of the first entropy and second entropy. The method may comprise selecting the first machine learning model in a case that the first entropy is lower than the second entropy and selecting the second machine learning model in a case that the second entropy is lower than the first entropy.
The method may further comprise checking the model update package for a signature to prevent installation of malicious code on the electronic device.
A second embodiment provides electronic device comprising a processing element and a data storage element, the storage element storing code that, when executed by the processing element, causes the electronic device to perform a method for testing machine learning models, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
The electronic device may further comprise a wireless connection element.
The data storage element may further store code that, when executed by the processing element, provides a secure transfer function that checks a signature included with data transfers to prevent the installation of malicious code.
The electronic device may be an embedded system.
A further embodiment provides a non-transitory computer-readable storage medium containing code that, when executed by an electronic device, causes the electronic device to perform a method for testing machine learning models, the method comprising: the electronic device receiving a machine learning model update data package; partially or fully updating a first machine learning model to generate a second machine learning model using the machine learning model update data package; executing the program, whereby the program executes both the first machine learning model and the second machine learning model using a common set of input data; collecting outputs from the first machine learning model and the second machine learning model for analysis.
A further embodiment provides a method performed by an infrastructure element for sending a machine learning model update package to an electronic device, the method comprising: identifying a machine learning model currently installed on the electronic device; creating a delta update corresponding to a difference between a machine learning model to be installed on the electronic device and the identified machine learning model currently installed on the electronic device; and sending a machine learning model update package to the electronic device including the delta update.
The machine learning model update package may further include statistics relating to performance of the machine learning model to be installed. The statistics may include a measure of entropy of the outputs of the machine learning model to be installed. The measure of entropy may be one of an intra-class entropy and an extra-class entropy.
Particular embodiments will now be described with reference to the Figures.
The software stack of
A/B testing as performed by the application 26 is a method of testing two different machine learning models, in this case machine learning model A 32 and machine learning model B 33, using the same input data. By collecting statistics on the performance of the two machines learning models, the performance of the machine learning models can be evaluated against each other. This allows the better performing machine learning model to be selected for use in further inference processing.
If, however, the machine learning model is not a first version machine learning model, at step S52 the process is instead directed to S57, as shown in
Once the model delta is generated, in step S59 the code portion for A/B testing code 59 is updated to perform A/B testing with the two machine learning models Mi-1 and Mi. Within the A/B testing, machine learning model Mi-1 forms model A, and machine learning model Mi forms model B.
In step S510, both the updated code for A/B testing created in S59 and the model delta generated in step S58 are uploaded to the board 11 as an update package. The use of binary delta in the update package is useful when updating ML models in IoT devices in the field because in some applications, network bandwidth is limited and costly. In other applications, power available to the IoT device may be limited, for example in the case of a coin cell operated board, so saving power by making less use of the wireless communication module 14 may be desirable.
Upon receiving an update package over the wireless connection, the board 11 updates the application 26 so that the A/B testing code 31 portion is updated, and the machine learning models are updated in accordance with the configuration in the update package.
The A/B test metrics are then sent to the related infrastructure 42 to allow determination of which of the two machine learning models performed better on the input data. The sending of test metrics rather than the node activation data from the two models may be useful in reducing the amount of data that needs to be sent to the related infrastructure and thereby reducing power requirements at the board 11.
The A/B testing application shown in
Once the update package has been downloaded to the board 11, the board 11 generates machine learning model Mi from the update using existing model on the board Mi-1 and the model delta to form of the updated machine learning model Mi from the model already existing on the board. When the A/B testing application is run with input data, the board 11 can be used to determine the more suitable model.
The board 11 is usable both in the case where the machine learning models are configured to perform supervised learning and where they are configured to perform unsupervised learning. Each of these examples will now be described.
At the stage of developing the machine learning model, in step S71, a set of sample data statistics e.g. vector or a buffer of entropies, are attached to the model as part of the update package. In the case of a clustering model, these entropies indicate a degree of variation within a cluster (intra-class entropy) and a degree of separation between the clusters (extra-class entropy). More generally, the sample data statistics attached to the model are statistics that indicate typical dispersion of the activation values for nodes as the model learns from the incoming input data. In particular, the sample data statistics are a measure of the entropy of the set of node values. Each node in the output layer is said to identify a class of outputs (e.g. letters in a character recognition model). The entropy values may be intra-class entropy values and/or extra-class entropy values. The intra-class entropy values indicate how much a node values for a class or cluster vary (for example how much values vary when detecting the letter ‘c’). Extra-class entropy values indicate how different the values are between node activation values. For example, indicating a typical difference between the node values when detecting the letter ‘c’ and when detecting the letter ‘z’.
The sample data statistics are received by the board 11 along with the A/B testing code 31 and the updated machine learning model 33. In step S72, the board 11 runs the A/B testing model. The board 11 runs both machine learning model A 32 and machine learning model B 33 for a plurality of sets of input data, each set of input data being processed by both machine learning models. In step S73, the node activation values from running the machine learning models on the plurality of sets of input data is stored in the data storage 13.
In step S74, the stored node activation values are processed. For each input data set, two sets of node activation values are stored corresponding to model A 32 and model B 33 respectively. For each set of input data, the entropy between node activation data values is calculated to calculate an extra-class entropy. The extra-class entropy may be averaged across the plurality of sets of output data.
An entropy in the activation values for a particular node, across several sets of input data classified in the same cluster, is calculated to generate an intra-class entropy value.
The intra-class and extra-class entropies compared with threshold values to see if the supervised learning models A and B are deviating from desired performance. This may happen if the intra-class entropy values are starting to get large (a cluster within the clustering model is starting to get large) or if the extra-class entropy values start getting small (the clusters are starting to get close to each other and the model is making less clear predictions).
In step S75, the performance of the models over the input data sets are evaluated. This evaluation may include evaluating the intra-class entropy of the node activation values for each model and the extra-class entropy values of the node activation values for each model. A smaller intra-class entropy value and a larger extra-class entropy value is desirable. A small intra-class entropy indicates that the node activation values are quite consistent when a label is to be assigned. For example, the signal may be consistently quite high when a letter ‘c’ is detected in an image. A large extra-class entropy indicates that there is a significant difference between the activation node values in a case where a label is to be assigned. This is desirable because it means that there is a clear signal, such as the node for the letter ‘c’ being much higher than a node activation value for the letter ‘z’ in a case when a ‘c’ is detected.
In step S75, the system will select the machine learning model having output values that have at least one of a smaller intra-class entropy value and a larger extra-class entropy value.
In a case in which neither model A 32 or Model B 33 produce output values that resemble the sample statistics provided with the machine learning model when it was installed across a predetermined number of sets of input data, the system will determine that neither model is working well and request an updated model from the related infrastructure 42.
In other embodiments, the board 11 may use supervised learning to compare the two machine learning models.
A supervised machine learning model is developed on the related infrastructure 42 using a set of training data. Each item of training data includes input values, which are input to the neural network, and a label that indicates the correct result. For example, for a machine learning model for recognizing a person in an image, the training data should include the image data and a label that indicates whether or not the image includes a person. The person recognition model may be trained using a set of training images. For each data item (e.g. image) in the set of training data, the trained model will generate a set of node activation values. In the case of a trained neural network, the training data should, in many cases, generate a high node activation corresponding to the label associated with training data. In the example of person recognition, a node should give a high activation value if a person is included in the image of the training data.
In step S81, during development of the machine learning model, an entropy is calculated for activation values in the output layer of the trained model obtained based on the training data set. In a case where the node activation values (probabilities) are relatively close to each other, uncertainty in the model output is high, so entropy is large. In contrast, in a case where the node activation values (probabilities) are relatively far from each other and there is a strong activation value from one output node, uncertainty in the model output is low, so entropy is low. The average of all these entropies is calculated in step S82, providing a threshold measure of entropy. This threshold measure is then attached to the machine learning model and included in the update package sent to the board 11 in step S83. Upon receipt, the application is updated as previously described.
In step S84, the updated application is run on the board 11 for multiple sets of input data. Each set of input data is processed by both machine learning model A 32 and machine learning model B 33. In step S85, entropy values are calculated by the application for the output activation values of each machine learning model. In step S86, the model having the lower entropy in the node activation values is selected for use in subsequent inference.
If none of the models have sufficiently low entropy values for reasonable number of input data items, the application 26 may send a request to the related infrastructure 42 to update the model. This sufficiently low entropy value may be determined as a deviation by a predetermined threshold below the threshold measure of entropy that was sent with the machine learning model. In other implementations, the maximum entropy of certain number of classes could be calculated theoretically (e.g. for two classes). In such implementations, the need for the expected entropy to accompany every model is removed. However, it is anticipated that providing a threshold measure of entropy with the model will be a desirable option, as some data are naturally harder to train on.
In other embodiments where the machine learning models are configured for supervised learning, the models may be evaluated against a predetermined set of input data with a known set of labels. In such a supervised test, the board 11 is provided with a predetermined set of input data and the results of the two models are compared with known correct outcomes. The predetermined set of input data may be input to the board by providing input data to sensors of the board 11. For example, images could be presented to a camera of the board 11 or sounds presented to a microphone of the board 11.
Evaluation of the results of the supervised test may be performed on the board 11 or on the related infrastructure 42. In a case where the results are evaluated on the board, once the supervised test has been performed, the board may send a request to the related infrastructure 42 to receive labels associated with the supervised test. The related infrastructure 42 may send the labels for the supervised test to the board 11 in response to the request. The board may then compare the predictions of the two machine learning models against the received set of labels and evaluate which machine learning model provided more accurate predictions.
In other implementations, after each input data item is processed by machine learning model A 32 and machine learning model B 33, the application may cause the board 11 to send a query to get the label. The related infrastructure sends the label in response to the request from the board 11. The application may suspend processing on the board 11 until the application receives the label. The application resumes processing after receipt of the label and increments a set of scores of the A/B tests. At the end of the predetermined test period, the board 11 will provide a measurement, comparing machine learning model A 32 and machine learning model B 33, to the related infrastructure 42.
In other implementations, the board 11 may perform a supervised test and send the predictions from each of machine learning model A 32 and machine learning model B 33 to the related infrastructure. The infrastructure can then evaluate the predictions of the two machine learning models against the correct set of labels for the test. This implementation may be useful in cases where there is going to be a delay between the board 11 performing the supervised test and the labels against which the machine learning models are going to be evaluated being available. The related infrastructure 42 will typically have more storage and processing power than the board 11 and may be a useful place to store the predictions during the delay. Once the related infrastructure 42 has evaluated the two machine learning models, a result of the evaluation may be sent back to the board 11. This may allow the application 26 to select one of the machine learning models to use for further inference.
Embodiments described above allow A/B testing on a board 11 to select between two machine learning models. It could be considered that such testing between machine learning models may be carried out in simulation away from the board 11. However, it has been found that in some cases the simulations may not adequately reflect how a given machine learning model will perform when implemented on the board 11. This may be due to hardware limitations on the board or other factors. Accordingly, it has been found that providing an application for performing A/B testing is useful because it allows machine learning models, particularly small machine learning models of the type that can be installed on an embedded system, to be tested natively.
In some implementations, a plurality of boards 11 may be connected to a common infrastructure 42. The A/B testing may be performed on one or more boards to test between a model version i and a model version i-1, as previously described. In a case that the infrastructure determines that the model version i performs better than model version i-1 the plurality of boards may be migrated to the model version i. The infrastructure may determine that the model version i performs better either by evaluating the results of the two models directly or by receiving metrics from the board 11 performing A/B testing. The migration of the boards from model version i-1 to model version i may be performed incrementally in batches. For example, if one hundred boards are connected to the infrastructure 42, the boards may be transitioned to the new model in batches of 10-20 percent. This allows the performance of the new model to be evaluated on a larger number of boards before completely transitioning to the new model.
As mentioned in the preceding paragraph, the A/B testing may be performed on several of a plurality of boards 11. This allows a determination of whether or not the new model performs consistently across boards 11. A machine learning mode could be selected that is robust to variations in performance of the boards, perhaps due to variations in the performance of the sensors on the boards or due to differences in location of installation of the boards.