Data is regularly centrally collected from a wide variety of different client devices for analytical and other purposes. For example, on the basis of such centrally collected data analysis, how the client devices and their running applications can be improved can be assessed, and impending or actual failures or malfunctions detected such that appropriate reactive and proactive actions can be performed. Client devices in this respect can include individual user devices, such as desktop, laptop, and notebook computers, as well as smartphones, tablet computing devices, and other types of computing devices. Client devices can also include peripheral devices such as printing and other imaging devices, as well as Internet-of-Things (loT) devices that may be computationally lightweight and low-cost devices that primarily report data.
As noted in the background, data is often centrally collected from client devices for analytical and other purposes. While such central data collection is beneficial, users may be hesitant to share their data in this way due to privacy concerns. Moreover, governmental regulations can dictate how the data can be collected. Therefore, before data is reported from a client device, the data may be subjected to a data anonymization protocol that effectively removes identifying information from the data and prevents such identifying information of the devices and/or their users from being discerned from the reported data. Adjusting centralized collection of data in this matter can mitigate users' privacy concerns and satisfy privacy-related governmental regulations.
One type of data anonymization protocol is an adaptive differential privacy protocol. An example of an adaptive differential privacy protocol is described in the pending PCT patent application “Determination of data parameters based on distribution of data for privacy protection,” filed on Oct. 5, 2020, and assigned patent application number PCT/US2020/053830. In accordance with an adaptive differential privacy protocol, a client device may instead of reporting the raw data that the client device collects report the number of values that fall in each of a number of data collection bins. The server device centrally collecting the reported data from the client devices therefore receives an obfuscated version of the data collected at each client device. Furthermore, the reported data may constitute a probabilistically accurate response, such that the server device cannot ascertain with 100% confidence that any given data sent is correct or random.
In an adaptive differential privacy protocol, the data collection bins are variably sized, instead of uniformly sized, in accordance with the distribution of the data being centrally collected. That is, an adaptive differential privacy protocol adapts individual bin sizes according to the underlying distribution of the data being collected. For example, for data values within the range of 1 and 100, instead of having equally sized bins corresponding to values 1-5, 6-10, 11-15, . . . , 95-100, the bins are variably sized based on the underlying distribution of the data being collected. Data values that are collected at higher frequency are assigned to smaller sized bins (e.g., corresponding to fewer values), and data values that are collected at lower frequency are assigned to larger sized bins (e.g., corresponding to more values).
An adaptive differential privacy protocol improves the quality of the data being centrally collected, while still preserving privacy. Because the data is reported over data collection bins that reflect the underlying distribution of the data being collected, analyses performed on the data can result in better and more accurate conclusions being drawn. However, the distribution of the data being collected has to be approximately known before an adaptive differential privacy protocol can be applied to the data. This can lead to a chicken-and-egg problem, in which the distribution of the data may not be known until after the data has been collected, but collecting the data in a privacy-preserving manner according to an adaptive differential privacy protocol necessitates that the distribution be approximately known, or otherwise the quality of the centrally collected data will suffer.
Techniques described herein classify the distribution type of a data set from a minimum number of samples of the data set, using a machine learning model. For example, the distribution type may be predicted from a small number of initial data samples. Subsequent collection of additional data samples can then be adjusted according to the predicted distribution type of the data set. A data anonymization protocol, such as an adaptive differential privacy protocol, may be adjusted, for instance, based on the predicted distribution of the data set. Depending on whether the distribution type is predicted locally at individual client devices or centrally at a server device, either no samples or just the small number of initial samples are transmitted to the server device without first being subjected to data anonymization.
The described techniques have been shown to be able to classify data set distribution type using as few as ten initial samples while still maintaining nearly 100% classification accuracy (as a specified accuracy). Even limiting the initial number of samples to as few as five has been shown to still result in greater than 90% accuracy (as a specified accuracy) in classifying the distribution type of the data set. The result is that the central collection of subsequent samples of the data set can be adjusted according to the classified distribution type to improve such collection. As noted, for instance, a data anonymization protocol such as an adaptive differential privacy protocol that governs central data collection can be adjusted in such a way that the quality of the data is improved in terms of the accuracy of subsequent analyses that can be performed on the data. More generally, therefore, the described techniques can classify data set distribution type using no more than ten initial samples while still maintaining a specified accuracy of 90% or higher. For example, such a minimum number of samples is sufficient for a machine learning model to classify distribution type with a specified accuracy.
The server device 104 centrally collects a data set 108 from the client devices 102 over the network 106. The data set 108 is made up of client-specific subsets 110 that respectively correspond to the client devices 102. Each client device 102 thus locally collects its client-specific subset 110 of the data set 108 and reports it over the network 106 to the server device 104. By receiving the reported client-specific subset 110 from each client device 102, the server device 104 therefore centrally collects the data set 108 in its entirety from the client devices 102. The client devices 102 may each report its client-specific subset 110 as data samples (i.e., data values) of the subset 110 are generated or locally collected, or may periodically report groups of samples in batch form.
For example, a client device 102 may report the samples of its client-specific subset 110 when a number of samples has been locally collected. Therefore, during times in which more samples are locally collected, the client device 102 reports at higher frequency (i.e., more often) to the server device 104 as compared to during times in which fewer samples are locally collected. As another example, a client device 102 may report the samples of its client-specific subset 110 at the end of every period of time, which may be measurable in seconds, minutes, hours, days and so on, regardless of the number of samples that were locally collected in that period of time. Therefore, during times in which more samples are locally collected, the client device 102 reports more samples to the server device 104 as compared to during times in which fewer samples are locally collected.
The data set 108 made up of the client-specific subsets 110 of data samples 202 has a distribution 204 that includes and is defined by a distribution type 206 and distribution parameters 208. The distribution 204 may also be referred to as a statistical or probability distribution, and may be considered as a mathematical function specifying the values (i.e., the data samples 202) of the data set 108. The distribution type 206 is the type of the distribution 204 of the data set 108. For example, the distribution type may be normal, exponential, lognormal, uniform, Beta, binomial, negative binomial, Poisson, and so on.
The distribution parameters 208 are the parameters governing the distribution 204, in that the parameters 208 instantiate the distribution 204 as a particular distribution of the distribution type 206. That is, whereas the distribution type 206 generally specifies the distribution 204, the distribution type 206 together with the distribution parameters 208 completely specifies the distribution 204. The distribution parameters 208 are specific to the distribution type 206. For example, a normal distribution may have distribution parameters 208 of mean and standard deviation, whereas an exponential distribution may have a single distribution parameter 208 of rate or scale.
In general, in the architecture 200, the distribution type 206 of the distribution 204 of the data set 108 is classified (210) from the initial samples 202A of the client-specific subsets 110 of the data set 108. The distribution parameters 208 of the distribution 204 of the data set 108 that has the classified distribution type 206 are similarly calculated (212) from the initial samples 202A of the client-specific subsets 110. A data anonymization protocol 214, such as an adaptive differential privacy protocol, is adjusted (216) based on the classified distribution type 206 and the calculated distribution parameters 208 of the distribution 204 of the data set 108. Because the distribution type 206 and the distribution parameters define the distribution 204, the data anonymization protocol 214 can thus be considered as being adjusted based on the distribution 204 of the data set 108.
The data anonymization protocol 214 governs (218) central collection 220 of the additional samples 202B of the client-specific subsets 110 (and in some cases, the initial samples 202A as well) from respective client devices 102 by the sever device 104 of
The client device 102 then reports (310) the data samples 202 to the server device 104 by specifically reporting the number of samples 202 assigned to each data collection bin 304. In the case in which the data samples 202 are reported as they are locally collected, the bin 304 to which a data sample 202 has been assigned is reported. The server device 104 therefore does not receive the actual data samples 202—i.e., their actual values—but rather just the count of the samples 202 in each variably sized bin 304. In one implementation, the server device 104 may just receive probabilistic response of the count of the samples 202 in each variably sized bin 304, such that the server device 104 cannot ascertain with 100% confidence whether any given sample 202 or any given count is accurate. Because each client device 102 of
Each client device 102 locally collects (402) initial samples 202A of its respective client-specific subset 110 of the data set 108, and transmits (404) the initial samples 202A to the server device 406, which responsively receives (406) the samples 202A. The server device 104 applies (408) a trained machine learning model 410 to the initial samples 202A received from the client devices 102 as a whole to predict the distribution type 206 of the data set 108. The machine learning model 410 is trained to classify distribution type of an input data set from a minimum number of samples of the data set, including as few as five or ten such samples. Example training of such a machine learning model 410 is described in detail later in the detailed description.
In the example of
The server device 104 further calculates (411) the distribution parameters 208 for the distribution 204 of the data set 108, from the initial samples 202A received from all the client devices 102. Which distribution parameters 208 that are calculated depend on the distribution type 206 of the distribution 204. Therefore, the distribution parameters 208 are calculated after the distribution type 206 has been classified, or predicted, via usage of the trained machine learning model 410. The server device 104 transmits (412) both the classified distribution type 206 and the calculated distribution parameters 208 that define the distribution of the data set 108 to each client device 102, which responsively receives (414) them.
The client devices 102 in turn each adjust (416) the data anonymization protocol 214 governing central collection of the client-specific subsets 110 of the data set 108 at the server device 104 based on the received distribution type 206 and distribution parameters 208, as has been described. Each client device 102 collects (418) the additional samples 202B of its respective client-specific subset 110 of the data set 108, and reports (420) the samples 202B to the server device 104 in accordance with the adjusted data anonymization protocol 214, as has also been described. The server device 104 therefore centrally collects (422) the additional samples 202B from the client devices 102 as have been anonymized via the data anonymization protocol 214 that was adjusted according to the distribution 204 of the data set 108.
Each client device 102 locally collects (502) initial samples 202A of its respective client-specific subset 110 of the data set 108. Unlike in
The client devices 102 transmit (506) their respective predicted distribution types 206′ to the server device 104, which responsively receives (508) the distribution types 206′. The server device 104 chooses (510) one of the predicted distribution types 206′ to serve as the selected distribution type 206 that will govern the central collection of the data set 108 from all the client devices 102. For instance, the server device 104 may choose the selected distribution type 206 predicted by the highest number of client devices 102. For example, if there are fifty client devices 102, and forty each predict a distribution type 206′ of normal and the remaining ten each predict a distribution type 206′ of exponential, then the server device 104 may select the distribution type 206 as normal. In this way, the distribution type 206 is still classified on the basis of the initial samples 202A of all the client-specific subsets 110, since the distribution type 206 is selected from the distribution types 206′ classified based on the initial samples 202A of respective client-specific subsets 110.
The server device 104 transmits (512) the selected distribution type 206 to the client devices 102, which each responsively receives (514) the selected distribution type 206. Each client device 102 calculates (516) the distribution parameters 208′ of the distribution 204 of the data set 108, where which parameters 208′ are calculated depends on the selected distribution type 206 of the distribution 204. Each client device 102 separately calculates the distribution parameters 208′ from the initial samples 202A of its respective client-specific subset 110. Therefore, the distribution parameters 208′ differ from the distribution parameters 208 of the distribution 204 of the data set 104 in that the parameters 208′ are calculated by each client device 102 just from the initial samples 202A of its respective client-specific subset 110, whereas the parameters 208 are effectively calculated from the initial samples 202A of all the subsets 110.
The client devices 102 each transmit (518) the calculated distribution parameters 208′ to the server device 104, which receives (520) the distribution parameters 208′ from all the client devices 102. The server device 104 in turn calculates (522) selected distribution parameters 208 from the calculated distribution parameters 208′ received the client devices 102. For instance, the server device 104 may average respective types of the distribution parameters 208′ to calculate the selected distribution parameters 208. As an example, the server device 104 may average the mean value of the distribution 204 calculated by received from each client device 102 to calculate the selected mean value, and may similarly average the mean standard deviation calculated by and received from each client device 102 to calculate the selected standard deviation. In this way, the distribution parameters 208 are still calculated on the basis of the initial samples 202A of all the client-specific subsets 110, since the distribution parameters 208 are calculated from the distribution parameters 208′ calculated from the initial samples 202A of respective client-specific subsets 110.
The sever device 104 transmits (524) the selected distribution parameters 208 to the client devices 102, which each responsively receive (526) the distribution parameters 208. The client devices 102 in turn each adjust (528) the data anonymization protocol 214 governing central collection of the client-specific subsets 110 of the data set 108 at the server device 104 based on the distribution type 206 and the distribution parameters 208 that have been received, as has been described. Each client device 102 collects (530) the additional samples 202B of its respective client-specific subset 110 of the data set 108, and reports (532) the samples 202B to the server device 104 in accordance with the adjusted data anonymization protocol 214, as has also been described. The server device 104 therefore centrally collects (534) the additional samples 202B from the client devices 102 as have been anonymized via the data anonymization protocol 214 that was adjusted according to the distribution 204 of the data set 108.
The left part of the method 550 is again performed by the client devices 102, and the right party by the server device 104. The method 550, like the methods 400 and 500, may be implemented as program code stored on a non-transitory computer-readable data storage medium and executed by a processor of a respective device. Note that whereas in
A client device 102 locally collects (502) initial samples 202A of its respective client-specific subset 110 of the data set 108, as before. The client device 102 applies (504) the machine learning model 410 to its initial samples to classify the distribution type 206 of the data set 108. Unlike the method 500 of
The client device 102 also calculates (516) the distribution parameters 208 of the distribution 204 of its respective client-specific subset 110 of the data set 108. Unlike the method 500 of
The client device 102 thus adjusts (528) the data anonymization protocol 214 governing central collection of its respective client-specific subset 110 of the data set 108 at the server device 104 based on the distribution type 206 and the distribution parameters 208 as determined by the client device 102 itself, as opposed to as determined by the server device 104 in the method 500 of
For example, there may be 50,000 training data sets 602. For each of a number of specified distribution types 604, there may be different specified combinations of distribution parameters 606. For example, there may be normal distributions that have means and standard deviations of 0 and 1, 2 and 5, and −5 and 1, respectively. The distribution type 604 of each training data set 602 may be randomly selected from the specified distribution types 604, and then the distribution parameters 606 selected from the specified combinations of distribution parameters 606 for the selected distribution type 604. The specified number 608 of samples may be identical for each training data set 602, or may be randomly selected from a specified range, such as between five and ten such samples.
The process 600 includes then generating (612) a distribution 610 for each training data set 602 having the specified distribution type 604 and the specified distribution parameters 606 of that training data set 602. From the distribution 610 for each training data set 602, the process 600 includes randomly selecting (616) the specified number 608 of samples 614 from that distribution 610. That is, randomly selected samples 614 for each training data set 602 are generated, where each randomly selected sample 614 is within the generated distribution 610 for that training data set 602. The number of such randomly selected samples 614 for each training data set 602 is the number 608 of samples specified for that training data set 602.
In another implementation, the training data sets 602 may include historical data previously collected from a number of client devices. In this case, the distribution 610 of the historical data of each training data set 602 is identified, where the distribution 610 of each set 602 is thus of a particular distribution type 604. Randomly selected samples 614 are then chosen from the historical data of each training data set 602.
The process 600 includes also labeling (618) the distribution type 604 of each training data set 602 to generate distribution type labels 620 respectively corresponding to the training data sets 602. Training data sets 602 having the same distribution type 604 but that have different distribution parameters 606 are nevertheless assigned the same labels. For example, two training data sets 602 that have the same normal distribution type 604 may have different distribution parameters 606. The first set 602 may have a mean of X1 and a standard deviation of Y1 as its distribution parameters 606, and the second set 602 may have a mean of X2 and a standard deviation of Y2 as its distribution parameters 606. Both training data sets 602 are assigned the same label, however, since they both have the same normal distribution type 604.
There may be a group of specified labels, including specific distribution labels that each correspond to a specific distribution type, and one non-specific distribution label that corresponds to distribution types other than the specific distribution type of any specific distribution label. For example, there may be specific distribution labels for more common distribution types 604, including normal, exponential, lognormal, uniform, Beta, binominal, negative binomial, and/or Poisson, etc. There may also be a non-specific distribution label that generally corresponds to all other, less common distribution types 604, such as Zeta, Gamma, and/or arcsine, etc. Therefore, each training data set 602 has a distribution type label 620 that is either a specific distribution label or the non-specific distribution label, depending on the actual distribution type 604 of the training data set 602 in question.
The process 600 includes then training (622) the machine learning model 410 from the randomly selected samples 614 of the training data sets 602 and the distribution type labels 620 assigned to the training data sets 602. Once trained, the machine learning model 410 can be used to classify the distribution type of an input data set from a minimum number of samples, as has been described. For instance, the trained machine learning model 410 can be applied (624) to initial samples 202A of one or multiple client-specific subsets 110 of the data set 108 to classify the distribution type 206 or 206′ of the data set 108 depending on whether the initial samples 202A are from all the subsets 110 per
However, in
The process 700 again includes labeling (618) the distribution type 604 of each training data set 602 to generate distribution type labels 620 respectively corresponding to the training data sets 602. The process 700, similar to the process 600 of
Once trained, the machine learning model 410 can be used to classify the distribution type of an input data set from a minimum number of samples, similar to as has been described, but on the basis of an image plot of the samples of the input data set as opposed to on the basis of the actual numeric values of the samples. For instance, an image plot 706 may be generated (708) from the initial samples 202A of one or multiple client-specific subsets 110 of the data set 108, in the same manner in which the image plots 702 of the randomly selected samples 614 of the training data sets 602 were generated. The trained machine learning model 410 can then be applied (624) to the image plot 706 of the initial samples 202A to classify the distribution type 206 or 206′ of the data set 108.
For instances the method can include specifying the distribution type 604, the distribution parameters 606, and the number 608 of randomly selected samples of the training data sets 602 (804). The method 800 can include generating a distribution 610 of each training data set 602 having the distribution type 604 and the distribution parameters 606 of that data set 602 (806), and randomly selecting the specified number 608 of samples 614 from the generated distribution 610 of the data set (808). In one implementation, the method 800 can include, for each training data set 602, generating an image plotting the randomly selected samples 614 of that data set 602 (810).
The method 800 includes labeling the training data sets 602 with labels 620 (812). The label 620 of each training data set 602 corresponds to the distribution type 604 of that data set 602. The method 800 includes training a machine learning model 410 from the training data sets 602 and their labels 618 (814), where the machine learning model 410 classifies a distribution type of an input data set from a minimum number of initial samples of the input data set. The machine learning model 410 may be trained from the actual numeric values of the randomly selected samples 614 of the training data sets 602, or from the generated images plotting these samples 614.
The method 800 can include then applying the trained machine learning model to the initial samples of the input data set—either to their actual numeric values or to an image plotting the values—to predict the distribution type of the input data set (816). The method 800 can include calculating distribution parameters of a distribution of the predicted distribution type from the initial samples of the input data set (818), and adjusting centralized collection of additional samples of the input data set based on the predicted distribution type and the calculated distribution parameters (820). For instance, a data anonymization protocol 214, such as an adaptive differential privacy protocol having variably sized bins, governing the centralized collection of the additional samples may be adjusted. The method 800 can include then centrally collecting the additional samples as so adjusted (822).
The processing includes receiving a selected distribution type 206 from the server device 104 that chooses the selected distribution type 206 from the predicted distribution types 206′ received from the client device 102 and the other client devices 102 (910). The processing includes calculating distribution parameters 208′ of a distribution 204 of the selected distribution type 206 from the initial samples 202A of the client-specific subset 110 of the data set 108 (912). The processing includes transmitting the calculated distribution parameters 208′ to the server device 104 that also receives calculated distribution parameters 208′ of the distribution 204 of the selected distribution type 206 from the other client devices 102 as calculated from the initial samples 202A of the respective other client-specific subsets 110 of the data set 108 (914).
The processing includes receiving selected distribution parameters 208 from the server device 104 that determines the selected distribution parameters 208 from the calculated distribution parameters 208′ received from the client device 102 and the other client devices 102 (916). The processing includes adjusting a data anonymization protocol 214 governing centralized data collection by the server device 104 from the client device 102, based on the selected distribution type 206 and the distribution parameters 208 (918). The processing includes collecting additional samples 202B of the client-specific subset 110 of the data set 108 (920), and reporting the additional samples 202B of the client-specific subset 110 of the data set 108 in accordance with the data anonymization protocol 214 as has been adjusted (922). No collected samples 202 are thus transmitted from the client device 102 to the server device 104 without undergoing data anonymization in accordance with the data anonymization protocol 214.
The program code 1010 is executed to calculate distribution parameters 208 of a distribution 204 of the predicted distribution type 206 from the initial samples 202A received from the client devices 102 (1016). The program code 1010 is executed to transmit to each client device 102 the predicted distribution type 206 and the calculated distribution parameters 208 (1018), in accordance with each client device 102 to adjust a data anonymization protocol 214 governing centralized data collection by the server device 104 from the client device 102. The program code 1010 is executed to centrally collect from the client devices 102 additional samples 202B of the respective client-specific subsets 110 of the data set 108 as reported by the client devices 102 in accordance with the adjusted data anonymization protocol 214 (1020). No additional samples 202B are thus centrally collected by the server device 102 without undergoing data anonymization in accordance with the data anonymization protocol 214 at the client devices 102.
Techniques have been described for classifying the distribution type of a data set from a minimal number of initial samples of the data set. The classified distribution type, along with distribution parameters calculated from the initial samples, can be used to adjust collection of samples of the data set. For example, a data anonymization protocol may be adjusted according to the classified distribution type and the calculated distribution parameters. As such, how the data is collected is adjusted, in a manner that provides for more accurate analyses of the resultantly collected anonymized data.