Embodiments of the present application relate to big data analytics technologies, and in particular, to an information determining method and apparatus.
Big data analytics typically refers to analysis of massive data. Big data may be summarized as four Vs: a large data volume, a high velocity, a great variety, and veracity. Compared with analysis of a small volume of data, the big data analytics may provide a more accurate data analysis result. Application of the big data analytics may bring tremendous changes and value to the society, economy, and production.
The data convergence technology refers to a technology of information processing performed to automatically analyze and combine, according to a rule and using a computer, several pieces of observation information obtained in a time sequence, to complete required decision-making and assessment tasks. Therefore, cross-field data convergence enables the big data analytics to bring greater value into play. Data convergence for two fields generates an effect of 1+1>2.
It is assumed that instance data of a same user in different fields needs to be analyzed to estimate to-be-predicted attribute information of the user. The instance data herein may include a plurality of pieces of attribute information. For example, attribute information included in instance data of a user A in a mobile operator may include a name, a mobile number, consumption information, and the like, while attribute information included in instance data of the user A in a bank may include the name, the mobile number, a service type, an amount related to the service type, and the like. To-be-predicted attribute information, such as the gender or the age, of the user A may be estimated using the known attribute information. A current method for processing big data analytics may comprise: data convergence for the two fields may first be implemented according to an identifier of the user A in the mobile operator and an identifier of the user A in the bank, where the identifiers herein may be public attribute information, such as the name, of the user A in the mobile operator and the bank, and to implement the data convergence may only require to perform data connection or combination in a plaintext manner. The converged data may be analyzed to estimate the to-be-predicted attribute information of the user.
The foregoing data analytics process based on the data convergence may be referred to as an information determining process. In the information determining process in the current method, to implement the data convergence may only require to perform data connection or combination in a plaintext manner. Consequently, confidentiality between data in different fields may not be ensured.
Embodiments of the present application provide an information determining method and apparatus, to more accurately determine to-be-predicted information by converging data in a plurality of fields while ensuring confidentiality between data in different fields.
According to a first aspect, an embodiment of the present application provides an information determining method. The method is based on N fields, N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, and a feature vector of each sample includes a same quantity of pieces of known attribute information. The method may include estimating an association relationship between a feature vector and to-be-predicted attribute information of a to-be-labeled sample, where the to-be-labeled sample may be a sample including at least one piece of to-be-predicted attribute information. The method may also include decomposing the association relationship into N sub-association relationships that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields. The method may also include obtaining a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding sub-association relationship. The method may further include: calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain estimated attribute information, where the estimated attribute information is attribute information that is in a labeled sample, that corresponds to the to-be-predicted attribute information, and that is estimated based on the association relationship and a feature vector of the labeled sample, and the labeled sample is a sample in which all attribute information included is known attribute information. The method may also include determining the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information; and determining the to-be-predicted attribute information of the to-be-labeled sample based on the determined association relationship and the feature vector of the to-be-labeled sample.
In the method, the sum of the first values obtained in the N fields for the same user may be calculated based on the public attribute information to obtain estimated attribute information. That is, a calculation result may be obtained from each field without a need to know attribute information of each field. The calculation result of the same user is further calculated using the public attribute information, and finally the to-be-predicted attribute information is determined. In this way, confidentiality between data in the different fields is ensured.
Further, the calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain estimated attribute information includes: calculating, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the estimated attribute information, where the public attribute information is encrypted using a same encryption algorithm in the N fields.
All the fields may use the same encryption algorithm. Therefore, the encrypted public attribute information of each field may be the same. In the method, data in all of the N fields may not need to be converged, as long as matching of the data in the N fields may be implemented based on the encrypted public attribute information, so that the confidentiality between the data can be improved.
In an optional implementation, the determining the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information may include: for each labeled sample, calculating a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information; and determining the association relationship by making a sum of the first differences corresponding to all the labeled samples minimum.
In another optional implementation, the method may further include: obtaining similarity weights between to-be-labeled samples in each field, where the similarity weight may be used for measuring a similarity between the instance data; obtaining a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding sub-association relationship; and calculating second differences between the second values of the to-be-labeled samples in each field, calculating a sum of products of all second differences in each field and corresponding similarity weights. The method may also include: determining the association relationship may be based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information that may include: for each labeled sample, calculating a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information; and determining the association relationship based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
The association relationship between the feature vector and the to-be-predicted attribute information of the to-be-labeled sample may be relatively accurately determined using the foregoing two optional implementations.
Further, after the determining the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information, the method may further included: correcting the association relationship, and using the corrected association relationship as an estimated new association relationship; and stopping when a quantity of corrections exceeds a preset value; or stopping when all association relationships converge. The correction process may be a learning process, and the association relationship may be made more accurate through constant learning.
According to a second aspect, an embodiment of this aspect provides an information determining method. The method may be based on N fields, wherein N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, a feature vector of each sample includes a same quantity of pieces of known attribute information. The method includes: estimating a probability distribution function of to-be-predicted attribute information according to a feature vector of a to-be-labeled sample, where the to-be-labeled sample is a sample including at least one piece of to-be-predicted attribute information. The method may also include decomposing the probability distribution function into N subfunctions that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields. The method may also include obtaining a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding subfunction. The method may further include: calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information, where the labeled sample is a sample in which all attribute information included is known attribute information. The method may also include determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information. The method may also include determining the to-be-predicted attribute information of the to-be-labeled sample based on the determined probability distribution function and the feature vector of the to-be-labeled sample.
In the process, the sum of the first values obtained in the N fields for the same user may be calculated based on the public attribute information to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information. That is, a calculation result may be obtained from each field without a need to know attribute information of each field. The calculation result of the same user may be further calculated using the public attribute information, and finally the to-be-predicted attribute information is determined. In this way, confidentiality between data in the different fields may be ensured.
Further, the calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information may include: calculating, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information, where the public attribute information may be encrypted using a same encryption algorithm in the N fields.
All the fields may use the same encryption algorithm. Therefore, the encrypted public attribute information of each field may be the same. In the method, data in all of the N fields may not need to be converged, as long as matching of the data in the N fields is implemented based on the encrypted public attribute information, so that the confidentiality between the data can be improved.
In an optional implementation, the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information may include: when the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information corresponds to m pieces of particular attribute information, where m is a positive integer greater than or equal to 2, for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculating a first difference between the probability and 1; otherwise, calculating a first difference between the probability and 0; and determining the probability distribution function by making a sum of all first differences minimum.
In another implementation, the method may further include: obtaining similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data; obtaining a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding subfunction; and calculating second differences between the values of the to-be-labeled samples in each field, and calculating a sum of products of all second differences in each field and corresponding similarity weights; and the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information includes: for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information matches the particular attribute information, calculating a first difference between the probability and 1; otherwise, calculating a first difference between the probability and 0; and determining the probability distribution function based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
The probability distribution function of the to-be-predicted attribute information may be relatively accurately determined using the foregoing two optional implementations.
Further, after the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information, the method may further include: correcting the probability distribution function, and using the corrected probability distribution function as an estimated new probability distribution function; and stopping when a quantity of corrections exceeds a preset value; or stopping when all probability distribution functions converge. The correction process may be a learning process, and the probability distribution function may be made more accurate through constant learning.
The following describes an information determining apparatus according to an embodiment of the present application. The apparatus portion corresponds to the foregoing method, and technical effects of corresponding content are the same. Details are not described herein again.
According to a third aspect, an embodiment of the present application may provide an information determining apparatus. The apparatus may be based on N fields, wherein N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, a feature vector of each sample includes a same quantity of pieces of known attribute information. The apparatus may include: an estimation module, configured to estimate an association relationship between a feature vector and to-be-predicted attribute information of a to-be-labeled sample, where the to-be-labeled sample is a sample including at least one piece of to-be-predicted attribute information. The apparatus may also include a decomposition module, configured to: decompose the association relationship into N sub-association relationships that are in a one-to-one correspondence to the N fields, and decompose the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields. The apparatus may also include an obtaining module, configured to obtain a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding sub-association relationship. The apparatus may further include: a calculation module, configured to calculate, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain estimated attribute information, where the estimated attribute information is attribute information that is in a labeled sample, that corresponds to the to-be-predicted attribute information, and that is estimated based on the association relationship and a feature vector of the labeled sample, and the labeled sample is a sample in which all attribute information included is known attribute information. The apparatus may also include a determining module, configured to determine the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information. The determining module may be further configured to determine the to-be-predicted attribute information of the to-be-labeled sample based on the determined association relationship and the feature vector of the to-be-labeled sample.
Further, the calculation module is specifically configured to calculate, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the estimated attribute information, where the public attribute information may be encrypted using a same encryption algorithm in the N fields.
Optionally, the determining module may be configured to: for each labeled sample, calculate a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information, and determine the association relationship by making a sum of the first differences corresponding to all the labeled samples minimum.
Optionally, the obtaining module may be further configured to: obtain similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data, and obtain a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding sub-association relationship. The calculation module is further configured to: calculate second differences between the second values of the to-be-labeled samples in each field, and calculate a sum of products of all second differences in each field and corresponding similarity weights. The determining module may be configured to: for each labeled sample, calculate a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information, and determine the association relationship based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
Still further, the apparatus may further include: a correction module, configured to: correct the association relationship, and use the corrected association relationship as an estimated new association relationship, and stop when a quantity of corrections exceeds a preset value, or stop when all association relationships converge.
According to a fourth aspect, an embodiment of the present application may provide an information determining apparatus. The apparatus may be based on N fields, N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, a feature vector of each sample includes a same quantity of pieces of known attribute information. The apparatus may include: an estimation module, configured to estimate a probability distribution function of to-be-predicted attribute information according to a feature vector of a to-be-labeled sample, where the to-be-labeled sample is a sample including at least one piece of to-be-predicted attribute information The apparatus may also include a decomposition module, configured to: decompose the probability distribution function into N subfunctions that are in a one-to-one correspondence to the N fields, and decompose the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields. The apparatus may also include an obtaining module, configured to obtain a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding subfunction. The apparatus may further include: a calculation module, configured to calculate, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information, where the labeled sample is a sample in which all attribute information included is known attribute information. The apparatus may also include a determining module, configured to determine the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information. The determining module may be further configured to determine the to-be-predicted attribute information of the to-be-labeled sample based on the determined probability distribution function and the feature vector of the to-be-labeled sample.
Further, the calculation module may be configured to calculate, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information, where the public attribute information is encrypted using a same encryption algorithm in the N fields.
Optionally, the determining module may be configured to: when the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information corresponds to m pieces of particular attribute information, where m is a positive integer greater than or equal to 2, for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculate a first difference between the probability and 1; otherwise, calculate a first difference between the probability and 0; and determine the probability distribution function by making a sum of all first differences minimum.
Optionally, the obtaining module may be further configured to: obtain similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data; and obtain a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding subfunction; the calculation module may be further configured to: calculate second differences between the values of the to-be-labeled samples in each field, and calculate a sum of products of all second differences in each field and corresponding similarity weights; and the determining module may be configured to: for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculate a first difference between the probability and 1; otherwise, calculate a first difference between the probability and 0; and determine the probability distribution function based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
Still further, the apparatus may further include: a correction module, configured to: correct the probability distribution function, and use the corrected probability distribution function as an estimated new probability distribution function; and stop when a quantity of corrections exceeds a preset value; or stop when all probability distribution functions converge.
According to a fifth aspect, an embodiment of the present application may provide an information determining apparatus. The apparatus is based on N fields, N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, and a feature vector of each sample includes a same quantity of pieces of known attribute information. The information determining apparatus may include: a processor, and a memory configured to store an executable instruction of the processor. The processor executes the executable instruction stored in the memory, so that the information determining apparatus performs the method according to the first aspect and the subdivisions thereof. For example, the information determining apparatus may perform the following steps: estimating a probability distribution function of to-be-predicted attribute information according to a feature vector of a to-be-labeled sample, where the to-be-labeled sample is a sample including at least one piece of to-be-predicted attribute information; decomposing the probability distribution function into N subfunctions that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields; and obtaining a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding subfunction. The information determining apparatus may further perform the following method steps: calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information, where the labeled sample is a sample in which all attribute information included is known attribute information; determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information; and determining the to-be-predicted attribute information of the to-be-labeled sample based on the determined probability distribution function and the feature vector of the to-be-labeled sample.
According to a sixth aspect, an embodiment of the present application may provide an information determining apparatus. The apparatus is based on N fields, N is an integer greater than or equal to 2, each of the fields includes instance data of a plurality of users, each piece of instance data includes a plurality of pieces of attribute information, at least one piece of public attribute information exists in instance data of a same user in the N fields, the instance data of the same user in the N fields constitutes one sample, a feature vector of the sample is generated based on some or all of known attribute information included in the sample, and a feature vector of each sample includes a same quantity of pieces of known attribute information. The information determining apparatus may include: a processor, and a memory configured to store an executable instruction of the processor. The processor executes the executable instruction stored in the memory, so that the information determining apparatus performs the method according to the second aspect and the subdivisions thereof. For example, the information determining apparatus may perform the following method steps: estimating a probability distribution function of to-be-predicted attribute information according to a feature vector of a to-be-labeled sample, where the to-be-labeled sample is a sample including at least one piece of to-be-predicted attribute information; decomposing the probability distribution function into N subfunctions that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields; and obtaining a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding subfunction. The information determining apparatus may further perform the following method steps: calculating, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information, where the labeled sample is a sample in which all attribute information included is known attribute information; determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information; and determining the to-be-predicted attribute information of the to-be-labeled sample based on the determined probability distribution function and the feature vector of the to-be-labeled sample.
Embodiments of the present application may provide the information determining method and apparatus. The method may include: estimating the association relationship between the feature vector and the to-be-predicted attribute information of the to-be-labeled sample. The method may also include decomposing the association relationship into the N sub-association relationships that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into the feature subvectors that are in a one-to-one correspondence to the N fields. The method may also include obtaining the first value obtained by substituting the feature subvector of each labeled sample in each field into the corresponding sub-association relationship. The method may also include calculating, based on the public attribute information, the sum of first values obtained in the N fields for the same user to obtain the estimated attribute information, where the estimated attribute information is the attribute information that is in the labeled sample, that corresponds to the to-be-predicted attribute information, and that is estimated based on the association relationship and the feature vector of the labeled sample. The method may also include determining the association relationship based on the estimated attribute information of all the labeled samples and the known attribute information corresponding to the estimated attribute information. In the process, the sum of the first values obtained in the N fields for the same user may be calculated based on the public attribute information to obtain the estimated attribute information. That is, a calculation result may be obtained from each field without a need to know attribute information of each field. The calculation result of the same user may be further calculated using the public attribute information, and finally the to-be-predicted attribute information is determined. In this way, confidentiality between data in the different fields is ensured.
To describe technical solutions in embodiments of the present application briefly describes the accompanying drawings required for describing the embodiments.
To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the following describes the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.
To resolve a problem in a current system that a data analytics process based on data convergence cannot ensure confidentiality between data in different fields, the present application provides an information determining method and apparatus.
For example, the method may be assumed to relate to two fields: mobile operators and banks. In such a scenario, instance data of a user A in a mobile operator may be: {Zhang San, 139***0000, a mobile phone fee too RMB for November, including a 50 RMB call charge and a 50 RMB traffic fee}. And instance data of the user A in a bank may be: {Zhang San, 133***0000, a service type: a financing product 1, the financing product 1 relating to an amount of 80 thousands RMB, male, age}. All instance data of the user A may constitute a to-be-labeled sample, and the related age may be to-be-predicted attribute information.
Additionally, in the foregoing example, instance data of a user B in a mobile operator may be: {Li Si, 139***0001, a mobile phone fee 78 RMB for November, including a 30 RMB call charge and a 48 RMB traffic fee}. And instance data of the user B in a bank may be: {Li Si, 139***0000, a service type: a financing product 2, the financing product 2 relating to an amount of 50 thousands RMB, female, 40}. All instance data of the user B may constitute a labeled sample.
Finally, the instance data in the foregoing example may include instance data of a user M in a mobile operator which may be defined as: {Wang Wu, 139***0010, a mobile phone fee 50 RMB for November, including a 30 RMB call charge and a to RMB traffic fee}. And may further include instance data of the user M in a bank which may be defined as: {Wang Wu, 139***0010, a service type: a deposit, relating to an amount of 2000 RMB, female, 50}. All instance data of the user M may constitute a labeled sample.
In the depicted scenario, a feature vector may be {a name, a mobile number, consumption information, a service type, an amount related to the service type}, and to-be-predicted attribute information of a to-be-labeled sample may be determined based on an internal data relationship of a labeled sample and known attribute information of the to-be-labeled sample.
As shown in
S101: Estimate an association relationship between a feature vector and to-be-predicted attribute information of a to-be-labeled sample.
First, it may be determined that a larger value of consumption information indicates a younger age, that is, consumption information may be inversely proportional to an age. Next, when a service type tends to be a financing product, then it may be determined that most ages may range from 30 to 45. For example, when the age is older than 40, a larger amount related to the service type may indicate a younger age. In another example, when the age is younger than 40, a larger amount related to the service type may indicate an older age. That is, an amount related to the service type may have a quadratic function relationship with an age.
Therefore, it may be estimated that the association relationship is F(Xi)=−ax1i+bx21i+cx22i+dx23i−e(x3i−40)2+f, where F indicates an association relationship. The feature vector is Xi=(x1i,x21i,x22i,x23i,x3i), where x1i indicates consumption information of a user i in a mobile operator, x21i, indicates that a service type of the user i in a bank is the financing product 1, x22i indicates that the service type of the user i in the bank is the financing product 2, x23i indicates that the service type is deposit, and x3i indicates the amount related to the service type. Factors a, b, c, d, e, and f may be all positive integers. In some embodiments, the association relationship may account for more service types. Those of skill in the art will appreciate that the formulas described herein may be amended to account for additional service types. The foregoing formulas use three service types as an example only. In one example, an age of the user i that purchases the financing product 1 may be estimated, according to a labeled sample, to be younger than an age of a user that purchases the financing product 2, and the age of the user i that purchases the financing product 2 may be younger than a user that selects the deposit. In such a scenario, b>c>d may be set.
S102: Decompose the association relationship into N sub-association relationships that are in a one-to-one correspondence to the N fields, and decompose the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields.
S103: Obtain a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding sub-association relationship.
With reference to step S102 and step S103, the feature vector of the sample may include some or all known attribute information included in the sample, and therefore known attribute information included, in each field, in the feature vector of the sample may be determined. The known attribute information included in each field may be referred to as feature subvectors of the sample. Correspondingly, according to the known attribute information included, in each field, in the feature vector of the sample, a portion that is in the known attribute information included in each field and that needs to be substituted into the association relationship may be referred to as a sub-association relationship. Then, in the foregoing example, F may be decomposed into two sub-association relationships: F1(X1i)=−ax1i and F2(X2i)=bx21i+cx22i+dx23i−e(x3i−40)2+f, and a corresponding feature vector may also be decomposed into two feature subvectors: X1i=x1i and X2i=(x21i,x22i,x23i,x3i). In such a scenario, the feature vector of the labeled sample may be Xj, and the feature subvectors may be X1j=x1j and X2j=(x21j,x22j,x23j,x3j), where two first values: F1(X1j) and F2 (X2j) are obtained.
S104: Calculate, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain estimated attribute information, where the estimated attribute information is attribute information that is in a labeled sample, that corresponds to the to-be-predicted attribute information, and that is estimated based on the association relationship and a feature vector of the labeled sample.
Further, the sum of the first values obtained in the N fields for the same user may be further calculated based on encrypted public attribute information to obtain the estimated attribute information, where the public attribute information may be encrypted using a same encryption algorithm in the N fields. The public attribute information may be encrypted using the same encryption algorithm in the N fields. Therefore, for a same piece of public attribute information, results after the encryption may be the same. In this embodiment of the present application, the sum of the first values obtained in the N fields for the same user may be calculated based on encrypted public attribute information to obtain the estimated attribute information F(X). For example, the estimated attribute information may be an age of the user B, or an age of the user M.
S105: Determine the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information.
S106: Determine the to-be-predicted attribute information of the to-be-labeled sample based on the determined association relationship and the feature vector of the to-be-labeled sample.
In an optional implementation, step S105 may include: for each labeled sample, calculating a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information; and determining the association relationship by making a sum of the first differences corresponding to all the labeled samples minimum.
For example, min Σj∈LF(Xj)−yj, where yj indicates the known attribute information corresponding to the estimated attribute information, F(Xj)−yj is the first difference, and L indicates a set of all the labeled samples. Furthermore, the association relationship F may be determined by making min Σj∈LF(Xj)−yj minimum.
In another optional implementation,
S201: Obtain similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data.
The similarity weights between the to-be-labeled samples may be determined using a cosine similarity algorithm. For example, for a field, feature subvectors corresponding to two to-be-labeled samples may be determined, and then a cosine value of an angle between the two feature subvectors may be calculated to estimate a similarity weight between the two to-be-labeled samples.
S202: Obtain a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding sub-association relationship.
It is assumed that the feature vector of the to-be-labeled sample is Xq, and the feature subvectors are X1q=x1q and X2q=(x21q,x22q,x23q,x3q), where two second values: F1(X1q) and F2(X2q) are obtained.
S203: Calculate second differences between the second values of the to-be-labeled samples in each field, and calculate a sum of products of all second differences in each field and corresponding similarity weights.
S204: For each labeled sample, calculate a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information.
S205: Determine the association relationship based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
Specifically, descriptions are provided with reference to steps S203 to S205:
minΣj∈LM(F(Xj)−yj)2+Σq
where R indicates a set of all the to-be-labeled samples, and M is as large as possible. wq1,q2 indicates a similarity weight between labeled samples q1 and q2 in a field corresponding to F1, and ωq1,q2 indicates a similarity weight between the labeled samples q1 and q2 in a field corresponding to F2. Both F1(X1q
Further, after the determining the association relationship based on estimated attribute information of all labeled samples and known attribute information corresponding to the estimated attribute information, the method may further include: correcting the association relationship, and using the corrected association relationship as an estimated new association relationship; and stopping when a quantity of corrections exceeds a preset value; or stopping when all association relationships converge.
This embodiment of the present application provides the information determining method, including: estimating the association relationship between the feature vector and the to-be-predicted attribute information of the to-be-labeled sample; decomposing the association relationship into the N sub-association relationships that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into the feature subvectors that are in a one-to-one correspondence to the N fields; obtaining the first value obtained by substituting the feature subvector of each labeled sample in each field into the corresponding sub-association relationship; calculating, based on the public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the estimated attribute information, where the estimated attribute information is the attribute information that is in the labeled sample, that corresponds to the to-be-predicted attribute information, and that is estimated based on the association relationship and the feature vector of the labeled sample; and determining the association relationship based on the estimated attribute information of all the labeled samples and the known attribute information corresponding to the estimated attribute information. In the process, the sum of the first values obtained in the N fields for the same user may be calculated based on the public attribute information to obtain the estimated attribute information. That is, a calculation result may be obtained from each field without a need to know attribute information of each field. The calculation result of the same user is further calculated using the public attribute information, and then the to-be-predicted attribute information may be determined. In this way, confidentiality between data in the different fields may be ensured.
S301: Estimate a probability distribution function of to-be-predicted attribute information according to a feature vector of a to-be-labeled sample.
For example, the method may be assumed to relate to two fields: mobile operators and banks. In such a scenario, instance data of a user A in a mobile operator may be: {Zhang San, 139***0000, a mobile phone fee too RMB for November, including a 50 RMB call charge and a 50 RMB traffic fee}. And instance data of the user A in a bank may be: {Zhang San, 133***0000, a service type: a financing product 1, the financing product 1 relating to an amount of 80 thousands RMB, male}. All instance data of the user A may form a to-be-labeled sample, and the related gender may be to-be-predicted attribute information.
Additionally, in the foregoing example, instance data of a user B in a mobile operator may be: {Li Si, 139***0001, a mobile phone fee 78 RMB for November, including a 30 RMB call charge and a 48 RMB traffic fee}. And instance data of the user B in a bank may be: {Li Si, 139***0000, a service type: a financing product 2, the financing product 2 relating to an amount of 50 thousands RMB, female}. All instance data of the user B may form a labeled sample.
Finally, the instance data in the foregoing example may include instance data of a user M in a mobile operator which may be defined as: {Wang Wu, 139***0010, a mobile phone fee 50 RMB for November, including a 30 RMB call charge and a to RMB traffic fee}. And may further include instance data of the user M in a bank which may be defined as: {Wang Wu, 139***0010, a service type: a deposit, relating to an amount of 2000 RMB, female}. All instance data of the user M may form a labeled sample.
In the depicted scenario, a feature vector may be {a name, a mobile number, consumption information, a service type, an amount related to the service type}, and to-be-predicted attribute information of a to-be-labeled sample may be determined based on an internal data relationship of a labeled sample and known attribute information of the to-be-labeled sample.
It may be assumed that a probability distribution function of the gender may be determined as a discrete function according to the feature vector, and a function value is 0 or 1, where 0 may indicate that the gender is male, and 1 may indicate that the gender is female.
S302: Decompose the probability distribution function into N subfunctions that are in a one-to-one correspondence to the N fields, and decompose the feature vector of each sample into feature subvectors that are in a one-to-one correspondence to the N fields.
S303: Obtain a first value obtained by substituting a feature subvector of each labeled sample in each field into a corresponding subfunction.
S304: Calculate, based on the public attribute information, a sum of first values obtained in the N fields for the same user to obtain a probability that attribute information that is in a labeled sample and that corresponds to the to-be-predicted attribute information is particular attribute information.
Further, the sum of the first values obtained in the N fields for the same user may be calculated based on encrypted public attribute information to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information, where the public attribute information may be encrypted using a same encryption algorithm in the N fields. Confidentiality between data can be improved using this type of encryption manner.
S305: Determine the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information.
S306: Determine the to-be-predicted attribute information of the to-be-labeled sample based on the determined probability distribution function and the feature vector of the to-be-labeled sample.
With reference to this embodiment of the present application, the particular attribute information may include: male and female.
In an exemplary embodiment, the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information may be the particular attribute information and whether the attribute information is actually the particular attribute information may include: when the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information corresponds to m pieces of particular attribute information, where m is a positive integer greater than or equal to 2, for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculating a first difference between the probability and 1; otherwise, calculating a first difference between the probability and 0; and determining the probability distribution function by making a sum of all first differences minimum.
In another embodiment, the method may further include: obtaining similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data; obtaining a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding subfunction; and calculating second differences between the values of the to-be-labeled samples in each field, and calculating a sum of products of all second differences in each field and corresponding similarity weights; and the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information includes: for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculating a first difference between the probability and 1; otherwise, calculating a first difference between the probability and 0.
Optionally, the probability distribution function may be determined based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
Optionally, the probability distribution function may be determined based on a difference between the probability and a preset value, a sum of the first differences corresponding to all the labeled samples, and the sum of the products of all the second differences in each field and the corresponding similarity weights. Preset values of all users may constitute a prior matrix.
Further, after the determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information, the method may further include: correcting the probability distribution function, and using the corrected probability distribution function as an estimated new probability distribution function; and stopping when a quantity of corrections exceeds a preset value; or stopping when all probability distribution functions converge.
The embodiments of the present application may provide for an information determining method, comprising: estimating the probability distribution function of the to-be-predicted attribute information according to the feature vector of the to-be-labeled sample; decomposing the probability distribution function into the N subfunctions that are in a one-to-one correspondence to the N fields, and decomposing the feature vector of each sample into the feature subvectors that are in a one-to-one correspondence to the N fields; obtaining the first value obtained by substituting the feature subvector of each labeled sample in each field into the corresponding subfunction; calculating, based on the public attribute information, the sum of first values obtained in the N fields for the same user to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information; and determining the probability distribution function according to the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information and whether the attribute information is actually the particular attribute information. In the process, the sum of the first values obtained in the N fields for the same user is calculated based on the public attribute information to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information. That is, a calculation result is obtained from each field without a need to know attribute information of each field. The calculation result of the same user is further calculated using the public attribute information, and finally the to-be-predicted attribute information is determined. In this way, confidentiality between data in the different fields is ensured.
Further, the calculation module 44 may be configured to calculate, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the estimated attribute information, where the public attribute information is encrypted using a same encryption algorithm in the N fields.
Still further, the determining module 45 may be configured to: for each labeled sample, calculate a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information; and determine the association relationship by making a sum of the first differences corresponding to all the labeled samples minimum.
Optionally, the obtaining module 43 may be further configured to: obtain similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data; and obtain a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding sub-association relationship. The calculation module 44 may be further configured to: calculate second differences between the second values of the to-be-labeled samples in each field, and calculate a sum of products of all second differences in each field and corresponding similarity weights. The determining module 45 may be configured to: for each labeled sample, calculate a first difference between estimated attribute information and known attribute information corresponding to the estimated attribute information; and determine the association relationship based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
In some embodiments, the apparatus may further include: a correction module 46, which may be configured to: correct the association relationship, and use the corrected association relationship as an estimated new association relationship; and stop when a quantity of corrections exceeds a preset value; or stop when all association relationships converge.
The information determining apparatus provided in this embodiment may be used to perform the method steps in the embodiments shown in
Further, the calculation module 54 may be configured to calculate, based on encrypted public attribute information, the sum of the first values obtained in the N fields for the same user to obtain the probability that the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information is the particular attribute information, where the public attribute information is encrypted using a same encryption algorithm in the N fields.
Optionally, the determining module 55 may be configured to: when the attribute information that is in the labeled sample and that corresponds to the to-be-predicted attribute information corresponds to m pieces of particular attribute information, where m is a positive integer greater than or equal to 2, for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculate a first difference between the probability and 1; otherwise, calculate a first difference between the probability and 0; and determine the probability distribution function by making a sum of all first differences minimum.
Optionally, the obtaining module 53 may be further configured to: obtain similarity weights between to-be-labeled samples in each field, where the similarity weight is used for measuring a similarity between the instance data; and obtain a second value obtained by substituting a feature subvector of each to-be-labeled sample in each field into a corresponding subfunction. The calculation module 54 may be further configured to: calculate second differences between the values of the to-be-labeled samples in each field, and calculate a sum of products of all second differences in each field and corresponding similarity weights. The determining module 55 may be configured to: for each piece of the particular attribute information of each labeled sample, when the attribute information corresponding to the to-be-predicted attribute information is actually the particular attribute information, calculate a first difference between the probability and 1; otherwise, calculate a first difference between the probability and 0; and determine the probability distribution function based on a sum of the first differences corresponding to all the labeled samples and the sum of the products of all the second differences in each field and the corresponding similarity weights.
In some embodiments, the apparatus may further include: a correction module 56, which may be configured to: correct the probability distribution function, and use the corrected probability distribution function as an estimated new probability distribution function; and stop when a quantity of corrections exceeds a preset value; or stop when all probability distribution functions converge.
The information determining apparatus provided in this embodiment may be used to perform the method steps in the embodiment shown in
The information determining apparatus provided in this embodiment may be used to perform the method steps in the embodiments shown in
The information determining apparatus provided in this embodiment may be used to perform the method steps in the embodiment shown in
An embodiment of the present application may further provide a computer program product, including a computer readable storage medium. The storage medium may be configured to store computer executable instructions, and the computer executable instructions may include instructions for performing the foregoing method steps. Persons of ordinary skill in the art may understand that all or some of the steps of the method embodiments may be implemented by a program instructing relevant hardware. The program may be stored in a non-transitory computer-readable storage medium. When the program runs, the steps of the method embodiments may be performed. The foregoing storage medium may include: any medium that can store program code, such as a read only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disc.
Finally, it should be noted that, the foregoing embodiments are merely intended for describing the technical solutions of the present application other than limiting the present application. Although the present application is described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they may still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacements to some technical features thereof, without departing from the scope of the technical solutions of the embodiments of the present application.
Number | Date | Country | Kind |
---|---|---|---|
201510959360.9 | Dec 2015 | CN | national |
This application is a continuation of International Application No. PCT/CN2016/097816, filed on Sep. 1, 2016, which claims priority to Chinese Patent Application No. 201510959360.9, filed on Dec. 21, 2015. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2016/097816 | Sep 2016 | US |
Child | 16013433 | US |