A machine learning system can use one or more algorithms, statistical models, or both to produce, from a training set of data, a mathematical model that can predict an outcome of a future occurrence of an event. The outcome of the future occurrence of the event can be referred to as a label. A set of data can be received. The set of data can be organized as records. The records can have a set of fields. One field can correspond to an occurrence of the event. A set of records can be determined in which members of the set of records have a value for this field that is other than a null value. This value can represent the outcome of a past occurrence of the event. This set of records can be designated as a preliminary training set of data. Records other than this set of records can be designated as a scoring set of data. It can be possible that one or more fields, other than the field that corresponds to the occurrence of the event, are associated with data that are entered into the set of data after the outcome of a corresponding occurrence of the event is known. Such data can be associated with hindsight bias. A training set of data that includes data associated with hindsight bias can be referred to as having label leakage. Instances of inclusion of data associated with hindsight bias in the training set of data can reduce an accuracy of the mathematical model to predict the outcome of the future occurrence of the event.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate implementations of the disclosed subject matter and together with the detailed description serve to explain the principles of implementation of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and the various ways in which it can be practiced.
As used herein, a statement that a component can be “configured to” perform an operation can be understood to mean that the component requires no structural alterations, but merely needs to be placed into an operational state (e.g., be provided with electrical power, have an underlying operating system running, etc.) in order to perform the operation.
A machine learning system can use one or more algorithms, statistical models, or both to produce, from a training set of data, a mathematical model that can predict an outcome of a future occurrence of an event. The outcome of the future occurrence of the event can be referred to as a label. A set of data can be received. The set of data can be organized as records. The records can have a set of fields. One field can correspond to an occurrence of the event. A set of records can be determined in which members of the set of records have a value for this field that is other than a null value. This value can represent the outcome of a past occurrence of the event. This set of records can be designated as a preliminary training set of data. Records other than this set of records can be designated as a scoring set of data. It can be possible that one or more fields, other than the field that corresponds to the occurrence of the event, are associated with data that are entered into the set of data after the outcome of a corresponding occurrence of the event is known. Such data can be associated with hindsight bias. A training set of data that includes data associated with hindsight bias can be referred to as having label leakage. Instances of inclusion of data associated with hindsight bias in the training set of data can reduce an accuracy of the mathematical model to predict the outcome of the future occurrence of the event.
The disclosed technologies can reduce instances of inclusion of data associated with hindsight bias in a training set of data for a machine learning system. A first set of data can be received. The first set of data can be organized as records. The records can have a first set of fields. An analysis of data in a first field of the first set of fields can be performed with respect to data in a second field of the first set of fields. The second field can correspond to an occurrence of an event. A result of the analysis can be determined. The result can be that the data in the first field is associated with hindsight bias. In responses to the result, a second set of data can be produced. The second set of data can be organized as the records. The records can have a second set of fields. The second set of fields can include the first set of fields except the first field. In response to a production of the second set of data, one or more features associated with the second set of data can be produced. In response to a generation of the one or more features, a third set of data can be produced. The third set of data can be organized as the records. The records having a third set of fields. The third set of fields can include the second set of fields and one or more additional fields. The one or more additional fields can correspond to the one or more feature. Using the third set of data, the training set of data can be produced. Using the training set of data, the machine learning system can be caused to be trained to predict the outcome of a future occurrence of the event.
With reference to
With reference to
At an optional operation 206, a preliminary training set of data can be designated. The preliminary training set of data can include the first set of records. For example, the preliminary training set of records can include the records associated with Lead Nos. 002, 004, 005, 007, 008, and 010.
At an optional operation 208, a scoring set of data can be designated. The scoring set of data can include the records other than the first set of records. For example, the scoring set of records can include the records associated with Lead Nos. 001, 003, 006, and 009.
At an operation 210, an analysis of data in a first field, of the first set of fields, can be performed with respect to data in the second field.
At an operation 212, a result of the analysis can be determined. The result can be that the data in the first field is associated with hindsight bias.
With reference to
At an operation 404, a determination can be made, for the second set of records, that a value of the second field of one record of the second set of records is a same as a value of the second field of each other record of the second set of records.
For example, the second set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Customer No. Alternatively, for example, the second set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Date of last purchase.
With reference to
At an operation 504, a first count can be determined. The first count can be of the members of the third set of records.
At an operation 506, a subset of the third set of records can be determined. A value of the first field of each member of the subset of the third set of records can be other than a null value.
At an operation 508, a second count can be determined. The second count can be of members of the subset of the third set of records.
At an operation 510, a determination can be made that an absolute value of a difference between the second count subtracted from the first count is less than or equal to a threshold.
For example, if the threshold is one, then the third set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Holiday card sent.
In general, a value of the threshold should not be too large so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
With reference to
At an operation 604, a determination can be made that a value of the first field of each member of the fourth set of records is a null value.
For example, the fourth set of records can include the records associated with Lead Nos. 004, 005, and 010 in which the first field is Holiday card sent.
With reference to
At an operation 704, a first count can be determined. The first count can be of the members of the fifth set of records.
At an operation 706, a subset of the fifth set of records can be determined. A value of the first field of each member of the subset of the fifth set of records can be a null value.
At an operation 708, a second count can be determined. The second count can be of members of the subset of the fifth set of records.
At an operation 710, a determination can be made that an absolute value of a difference between the second count subtracted from the first count is less than or equal to a threshold.
For example, if the threshold is one, then the fifth set of records can include the records associated with Lead Nos. 004, 005, and 010 in which the first field is Date subscription stopped.
In general, a value of the threshold should not be too large so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
With reference to
At an operation 804, a seventh set of records can be determined. The seventh set of records can be the records other than the sixth set of records.
At an operation 806, a determination can be made, for the seventh set of records, that a value of the second field of one record of the seventh set of records is a same as a value of the second field of each other record of the seventh set of records.
For example, the seventh set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Value of customer.
With reference to
At an operation 904, a ninth set of records can be determined. The ninth set of records can be the records other than the eighth set of records.
At an operation 906, a first count can be determined. The first count can be of members of the ninth set of records.
At an operation 908, for the ninth set of records, a superset of the ninth set of records can be determined. A value of the second field of one record of the superset of the ninth set of records can be a same as a value of the second field of each other record of the superset of the ninth set of records.
At an operation 910, a second count can be determined. The second count can be of members of the superset of the ninth set of records.
At an operation 912, a determination can be made that an absolute value of a difference between the first count subtracted from the second count is less than or equal to a threshold.
For example, if the threshold is one, then the ninth set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Value of last purchase. (For example, an entity associated with Lead No. 002 may have received a promotional offer such that a value of a last purchase by this entity was zero.)
In general, a value of the threshold should not be too large so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
With reference to
At an operation 1004, a determination can be made, for the tenth set of records, that a value of the first field of one record of the tenth set of records that is a same as a value of the first field of each other record of the tenth set of records.
For example, the tenth set of records can include the records associated with Lead Nos. 004, 005, and 010 in which the first field is Number of items in last purchase.
With reference to
At an operation 1104, a first count can be determined. The first count can be of the members of the eleventh set of records.
At an operation 1106, for the eleventh set of records, a subset of the eleventh set of records can be determined. A value of the first field of one record of the subset of the eleventh set of records can be a same as a value of the first field of each other record of the subset of the eleventh set of records.
At an operation 1108, a second count can be determined. The second count can be of members of the subset of the eleventh set of records.
At an operation 1110, a determination can be made that an absolute value of a difference between the second count subtracted from the first count is less than or equal to a threshold.
For example, if the threshold is one, then the eleventh set of records can include the records associated with Lead Nos. 002, 007, and 008 in which the first field is Value of last item returned.
In general, a value of the threshold should not be too large so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
With reference to
At an operation 1204, a determination can be made, for the scoring set of data, that all of the members of the scoring set of data have the value of the first field that is the null value.
For example, the twelfth set of records can include the records associated with Lead Nos. 007 and 008 in which the first field is Last date relative of lead contacted.
With reference to
At an operation 1304, a first quotient can be determined. The first quotient can be of a count of the members of the thirteenth set of records divided by a count of members of the preliminary training set of data.
At an operation 1306, a fourteenth set of records can be determined for the scoring set of data. Members of the fourteenth set of records can have the value of the first field that is other than the null value.
At an operation 1308, a second quotient can be determined. The second quotient can be of a count of the members of the fourteenth set of records divided by a count of the members the scoring set of data.
At an operation 1310, a determination can be made that the first quotient is less than or equal to a threshold.
At an operation 1312, a determination can be made that the second quotient is less than or equal to the threshold.
For example, if the threshold is 0.25 and the first field is Birthday of lead, then the thirteenth set of records can include the record associated with Lead No. 002, the first quotient can be 0.1667, the fourteenth set of records can include the record associated with Lead No. 006, and the second quotient can be 0.25.
In general, a value of the threshold should not be too large so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
With reference to
At an operation 1404, a first quotient can be determined. The first quotient can be of a count of the members of the fifteenth set of records divided by a count of members of the preliminary training set of data.
At an operation 1406, a sixteenth set of records can be determined for the scoring set of data. Members of the sixteenth set of records can have the value of the first field that is other than the null value.
At an operation 1408, a second quotient can be determined. The second quotient can be of a count of the members of the sixteenth set of records divided by a count of the members the scoring set of data.
At an operation 1410, a determination can be made that an absolute value of a difference between the second quotient subtracted from the first quotient is greater than or equal to a threshold.
For example, if the threshold is 0.25 and the first field is Last date friend of lead contacted, then the fifteenth set of records can include the records associated with Lead Nos. 004, 007, and 008, the first quotient can be 0.5, the sixteenth set of records can include the record associated with Lead No. 003, and the second quotient can be 0.25.
In general, a value of the threshold should not be too small so that the disclosed technologies can remove data associated with hindsight bias and not remove data having a predictive quality with respect to an outcome of a future occurrence of the event.
Returning to
With reference to
At an operation 218, a third set of data can be produced in response to a generation of the one or more features. The third set of data can be organized as the records. The records can have a third set of fields. The third set of fields can include the second set of fields and one or more additional fields. The one or more additional fields can corresponds to the one or more features.
Returning to
Returning to
Training the machine learning system can be a continual process.
For example, returning to
Returning to
With reference to
At an optional operation 230, for the set of iterations, a set of difference can be determined. A difference, of the set of differences, can be, for the iteration, an absolute value of the quotient subtracted from the average of the quotients. For example, for the January iteration, the difference can be 0.03; for the February iteration, the difference can be 0.02; for the March iteration, the difference can be 0.22; for the April iteration, the difference can be 0.05; for the May iteration, the difference can be 0.04; and for the June iteration, the difference can be 0.11.
At an optional operation 232, from the set of differences, a set of unusual actual outcomes can be determined. The absolute value of members of the set of unusual actual outcomes can be greater than or equal to a threshold. For example, if the threshold is 0.15, then the set of unusual actual outcomes can include the actual outcomes for the March iteration.
At an optional operation 234, the records associated with the set of unusual actual outcomes can be excluded from a future training set of data.
Advantageously, the disclosed technologies can automate operations associated with training a machine learning system that conventionally have not been automated. Specifically, although conventional technologies include a variety of automated techniques associated with feature engineering, feature selection, and mathematical models, conventionally a data scientist must manually select from among this variety of automated techniques. In contrast, the disclosed technologies provide for automatic selection of feature engineering techniques, feature selection techniques, and mathematical models. Thus, the disclosed technologies integrate automation of operations associated with training a machine learning system.
Advantageously, the disclosed technologies use a fewer number of memory cells than conventional approaches to producing the training set of data.
In light of the technologies described above, one of skill in the art understands that reducing instances of inclusion of data associated with hindsight bias in a training set of data for a machine learning system can include any combination of some or all of the foregoing configurations.
The computing device 2000 can include a bus 2002 that interconnects major components of the computing device 2000. Such components can include a central processor 2004, a memory 2006 (such as Random Access Memory (RAM), Read-Only Memory (ROM), flash RAM, or the like), a sensor 2008 (which can include one or more sensors), a display 2010 (such as a display screen), an input interface 2012 (which can include one or more input devices such as a keyboard, mouse, keypad, touch pad, turn-wheel, and the like), a fixed storage 2014 (such as a hard drive, flash storage, and the like), a removable media component 2016 (operable to control and receive a solid-state memory device, an optical disk, a flash drive, and the like), a network interface 2018 (operable to communicate with one or more remote devices via a suitable network connection), and a speaker 2020 (to output an audible communication). In some embodiments the input interface 2012 and the display 2010 can be combined, such as in the form of a touch screen.
The bus 2002 can allow data communication between the central processor 2004 and one or more memory components 2014, 2016, which can include RAM, ROM, or other memory. Applications resident with the computing device 2000 generally can be stored on and accessed via a computer readable storage medium.
The fixed storage 2014 can be integral with the computing device 2000 or can be separate and accessed through other interfaces. The network interface 2018 can provide a direct connection to the premises management system and/or a remote server via a wired or wireless connection. The network interface 2018 can provide such connection using any suitable technique and protocol, including digital cellular telephone, WiFi™, Thread®, Bluetooth®, near field communications (NFC), and the like. For example, the network interface 2018 can allow the computing device 2000 to communicate with other components of the premises management system or other computers via one or more local, wide-area, or other communication networks.
The foregoing description, for purpose of explanation, has been described with reference to specific configurations. However, the illustrative descriptions above are not intended to be exhaustive or to limit configurations of the disclosed technologies to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The configurations were chosen and described in order to explain the principles of configurations of the disclosed technologies and their practical applications, to thereby enable others skilled in the art to utilize those configurations as well as various configurations with various modifications as may be suited to the particular use contemplated.
This application claims, under 35 U.S.C. § 119(e), the benefit of U.S. Provisional Application No. 62/764,666, filed Aug. 15, 2018, the disclosure of which is incorporated herein in its entirety by reference.
Number | Date | Country | |
---|---|---|---|
62764666 | Aug 2018 | US |