The embodiments are directed to machine learning, and more particularity to a machine learning system for identifying object similarity.
Conventionally, similarity between two objects is determined using unsupervised learning techniques. These conventional techniques identity features of the objects, transform the features into a high-dimensional feature space, and use a clustering or K-nearest similarity algorithm to identify similarity of the objects.
In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
A similarity framework can be used to identify relationships between objects and evaluate the strength between them. The output of the similarity framework may be a similarity matrix. The similarity matrix is a symmetric matrix of n×n rows and columns representing objects. An element of the matrix is a similarity score between two objects identified by the row and column of the matrix. The similarity score identifies the strength of a relationship between the two objects.
The similarity framework, such as the one described in the embodiments below, may be used to identify similarity between different types of objects. When objects are images, and the similarity framework identifies similar images, one image may replace another to be used in, e.g., image recognition systems. When objects are articles, similar articles may be identified to determine current trends. When objects are documents, similar documents may identify plagiarism. When objects are transactions, similar or dissimilar transactions may identify fraud. The similarity framework may also be used in various natural language processing tasks including text summarization, translation, etc. The similarity framework may be used to identify similar securities and substitute a security of interest with another security with similar characteristics. This has applications in trading and liquidity when, for example, a bond cannot be sourced from the market or, in another example, in portfolio construction where one or more securities may be replaced with other securities that are mostly similar but with more desirable properties or characteristics.
The similarity framework may include a supervised machine learning algorithm, such as a Gradient Boosting Machines (GBM) algorithm. The GBM algorithm may train an ensemble of decision trees using a training dataset that includes features of different objects. Once the similarity framework is trained, the similarity framework receives objects. The objects are propagated through each tree in the ensemble of decision trees until the objects reach the leaf nodes. The GBM algorithm may compute the leaf node of every tree in the ensemble that corresponds to the object. Thereafter, the similarity between two objects is defined as the percentage of trees in the ensemble where the two objects fall into the same leaf node. For example, the similarity framework may assign a similarity score of one when two objects share the same leaf node, otherwise the similarity framework may assign a score of zero. In another example, instead of assigning a score that is zero or one, the score between the two objects in the same tree may vary from zero to one based on the height of the deepest node in the tree that the objects share and the height of the tree. This means that if the two objects share a leaf node, the score may be one, if the objects split at the root, the score may be zero, or if the objects split elsewhere in the tree, the score may be a number between zero and one, based on dc/d where dc is the depth of the deepest node that the objects have in common and d is the depth of the entire tree.
In some embodiments, the similarity framework may assign different weights to the scores from different trees. The weights may be assigned based on the importance of the tree in the ensemble of trees compared to other trees in the ensemble of trees. The weight associated with each tree may be based on a reduction in the training error contributed by that tree to the ensemble of trees.
The output of the similarity framework may be a similarity matrix. The similarity matrix may include object similarity scores for pairs of objects determined from the ensemble of trees. Each object similarity score may be a combination of similarity scores generated by each tree in the ensemble of trees.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities. Although illustrated as a single processor 110 and a single memory 120, the embodiments may be executed on multiple processors and stored in multiple memories.
In some examples, memory 120 may include a non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some embodiments, memory 120 may store a similarity framework 130. Similarity framework 130 may be trained using machine learning to identify similarity between objects. Similarity between objects may include a same or similar characteristics or a set of same or similar characteristics that satisfy an objective. Similarity may be quantified by an object similarity score. Similarity framework 130 may receive objects 140 as input. Using the objects 140, similarity framework 130 may generate a similarity matrix 150 that includes similarity scores for the objects. A similarity score between a pair of objects in the similarity matrix 150 may identify similarity between the pair of objects.
In some embodiments, the trees are trained using training loss. For example, if there are k number of trees, the set of training loss (TL) at each step may be defined as follows:
TL=(TL0,TL1, . . . ,TLK−1) Equation (1)
In some embodiments, the training loss may be a monotonically decreasing set of numbers that reflects that the training loss decreases with every step or tree added to the ensemble of trees 202. The training loss at each step may be a result of the performance of all the trees that preceded that step.
Similarity framework 130 may be trained to capture an importance of each tree in the ensemble of trees 202. The importance of a tree in the ensemble of trees 202 may be captured using an importance vector. To compute the importance vector, an absolute difference in the training loss is computed as follows:
s
0
=|TL
1
−TL
0| Equation (2)
s
i
=|TL
i
−TL
i−1
|∀i∈{1,2, . . . ,K−1} Equation (3)
Using the absolute difference in the training loss, the final importance weight for a tree may then be determined as follows:
Once trees in ensemble of trees 202 are identified and trained and the corresponding weights are determined, similarity framework 130 enters an inference stage. In the inference stage, similarity scores for different objects may be determined. For example, for a given ensemble of trees 202 (also referred to as an ensemble ƒ), similarity framework 130 may determine similarity between two objects X1 and X2 as follows. First, similarity framework 130 may propagate the two objects X1 and X2 down all the trees within ensemble ƒ by comparing features of the objects X1 and X2 to properties of the tree nodes of the trees until objects X1 and X2 reach the leaf nodes. Next, the terminal node position of object X1 and object X2 in each of the leaf nodes of the trees is recorded. Let Z1=(Z11, Z12, . . . , Z1K) be the tree node positions for object X1 and Z2=(Z21, Z22, . . . , Z2K) be tree positions of the leaf nodes for object X2. Then, the similarity S between objects X1 and X2 in a tree may be determined as follows:
S(X1,X2)=Σi=0K−1I(Z1i==Z2i)wi Equation (5)
where I is the indicator function. The similarity score between objects X1 and X2 in a tree may then defined as:
D(X1,X2)=1−S(X1,X2). Equation (6)
By construction, D is a number that may range from 0 to 1. Similarity framework 130 repeats this process to determine similarity scores for multiple objectives in different trees in ensemble of trees 202, which results in multiple distances or tree scores given by DOBJ1(X1, X2), DOBJ2(X1, X2) and DOBJ3(X1, X2). Similarity framework 130 may combine these distances into a single distance, e.g. similarity score which may be a weighted Euclidean distance which is an overall object similarity score 206, as follows:
Similarity framework 130, may determine similarity between both structured and unstructured objects. When objects X1 and X2 are structured objects, e.g. objects with features that may be found in a particular field in an object or quantified, similarity framework 130 may determine similarity score 206 as discussed above. When objects X1 and X2 are unstructured objects, e.g., objects with features that are qualitative, such as features included in objects that are text, images, etc., similarity framework 130 may first encode the features of unstructured objects X1 and X2 using encoder 204 into encodings. Ensemble of trees 202 may be trained on the encodings and use the encoding to determine similarity score 206.
In some embodiments, similarity framework 130 may determine similarity scores for multiple objects.
with an estimate {circumflex over (ƒ)}(x), such that some specified loss function Ψ(y, ƒ) is minimized, as follows:
The function estimation problem may be re-written in terms of expectations where an equivalent formulation would be to minimize the expected loss function over a response variable Ey(Ψ(y, ƒ(x)), conditioned on the observed explanatory data x:
The response variable y may come from different distributions. This leads to specification of different loss functions W. In particular, if the response variable is binary, i.e., y∈{0,1}, the binomial loss function may be considered. If the response variable is continuous, i.e., y∈R, the L2 squared loss function or the robust regression Huber loss function may be used. For other response distributions, specific loss functions may be designed. To make the problem of function estimating tractable, the function search space may be restricted to a parametric family of functions ƒ(x, θ). This may change the function optimization problem into the parameter estimation problem:
Similarity framework 130 may use iterative numerical procedures to perform parameter estimation. In some embodiments, given M iteration steps, where M is an integer, the parameter estimates may be written in an incremental form as follows:
{circumflex over (θ)}=Σi=1M{circumflex over (θ)}i Equation (13)
In some embodiments, the steepest gradient descent may be used to estimate parameters. In the steepest gradient descent, given N data points (x, y)i=1N, the empirical loss function J(θ) is decreased over this observed data, as follows:
J(θ)=Σi=1NΨ(yi,ƒ(xi,θ) Equation (14)
The steepest descent optimization procedure may be based on consecutive improvements along the direction of the gradient of the loss function ∇J(θ). As the parameter estimates {circumflex over (θ)} are presented in an incremental way, the estimate notation is distinguished. By the subscript index of the estimates {circumflex over (θ)}t, the t-th incremental step of the estimate {circumflex over (θ)} is considered. The superscript {circumflex over (θ)}t corresponds to the collapsed estimate of the whole ensemble, i.e., sum of all the estimate increments from step 1 to step t. The steepest descent optimization procedure may be organized as follows.
{circumflex over (θ)}t=Σi=1t−1{circumflex over (θ)}i Equation (15)
{circumflex over (θ)}t←−∇J(θ) Equation (17)
In some embodiments, similarity framework 130 may perform optimization that may occur in a function space. In this case, the function estimate {circumflex over (ƒ)} is parameterized in the additive functional form:
{circumflex over (ƒ)}(x)={circumflex over (ƒ)}M(x)=Σi=0M{circumflex over (ƒ)}i(x) Equation (18)
where M is the number of iterations, {circumflex over (ƒ)}0 is the initial guess and {ƒi}i=1M are the function increments, also referred to as “boosts”.
In some embodiments, the parameterized “base-learner” functions h(x, θ) may be distinguished from the overall ensemble function estimates {circumflex over (ƒ)}(x). Different families of base-learners functions such as decision trees or splines functions may be selected.
In a “greedy stagewise” approach for incrementing the function with the base-learners, the optimal step-size ρ may be specified at each iteration. For the function estimate at the t-th iteration, the optimization rule may be defined as follows:
{circumflex over (ƒ)}t←{circumflex over (ƒ)}t−1+ρth(x,θt) Equation (19)
In some embodiments, similarity framework 130 may arbitrarily specify both the loss function Ψ(y, ƒ) and the base-learner functions h(x, θ), on demand. In some embodiments, a new function h(x, θt) may be the most parallel to the negative gradient {gt(xi)}i=1N along the observed data:
In this way, instead of looking for a general solution for the boost increment in the function space, the new function increment may be correlated with −gt(x). This simplifies the optimization task with the least-squares minimization task:
In some embodiments, the GBM algorithm may be trained using Python or another programming language known in the art. The loss function Ψ(y, ƒ) may be the L2 loss. The GBM algorithm may train trees on residual vectors or sign vectors.
In some embodiments, the base learner function h(x, θ) that may be used is a decision tree stump and may restrict the total number of leaf nodes to a configurable number, e.g., sixteen leaf nodes.
As illustrated in
and similarity score
To determine an optimal hyperparameter, multiple trees may be generated for each hyperparameter and scored. For example, for each hyperparameter, the features may be divided into a training dataset and a validation dataset. The trees, including properties and values of the properties at each node, may be generated with the GBM algorithm using the hyperparameter and the features in the training dataset. The trees may be validated with the features in the validation dataset that validates that objects in the dataset meet a particular objective. The trees may also be scored. After the trees based on the hyperparameter are generated, the hyperparameter may be scored by averaging the scores from the trees. An optimal hyperparameter may be determined using an “argmin” function, or another function, based on the scores associated with different hyperparameters. The “argmin” function, for example, identifies the hyperparameter associated with the lowest hyperparameter score from the scored hyperparameters. The lowest hyper-parameter score simulates the minimal loss discussed above.
As illustrated in
The trees associated with different hyperparameters may include, but not limited to, anywhere from five to three-hundred trees and may have tree depth anywhere from five to sixteen nodes. During training, the features and the properties of the features that are associated with in each node are also determined.
At process 502, features are determined. For example, similarity framework 130 is trained on input data, which may be a training dataset of features. The features may be specific to objects of a particular type and may be extracted from an object. Features may be static, dynamic, or engineered. Static features may be features that do not change with time over a period of time. Dynamic feature may be features that change over a period of time. Engineered features may be created using static and dynamic features. In some embodiments, when objects include unstructured data, static, dynamic, and engineered features may be encoded into structured features using encoder 204.
At process 504, an ensemble of trees is generated. For example, similarity framework 130 may generate an ensemble of trees 202 using features and the GBM algorithm. The trees in the ensemble of trees 202 may be constructed to minimize a variance the features. Specifically, the similarity framework 130 constructs and reconstructs trees using a base function that receives features as input and generates labels, such that the function loss during the reconstruction is minimized Each tree in the ensemble of trees 202 may include one or more properties at each node of each tree with exception of the leaf nodes. As illustrated in
At process 506, tree importance for each tree in the ensemble of trees is determined. For example, similarity framework 130 may determine an importance of each tree in the ensemble of trees 202 by determining the accuracy of the ensemble of trees 202 before and after each tree is added to ensemble of trees 202. The tree importance may correspond to how important the tree is to determining similarity between objects 140. The measure of the importance may be a weight having a value between zero and one.
Once method 500 completes, the similarity framework 130 has generated ensemble of trees 202 and determined the measure of importance of each tree in ensemble of trees 202. At this point, similarity framework 130 may enter an inference stage where the similarity framework 130 determines similarity between objects 140.
At process 602, objects are received. For example, similarity framework 130 receives objects 140. The objects 140 may be the same type of objects that were used to train similarity framework 130 to generate the ensemble of trees 202.
At process 604, objects are propagated through trees in the ensemble of trees. For example, similarity framework 130 propagates objects 140 received in process 602 through each tree in ensemble of trees 202 until objects 140 reach the leaf nodes of the trees. Typically, each object 140 may be propagated through each tree in the ensemble of trees 202. As the objects are propagated, the similarity framework 130 compares the features of the object 140 to properties of the nodes of the tree in the object's path to the leaf node.
At process 606, a similarity score for pairs of objects is determined. For example, similarity framework 130 may determine a similarity score 206 for every two objects. First, similarity framework determines a similarity score for pairs of objects in each tree in ensemble of trees 202. In one instance, the similarity score for a pair of objects in the same tree may be one if the objects share the same leaf node and zero otherwise. In another instance, the similarity score may be a measure of a distance between leaf node(s) of the tree that store the pair of objects. The similarity score for the pair of objects in each tree may be determined based on the tree distance and the tree height. For example, the similarity score may be a measure of the distance from the root node to the last node that the two objects share that is divided by the depth of the tree. In some embodiments, the similarity score is further adjusted based on the tree importance. The object similarity score 206 may then be determined by combining the similarity scores for the pair of objects from each tree in the ensemble of trees 202. Process 606 repeats until similarity framework 130 determines similarity scores 206 for all pairs of objects in objects 140.
At process 608, a similarity matrix is generated. For example, the similarity score 206 for all pairs of objects determined in process 606 is stored in the similarity matrix 150.
Going back to
Objects 140 may be transactions. When objects 140 are transactions, similarity framework 130 may be trained on the training data that includes transaction features for a predefined objective. Once trained, similarity framework 130 may identify, e.g., for a fraud objective, similar and different transactions based on the transaction features. The transactions or a cluster of transactions that similarity framework 130 determines as different, may be considered outlier. An outlier transaction is a transaction that has different features from other transactions, or a transaction that is not included in the training dataset. Outlier transactions may be indicative of fraud. In another example, outlier transactions may be indicative of data errors elsewhere in a transaction processing system. For example, suppose similarity framework 130 is trained on a training dataset that includes previous transactions that passed through a transaction system and that are known to be genuine or include valid data. During an inference stage, similarity framework 130 may identifier outlier transactions that are indicative of transactions that are different from transactions that have previously passed through the transaction system and were included in the training data. Different transactions may be transactions that have similarity score 150 below a similarity threshold for one or more entries in similarity matrix 150. These transactions may be indicative of fraudulent transactions or transactions that included erroneous data.
In some embodiments, similarity framework 130 may quantify uncertainty in the data. For example, similarity framework 130 may be trained on a training dataset that includes objects. Once trained, similarity framework 130 may receive objects and determine how similar an object in objects 140 is to the data in the training dataset. An object that is not similar may be considered an outlier or be out of training distribution. In some instances, similarity framework 130 may also include a classifier. The classifier may indicate how similar object 140 may be to data in the training dataset.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 500 and 600. Some common forms of machine readable media that may include the processes of methods 500 and 600 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 63/256,129, filed Oct. 15, 2021, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63256129 | Oct 2021 | US |