The following relates generally to data augmentation. Data augmentation or data completion refers to the process of predicting unseen data based on existing data. Data augmentation is a field of data analysis, which is the process of inspecting, cleaning, transforming and modeling data. In some cases, a dataset has missing cell values at certain rows. These rows may not be selected for visualizations, analysis, or training a machine learning model due to incomplete data. Datasets with missing cell values lead to models with poor predictive performance, bias, and lack of generalizability. Additionally, models trained on datasets with missing cell values generate incorrect prediction and visualization (e.g., dashboard application).
Data visualization models often generate visualizations based on a dataset using data points with no missing cell values. In the event that a dataset has data points with missing cell values, data visualization models often exclude the data points with missing cell values. However, these models make inaccurate analyses, visualizations, and predictions because they exclude incomplete (i.e., missing) but important cell values. Therefore, there is a need in the art for an improved data augmentation system that can efficiently manage data completion for datasets.
The present disclosure describes systems and methods for data augmentation. Embodiments of the disclosure include a data augmentation apparatus configured to compute a probability of an additional edge based on a dataset. Some embodiments of the present disclosure provide for augmenting datasets using a graph model that includes clusters that are both homophilous (clustered nodes likely to be connected) and heterophilous (clustered nodes unlikely to be connected). The data augmentation apparatus, via the graph model, can predict or fill missing values in the dataset. The graph model is generated based on nonnegative matrix factorization that represents both a homophilous cluster and a heterophilous cluster. In some examples, the graph model is trained to output link probabilities which are interpretable in terms of the clusters (e.g., communities) it detects.
A method, apparatus, and non-transitory computer readable medium for data augmentation are described. One or more embodiments of the method, apparatus, and non-transitory computer readable medium include receiving a dataset that includes a plurality of nodes and a plurality of edges, wherein each of the plurality of edges connects two of the plurality of nodes; computing a first nonnegative matrix representing a homophilous cluster affinity; computing a second nonnegative matrix representing a heterophilous cluster affinity; computing a probability of an additional edge based on the dataset using a machine learning model that represents a homophilous cluster and a heterophilous cluster based on the first nonnegative matrix and the second nonnegative matrix; and generating an augmented dataset including the plurality of nodes, the plurality of edges, and the additional edge.
A method, apparatus, and non-transitory computer readable medium for data augmentation are described. One or more embodiments of the method, apparatus, and non-transitory computer readable medium include receiving a dataset that includes a plurality of nodes and a plurality of edges, wherein each of the plurality of edges connects two of the plurality of nodes; computing a first nonnegative matrix representing a homophilous cluster affinity; computing a second nonnegative matrix representing a heterophilous cluster affinity; computing a predicted probability of an edge of the plurality of edges based on the first nonnegative matrix and the second nonnegative matrix using a machine learning model that represents a homophilous cluster and a heterophilous cluster; and updating parameters of the machine learning model based on the predicted probability of the edge.
An apparatus and method for data augmentation are described. One or more embodiments of the apparatus and method include a processor; a memory including instructions executable by the processor; a machine learning model configured to compute a probability of an additional edge for a dataset that includes a plurality of nodes and a plurality of edges based on a first nonnegative matrix representing a homophilous cluster affinity and a second nonnegative matrix representing a heterophilous cluster affinity, wherein the machine learning model represents a homophilous cluster and a heterophilous cluster of the plurality of nodes; and a data augmentation component configured to generate an augmented dataset including the plurality of nodes, the plurality of edges, and the additional edge.
The present disclosure describes systems and methods for data augmentation. Embodiments of the disclosure include a data augmentation apparatus configured to compute a probability of a new edge based on a dataset. The data augmentation apparatus includes a generative graph model configured to represent heterophily and overlapping clusters based on nonnegative matrix factorization. The graph model outputs probabilities of additional links or edges which are interpretable in terms of the clusters (e.g., communities) it detects. According to some embodiments, a training component performs initialization of the nonnegative factors using arbitrary real factors generated by logistic component analysis.
In some cases, a machine learning model is trained based on a dataset that has missing data points (i.e., values). The rows that have missing values are often excluded from training the model, for example, a visualization application (e.g., a dashboard) generates visualizations from a subset of attributes in user dataset of interest (e.g., scatterplot) using exclusively the rows or data points with no missing values. However, such rows with complete values are biased, and non-representative of the data points in the user dataset. Accordingly, without interpretable data augmentation, the trained machine learning model produces visualizations with poor predictive performance, bias, incorrect analysis, and conclusions.
Embodiments of the present disclosure include a data augmentation apparatus configured to train a graph model and use the trained graph model to augment a dataset with missing values to obtain an augmented dataset. The data augmentation apparatus uses the affinity of a node towards a community (i.e., a homophilous cluster or a heterophilous cluster) and the effect of participation of the node in the community to capture community affinities of the nodes. The affinities then increase or decrease the probability value of a link. For example, two nodes participating in a same community increases the probability of a link (e.g., a link between the two nodes). Two nodes participating in a different community decreases the probability of a link (e.g., a link between the two nodes). In some cases, a training component of the data augmentation apparatus is configured to minimize a loss function based on link predictions over the pairs of nodes in a graph. In some cases, phrases “communities” and “clusters” may be used interchangeably.
Some embodiments of the present disclosure include a generative graph model that is able to capture homophily and heterophily in the data. Heterophilous structure or heterophily refers to a graph structure where links are present between dissimilar nodes, such as interaction between men and women. For example, the model produces nonnegative node representations which provide for link predictions to be interpreted in terms of node clusters and outputs edge probabilities. On the other hand, homophilous structure or homophily refers to clustered nodes that are likely to be connected or linking between similar nodes. By integrating homophily, the graph model measures community affinities of the nodes and the affinities with the increase or decrease of the probability of a link between the nodes. The generative graph model is explainable (interpretable) while being naturally expressive enough to capture both heterophily and homophily. The data augmentation apparatus can capture community overlap and heterophily in the data and provides intuitions between different communities at a high level. This way, the graph model is generalizable to model real-world data where heterophily is commonly present.
In some embodiments, the data augmentation apparatus receives a dataset that includes nodes and edges that form connections between nodes. A machine learning model calculates the probability of an additional edge based on the dataset received that represents homophilous and heterophilous clusters. A training component of the data augmentation apparatus updates the parameters of the machine learning model based on the predicted probability of the edge. The training component then generates an augmented dataset including the nodes, edges, and the additional edge, thus filling connections between attributes and missing data points or cell values in the dataset.
In some embodiments of the present disclosure, the data augmentation apparatus computes the nonnegative matrices representing homophilous or heterophilous cluster affinities. The training component is configured to fit a logistic components analysis (LCA) model, yielding unconstrained factors followed by processing the unconstrained factors into nonnegative initializations. Further, the model computes products of the non-negative matrices with the corresponding transpose matrices followed by difference of the products to obtain a symmetric matrix. The difference is trained to minimize the cross-entropy loss of the link predictions over the pair of nodes. Regularization of the factors is performed subject to the nonnegativity of the node representations. In some examples, regularization methods are applied to the non-negative representations followed by normalization and ranking (i.e., ranking the non-negative representations based on the normalization). Thus, the machine learning model computes the edge probabilities that indicate the probability of a link between nodes.
Embodiments of the present disclosure include a graph model configured to represent heterophily and overlapping communities based on a dataset. The graph model is based on nonnegative matrix factorization and the graph model outputs link probabilities which are interpretable in terms of the communities the graph model detects. The graph model performs initialization of the nonnegative factors using arbitrary real factors generated by logistic principal component analysis (LPCA). The data augmentation apparatus can represent a natural class of graphs with a small number of communities which exhibits heterophily and overlapping communities. Example experiments and evaluation demonstrate that apparatus, methods, and algorithms described in the present disclosure are competitive in dealing with real-world graphs in areas such as representing a network, doing interpretable link prediction, and detecting communities that align with ground-truth.
Embodiments of the disclosure may be used in the context of data augmentation. For example, a data augmentation system includes a graph based generative model that receives a dataset (e.g., tabular dataset or any user uploaded data) including nodes and edges and generates an augmented dataset that can be used to predict missing cell values. In some examples, the graph model computes community affinities of the nodes and the affinities with the increase or decrease of the probability of a link between the nodes. The data augmentation system can capture community overlap and heterophily in the dataset and provides information between different communities. An example application, according to some embodiments, is provided with reference to
In
Some examples of the apparatus and method further include a cluster affinity component configured to compute a cluster affinity matrix based on the first nonnegative matrix and the second nonnegative matrix, wherein the machine learning model includes the cluster affinity matrix. Some examples of the apparatus and method further include a training component configured to update parameters of the machine learning model based on the probability of the additional edge. In some embodiments, the training component comprises a logistic principal components analysis (LPCA) model.
As an example shown in
In the above example, the original dataset has missing cell values, i.e., an organization attribute corresponding to user A and user D. Data augmentation apparatus 110 can predict the missing cell values to obtain an augmented dataset. Users that have the same IP address have an increased chance of belonging to a same organization. The augmented dataset is returned to user 100 via cloud 115 and user device 105. User 100 may then utilize the augmented dataset for improved visualizations, analysis, model training, etc.
User interface may enable user 100 to interact with a device. In some embodiments, a user interface may include an audio device, such as an external speaker system, an external display device such as a display screen, or an input device (e.g., remote control device interfaced with the user interface directly or through an I/O controller module). In some cases, a user interface may be a graphical user interface (GUI).
User device 105 may be a personal computer, laptop computer, mainframe computer, palmtop computer, personal assistant, mobile device, or any other suitable processing apparatus. In some examples, user device 105 includes software that incorporates a data augmentation application. The data augmentation application may either include or communicate with data augmentation apparatus 110. In some examples, the data augmentation application on user device 105 may include functions of data augmentation apparatus 110.
Data augmentation apparatus 110 includes a computer implemented network comprising a machine learning model that further includes a cluster affinity component and a data augmentation component. Data augmentation apparatus 110 may also include a processor unit, a memory unit, an I/O module, and a training component. Additionally, data augmentation apparatus 110 can communicate with database 120 via cloud 115. In some cases, the architecture of the data augmentation network is also referred to as a network or a network model. Further detail regarding the architecture of data augmentation apparatus 110 is provided with reference to
In some cases, data augmentation apparatus 110 is implemented on a server. A server provides one or more functions to users linked by way of one or more of the various networks. In some cases, the server includes a single microprocessor board, which includes a microprocessor responsible for controlling all embodiments of the server. In some cases, a server uses microprocessor and protocols to exchange data with other devices/users on one or more of the networks via hypertext transfer protocol (HTTP), and simple mail transfer protocol (SMTP), although other protocols such as file transfer protocol (FTP), and simple network management protocol (SNMP) may also be used. In some cases, a server is configured to send and receive hypertext markup language (HTML) formatted files (e.g., for displaying web pages). In various embodiments, a server comprises a general purpose computing device, a personal computer, a laptop computer, a mainframe computer, a supercomputer, or any other suitable processing apparatus.
Cloud 115 is a computer network configured to provide on-demand availability of computer system resources, such as data storage and computing power. In some examples, cloud 115 provides resources without active management by the user. The term cloud is sometimes used to describe data centers available to many users over the Internet. Some large cloud networks have functions distributed over multiple locations from central servers. A server is designated an edge server if it has a direct or close connection to a user. In some cases, cloud 115 is limited to a single organization. In other examples, cloud 115 is available to many organizations. In one example, cloud 115 includes a multi-layer communications network comprising multiple edge routers and core routers. In another example, cloud 115 is based on a local collection of switches in a single physical location.
Database 120 is an organized collection of data. For example, database 120 stores data in a specified format known as a schema. Database 120 may be structured as a single database, a distributed database, multiple distributed databases, or an emergency backup database. In some cases, a database controller may manage data storage and processing in database 120. In some cases, a user interacts with database controller. In other cases, database controller may operate automatically without user interaction.
Processor unit 200 is an intelligent hardware device, (e.g., a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, processor unit 200 is configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into processor unit 200. In some cases, processor unit 200 is configured to execute computer-readable instructions stored in a memory to perform various functions. In some embodiments, processor unit 200 includes special purpose components for modem processing, baseband processing, digital signal processing, or transmission processing.
Examples of memory unit 205 include random access memory (RAM), read-only memory (ROM), or a hard disk. Examples of memory unit 205 include solid state memory and a hard disk drive. In some examples, memory unit 205 is used to store computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions described herein. In some cases, memory unit 205 contains, among other things, a basic input/output system (BIOS) which controls basic hardware or software operation such as the interaction with peripheral components or devices. In some cases, a memory controller operates memory cells. For example, the memory controller can include a row decoder, column decoder, or both. In some cases, memory cells within memory unit 205 store information in the form of a logical state.
I/O module 210 includes an I/O controller. The I/O controller may manage input and output signals for a device. I/O controller may also manage peripherals not integrated into a device. In some cases, an I/O controller may represent a physical connection or port to an external peripheral. In some cases, an I/O controller may utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system. In other cases, an I/O controller may represent or interact with a modem, a keyboard, a mouse, a touchscreen, or a similar device. In some cases, an I/O controller may be implemented as part of a processor. In some cases, a user may interact with a device via I/O controller or via hardware components controlled by an I/O controller.
In some examples, I/O module 210 includes a user interface. A user interface may enable a user to interact with a device. In some embodiments, the user interface may include an audio device, such as an external speaker system, an external display device such as a display screen, or an input device (e.g., remote control device interfaced with the user interface directly or through an I/O controller module). In some cases, a user interface may be a graphical user interface (GUI). In some examples, a communication interface operates at the boundary between communicating entities and the channel and may also record and process communications. Communication interface is provided herein to enable a processing system coupled to a transceiver (e.g., a transmitter and/or a receiver). In some examples, the transceiver is configured to transmit (or send) and receive signals for a communications device via an antenna.
According to an embodiment, training component 215 receives a dataset that includes a set of nodes and a set of edges, where each of the set of edges connects two of the set of nodes. In some examples, training component 215 updates parameters of machine learning model 220 based on the predicted probability of the edge.
According to some embodiments, training component 215 selects a regularization term. In some examples, training component 215 applies the regularization term to the first nonnegative matrix to obtain a regularized first nonnegative matrix. Furthermore, training component 215 applies the regularization term to the second nonnegative matrix to obtain a regularized second nonnegative matrix, where the parameters of machine learning model 220 are updated based on the regularized first nonnegative matrix and the regularized second nonnegative matrix.
According to some embodiments, training component 215 computes an L2 norm of a set of columns of the first nonnegative matrix. Training component 215 ranks the set of columns based on the L2 norm, where the predicted probability of the edge is computed based on the ranking. In some examples, training component 215 computes an L2 norm of a set of columns of the second nonnegative matrix. Training component 215 ranks the set of columns of the second nonnegative matrix based on the L2 norm, where the predicted probability of the edge is computed based on the ranking.
According to some embodiments, training component 215 is configured to update parameters of machine learning model 220 based on the probability of the additional edge. In some embodiments, training component 215 includes a logistic principal components analysis (LPCA) model. In some examples, training component 215 is part of another apparatus other than data augmentation apparatus 235.
According to some embodiments of the present disclosure, data augmentation apparatus 235 includes a computer implemented artificial neural network (ANN). An ANN is a hardware or a software component that includes a number of connected nodes (i.e., artificial neurons), which loosely correspond to the neurons in a human brain. Each connection, or edge, transmits a signal from one node to another (like the physical synapses in a brain). When a node receives a signal, it processes the signal and then transmits the processed signal to other connected nodes. In some cases, the signals between nodes comprise real numbers, and the output of each node is computed by a function of the sum of its inputs. Each node and edge is associated with one or more node weights that determine how the signal is processed and transmitted.
Accordingly, during the training process, the parameters and weights of machine learning model 220 are adjusted to increase the accuracy of the result (i.e., by attempting to minimize a loss function which corresponds to the difference between the current result and the target result at a time of training). The weight of an edge increases or decreases the strength of the signal transmitted between nodes. In some cases, nodes have a threshold below which a signal is not transmitted at all. In some examples, the nodes are aggregated into layers. Different layers perform different transformations on their inputs. The initial layer is known as the input layer and the last layer is known as the output layer. In some cases, signals traverse certain layers multiple times.
According to an embodiment, machine learning model 220 receives a dataset that includes a set of nodes and a set of edges, where each of the set of edges connects two of the set of nodes. In some examples, machine learning model 220 computes a first nonnegative matrix representing a homophilous cluster affinity. Machine learning model 220 computes a second nonnegative matrix representing a heterophilous cluster affinity. Furthermore, machine learning model 220 computes a probability of an additional edge based on the dataset that represents a homophilous cluster and a heterophilous cluster based on the first nonnegative matrix and the second nonnegative matrix.
According to some embodiments, machine learning model 220 adds the additional edge to the dataset to obtain the augmented dataset. In some examples, machine learning model 220 computes a first product of the first nonnegative matrix and a transpose of the first nonnegative matrix. Machine learning model 220 computes a second product of the second nonnegative matrix and a transpose of the second nonnegative matrix. Further, machine learning model 220 computes a difference between the first product and the second product to obtain a symmetric difference matrix. Machine learning model 220 applies a nonnegative nonlinear function to the symmetric difference matrix, where the probability of the additional edge is based on the nonnegative nonlinear function.
According to some embodiments, machine learning model 220 computes a first nonnegative matrix representing a homophilous cluster affinity. In some examples, machine learning model 220 computes a second nonnegative matrix representing a heterophilous cluster affinity. In some examples, machine learning model 220 computes a predicted probability of an edge of the set of edges based on the first nonnegative matrix and the second nonnegative matrix that represents a homophilous cluster and a heterophilous cluster.
In some examples, machine learning model 220 computes a first product of the first nonnegative matrix and a transpose of the first nonnegative matrix. Machine learning model 220 computes a second product of the second nonnegative matrix and a transpose of the second nonnegative matrix. Machine learning model 220 computes a difference between the first product and the second product to obtain a symmetric difference matrix. Machine learning model 220 applies a nonnegative nonlinear function to the symmetric difference matrix, where the predicted probability of the edge is based on the nonnegative nonlinear function.
According to some embodiments, machine learning model 220 is configured to compute a probability of an additional edge for a dataset that includes a plurality of nodes and a plurality of edges based on a first nonnegative matrix representing a homophilous cluster affinity and a second nonnegative matrix representing a heterophilous cluster affinity, wherein machine learning model 220 represents a homophilous cluster and a heterophilous cluster of the plurality of nodes.
In one embodiment, machine learning model 220 includes cluster affinity component 225 and data augmentation component 230. Machine learning model 220 is an example of, or includes embodiments of, the corresponding element described with reference to
According to some embodiments, cluster affinity component 225 identifies a number of clusters, where a sum of a dimension of the first nonnegative matrix and a dimension of the second nonnegative matrix is equal to the number of clusters. In some examples, cluster affinity component 225 computes a first factor matrix and a second factor matrix, where the first factor matrix or the second factor matrix includes a negative value, and where the first nonnegative matrix and the second nonnegative matrix are computed based on the first factor matrix and the second factor matrix. Cluster affinity component 225 computes a cluster affinity matrix based on the first nonnegative matrix and the second nonnegative matrix, where the machine learning model 220 includes the cluster affinity matrix. Cluster affinity component 225 is an example of, or includes embodiments of, the corresponding element described with reference to
According to an embodiment, data augmentation component 230 generates an augmented dataset including the set of nodes, the set of edges, and the additional edge. Data augmentation component 230 is an example of, or includes embodiments of, the corresponding element described with reference to
The described methods may be implemented or performed by devices that include a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, a conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored in the form of instructions or code on a computer-readable medium.
Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer. For example, non-transitory computer-readable media can comprise random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), compact disk (CD) or other optical disk storage, magnetic disk storage, or any other non-transitory medium for carrying or storing data or code.
Also, connecting components may be properly termed computer-readable media. For example, if code or data is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, or microwave signals, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology are included in the definition of medium. Combinations of media are also included within the scope of computer-readable media.
According to an embodiment of the present disclosure, the machine learning model 300 is configured to extract nodes and edges from a received dataset to compute non-negative matrices representing cluster affinity. Further, the machine learning model 300 computes a probability of an additional edge based on the dataset representing homophilous and heterophilous clusters which is then input to data augmentation component 310.
As an example shown in
According to an embodiment, machine learning model 300 includes a factorization-based graph model that is sufficiently expressive to capture heterophily in the dataset. Machine learning model 300 produces nonnegative node representations which generate link predictions to be interpreted in terms of node clusters. Additionally, machine learning model 300 outputs edge probabilities and optimizes real-world graphs with gradient descent on a cross-entropy loss. In some cases, expressiveness of a machine learning model refers to the ability of the model to reconstruct a graph using a number of clusters that is linear in the maximum degree and the ability of the model to capture heterophily and homophily in the graph.
In
Some examples of the method, apparatus, and non-transitory computer readable medium further include providing a content item to a user based on the augmented dataset, wherein the user and the content item are represented by the plurality of nodes. Some examples of the method, apparatus, and non-transitory computer readable medium further include adding the additional edge to the dataset to obtain the augmented dataset.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a first product of the first nonnegative matrix and a transpose of the first nonnegative matrix. Some examples further include computing a second product of the second nonnegative matrix and a transpose of the second nonnegative matrix. Some examples further include computing a difference between the first product and the second product to obtain a symmetric difference matrix. Some examples further include applying a nonnegative nonlinear function to the symmetric difference matrix, wherein the probability of the additional edge is based on the nonnegative nonlinear function.
Some examples of the method, apparatus, and non-transitory computer readable medium further include identifying a number of clusters, wherein a sum of a dimension of the first nonnegative matrix and a dimension of the second nonnegative matrix is equal to the number of clusters.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a first factor matrix and a second factor matrix, wherein the first factor matrix or the second factor matrix includes a negative value, and wherein the first nonnegative matrix and the second nonnegative matrix are computed based on the first factor matrix and the second factor matrix.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a cluster affinity matrix based on the first nonnegative matrix and the second nonnegative matrix, wherein the machine learning model includes the cluster affinity matrix.
At operation 405, the user provides a spreadsheet (e.g., a dataset) that includes missing entries to the system. In some cases, the operations of this step refer to, or may be performed by, a user as described with reference to
At operation 410, the system constructs relationships based on the spreadsheet (e.g., dataset). In some cases, the operations of this step refer to, or may be performed by, a data augmentation apparatus as described with reference to
At operation 415, the system generates values for the missing entries based on the constructed relationships. In some cases, the operations of this step refer to, or may be performed by, a data augmentation apparatus as described with reference to
At operation 420, the system displays the data, for example as an augmented spreadsheet. In some cases, the operations of this step refer to, or may be performed by, a data augmentation apparatus as described with reference to
As shown in
As an example shown in
probability of second edge 520 between a pair of nodes. For example, second edge 520 connects first user 505 and first webpage 510. First webpage 510 is an example of, or includes embodiments of, the corresponding element described with reference to
First user 505, second user 506, and third user 507 are in the center of
Machine learning model 220 can predict which employer employs third user 507. Based on known links, machine learning model 220 predicts whether unknown links exist and assign probabilities to the existence of unknown links. A higher probability indicates there is such existence of a corresponding link. In this example, third user 507 is more similar to second user 506 than first user 505 in terms of the known links for the content items and webpages (second user 506 and third user 507 both like third content item 502 and third webpage 512). Thus, third user 507 is found to be distinct from first user 505. Accordingly, machine learning model 220 predicts a high probability for second employer 530, which employs second user 506, to be the employer of third user 507. Machine learning model 220 predicts a low probability for first employer 525 to be the employer of third user 507. As a result, machine learning model 220 can generate and fill in the missing data. In some cases, missing data can make visualizations impractical. When making a plot of a few attributes, a system may know those attributes for a small fraction of users. Hence, the plots are very sparse with points. According to an embodiment, machine learning model 220 can identify missing links between nodes or data points and generates new edges connecting the nodes.
For example, machine learning model 220 may fit first cluster 625, which is dense with known links. Along with which nodes are in the cluster, machine learning model 220 learns a number, which indicates how being inside the cluster changes the predicted odds of a link existing. For example, machine learning model 220 doubles the predicted odds of the links inside first cluster 625. So for the two unknown links within the cluster, since the model starts with a baseline 1-to-1 odds of a link vs. no link, now machine learning model 220 predicts a 2-to-1 odds of these links existing.
Machine learning model 220 also learns second cluster 630, which is also dense with known edges. Machine learning model 220 triples the predicted odds of a link inside second cluster 630. These effects compound as follows. For the unknown link in the middle, at the intersection of first cluster 625 and second cluster 630, machine learning model 220 predicts a 6-to-1 odds of that link existing, that is, a user visiting that website. That is a 6 in 7 probability. To include context behind graph 600 and weighted graph 620, for example, second cluster 630 involves Canadian users, Canadian websites, and Canadian products. Links between these nodes are more likely to exist (i.e., the probability of link existence is high). Similarly, the nodes in first cluster 625, for example, represent sports fans, sports websites, sports products.
Certain communities may lower the odds of a link. For example, third cluster 635 on the bottom right seems relatively sparse. An odds factor is less than one, and it reduces the predicted odds of the links existing. For example, third cluster 635 represents vegetarian users and products with meat; vegetarian users are unlikely to buy meat-based products, so links between such users and products are unlikely to exist. In some cases, the thicker the dashed line, the greater the predicted probability of that link existing. Machine learning model 220 predicts higher probabilities where the known links are denser.
According to an embodiment, cluster affinity component 305 shown in
Clustering and community detection on graph generative models suggests that the addition of a nonlinear linking function, e.g., softmax and logistic nonlinearities, can increase expressiveness of matrix factorization-based graph models. In some examples, overlapping communities can be determined, such as in link clustering, to achieve matching ground-truth communities.
According to an embodiment, the data augmentation apparatus 235 can represent a natural family of community overlap threshold (hereinafter COT) graphs, which exhibits homophily and heterophily, with small k and interpretably, where the value of k is based on number of communities. Community overlap threshold graph refers an unweighted, undirected graph whose edges are determined by an overlapping clustering and a thresholding integer t∈ as follows: for each vertex i, there are two latent binary vectors bi∈{0,1}k
At operation 705, the system receives a dataset that includes a set of nodes and a set of edges, where each of the set of edges connects two of the set of nodes. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
Machine learning model 220 as shown in
V=(B×diag(mB−1);C×diag(mC−1))
W=diag((+mB2;−mC2)) (1)
the constraint on V is satisfied, further VWVT=BBT−CCT, so Ã:=σ(BBT−CCT)=σ(VWVT), where BBT is the product of a first nonnegative matrix and a transpose of the first nonnegative matrix. CCT is the product of a second nonnegative matrix and a transpose of the second nonnegative matrix. σ is a logistic function. Equation 1 is an example of, or includes embodiments of, corresponding elements described with reference to
Here, if vi∈[0,1]k is the i-th row of matrix V, then vi is the soft (normalized) assignment of node i to the k communities. The diagonal entries of W represent the strength of the homophily (if positive) or heterophily (if negative) of the communities. For each entry, Ãi,j=σ(viWvjT). In some cases, the two forms can be used interchangeably. In some examples, σ is used to denote a nonnegative nonlinear function.
One theorem relates to compact representation of COT graphs. Suppose A is the adjacency matrix of a COT graph on n nodes with latent vectors bi∈{0,1}k
The data augmenting apparatus can process latent vectors into community factors B, C, and the thresholding integer can be handled with an extra community. W represents a cluster affinity matrix and diagonal entries of W represent the strength of the homophily (if positive) or heterophily (if negative) of the communities.
According to an embodiment, nodes in COT graphs share an edge if they co-participate in a number of homophilous communities and do not co-participate in a number of heterophilous communities. For example, in a graph, an edge occurs between two users if the two users are from the same city (e.g., a homophilous community) and have different genders (e.g., a heterophilous community).
At operation 710, the system computes a first nonnegative matrix representing a homophilous cluster affinity. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
At operation 715, the system computes a second nonnegative matrix representing a heterophilous cluster affinity. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
According to an embodiment of the present disclosure, machine learning model 220 (see
At operation 720, the system computes a probability of an additional edge based on the dataset using a machine learning model that represents a homophilous cluster and a heterophilous cluster based on the first nonnegative matrix and the second nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
According to an embodiment, machine learning model 220 extracts a set of nodes and edges from a dataset. For example, consider a set of undirected, unweighted graphs on n nodes, i.e., the set of graphs with symmetric adjacency matrices in {0,1}n×n. An edge-independent generative model is used for such graphs. Given nonnegative parameter matrices B∈+n×k
Ã:=σ(BBT−CCT) (2)
where BBT is the product of a first nonnegative matrix and a transpose of the first nonnegative matrix. CCT is the product of a second nonnegative matrix and a transpose of the second nonnegative matrix. σ is a logistic function. Here kB and kC are the number of homophilous and heterophilous clusters, respectively. For example, if bi∈+k
According to an embodiment, machine learning model 220 is configured for explainable dataset completion based on heterophilous and homophilous structures in a graph. A heterophilous structure includes links between dissimilar nodes. A homophilous structure includes links between similar nodes.
At operation 725, the system generates an augmented dataset including the set of nodes, the set of edges, and the additional edge. In some cases, the operations of this step refer to, or may be performed by, a data augmentation component as described with reference to
At operation 805, the system computes a first product of the first nonnegative matrix and a transpose of the first nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
At operation 810, the system computes a second product of the second nonnegative matrix and a transpose of the second nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
At operation 815, the system computes a difference between the first product and the second product to obtain a symmetric difference matrix. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
According to an embodiment, the machine learning model processes factors X and Y into nonnegative factors B∈+n×k
Here, BBT−CCT (i.e., the difference between the first product and the second product) represents symmetric matrices.
In some examples, let
L is a symmetrization of XYT. In some cases, if σ(XYT) closely approximates the symmetric matrix A as desired, σ(L) also closely approximates the symmetric matrix A. Algorithm 1400 (with reference to
In some examples, Lemma 1 states the following. Let ϕ: → denote the ReLU function, i.e., ϕ(z)=max{z,0}. For any vector v, vvT=2ϕ(v)ϕ(v)T+2ϕ(−v)ϕ(−v)T−|v∥v|T. The proof is described as follows: Take any v∈k. Then
The first step follows from v=ϕ(v)−ϕ(−v), and the last step follows from |v|=ϕ(v)+ϕ(−v). Algorithm 1400 as described in
which demonstrates O(nk) time matrix-vector multiplication. Thus, the eigenvectors can be computed efficiently using an iterative method.
In some examples, Theorem 1 relates to nonnegative factorization of rank-k matrices and Theorem 1 states the following. Given a symmetric rank-k matrix L∈n×n there exist nonnegative matrices B∈+n×k
Theorem 1 and algorithm 1400 show that unconstrained factors X, Y for the LPCA model can be processed into symmetric and nonnegative factors B, C for machine learning model 220 (see
At operation 820, the system applies a nonnegative nonlinear function to the symmetric difference matrix, where the probability of the additional edge is based on the nonnegative nonlinear function. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
According to an embodiment, the machine learning model combines the homophilous and heterophilous cluster assignments. The form uses a matrix V∈[0,1]n×k and a diagonal matrix W∈k×k, where k=kB+kC is the total number of clusters. For example, let mB and mC be the vectors containing the maximums of each column of B and C. By setting
V=(B×diag(mB−1);C×diag(mC−1))
W=diag((+mB2;−mC2)) (4)
the constraint on V is satisfied. Furthermore, VWVT=BBT−CCT, hence
Ã:=σ(BBT−CCT)=σ(VWVT) (5)
The edge probabilities output by the machine learning model may be interpreted as the following. There are bijections between probability p∈[0,1], odds
and logit =log(o)∈(−∞,+∞). The logit of the link probability between nodes i and j is viTWvj, which is a summation of terms vicvjcWcc over communities c∈[k]. If the nodes fully participate in community c, that is, vic=vjc=1, then the edge logit is changed by Wcc starting from a baseline of 0, or equivalently, the odds of an edge is multiplied by exp(Wcc) starting from a baseline odds of 1; if either of the nodes participates only partially in community c, then the change in logit and odds is accordingly prorated. Homophily and heterophily have a clear interpretation in the model: homophilous communities, which are expressed in B, are those with Wcc>0, where two nodes both participating in the community increases the odds of a link, whereas communities with Wcc<0, which are expressed in C, are heterophilous communities, and co-participation decreases the odds of a link.
In
Some examples of the method, apparatus, and non-transitory computer readable medium further include training an additional edge to the dataset to obtain an augmented dataset.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a first product of the first nonnegative matrix and a transpose of the first nonnegative matrix. Some examples further include computing a second product of the second nonnegative matrix and a transpose of the second nonnegative matrix. Some examples further include computing a difference between the first product and the second product to obtain a symmetric difference matrix. Some examples further include applying a nonnegative nonlinear function to the symmetric difference matrix, wherein the predicted probability of the edge is based on the nonnegative nonlinear function.
Some examples of the method, apparatus, and non-transitory computer readable medium further include identifying a number of clusters, wherein a sum of a dimension of the first nonnegative matrix and the second nonnegative matrix is equal to the number of clusters.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a first factor matrix and a second factor matrix, wherein the first factor matrix or the second factor matrix includes a negative value, and wherein the first nonnegative matrix and the second nonnegative matrix are computed based on the first factor matrix and the second factor matrix.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a cluster affinity matrix based on the first nonnegative matrix and the second nonnegative matrix, wherein the machine learning model includes the cluster affinity matrix.
Some examples of the method, apparatus, and non-transitory computer readable medium further include selecting a regularization term. Some examples further include applying the regularization term to the first nonnegative matrix to obtain a regularized first nonnegative matrix. Some examples further include applying the regularization term to the second nonnegative matrix to obtain a regularized second nonnegative matrix, wherein the parameters of the machine learning model are updated based on the regularized first nonnegative matrix and the regularized second nonnegative matrix.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing an L2 norm of a plurality of columns of the first nonnegative matrix. Some examples further include ranking the plurality of columns based on the L2 norm, wherein the predicted probability of the edge is computed based on the ranking.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing an L2 norm of a plurality of columns of the second nonnegative matrix. Some examples further include ranking the plurality of columns of the second nonnegative matrix based on the L2 norm, wherein the predicted probability of the edge is computed based on the ranking.
At operation 905, the system receives a dataset that includes a set of nodes and a set of edges, where each of the set of edges connects two of the set of nodes. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
At operation 910, the system computes a first nonnegative matrix representing a homophilous cluster affinity. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
At operation 915, the system computes a second nonnegative matrix representing a heterophilous cluster affinity. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
According to an embodiment of the present disclosure, machine learning model 220 (shown in
At operation 920, the system computes a predicted probability of an edge of the set of edges based on the first nonnegative matrix and the second nonnegative matrix using a machine learning model that represents a homophilous cluster and a heterophilous cluster. In some cases, the operations of this step refer to, or may be performed by, a machine learning model as described with reference to
At operation 925, the system updates parameters of the machine learning model based on the predicted probability of the edge. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
Given an input graph A∈{0,1}n×n, B and C are obtained such that the machine learning model produces Ã=σ(BBT−CCT)∈(0,1)n×n. Ã approximately matches A. The machine learning model is trained to minimize the sum of binary cross-entropies of the link predictions over the pairs of nodes:
R=−Σ(A log(Ã)+(1−A)log(1−Ã)) (6)
where Σ denotes the scalar summation of entries in the matrix. The training component is configured to fit the parameters using gradient descent over the loss (Eq. 6), as well as L2 regularization of the factors B and C, subject to the nonnegativity of B and C.
According to an embodiment of the present disclosure, principled initialization of factors B and C is performed. Alternatively, random initialization of factors B and C may be used. The training component selects a total number of clusters k, and then automatically sets a split of homophilous/heterophilous clusters kB/kC such that kB+kC=k. The training component, via algorithm 1400 shown in
In some examples, the number of these communities is fixed (e.g., 3 communities in
According to an embodiment, the trained machine learning model can predict unknown data, e.g., the probabilities of unknown links which correspond to unknown attributes of users. The prediction is explainable. For example, user i will buy product j because that user and product participate in clusters a and b, and these communities change the odds of a link by factors x and y.
At operation 1005, the system selects a regularization term. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
At operation 1010, the system applies the regularization term to the first nonnegative matrix to obtain a regularized first nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
At operation 1015, the system applies the regularization term to the second nonnegative matrix to obtain a regularized second nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
According to an embodiment, algorithm 1400 (shown in
which demonstrates O(nk) time matrix-vector multiplication. Thus, the eigenvectors can be computed efficiently using an iterative method. Algorithm 1400 generates nonnegative matrices using the regularization weight and eigenvalues/eigenvectors. For example, the regularization weight and eigenvalues/eigenvectors can be obtained based on logit factors X, Y∈k×k. Further, algorithm 1400 executes Q+×diag(√{square root over (+λ+)}) to assign value to a regularized first nonnegative matrix B*. Furthermore, the algorithm executes Q−×diag(√{square root over (−λ−)}) to assign value to a regularized second nonnegative matrix C*. Here, λ+, Q+ are positive eigenvalues/eigenvectors and λ−, Q− are negative eigenvalues/eigenvectors.
At operation 1020, the system updates the parameters of the machine learning model based on the regularized first nonnegative matrix and the regularized second nonnegative matrix. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
According to an embodiment, the nonnegative matrices B, C for the machine learning model are updated based on the regularized first nonnegative matrix and regularized second nonnegative matrices B*, C*. In some cases, unconstrained factors X, Y for the LPCA model can be processed into symmetric and nonnegative factors B, C for the machine learning model without any approximation error, at the cost of increasing the factorization rank. Therefore, for initialization of the 3k communities generated, the top k are kept which are most impactful on edge logits, as ranked by the L2 norms of the columns of B and C. Now B∈+n×k
For example, Lemma 2 relates to exact LPCA embeddings for bounded-degree graphs. Let A∈{0,1}n×n be the adjacency matrix of a graph G with maximum degree c. Then there exist matrices X, Y∈n×(2c+1) such that (XYT)ij>0 if Aij=1 and (XYT)ij<0 if Aij=0.
For example, Theorem 2 relates to exact reconstruction via nonnegative logits for bounded-degree graphs. Let A∈{0,1}n×n be the adjacency matrix of a graph G with maximum degree c. Let k=12c+6. For any ϵ>0, there exist V∈[0,1]n×k and diagonal W∈k×k such that ∥σ(VWVT)−A∥F<ϵ.
Theorem 2 results from combining Theorem 1 and Lemma 2. Theorem 2 refers to the capacity of machine learning model 220. Lemma 2 and Theorem 2 are obtained from a constructive proof based on polynomial interpolation. Algorithm 1300 is based on cross-entropy gradient descent and employs regularization and uses fewer communities than the above upper bound on k.
In some cases, graph models compress information about a graph (e.g., high-dimensional objects) using dot product models, which associate each node with a real-valued embedding vector. The predicted probability of a link between two nodes increases with the similarity of their embedding vectors. The models can be seen as factorizing an adjacency matrix of the graph in terms of low-rank matrices.
As an example shown in
As shown in
As an example shown in
According to an embodiment, the features generated by the machine learning model, i.e., the factors returned by factorization are visualized. The model factors capture the relevant latent structure in an interpretable way and can represent the homophilous and heterophilous structures.
At line 1 of the algorithm, Q∈n×k and λ∈k is set by truncated eigendecomposition such that Q×
At line 2, the algorithm executes Q+×diag(√{square root over (+λ+)}) to assign value to B*, where λ+, Q+ are positive eigenvalues/eigenvectors. At line 3, Q−×diag(√{square root over (−λ−)}) is executed to assign value to C*, where λ−, Q− are negative eigenvalues/eigenvectors. At line 4, the algorithm executes (√{square root over (2)}ϕ(B*); (√{square root over (2)}ϕ(−B*); |C*|) to assign value to B, where ϕ and |⋅| are entrywise ReLU and absolute value. At line 5, (√{square root over (2)}ϕ(C*); (√{square root over (2)}ϕ(−C*); |B*|) are executed to assign value to C. At line 6, the value of B, C is returned.
Performance of apparatus, systems and methods of the present disclosure have been evaluated, and results indicate embodiments of the present disclosure have obtained increased performance over existing technology. Example experiments demonstrate that data augmentation apparatus 235 of the present disclosure outperforms conventional systems.
Embodiments of the present disclosure include a data augmentation apparatus that can capture arbitrary homophilous and heterophilous structures. The present disclosure provides methods, systems, and apparatus to generate an interpretable graph generative model based on nonnegative matrix factorization that is expressive at representing both homophily and heterophily, while maintaining simplicity and interpretability.
According to an embodiment, the data augmentation apparatus uses PyTorch for automatic differentiation and minimizes loss using the SciPy implementation of an optimization algorithm. In some examples, a limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm is used with default hyperparameters and a maximum of 200 iterations of optimization.
The description and drawings described herein represent example configurations and do not represent all the implementations within the scope of the claims. For example, the operations and steps may be rearranged, combined or otherwise modified. Also, structures and devices may be represented in the form of block diagrams to represent the relationship between components and avoid obscuring the described concepts. Similar components or features may have the same name but may have different reference numbers corresponding to different figures.
Some modifications to the disclosure may be readily apparent to those skilled in the art, and the principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein, but is to be accorded the broadest scope consistent with the principles and novel features disclosed herein.
In this disclosure and the following claims, the word “or” indicates an inclusive list such that, for example, the list of X, Y, or Z means X or Y or Z or XY or XZ or YZ or XYZ. Also the phrase “based on” is not used to represent a closed set of conditions. For example, a step that is described as “based on condition A” may be based on both condition A and condition B. In other words, the phrase “based on” shall be construed to mean “based at least in part on.” Also, the words “a” or “an” indicate “at least one.”