The following relates generally to item recommendation, and more specifically to video recommendation using machine learning.
Item recommendation refers to the task of collecting data relating to user interactions, modeling user behavior, and using the model to predict items that users are likely to interact with. For example, the user may click on a sequence of items in an online store, and a website server can predict a next item that the user is likely to view or purchase.
Video recommendation is a subtask within the field of item recommendation where a video item is suggested to users to view. A video recommendation system generates a video recommendation based on a user profile when a user logs on to a video-sharing platform. In some examples, the user profile includes past interactions, preferred genres, or general browsing history on the internet. Additionally or alternatively, after a user watches a video, a set of related videos are presented on the side such that the user can move on to the next video of interest in a single click.
In some cases, neural networks such as transformer-based networks are used to generate recommendations. However, conventional recommendation systems encounter sparse interactions between users and videos due to the size of data and are unable to process different types of information for efficient recommendation (e.g., different modalities such as textual, visual, acoustic information). Therefore, there is a need in the art for an improved recommendation network that can be trained to model multi-modal information and recommend highly relevant videos.
The present disclosure describes systems and methods for video recommendation. Embodiments of the present disclosure include an item recommendation apparatus configured to generate a knowledge graph based on a user and a set of content items represented as nodes in the knowledge graph. In some cases, a knowledge graph includes a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, and an edge encoding matrix representing edge types between the nodes. In some embodiments, a multi-modal graph encoder of the item recommendation apparatus generates a first feature embedding representing a user and a second feature embedding representing a content item based on the knowledge graph. The second feature embedding is generated using a first modality (e.g., textual information) for a query vector of an attention mechanism and a second modality (e.g., visual information) for a key vector and a value vector of the attention mechanism. In some examples, the multi-modal graph encoder can be trained using a contrastive learning loss and a ranking loss.
A method, apparatus, and non-transitory computer readable medium for item recommendation are described. One or more embodiments of the method, apparatus, and non-transitory computer readable medium include receiving input indicating a relationship between a user and a first content item; generating a knowledge graph based on the input, wherein the knowledge graph comprises relationship information between a node representing the user and a plurality of nodes corresponding to a plurality of content items including the first content item; generating a first feature embedding representing the user and a second feature embedding representing a second content item of the plurality of content items using a multi-modal graph encoder based on the knowledge graph, wherein the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; comparing the first feature embedding to the second feature embedding to obtain a similarity score between the user and the second content item; and recommending the second content item for the user based on the similarity score.
A method, apparatus, and non-transitory computer readable medium for training a neural network are described. One or more embodiments of the method, apparatus, and non-transitory computer readable medium include receiving training data including relationships between a plurality of users and a plurality of content items; generating a knowledge graph based on the training data, wherein the knowledge graph represents the relationships between the plurality of users and the plurality of content items; generating a first feature embedding representing a user and a second feature embedding representing a content item of the plurality of content items using a multi-modal graph encoder based on the knowledge graph, wherein the second feature embedding is generated using a first modality as a query vector of an attention mechanism and a second modality as a key vector of the attention mechanism; computing a loss function based on the first feature embedding and the second feature embedding; and updating parameters of the multi-modal graph encoder based on the loss function.
An apparatus and method for item recommendation are described. One or more embodiments of the apparatus and method include a knowledge graph component configured to generate a knowledge graph representing relationships between a plurality of users and a plurality of content items; a multi-modal graph encoder configured to generate a first feature embedding representing a user and a second feature embedding representing a content item of the plurality of content items based on the knowledge graph, wherein the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; and a recommendation component configured to compare the first feature embedding to the second feature embedding to obtain similarity scores between the users and the content items and to identify recommended content items to the users based on the similarity scores.
The present disclosure describes systems and methods for video recommendation. Embodiments of the present disclosure include an item recommendation apparatus configured to generate a knowledge graph based on a user and a set of content items represented as nodes in the knowledge graph. In some cases, a knowledge graph includes a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, and an edge encoding matrix representing edge types between the nodes. In some embodiments, a multi-modal graph encoder of the item recommendation apparatus generates a first feature embedding representing a user and a second feature embedding representing a content item based on the knowledge graph. The second feature embedding is generated using a first modality (e.g., textual information) for a query vector of an attention mechanism and a second modality (e.g., visual information) for a key vector and a value vector of the attention mechanism. In some examples, the multi-modal graph encoder can be trained using a contrastive learning loss and a ranking loss.
Conventional recommendation networks are content-based or collaborative-filtering-based systems. In some examples, content-based networks generate vector representations in the Euclidean space based on input information and measure their similarities based on the vector representations. Alternatively, collaborative filtering-based systems treat each user-item interaction as an independent instance and encode side information.
However, conventional recommendation systems are not scalable to handle different types or sizes of input data and these systems may face cold start issues. In some examples, content-based systems are not able to provide recommendations based on sparse data (i.e., the interactions between users and content items are sparse due to large size of data). Similarly, collaborative filter-based systems have difficulty recommending relevant videos to a new user (i.e., cold start issue). As a result, performance of existing recommendation systems may not meet user expectations because the quality of personalized recommendations is decreased.
Embodiments of the present disclosure include a multi-modal graph encoder using a knowledge graph to model relationships among a set of nodes (i.e., users and content items). Some embodiments generate a knowledge graph including relationship information between a node representing a user and nodes corresponding to a set of content items. A knowledge graph captures node-edge relationship (i.e., entity-relation structure) connecting items with their corresponding attributes in a non-Euclidean space. In some examples, the knowledge graph includes both homogenous information and heterogenous information. For example, entities such as users and content items (represented as nodes in a knowledge graph) can be different types of objects.
By using a symmetric bi-modal attention network, embodiments of the present disclosure generate a first feature embedding representing the user and a second feature embedding representing a content item of the content items based on the knowledge graph. The second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism. That is, a multi-modal graph encoder can handle input information of different types (i.e., modalities such as visual, textual, or acoustic information) where each modality has its own multi-head attention module. In some examples, the query, and (key, value) pair are constructed using input from different modalities. For example, the embedding from a first modality (i.e., modality 1) is used as the query input for a second modality (i.e., modality 2) multi-head attention unit, while the embedding in the second modality (modality 2) is used as the query input for the first modality (modality 1) multi-head attention unit. Therefore, multi-modal graph encoder is able to handle parallel sequential inputs, such as videos/transcripts, video/sound, etc. In some examples, the multi-modal graph encoder is trained using a multi-task loss function. The multi-mask loss includes a Bayesian personalized ranking loss and a metric loss function.
Embodiments of the present disclosure may be used in the context of content recommendation applications. For example, an item recommendation network based on the present disclosure may take different types of information as input and efficiently identify content items to be recommended to users to increase user interaction. An example application of the inventive concept in the video recommendation context is provided with reference to
In the example of
User device 105 may be a personal computer, laptop computer, mainframe computer, palmtop computer, personal assistant, mobile device, or any other suitable processing apparatus. In some examples, user device 105 includes software that incorporates a content recommendation application. In some examples, the content recommendation application on user device 105 may include functions of item recommendation apparatus 110.
A user interface may enable user 100 to interact with user device 105. In some embodiments, the user interface may include an audio device, such as an external speaker system, an external display device such as a display screen, or an input device (e.g., remote control device interfaced with the user interface directly or through an I/O controller module). In some cases, a user interface may be a graphical user interface (GUI). In some examples, a user interface may be represented in code which is sent to user device 105 and rendered locally by a browser.
Item recommendation apparatus 110 collects user profile information of user 100 and browsing history. In some examples, the browsing history includes at least one video viewed by the user previously (i.e., a cover page of the video shows visual information such as title, date, duration of the video). Each of the at least one video has an associated transcript (i.e., textual information including a short summary of the video). The content of the at least one video includes an audio feed (i.e., acoustic information). Item recommendation apparatus 110 receives input indicating a relationship between user 100 and a first content item. The multi-modal information is represented by a media play icon and a document icon (i.e., visual and textual information). For example, browsing history may correspond to a list of searchable content items stored within database 120. A data structure such as an array, a matrix, a tuple, a list, a tree, or a combination thereof may be used to represent the list of content items. The item recommendation apparatus 110 generates knowledge graph based on the input, where the knowledge graph indicates relationship information between a user and a set of content items including the first content item.
Item recommendation apparatus 110 generates a first feature embedding representing user 100 and a second feature embedding representing a second content item (e.g., a video) based on the knowledge graph. The second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism. Item recommendation apparatus 110 compares the first feature embedding to the second feature embedding to obtain a similarity score between user 100 and the second content item.
Item recommendation apparatus 110 recommends the second content item for user 100 based on the similarity score and returns the second content item (denoted as a favorite video icon) to user 100. Alternatively or additionally, item recommendation apparatus 110 displays a content item on a user interface similar to a video currently being viewed on a streaming platform. The process of using item recommendation apparatus 110 is further described with reference to
Item recommendation apparatus 110 includes a computer implemented network comprising a knowledge graph component and a multi-modal graph encoder. Item recommendation apparatus 110 may also include a processor unit, a memory unit, an I/O module, a training component, and a recommendation component. The training component is used to train a machine learning model (or an item recommendation network). Additionally, item recommendation apparatus 110 can communicate with database 120 via cloud 115. In some cases, the architecture of the item recommendation network is also referred to as a network or a network model for brevity. Further detail regarding the architecture of item recommendation apparatus 110 is provided with reference to
In some cases, item recommendation apparatus 110 is implemented on a server. A server provides one or more functions to users linked by way of one or more of the various networks. In some cases, the server includes a single microprocessor board, which includes a microprocessor responsible for controlling all aspects of the server. In some cases, a server uses microprocessor and protocols to exchange data with other devices/users on one or more of the networks via hypertext transfer protocol (HTTP), and simple mail transfer protocol (SMTP), although other protocols such as file transfer protocol (FTP), and simple network management protocol (SNMP) may also be used. In some cases, a server is configured to send and receive hypertext markup language (HTML) formatted files (e.g., for displaying web pages). In various embodiments, a server comprises a general purpose computing device, a personal computer, a laptop computer, a mainframe computer, a supercomputer, or any other suitable processing apparatus.
A cloud 115 is a computer network configured to provide on-demand availability of computer system resources, such as data storage and computing power. In some examples, the cloud 115 provides resources without active management by the user. The term cloud is sometimes used to describe data centers available to many users over the Internet. Some large cloud networks have functions distributed over multiple locations from central servers. A server is designated an edge server if it has a direct or close connection to a user. In some cases, a cloud 115 is limited to a single organization. In other examples, the cloud 115 is available to many organizations. In one example, a cloud 115 includes a multi-layer communications network comprising multiple edge routers and core routers. In another example, a cloud 115 is based on a local collection of switches in a single physical location.
A database 120 is an organized collection of data. For example, database 120 stores content items of different modalities (e.g., video files, text files, audio files) in a specified format known as a schema. In some cases, a content item includes multiple types of information, e.g., a video can have audio, visual information, and transcript (i.e., text). A database 120 may be structured as a single database, a distributed database, multiple distributed databases, or an emergency backup database. In some cases, a database controller may manage data storage and processing in a database 120. In some cases, a user interacts with a database controller. In other cases, a database controller may operate automatically without user interaction.
At operation 205, a user interacts with a set of content items. In some cases, the operations of this step refer to, or may be performed by, user as described with reference to
At operation 210, the system compares the user with additional items based on the interaction. The additional items may include different types of modalities (e.g., textual, visual, and acoustics information). In some cases, the operations of this step refer to, or may be performed by, item recommendation apparatus as described with reference to
At operation 215, the system selects a content item based on the comparison. In some cases, the operations of this step refer to, or may be performed by, item recommendation apparatus as described with reference to
At operation 220, the system recommends the selected content item to the user. In some cases, the operations of this step refer to, or may be performed by, item recommendation apparatus as described with reference to
The item recommendation network can have different types of information as input. In some cases, the network takes features and multi-modal information as input for generating recommendations (e.g., recommendations for users, videos). For example, features may include video information such as upload time, application name, etc., and user information such as username, view history, etc. According to an embodiment, multi-modal information 300 includes visual information 305, textual information 310, and/or acoustic information 315. Visual information 305 is an example of, or includes aspects of, the corresponding element described with reference to
One or more embodiments of the present disclosure include a knowledge graph database constructed based on high speed and highly scalable database (e.g., Neo4j database). In some cases, a recommendation algorithm is implemented to make recommendations based on graph neural networks (GNNs).
According to some embodiments, online viewer data 400 and offline video data 405 are input to graph data platform 410 (e.g., Neo4j) for data integration. Neo4j is a graph database management system which includes a transactional database with native graph storage and processing. Powered by a native graph database, Neo4j stores and manages data in its more natural, connected state, maintaining data relationships, context for analytics, and a modifiable data model. Output from graph data platform 410 is then input to data analysis library 415 (e.g., Python® Pandas) for data pre-processing. Pandas is a software library written for the Python® programming language for data manipulation and analysis. Machine learning library 420 is used to train item recommendation apparatus 425. In some examples, machine learning library 420 includes PyTorch. PyTorch is a machine learning library based on the Torch library used for applications such as computer vision and natural language processing. Web demo using micro web framework 430 (e.g., Flask, which is a micro web framework written in Python®) illustrates increased performance of item recommendation apparatus 425.
In
Some examples of the apparatus and method further include an image encoder configured to generate a visual embedding for the content items, wherein the query vector is generated based on the visual embedding.
Some examples of the apparatus and method further include a text encoder configured to generate a textual embedding based on the content items, wherein the key vector is generated based on the textual embedding.
Some examples of the apparatus and method further include a training component configured to compute a loss function based on the first feature embedding and the second feature embedding and to update parameters of the multi-modal graph encoder based on the loss function.
In some examples, the multi-modal graph encoder comprises a symmetric bimodal attention network. In some examples, the symmetric bimodal attention network comprises a first multi-head attention module corresponding to the first modality and a second multi-head attention module corresponding to the second modality. Some examples of the apparatus and method further include a search component configured to search for a plurality of candidate content items for recommendation to the user.
A processor unit 505 is an intelligent hardware device, (e.g., a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, processor unit 505 is configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into the processor. In some cases, processor unit 505 is configured to execute computer-readable instructions stored in a memory to perform various functions. In some embodiments, processor unit 505 includes special-purpose components for modem processing, baseband processing, digital signal processing, or transmission processing.
Examples of a memory unit 510 include random access memory (RAM), read-only memory (ROM), or a hard disk. Examples of memory unit 510 include solid state memory and a hard disk drive. In some examples, memory unit 510 is used to store computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions described herein. In some cases, memory unit 510 contains, among other things, a basic input/output system (BIOS) which controls basic hardware or software operation such as the interaction with peripheral components or devices. In some cases, a memory controller operates memory cells. For example, the memory controller can include a row decoder, column decoder, or both. In some cases, memory cells within memory unit 510 store information in the form of a logical state.
I/O module 515 (e.g., an input/output interface) may include an I/O controller. An I/O controller may manage input and output signals for a device. I/O controller may also manage peripherals not integrated into a device. In some cases, an I/O controller may represent a physical connection or port to an external peripheral. In some cases, an I/O controller may utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system. In other cases, an I/O controller may represent or interact with a modem, a keyboard, a mouse, a touchscreen, or a similar device. In some cases, an I/O controller may be implemented as part of a processor. In some cases, a user may interact with a device via I/O controller or via hardware components controlled by an I/O controller.
In some examples, I/O module 515 includes a user interface. A user interface may enable a user to interact with a device. In some embodiments, the user interface may include an audio device, such as an external speaker system, an external display device such as a display screen, or an input device (e.g., remote control device interfaced with the user interface directly or through an I/O controller module). In some cases, a user interface may be a graphical user interface (GUI). In some examples, a communication interface operates at the boundary between communicating entities and the channel and may also record and process communications. Communication interface is provided herein to enable a processing system coupled to a transceiver (e.g., a transmitter and/or a receiver). In some examples, the transceiver is configured to transmit (or send) and receive signals for a communications device via an antenna.
According to some embodiments of the present disclosure, item recommendation apparatus 500 includes a computer implemented artificial neural network (ANN) for identifying high-level events and their respective vector representations occurring in a video. An ANN is a hardware or a software component that includes a number of connected nodes (i.e., artificial neurons), which loosely correspond to the neurons in a human brain. Each connection, or edge, transmits a signal from one node to another (like the physical synapses in a brain). When a node receives a signal, it processes the signal and then transmits the processed signal to other connected nodes. In some cases, the signals between nodes comprise real numbers, and the output of each node is computed by a function of the sum of its inputs. Each node and edge is associated with one or more node weights that determine how the signal is processed and transmitted.
According to some embodiments, item recommendation apparatus 500 includes a convolutional neural network (CNN) for item recommendation. CNN is a class of neural network that is commonly used in computer vision or image classification systems. In some cases, a CNN may enable processing of digital images with minimal pre-processing. A CNN may be characterized by the use of convolutional (or cross-correlational) hidden layers. These layers apply a convolution operation to the input before signaling the result to the next layer. Each convolutional node may process data for a limited field of input (i.e., the receptive field). During a forward pass of the CNN, filters at each layer may be convolved across the input volume, computing the dot product between the filter and the input. During the training process, the filters may be modified so that they activate when they detect a particular feature within the input.
A GCN is a type of neural network that defines convolutional operation on graphs and uses their structural information. For example, a GCN may be used for node classification (e.g., documents) in a graph (e.g., a citation network), where labels are available for a subset of nodes using a semi-supervised learning approach. A feature description for every node is summarized in a matrix and uses a form of pooling operation to produce a node level output. In some cases, GCNs use dependency trees which enrich representation vectors for key terms in an input phrase/sentence.
According to some embodiments, training component 520 receives training data including relationships between a set of users and a set of content items. In some examples, training component 520 computes a loss function based on the first feature embedding and the second feature embedding. Training component 520 updates parameters of the multi-modal graph encoder 540 based on the loss function. In some examples, training component 520 identifies a first content item and a second content item. Training component 520 determines that a user prefers the first content item over the second content item using similarity scores for the first content item and the second content item. Training component 520 computes a ranking loss based on the determination, where the loss function includes the ranking loss.
In some examples, training component 520 identifies a positive sample pair including a user and a first content item that is preferred by the user. Next, training component 520 identifies a negative sample pair including the user and a second content item that is not preferred by the user. Training component 520 then computes a contrastive learning loss based on the positive sample pair and the negative sample pair, where the loss function includes the contrastive learning loss.
According to some embodiments, training component 520 is configured to compute a loss function based on the first feature embedding and the second feature embedding and to update parameters of the multi-modal graph encoder 540 based on the loss function.
According to some embodiments, recommendation component 525 compares the first feature embedding to the second feature embedding to obtain a similarity score between the user and the second content item. In some examples, recommendation component 525 recommends the second content item for the user based on the similarity score. In some examples, recommendation component 525 computes a cosine similarity, where the similarity score is based on the cosine similarity.
According to some embodiments, recommendation component 525 is configured to compare the first feature embedding to the second feature embedding to obtain similarity scores between the users and the content items and to identify recommended content items to the users based on the similarity scores. Recommendation component 525 is an example of, or includes aspects of, the corresponding element described with reference to
According to some embodiments, search component 527 is configured to search for a set of candidate content items for recommendation to the user. According to some embodiments, machine learning model 530 receives input indicating a relationship between a user and a first content item. In some cases, machine learning model 530 may be referred to as an item recommendation network or the network model.
According to some embodiments, knowledge graph component 535 generates a knowledge graph based on the input, where the knowledge graph includes relationship information between a node representing the user and a set of nodes corresponding to a set of content items including the first content item. In some examples, knowledge graph component 535 generates a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, where the knowledge graph includes the spatial encoding matrix. In some examples, knowledge graph component 535 generates an edge encoding matrix representing edge types between nodes of the knowledge graph, where the knowledge graph includes the edge encoding matrix. In some examples, the edge types represent types of interactions between users and content items.
According to some embodiments, knowledge graph component 535 generates a knowledge graph based on the training data, where the knowledge graph represents the relationships between the set of users and the set of content items.
According to some embodiments, knowledge graph component 535 is configured to generate a knowledge graph representing relationships between a set of users and a set of content items. Knowledge graph component 535 is an example of, or includes aspects of, the corresponding element described with reference to
According to some embodiments, multi-modal graph encoder 540 generates a first feature embedding representing the user and a second feature embedding representing a second content item of the set of content items based on the knowledge graph, where the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism. In some examples, an image encoder (see
In some examples, multi-modal graph encoder 540 combines the query vector of the first modality and the key vector of the second modality to obtain a combined vector. Multi-modal graph encoder 540 weights the combined vector based on the knowledge graph to obtain a weighted vector. In some examples, multi-modal graph encoder 540 combines the weighted vector with the value vector of the second modality, where the second feature embedding is based on the combination of the weighted vector and the value vector. In some examples, multi-modal graph encoder 540 generates a first symmetric feature embedding using the first modality as the query vector and the second modality as the key vector. Multi-modal graph encoder 540 generates a second symmetric feature embedding using the second modality as a symmetric query vector and the first modality as a symmetric key vector, where the second feature embedding is based on the first symmetric feature embedding and the second symmetric feature embedding.
According to some embodiments, multi-modal graph encoder 540 generates a first feature embedding representing a user and a second feature embedding representing a content item of the set of content items based on the knowledge graph, where the second feature embedding is generated using a first modality as a query vector of an attention mechanism and a second modality as a key vector of the attention mechanism.
In some examples, the multi-modal graph encoder 540 includes a symmetric bimodal attention network. In some examples, the symmetric bimodal attention network includes a first multi-head attention module corresponding to the first modality and a second multi-head attention module corresponding to the second modality. Multi-modal graph encoder 540 is an example of, or includes aspects of, the corresponding element described with reference to
The described methods may be implemented or performed by devices that include a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, a conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored in the form of instructions or code on a computer-readable medium.
Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer. For example, non-transitory computer-readable media can comprise random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), compact disk (CD) or other optical disk storage, magnetic disk storage, or any other non-transitory medium for carrying or storing data or code.
Also, connecting components may be properly termed computer-readable media. For example, if code or data is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, or microwave signals, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology are included in the definition of medium. Combinations of media are also included within the scope of computer-readable media.
As illustrated in
Multi-modal graph encoder 605 generates a first feature embedding and a second feature embedding, which are input to recommendation component 610. Recommendation component 610 compares the first feature embedding to the second feature embedding to obtain a similarity score between a user and a content item. Recommendation component 610 recommends a content item from the set of content items for a user based on the similarity score. Recommendation component 610 is an example of, or includes aspects of, the corresponding element described with reference to
According to some embodiments of the present disclosure, the item recommendation apparatus 500 (
Multi-modal graph encoder 700 includes a symmetric bimodal attention (SBA) network. In some cases, the SBA network may also be referred to as a co-attention network. According to an embodiment, multi-modal graph encoder 700 can simultaneously process two or more modalities. Additionally, each modality has its own multi-head attention module. In some cases, query, and (key, value) pair do not use the same input from a single modality. That is, the query and (key, value) pair may depend on different input from different modalities. For example, the embedding in a first modality (i.e., modality 1) is used as the query vector 725 input for a second modality (i.e., modality 2) multi-head attention unit, while the embedding in the second modality (modality 2) is used as the query vector input 725 for the first modality (modality 1) multi-head attention unit. Multi-modal graph encoder 700 is configured for parallel sequential inputs, such as videos/transcripts, video/sound, etc. The node-edge relationship in the knowledge graph forms a complex relation between entities in a non-Euclidean space. According to an embodiment, the second feature embedding 710 is generated using a first modality for query vector 725 of an attention mechanism and a second modality for key vector 730 and value vector 735 of the attention mechanism. Additionally, the first feature embedding 705 is generated using the second modality for query vector 725 of an attention mechanism and the first modality for key vector 730 and value vector 735 of the attention mechanism.
According to an embodiment of the present disclosure, multi-modal graph encoder 700 incorporates additional spatial information. In some cases, the additional spatial information may be referred to as spatial encoding and edge encoding. Spatial encoding matrix 715 represents or includes spatial encoding information. Edge encoding matrix 720 represents or includes edge encoding information. For example, spatial encoding matrix 715 considers the hop-information between the nodes in the knowledge graph structure. Additionally, edge encoding matrix 720 corresponds to the heterogeneity of link connections, for example, different types of relations. In some examples, the relations may include “follows”, “views”, “creates”, etc.
In
Some examples of the method, apparatus, and non-transitory computer readable medium further include generating a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, wherein the knowledge graph includes the spatial encoding matrix.
Some examples of the method, apparatus, and non-transitory computer readable medium further include generating an edge encoding matrix representing edge types between nodes of the knowledge graph, wherein the knowledge graph includes the edge encoding matrix. In some examples, the edge types represent types of interactions between users and content items.
Some examples of the method, apparatus, and non-transitory computer readable medium further include generating a visual embedding for the second content item, wherein the query vector is generated based on the visual embedding.
Some examples of the method, apparatus, and non-transitory computer readable medium further include generating a textual embedding based on the second content item, wherein the key vector is generated based on the textual embedding.
Some examples of the method, apparatus, and non-transitory computer readable medium further include combining the query vector of the first modality and the key vector of the second modality to obtain a combined vector. Some examples further include weighting the combined vector based on the knowledge graph to obtain a weighted vector.
Some examples of the method, apparatus, and non-transitory computer readable medium further include combining the weighted vector with the value vector of the second modality, wherein the second feature embedding is based on the combination of the weighted vector and the value vector.
Some examples of the method, apparatus, and non-transitory computer readable medium further include generating a first symmetric feature embedding using the first modality as the query vector and the second modality as the key vector. Some examples further include generating a second symmetric feature embedding using the second modality as a symmetric query vector and the first modality as a symmetric key vector, wherein the second feature embedding is based on the first symmetric feature embedding and the second symmetric feature embedding.
Some examples of the method, apparatus, and non-transitory computer readable medium further include computing a cosine similarity, wherein the similarity score is based on the cosine similarity.
At operation 805, the system receives input indicating a relationship between a user and a first content item. In some cases, the operations of this step refer to, or may be performed by, machine learning model as described with reference to
At operation 810, the system generates a knowledge graph based on the input, where the knowledge graph includes relationship information between a node representing the user and a set of nodes corresponding to a set of content items including the first content item. In some cases, the operations of this step refer to, or may be performed by, knowledge graph component as described with reference to
A knowledge graph captures node-edge relationship (i.e., entity-relation structure) connecting items with their corresponding attributes in a non-Euclidean space. In some examples, the knowledge graph includes both homogenous information and heterogenous information. For example, entities (represented as nodes in knowledge graphs) can be different types of objects. In some cases, a knowledge graph includes a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, and an edge encoding matrix representing edge types between nodes of the knowledge graph.
At operation 815, the system generates a first feature embedding representing the user and a second feature embedding representing a second content item of the set of content items using a multi-modal graph encoder based on the knowledge graph, where the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
According to an embodiment, the multi-modal graph encoder can handle input information of different types (i.e., modalities such as visual, textual, or acoustic information) where each modality has its own multi-head attention module. In some examples, the query, and (key, value) pair are constructed using input from different modalities. For example, the embedding from a first modality (i.e., modality 1) is used as the query input for a second modality (i.e., modality 2) multi-head attention unit, while the embedding in the second modality (modality 2) is used as the query input for the first modality (modality 1) multi-head attention unit. Therefore, multi-modal graph encoder is able to handle parallel sequential inputs, such as videos/transcripts, video/sound, etc. Multi-modal features increase content understanding.
At operation 820, the system compares the first feature embedding to the second feature embedding to obtain a similarity score between the user and the second content item. In some cases, the operations of this step refer to, or may be performed by, recommendation component as described with reference to
At operation 825, the system recommends the second content item for the user based on the similarity score. In some cases, the operations of this step refer to, or may be performed by, recommendation component as described with reference to
According to an embodiment, an item recommendation network includes graph constructions, nodes, and relations. In some cases, knowledge graph 923 includes multiple types of entities and multiple types of relations between the entities. In some examples, a knowledge graph 923 includes five types of entities and five types of relations. A node may represent a video, viewer, streamer, etc. In some examples, knowledge graph 923 includes 331,790 nodes. Additionally, relations may link two nodes by defining a relationship between the nodes. In some examples, the recommendation network may include 2,253,641 relations that may relate to different types of relations such as views, follows, creates, etc. For example, the relation between a viewer and a streamer can be defined using “follows” relation since the viewer follows the streamer. Similarly, the relationship of a viewer and a video may be defined using “views” as the viewer views a video.
In some examples, the item recommendation network takes visual information 900 and textual information 905 (e.g., a video and text). A video feature embedding network (VFE) and a universal sentence embedding network (USE) may be used to obtain node embeddings corresponding to node feature modalities. That is, an image encoder is configured to generate visual encoding 910 while a text encoder is configured to generate textual encoding 915, Furthermore, a multi-modal graph encoder (MMGE) is used to model the encoding/embeddings by incorporating information from knowledge graph 923 (spatial encodings and edge encodings). Visual encoding 910 and textual encoding 915 are input to multi-modal graph encoding 920. Visual information 900 is an example of, or includes aspects of, the corresponding element described with reference to
Training the item recommendation network will be described in greater detail in
At operation 1005, the system generates a visual embedding for the second content item, where the query vector is generated based on the visual embedding. In some cases, the operations of this step refer to, or may be performed by, image encoder as described with reference to
At operation 1010, the system generates a textual embedding based on the second content item, where the key vector is generated based on the textual embedding. In some cases, the operations of this step refer to, or may be performed by, text encoder as described with reference to
In some examples, the second content item includes a transcript or document describing the video. The system generates word embeddings corresponding to the transcript using a text encoder as described in
At operation 1015, the system combines the query vector of the first modality and the key vector of the second modality to obtain a combined vector. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1020, the system weights the combined vector based on the knowledge graph to obtain a weighted vector. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1025, the system combines the weighted vector with the value vector of the second modality, where the second feature embedding is based on the combination of the weighted vector and the value vector. In some examples, the multi-modal graph encoder combines the weighted vector with the value vector of the second modality to obtain the second feature embedding via matrix multiplication. The second feature embedding represents a content item of a set of content items. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1105, the system generates a first symmetric feature embedding using the first modality as the query vector and the second modality as the key vector. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1110, the system generates a second symmetric feature embedding using the second modality as a symmetric query vector and the first modality as a symmetric key vector. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1115, the system generates the second feature embedding based on the first symmetric feature embedding and the second symmetric feature embedding. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
In
Some examples of the method, apparatus, and non-transitory computer readable medium further include identifying a first content item and a second content item. Some examples further include determining that a user prefers the first content item over the second content item using similarity scores for the first content item and the second content item. Some examples further include computing a ranking loss based on the determination, wherein the loss function includes the ranking loss.
Some examples of the method, apparatus, and non-transitory computer readable medium further include identifying a positive sample pair comprising a user and a first content item that is preferred by the user. Some examples further include identifying a negative sample pair comprising the user and a second content item that is not preferred by the user. Some examples further include computing a contrastive learning loss based on the positive sample pair and the negative sample pair, wherein the loss function includes the contrastive learning loss.
During the training process, the parameters and weights of the machine learning model are adjusted to increase the accuracy of the result (i.e., by minimizing a loss function which corresponds in some way to the difference between the current result and the target result). The weight of an edge increases or decreases the strength of the signal transmitted between nodes. In some cases, nodes have a threshold below which a signal is not transmitted at all. In some examples, the nodes are aggregated into layers. Different layers perform different transformations on their inputs. The initial layer is known as the input layer and the last layer is known as the output layer. In some cases, signals traverse certain layers multiple times.
At operation 1205, the system receives training data including relationships between a set of users and a set of content items. In some examples, training data includes different types of relations such as views, follows, creates, etc. The relation between a viewer and a streamer can be defined using “follows” relation since the viewer follows the streamer. Similarly, the relationship of a viewer and a video may be defined using “views” as the viewer views a video. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
At operation 1210, the system generates a knowledge graph based on the training data, where the knowledge graph represents the relationships between the set of users and the set of content items. According to an embodiment, the knowledge graph includes a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, and an edge encoding matrix representing edge types between the nodes. In some cases, the operations of this step refer to, or may be performed by, knowledge graph component as described with reference to
At operation 1215, the system generates a first feature embedding representing a user and a second feature embedding representing a content item of the set of content items using a multi-modal graph encoder based on the knowledge graph, where the second feature embedding is generated using a first modality as a query vector of an attention mechanism and a second modality as a key vector of the attention mechanism. In some cases, the operations of this step refer to, or may be performed by, multi-modal graph encoder as described with reference to
At operation 1220, the system computes a loss function based on the first feature embedding and the second feature embedding. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
The term loss function refers to a function that impacts how a machine learning model is trained in a supervised learning model. Specifically, during each training iteration, the output of the model is compared to the known annotation information in the training data. The loss function provides a value for how close the predicted annotation data is to the actual annotation data. After computing the loss function, the parameters of the model are updated accordingly and a new set of predictions are made during the next iteration.
According to an embodiment of the present disclosure, the recommendation network includes an optimization objective as follows: rank=−lnσ(uTi−uTi′)+λ|Θ|22. A metric learning loss may include a neighboring contrastive (NC) loss or triplet loss, i.e., metric. Therefore, the total loss function is formulated as total=rank+metric.
At operation 1225, the system updates parameters of the multi-modal graph encoder based on the loss function. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
At operation 1305, the system identifies a first content item and a second content item. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
At operation 1310, the system determines that a user prefers the first content item over the second content item using similarity scores for the first content item and the second content item. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
At operation 1315, the system computes a ranking loss based on the determination, where the loss function includes the ranking loss. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
In some examples, a ranking loss is used to predict relative distances between inputs (also known as metric learning). A ranking loss function depends on a similarity score between data points. The similarity score can be binary (similar or dissimilar). The training component (see
The multi-modal graph encoder as described in
At operation 1405, the system identifies a positive sample pair including a user and a first content item that is preferred by the user. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
In some examples, the multi-modal graph encoder is trained using a contrastive learning loss, which pushes apart dissimilar pairs (referred to as negative pairs) while pulling together similar pairs (referred to as positive pairs). In some examples, a first content item that is preferred by the user is identified as a positive sample. The first content item and the user form a positive pair. Additionally, a second content item that is not preferred by the user is identified as a negative sample. The second content item and the user form a negative pair.
At operation 1410, the system identifies a negative sample pair including the user and a second content item that is not preferred by the user. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
At operation 1415, the system computes a contrastive learning loss based on the positive sample pair and the negative sample pair, where the loss function includes the contrastive learning loss. In some cases, the operations of this step refer to, or may be performed by, training component as described with reference to
Performance of apparatus, systems and methods of the present disclosure have been evaluated, and results indicate embodiments of the present disclosure have obtained increased performance over existing technology. Example experiments demonstrate that the item recommendation apparatus outperforms conventional systems.
The description and drawings described herein represent example configurations and do not represent all the implementations within the scope of the claims. For example, the operations and steps may be rearranged, combined or otherwise modified. Also, structures and devices may be represented in the form of block diagrams to represent the relationship between components and avoid obscuring the described concepts. Similar components or features may have the same name but may have different reference numbers corresponding to different figures.
Some modifications to the disclosure may be readily apparent to those skilled in the art, and the principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein, but is to be accorded the broadest scope consistent with the principles and novel features disclosed herein.
In this disclosure and the following claims, the word “or” indicates an inclusive list such that, for example, the list of X, Y, or Z means X or Y or Z or XY or XZ or YZ or XYZ. Also the phrase “based on” is not used to represent a closed set of conditions. For example, a step that is described as “based on condition A” may be based on both condition A and condition B. In other words, the phrase “based on” shall be construed to mean “based at least in part on.” Also, the words “a” or “an” indicate “at least one.”