Search systems and recommender systems are both online services that recommend content to a computer user (or, more simply, a “user”) in response to a query. Search systems respond to a query with a focused set of results that are viewed as “answers” to a query. In contrast, recommender systems are not necessarily tasked with responding with “answers,” i.e., content that is specifically relating to the query. Instead, recommender systems respond to queries with recommended content, i.e., content calculated to lead a requesting user to discovering new content. Roughly, search systems provide a focused scope to a specific topic while recommender systems provide a broadened scope. For both types of systems, however, it is quite common for the requesting user to submit a text-based query and, in response, expect non-text content items.
There are online hosting services whose primary focus is to maintain non-textual content items for its users/subscribers. These content items are maintained as a corpus of content items and often become quite large. Indeed, at least one existing hosting service maintains a corpus that includes over a billion content items that have been posted to the hosting service by its users/subscribers. However, determining the content items from the billions of content items that should be presented or recommended to a user is often difficult. Further, traditional content items are typically configured to determine and recommend content items to a user/subscriber to encourage immediate interaction, by the user/subscriber, with the recommended content items.
The foregoing aspects and many of the attendant advantages of the disclosed subject matter will become more readily appreciated as they are better understood by reference to the following description when taken in conjunction with the following drawings, wherein:
Disclosed are systems and methods that determine recommended non-text content items (e.g., images) based on one or more selected or provided content items, referred to herein as session content items. As discussed further below, the disclosed implementations may generate a content item caption for each session content item and/or generate a session caption that is descriptive of the group of session content items. The caption(s) may then be processed by a Large Language Model (“LLM”) which will generate and output an LLM output that includes a narrative description of the session content items. The narrative description may then be used as a text-based request into a query service that identifies and returns one or more recommended content items. Alternatively, the LLM may provide, as an LLM output, a list of content item identifiers that the LLM selects from a set of provided LLM content item identifiers that may also have corresponding captions, as recommended content items that are responsive to the session content items. The recommended content items may then be provided for presentation to a user, utilized to generate a category, vertical, etc.
As discussed further below, in some implementations, the query service, in response to a text-based request, may process the text-based request into a set of word pieces from terms of the received request. In some implementations, at least one term of the received request results in at least two word pieces. Embedding vectors that project source content (in this case word pieces) into a content item embedding space are generated for each word piece of the set of word pieces for the received request, and the embedding vectors are combined into a representative embedding vector for the request. A set of content items of a corpus of content items are identified according to the representative embedding vector as projected into the content item embedding space. At least some of the content items from the set of content items are returned as content items in response to the request from the subscriber.
Further, while the present disclosure describes certain implementations for utilizing and/or determining embedding vectors for content items and/or subscribers of an online service and recommending content items to subscribers of the online service, exemplary embodiments of the present disclosure contemplate utilizing embedding vectors for subscribers of the online service that are generated using other methods, systems, and implementations and recommending content items to subscribers of the online service that employ other methods, systems, and implementations. For example, exemplary embodiments of the present disclosure may recommend content items and/or employ embedding vectors that are generated as described in U.S. patent application Ser. No. 16/273,939 and/or U.S. patent application Ser. No. 18/166,415, which are both hereby incorporated by reference in their entireties. Accordingly, the embedding vectors utilized by implementations of the present disclosure may encode visual, semantic, and other features of the content items they represent.
In further implementations of the disclosed subject matter, recommended content items may be determined based on a long-term objective. For example, a long-term objective, such as a cumulative engagement, may be determined and/or defined, and a reverse inference learning technique may be employed to train and/or optimize a recommendation system to determine recommended content items so as to encourage the long-term objective. According to certain aspects, one or more mappings that correlate content items to the long-term objective may be generated, and the mappings may be employed to optimize and/or configure the recommendation system to determine recommended content items in view of the long-term objective. In this regard, one or more interim mappings may be used to map content items to the long-term objective. For example, a first mapping may map parameters associated with an aggregation of sessions to the long-term objective, a second mapping may map features within an individual session to the parameters associated with the aggregation of sessions, and a third mapping may map individual content items to the features within the individual sessions. In certain implementations, the mappings may be based on attributions that are determined in connection with the content items, which may reflect which content items drive subscribers to the long-term objective. For example, alignment scores and/or attention scores may be generated for the content items that represent a relevance and/or influence that each content item may have in driving or motivating subscribers towards the long-term objective.
Further, according to other aspects of the present disclosure, the state of the subscribers within the online service ecosystem may be determined and considered in determining the recommended content items and/or the one or more mappings from the individual content items to the long-term objective. For example, probabilities associated with subscriber state transitions may be determined and the transition probabilities for the various subscriber states may be utilized in determining the recommended content items. The exemplary recommendation system provided according to exemplary embodiments of the present disclosure may further consider the subscriber state, along with the various transition probabilities, in determining the one or more recommended content items based on the long-term objective to return and/or present to the subscriber. Thus, unlike traditional recommendation systems, which are typically configured to determine content recommendations to optimize immediate interactions with the recommended content items, the content items recommended by the exemplary recommendation system are determined based on the long-term objective to encourage the long-term behavior associated with the objective, and not necessarily provoke an immediate interaction with the recommended content items.
By way of definition, and as those skilled in the art will appreciate, an “embedding vector” is an array of values that reflect aspects and features of source/input content. For example, an embedding vector of an image will include an array of values describing aspects and features of that image. An executable model or process, referred to as an embedding vector generator, generates an embedding vector for input content. Indeed, the embedding vector generator generates the same learned features to identify and extract information of each instance of input content. This processing leads to the generation of an embedding vector for an instance of input content. As those skilled in the art will appreciate, embedding vectors generated by the same embedding vector generator based on the expected input content are comparable, such that a greater similarity between two embedding vectors indicates a greater similarity between the source items—at least as determined by the embedding vector generator. By way of illustration and not limitation, an embedding vector may comprise 128 elements, each element represented by a 32-bit or 64-bit floating point value, each value representative of some aspect (or multiple aspects) of the input content. In other implementations, the embedding vector may have additional or fewer elements and each element may have additional or fewer floating-point values, integer values, and/or binary values.
As those skilled in the art will appreciate, embedding vectors are comparable across the same element within the embedding vectors. For example, a first element of a first embedding vector can be compared to a first element of a second embedding vector generated by the same embedding vector generator on distinct input items. This type of comparison is typically viewed as a determination of similarity for that particular element between the two embedding vectors. On the other hand, the first element of a first embedding vector cannot typically be compared to the second element of a second embedding vector because the embedding vector generator generates the values of the different elements based on distinct and usually unique aspects and features of input items.
Regarding embedding vector generators, typically an embedding vector generator accepts input content (e.g., an image, video, or multi-item content), processes the input content through various levels of convolution, and produces an array of values that specifically reflect on the input data, i.e., an embedding vector. Due to the nature of a trained embedding vector generator (i.e., the convolutions that include transformations, aggregations, subtractions, extrapolations, normalizations, etc.), the contents or values of the resulting embedding vectors are often meaningless to personal examination. However, collectively, the elements of an embedding vector can be used to project or map the corresponding input content into an embedding space as defined by the embedding vectors.
As indicated above, two embedding vectors (generated from the same content type by the same embedding vector generator) may be compared for similarity as projected within the corresponding embedding space. The closer that two embedding vectors are located within the embedding space, the more similar the input content from which the embedding vectors were generated.
Network 108 is a computer network, also commonly referred to as a data network. As those skilled in the art will appreciate, network 108 is fundamentally a telecommunication network over which computers, computing devices, such as computing devices 102, 104, and 106, and other network-enabled devices and/or services can electronically communicate, including exchanging information and data among the computers, devices, and services. In computer networks, networked computing devices are viewed as nodes of the network. Thus, in the exemplary networked environment 100, computing devices 102, 104, and 106, as well as hosting service 130, are nodes of network 108.
In communicating with other devices and/or services over network 108, connections between other devices and/or services are conducted using either cable media (e.g., physical connections that may include electrical and/or optical communication lines), wireless media (e.g., wireless connections such as 802.11x, Bluetooth, and/or infrared connections), or some combination of both. While a well-known computer network is the Internet, the disclosed subject matter is not limited to the Internet. Indeed, elements of the disclosed subject matter may be suitably and satisfactorily implemented on wide area networks, local area networks, enterprise networks, and the like.
As illustrated in exemplary network environment 100 of
As indicated above, hosting service 130 is an online service that, among other things, maintains corpus of content items 134. The content items of this corpus are typically obtained from one or more subscribers and/or other providers (e.g., businesses) through a posting service of the hosting service (also called a hosting system), a recommender service that provides recommended content (content items) to a subscriber, and/or a search service that responds to a request for related/relevant content items to a request. Indeed, hosting service 130 is a network-accessible service that typically provides application programming interfaces (APIs), processes, and functions to its users/subscribers, including those described herein.
According to aspects of the disclosed subject matter, computer users, such as computer users 101, 103, and 105, may be subscribers of the various services of hosting service 130, i.e., making use of one or more features/functions/services of hosting service 130. Indeed, according to aspects of the disclosed subject matter, a subscriber is a computer user that takes advantage of services available for an online service, such as hosting service 130. In exemplary networked environment 100 of
In accordance with aspects of the disclosed subject matter, a subscriber requesting content from hosting service 130, such as computer user 101, submits request 120 to hosting service 130. Request 120 may be a text-based request, such as a text-based search query, a selection of multiple content items from corpus of content items 134 that are submitted as the request, one or more content items uploaded or provided by the user to hosting service 130 as request 120, etc. Request 120 may be an explicit request, such as a text-based search request or a specific search request in which one or more content items are selected or provided by a user. In other examples, the request may be implicit. For example, as a user browses content items of hosting service 130, hosting service 130 may maintain identifiers of the browsed content items and utilize those content items as the basis for a request. As another example, if a user selected to view or close-up a content item from corpus of content items 134, that content item may be utilized as a request to determine other content items that are similar to the viewed content item. According to other aspects, request 120 may be included as part of and/or in connection with a request to access a homepage and/or a home feed, an indication that recommended content items are to be pushed to a subscriber, and the like. Still further, the disclosed implementations may be utilized to determine content items without an explicit or implicit request from a user. For example, the disclosed implementations may be used to determine content items that are like one or more other content items (e.g., have a similar style, fashion, etc.). Accordingly, it will be appreciated that the disclosed implementations are operable with any type or text-based request or content item-based request regardless of whether it is a request from a user (explicit or implicit) or otherwise.
In response to a request 120 for content, hosting service 130 draws from corpus of content items 134, identifying one or more content items that satisfy the request. As will be set forth in greater detail below, and according to aspects of the disclosed subject matter, if request 120 is a text-based request, a set of word pieces is generated for the terms of request 120. If request 120 includes one or more content items, those content item(s) may be processed, as discussed further herein, to generate a caption for the content item(s) (either individually or collectively) and that caption(s) may then be processed to a text-based request from which word pieces are generated for the request. Embedding vectors for the word pieces are determined and combined to form a representative embedding vector for the request. Using the representative embedding vector, content items from the corpus are identified.
Alternatively, or in addition thereto, rather than determining word pieces for content items of request 120, the content item(s) of request 120 and at least some of the content items from corpus of content items 134, referred to herein as a reduced corpus, may be processed to determine captions of those content items, and those captions further processed, for example, by a Large Language Model (“LLM”), to determine content items from the reduced corpus that correspond to the content item(s) of the request. After identifying the content items, hosting service 130 returns the one or more content items to the requesting subscriber as response 122 to request 120 and/or handles them in accordance with the intent of request 120—e.g., creates a taste preference guide.
According to another implementation of the present disclosure, hosting service 130 may identify content items from corpus of content items 134 based on a long-term objective. As is described in greater detail herein, in determining recommended content items in view of the long-term objective, implementations of the present disclosure seek to determine an attribution in connection with the content items, e.g., a measure of which content items drive the long-term objective. The attribution may be used to generate a mapping between one or more of the content items of corpus of content items 134 and the long-term objective, and the recommended content items may be determined based at least in part on the mapping. In exemplary implementations, the mapping may be generated utilizing one or more interim mappings that correlate the content items to the long-term objective. For example, alignment scores and/or attention scores may be determined for the content items that represent a relevance and/or influence that each content item may have in driving the long-term objective over a defined time period.
As shown in
As discussed herein, one or more services, whether internal to the hosting service or external and accessed by the hosting service, may process one or more content items to determine captions for each of the one or more content items and/or determine a caption for a plurality of content items. For example, an image encoder and language model, such as BLIP-2, FLAMINGO80B, VQAv2, etc., may be used to generate captions for each of a plurality of content items and/or a group of content items (or the captions for each content item combined to form a single caption for a plurality of content items). A caption, as used herein, is a short descriptive or explanatory text, usually one or two sentences long, that describes or explains a content item or a plurality of content items.
Likewise, as discussed further herein, a caption for a content item for each of a plurality of content items, or a caption for a group of content items, may be processed by an LLM to determine descriptors and/or a text request for the content item or plurality of content items of the request. Alternatively, in some implementations, an LLM input may be generated that includes both captions for one or more content items of a request, captions for one or more content items of a reduced corpus, and instructions that the LLM determine one or more content items as recommended content items based on the captions of the one or more content items of the request.
As shown in
Each stage of recommendation system 200 may be configured to successively filter and rank content items obtained from a corpus of content items, so as to reduce and narrow down the number of content items from the corpus of content items in determining one or more content items to return in response to a request for content items. In the exemplary implementation shown in
As shown in
Content items 210-B may then be provided to content ranking stage 204, which may employ one or more machine learning models to further refine and/or rank the content items in identifying recommended content to provide to a subscriber. According to exemplary implementations of the present disclosure, content ranking stage 204 may process content items 210-B based on the long-term objective, the request for content items, information associated with the subscriber associated with the request for content items, and the like to rank content items 210-B. According to certain aspects of the present disclosure, a relevancy score may also be generated for content items 210-B based on the long-term objective, the request for content items, information associated with the subscriber associated with the request for content items, and the like. Accordingly, the highest ranked content items from content items 210-B and/or content items 210-B having a ranking above a threshold ranking may be identified as content items 210-C, which may be a subset of content items 210-B, and may be returned and provided by content ranking stage 204 to content blending stage 206.
Content items 210-C may then be provided to content blending stage 206, which may employ one or more machine learning models to further refine and/or rank the content items in identifying recommended content to provide to a subscriber. According to exemplary implementations of the present disclosure, content blending stage 206 may apply certain parameters and/or policies to process content items 210-C based on the long-term objective, the request for content items, information associated with the subscriber associated with the request for content items, features of content items 210-C, and the like to determine an order and/or priority for content items 210-C. According to certain aspects of the present disclosure, content blending stage 206 may be learned to predict optimal parameters and/or policies that are optimized for a defined reward. Accordingly, the highest prioritized and/or ordered content items from content items 210-C may be identified as content items 210-D, which may be a subset of content items 210-C, and may be provided by content blending stage 206 to content serving stage 208.
In turn, content items 210-D may then be provided to content serving stage 208. Content serving stage 208 may employ one or more machine learning models, probabilistic models, rule-based models, and the like to make a further determination as to which content items may be provided to the subscriber in response to the request for content items. As shown in
According to certain exemplary implementations, recommendation system 200 may be optimized and/or configured to determine content item recommendations (e.g., content items 210-E) in view of the long-term objective, as described herein in connection with at least
In the illustrated example, a user, during a session and through interaction with user device 301, selects or views a plurality of session content items 303-1, 303-2, through 303-X, as in 311. The selection of content items during the session constitutes session content items 303. Any number of content items may be selected during a session and included as session content items 303. In this example, the user is selecting different content items that are images of sideboards. As the content items are selected, the sequence in which each content item is selected may also be maintained or determined. As discussed above, session content items 303 may be selected from corpus of content items 334 that is accessible by user device 301 through hosting service 330. In other examples, some or all of the content items of the session content items may be selected from or provided by user device 301. For example, during the session, the user may take an image of a sideboard, and that image may be provided to hosting service 330 as a content item of the sequence of content items included in the session content items 303.
During or after the session, some or all of session content items 303 are sent, via network 308, from user device 301 to hosting service 330. For example, after the user has viewed five content items, those content items, or content item identifiers corresponding to those content items may be sent to hosting service 330. In other implementations, content item identifiers may be sent or streamed to hosting service 330 as the content items are viewed or selected by the user as part of the session.
Hosting service 330, upon receiving identification of content items viewed by the user, may process the content items to generate captions descriptive of each content item, as in 312. For example, hosting service 330 may include and/or access an image encoder and language model, such as BLIP-2, FLAMINGO80B, VQAv2, etc. and/or internally maintained services, referred to herein generally as a “caption service,” and provide each content item to the caption service and receive a caption descriptive of the content item. Each caption may be associated with a content item identifier of the corresponding content item. For example, hosting service 330 may maintain a content item identifier for each content item, which may be unique for each content item. In some examples, captions may be pre-determined for session content items 303 and maintained in corpus of content items 334 accessible to hosting service 330. In such an example, hosting service 330 may obtain the caption for each content item of session content items 303 from the caption data store rather than having to re-process each content item to determine a caption. Likewise, if some of the content items do not have a corresponding caption in the caption data store, those content items may be processed with a caption service to determine a caption for the content item and the caption, with the corresponding content item identifier, may be added to the caption data store.
In addition to determining a caption for each content item of session content items 303, hosting service 330 may also determine, based at least in part on the session content items, a reduced corpus that includes less than all of the content items of corpus of content items 334, as in 313. For example, corpus of content items 334 may be reduced to the reduced corpus by excluding content items of session content items 303 viewed by the user. In still further implementations, the corpus may be further reduced based on existing relationships between content items of session content items 303 and content items of the corpus, to exclude content items that are in different categories or verticals than those of session content items 303, etc. In other examples, the corpus may not be reduced.
The hosting service may then generate or obtain a caption for each content item of the reduced corpus, as in 314. For example, the content items of the reduced corpus may be processed by the same or similar caption service used to process the session content items. In other examples, captions may be pre-determined and stored in a caption data store for each content item of the reduced corpus. In such an example, rather than re-process each content item of the corpus, the hosting service may obtain the caption from the caption data store. In such an example, as new content items are added to the corpus, the content item may be processed with a caption service to determine a caption for the content item, and the caption, with the corresponding content item identifier, may be added to the caption data store.
The system may also include computing resource(s) 321. Computing resource(s) 321 may be remote from user device 301. Likewise, computing resource(s) 321 may be configured to communicate over network 308 with user device 301.
As illustrated, computing resource(s) 321 may be implemented as one or more servers 321(1), 321(2), . . . , 321(N) and may, in some instances, form a portion of a network-accessible computing platform implemented as a computing infrastructure of processors, storage, software, data access, and so forth that is maintained and accessible by components/devices of the system via network 308, such as an intranet (e.g., local area network), the Internet, etc. Computing resources 321 may process content items, captions, etc., to generate recommended content items, as discussed herein.
Computing resource(s) 321 does not require end-user knowledge of the physical location and configuration of the system that delivers the services. Common expressions associated for these remote computing resource(s) 321 include “on-demand computing,” “software as a service (SaaS),” “platform computing,” “network-accessible platform,” “cloud services,” “data centers,” and so forth. Each of servers 321(1)-(N) includes processor 318 and memory 319, which may store or otherwise have access to hosting service 330, as described herein.
Network 308, and each of the other networks discussed herein, may utilize wired technologies (e.g., wires, USB, fiber optic cable, etc.), wireless technologies (e.g., radio frequency, infrared, NFC, cellular, satellite, Bluetooth, etc.), or other connection technologies. Network 308 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., cable, CAT6, fiberoptic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth, etc.), and/or other connection technologies.
Turning now to
Based on the LLM input, the LLM will process each caption of the sequence of content items of the session content items and compare those captions with captions of each content item of the reduced corpus of content items to determine content items from the reduced corpus that are most closely related to the content items of the session of content items. The LLM may also determine, based on the sequence of the content items of the session of content items, the captions of the content items of the session of content items, and the captions of the content items selected from the reduced corpus of content items, a sequence in which the selected content items are to be presented.
Recommended content items 333-1, 333-2, through 333-Y, determined by the hosting service, and the sequence in which those items are to be presented, are then sent, via network 308, to user device 301, as in 316. User device 301, upon receiving the recommended content items and the sequence of presentation of those recommended content items, presents recommended content items 333 in the specified sequence, as in 317. In some implementations, a merchant(s) that offers an item(s) represented in at least one of recommended content items 333 for sale may also be determined and indicated as part of the presentation of recommended content items 333.
In the illustrated example, a user, during a session and through interaction with user device 401, selects or views a plurality of content items 403-1, 403-2, through 403-X, as in 411. The selection of content items during the session constitutes session content items 403. Any number of content items may be selected during a session and included as session content items 403. In this example, the user is selecting different content items that are images of sideboards. As discussed above, the session content items may be selected from a corpus of content items 434 that is accessible by user device 401 through hosting service 430. In other examples, some or all of the content items of session content items 430 may be selected from or provided by user device 401. For example, during the session, the user may take an image of a sideboard, and that image may be provided to hosting service 430 as a content item included in session content items 403.
During or after the session, some or all of session content items 403 are sent, via network 408, from user device 401 to hosting service 430. For example, after the user has viewed five content items, those content items, or identifiers corresponding to those content items, may be sent to the hosting service 430. In other implementations, content item identifiers may be sent or streamed to the hosting service as they are viewed or selected by the user as part of the session.
Hosting service 430, upon receiving identification of content items viewed by the user, in some implementations, may determine a session context for the session, as in 412. For example, if the session content items are included in a named group or list of content items, the name of the group may be determined to be the context. In other examples, metadata (e.g., annotations, keywords, etc.) associated with the content items may be processed to determine a relationship between the content items and used as the session context. For example, annotations or keywords associated with the session content items may include words such as furniture, home decor, bedroom, etc. In such an example, one or more of the keywords/annotations most often associated with the session content items may be determined and used as the session context. In other examples, if the content items are viewed from a particular section or vertical of content items, such as a vertical for “home decor” that is maintained and presented to the user by the hosting service, the vertical may be determined and used as the session context. In still other examples, the session context may not be determined or may be omitted.
In addition to optionally determining a session context for the session, hosting service 430 may also process session content items 403 to generate captions descriptive of each content item, as in 413. For example, hosting service 430 may include and/or access one or more internal and/or external caption services and provide the session content items to the caption service(s) and receive a caption descriptive of the session. In some implementations, the caption service may process all of the content items collectively and generate a single session caption descriptive of the session content items. In other examples, each content item of the session content items may be processed by the caption service(s) and a content item caption determined for each content. Those content item captions may then be combined to generate a session caption for the session. In instances when multiple caption services are used, each caption service may generate a caption for the session content items, referred to herein as a service caption, and those service captions may be combined to generate a session caption for the session.
Using the session context and the session caption, a text-based description may be generated that is descriptive of the session content items, as in 414. As discussed further below, in some implementations, an LLM input may be defined that includes instructions that the LLM consider the session context and the session caption to generate a session text-based description that is descriptive of the session content items, when considering the session context. Based on the LLM input, the LLM will process the session caption, considering the session context, and generate a text-based description of the session.
The text-based description may then be used as a text input to a query system of hosting service 430 (discussed further below) to determine recommended content items to return to user device 401 for presentation, as in 415.
The system may also include computing resource(s) 421. Computing resource(s) 421 may be remote from user device 401. Likewise, computing resource(s) 421 may be configured to communicate over network 408 with user device 401.
As illustrated, computing resource(s) 421 may be implemented as one or more servers 421(1), 421(2), . . . , 421(N) and may, in some instances, form a portion of a network-accessible computing platform implemented as a computing infrastructure of processors, storage, software, data access, and so forth that is maintained and accessible by components/devices of the system via network 408, such as an intranet (e.g., local area network), the Internet, etc. Computing resources 421 may process content items, captions, etc., to generate recommended content items, as discussed herein.
Computing system(s) 421 does not require end-user knowledge of the physical location and configuration of the system that delivers the services. Common expressions associated for these remote computing resource(s) 421 include “on-demand computing,” “software as a service (SaaS),” “platform computing,” “network-accessible platform,” “cloud services,” “data centers,” and so forth. Each of servers 421(1)-(N) includes processor 418 and memory 419, which may store or otherwise have access to hosting service 430, as described herein.
Network 408, and each of the other networks discussed herein, may utilize wired technologies (e.g., wires, USB, fiber optic cable, etc.), wireless technologies (e.g., radio frequency, infrared, NFC, cellular, satellite, Bluetooth, etc.), or other connection technologies. Network 408 is representative of any type of communication network, including data and/or voice network, and may be implemented using wired infrastructure (e.g., cable, CAT6, fiberoptic cable, etc.), a wireless infrastructure (e.g., RF, cellular, microwave, satellite, Bluetooth, etc.), and/or other connection technologies.
Turning now to
As noted above, regardless of the implementation used, the content items included in the session content items discussed with respect to
While the example discussed with respect to
Still further, in some implementations, as discussed further below, user preferences, user location, or content item locations (i.e., the location of a physical item represented by a content item) may also be determined and considered as part of the disclosed implementations when determining recommended content items. For example, referring back to
As another example, the disclosed implementations may also consider known user preferences, styles, etc., that have been previously determined and/or provided by the user when determining recommended content items.
The system components discussed with respect to
As discussed above, and elsewhere herein, session content items 501 and a sequence in which the session content items were viewed or selected by a user is received by the hosting service and processed by one or more caption services 506 and corpus reduction component 502. For example, caption service(s) 506 may process each content item of the session content items to generate a text content item caption for each content item 507-B. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item may be combined to generate the content item caption for the content item.
Likewise, corpus reduction component 502 may utilize session content items 501 and/or other user information to generate a reduced corpus. For example, corpus reduction component 502 may also process the corpus to remove any duplicates, to remove any content items that the user has previously viewed, or previously viewed within a defined period of time, remove items that are not relevant to the session—for example based on metadata associated with the content items and/or the session content items, etc.
Content items of the reduced corpus may also be provided to caption service(s) 506 and, like session content items 501, a caption may be generated for each content item of reduced corpus 507-A. For example, caption service(s) 506 may process each content item of the reduced corpus of content items to generate a content item caption for each content item. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item of the reduced content item corpus may be combined to generate the content item caption for that content item.
The hosting service may then generate LLM input 507 based on the content item caption of each content item of session content items 507-B, the content item caption of each content item of reduced corpus 507-A, user data 507-C, and content item sequence 507-D. For example, the hosting service may generate LLM input 507 that includes or references the content item caption for each session content item 507-B, that includes or references the content item caption for each content item of the reduced corpus 507-A, and that includes instructions that LLM 508 is to consider the content item caption of each session content item 507-B in the sequence provided and to select one or more content items as recommended content items based on the caption of each content item from reduced content item corpus 507-A. The instructions may further provide a minimum and maximum number of content items that are to be returned as recommended content items, instructions to indicate a sequence in which the recommended content items are to be presented, an LLM output structure that is to be provided by the LLM, etc. Still further, LLM input 507 may also provide additional context or parameters to guide the LLM in selection of recommended content items. For example, additional context or parameters may be specified based on user data 507-C, such as indicating preferred styles, colors, shapes, etc., known about the user that are to be considered in conjunction with the caption of each session content item in determining recommended content items.
LLM 508, upon receiving LLM input 507 generated by the hosting service processes the content item captions of the session content items, the content item captions of the content items of the reduced content item corpus, the sequence, instructions, etc., and determines one or more recommended content items from the reduced content item corpus, along with a sequence in which those content items are to be presented 510.
Example process 600 begins upon receipt of session content items, a sequence in which those session content items were viewed or selected by a user, and user data about a user, as in 602. As discussed above, a user may select or view one or more content items during a session or interaction between a user device and the hosting service. Content items viewed during the session are provided or identified to the hosting service as session content items. In some examples, the user device or an application executing on the user device may send indications of content items to the hosting service as those content items are viewed or selected by the user. Likewise, if the user interacts with one or more of the viewed content items, any such interaction may also be provided to the hosting service.
The session content items may then each be processed, for example by one or more caption services, to generate a content item caption descriptive of the session content item, as in 604. The content item caption, once generated, may be associated with a content item identifier for the content item. For example and referring briefly to
Returning to
The example process 600 may also utilize the session content items and/or contextual metadata determined for the session content items to determine a reduced corpus of content items, as in 608. For example, and returning again to
The reduced corpus of content items may then be processed to generate a content item caption for each content item of the reduced corpus, as in 610. For example, caption service 706, which may be the same or different caption service that generated captions for the session content items, may process each content item of reduced corpus of content items 778 to generate a list of reduced corpus content item captions 707-B. Like the session content item captions, the caption generated for each content item of reduced corpus 778 may be associated with the content item identifier and included in reduced corpus content item captions 707-B. Likewise, contextual metadata service 713 may also determine, for each content item of the reduced corpus of content items, contextual metadata, as in step 611.
Returning to
Example process 600 may then provide the LLM input to an LLM, such as GPT-4, BERT, Galactica, LaMDA, Llama, or an LLM defined and trained by the hosting service, as in 614. The LLM, upon receipt of the LLM input, processes the list of session content item captions and the list of reduced corpus content item captions, in accordance with the instructions, and outputs a sequenced list of recommended content item identifiers that are received by the example process, as in 616 and as illustrated as recommended content item identifiers 709 (
The example process 600 may then obtain the recommended content items from the corpus, or the reduced corpus, that are identified by the recommended content item identifiers that are returned by the LLM, as in 618. Finally, the obtained recommended content items may be sent, in accordance with the determined sequence, for presentation, as in 620. Returning again to
In some implementations, example process 600 may also determine a merchant(s) that offers an item(s) or object represented in at least one of the recommended content items for sale. In such an implementation, the merchant may also be identified in the presentation so that the object represented in the one or more content items may be purchased through the merchant.
The system components discussed with respect to
As discussed above, and elsewhere herein, session content items 801 viewed or selected by a user, or otherwise provided to the system, are received by the hosting service and processed by one or more caption service(s) 806. For example, caption service(s) 806 may process each content item of the session content items to generate a caption for each content item, and those content item captions may be combined to generate a single session caption for session content items 801. Alternatively, caption service(s) 806 may process all session content items 801 together and generate a session caption descriptive of the session content items. Likewise, as discussed further below, in examples in which multiple caption services 806 are used, each caption service may generate a service caption for the session content items, as determined by that caption service, and each of the service captions may then be combined to generate the session caption for session content items 801.
Likewise, session context 802 may be received and/or determined for the session. The session context may be provided as part of the session content items, may be determined based on the content items, may be determined based on user browser history, user preferences, metadata about or relating to the session content items, etc.
The hosting service may then generate LLM input 807 based on the caption of each session content item 801, session context 802, and the desired output to be received from LLM 808. For example, the hosting service may generate LLM input 807 that includes or references the session caption for session content items 801, that includes session context 802, and that includes instructions that LLM 808 is to consider the session caption, the session context, and output a session description representative of session content items 801 collectively. The instructions may specify a specific structure for the LLM output, a request for a summary of the session content items be provided, that the LLM pick from a set of summary descriptors as a summary for the session content item, etc. Still further, LLM input 807 may also provide additional context, parameters, and/or other instructions to guide the LLM in generation of the LLM output and session description. For example, additional context or parameters may be specified based on user data, such as indicating preferred styles, colors, shapes, etc., known about the user that are to be considered in conjunction with the session caption in determining recommended content items.
LLM 808, upon receiving the LLM input generated by the hosting service, processes the session caption, the session context, etc., in accordance with the instructions of the LLM input, and generates an LLM output that includes the session description and, optionally, a session summary.
The session description may then be provided as a text-based request to a content item recommender 812 and determine one or more content items from a corpus of content items to select as recommended content items. As discussed further below, the content item processes the text-based request and returns one or recommended content items. Example process 800 may then combine the recommended content items, the session summary, and optionally other information as session output 810.
Example process 900 begins upon receipt of, or by determining session content items, as in 904. As discussed above, a user may select or view one or more content items during a session or interaction between a user device and the hosting service. Content items viewed during the session are provided or identified to the hosting service as session content items. In some examples, the user device or an application executing on the user device may send indications of content items to the hosting service as those content items are viewed or selected by the user. Likewise, if the user interacts with one or more of the viewed content items, any such interaction may also be provided to the hosting service. In other examples, the session content items may be selected by the hosting service or another entity for use in creating a feed, vertical, category, etc.
In addition to determining or receiving the content items, a session context may be received or determined, as in 902. For example, the session context may be a feed, vertical, category, etc., from or for which the session content items were selected. Alternatively, the content items may be initially processed (e.g., image processing, querying annotations, etc.) to determine the session context and/or the contextual metadata corresponding to the content items may be processed to determine a session context.
The session content items may then be processed to generate a session caption descriptive of the session content items, as in 1000. The session caption process 1000 is discussed further below with respect to
Utilizing the session context and the session caption, example process 900 generates an LLM input, as in 908.
For example and referring briefly to
LLM input 1211 may also include a prompt 1203, which may include one or more of instructions 1204 that the LLM is to follow in executing the LLM input, session caption 1205 determined from the session content items, contextual metadata 1208 determined for the session content items, response structure 1206 which may indicate how the LLM output is to be structured, and/or rules 1207 that are to be followed by the LLM in processing the LLM input. Continuing with the bathroom ideas, instructions 1204 may include, for example:
In this example, session captions 1205 included in the LLM input may include: “mediterranean, country, coastal, mediterranean: spanish, mediterranean: italian, mid-century modern, moroccan, bathroom design, bathroom interior, bathroom remodel, bathroom inspiration,” all of which may have been determined by a caption service, as discussed herein.
In some implementations, LLM input 1211 may also include additional instructions 1204 as to how the LLM output is to be structured, etc. Continuing with the above example, LLM input 1211 may include additional instructions 1204 specifying the structure of the LLM output:
Rules 1207 for LLM input 1211 may include, for example:
As illustrated in the above example LLM input, any of a variety of captions, instructions, and/or rules may be included in the LLM input to help construct and guide the LLM in creating the LLM output.
Returning to
Returning again to
Finally, example process 900 may generate and present a session output, as in 916. The session output may include both information from the LLM output, such as title 1302 (
In some implementations, example process 900 may also determine a merchant(s) that offers an item(s) or object represented in at least one of the recommended content items for sale. In such an implementation, the merchant may also be identified in the presentation so that the object represented in the one or more content items may be purchased through the merchant.
Example process 1000 begins with selection of one or more caption services that are to process the content items and output captions descriptive of those content items, as in 1002. In some implementations, example process 1000 may only select one caption service. In other examples, multiple caption services may be selected. The one or more caption services may be, for example, BLIP-2, FLAMINGO80B, VQAv2, etc. and/or an internally maintained caption service. In some implementations, the caption service(s) may be selected based on the user, the content items selected, the quantity of content items selected, whether a caption is to be created for each content item, whether a caption is to be created as representative of all the content items, etc.
In some implementations, possible result captions that may be provided as outputs by the caption service may also be defined, as in 1003. The content identifiers are then processed to generate session captions representation of the session content identifiers, as in 1004.
If the selected caption service only generates a caption for each content item, the caption service may process each content item and generate a respective content identifier caption for each content item. Those content identifier captions may then be combined as a service caption for the session, as determined for the session content items. In other examples, a selected caption service may process all of the content items of the session content items and generate a service caption that is representative of the content items of the session content items. If more than one caption service is selected for use with the example process 1000, the service caption output by each selected caption service may then be combined to generate the session caption that is representative of the session content items processed by the example process 1000. Combining of individual content item captions to generate a service caption for the session content items and/or combining of service captions output by a plurality of caption services may be done by, for example, adding the terms of each caption together. In other examples, combining of captions may include only selecting terms that appear in two or more of the captions being combined, or only terms appearing in a majority of the captions combined, etc.
For example,
Returning to
According to exemplary implementations of the present disclosure, content recommendations may also be determined based on a long-term objective. For example, a long-term objective, such as a cumulative engagement associated with a subscriber of the online service over a certain period of time, may be defined and/or determined, and form the basis for determining recommended content items. In an exemplary implementation, cumulative engagement may be defined as a function of a depth of session and a number of session (e.g., a product of a depth of session and a number of session, etc.) over a defined time period. Accordingly, cumulative engagement may be represented, for the given time period, as:
Cumulative Engagement=depth of session×number of sessions
Consequently, the depth of a session may be defined as a function of a subscriber's time spent accessing the online service (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, and the like) and the actions performed by the subscriber (e.g., the type of actions performed by the subscriber, the number of actions performed by the subscriber, etc.), and the number of sessions can be a function of a frequency with which the subscriber accesses the online service over a defined time period. Accordingly, the depth of session and the number of sessions may be represented as:
Depth of Session=f(time spent,actions performed)
Number of Sessions=f(frequency of use)
According to other implementations of the present disclosure, other long-term objectives may be determined and/or defined, such as objectives based on shopping/purchase metrics, objectives based on other engagement and/or interaction metrics, objectives based on advertisement engagement and/or interaction metrics, objectives based on query and/or search metrics, and the like.
According to exemplary embodiments of the present disclosure, datasets may be generated to learn how subscribers interact with content items across interests, content item formats and/or presentation type (e.g., homepage, search results, shopping, etc.), multiple sessions, and the like. The dataset may also provide insights into subscriber transitions between subscriber states. The dataset may then be used to map content items to the long-term objective. The mappings may be determined by determining alignment scores and/or attention scores that attribute interactions of subscribers with previous interactions with content items. The mapping of the content items to the long-term objective may facilitate optimization and/or generation of a recommendation service to configure the recommendation service to determine content items to recommend and serve to subscribers to encourage the long-term objective. Advantageously, unlike traditional recommendation systems, which are typically configured to recommend content items with the objective to encourage immediate interaction with the recommended content items, exemplary embodiments of the present disclosure can provide a recommendation system configured to recommend content items that are determined to prioritize the long-term objective and identify content items to promote the long-term objective. Additionally, aspects of the present disclosure may also consider the subscriber's state in mapping content items to the long-term objective. For example, the subscriber's actions and behavior in connection with accessing and/or utilizing the online service may be modeled as a state of the subscriber, and the subscriber's state, along with the subscriber's history (e.g., a new user, a casual user, a power user, etc.), may be utilized to determine probabilities associated with transitions in the state of the user in determining the mappings of the content items to the long-term objective. According to aspects of the present disclosure, this may be modeled as a Markov decision process employing Bellman equations to determine values for subscribers in view of multiple objectives. Determination of subscriber states, as well as determination of the probabilities associated with state transitions, is described in further detail herein in connection with at least
As illustrated in
In exemplary implementations of the present disclosure, the features, metrics, and/or parameters associated the aggregation of the subscriber sessions 1530 may include, for example, a frequency at which the subscriber initiates sessions with the online service within a defined time period (e.g., 1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, etc.), a depth of session associated with the sessions (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, a number of content items viewed, and the like), and the like. Similarly, the features, metrics, and/or parameters associated with individual subscriber sessions 1520 may include, for example, a session depth (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, a number of content items viewed, and the like), the actions performed by the subscriber (e.g., the type of actions performed by the subscriber, the number of actions performed by the subscriber, etc.), a session length (e.g., amount of time spent on the session), an entropy and/or diversity associated with the session (e.g., number of interests and/or topics explored, number of different content item types explored, number of different content item formats explored-home page, search, shopping, etc.), and the like. Although
In an exemplary implementation, the mappings may be determined by determining alignment scores that attribute interactions of subscribers with previous interactions with content items. For example, the alignment scores may be determined using query, key, and value vectors to determine alignment scores for candidate content items. A subscriber interaction may represent a query, a sequence of content items with which the subscriber engaged prior to the subscriber interaction may represent the key, and content items that are retrieved from a corpus of content items for each content item in the sequence of content items as the candidate content items from which the recommended content items are determined may represent the values. According to aspects of the present disclosure, the subscriber interaction may correspond to the long-term objective. For example, the subscriber interaction may include the desired subscriber behavior within a desired time period after the sequence of content items preceding the subscriber interaction. Using the subscriber interaction and the sequence of subscriber engagements preceding the subscriber interaction, an alignment score may be determined for each subscriber engagement that preceded the subscriber interaction. The alignment scores for each subscriber engagement that preceded the subscriber interaction may represent a relevance and/or influence of the preceding subscriber engagement in connection with the subscriber interaction. According to an aspect of the present disclosure, as the embedding vectors representing the content items encode features such as visual features, semantic features, and the like, the relevance and/or influence quantified by the alignment scores also include a measure of semantic relevance. The alignment scores may then be utilized to determine a weighted sum of the candidate content items, which may be used to determine one or more recommended content items.
Alternatively and/or in addition, attention scores may be used in place of or in addition to alignment scores. Similar to the determination of alignment scores, a representation of a subscriber interaction (e.g., representation of the content item with which the subscriber interacted that is of interest, etc.) may be modeled as a query vector and the content items in an input sequence of content items (e.g., representations of a sequence of content items with which the subscriber interacted that may be of interest, etc.) may be modeled as the key vector. Accordingly, the dot product of the query and key vectors may provide attention scores for each content item in the input sequence of content items, which may represent a relevance of each content item in the sequence of content items to the subscriber interaction of interest.
Accordingly, the mapping of the content items to the long-term objective may facilitate optimization and/or generation of a recommendation service to configure the recommendation service to determine content items to recommend and serve to subscribers to encourage the long-term objective.
Alternatively and/or in addition, according to certain aspects of the present disclosure, each mapping may be determined using one or more trained models that are configured to predict a respective target variable based on the respective inputs. For example, a first model may be trained in connection with the first mapping 1502 to predict the long-term objective 1540 based on inputs corresponding to features, metrics, and/or parameters associated across the aggregation of subscriber sessions 1530. Similarly, a second model may be trained in connection with the second mapping 1504 to predict the features, metrics, and/or parameters associated across the aggregation of subscriber sessions 1530 based on inputs corresponding to features, metrics, and/or parameters associated with individual subscriber sessions 1520, and a third model may be trained in connection with the third mapping 1506 to predict the features, metrics, and/or parameters associated with individual subscriber sessions 1520 based on inputs corresponding to content items 1514 in a corpus of content items from which recommended content items may be determined. Accordingly, the various mappings may be utilized to optimize and/or configure a recommendation system and/or service to recommend content items to prioritize the long-term objective. For example, a reverse inference learning technique may be employed to train, fine-tune, and/or optimize a recommendation system and/or service to configure the recommendation system and/or service to recommend content items to subscribers to achieve the long-term objective. Although exemplary embodiments of the present disclosure are described as utilizing three interim mappings, any number of interim mappings may be used.
According to implementations of the present disclosure, attributions may be determined within and across subscriber sessions, so as to identify content items that may be responsible for, driven, or otherwise influenced the subscriber to interact and/or engage with a subsequent content item. As shown in
In exemplary implementations of the present disclosure, attributions 1570 may be determined based on alignment scores and/or attention scores. The alignment scores and/or attention scores for each content item may represent a relevance and/or influence of the content item to the subsequent subscriber engagement. According to one aspect of the present disclosure, the alignment score may be determined based on a similarity measure of each preceding content item with the subscriber interaction in question. Alternatively and/or in addition, the alignment score may be determined based on a textual caption that may be generated for each preceding content item. Determination of alignment scores is described in further detail herein in connection with at least
According to certain aspects of the present disclosure, the subscriber's state, along with the subscriber's current context, may be utilized to determine probabilities associated with transitions in the state of the user in determining the mappings of the content items to the long-term objective.
As shown in
In the illustrated implementation, as a subscriber is interacting and/or engaging with the online service, the subscriber may transition from state 1610 to state 1620 (e.g., transition 1612), from state 1610 to state 1630 (e.g., transition 1614), from state 1620 to state 1610 (e.g., transition 1622), from state 1620 to state 1630 (e.g., transition 1624), from state 1630 to state 1620 (e.g., transition 1634), and/or from state 1630 to state 1610 (e.g., transition 1632). According to exemplary embodiments of the present disclosure, a probability associated with each transition 1612, 1614, 1622, 1624, 1632, and 1634 may be determined. The probabilities associated with each transition 1612, 1614, 1622, 1624, 1632, and 1634 may be determined, for example, based on the behavior of subscribers, history of subscribers, profile information of subscribers, experience level of the subscriber (e.g., a new subscriber, a casual subscriber, a power subscriber, etc.), and the like. For example, it may be determined that an experienced subscriber is more likely to transition from initial state 1610 to state 1620 before transitioning to state 1630, while a new subscriber may include multiple transitions between initial state 1610 and state 1620 before transitioning to state 1630, while a power subscriber may transition directly from state 1610 to state 1630.
In exemplary implementations, this may be modeled as a Markov decision process employing Bellman equations to determine values for subscribers in view of multiple objectives. Accordingly, the probability of transitioning between state i and state j may be represented as stochastic matrix Pi,j, where the sum of the probabilities from a given state i to all other states is 1, which may be represented as Σj=1αPi,j=1. Accordingly, the determined probabilities may be applied in determining the mappings between content items and the long-term objective. Further, although
As shown in
Additionally, recommendation system 1700 may also be further configured to determine recommended content items in view of a long-term objective. As described herein, in connection with determining recommended content items based on the long-term objective, a mapping between content items may be generated and utilized for determining configuring and/or optimizing recommendation system 1700 to recommend content items based on the long-term objective. In an exemplary implementation, and as shown in
As illustrated, alignment scores may be determined by alignment score determination engine 1800 for candidate content items provided by content retrieval stage 1702 based on subscriber history information 1802. For example, subscriber interactions, along with sequences of subscriber engagements that preceded each subscriber interaction, may be identified in subscriber history information 1802. According to aspects of the present disclosure, the subscriber interaction and the sequence of subscriber engagements may be represented by embedding vectors that encode features of the content items with which the subscriber has interacted and/or engaged. The embedding vectors may encode features, such as visual information, textual information, audio information, semantic information, contextual information, and the like. Accordingly, for each identified subscriber interaction, alignment scores and/or attention scores may be determined for the content items forming the sequence of subscriber engagements preceding the subscriber interaction. The alignment scores and/or attention scores for each subscriber engagement that preceded the subscriber interaction may represent a relevance and/or influence of the preceding subscriber engagement in connection with the subscriber interaction. The alignment scores and/or attention scores may then be utilized to determine weights for each of the candidate content items (e.g., as a weighted sum of the candidate content items, etc.), which may be used to determine one or more recommended content items. For example, the alignment scores, attention scores, and/or the weights may be provided to recommendation system 1700 as an input, may be used to modify the utility function associated with one or more stages of recommendation system 1700, be used as rewards in applying a reinforcement learning technique to further train and/or fine-tune recommendation system 1700, and the like.
In exemplary implementations of the present disclosure, alignment score determination engine 1800 may determine alignment scores using query, key, and value vectors, a large language model, and the like. Determination of alignment scores is described in further detail herein in connection with at least
In the implementation employing query, key, and value vectors, each subscriber interaction 1804 may represent a query, a sequence of subscriber engagements 1806 preceding each subscriber interaction 1804 may represent the key, and candidate content items 1810 may represent the values. According to aspects of the present disclosure, the subscriber interactions may correspond to the long-term objective. For example, the subscriber interaction may include the desired subscriber behavior within a desired time period after the sequences of subscriber engagements preceding the subscriber interactions.
In determining alignment scores 1820, for a particular subscriber interaction 1804 (e.g., an interaction with a content item, a like of a content item, a sharing of a content item, a saving of a content item, etc.), a sequence of subscriber engagements 1806 preceding the particular subscriber interaction 1804 may be identified. The sequence of subscriber engagements 1806 may include a sequence of content items with which the subscriber engaged prior to the particular subscriber interaction 1804, and a corresponding candidate content item may be retrieved for each content item in the sequence of subscriber engagements 1806. Accordingly, the particular subscriber interaction 1804 may be a query, the sequence of subscriber engagements 1806 may be the key, and the retrieved candidate content items may be the value. According to aspects of the present disclosure, the particular subscriber interaction may correspond to the long-term objective. For example, the subscriber interaction may include the desired subscriber behavior within a desired time period after the sequence of subscriber engagements preceding the subscriber interaction. Given the query, key, and the value, a similarity measure may be determined between the particular subscriber interaction 1804 and each content item in the sequence of subscriber engagements 1806. For example, a cosine similarity may be determined between an embedding vector representative of the particular subscriber interaction 1804 and an embedding vector representative of each content item included in the sequence of subscriber engagements 1806. Other similarity measures may alternatively be utilized such as, by way of illustration and not limitation, a dot product, the Normalized Hamming Distance measure, a Euclidian distance measure, and the like. The similarity measure between the particular subscriber interaction 1804 and each content item in the sequence of subscriber engagements 1806 may represent the relevance and/or influence that each content item had on the particular subscriber interaction 1804. Further, as the embedding vectors preferably encode non-visual features, such as textual features, semantic features, etc. (in addition to visual features) of the content items, the similarity measure is representative of a comprehensive similarity between the particular subscriber interaction 1804 and each content item in the sequence of subscriber engagements 1806.
The similarity measures between the particular subscriber interaction 1804 and each content item in the sequence of subscriber engagements 1806 may be returned as alignment scores 1820 for the content items included in the sequence of subscriber engagements 1806. Accordingly, alignment scores 1820 may represent the relevance and/or influence of each content item in the sequence of subscriber engagements 1806 on the particular subscriber interaction 1804. Alternatively and/or in addition, the similarity measures/alignment scores may processed (e.g., a softmax function, etc.) to determine a distribution for the candidate content items 1810, as well as a weighted sum of the candidate content items 1810, which can be provided back to the recommendation system. For example, the weighted sum may be used to modify the utility function associated with one or more stages of the recommendation system, be used as rewards in applying a reinforcement learning technique to further train and/or fine-tune the recommendation system, and the like.
As illustrated, candidate content items 1810 and the sequences of subscriber interactions may be processed by one or more caption services 1822. For example, the caption service(s) 1822 may process each content item of the sequences of subscriber engagements 1806 to generate a content item caption for each content item. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item may be combined to generate the content item caption for the content item. Similarly, candidate content items 1810 may also be processed by caption service(s) 1822 and, like content items of the sequences of subscriber engagements 1806, a caption may be generated for each content item of candidate content items 1810. For example, caption service(s) 1822 may process each content item of candidate content items 1810 to generate a content item caption for each content item. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item of the reduced content item corpus may be combined to generate the content item caption for that content item.
An LLM input based on the content item caption of each content item of sequences of subscriber engagements 1806 and the content item caption of each content item of candidate content items 1810 may be generated to be provided as an input to LLM 1824. Optionally, additional information, such as additional subscriber history information (e.g., demographic information, likes, dislikes, recent activity, etc.), the long-term objective, and the like may also be used to generate the LLM input. For example, the LLM input may be generated that includes or references the content item caption for each content item of the sequences of subscriber engagements 1806, that includes or references the content item caption for each content item of candidate content items 1810, and that includes instructions that the LLM is to consider the content item caption of each content item of sequences of subscriber engagements 1806, the long-term objective, and the like, and to select one or more content items as recommended and/or ranked content item(s) 1830, determine alignment scores 1820, and the like, based on the caption of each content item from candidate content items 1810. The instructions may further provide a minimum and maximum number of content items that are to be returned as recommended content items, instructions to indicate a sequence in which the recommended content items are to be presented, an LLM output structure that is to be provided by the LLM, etc. Still further, the LLM input may also provide additional context or parameters to guide the LLM in selection of recommended content items. For example, additional context or parameters may be specified based on subscriber history information 1802, such as indicating preferred styles, colors, shapes, etc., information known about the subscriber that are to be considered in conjunction with the caption of each content item in determining recommended content items, and the like.
LLM 1824, upon receiving the generated LLM input, processes the content item captions of sequences of subscriber engagements 1806, the content item captions of candidate content items 1810, subscriber history information 1802, instructions, etc., and determines one or more recommended and/or ranked content item(s) 1830 from candidate content items 1810, along with a sequence in which those content items are to be presented.
As shown in
Cumulative Engagement=depth of session×number of sessions
Consequently, the depth of a session may be defined as a function of a subscriber's time spent accessing the online service (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, and the like) and the actions performed by the subscriber (e.g., the type of actions performed by the subscriber, the number of actions performed by the subscriber, etc.), and the number of sessions can be a function of a frequency with which the subscriber accesses the online service. Accordingly, the depth of session and the number of sessions may be represented as:
Depth of Session=f(time spent,actions performed)
Number of Sessions=f(frequency of use)
According to other implementations of the present disclosure, other long-term objectives may be determined and/or defined, such as objectives based on shopping/purchase metrics, objectives based on other engagement and/or interaction metrics, objectives based on advertisement engagement and/or interaction metrics, objectives based on query and/or search metrics, and the like.
After the long-term objective is determined, according to exemplary embodiments of the present disclosure, content items that are to be served to subscribers of the online service may be mapped to the determined long-term objective. The mapping of the content items to the long-term objective may facilitate optimization and/or generation of a recommendation service to configure the recommendation service to determine content items to recommend to subscribers to encourage the long-term objective and may be performed as one or more mappings. In exemplary embodiments, each of the one or more mappings may be determined using one or more trained models that are configured to predict a respective target variable based on respective inputs.
As illustrated in
The first mapping may map parameters associated with an aggregation of subscriber sessions to the long-term objective. In exemplary implementations of the present disclosure, the parameters associated with an aggregation of subscriber sessions may include, for example, features, metrics, and/or parameters such as a frequency at which the subscriber initiates sessions with the online service within a defined time period (e.g., 1 day, 3 days, 5 days, 1 week, 2 weeks, 1 month, etc.), a depth of session associated with the sessions (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, a number of content items viewed, and the like), and the like. Accordingly, a first model may be trained to predict a long-term objective based on an input of one or more parameters across an aggregation of subscriber sessions in mapping the long-term objective may be mapped to a plurality of corresponding parameters associated with an aggregation of subscriber sessions. Optionally, the mappings and generation of the first model may be based at least in part on probabilities associated with a state and current context of the subscriber, as described further herein in connection with at least
In addition to generation of a first mapping, a second mapping that maps features associated within individual subscriber sessions to parameters of the aggregation of subscriber sessions may be generated. Similar to the parameters associated with the aggregation of subscriber sessions, in exemplary implementations of the present disclosure, the features associated with the individual subscriber sessions may include, for example, features, metrics, and/or parameters such as a session depth (e.g., a number of engaged sessions, a total amount of time spent accessing the online service, a number of content items viewed, and the like), the actions performed by the subscriber (e.g., the type of actions performed by the subscriber, the number of actions performed by the subscriber, etc.), a session length (e.g., amount of time spent on the session), an entropy and/or diversity associated with the session (e.g., number of interests and/or topics explored, number of different content item types explored, number of different content item formats explored-home page, search, shopping, etc.), and the like. Accordingly, a second model may be trained to predict parameters across an aggregation of subscriber sessions based on inputs of one or more features associated with individual subscriber sessions in mapping the features associated with individual subscriber sessions to parameters of the aggregation of subscriber sessions. Optionally, the mappings and generation of the second model may be based at least in part on probabilities associated with a state and current context of the subscriber.
Further, a third mapping that maps content items to the features of the individual subscriber sessions may be generated. In exemplary implementations, a third model may be trained to predict the features of the individual subscriber sessions based on an input of one or more content items. Optionally, the mappings and generation of the third model may be based at least in part on probabilities associated with a state and current context of the subscriber.
Although exemplary embodiments of the present disclosure are described as utilizing three interim mappings (e.g., the parameters associated with an aggregation of subscriber sessions to the long-term objective and the features associated with individual subscriber sessions to the parameters associated with an aggregation of subscriber sessions) and/or metrics, any number of interim mappings and/or metrics may be used. The mappings may then be utilized to optimize, fine-tune, and/or otherwise train a recommendation system and/or service to configure the recommendation system and/or service to recommend content items to subscribers to achieve the long-term objective, as in step 1906. For example, the mapping may be used to modify a utility function associated with one or more stages of the recommendation system, be used as rewards in applying a reinforcement learning technique to further train and/or fine-tune the recommendation system, and the like.
In step 1908, a request for content items may be received. The request may be an explicit request, such as a text-based search request or a specific search request in which one or more content items are selected or provided by a user. In other examples, the request may be implicit. For example, as a user browses content items of the hosting service, the hosting service may maintain identifiers of the browsed content items and utilize those content items as the basis for a request. As another example, if a user selected to view or close-up a content item from the corpus, that content item may be utilized as a request to determine other content items that are similar to the viewed content item. According to other aspects, the request may be included as part of and/or in connection with a request to access a homepage and/or a home feed, an indication that recommended content items are to be pushed to a subscriber, and the like. Still further, the disclosed implementations may be utilized to determine content items without an explicit or implicit request from a user.
In steps 1910 and 1912, recommended content items may be determined using the optimized recommendation system to encourage the long-term objective, and the recommended content items may be returned (e.g., provided to the subscriber, presented on a client device, etc.).
As shown in
In determining the alignment scores, for a particular subscriber interaction (e.g., an interaction with a content item, a like of a content item, a sharing of a content item, a saving of a content item, etc.), a sequence of subscriber engagements preceding the particular subscriber interaction may be identified. The sequence of subscriber engagements may include a sequence of content items with which the subscriber engaged prior to the particular subscriber interaction, and a corresponding candidate content item may be retrieved for each content item in the sequence of subscriber engagements. Accordingly, the particular subscriber interaction may be a query, the sequence of subscriber engagements may be the key, and the retrieved candidate content items may be the value. Given the query, key, and the value, a similarity measure may be determined between the particular subscriber interaction and each content item in the sequence of subscriber engagements. For example, a cosine similarity may be determined between an embedding vector representative of the particular subscriber interaction and an embedding vector representative of each content item included in the sequence of subscriber engagements. Other similarity measures may alternatively be utilized such as, by way of illustration and not limitation, a dot product, the Normalized Hamming Distance measure, a Euclidian distance measure, and the like. The similarity measure between the particular subscriber interaction and each content item in the sequence of subscriber engagements may represent the relevance and/or influence that each content item had on the particular subscriber interaction. Further, as the embedding vectors preferably encode non-visual features, such as textual features, semantic features, etc. (in addition to visual features) of the content items, the similarity measure is representative of a comprehensive similarity between the particular subscriber interaction and each content item in the sequence of subscriber engagements. The similarity measures between the particular subscriber interaction and each content item in the sequence of subscriber engagements may be returned as alignment scores for the content items included in the sequence of subscriber engagements. Accordingly, the alignment scores may represent the relevance and/or influence of each content item in the sequence of subscriber engagements on the particular subscriber interaction.
Additionally, as shown in
Alternatively and/or in addition, attention scores may be used in place of or in addition to alignment scores. Similar to the determination of alignment scores, a representation of a subscriber interaction (e.g., representation of the content item with which the subscriber interacted that is of interest, etc.) may be modeled as a query vector and the content items in an input sequence of content items (e.g., representations of a sequence of content items with which the subscriber interacted that may be of interest, etc.) may be modeled as the key vector. Accordingly, the dot product of the query and key vectors may provide attention scores for each content item in the input sequence of content items, which may represent a relevance of each content item in the sequence of content items to the subscriber interaction of interest.
As shown in
As illustrated, in step 2056, the candidate content items and the sequences of subscriber interactions may be processed by one or more caption services. For example, the caption service may process each content item of the sequences of subscriber engagements to generate a content item caption for each content item. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item may be combined to generate the content item caption for the content item. Similarly, the candidate content items may also be processed by the caption service and, like content items of the sequences of subscriber engagements, a caption may be generated for each content item of the candidate content items. For example, the caption service may process each content item of the candidate content items to generate a content item caption for each content item. In implementations in which multiple caption services are utilized, the service caption generated by each caption service for a content item of the reduced content item corpus may be combined to generate the content item caption for that content item.
In step 2058, an LLM input based on the content item caption of each content item of the sequences of subscriber engagements and the content item caption of each content item of the candidate content items may be generated to be provided as an input to an LLM. Optionally, additional information, such as additional subscriber history information (e.g., demographic information, likes, dislikes, recent activity, etc.), the long-term objective, and the like may also be used to generate the LLM input. For example, the LLM input may be generated that includes or references the content item caption for each content item of the sequences of subscriber engagements, that includes or references the content item caption for each content item of the candidate content items, and that includes instructions that the LLM is to consider the content item caption of each content item of the sequences of subscriber engagements, the long-term objective, and the like, and to select one or more content items as recommended and/or ranked content items, determine alignment scores, and the like, based on the caption of each content item from the candidate content items. The instructions may further provide a minimum and maximum number of content items that are to be returned as recommended content items, instructions to indicate a sequence in which the recommended content items are to be presented, an LLM output structure that is to be provided by the LLM, etc. Still further, the LLM input may also provide additional context or parameters to guide the LLM in selection of recommended content items. For example, additional context or parameters may be specified based on subscriber history information, such as indicating preferred styles, colors, shapes, etc., information known about the subscriber that are to be considered in conjunction with the caption of each content item in determining recommended content items, and the like.
In step 2060, the LLM input is processed using an LLM to determine one or more recommended and/or ranked content items from the candidate content items, along with a sequence in which those content items are to be presented.
In implementations in which a text request is provided as the request or content items of the request are processed to generate a text request, as suggested above, embedding vector generators can be used to generate embedding vectors from the text request and project the embedding vectors into a suitable content embedding space. Generally speaking, an embedding vector generator trained to generate embedding vectors for text-based input generates embedding vectors that project into a text-based embedding space. Similarly, an embedding vector generator trained to generate embedding vectors for image-based input generates embedding vectors that project into an image-based embedding space. To further illustrate,
According to aspects of the disclosed subject matter, rather than training embedding vector generators to generate embedding vectors that project into an embedding space according to the input type (e.g., text-based embedding vectors that project into a text-based embedding space and image-based embedding vectors that project into an image-based embedding space), one or more embedding vector generators can be trained to generate embedding vectors for text-based queries that project the text-based queries directly into the image-based embedding space. Indeed, according to aspects of the disclosed subject matter, an embedding vector generator may be trained (either as a single instance or as part of an on-going training) by query/user interaction logs to generate embedding vectors for text-based queries into a non-text content item embedding space.
Regarding the projection of text-based content (e.g., text-based queries 2202-1508), it should be appreciated that some text-based content will be projected, via an associated embedding vector, to the same location as an image, as is the illustrated case with text-based query 2202 “Dog” and image 2216. In other instances, text-based content may be projected, via an associated embedding vector, to a location that is near an image projected into the embedding space that, at least to a person, appears to be the same subject matter. For example, text-based query 2204 “Walking a dog” is projected near to, but not to the same location as the projection of image 2214. This possibility reflects the “freedom” of the trained embedding vector generator to differentiate on information that may or may not be apparent to a person, a common “feature” of machine learning.
To further illustrate the process of responding to a text-based request with a response containing one or more non-text content items, reference is now made to
In accordance with aspects of the disclosed subject matter, content items of the corpus of content items, such as corpus of content items 134, are non-text content items. By way of illustration and not limitation, non-text content items may comprise images, video content, audio content, data files, and the like. Additionally, and/or alternatively, a content item may be an aggregation of several content types (e.g., images, videos, data, etc.) and textual content-though not an aggregation of only text content. Additionally, while content items are non-text content items, these content items may be associated with related textual content. Typically, though not exclusively, related textual content associated with a content item may be referred to as metadata. This textual metadata may be any number of text-based sources such as, by way of illustration and not limitation, source file names, source URL (uniform resource locator) data, user-supplied comments, titles, annotations, and the like.
According to aspects of the disclosed subject matter and, in maintaining the corpus of content items, such as the corpus of content items 134 illustrated in
As will be readily appreciated by those skilled in the art, a content item graph, such as content item graph 2400, includes nodes and edges, where each node corresponds to a content item of the corpus of content items, and an edge represents a relationship between two nodes corresponding to two distinct content items of the content graph. By way of illustration, nodes in content item graph 2400 are represented as circles, including nodes A-L, and relationships are presented as lines between nodes, such as relationships 2401, 2403, 2405, 2407, 2409. There may be multiple bases for relationships between content items which include, by way of illustration and not limitation, co-occurrence within a collection of content items, commonality of ownership of content items, user engagement of content items, similarity between content items, and the like.
In regard to process 2300, at block 2304 the hosting service receives a text-based request for content items, such as a text-based request generated as discussed above. According to aspects of the disclosed subject matter, the text-based request comprises one or more text-based terms that, collectively, provide information to a hosting service, such as hosting service 130 of
At block 2306, an optional step may be taken to conduct a semantic analysis of the received request. According to aspects of the disclosed subject matter and by way of definition, this optional semantic analysis processes the terms of the request, including identifying syntactic structures of terms, phrases, clauses, and/or sentences of the request to derive one or more meanings or intents of the subscriber's request. As should be appreciated, one or more semantic meanings or intents of the request may be used to identify a specific set of content items for terms of the search request that may have multiple meanings, interpretations or intents.
At block 2308, the received request is processed to generate a set of terms of the request. Typically, though not exclusively, the terms are processed by a lexical analysis that parses the request according to white space to identify the various terms. In addition to the parsing of the request, spell correction, expansion of abbreviations, and the like may occur in order to generate the set of terms for the received request.
At block 2310, a morphological analysis is conducted to generate a set of word pieces from the set of text-based terms of the request. According to at least some implementations of the disclosed subject matter, at least one term of the text-based request includes at least two word pieces. According to various implementations of the disclosed subject matter, the word pieces are generated according to and comprise the various parts of a word including, but not limited to: e.g., a prefix, a suffix, a prefix of a suffix, a stem, and/or a root (or roots) of a word to term, as well as sub-strings of the same. Indeed, all parts of a term are found in a word piece for that term. Additionally, and according to further aspects of the disclosed subject matter, word pieces that are not the leading characters of a term are identified. To illustrate, for the word/term “concatenation,” the word pieces generated would be “conca,” “##tena,” and “##tion,” with the characters, “##,” included for designating that the following word piece was not found at the beginning of the term. According to alternative aspects of the disclosed subject matter, each word piece within the set of word pieces is a morpheme of at least one of the terms of the set of text-based terms of the request.
Regarding the word parts, the text term “running” may be broken down into two word pieces: “run” being the root, and “##ing” being a suffix indicative of something actively running. A lexical or etymological analysis may be conducted to identify the various word parts of each term, where each word part is viewed as a “word piece.”
Regarding morphemes and by way of definition, a morpheme (or word piece) is the smallest meaningful unit in a language and is a part of a word/term. A morpheme is not identical to a word: a word includes one or more morphemes and a morpheme may also be a complete word. By way of illustration and not limitation, “cat” is a morpheme that is also a word. On the other hand, “concatenation” is a word comprising multiple morphemes: “con,” “catenate” and “tion,” where “catenate” is a completed form of “catena,” completed as part of generating the word pieces. The identifiers indicating that the word piece does not comprise the leading characters of the term may, or may not be included, as determined according to implementation requirements.
According to various implementations of the disclosed subject matter, the morphological analysis may be conducted by an executable library or service, and/or a third-party service, that examines a given word and provides the morphemes for that given word. In various alternative implementations, a word/morpheme list cache may be utilized to quickly and efficiently return one or more morphemes of a given input word.
In yet a further implementation of the disclosed subject matter, various technologies, such as Byte Pair Encoding (BPE), may be used to generate word pieces for the text-based terms of the text-based request. Generally speaking, these various technologies, including BPE, operate on a set of statistical rules based on some very large corpus text. As those skilled in the art will appreciate, BPE is often used as a form of data compression in which the most common consecutive characters of input data are replaced with a value that does not occur within that data. Of course, in the present instance, the BPE process does not replace the consecutive characters in the term itself, but simply identifies the consecutive characters as a word piece.
At block 2312, embedding vectors for each of the word pieces of the set of word pieces is obtained. According to aspects of the disclosed subject matter, the embedding vectors are content item embedding vectors, meaning that the embedding vectors project the corresponding word piece into the content item embedding space of the content items in the corpus of content items.
According to various implementations of the disclosed subject matter, a content item embedding vector of a given word piece may be generated in a just-in-time manner by a suitably trained embedding vector generator. According to additional and/or alternative implementations, previously generated and cached content item embedding vectors may be retrieved from a cache of the hosting service configured to hold word piece-embedding vector pairs.
At block 2314, weightings for the various word pieces of the set of word pieces are optionally determined. Weightings may be optionally applied to emphasize important word pieces of a request. These weightings may be determined, by way of illustration and not limitation, according to the importance of the word pieces themselves, the determined potential topic of the requesting subscriber (as optionally determined in block 2306), multiple instances of a word piece among the terms of the request, and the like.
At block 2316, the embedding vectors of the word pieces are combined to form a representative embedding vector for the request. According to various implementations of the disclosed subject matter, the various embedding vectors may be averaged together to form the representative embedding vector. Optionally, the weightings determined in block 2312 may be applied in averaging of the various embedding vectors to favor those word pieces of the set of word pieces that are viewed as being more important to the request.
According to implementations of the disclosed subject matter, the text-based request and the representative embedding vectors may be stored in a cache, so that subsequent instances of receiving the same text-based request may be optimized through simple retrieval of the corresponding representative embedding vector. Of course, if there is no entry for a particular request, or if the implementation does not include a text request-embedding vector cache, the representative embedding vector for a text-based request may be generated in a just-in-time manner.
With the representative embedding vector for the request determined from embedding vectors of the word pieces, at block 2318 a set of content items is determined from the corpus of content items. A description of determining a set of content items from the corpus of content items is set forth in more detail in regard to routine 2500 of
Beginning at block 2502, the representative embedding vector for the word pieces is projected into the content item embedding space. At block 2504, with the content items of the corpus of content items projected into the content item embedding space, a set of k content items, also commonly referred to as the nearest neighbors to the projected representative embedding vector, are identified. More particularly, this set of k content items whose projection into the content item embedding space are closest, according to the distance measurement, to the projection of the representative embedding vector are selected. In various implementations of the disclosed subject matter, the distance measurement of embedding vectors is a cosine similarity measurement. Of course, other similarity measures may alternatively be utilized such as, by way of illustration and not limitation, the Normalized Hamming Distance measure, a Euclidian distance measure, and the like. In various implementations of the disclosed subject matter, the value of k may correspond to any particular number as may be viewed as a good representation of close content items to the representative embedding vector. In various non-limiting implementations, the value of k may be twenty. Of course, in alternative implementations, the value of k may be higher or lower than twenty.
At block 2506, a closest content item of the corpus of content items to the projected representative embedding vector (often included among the k nearest neighbors) is identified. This closest content item may be used as an “origin” of a random-walk to identify a set of n related content items within the content item graph in which the content items of the corpus of content items are represented.
As described in greater detail in co-pending and commonly assigned U.S. patent application Ser. No. 16/101,184, filed Aug. 10, 2018, which is incorporated herein by reference, and according to aspects of the disclosed subject matter, a random-walk selection relies upon the frequency and strength of edges between nodes in a content item graph, where each edge corresponds to a relationship between two content items. As mentioned above, a “relationship” between two content items in a content item graph represents a relationship between the two content items, such as, by way of illustration and not limitation, co-occurrence within a collection, common ownership, frequency of access, and the like.
At block 2508 and according to aspects of the disclosed subject matter, a random-walk selection is used to determine a set of n related content items. This random-walk selection utilizes random selection of edge/relationship traversal between nodes (i.e., content items) in a content item graph, such as content item graph 2400, originating at the closest content item to the projected representative embedding vector. By way of illustration and not limitation, and with returned reference to
According to further aspects of the disclosed subject matter, in a random-walk, a random traversal is performed, starting with an origin, e.g., node A, in a manner that limits the distance/extent of accessed content items reached in a random traversal of the content items of the content item graph 2400 by resetting back to the original content item after several traversals. Strength of relationships (defined by the edges) between nodes is often, though not exclusively, considered during random selection to traverse to a next node. Indeed, a random-walk selection of “related nodes” relies upon frequency and strength of the various edges to ultimately identify the second set of n content items of the content item graph 2400. These “visited” nodes become candidate content items of the n content items that are related to the origin content item. At the end of several iterations of random-walking the content item graph 2400 from the origin (e.g., node A), a number of those nodes (corresponding to content items) that have been most visited become the n content items of the set of related content items. In this manner, content items close to the original content item that have stronger relationships in the content item graph are more likely included in this set of n content items. While the value of n may correspond to any particular number as may be viewed as a good representation of close content items, in various non-limiting implementations, the value of n may be twenty-five. Of course, in alternative implementations, the value for n may be higher or lower than twenty-five.
At block 2510, the set of k content items and the set of n content items (which may share common content items) are combined into a related content item list for the representative embedding vector. According to various aspects of the disclosed subject matter, the combining process may include removing duplicate instances of the same content item in the related content item list.
At block 2512, the related content item list is returned. Thereafter, routine 2500 terminates.
While routine 2500 describes the use of a combination of two techniques for identifying content, i.e., k nearest neighbors (often referred to as kNN) and random-walk, it should be appreciated that in any given implementation, either or both techniques may be used when obtaining content for a user's request from a representative embedding vector generated from word pieces of the text-based request. Accordingly, the discussion of using both techniques in routine 2500 should be viewed as illustrative and not limiting upon the disclosed subject matter.
With returned reference to routine 2300, after obtaining the related content item list, at block 2320 a set of x content items from the related content item list are selected as content items to be returned as a response to the request. At block 2322, the selected x content items are returned. Thereafter, routine 2300 terminates.
As indicated above, a trained embedding vector generator is used to generate embedding vectors into a content item embedding space for word pieces.
At block 2704, the request/content item logs are aggregated according to unique requests. In this aggregation, there may be (and will likely be) multiple content items associated with a unique, text-based request. Each of these content items represents a positive relationship to the text-based request.
At block 2706, an iteration loop is begun to iterate through and process the unique requests of the request/content item logs, to generate training data for training a machine learning model to generate embedding vectors for text-based requests into the content item embedding space. Thus, at block 2708 and with regard to a currently iterated request (with corresponding content items), a set of word pieces for the text-based request is generated. As suggested above, these word pieces may correspond to parts of the words, or, in the alternative, correspond to morphemes. At block 2710, embedding vectors are generated for each of the word pieces. According to aspects of the disclosed subject matter, the embedding vectors generated from the word pieces are embedding vectors into a text-based/word-pieces embedding space, not the content item embedding space.
At block 2712, a representative embedding vector (into the text-based/word-pieces embedding space) is generated for the request from the embedding vectors of the word pieces. Typically, though not exclusively, the word pieces embedding vectors are averaged together to form the representative embedding vector. Weighting for word pieces that are viewed as more important, e.g., root portions of word pieces, post-fixes that indicate activity, etc., may be given more weight when forming the resulting representative embedding vector.
With the representative embedding vector generated for the request, at block 2714, the content items associated with the currently iterated text-based request are projected (logically) into the multi-dimensional content item embedding space. At block 2716, the projected content items are clustered to identify a type of “neighborhood” in which a content item positively represents the text-based request. At block 2718, a centroid for the cluster is identified, along with dimensional information of the cluster.
At block 2720, the text-based request, the representative embedding vector, a centroid embedding vector of the cluster's centroid, and the cluster's dimensional data are stored as a positive training data element for training the machine learning model. Since negative training elements are also needed, at block 2722, an embedding vector in the content item space that points outside of the cluster is used to replace the centroid embedding vector and saved as a negative training element.
Regarding blocks 2716-2720, while these blocks describe the identification of a centroid of a cluster, and using the representative embedding vector, the centroid, and some measure of the cluster's dimensions as a positive training data element, in alternative implementations, each image projected in the image-based embedding space within the generated cluster is paired with the representative embedding vector and the cluster's dimensional data is stored as a positive training data element for training the machine learning model. In still further alternative implementations, a simple, predefined distance measure from the centroid may be used, rather than cluster dimensions.
At block 2724, if there are additional unique requests to process in the iteration, the routine 2700 returns to block 2706 to process the next unique, text-based request from the request/content item logs. Alternatively, if there are no more requests to process in the iteration, routine 2700 terminates, having generated both positive and negative training data/tuples.
As those skilled in the art will appreciate, there are often numerous ways to generate training data to train a machine learning model. In this regard,
Beginning at block 2752, a set of request/content item logs that are maintained by the hosting service are accessed. As indicated above, these request/content item logs include request/content item pairs corresponding to a text-based request by a subscriber and one or more content items with which the requesting subscriber interacted, where the one or more content items are viewed as being indicative of a positive interaction on the part of the subscriber resulting from the request. At block 2754, the request/content item logs are aggregated according to unique requests among all the requests, and further combined with the content items of each instance of a request. Of course, in this aggregation, there may be (and will likely be) multiple content items associated with a unique, text-based request. As mentioned, each of these content items represents a positive relationship to the text-based request.
At block 2756, an iteration loop is begun to iterate through and process the unique requests of the request/content item logs, to generate training data for training a machine learning model to generate embedding vectors for text-based requests into the content item embedding space. Thus, at block 2758 and with regard to a currently iterated text-based request (with corresponding content items), a set of word pieces for the text-based request is generated. As suggested above, these word pieces may correspond to parts of the words (terms of the text-based request) or, in alternative implementations, correspond to morphemes of the text terms of the text-based request.
At block 2760, the currently processed request, the content items that are associated with the currently processed request, and the word pieces are stored as a positive training element. As an alternative to generating a single training element that is associated with multiple content items, multiple positive training elements may be generated from the request and word pieces, each of the multiple positive training elements being associated with one of the content items of the multiple content items associated with the currently processed request along with the request and set of word pieces.
At block 2762, the currently processed request, a set of randomly selected content items, and the word pieces are stored as a negative training element. Touching on the alternative mentioned in regard to block 2760, multiple negative training elements may be generated, with each negative training element being associated with a single, randomly-selected content item.
At block 2764, if there are additional unique requests to process in the iteration, the routine 2750 returns to block 2756 to process the next unique, text-based request from the request/content item logs. Alternatively, if there are no more requests to process in the iteration, routine 2750 terminates, having generated both positive and negative training data/tuples.
Returning to routine 2600, after generating positive and negative training tuples from the request/content item logs, at block 2604, a machine learning model, such as a deep neural network and/or a convolutional neural network, is trained as an embedding vector generator to generate embedding vectors into a content item embedding space for text-based requests according to the word pieces of the requests. This training of the embedding vector generator is made according to the positive and negative training tuples, i.e., the training data, as may have been generated in routine 2700. A generalized routine for training a machine learning model is set forth below in regard to routine 2800 of
After training an embedding vector generator that generates embedding vectors into a content item embedding space for text-based requests, optional steps may be taken. More particularly, at block 2606, an iteration loop may be carried out to iterate through the unique text-based requests of the request/content item logs in order to pre-generate and cache the results. Thus, at block 2608 and with regard to a currently iterated text-based request, word pieces for the request are generated. At block 2610, embedding vectors (into a text-based embedding space) are generated for the word pieces. At block 2612, the word pieces are aggregated to form a representative embedding vector (into the text-based embedding space) for the request. At block 2614, a request embedding vector is generated that projects the representative embedding vector of the request into the content item embedding space. At block 2616, the request and the request embedding vector are stored in the text request-embedding vector cache.
At block 2618, if there are any additional unique requests to process, the iteration returns to block 2606 for further processing. Alternatively, if there are no more unique requests to process and cache, the routine 2600 terminates.
Turning now to
Beginning at block 2802, the training data (comprising both positive and negative training tuples) is accessed. At block 2804, training and validation sets are generated from the training data. These training and validation sets comprise a training tuple randomly selected from the training data, while retaining whether a given training tuple is a positive or negative training tuple.
As those skilled in the art will appreciate, the purpose of both training and validation sets is to carry out training phases of a machine learning model (in this instance, an embedding vector generator) by a first phase of repeatedly training the machine learning model with the training set until an accuracy threshold is met, and a second phase of validating the training of the machine learning model with the validation set to validate the accuracy of the training phase. Multiple iterations of training and validation may, and frequently do occur. Typically, though not exclusively, the training and validation sets include about the same number of training tuples. Additionally, as those skilled in the art will appreciate, a sufficient number of training tuples should be contained within each set to ensure proper training and validation, since using too few may result in a high level of accuracy among the training and validation sets, but a low level of overall accuracy in practice.
With the training and validation sets established, at block 2806, an iteration loop is begun to iterate through the training tuples of the training set. At block 2808, a content item embedding vector is generated by a machine learning model for the word piece of the currently iterated tuple. At block 2810, the accuracy of the embedding vector for the word piece of the currently iterated tuple is determined based on the centroid embedding vector of the word piece of the currently iterated tuple, the distance measure. For example, if the content item embedding vector generated for the currently iterated tuple is within the distance measure of the embedding vector of the tuple, the tracking would view this as an accurate embedding vector generation. On the other hand, if the embedding vector generated for the currently iterated tuple is outside of the distance measure of the centroid embedding vector of the tuple, the tracking would view this as an inaccurate embedding vector generation.
After determining and tracking the accuracy of the machine learning model on the currently iterated tuple, at block 2812 if there are additional tuples in the training set to be processed, the routine 2800 returns to block 2806 to select and process the next tuple, as set forth above. Alternatively, if there are no additional tuples in the training set to be processed, the routine 2800 proceeds to decision block 2814.
At decision block 2814, a determination is made as to whether a predetermined accuracy threshold is met by the current training state of the machine learning model in processing the tuples of the training set. This determination is made according to the tracking information through processing the tuples of the training data. If the in-training machine learning model has not at least achieved this predetermined accuracy threshold, the routine 2800 proceeds to block 2816.
At block 2816, the processing parameters that affect the various processing layers of the in-training machine learning model, including but not limited to the convolutions, aggregations, formulations, and/or hyperparameters of the various layers, are updated, and the routine 2800 returns to block 2806, thereby resetting the iteration process on the training data in order to iteratively continue the training of the in-training machine learning model.
With reference again to decision block 2814, if the predetermined accuracy threshold has been met by the in-training machine learning model, routine 2800 proceeds to block 2820. At block 2820, an iteration loop is begun to process the tuples of the validation set, much like the processing of the tuples of the training set.
At block 2822, an embedding vector (that projects into the content item embedding space) is generated by the machine learning model for the currently iterated tuple of the validation set. At block 2824, the accuracy of the in-training machine learning model is determined and tracked. More particularly, if the embedding vector generated for the currently iterated tuple (of the validation set) is within the distance measure of the embedding vector of the tuple, the tracking would view this as an accurate embedding vector generation. On the other hand, if the embedding vector generated for the currently iterated tuple is outside of the distance measure of the centroid embedding vector of the tuple, the tracking would view this as an inaccurate embedding vector generation.
At block 2826, if there are additional tuples in the validation set to be processed, the routine 2800 returns to block 2820 to select and process the next tuple of the validation set, as described forth above. Alternatively, if there are no additional tuples to be processed, the routine 2800 proceeds to decision block 2828.
At decision block 2828, a determination is made as to whether a predetermined accuracy threshold, which may or may not be the same predetermined accuracy threshold as used in decision block 2814, is met by the machine learning model in processing the tuples of the validation set. This determination is made according to the tracking information aggregated in processing the tuples of the validation set. If the in-training machine learning model has not at least achieved this predetermined accuracy threshold, then routine 2800 proceeds to block 2816.
As set forth above, at block 2816, the processing parameters of the in-training machine learning model, including but not limited to the convolutions, aggregations, formulations, and/or hyperparameters, are updated and the routine 2800 returns to block 2806, resetting the iteration process in order to restart the iterations with the training tuples of the training set.
In the alternative, at decision block 2828, if the accuracy threshold has been met (or exceeded), it is considered that the machine learning model has been accurately trained and the routine 2800 proceeds to block 2830. At block 2830, an executable embedding vector generator is generated from the now-trained machine learning model.
As those skilled in the art will appreciate, the in-training version of the machine learning model will include elements that allow its various levels, processing variables and/or hyperparameters to be updated. In contrast, an executable embedding vector generator is generated such that those features that allow the in-training machine learning model to be updated and “trained” are removed without modifying the trained functionality of the now-trained machine learning model. Thereafter, the routine 2800 terminates.
In accordance with additional aspects and implementations of the disclosed subject matter, a computer-executed method is set forth for providing content items to a subscriber of an online hosting service. A corpus of content items is maintained by the hosting service. In maintaining this corpus of content items, each content item is associated with an embedding vector that projects the associated content item into a content item embedding space. A text-based request for content from the corpus of content items is received from a subscriber of the hosting service, and the text-based request includes one or more text-based terms. A set of word pieces is generated from the one or more text-based terms. In some implementations, the set of word pieces includes at least two word pieces generated from at least one text-based term. An embedding vector is obtained for each word piece of the set of word pieces. Regarding the embedding vectors, each embedding vector for each word piece projects a corresponding word piece into the content item embedding space. With the embedding vectors obtained, the embedding vectors of the word pieces of the set of word pieces are combined to form a representative embedding vector for the set of word pieces. A set of content items of the corpus of content items is then determined according to or based on a projection of the representative embedding vector for the set of word pieces into the content item embedding space. At least one content item is selected from the set of content items of the corpus of content items and returned in response to the text-based request.
In accordance with additional aspects and implementations of the disclosed subject matter, computer-executable instructions, embodied on computer-readable media, a method of a hosting service is presented that responds to a text-based request with one or more content items. A corpus of content items is maintained by the hosting service. In maintaining this corpus of content items, each content item is associated with an embedding vector that projects the associated content item into a content item embedding space. A text-based request for content from the corpus of content items is received, and the text-based request includes one or more text-based terms. A set of word pieces is generated from the one or more text-based terms. In some but not all implementations, the set of word pieces includes at least two word pieces generated from at least one text-based term. An embedding vector is obtained for each word piece of the set of word pieces. Regarding the embedding vectors, each embedding vector for each word piece projects a corresponding word piece into the content item embedding space. With the embedding vectors obtained, the embedding vectors of the word pieces of the set of word pieces are combined to form a representative embedding vector for the set of word pieces. A set of content items of the corpus of content items is then determined according to or based on a projection of the representative embedding vector for the set of word pieces into the content item embedding space. At least one content item is selected from the set of content items of the corpus of content items and returned in response to the text-based request.
According to additional aspects of the disclosed subject matter, a computer system that provides one or more content items in response to a request is presented. In execution, the computer system is configured to, at least, maintain an embedding vector associated with each content item of a corpus of content items, each embedding vector suitable to project the associated content item into a content item embedding space. A text-based request for content items of the corpus of content items is received. The request comprises one or more text-based terms and a set of word pieces is generated from the one or more text-based terms. As discussed herein, the set of word pieces includes at least two word pieces generated from at least one text-based term of the received request. An embedding vector is obtained for each word piece of the set of word pieces, such that each embedding vector for each word piece projects a corresponding word piece into the content item embedding space. The embedding vectors of the word pieces of the set of word pieces are combined to form a representative embedding vector for the set of word pieces. A set of content items of the corpus of content items is then determined based on and/or according to a projection of the representative embedding vector for the set of word pieces into the content item embedding space. At least one content item from the set of content items of the corpus of content items is selected and returned to the subscriber in response to the request.
Regarding routines 600, 900, 1000, 1900, 2000, 2050, 2300, 2500, 2600, 2700, 2750 and 2800 described above, as well as other routines and/or processes described or suggested herein, while these routines/processes are expressed in regard to discrete steps, these steps should be viewed as being logical in nature and may or may not correspond to any specific, actual and/or discrete execution steps of a given implementation. Also, the order in which these steps are presented in the various routines and processes, unless otherwise indicated, should not be construed as the only or best order in which the steps may be carried out. Moreover, in some instances, some of these steps may be combined and/or omitted.
Optimizations of routines may be carried out by those skilled in the art without modification of the logical process of these routines and processes. Those skilled in the art will recognize that the logical presentation of steps is sufficiently instructive to carry out aspects of the claimed subject matter irrespective of any specific development or coding language in which the logical instructions/steps are encoded. Additionally, while some of these routines and processes may be expressed in the context of recursive routines, those skilled in the art will appreciate that such recursive routines may be readily implemented as non-recursive calls without actual modification of the functionality or result of the logical processing. Accordingly, the particular use of programming and/or implementation techniques and tools to implement a specific functionality should not be construed as limiting upon the disclosed subject matter.
Of course, while these routines and/or processes include various novel features of the disclosed subject matter, other steps (not listed) may also be included and carried out in the execution of the subject matter set forth in these routines, some of which have been suggested above. Those skilled in the art will appreciate that the logical steps of these routines may be combined or be comprised of multiple steps. Steps of the above-described routines may be carried out in parallel or in series. Often, but not exclusively, the functionality of the various routines is embodied in software (e.g., applications, system services, libraries, and the like) that is executed on one or more processors of computing devices, such as the computer system described in
As suggested above, these routines and/or processes are typically embodied within executable code segments and/or modules comprising routines, functions, looping structures, selectors and switches such as if-then and if-then-else statements, assignments, arithmetic computations, and the like that, in execution, configure a computing device to operate in accordance with the routines/processes. However, the exact implementation in executable statement of each of the routines is based on various implementation configurations and decisions, including programming languages, compilers, target processors, operating environments, and the linking or binding operation. Those skilled in the art will readily appreciate that the logical steps identified in these routines may be implemented in any number of ways and, thus, the logical descriptions set forth above are sufficiently enabling to achieve similar results.
While many novel aspects of the disclosed subject matter are expressed in executable instructions embodied within applications (also referred to as computer programs), apps (small, generally single or narrow purposed applications), and/or methods, these aspects may also be embodied as computer-executable instructions stored by computer-readable media, also referred to as computer readable storage media, which (for purposes of this disclosure) are articles of manufacture. As those skilled in the art will recognize, computer-readable media can host, store and/or reproduce computer-executable instructions and data for later retrieval and/or execution. When the computer-executable instructions that are hosted or stored on the computer-readable storage devices are executed by a processor of a computing device, the execution thereof causes, configures and/or adapts the executing computing device to carry out various steps, methods and/or functionality, including those steps, methods, and routines described above in regard to the various illustrated routines and/or processes. Examples of computer-readable media include but are not limited to: optical storage media such as Blu-ray discs, digital video discs (DVDs), compact discs (CDs), optical disc cartridges, and the like; magnetic storage media including hard disk drives, floppy disks, magnetic tape, and the like; memory storage devices such as random-access memory (RAM), read-only memory (ROM), memory cards, thumb drives, and the like; cloud storage (i.e., an online storage service); and the like. While computer-readable media may reproduce and/or cause to deliver the computer-executable instructions and data to a computing device for execution by one or more processors via various transmission means and mediums, including carrier waves and/or propagated signals, for purposes of this disclosure computer-readable media expressly excludes carrier waves and/or propagated signals.
Regarding computer-readable media,
Turning to
As will be appreciated by those skilled in the art, the memory 3004 typically (but not always) comprises both volatile memory 3006 and non-volatile memory 3008. Volatile memory 3006 retains or stores information so long as the memory is supplied with power. In contrast, non-volatile memory 3008 can store (or persist) information even when a power supply is not available. In general, RAM and CPU cache memory are examples of volatile memory 3006 whereas ROM, solid-state memory devices, memory storage devices, and/or memory cards are examples of non-volatile memory 3008.
As will be further appreciated by those skilled in the art, the CPU 3002 executes instructions retrieved from the memory 3004 from computer-readable media, such as computer-readable medium 2908 of
Further still, the illustrated computer system 3000 typically also includes a network communication interface 3012 for interconnecting this computing system with other devices, computers and/or services over a computer network, such as network 108 of
The illustrated computer system 3000 also frequently, though not exclusively, includes a graphics processing unit (GPU) 3014. As those skilled in the art will appreciate, a GPU is a specialized processing circuit designed to rapidly manipulate and alter memory. Initially designed to accelerate the creation of images in a frame buffer for output to a display, due to their ability to manipulate and process large quantities of memory, GPUs are advantageously applied to training machine learning models and/or neural networks that manipulate large amounts of data, including LLMs and/or the generation of embedding vectors of text terms of an n-gram. One or more GPUs, such as GPU 3014, are often viewed as essential processing components of a computing system when conducting machine learning techniques. Also, and according to various implementations, while GPUs are often included in computing systems and available for processing or implementing machine learning models, multiple GPUs are also often deployed as online GPU services or farms and machine learning processing farms.
The illustrated computer system may also include an LLM 3030, a caption service 3031, and/or a caption data store 3036. As discussed herein, the captions service(s) 3031 may process content items and generate content item captions for each content item and/or generate a session item for a session of content items. Captions, such as content item captions and/or session captions may be stored in and/or accessed from the captions data store 3036. The LLM 3030 may process content item captions and/or session captions that are included in, or referenced by an LLM input and generate narrative descriptions of the sessions and/or indicate content item identifiers. Those narrative descriptions may be provided as a text-based request that is used to determine recommended content items, as discussed herein.
Also included in the illustrated computer system 3000 is a response module 3020. As operationally described above in regard to routine 2300 of
In responding to a text-based request from a subscriber, the response module 3020 of the hosting service operating on the computer system 3000 utilizes term generator 3024 that conducts a lexical analysis of a received request and generates a set of text-based terms. The response module 3020 further utilizes a word pieces generator 3026 to generate a set of word pieces for the text-based terms of the request.
In identifying content items for the request, the response module 3020 utilizes a trained, executable embedding vector generator 3022 that generates, or obtains a request embedding vector for a set of word pieces of the text-based terms of a text-based request. As described in routine 2300 above, the response module 3020 utilizes a term generator 3024 that obtains a set of text-based terms from the received request, and further utilizes a word pieces generator 3026 to generate a set of word pieces from the set of text-based terms.
In addition to the above, the illustrated computer system 3000 also includes a training tuple generator 3028 that generates training tuples from request/content item logs 3040 (also referred to as request/user interaction logs) of the hosting service implemented on the computer system 3000.
Regarding the various components of the exemplary computer system 3000, those skilled in the art will appreciate that many of these components may be implemented as executable software modules stored in the memory of the computing device, as hardware modules and/or components (including SoCs-system on a chip), or a combination of the two. Indeed, components may be implemented according to various executable implementations including, but not limited to, executable software modules that carry out one or more logical elements of the processes described in this document, or as hardware and/or firmware components that include executable logic to carry out the one or more logical elements of the processes described in this document. Examples of these executable hardware components include, by way of illustration and not limitation, ROM (read-only memory) devices, programmable logic array (PLA) devices, PROM (programmable read-only memory) devices, EPROM (erasable PROM) devices, and the like, each of which may be encoded with instructions and/or logic which, in execution, carry out the functions described herein.
For purposes of clarity and by way of definition, the term “exemplary,” as used in this document, should be interpreted as serving as an illustration or example of something, and it should not be interpreted as an ideal or leading illustration of that thing. Stylistically, when a word or term is followed by “(s),” the meaning should be interpreted as indicating the singular or the plural form of the word or term, depending on whether there is one instance of the term/item or whether there is one or multiple instances of the term/item. For example, the term “subscriber(s)” should be interpreted as one or more subscribers. Moreover, the use of the combination “and/or” with multiple items should be viewed as meaning either or both items.
While various novel aspects of the disclosed subject matter have been described, it should be appreciated that these aspects are exemplary and should not be construed as limiting. Variations and alterations to the various aspects may be made without departing from the scope of the disclosed subject matter.
This application is a continuation-in-part application of and claims benefit to U.S. patent application Ser. No. 18/499,984, filed on Nov. 1, 2023 and entitled “Identifying Image Based Content Items Using a Large Language Model,” which is hereby incorporated by reference herein in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| Parent | 18499984 | Nov 2023 | US |
| Child | 18678748 | US |