The following relates to media generation. Media is often presented in a context. For example, an image is often viewed on a website. Media that is stylized to fit the context that the media will be presented in can help deliver a more impactful, meaningful, and personalized user experience than contextually independent media.
However, manually editing media in order to blend well with a context can be difficult and time-consuming. There is therefore a need in the art for contextual media generation systems and methods.
Embodiments of the present disclosure provide a media generation system that uses a reinforcement learning model (for example, a machine learning-based reinforcement learning model) to generate a variant of a media object based on a context. By using reinforcement learning to generate the variant based on the context, the media generation system can thereby edit a media object to fit the context.
A method, apparatus, non-transitory computer readable medium, and system for are described. One or more aspects of the method, apparatus, non-transitory computer readable medium, and system include obtaining a media object and context data describing a context of the media object, wherein the media object comprises one or more modification parameters; generating a modified media object by adjusting the one or more modification parameters using a reinforcement learning model based on the context data; and providing the modified media object within the context.
A method, apparatus, non-transitory computer readable medium, and system for are described. One or more aspects of the method, apparatus, non-transitory computer readable medium, and system include obtaining a media object and context data describing a context of the media object, wherein the media object comprises one or more modification parameters; generating a modified media object by adjusting the one or more modification parameters using a reinforcement learning model based on the context data; computing a reward value on the modified media object; and updating parameters of the reinforcement learning model based on the reward value.
An apparatus, system, and method for are described. One or more aspects of the apparatus, system, and method include a processor; a memory including instructions executable by the processor to perform the steps of: obtaining a media object and context data describing a context of the media object, wherein the media object comprises one or more modification parameters; selecting, by a reinforcement learning model, an action for modifying the one or more modification parameters based on the context data; and generating, by a media editing application, a modified media object by adjusting the one or more modification parameters based on the action.
Embodiments of the present disclosure relate to media generation. Media is often presented in a context (e.g., circumstances under which media is provided to a user). For example, an image is often viewed on a website. Media that is stylized to fit the context it will be presented in can help deliver a more impactful, meaningful, and personalized user experience than contextually independent media.
Examples of a context could include a website's design template, a target user, a role of the target user, a task at hand or a step in the task, a location of the user, a time and date, a user device being used, etc. Manually styling the media to blend well with the context can be difficult and time-consuming, and creating and delivering user-resonating contextual media at scale is a very challenging task.
A conventional approach to media contextualization can include traditional methods like Bayesian optimization and A/B testing. However, these traditional methods are feasible only in situations in which a specific design template is chosen from a limited set of design templates for a given context. In a real-world scenario, possible media variations are not predefined and are virtually infinite. For example, an image can be stylized in terms of vibrance, saturation, exposure contrast, etc., while taking into account website specific features like font size, font color, theme, background styles, etc., or other context such as a user role, a location of the user, a time and date, etc.
Embodiments of the present disclosure generate a contextualized media object by optimizing features of a media object based on context associated with the media object. In particular, in some cases, embodiments of the present disclosure use a reinforcement learning model to efficiently search across a vast space of media variations to select an optimized variation based on the context. In some cases, the reinforcement learning model leverages knowledge gained from past optimizations to accelerate the search for the optimized media variation for the context. In other words, in some cases, a media generation system uses reinforcement learning to learn how to generate a modified media object based on an input media object and a context, where the modified media object is optimized for the context. Embodiments of the present disclosure thereby provide a media generation technology having an increased efficiency than conventional media generation systems.
Additionally, in some cases, the reinforcement learning model uses a reward function that is trained based on human feedback to mimic preferences of human users when evaluating modified media objects generated by the reinforcement learning model. Accordingly, in some cases, the media generation system automatically modifies and optimizes a media object based on the context that the media is associated with in terms of user perception of the modified media object within the context. The modified media object is optimized without constraining the modified media object to a pre-defined variant template of the media object. Such a data-driven and adaptive framework can help content creators save time and effort.
An example of the present disclosure is used in an image generation setting. For example, a user wants to edit an image (e.g., a media object) so that the image will aesthetically look better when included on a website (e.g., a context). For example, in some cases, the user is interested in editing the image to better suit the background color, text font, and text color (e.g., context data) of the website.
The user provides the image and the context data to the media generation system. The media generation system uses a reinforcement learning model to take iterative actions to generate iterative images based on iterative states, where the state corresponds to the image at a given time, a previous action, and the context data. In some cases, the action is also taken based on iterative rewards that evaluate a success of an action in terms of how appropriate an iteratively generated image is for the context, as would be perceived by a human observer. The media generation system thereby generates a new image as a final iteration that is optimized for the context.
Further example applications of the media generation system in an image generation setting are provided with reference to
A system and an apparatus for media generation is described with reference to
Some examples of the system and the apparatus further include a contextual media interface configured to display the modified media object within the context. Some examples of the system and the apparatus further include a reward network configured to compute a reward value for the reinforcement learning model based on the modified media object. Some examples of the system and the apparatus further include a training component configured to update parameters of the reward network based on instructor feedback.
Referring to
According to some aspects, user device 110 is a personal computer, laptop computer, mainframe computer, palmtop computer, personal assistant, mobile device, or any other suitable processing apparatus. In some examples, user device 110 includes software, such as a graphical user interface, that displays and facilitates the communication of information, such as a media object, a modified media object, a context, and context data between user 105 and media generation apparatus 115.
According to some aspects, a user interface enables user 105 to interact with user device 110. In some embodiments, the user interface may include an audio device, such as an external speaker system, an external display device such as a display screen, or an input device (e.g., a remote-control device interfaced with the user interface directly or through an IO controller module). In some cases, the user interface may be a graphical user interface (GUI).
According to some aspects, media generation apparatus 115 includes a computer implemented network. In some embodiments, the computer implemented network includes a machine learning model (for example, a reinforcement learning model described with reference to
In some cases, media generation apparatus 115 is implemented on a server. A server provides one or more functions to users linked by way of one or more of various networks, such as cloud 120. In some cases, the server includes a single microprocessor board that includes a microprocessor responsible for controlling all aspects of the server. In some cases, the server uses a microprocessor and a protocol, such as hypertext transfer protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (FTP), simple network management protocol (SNMP), or the like, to exchange data with other devices or users on one or more of the networks. In some cases, the server is configured to send and receive hypertext markup language (HTML) formatted files (e.g., for displaying web pages). In various embodiments, the server comprises a general-purpose computing device, a personal computer, a laptop computer, a mainframe computer, a supercomputer, or any other suitable processing apparatus.
Media generation apparatus 115 is an example of, or includes aspects of, the corresponding element described with reference to
Cloud 120 is a computer network configured to provide on-demand availability of computer system resources, such as data storage and computing power. In some examples, cloud 120 provides resources without active management by a user. The term “cloud” is sometimes used to describe data centers available to many users over the Internet. Some large cloud networks have functions distributed over multiple locations from central servers. A server is designated an edge server if it has a direct or close connection to a user. In some cases, cloud 120 is limited to a single organization. In other examples, cloud 120 is available to many organizations. In one example, cloud 120 includes a multi-layer communications network comprising multiple edge routers and core routers. In another example, cloud 120 is based on a local collection of switches in a single physical location. According to some aspects, cloud 120 provides communications between user device 110, media generation apparatus 115, and database 125.
Database 125 is an organized collection of data. In an example, database 125 stores data in a specified format known as a schema. According to some aspects, database 125 is structured as a single database, a distributed database, multiple distributed databases, or an emergency backup database. In some cases, a database controller manages data storage and processing in database 125. In some cases, a user interacts with the database controller. In other cases, the database controller operates automatically without interaction from the user. According to some aspects, database 125 is external to media generation apparatus 115 and communicates with media generation apparatus 115 via cloud 120. According to some aspects, database 125 is included in media generation apparatus 115. Database 125 is an example of, or includes aspects of, the corresponding element described with reference to
According to some aspects, processor unit 205 includes one or more processors. A processor is an intelligent hardware device, such as a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof. In some cases, processor unit 205 is configured to operate a memory array using a memory controller. In other cases, the memory controller is integrated into processor unit 205. In some cases, processor unit 205 is configured to execute computer-readable instructions stored in memory unit 210 to perform various functions. In some embodiments, processor unit 205 includes special-purpose components for modem processing, baseband processing, digital signal processing, or transmission processing.
According to some aspects, memory unit 210 includes one or more memory devices. Examples of a memory device include random access memory (RAM), read-only memory (ROM), or a hard disk. Examples of memory devices include solid state memory and a hard disk drive. In some examples, memory is used to store computer-readable, computer-executable software including instructions that, when executed, cause a processor of processor unit 205 to perform various functions described herein. In some cases, memory unit 210 includes a basic input/output system (BIOS) that controls basic hardware or software operations, such as an interaction with peripheral components or devices. In some cases, memory unit 210 includes a memory controller that operates memory cells of memory unit 210. For example, the memory controller may include a row decoder, column decoder, or both. In some cases, memory cells within memory unit 210 store information in the form of a logical state.
According to some aspects, reinforcement learning model 215 obtains a media object and context data describing a context of the media object, where the media object includes one or more modification parameters. In some examples, reinforcement learning model 215 generates a modified media object by adjusting the one or more modification parameters based on the context data. In some aspects, the context includes a graphical user interface. In some aspects, the context data includes at least one of a background color, a font style, or a font color of the graphical user interface. In some aspects, the one or more modification parameters includes a contrast, a hue, and a brightness of the media object.
In some examples, reinforcement learning model 215 receives feedback based on the modified media object. In some examples, reinforcement learning model 215 computes a reward value based on the feedback. In some examples, reinforcement learning model 215 updates reinforcement learning model 215 based on the reward value. In some examples, reinforcement learning model 215 generates a subsequent modified media object using the updated reinforcement learning model 215.
In some examples, reinforcement learning model 215 generates state information for the media object based on features of the media object, where the modified media object is generated based on the state information. In some aspects, the state information includes a previous action of the reinforcement learning model 215. In some examples, reinforcement learning model 215 selects an action from an action set corresponding to potential values of the one or more modification parameters using the reinforcement learning model 215, where the modified media object is generated by applying the action to the media object using media editing application 220.
According to some aspects, reinforcement learning model 215 obtains a media object and context data describing a context of the media object, where the media object includes one or more modification parameters. In some examples, reinforcement learning model 215 generates a modified media object by adjusting the one or more modification parameters using a reinforcement learning model 215 based on the context data. In some examples, reinforcement learning model 215 computes a reward value on the modified media object. In some examples, reinforcement learning model 215 updates parameters of reinforcement learning model 215 based on the reward value.
In some examples, reinforcement learning model 215 computes a static reward based on features of the media object, where the reward value includes the static reward. In some examples, reinforcement learning model 215 identifies an acceptable range for the features of the media object.
According to some aspects, reinforcement learning model 215 obtains a media object and context data describing a context of the media object, where the media object includes one or more modification parameters. In some examples, reinforcement learning model 215 selects an action for modifying the one or more modification parameters based on the context data.
According to some aspects, reinforcement learning model 215 comprises one or more artificial neutral networks (ANNs). An ANN is a hardware or a software component that includes a number of connected nodes (i.e., artificial neurons) that loosely correspond to the neurons in a human brain. Each connection, or edge, transmits a signal from one node to another (like the physical synapses in a brain). When a node receives a signal, it processes the signal and then transmits the processed signal to other connected nodes. In some cases, the signals between nodes comprise real numbers, and the output of each node is computed by a function of the sum of its inputs. In some examples, nodes may determine their output using other mathematical algorithms (e.g., selecting the max from the inputs as the output) or any other suitable algorithm for activating the node. Each node and edge are associated with one or more node weights that determine how the signal is processed and transmitted.
In ANNs, a hidden (or intermediate) layer includes hidden nodes and is located between an input layer and an output layer. Hidden layers perform nonlinear transformations of inputs entered into the network. Each hidden layer is trained to produce a defined output that contributes to a joint output of the output layer of the neural network. Hidden representations are machine-readable data representations of an input that are learned from a neural network's hidden layers and are produced by the output layer. As the neural network's understanding of the input improves as it is trained, the hidden representation is progressively differentiated from earlier iterations.
During a training process of an ANN, the node weights are adjusted to improve the accuracy of the result (i.e., by minimizing a loss which corresponds in some way to the difference between the current result and the target result). The weight of an edge increases or decreases the strength of the signal transmitted between nodes. In some cases, nodes have a threshold below which a signal is not transmitted at all. In some examples, the nodes are aggregated into layers. Different layers perform different transformations on their inputs. The initial layer is known as the input layer and the last layer is known as the output layer. In some cases, signals traverse certain layers multiple times.
According to some aspects, reinforcement learning model 215 comprises an agent, an environment, or a combination thereof. In some cases, reinforcement learning model is implemented as hardware, as firmware, as software stored in memory unit 210 and executed by processor 205, or as a combination thereof. Reinforcement learning model 215 is an example of, or includes aspects of, the corresponding element described with reference to
According to some aspects, media editing application 220 generates a modified media object by adjusting the one or more modification parameters based on the action. According to some aspects, media editing application 220 is further comprised in reinforcement learning model 215. According to some aspects, media editing application 220 is implemented in an environment of reinforcement learning model 215.
In some cases, media editing application 220 is implemented as hardware, as firmware, as software stored in memory unit 210 and executed by processor 205, or as a combination thereof. Media editing application 220 is an example of, or includes aspects of, the corresponding element described with reference to
According to some aspects, contextual media interface 225 provides the modified media object within the context. According to some aspects, contextual media interface 225 receives instructor feedback based on the modified media object. In some examples, contextual media interface 225 includes in a dataset a first trajectory corresponding to the modified media object, a second trajectory corresponding to the additional modified media object, and the instructor feedback.
According to some aspects, contextual media interface 225 is configured to display the modified media object within the context. In some cases, contextual media interface 225 is implemented as a graphical user interface. According to some aspects, contextual media interface 225 is provided on a user device (such as described with reference to
According to some aspects, reward network 230 computes a dynamic reward based on state information for the media object using a reward network 230, where the reward value includes the dynamic reward. According to some aspects, reward network is further comprised in reinforcement learning model 215. According to some aspects, reward network is implemented in an environment of reinforcement learning model 215.
According to some aspects, reward network 230 comprises one or more ANNs. In some cases, reward network 230 comprises one or more fully connected (FC) layers. According to some aspects, reward network 230 is implemented as hardware, as firmware, as software stored in memory unit 210 and executed by processor 205, or as a combination thereof. Reward network 230 is an example of, or includes aspects of, the corresponding element described with reference to
According to some aspects, training component 235 computes a dynamic reward loss based on the instructor feedback. In some examples, training component 235 updates parameters of the reward network 230 based on the dynamic reward loss.
According to some aspects, training component 235 is implemented as hardware, as firmware, as software stored in memory unit 210 and executed by processor 205, or as a combination thereof. According to some aspects, training component 235 is omitted from media generation apparatus 200 and is implemented in a separate apparatus. According to some aspects, training component 235 included in the separate apparatus communicates with media generation apparatus 200 to perform the functions described herein. According to some aspects, training component 235 is implemented in the separate apparatus as hardware, as firmware, as software stored in memory of the separate apparatus and executed by a processor of the separate apparatus, or as a combination thereof. Training component 235 is an example of, or includes aspects of, the corresponding element described with reference to
Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in that labelled training data is not needed, and errors need not be explicitly corrected. Instead, reinforcement learning can balance exploration of unknown options and exploitation of existing knowledge. Specifically, reinforcement learning relates to how software agents make decisions in order to maximize a reward. In some cases, a strategy for decision making within a reinforcement learning model may be referred to as a policy.
Referring to
In some cases, agent 305 comprises a policy and a policy learning algorithm. In some cases, the policy is a mapping from state 315 to a probability distribution of actions to be taken. In some cases, within agent 305, the policy is implemented by a function approximator with tunable parameters. In some cases, the function approximator is implemented as one or more ANNs. In some cases, agent 305 uses the policy learning algorithm to continuously update parameters of the policy based on state 315, action 320, and reward 325. In some cases, agent 305 uses the policy learning algorithm to find an optimal policy that maximizes an expected cumulative long-term reward received during a task.
Agent 305 is an example of, or includes aspects of, the corresponding element described with reference to
First state 315 is an example of, or includes aspects of, the corresponding element described with reference to
Agent 405 is an example of, or includes aspects of, the corresponding element described with reference to
In some cases, at time t, state 445 comprises a media object ft, a previous action at−1 at a previous time t−1, and context vector c. In some cases, context vector c is a vector representation of context data. In some cases, the context data includes a background color, a font style, a font color, or a combination thereof. In some cases, the context is a graphical user interface (such as a website, a software application display, etc.).
In some cases, agent instructs media editing application 415 via action 450 to modify the media object ft to obtain modified media object 420 by adjusting one or more modification parameters 425. In the example shown by
In some cases, reward function 430 computes a reward based on action 450 and state 445 using static reward function 435, dynamic reward function 440, or a combination thereof. In some cases, a reward 455 based on static reward function 435 positively reinforces agent 405 when modified media object 420 retains the content of media object ft and negatively reinforces agent 405 when modified media object 420 does not retain the content of media object ft, as described with reference to
Media editing application 415 is an example of, or includes aspects of, the corresponding element described with reference to
Reward function 430 is an example of, or includes aspects of, the corresponding element described with reference to
First state 445 is an example of, or includes aspects of, the corresponding element described with reference to
Referring to
Q-learning is a model-free reinforcement learning algorithm that learns a function for determining an action that an agent should take given a current state of its environment. Q-learning does not require a model of the environment, and can handle problems with stochastic transitions and rewards. The “Q” refers to a quality function that returns the reward based on the action taken, and therefore is used to provide the reinforcement. In some examples, at each time period during training a Q-learning network, an agent selects an action, observes a reward, enters a new state, and updates the quality function Q according to a weighted average of the old value and the new observation.
In some cases, an off-policy agent comprises an update policy (e.g., a policy that directs how an agent images it will act when calculating a value of a state-action pair) that is different than a behavior policy (e.g., a policy that directs how an agent will act in any given state). For example, in some cases, an off policy agent can comprise a greedy update policy (e.g., a policy that directs an agent to consistently perform an action that would yield a highest long-term expected reward) and an epsilon-greedy behavior policy (e.g., a policy that does not direct an agent to consistently perform an action that would yield a highest expected reward, but instead sometimes directs a semi-random action). In contrast, in an on-policy agent, both the update policy and the behavior policy are the same.
In some cases, a DDPG agent is an actor-critic reinforcement learning agent that searches for an optimal policy that maximizes an expected cumulative long-term reward. As used herein, an “actor network” can refer to an ANN that returns, for a given state, an action, where the action often maximizes a predicted discount cumulative long-term reward. As used herein, a “critic network” can refer to an ANN that returns, for a given state and action, a predicted discounted value of the cumulative long-term reward. As used herein, “discount” refers to a degree of sensitivity to future rewards.
In some cases, agent 500 is implemented as a DDPG agent because the environment described with reference to
A DDPG agent is related to a deep Q network (DQN). Certain Q-learning systems use a deep convolutional neural network (CNN), with layers of tiled convolutional filters to mimic effects of receptive fields. A convolutional neural network (CNN) is a class of neural network that is commonly used in computer vision or image classification systems. In some cases, a CNN may enable processing of digital images with minimal pre-processing. A CNN may be characterized by the use of convolutional (or cross-correlational) hidden layers. These layers apply a convolution operation to the input before signaling the result to the next layer. Each convolutional node may process data for a limited field of input (i.e., the receptive field). During a forward pass of the CNN, filters at each layer may be convolved across the input volume, computing the dot product between the filter and the input. During the training process, the filters may be modified so that they activate when they detect a particular feature within the input.
In some cases, reinforcement learning can be unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability is based on correlations present in the sequence of observations. For example, small updates to Q may significantly change a policy, a data distribution, and correlations between Q and target values. Thus, DQN techniques may utilize experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed. This reduces correlations in the observation sequence and smooths changes in the data distribution.
Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target. In some examples, DQN models also utilize a target network to fix parameters of the target function. In some examples, a clipping reward technique is used to replace all positive rewards with a same value, and all negative rewards with a different value. In some examples, a skipping frames technique is used to calculate a Q value at periodic intervals to reduce computational cost.
In some cases, a DDPG agent is based on a DQN agent, but is better suited to handle a continuous environment (such as the environment described with reference to
In some cases, each of actor network 505, critic network 510, target actor network 515, and target critic network 520 comprises one or more ANNs. In some cases, the quality function Q is implemented as critic network 520.
In some cases, agent 500 uses a replay buffer to sample experiences to update parameters of actor network 505, critic network 510, target actor network 515, and target critic network 520. For example, during each state-action trajectory roll-out in agent 500, each experience tuple (e.g., comprising a state, an action, a reward, and a next state) is stored in the replay buffer (e.g., a finite-sized cache). Then, in some cases, agent 500 samples random mini-batches of experience from the replay buffer when actor network 505 and critic network 510 are updated. In some cases, agent 500 is therefore relatively more efficient in sampling than an on-policy agent.
A method for media generation is described with reference to
In some aspects, the context comprises a graphical user interface. In some aspects, the context data includes at least one of a background color, a font style, or a font color of the graphical user interface. In some aspects, the one or more modification parameters includes a contrast, a hue, and a brightness of the media object.
Some examples of the method further include receiving feedback based on the modified media object. Some examples further include computing a reward value based on the feedback. Some examples further include updating the reinforcement learning model based on the reward value. Some examples of the method further include generating a subsequent modified media object using the updated reinforcement learning model.
Some examples of the method further include generating state information for the media object based on features of the media object, wherein the modified media object is generated based on the state information. In some aspects, the state information includes a previous action of the reinforcement learning model. Some examples of the method further include selecting an action from an action set corresponding to potential values of the one or more modification parameters using the reinforcement learning model, wherein the modified media object is generated by applying the action to the media object using a media editing application.
Referring to
At operation 605, the system provides a media object and a context for the media object. In some cases, the operations of this step refer to, or may be performed by, a user as described with reference to
In some cases, the user provides the context as context data (for example, in the case of the website, by providing HTML code corresponding to the context, or by identifying the context data in the user interface), as context data within a set of data (for example, full code for the website) or as a representation of the context (for example, as an image or other representation of the website). In some cases, the image generation system obtains the context data based on the set of data (for example, by identifying and extracting the context data from the set of data) or the representation of the context (for example, by analyzing the representation of the context and generating the context data based on the analysis).
At operation 610, the system generates a modified media object by optimizing a characteristic of the media object for the context. In some cases, the operations of this step refer to, or may be performed by, a media generation apparatus as described with reference to
At operation 615, the system displays the modified media object in the context. In some cases, the operations of this step refer to, or may be performed by, a media generation apparatus as described with reference to
Referring to
As used herein, a “media object” refers to any media content, such as an image, a video, text, audio, etc. As used herein, “context” refers to a surrounding of the media object within which the media object is observed or perceived. In some cases, a context includes a graphical user interface. In some cases, a context comprises context data. As used herein, “context data” refers to a numerical representation of the context. In some cases, context data comprises a context vector. In some cases, a “modified media object” refers to a media object comprising one or more modification parameter values that are different from a value for corresponding modification parameters comprised in an original media object. As used herein, a “modification parameter” refers to a representation of a feature of a media object.
At operation 705, the system obtains a media object and context data describing a context of the media object, where the media object includes one or more modification parameters. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
For example, in some cases, the reinforcement learning model obtains the media object from a user via a user device (such as the user and user device described with reference to
In some cases, the reinforcement learning model directly obtains the context data. In some cases, the media generation apparatus extracts the context data from other data and provides the context data to the reinforcement learning model.
At operation 710, the system generates a modified media object by adjusting the one or more modification parameters using a reinforcement learning model based on the context data. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
For example, in some cases, at a time t, an agent of the reinforcement learning model (such as the agent described with reference to
In some cases, at the time t, the reinforcement learning model generates state information (e.g., state st) for a media object ft based on features of the media object ft. For example, in some cases, the state st is a vector comprising the media object ft, a previous action media object at−1, and the context data media object c. In some cases, an initial action is generated as described with reference to
At operation 715, the system provides the modified media object within the context. In some cases, the operations of this step refer to, or may be performed by, a contextual media interface as described with reference to
Referring to
At operation 905, the system receives feedback based on the modified media object. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
At operation 910, the system computes a reward value based on the feedback. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
In some cases, the feedback comprises instructor feedback as described with reference to
At operation 915, the system updates the reinforcement learning model based on the reward value. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
In some cases, the agent updates parameters of a critic network (such as the critic network described with reference to
In some cases, the agent updates parameters of an actor network (such as the actor network described with reference to
In equation (1), r is the reward and y is a discount factor. Referring to equation (1), next-state Q-values are calculated using a target critic network (such as the target critic network described with reference to
In some cases, the agent updates the target actor network according to θ′=τθ+(1−τ)θ′ and the target critic network according to w′=τw+(1−τ)w′, where τ<<1.
According to some aspects, the reinforcement learning model generates a subsequent modified media object based on the updated parameters. For example, based on a next state (such as the next state described with reference to
Reward function 1000 is an example of, or includes aspects of, the corresponding element described with reference to
Referring to
In some cases, acceptable features for a modified media object implemented as an image may relate to pixel colors of the modified media object. For example, an image may be generated such that a combination of a brightness, a hue, and a contrast of the image (e.g., features of the image corresponding to modification parameters) causes the image to appear “overexposed” (e.g., |255−Avg(image pixel colors)|>180) or “underexposed” (e.g., |255−Avg(image pixel colors)|<30), and thus having features that are outside the acceptable range.
In the example of
Accordingly, by determining the static reward r0, the reward function reinforces the reinforcement learning model to generate a modified media object that retains content of the media object.
Referring to
A method for media generation is described with reference to
Some examples of the method further include computing a static reward based on features of the media object, wherein the reward value includes the static reward. Some examples of the method further include identifying an acceptable range for the features of the media object. Some examples further include determining whether the features of the media object are within the acceptable range, wherein the static reward is based on the determination.
Some examples of the method further include computing a dynamic reward based on state information for the media object using a reward network, wherein the reward value includes the dynamic reward. Some examples of the method further include receiving instructor feedback based on the modified media object. Some examples further include computing a dynamic reward loss based on the instructor feedback. Some examples further include updating parameters of the reward network based on the dynamic reward loss.
Some examples of the method further include generating an additional modified media object, wherein the instructor feedback is based on the additional modified media object. Some examples of the method further include including in a dataset a first trajectory corresponding to the modified media object, a second trajectory corresponding to the additional modified media object, and the instructor feedback.
Referring to
At operation 1305, the system obtains a media object and context data describing a context of the media object, where the media object includes one or more modification parameters. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
At operation 1310, the system generates a modified media object by adjusting the one or more modification parameters using a reinforcement learning model based on the context data. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
At operation 1315, the system computes a reward value on the modified media object. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
At operation 1320, the system updates parameters of the reinforcement learning model based on the reward value. In some cases, the operations of this step refer to, or may be performed by, a reinforcement learning model as described with reference to
According to some aspects, a dynamic reward function described with reference to
At operation 1405, the system receives instructor feedback based on the modified media object. In some cases, the operations of this step refer to, or may be performed by, a contextual media interface as described with reference to
At operation 1410, the system computes a dynamic reward loss based on the instructor feedback. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
At operation 1415, the system updates parameters of the reward network based on the dynamic reward loss. In some cases, the operations of this step refer to, or may be performed by, a training component as described with reference to
Modified media object representation 1505 is an example 1500 of, or includes aspects of, the corresponding element described with reference to
Referring to
In some cases, the contextual media interface receives an input from instructor 1515 indicating a preference y∈{(1,0), (0,1)} (e.g., instructor feedback) for either modified media object representation 1505 or additional modified media object representation 1510. In some cases, the contextual media interface adds first trajectory σ1, second trajectory σ2, and the preference y in a dataset , thereby obtaining a reward network training dataset that includes not only changes made to a media object for a context but also a human response to those changes. In some cases, contextual media interface stores the dataset
in database 1520 (such as the database described with reference to
In some cases, reward network 1525 receives modified media object representation 1505 and additional modified media object representation 1510 and determines a reward network preference for either modified media object representation 1505 or additional modified media object representation 1510. In some cases, training component 1530 trains reward network 1525 based on a comparison between the reward network preference and the instructor preference y.
For example, in some cases, reward network 1525 computes a preference predictor Pϕ(σ2>σ1) modeled using the dynamic reward function rϕ:
In some cases, σ2>σ1 denotes a preference for a modified media object corresponding to second trajectory σ2 over a modified media object corresponding to first trajectory σ1.
In some cases, training component 1530 computes a dynamic reward loss using dynamic reward loss function 1535 based on the preference predictor and instructor preference y:
In some cases, training component 1530 updates parameters of reward network 1525 based on the dynamic reward loss. Accordingly, by training the reward network based on a dataset obtained using instructor feedback, the media generation apparatus implements a reinforcement learning model that is positively rewarded, via the dynamic reward rϕ, by generating a modified media object that would appear to a human observer to be appropriate for an input context.
In an example, algorithm 1600 begins within the reinforcement learning model initializing a frequency of obtaining instructor feedback via a contextual media interface as described with reference to
Referring to . In some cases, with respect to lines 17-20, the reward network is trained. In some cases, in a parameter update event, the agent performs an action at and observes reward rt. In some cases, with respect to lines 28-30, after a mini-batch of transitions (e.g., st, at, st+1, rt) are sampled, parameters of the actor network, the critic network, the target actor network, and the target critic network are updated.
The description and drawings described herein represent example configurations and do not represent all the implementations within the scope of the claims. For example, the operations and steps may be rearranged, combined or otherwise modified. Also, structures and devices may be represented in the form of block diagrams to represent the relationship between components and avoid obscuring the described concepts. Similar components or features may have the same name but may have different reference numbers corresponding to different figures.
Some modifications to the disclosure may be readily apparent to those skilled in the art, and the principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein, but is to be accorded the broadest scope consistent with the principles and novel features disclosed herein.
The described methods may be implemented or performed by devices that include a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, a conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored in the form of instructions or code on a computer-readable medium.
Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer. For example, non-transitory computer-readable media can comprise random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), compact disk (CD) or other optical disk storage, magnetic disk storage, or any other non-transitory medium for carrying or storing data or code.
Also, connecting components may be properly termed computer-readable media. For example, if code or data is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, or microwave signals, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology are included in the definition of medium. Combinations of media are also included within the scope of computer-readable media.
In this disclosure and the following claims, the word “or” indicates an inclusive list such that, for example, the list of X, Y, or Z means X or Y or Z or XY or XZ or YZ or XYZ. Also the phrase “based on” is not used to represent a closed set of conditions. For example, a step that is described as “based on condition A” may be based on both condition A and condition B. In other words, the phrase “based on” shall be construed to mean “based at least in part on.” Also, the words “a” or “an” indicate “at least one.”