The embodiments described herein relate to methods and apparatus for automated management of livestock and/or the production of bioproducts by the livestock, using machine learning, to produce customized product depending on an identified end use.
In the past decade, technological advancements in intensive livestock management techniques have expanded greatly. Farmers can economically collect data on many aspects of animal health and bioproduct (e.g., milk) parameters during various phases of life of an animal (e.g., during phases of a lactation cycle). Using this data, veterinarians can attempt to alter feed blends that the animals consume to alter animal health or the bioproduct. In some instances, input of other professionals (e.g., veterinarians) can be used to inform decisions (e.g., selection of feed). Conventionally, this gathering of data and decision making is done manually and is time consuming, uneconomical, and error prone. Accordingly, there exists a need to automate a process of management of livestock that produce bioproducts that meet multiple customers, providing farmers the flexibility to cost effectively provide tailored bioproducts for their clients.
In some embodiments, a method includes receiving an indication of a target quality of a property associated with a bioproduct obtained from a managed livestock. The target quality can be associated with an identified end-use. The method further includes receiving an indication of a current health status of the managed livestock and generating a set of input vectors based on the target quality of the property. The method further includes providing the set of input vectors to a machine learning model to generate an output indicating a feed selection to be used to feed the managed livestock. The feed selection is such that, upon consumption, it increases a likelihood of meeting the target quality of the property. The method further includes administering a feed blend to the managed livestock, the feed blend including the feed selection.
In some embodiments, an apparatus includes a memory and a processor operatively coupled to the memory. The processor can be configured to receive an indication of a target quality of a property associated with a bioproduct obtained from a managed livestock, the target quality being associated with an identified end-use. The processor can be further configured to receive an indication of a health status of the managed livestock. The processor can be further configured to generate a set of input vectors based on the target quality of the property. The processor can be further configured to provide the set of input vectors to a machine learning model to generate an output indicating a feed selection to be used to feed the managed livestock. The feed selection can, upon consumption, increase a likelihood of meeting the target quality of the property.
In some embodiments, an apparatus includes a memory and a processor. The processor is configured to train a machine learning model to receive a target quality of a property associated with a bioproduct of a first managed livestock, receive inputs associated with a health status of the first managed livestock, and determine a temporal abstraction based on the target property and the inputs to be used to identify a feed selection. The feed selection is configured to increase a likelihood of achieving the target quality of the property associated with the bioproduct of the first managed livestock, the target quality being associated with an identified end-use. The processor is further configured to receive a target value of the property associated with the bioproduct produced by a second managed livestock. The processor is further configured to receive, at a first time, a first indication of the property associated with the bioproduct produced by the second managed livestock, and generate a set of feature vectors based on the target value of the property and the first indication of the property. The processor is further configured to provide the set of feature vectors to the machine learning model to generate, based on the temporal abstraction and the first indication of the property, a first output including a first feed selection. The first feed selection is configured to, upon consumption by the second managed livestock, increase a likelihood of achieving the target value of the property associated with the bioproduct of the second managed livestock based on the first indication of the property. The processor is further configured to receive, at a second time after the first time, a second indication of the property, and compare the second indication of the property with at least one of the first indication of the property or the target value of the property, to calculate a difference metric. The machine learning model is configured to adaptively update, based on the difference metric, the temporal abstraction to generate a second output including a second feed selection configured to, upon consumption by the second managed livestock, increase a likelihood of achieving the target value of the property associated with the bioproduct of the second managed livestock based on the second indication of the property.
Disclosed embodiments include a non-transitory processor-readable medium storing code representing instructions to be executed by a processor, the instructions including code to cause the processor to receive a target value of a property associated with a bioproduct obtained from a managed livestock. The target value can be associated with an identified end-use of the bioproduct. The instructions can further include code to cause the processor to receive, at a first time, a first indication of the property associated with the bioproduct obtained from the managed livestock, and generate a set of input vectors based on the target value of the property and the first indication of the property associated with the bioproduct. The instructions can further include code to cause the processor to provide the set of input vectors to a machine learning model associated with a set of hyperparameters to generate a first output indicating a first feed selection to be used to feed the managed livestock. The first feed selection can be configured to, upon consumption, increase a likelihood of achieving the target value associated with the bioproduct based on the first indication of the property associated with the bioproduct. The instructions can further include code to cause the processor to receive at a second time after the first time, a reward signal associated with a second indication of the property associated with the bioproduct. The instructions can further include code to cause the processor to automatically adjust at least one hyperparameter from the set of hyperparameters, in response to the reward signal. The machine learning model can be configured to generate a second output indicating a second feed selection to be used to feed the managed livestock, the second feed selection configured to, upon consumption, increase a likelihood of achieving the target value associated with the bioproduct based on the second indication of the property associated with the bioproduct.
The livestock management (LM) system 100 is configured to manage receiving information from a set of compute devices 101-104 and, based on the information, implement an automatic livestock management process including evaluating procedural alternatives, making choices from the alternatives, and/or implementing rules. The choices, decisions, or rules can be associated with any suitable action or resource related to intensively managed livestock (e.g., animal selection for generating bioproduct, feed selection, medicine selection, bioproduct analysis, resource allocation, and/or the like). The livestock management system 100 can receive data related to health and/or bioproduct produced by a cohort of animals. In some instances, the LM system 100 can receive data related to a quality of bioproduct of interest to a specified customer. In some instances, the LM system can receive data related to costs of maintenance of a cohort of livestock and/or an efficiency associated with a yield of bioproduct from a cohort of livestock. Based on a received data, the LM system 100 can evaluate past and/or new protocols of livestock management including animal selection for generating bioproduct, feed selection, medicine selection, bioproduct analysis, resource allocation, and/or the like, according to an embodiment. The LM system 100 includes compute devices 101, 102, 103, and 104, connected to a livestock management device 105 (also referred to as “the device”) through a communications network 106, as illustrated in
In some embodiments, the communication network 106 (also referred to as “the network”) can be any suitable communications network for transferring data, operating over public and/or private networks. For example the network 106 can include a private network, a Virtual Private Network (VPN), a Multiprotocol Label Switching (MPLS) circuit, the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a worldwide interoperability for microwave access network (WiMAX®), an optical fiber (or fiber optic)-based network, a Bluetooth® network, a virtual network, and/or any combination thereof. In some instances, the communication network 106 can be a wireless network such as, for example, a Wi-Fi or wireless local area network (“WLAN”), a wireless wide area network (“WWAN”), and/or a cellular network. In other instances, the communication network 106 can be a wired network such as, for example, an Ethernet network, a digital subscription line (“DSL”) network, a broadband network, and/or a fiber-optic network. In some instances, the network can use Application Programming Interfaces (APIs) and/or data interchange formats, (e.g., Representational State Transfer (REST), JavaScript Object Notation (JSON), Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), and/or Java Message Service (JMS)). The communications sent via the network 106 can be encrypted or unencrypted. In some instances, the communication network 106 can include multiple networks or subnetworks operatively coupled to one another by, for example, network bridges, routers, switches, gateways and/or the like (not shown).
The compute devices 101, 102, 103, and 104, in the LM system 100 can each be any suitable hardware-based computing device and/or a multimedia device, such as, for example, a device, a desktop compute device, a smartphone, a tablet, a wearable device, a laptop and/or the like.
The processor 211 can be, for example, a hardware based integrated circuit (IC) or any other suitable processing device configured to run and/or execute a set of instructions or code. For example, the processor 211 can be a general purpose processor, a central processing unit (CPU), an accelerated processing unit (APU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a complex programmable logic device (CPLD), a programmable logic controller (PLC) and/or the like. The processor 211 can be operatively coupled to the memory 212 through a system bus (for example, address bus, data bus and/or control bus).
The processor 211 can be configured to collect, record, log, document, and/or journal data associated with health and/or yield of bioproduct of a cohort of managed livestock. In some instances, the compute device 201 can be associated with a farmer, a veterinarian, animal handling personnel, and/or the like who collect/log data associated with a health of animals or data associated with a bioproduct produced by the animals. In some instances, the compute device 201 can be associated with a customer interested in the purchase of bioproducts produced by a managed livestock. In some instances, the compute device 201 can be associated with an entity providing analytical services to analyze the contents of samples. For example, the compute device can be associated with an analytical service provider configured to analyze the contents of milk produced by a cohort of managed livestock.
The processor 211 can include a data collector 214. The processor can optionally include the history manager 231, and application 241. In some embodiments, the data collector 214, the data history manager 231 and/or the application 241 can include a process, program, utility, or a part of a computer's operating system, in the form of code that can be stored in memory 212 and executed by the processor 211.
In some embodiments, each of the data collector 214, the history manager 231, and/or the application 241 can be software stored in the memory 212 and executed by processor 211. For example, each of the above-mentioned portions of the processor 211 can be code to cause the processor 211 to execute the data collector 214, the history manager 231, and/or the software application 241. The code can be stored in the memory 212 and/or a hardware-based device such as, for example, an ASIC, an FPGA, a CPLD, a PLA, a PLC and/or the like. In other embodiments, each of the data collector 214, the history manager 231, and/or the application 241 can be hardware configured to perform the specific respective functions.
The data collector 214 can be configured to run as a background process and collect or log data related to cohorts of animals in a managed livestock. In some instances, the data can be logged by personnel via the application 241 in the compute device 201. In some instances, the data can be automatically logged by sensors associated with the compute device 201 (not shown in
The data collector 214 can monitor, collect and/or store data or information related to health status data, feed data, data related to yield, quantity, and/or quality of bioproducts produced, medical supplements data, data associated with targeted end-use, customer requirements of quantity, properties associated with a bioproduct, and/or target qualities of properties associated with bioproducts, and/or the like.
In some instances, the data collector 214 can store the information collected in any suitable form such as, for example, in the form of text-based narrative of events, tabulated sequence of events, data from sensors, and/or the like. In some instances, the data collector 214 can also analyze the data collected and store the results of the analysis in any suitable form such as, for example, in the form of event logs, or look-up tables, etc. The data collected by the data collector 214 and/or the results of analyses can be stored for any suitable period of time in the memory 212. In some instances, the data collector 214 can be further configured to send the collected and/or analyzed data, via the communicator 213, to a device that may be part of an LM system to which the compute device 201 is connected (e.g., the LM device 105 of the system 100 illustrated in
In some embodiments, the history manager 231 of the processor 211 can be configured to maintain logs or schedules associated with a history of handling or management of animals in a cohort of livestock, the quantity/quality of feed and/or medicinal supplement provided, quantity/quality of bio products produced, the costs associated with the maintenance of the cohort of animals, and/or the like. The history manager 231 can also be configured to maintain a log of information related to the sequence of events (e.g., interventions provided to animals) and/or a concurrent set of data logged indicating health and/or production of the animals.
The memory 212 of the compute device 201 can be, for example, a random-access memory (RAM), a memory buffer, a hard drive, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), and/or the like. The memory 212 can be configured to store any data collected by the data collector 214, or data processed by the history manager 231, and/or the application 241. In some instances, the memory 212 can store, for example, one or more software programs and/or code that can include instructions to cause the processor 211 to perform one or more processes, functions, and/or the like (e.g., the data collector 214, the history manager 231 and/or the application 241). In some embodiments, the memory 212 can include extendable storage units that can be added and used incrementally. In some implementations, the memory 212 can be a portable memory (for example, a flash drive, a portable hard disk, and/or the like) that can be operatively coupled to the processor 211. In some instances, the memory can be remotely operatively coupled with the compute device. For example, a remote database device can serve as a memory and be operatively coupled to the compute device.
The communicator 213 can be a hardware device operatively coupled to the processor 211 and memory 212 and/or software stored in the memory 212 executed by the processor 211. The communicator 213 can be, for example, a network interface card (NIC), a Wi-Fi™ module, a Bluetooth® module and/or any other suitable wired and/or wireless communication device. Furthermore, the communicator 213 can include a switch, a router, a hub and/or any other network device. The communicator 213 can be configured to connect the compute device 201 to a communication network (such as the communication network 106 shown in
In some instances, the communicator 213 can facilitate receiving and/or transmitting data or files through a communication network (e.g., the communication network 106 in the LM system 100 of
Returning to
Similar to the communicator 213 within compute device 201 of
The memory 352 of the LM device 305 can be a random-access memory (RAM), a memory buffer, a hard drive, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), and/or the like. The device memory 352 can store, for example, one or more software modules and/or code that can include instructions to cause the device processor 351 to perform one or more processes, functions, and/or the like. In some implementations, the device memory 352 can be a portable memory (e.g., a flash drive, a portable hard disk, and/or the like) that can be operatively coupled to the device processor 351. In some instances, the device memory can be remotely operatively coupled with the device. For example, the device memory can be a remote database device operatively coupled to the device and its components and/or modules.
The processor 351 can be a hardware based integrated circuit (IC) or any other suitable processing device configured to run and/or execute a set of instructions or code. For example, the processor 351 can be a general purpose processor, a central processing unit (CPU), an accelerated processing unit (APU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a complex programmable logic device (CPLD), a programmable logic controller (PLC) and/or the like. The processor 351 is operatively coupled to the memory 352 through a system bus (e.g., address bus, data bus and/or control bus). The processor 351 is operatively coupled with the communicator 353 through a suitable connection or device as described in further detail.
The processor 352 can be configured to include and/or execute several components, units and/or instructions that may be configured to perform several functions, as described in further detail herein. The components can be hardware-based components (e.g., an integrated circuit (IC) or any other suitable processing device configured to run and/or execute a set of instructions or code) or software-based components (executed by the processor 352), or a combination of the two. As illustrated in
The data aggregator 355 in the processor 351 can be configured to receive communications between the device 305 and compute devices connected to the device 305 through suitable communication networks (e.g., compute devices 101-104 connected to the device 105 via the communication network 106 in the system 100 in
The data aggregator 355 is further configured to receive data associated with history managers in the compute devices (e.g., history manager 231 on compute device 201 in
In some instances, the data aggregator 355 can be further configured to receive, analyze, and/or store communications from compute devices regarding any suitable information related to end use. The information received from a compute device can include, for example, a quantity/quality associated with bioproduct content (e.g., milk, eggs, honey, fiber, etc.) desired for a particular end use, one or more threshold values of one or more properties associated with quality of the bioproduct content, and/or the like. The data aggregator 355, in some instances, can also be configured to receive analytical reports based on analysis of bioproduct samples from a specified cohort of animals.
The data aggregator 355, in some instances, can also be configured to receive information from animal health experts such as veterinarians including reports on the current health status of specified animals in a managed livestock. In some instances, the information can include a recommendation of dietary feed and/or medicine supplements to be provided to the animals based on the analysis of the current health status of the animals. In some instances, the information can include a recommendation of feed and/or medicinal blend to be provided to animals based on a target property of a bioproduct to be achieved from the animals, and/or based on the current health status of the animals.
The processor 351 includes an agent manager 356 that can be configured to generate and/or manage one or more agents configured to interact in an environment and/or implement machine learning. An agent can refer to an autonomous entity that performs actions in an environment. An environment can be defined as a state/action space that an agent can perceive, act in, and receive a reward signal regarding the quality of its action in a cyclical manner (illustrated in
In an example implementation, agent-world interactions can include the following steps. An agent observes an input state. An action is determined by a decision-making function or policy (which can be implemented by an ML model 358). The action is performed. The agent receives a scalar reward or reinforcement from the environment in response to the action being performed. Information about the reward given for that state/action pair is recorded. The agent can be configured to learn based on the recorded history of state/action pair and the associated reward. Each state/action pair can be associated with a value using a value function under a specific policy. Value functions can be state-action pair functions that estimate how good a particular action can be at a given state, or what the return for that action is expected to be. In some implementations, the value of a state (s) under a policy (p) can be designated Vp(s). A value of taking an action (a) when at state (s) under the policy (p) can be designated Qp(s,a). The goal of the LM device 305 can then be estimating these value functions for a particular policy. The estimated value functions can then be used to determine sequences of actions that can be chosen in an effective and/or accurate manner such that each action is chosen to provide an outcome that maximizes and/or increases total reward possible, after being at a given state.
As an example, the agent manager 356 can define a virtualized environment that includes the virtualized management of a specified cohort of virtualized animals of a managed livestock (e.g., goats). The virtualized environment can be developed using data aggregated by the data aggregator 355.
The managed livestock can be raised to produce a specified bioproduct (e.g., milk). The agent manager 356 can define agents that perform actions that simulate events in the real world that may impact the management of the cohort of animals of the managed livestock. For example, the agent manager 356 can define actions that can simulate providing a specified feed blend to a cohort of animals, providing a blend of medicinal supplements to the cohort of animals, measuring a health status associated with the cohort of animals, obtaining a production of a specified quantity and/or quality of a bioproduct (e.g., a volume of milk, a measured value of a protein content in milk, and/or the like), etc.
In some implementations, each agent can be associated with a state from a set of states that the agent can assume. For example, the agent can be monitoring specific values associated with a bioproduct, which can be obtained from laboratory results from testing samples of the bioproduct. The specific values can be associated with by-products obtained from samples of the bioproduct. Samples of bioproduct can be associated with concurrent or otherwise temporally related feed and/or medicinal treatments used. Samples can be obtained and analyzed intermittently to aid in the monitoring. For example, the agent can use and/or track one or more values associated with by-products that can be in proximity to a set of target values. The by-product values and/or the target values can be used to define a reward function of the agent. In one example, the agent can be configured to optimize feed recommendations for a cheese producing customer who desires a target value of an optimal, desired, and/or sufficient value of average fat percentage in the milk that they purchase. In some implementations, the reward signal is computed based on analysis of a sample and a computation of a percent difference or a percent error of a value associated with the sample for example, a content of a by-product such as fat content in milk, or the like, compared to a selected target value. The reward signal can be configured to have a higher value for when the percent difference or percent error is smaller than a first threshold, that is the content of the by-product (e.g., fat) in the bioproduct (e.g., milk) is at or near the target value. The reward signal can be configured to have a lower value when the percent difference or percent error is greater than a second threshold, that is the content of the by-product (e.g., fat) in the bioproduct (e.g., milk) is farther away from the target value. Over time the LM system can learn to select feeds and medicinal treatments that maximize and/or increase the livestock cohort's ability to produce byproduct with the ideal and/or improved value of the by-product (e.g., fat content or fat percentage). In another example, the agent can be configured to optimize and/or improve feed recommendations for producing bioproduct, which can be milk, catered for a butter producing customer that has optimal or a desired protein and fat percentage needs based on which target values of protein and fat content can be set. The reward signal can be computed by root mean error of the protein and fat percentages in samples collected intermittently and compared against the target values of each, respectively. The reward signal can have a higher value when the milk is at or near the target values by a specified amount, for example, above a specified first threshold value for protein and a specified second threshold value for fat. Conversely, the reward signal can have a lower value when the protein and fat percentages are farther away from the respective target values. Over time the LM system can learn to select feeds and medicinal treatments that maximize and/or increase the livestock cohort's ability to produce by-product with a desired protein and fat percentages. Each agent can be configured to perform an action from a set of actions. The agent manager 356 can be configured to mediate an agent to perform an action, the result of which transitions the agent from a first state to a second state. In some instances, a transition of an agent from a first state to a second state can be associated with a reward. For example, an action of providing a dietary and/or medicinal supplement can result in a reward in the form of an increase in a protein content associated a milk produced by a cohort of animals of livestock. The actions of an agent can be directed towards achieving specified goals. An example goal can be maximizing rewards in an environment. For example, a goal can be defined to achieve a specified increase in a protein content associated with milk produced by a cohort of goats within a specified duration of time. The actions of agents can be defined based on observations of states of the environment obtained through data aggregated by the data aggregator 356 from compute devices or sources related to the environment (e.g., from sensors). In some instances, the actions of the agents can inform actions to be performed via actors (e.g., human or machine actors or actuators). In some instances, the agent manager 356 can generate and/or maintain several agents. The agents can be included in groups defined by specified goals. In some instances, the agent manager 356 can be configured to maintain a hierarchy of agents that includes agents defined to perform specified tasks and sub-agents under control of the agents.
In some instances, agent manager 356 can mediate and/or control agents to be configured to learn from past actions to modify future behavior. In some implementations, the agent manager 356 can mediate and/or control agents to learn by implementing principles of reinforcement learning. For example, the agents can be directed to perform actions, receive indications of rewards and associate the rewards to the performed actions. Such agents can then modify and/or retain specific actions based on the rewards that are associated with each action, to achieve a specified goal by a process directed to increase the number of rewards. In some instances, such agents can operate in what is initially an unknown environment and can become more knowledgeable and/or competent in acting in that environment with time and experience. In some implementations, agents can be configured to learn and/or use knowledge to modify actions to achieve specified goals.
In some embodiments, the agent manager 356 can configure the agents to learn to update or modify actions based on implementation of one or more machine learning models. In some embodiments, the agent manager 356 can configure the agents to learn to update or modify actions based on principles of reinforcement learning. In some such embodiments, the agents can be configured to update and/or modify actions based on a reinforcement learning algorithm implemented by the ML model 357, described in further detail herein.
In some implementations, the agent manager 356 can generate, based on data obtained from the data aggregator 355, a set of input vectors that can be provided to the ML model 357 to generate an output that determines an action of an agent. In some implementations, the agent manager 356 can generate input vectors based on inputs obtained by the data aggregator 355 including data received from compute devices and/or other sources associated with a managed livestock (e.g., sensors). In some implementations, the agent manager 356 can generate the input vectors based on a target quality of a property associated with the bioproduct. For example, the data aggregator 355 can receive data from a first compute device associated with a customer of a bioproduct (e.g., milk), from a farmer managing the livestock (e.g., goats), the customer being a manufacturer of cheese and cheese products. In some instances, the customer can provide a target quality (e.g., a targeted high level) of a property (e.g., protein content) of the bioproduct (e.g., milk) including an indication of a desired target quality of a property associated with a bioproduct produced by a cohort of animals of a managed livestock. For example, the indication can include a threshold volume of milk and/or a threshold level of protein content to be suitable for the end use of manufacturing cheese and cheese products. In some implementations, the agent manager 356 can receive the inputs obtained by the data aggregator 355 including the indication of the target quality of the property of the bioproduct, current health status of the animals, and generate input vectors to be provided to the ML model 357 to generate an output.
The ML model 357, according to some embodiments, can employ an ML algorithm to optimize a selection of schedules, feeds and/or medicines that can be used to produce bioproducts customized for different end-uses. In some instances, the ML model 357 can implement a reinforcement learning algorithm to determine action that can be undertaken by agents in a virtualized environment to arrive at predictions of indications of a selection of feed blends, feed schedules, and/or medicines to increase a probability or likelihood of achieving a specified goal. The goal can be a specific target quality of a property of a bioproduct, for example a target quality of a bioproduct desired by a specific customer for a specific end use.
The ML model 357 can be configured such that it receives input vectors and generates an output based on the input vectors, the output including an indication of a feed blend, medicine, supplements, schedule, and/or a feed selection that can increase the likelihood of meeting the target quality of the property. In some instances, the ML model 357 can be configured to generate an output indicating a feed schedule or feed blend that puts the animals producing the bioproduct on a trajectory to achieve the desired target quality within a specific time period. In some implementations, the ML model 357 can be configured to generate an output indicating a schedule to be adopted, to meet a target quantity/quality of property of bioproduct by a specific time point. In some implementations, the ML model 357 can be configured to account for a duration that the animals have to be on a particular feed schedule in order to achieve the desired type of bioproduct quality and/or quantity. The ML model 357 can be implemented using any suitable model (e.g., a statistical model, a mathematical model, a neural network model, and/or the like). The ML model 357 can be configured to receive inputs and based on the inputs generate outputs.
In some implementations, the ML model 357 can receive inputs related to a current health status of a cohort of identified animals of a managed livestock (e.g., current health status of a selected group of goats) and agents can perform actions proposed by the agent manager 356 based on one or more outputs of a machine learning (ML) model such as the ML model 357. In some implementations, the ML model 357 can be configured to model and/or implement the environment, agents, and interactions between the agents and the environment. The ML model 357 can be configured to implement agents, their actions, and/or state transitions associated with the agents and actions. In some implementations, the ML model 357 can be configured to receive inputs based on information related to health and/or yield of animals in the managed livestock and use the inputs to implement rewards in response to agent actions. For example, the inputs can include an indication of a change in a quality of a property of bioproduct (e.g., an increase in protein content), or a change in health status of an animal (e.g., a decrease in weight of an animal). The ML model 357 can implement any suitable form of learning such as supervised learning, unsupervised learning and/or reinforcement learning. The ML model 357 can be implemented using any suitable modeling tools including statistical models, mathematical models, decision trees, random forests, neural networks, etc. In some embodiments, the ML model 357 can implement one or more learning algorithms. Some example learning algorithms that can be implemented by the ML model can include Markov Decision Processes (MDPs), Temporal Difference (TD) Learning, Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic (A3C), Deep Q Networks (DQNs), Deep Deterministic Policy Gradient (DDPG), Evolution Strategies (ES) and/or the like. The learning scheme implemented can be based on the specific application of the task. In some instances, the ML model 357 can implement Meta-Learning, Automated Machine Learning and/or Self-Learning systems based on the suitability to the task.
The ML model 357 can incorporate the occurrence of rewards and the associated inputs, outputs, agents, actions, states, and/or state transitions in the scheme of learning. The ML model 357 can be configured to implement learning rules or learning algorithms such that upon receiving inputs indicating a desired goal or trajectory that is similar or related to a goal or trajectory that was achieved or attempted to be achieve in the past, the ML model 357 can use the history of events including inputs, outputs, agents, actions, state transitions, and/or rewards to devise an efficient strategy based on past knowledge to arrive at the solution more effectively.
While an ML model 357 is shown as included in the LM device 405, in some embodiments, the ML model can be omitted and the LM device 405 can implement a model free reinforcement learning algorithm to implement agents and their actions.
In some implementations, the ML model 357 and/or the agent manager 356 can implement hierarchical learning (e.g., hierarchical reinforcement learning) using multiple agents undertaking multi-agent tasks to achieve a specified goal. For example, a task can be decomposed into sub-tasks and assigned to agents and/or sub-agents to be performed in a partially or completely independent and/or coordinated manner. In some implementations, the agents can be part of a hierarchy of agents and coordination skills among agents can be learned using joint actions at higher level(s) of the hierarchy.
In some implementations, the ML model 357 and/or the agent manager 356 can implement temporal abstractions in learning and developing strategies to accomplish a task towards a specified goal. In some implementations, temporal abstractions can be an encapsulation of a set of repeatable sequence of actions (e.g., primitive actions) for use within a domain of implementation, but learnt from one set of tasks and applied to a different set of tasks. Temporal abstractions can be implemented using any suitable strategy including an options framework, bottleneck option learning, hierarchies of abstract machines and/or MaxQ methods.
The processor 351 further includes a predictor 358 configured to receive outputs from the ML model 357 and based on the outputs make predictions that can be tested in the real world. For example, the predictor 358 can receive outputs of ML model 357 and generate a prediction of achieving a specified target quality of a property of a bioproduct within a specified duration of time following the implementation of a feed schedule and/or a feed selection based on the outputs of the ML model 357. In some implementations, the predictor 358 can receive outputs of ML model 357 and generate a prediction of a projected amount of time needed to administer the feed selection to the managed livestock for the managed livestock to meet a specified target quality of the property. In some implementations, the predictor 358 can receive outputs of ML model 357 and generate a prediction of a projected amount of time that the managed livestock have to be fed using a recommend feed selection and/or feed schedule in a sustained manner or according to an indicated schedule for the managed livestock to meet a specified target quality of the property.
In some implementations, the predictor 358 can provide several predictions that can be used to choose a strategy to be implemented in the real world. In some implementations, the predictor 358 can be configured to recommend a feeding schedule or an animal care schedule while accounting for a duration of time that the animals will need to be under that schedule to achieve the desired goal. The scheduling needs can include a number of animals needed to produce a desired volume or quantity of bioproduct with the target quality of bioproduct for a customer's contract. In some instances, the output of the predictor 358 can be used to provide the farmer with an estimate of needs and costs to fulfill a customer's request. In some instances, the output of the predictor 358 can be used to determine profitability and quote estimation.
In use, the LM device 305 can receive inputs from one or more compute devices and/or remote sources using a data aggregator 355. The inputs can include information regarding health, handling, and/or feeding schedule of animals producing a bioproduct such as milk, information associated with a current yield (quantity/quality) of the bioproduct, indications of desired quantities/qualities in a bioproduct, etc. The LM device 305 can implement virtualized agents acting within a virtualized world or environment, using an agent manager 356 and/or an ML model 357. In some implementations, the environment can be defined in a form of a Markov decision process. For example, the environment can be modeled to include a set of environment and/or agent states (S), a set of actions (A) of the agent, and a probability of transition at a discreet time point (t) from a first state (S1) to a second state (S2), the transition being associated with an action (a).
In some implementations, the agents and/or the world can be developed based on one or more inputs or modified by one or more user inputs. The LM device 305 can provide aggregated information to the ML model 357. In some embodiments, the agent(s) can be part of the ML model 357. In some embodiments, the ML model 357 can implement the environment in which the agent(s) are configured to act. Similarly stated, in some embodiments, the agent(s) can be the ML model 357 for the system and in some embodiments, the ML model can be connected to the external environment and execute the recommendations in the environment. In some instances, the LM device 305 can receive an indication of a change in yield following an initiation of a feed schedule. The indication may include a positive change in the yield in the direction of a desired trajectory. In some instances, the LM device 305 can receive an indication of a recommendation of feed blend from a veterinarian. The recommendation can be closely aligned with a prior prediction or recommendation generated by the LM device 305. The LM device 305 can then provide the input associated with the positive change in the yield, and/or the indication of a recommendation from a veterinarian which is aligned with a recommendation of the LM device 305, in the form of a reward such that the ML model 357 can learn the positive association of a previously recommended strategy (e.g., feed blend, feed schedule, etc.) with external validation. Over time and/or a course of implementation of the virtualized environment/agents, the LM device 305 can generate an output based on the information received. The output of the ML model 357 can be used by a predictor 358 to generate a prediction of an outcome or an event or a recommendation of an event to achieve a desired goal. For example the output of the predictor 358 based on the output of the ML model 357 can include a recommendation of a feed blend or a feed schedule that a cohort of animals can be provided with for a specified period to achieve a higher likelihood of meeting a desired quality and/or quantity of the bioproduct obtained from the cohort of animals.
While the device 305 is described to have one each of a data aggregator, an agent manager, an ML model, and a predictor, in some embodiments, a device similar to the device 305 can be configured with several instances of the above mentioned units, components, and/or modules. For example, in some embodiments, the device may include several data aggregators associated with one or more compute devices or groups of compute devices. The device may include several agent managers generating and operating multiple agents as described in further detail herein. In some embodiments, the device may include several ML models and/or several predictors assigned to perform specified computations and/or predictions such as, for example, to predict a feed blend to most efficiently achieve a target property of a bioproduct, or to predict an estimated cost associated with a specified protocol of animal handling, to predict a quality (e.g., values associated with properties) of a bioproduct given a specified feed schedule and a given duration, etc. In some embodiments, one or more of the components including a data aggregator, an agent manager, an ML model, and a predictor can be omitted and/or combined with another component to perform related functions.
The LM device 405 can receive inputs from the compute device 401 indicating a target quality being a desired level of dry extract and fat content (e.g., higher than a first threshold value of dry extract and lower than a second threshold value of fat content) in milk to produce cheese. The LM device 405 can receive input from the compute device 402 indicating a target quality being a desired level of protein and dry extract content (e.g., higher than a first threshold value of protein extract and higher than a second threshold value of dry extract content) in milk for yogurts. The LM device 405 can receive input from the compute device 403 indicating a target quality being a desired level of fat content (e.g., higher than a first threshold value of fat content) in milk for milk products. The LM device 405 can receive any number of inputs. For example, the LM device 405 can receive additional inputs (not shown in
The LM device 405 can send to and/or receive inputs from the compute device 404 associated with an animal health specialist (e.g., a veterinarian). In some implementations, the LM device 405 can send feeding data or other animal handling data (e.g., data received from compute device 406 associated with farmer) to the compute device 404. In some implementations, the LM system 405 can send an indication of a target quality of a property of bioproduct that is of interest (e.g., data received from compute devices 401-403 associated with end-use customers). In some implementations, the LM device 405 can receive from the compute device 404 associated with an animal health specialist an indication of a recommendation of feed schedule and/or feed blend to be provided to the animals. In some implementations, the LM device 405 can receive information and/or a recommendation related to medicinal and/or dietary supplements to be provided to the animals to increase a likelihood of achieving a target quality. In some implementations, the LM device 405 can be configured to over time learn a pattern of information or recommendation and events associated with the information or recommendation provided by the compute device 404 associated with the animal health specialist such that the LM device 405 can provide inputs in place of and/or in addition to the information or recommendations from the animal health specialist.
The LM system 400 can send to and/or receive inputs from the compute device 406 associated with a farmer. The LM system 400 can receive from the compute device 406 an indication of a health status and current schedule of maintenance of animals. The LM device can provide based on computations carried out and/or based on inputs received from other compute devices (e.g., devices 401, 402, 403, 404, 406) and/or sources (not shown) a recommendation of feed, feed blend, and/or dietary/medicinal supplement to be provided to the animals to achieve a specific target goal. In some instances, a medicine and/or a dietary supplement can be included in a feed blend or be a part of a feed schedule. In some instances, an LM system 400 can recommend aspects of animal health other than feeding. For example, an LM system 400 can recommend a schedule of animal handling including a schedule for exercise, a schedule for sleep, a schedule for light cycle, a schedule for temperature, a schedule for any other suitable activity or state, and/or the like. In some implementations, the LM device 405 can send feeding schedule and/or other animal handling schedule (e.g., data received from compute device 406 associated with farmer) to the compute device 406. In some implementations, the LM system 405 can send an indication of an estimated quality of a property of bioproduct that may be obtained at a specified period of time if the animals were maintained in a particular regimen of feed schedule and/or dietary/medicinal supplement schedule. In some implementations, the LM system 405 can send an indication of an estimated cost associated with achieving a target quality of a property of bioproduct that may be obtained at a specified period of time if the animals were maintained in a particular regimen of feed schedule and/or dietary/medicinal supplement schedule.
As described previously, an LM system can be configured to receive information related to animal handling and/or feed schedules of animals in managed livestock that produce bioproducts, receive inputs related to a target quality of bioproduct, and generate outputs including recommendation of animal handling and/or feed schedules that can be adopted to increase a likelihood of achieving the target quality of bioproduct. In some implementations, the interactions between the components of the LM system including compute devices and LM device, or between virtualized agents and environments can be configured to be automatically carried out.
The LM system 500 includes a virtualized agent and a virtualized environment or world that the agent can act in using actions that impact a state of the world. The world can be associated with a set of states and the agent can be associated with a set of potential actions that can impact the state of the world. The world and/or a change in state of the world in turn can impact the agent in the form of an observation of a reward that can be implemented by the LM system 500. The LM system 500 can be configured such that the interactions between the world and the agent via actions and/or observations of rewards within the LM system 500 can be triggered and/or executed automatically. For example, an LM device within the LM system 500 that executes the interactions between the world and the agent can be configured to automatically receive inputs from sources or compute devices, and based on the inputs automatically trigger agent actions, state transitions in the world, and/or implementations of reward.
At 671, the method 600 includes receiving an indication of a target quality of a property associated with a bioproduct obtained from a managed livestock, the target quality being associated with an identified end-use. In some instances, the bioproduct can be milk and the managed livestock can be intensively managed cohorts of goats. An example of such a system is shown in the illustration in
At 672, the method 600 includes receiving an indication of a current health status of the managed livestock. The indication of current health can be received from animal handling personnel or alternatively from animal health specialists with access to information related to a current health status of the cohort of animals. In some instances, the indication of current health can be received from one or more sensors associated with animal handling. The health status can include details regarding well-being, age, weight, growth, production of bioproduct, quantity/quality of yield of bioproduct, etc.
At 673, the method includes generating a set of input vectors based on the target quality of the property. For example, the target quality can include a threshold level of fat content of milk. At 674, the method includes providing the set of input vectors to a machine learning model to generate an output indicating a feed selection to be used to feed the managed livestock, the feed selection configured to, upon consumption, increase a likelihood of meeting the target quality of the property. The LM device can generate input vectors based on the target quality to be provided to an ML model to be implemented as a target in a virtualized world including virtualized agents capable of virtualized actions. The ML model can implement the world and agents acting in discreet time steps to induce discreet state changes that may result in specific rewards associated with specific actions of agents. In some implementations, the ML model can draw rewards from prior learning or experience (e.g., learning based on data obtained from past virtualizations, from inputs received from compute devices associated with animal handling personnel and/or animal health specialists, etc.). The LM device can implement the world and the agents such that the agents act to maximize a cumulative reward. The scheme of cumulative rewards can be organized such that the LM device is configured to pursue conditions or states of the virtualized world that increase the likelihood of the world to arrive at a state that includes a production of the target quality of bioproduct. LM device can generate outputs and/or predictions indicating a feed selection that is recommend in feeding the cohort of animals to increase a likelihood of meeting the target quality of the property.
At 675, the method 600 includes administering a feed blend to the managed livestock, the feed blend including the feed selection. The LM device can provide a feeding schedule of a specific feed blend including the feed selection that can be adopted to increase the likelihood of achieving the target quality.
In some instances, a LM System can be used to guide in assignment of animals in a managed livestock to groups defined by the intended end-use of the bioproduct that will be produced. In some implementations, the assignment of animals to groups can be based on target goals or target qualities of properties associated with the bioproduct. In some instances, an output of an LM system can indicate how many animals are to be assigned to each group to meet a set of customer or end-use demands. An example outcome is illustrated in the plot 880 in
In some embodiments, the disclosed LM systems and/or methods can include implementation of cognitive learning in the learning of agent-world interactions. In some implementations, an LM system can be implemented based on a hierarchical cognitive architecture as described, and/or using a hierarchical learning algorithm and/or method by an LM Device (e.g., LM Device 105 and/or 305) or a compute device (e.g., compute device 101-103, and/or 201), as described herein. A hierarchical reinforcement learning algorithm and/or method can be configured to decompose or break up a reinforcement learning problem or task into a hierarchy of sub-problems or sub-tasks. For example, higher-level parent-tasks in the hierarchy can invoke lower-level child tasks as if they were primitive actions. Some or all of the sub-problems or sub-tasks can in turn be reinforcement learning problems. In some instances, an LM system, as described herein, can include an agent that can include one or more of many capabilities and/or processes including, for example: Temporal Abstraction, Repertoire Learning, Emotion Based Reasoning, Goal Learning, Attention Learning, Action Affordances, Model Auto-Tuning, Adaptive Lookahead, Imagination with Synthetic State Generation, Multi-Objective Learning, Working Memory System, and/or the like. In some embodiments, one or more of the above listed capabilities and/or processes can be implemented as follows.
(i) Repertoire Learning—Options learning can create and/or define non-hierarchical behavior sequences. By implementing repertoire learning hierarchical sequences of options can be built that can allow and/or include increasingly complicated agent behaviors.
(ii) Emotion Based Reasoning—Emotions in biological organisms can play a significant role in strategy selection and reduction of state-spaces improving the quality of decisions. Emotions can be implemented to impact agent decisions. Such an implementation can be configured to contribute to strategy selection by an agent and/or a reduction of state-spaces such that decisions made by the agent can be of improved quality.
(iii) Goal Learning—Goal learning can be a part of the hierarchical learning algorithm. Goal learning can be configured to support the decision-making process by selecting sub-goals for the agent. Such a scheme can be used by sub-models to select actions and features that may be relevant to their respective function.
(iv) Attention Learning—Attention learning can be included as a part of the implementation of hierarchical learning and can be responsible for selecting the features that are important to the agent performing its task.
(v) Action Affordances—Similar to Attention learning, affordances can provide the agent with a selection of actions that the agent can perform within a context. A model implementing action affordances can reduce the agent's error in action execution.
(vi) RL Model Auto-Tuning—This feature can be used to support the agent to operate in diverse contexts by changing contexts via auto-tuning.
(vii) Adaptive Lookahead—Using a self-attention mechanism that uses prior experience to control current actions/behavior, the adaptive lookahead can automate the agent search through a state space depending on the agent's emotive state and/or knowledge of the environment. Adaptive lookahead can improve the agent's computational needs by targeting search to higher value and understood state spaces.
(viii) Imagination with Synthetic State Generation—Synthetic state generation can facilitate agent learning through the creation of candidate options that can be reused within an environment with the agent not having to experience the trajectory first-hand. Additionally, synthetic or imagined trajectories including synthetic states can allow the agent to improve its attentional skills by testing implementation of different strategies of using masks such as attention masks.
(ix) Multi-Objective Learning—Many real-world problems can possess multiple and possibly conflicting reward signals that can vary from task to task. In some implementations, the agent can use a self-directed model to select different reward signals to be used within a specific context and sub-goal.
(x) Working Memory System—The Working Memory System (WMS), can be configured to maintain active memory sequences and candidate behaviors for execution by the agent. Controlled by the executive model (described in further detail herein), WMS facilitates adaptive behavior by supporting planning, behavior composition and reward assignment.
In some embodiments, the one or more capabilities and/or processes listed herein can be used to build ML systems in an LM device (e.g., the ML model 357 or agent manager 356 or predictor 358 in the LM device 305) that can operate with 98% less training data, compared to other conventional systems using ML models, while realizing superior long-term performance.
In some embodiments, the systems and/or methods described herein can be implemented using quantum computing technology. In some embodiments, systems and/or methods can be used to implement, among other strategies, Temporal Abstraction, Hierarchical Learning, Synthetic State and Trajectory Generation (Imagination), and Adaptive Lookahead.
Temporal Abstraction is a concept in machine learning related to learning a generalization of sequential decision making. An LM system implementing a Temporal Abstraction System (TAS) can use any suitable strategy including an options framework, bottleneck option learning, hierarchies of abstract machines and/or MaxQ methods. In some implementations, using the options framework, an LM system can provide a general-purpose solution to learning temporal abstractions and support an agent's ability to build reusable skills. The TAS can improve an agent's ability to successfully act in states that the agent has not previously experienced before. As an example, an agent can receive a specific combination of inputs indicating a sequence of states and can make a prediction of a trajectory of states and/or actions that may be different from its previous experience but effectively chosen based on implementing TAS. For example, an agent operating in an LM system simulating a world involving the management of livestock can receive, at a first time, inputs related to a health status of a cohort of animals on a predefined feed. The agent can be configured to interact with the world such that the LM system can predict a progress in health status and/or a yield of bioproduct, even if the prediction is different from the agent's past experience, based on implementing TAS. The prediction can include a recommendation of feed selection or feed schedule to increase a likelihood of achieving a predicted result (e.g., health status/yield). Another example includes agents operating in financial trading models that can use TAS to implement superior trading system logic.
The TAS can support generalization of agent behavior. The TAS can also support automatic model tuning where an internal reinforcement learning model can be used to automatically adjust agent hyperparameters that affect learning, future reward discounting and environment behaviors/interactions. For example, in some embodiments of an LM system, a set of parameters can be defined as hyperparameters. Some parameters involved in reinforcement learning include parameters used in Q-value update including a learning rate a, a discount factor associated with weight of future rewards y, a parameter to balance between exploration and exploitation by choosing a threshold value E, actions available to agents to choose from based on exploratory/greedy behavior, a measure of risk involving or unpredictable action or behavior that an agent can perform, a set of consequences and/or a time period of impact that a model can implement based on actions of an agent, and/or the like. One or more of these parameters can be implemented as hyperparameters that can be defined to be associated with a set of dependencies with respect to a model and/or an agent such that a specified change in a hyperparameter can impact other parameters or hyperparameters and/or the performance of the model and/or the agent in a specified manner. In some instances, a specified change in a hyperparameter can for example modify an agent from a practiced behavior to an exploratory behavior. An agent and/or a model can learn a set dependencies associated with hyperparameters such that a hyperparameter can be automatically tuned or modified in predefined degrees to alter agent behavior and/or model behavior. Temporal abstraction can be executed by the hyperparameter model to occur over a sequence of time intervals that the agent interacts with the environment.
As an example, an LM system can be configured to generate a first feed selection or feed schedule selection based on one set of inputs and/or an indication of a first state received at a first time. The LM system can receive a reward signal at a second time after the first time, and the reward signal can be associated with a second set of inputs and/or an indication of a second state. The LM system can generate a second feed selection or feed schedule selection in response to receiving the reward signal. In some implementations, the LM system can be configured to, based on the reward signal, automatically adjust one or more hyperparameters and then generate the second feed selection or feed schedule selection using the adjusted hyperparameter(s) such that the adjusted hyperparameter leads to an improvement in the outcomes (e.g., yield) associated with the second feed selection compared to the outcome associated with the first feed selection based on the change.
In such an auto tuning LM system, developers no longer have to iterate on finding model configurations with good convergence. The model can support contextually adaptive hyperparameter values depending on how much the agent knows about the current context and the environment's changing reward signal. Working in concert, the agent learns reusable strategies that are context sensitive allowing the agent to support adaptive behavior over time while enabling it to balance explorative/exploitative behaviors.
As described previously, embodiments of an LM system described herein can implement temporal abstraction in the virtualization of a world and/or agents to implement temporally extended courses of action, for example, to determine a recommended protocol of animal handling to meet demands on production of bioproducts based on end-use. Disclosed herein is a method to recursively build and optimize temporal abstractions also referred to as options and hierarchical Q-Learning states to facilitate learning and action planning of reinforcement learning based machine learning agents.
In some implementations, an LM system can build and define an entire library or dictionary of options that can be used and/or reused partially and/or fully at any suitable time in any suitable manner. Learning temporal abstractions for example, skills and hierarchical states that can applied to learning can enable an LM system to learn to respond to new stimuli in a sophisticated manner that can be comparable or competitive to human learning abilities. The disclosed method provides a general approach to automatically construct options and hierarchical states efficiently while controlling a rate or progress and/or growth of a model through the selection of salient features. When applied to reinforcement learning agents the disclosed method efficiently and generally solves problems related to implementing actions over temporally extended courses and improves learning rate and ability to interact in complex state/action spaces.
At 971, the method 900 includes training a machine learning model to receive a target quality of a property associated with a bioproduct of a first managed livestock, receive inputs associated with a health status of the first managed livestock, and determine a temporal abstraction based on the target property and the inputs to be used to identify a feed selection. The feed selection can be configured to increase a likelihood of achieving the target quality of the property associated with the bioproduct of the first managed livestock, the target quality being associated with an identified end-use. The temporal abstraction can include options, skills, hierarchical states, and/or hierarchical actions as described herein. The temporal abstractions allow the agent to execute reusable behaviors in environments that the agent has not previously experienced improving the agent's ability to interact in real-world environments and learn about these environments.
At 972, the method 900 includes receiving a target value of the property associated with the bioproduct produced by a second managed livestock different from the first managed livestock.
At 973, the method 900 includes receiving, at a first time, a first indication of the property associated with the bioproduct produced by the second managed livestock. At 974, the method includes generating a set of feature vectors based on the target value of the property and the first indication of the property.
At 974, the method 900 includes providing the set of feature vectors to the machine learning model to generate, based on the temporal abstraction and the first indication of the property, a first output including a first feed selection configured to, upon consumption by the second managed livestock, increase a likelihood of achieving the target value of the property associated with the bioproduct of the second managed livestock based on the first indication of the property.
At 975, the method 900 includes providing the set of feature vectors to the machine learning model to generate, based on the temporal abstraction and the first indication of the property, a first output including a first feed selection configured to, upon consumption by the second managed livestock, increase a likelihood of achieving the target value of the property associated with the bioproduct of the second managed livestock based on the first indication of the property.
At 976, the method includes receiving, at a second time after the first time, a second indication of the property. And at 977, the method 900 includes comparing the second indication of the property with at least one of the first indication of the property or the target value of the property, to calculate a difference metric. The machine learning model can be configured to adaptively update, based on the difference metric, the temporal abstraction to generate a second output including a second feed selection. The second feed selection can be configured to, upon consumption by the second managed livestock, increase a likelihood of achieving the target value of the property associated with the bioproduct of the second managed livestock based on the second indication of the property.
Temporal abstraction can be implemented by generating and/or using options that include sequences of states and/or sequences of actions. The implementation of options can be focused on generating and adopting reusable action sequences that can be applied within known and unknown contexts of the world implemented by an LM system.
An example option 1085 is illustrated in
In some instances, LM systems described herein can implement hierarchical states in reinforcement learning that can play a role in improving agent learning rate and/or in the development of long-term action plans. In some instances, with an increase in complexity of a task (e.g., increase in number of alternative solutions, increase in dimensionality of variable to be considered, etc.) the trajectory to the solution can become intractable due to exponentially increasing complexity of agent actions due to an increase in the number of states in the system. In some implementations, the LM system can implement hierarchical states, which decrease the size of a state space associated with an LM system. This implementation of hierarchical states and the resulting decrease in state space can lead to an exponential decrease in a time for learning in agents. Automatic learning of hierarchical states in conventional systems, however, can represent challenges by restricting size of models that can be used.
In some embodiments, an LM system can be configured that can learn options and generate and use hierarchical states effectively using a recursive execution of a process associated with a Bellman Optimization method as described herein. The recursive process can be configured to converge on optimal or desired values over a period of time. The method can allow for the agent to select improved and/or optimal or desired policies (e.g., actions resulting in state transitions) in known and unknown environments and update their quality values over time. In some instances, the method can treat options and hierarchical states as functionally dependent at creation and can allow for the merging of existing options and hierarchical states to build new state and action compositions. Over time, as the agent explores the state space, the algorithm can generate new hierarchical states and composition hierarchical states as the agent performs numerous trajectories through the state/action space.
To build a hierarchical state, the LM system can first identify two consecutive state/action transitions through the environment. The LM system can perform a sequence of verification steps including verifying that (1) the identified state/action transitions have non-zero Qp(s,a) values (also referred to herein as Q values), which can be values associated with a state/action pair under a predefined policy, as defined previously, (2) the identified state/action sequence is non-cyclical, and (3) that the transition sequence does not include a transition cycle from S0 to Sn.
Following the above steps, if positively verified the LM system can continue to the next step and if not the LM system can return to identifying two new consecutive state action transitions. If positively verified the LM system can create and/or define a new hierarchical state S′, for example state S′0 as shown by S′0 in
The LM system can extract state primitives and action primitives from standard and hierarchical state transitions. Based on the extracted information, the LM system can create and/or define a new hierarchical action from S0 state in sequence to the new hierarchical state S′ (e.g., action A′0) and add the hierarchical action to a new hierarchical action associated with state S0. The LM system can create and/or define new hierarchical action from S′ (e.g., action A′1 from state S′0) to an intermediary state (e.g., S2) or a last state in sequence Sn (e.g., S5 in
In some instances, an LM system can be configured to implement and/or learn to implement state deletion. In some instances, an LM system can consider combining multiple options to create and/or define a repertoire behavior or a subset of an option action sequence that can include states previously generated by a temporal abstraction algorithm, also referred to herein as hierarchical states. The LM system can be configured to learn to merge the two options to form a single option that builds hierarchical states from the two options. In some instances, the LM system can merge two options by selecting a set of hierarchical states and merging the action primitives to construct a new hierarchical state.
To generate an option, the LM system can initiate an induction cycle, in some implementations, to create and/or define a state name S′x (eg., x=1, 2, . . . n) from action sequences by using action sequences extracted from hierarchical state algorithms. The LM system can identify an action A′x associated with the state S′x. The LM system can check that action A′x is not in a preexisting dictionary of options and that a sum of action Q values associated the action sequence including A′x is above a threshold value of interest. If the verification steps are indicated to be true (i.e., A′x is not in the dictionary of options and the sum of action Q values associated with the action sequence including A′x is above a threshold value) the LM system can continue, if not the system exits from induction cycle. If true, the LM system can create and/or define an option with a S0 state from hierarchical state induction sequence as initial initiation state or start state.
A method to construct hierarchical states can be implemented using reinforcement learning. The method can be associated with agents and can use pairwise state/action transitions to recursively optimize and/or improve action values using the Bellman Optimality Principle. In some implementations, the method can use a Q-value threshold to determine if a new hierarchical state is to be added to the reinforcement model's options dictionary. In some implementations, the method can include generating hierarchical states in a recursive manner from other hierarchical states.
A method to construct options/skills can be implemented using reinforcement learning. The method can be associated with agents and can use pairwise state/action transitions to recursively optimize action values using the Bellman Optimality Principle. The method can use a state interest value to determine if a new Option/Skill is to be added to the reinforcement model. In some implementations, the method can use a state interest value to determine if a new option/skill is to be added to the model (e.g., reinforcement model). In some implementations, the method can include generating hierarchical states associated with options/skills in a recursive manner from other hierarchical states.
In some implementations, the LM system can additionally support automatic merging of previously generated hierarchical states with new action trajectories or action sequences in a manner that can be consistent with an existing sequence of states/actions. This functionality can simplify a process of building and maintaining hierarchical states no matter how complex an environment is in a general and fully automatic algorithm. The disclosed LM systems and/or methods can thus reuse existing Q-Learning model insertion, update and deletion mechanisms to manage hierarchical states. By using model update mechanisms of Q-Learning, selection of hierarchical states can help convergence to optimal and/or improved values over time according to the Bellman optimality principle. In some such implementations, the LM system thus combines sample efficient methods for the generation and merging of hierarchical states with mathematically mature methods to ensure that the quality of actions and options executed over time converge to optimal and/or improved values.
In some embodiments, the disclosed LM systems and/or methods can include implementation of cognitive or hierarchical learning in the learning of agent-world interactions. A Hierarchical Learning System (HLS) can include a learning algorithm that utilizes a recursively optimized collection of models (e.g., reinforcement learning models) to support different aspects of agent learning.
In some implementations, a model (e.g., the ML model 357 described previously) in an LM system can include multiple models that in some instances, can be configured in a hierarchical organization. The LM system 1200 can include an agent/system architecture as shown in
The integrated or hierarchical learning model (also illustrated in
Using the hierarchical architecture of the cognitive model, the LM system can be configured to operate effectively even in new environments by automatically surveying the environment and automatically tuning hyperparameters based on results of agent interactions. The Executive Model of the Working Memory System (WMS) can provide memory and behavior replay management of the agent. Specifically, the WMS can orchestrate the internal/external generation of experience and replays to adaptively learn temporal abstractions and selection of potential behaviors for future execution. The cognitive model can thus provide a general purpose LM system for all state and action spaces that the agent possesses.
In some implementations, an LM system can operate by using a model to simulate an external world and an internal model to simulate an internal world or representation (e.g., an internal representation of an animal or a cohort of animals, etc.). The internal model can be associated with internal states that can be perceived, organized using a system of memory, and impacted via internal actions. The internal model can be configured to impact a world state value and in turn impact agent action/behavior.
In some embodiments, the LM systems described herein can implement the Working Memory System (WMS) such that the WMS functions similar to a biological model and includes multiple subsystems that manage long term behavior selection, planning, and skill learning. In some implementations, an LM system can be configured such that the agent can interact not only in the world but also conceive states and/or state transitions or trajectories, or actions that are not experienced by the agent in the world. Such states conceived by agents can also be referred to as synthetic states, synthetic trajectories and synthetic actions imagined by agents. As part of the WMS, a processor of an LM device of the LM system can implement a Synthetic State & Trajectory Generation System (SSTGS) that is configured to manage generation of states and transition behavior for the agent's capability to conceive states/actions that are not experienced in the world (also referred to as the agents capability to imagine).
Managed by the Executive Model, the agent can create and/or define synthetic trajectories to generate temporal abstractions that can be reused in the live environment. Derived from past actual experience, synthetic states and their transitions enable the agent to learn new sub-goals, attention and affordances from experience in an offline manner for example, when an environment has not been actually experienced by the agent. These behaviors and goals/attentional/affordances can serve as templates for future use and can improve agent performance.
To create and/or define a synthetic state (SS) (e.g., synthetic states 0, 2, and 3), a new set of features can be selected using the original state features as the source (e.g., state features associated with states S1, S2, S5, S6, and S7). Actions can be generated from a subset of the original state's actions. The executive model can then estimate transition Q-values based on the average Q-values of the original state. Thus, synthetic state generation is achieved through the re-evaluation of an instant state's attended state features and its action space. The Executive Model (EM) selects new features to attend to and creates and/or defines a new synthetic state with actions and reward values based on the source action value. The primary function of the system configured to generate synthetic states can be to build targeted temporal abstraction candidates for the agent to use in the future and accelerates agent learning of the environment through more effective use of its current experience.
In addition to the creation and/or definition of synthetic states, a WMS can create and/or define synthetic trajectories based on the current model of the world. Through this the agent generates new temporal abstractions with estimated reward values. These skills are then tested in the real world and retained/discarded depending on the quality of the behavior. The creation and/or definition of targeted synthetic trajectories can conserve processing and memory use because this happens in an offline low priority process while the agent is executing an option in the world. Options allow the agent to execute preprogramed behaviors freeing the agent to allocate processing resources to planning and behavior generation through synthetic experience simulations.
In some embodiments, similar to the generation of synthetic states/state transition trajectories, a subset of the action space of the parent state can be selected. An LM system can estimate action Q-values and adjust the estimated values using an executive model, allowing the executive model to update the value function of various simulated synthetic trajectories. Synthetic experience (including synthetic states/state transitions) can be implemented as a temporal abstraction that is in a volatile memory representation and trimmed from the agent's model over time. The trimming can be omitted when the agent encounters a portion of a synthetic trajectory or a portion of synthetic state in a temporal abstraction in a non-synthetic context or in a simulation of a world. When the agent experiences a synthetic experience in a real simulation of the world that synthetic experience can be made permanent and its value can be updated to match the actual return value in the real simulation or model.
In some embodiments, an LM system can be configured to implement a feature referred to as Adaptive Lookahead which can be implemented as a part of the WMS. The Adaptive Lookahead System (ALS) can be an Executive Model (EM) controlled function that performs contextually relevant lookaheads from current or expected future states to guide behavior selection. Similar to Monte Carlo methods, ALS can provide an agent the ability to optimize and/or improve the use of lookahead for the agent. This system balances internal simulation time and live behavior to improve agent computational needs while providing improved action selection through experience search. Managed by the EM, the agent is configured to learn how to optimize this process minimizing its computational load with improved reward gains over time.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Where methods and/or schematics described above indicate certain events and/or flow patterns occurring in certain order, the ordering of certain events and/or flow patterns may be modified. While the embodiments have been particularly shown and described, it will be understood that various changes in form and details may be made.
Although various embodiments have been described as having particular features and/or combinations of components, other embodiments are possible having a combination of any features and/or components from any of embodiments as discussed above.
Some embodiments described herein relate to a computer storage product with a non-transitory computer-readable medium (also can be referred to as a non-transitory processor-readable medium) having instructions or computer code thereon for performing various computer-implemented operations. The computer-readable medium (or processor-readable medium) is non-transitory in the sense that it does not include transitory propagating signals per se (e.g., a propagating electromagnetic wave carrying information on a transmission medium such as space or a cable). The media and computer code (also can be referred to as code) may be those designed and constructed for the specific purpose or purposes. Examples of non-transitory computer-readable media include, but are not limited to, magnetic storage media such as hard disks, floppy disks, and magnetic tape; optical storage media such as Compact Disc/Digital Video Discs (CD/DVDs), Compact Disc-Read Only Memories (CD-ROMs), and holographic devices; magneto-optical storage media such as optical disks; carrier wave signal processing modules; and hardware devices that are specially configured to store and execute program code, such as Application-Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Read-Only Memory (ROM) and Random-Access Memory (RAM) devices. Other embodiments described herein relate to a computer program product, which can include, for example, the instructions and/or computer code discussed herein.
In this disclosure, references to items in the singular should be understood to include items in the plural, and vice versa, unless explicitly stated otherwise or clear from the context. Grammatical conjunctions are intended to express any and all disjunctive and conjunctive combinations of conjoined clauses, sentences, words, and the like, unless otherwise stated or clear from the context. Thus, the term “or” should generally be understood to mean “and/or” and so forth. The use of any and all examples, or exemplary language (“e.g.,” “such as,” “including,” or the like) provided herein, is intended merely to better illuminate the embodiments and does not pose a limitation on the scope of the embodiments or the claims.
Some embodiments and/or methods described herein can be performed by software (executed on hardware), hardware, or a combination thereof. Hardware modules may include, for example, a general-purpose processor, a field programmable gate array (FPGA), and/or an application specific integrated circuit (ASIC). Software modules (executed on hardware) can be expressed in a variety of software languages (e.g., computer code), including C, C++, Java™, Ruby, Visual Basic™, and/or other object-oriented, procedural, or other programming language and development tools. Examples of computer code include, but are not limited to, micro-code or micro-instructions, machine instructions, such as produced by a compiler, code used to produce a web service, and files containing higher-level instructions that are executed by a computer using an interpreter. For example, embodiments may be implemented using imperative programming languages (e.g., C, Fortran, etc.), functional programming languages (Haskell, Erlang, etc.), logical programming languages (e.g., Prolog), object-oriented programming languages (e.g., Java, C++, etc.) or other suitable programming languages and/or development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.