The present invention is directed to the control of order picking systems in a warehouse environment, and in particular to the use of algorithms used to aid in controlling the order picking systems.
The control of an order picking system with a variety of workers or agents (e.g., humanoid pickers, robotic pickers, item carrying vehicles, conveyors, and other components of the order picking system) in a warehouse is a complex task. Conventional algorithms are used to seek various objectives in an ever-increasing order fulfillment complexity characterized by scale of SKU variety, order composition ranging from single SKU to multiple SKUs, widely varying order demand in magnitude and time scales coupled with the very demanding constriction of delivery deadline. In recent times, this complexity is further compounded by recent labor shortages on one hand and the unforeseen dependence on e-commerce to support day-to-day activities. Conventional algorithms require a lot of effort to design, test, implement, optimize, program, and implement, and are usually very specific to customer requirements. Such conventional algorithms do not adjust well to changing warehouse/order fulfillment operations or conditions. Furthermore, the optimality of the algorithms for various agents can vary (e.g., the operational strategies for individual agents or workers may be different than the operational strategies for a facility as a whole).
Embodiments of the present invention provide methods and a system for a highly flexible solution to dynamically respond to changing warehouse operations and order conditions for both individual agents or workers and for changing facility objectives.
An order fulfillment control system for a warehouse in accordance with an embodiment of the present invention includes a controller, a memory module or data storage unit, and a training module. The controller controls mobile autonomous devices and/or fixed autonomous devices, and issues picking orders to pickers. The controller adaptively controls fulfillment activities in the warehouse via the use of tiered algorithms, such as hierarchically tiered algorithms, and to record operational data corresponding to the fulfillment activities in the warehouse. The memory module holds the operational data. The training module retrains the algorithm using reinforcement learning techniques. The training module performs the reinforcement learning on the operational data to retrain and update the algorithms. Operational data may be used for offline reinforcement training, but online reinforcement training may also take place using facility simulation. The training module also retrains a macro algorithm according to a first set of priorities for optimal operation of the warehouse, and to train a plurality of micro algorithms according to corresponding second sets of priorities for optimal operation of a particular location and/or activity within the warehouse. The controller adaptively controls the fulfillment activities using the updated algorithms. Such a controller and training module may, for example, comprise one or more computers or servers, such as operating in a network, comprising hardware and software, including one or more programs, such as cooperatively interoperating programs.
A method for controlling order fulfillment in a warehouse in accordance with an embodiment of the present invention includes controlling mobile autonomous devices and/or fixed autonomous devices, and issuing picking orders to pickers. The controlling includes adaptively controlling fulfillment activities in the warehouse via the use of hierarchically tiered algorithms. The method includes recording operational data corresponding to the fulfillment activities in the warehouse. The operational data is held in a memory module. The algorithms are retrained using reinforcement learning techniques. The retraining performs the reinforcement learning on the operational data to retrain and update the algorithms. Operational data may be used for offline reinforcement training, but online reinforcement training will also take place using facility simulation. The retraining includes retraining a macro algorithm according to a first set of priorities for optimal operation of the warehouse and retraining a plurality of micro algorithms according to corresponding second sets of priorities for optimal operation of a particular location and/or activity within the warehouse. The method also includes adaptively controlling the fulfillment activities using the updated algorithms.
In an aspect of the present invention, the order fulfillment control system includes a warehouse simulator comprising one or more programs operating on one or more computers/servers, such as being executed on one or more processors, which performs an episodic warehouse simulation. The warehouse simulator produces simulated operational data based on simulated operations. A digital twin may be used and configured in any of the embodiments to perform at least one warehouse simulation, where the digital twin produces simulated operational data based on simulated operations.
In another aspect of the present invention, the order fulfillment control system also includes a generative adversarial networks (GANs) module that synthesizes additional data from the operational data. The additional data is synthesized data mimicking the operational data.
In a further aspect of the present invention, the operational data is at least one of: operational data recorded during performance of operational tasks within the warehouse; simulation data configured to simulate warehouse operations; and synthetic data configured to mimic the operational data.
In yet another aspect of the present invention, the controller is configured to adaptively control the fulfillment activities in the warehouse using both the macro algorithm and at least one of the micro algorithms, wherein the controller is operable to use the macro algorithm to select a particular warehouse priority and then select at least one micro algorithm to execute a particular order fulfillment operation within the warehouse.
In a further aspect of the present invention, the training module trains the macro algorithm separately from the micro algorithms.
These and other objects, advantages, purposes and features of the present invention will become apparent upon review of the following specification in conjunction with the drawings.
The present invention will now be described with reference to the accompanying figures, wherein numbered elements in the following written description correspond to like-numbered elements in the figures. Methods and systems of the present invention may provide for a highly flexible solution to dynamically respond to changing warehouse operations and order conditions for both individual agents or workers and for changing facility objectives. An exemplary warehouse management system includes an adaptable order fulfillment controller which includes machine learning functionality for training both a macro-agent (also referred to as an orchestrator) and a plurality of micro-agents. The training or algorithm tuning for the macro-agent includes a different set of priorities as compared to the training/algorithm tuning for the micro-agents. The macro-agent or orchestrator is trained to find optimal operational strategies for the warehouse facility while the micro-agents are separately trained according to their unique local tasks and operational requirements. The agent training (macro- or micro-) utilizes reinforcement learning that is performed on recorded operational data, operational data developed from warehouse simulators that produce simulation data, and the addition of synthesized data that is produced by generative adversarial networks (GANs) which synthesize additional data from the operational data. These three data sets are utilized during agent training to optimize their respective algorithms, with the macro-agent algorithm trained/tuned separately from the micro-agents algorithm(s).
Exemplary embodiments of the present invention provide for an AI-based procedure for the control of macro-agent (orchestrator) and micro-agent in a warehouse environment based on algorithm tuning and training such that the macro-agent is trained to find optimal operational strategies while the micro-agents are separately trained according to local tasks and operational requirements. Such controls and training modules of the exemplary embodiments can be implemented with a variety of hardware and software that make up one or more computer systems or servers, such as operating in a network, comprising hardware and software, including one or more programs, such as cooperatively interoperating programs. For example, an exemplary embodiment can include hardware, such as, one or more processors configured to read and execute software programs. Such programs (and any associated data) can be stored and/or retrieved from one or more storage devices. The hardware can also include power supplies, network devices, communications devices, and input/output devices, such devices for communicating with local and remote resources and/or other computer systems. Such embodiments can include one or more computer systems, and are optionally communicatively coupled to one or more additional computer systems that are local or remotely accessed. Certain computer components of the exemplary embodiments can be implemented with local resources and systems, remote or “cloud” based systems, or a combination of local and remote resources and systems. The software executed by the computer systems of the exemplary embodiments can include or access one or more algorithms for guiding or controlling the execution of computer implemented processes, e.g., within exemplary warehouse order fulfilment systems. As discussed herein, such algorithms define the order and coordination of process steps carried out by the exemplary embodiments. As also discussed herein, improvements and/or refinements to the algorithms will improve the operation of the process steps executed by the exemplary embodiments according to the updated algorithms.
As illustrated in
Even relatively simple workflows (algorithms) within a warehouse or facility (hereinafter “facility”) 200 require good orchestration between the different tasks performed by workers/agents and areas of the warehouse/facility to reach maximum efficiency. One of the biggest problems is that workers/agents and associated supervisors have only “local” visibility to the tasks to be performed (in their area) and not a view of the entire facility in real time. This can create problems for the optimum operation of the facility.
Modern order fulfillment (e-commerce, etc.) can rapidly change the operational conditions from one day to the next for the facility 200. E-commerce also comes with heightened consumer expectations and narrowing delivery time windows. Thus, there is a need to rapidly adjust system control to adapt to those changes. Training exposes the AI agent to variable conditions, so it is prepared to handle changing conditions in real time. Examples of such changing objectives could be different order profile (small or large orders), labor skills/performance, volume variability, product variety, delivery time constraints, etc. Conventional cost controls developed for managing volume predictions, allocating space, equipment, and labor resources, have been shown to lack robustness when applied to the new challenges fulfilling ever increasing product volume and product variety demands. Flexibility (in order fulfillment and warehouse/facility management) is being embraced to mitigate rising operational costs of complex fulfillment and to sustain profits.
In a typical fulfillment facility 200, there are two exemplary types of tasks: inbound and outbound (see
The typical fulfillment facility will need to manage the following basic order fulfillment operations (see
For example, in an exemplary retail facility, operations need to handle high peak ecommerce demands and high store fulfillment demands simultaneously. Conventionally, the management of both ecommerce demands and brink-and-mortar store fulfillment demands are handled within a single-focus fixed automation system. Typically, each demand is situated in a separate facility, or if the ecommerce demands are low enough, in an “omni channel” facility that serves the ecommerce channel and the brick-and-mortar channel. However, these traditional fixed automation solutions are not capable of easily adjusting to significant demand spikes or volume shifts in either channel within the omni channel facility operation.
Furthermore, even relatively simple workflows (algorithms) within a facility 200 require good orchestration between the different tasks and areas to reach maximum efficiency. One of the biggest problems is that operators and supervisors in the areas have only “local” visibility to the tasks that are performed (in their area) and do not have a view of the entire facility 200 in real-time. In other words, they are unaware of the state of operations in other areas of the facility 200. This can create problems for the optimum operation of the facility 200. This disconnect between individual local operations within the facility 200 and the operation of the entire facility 200, can include:
To help in coordinating operations in the facility 200, conventional control systems, unable to constantly monitor the facility 200 make use of “waves” to coordinate each of the areas and ensure availability of storage space and resources in the downstream areas. An individual wave is a small plan of tasks and resources. Each area (e.g., 252, 254, 256, 258) of the facility 200 will finish a wave before starting the next wave. This serves as a checkpoint in a broader plan (for the facility 200 as a whole). The problem with waves is that waves do not use the resources efficiently because there will be ramp-up time in the beginning and cool-down time at the end. Full efficiency is only achieved in the middle of the wave. To help with wave transitions, staging buffers (conveyor or floor space) between the different areas are used to store inventory or order boxes coming from one area to another.
If the system controller 301 can monitor the different areas and resource in near real-time, the system controller 301 can release work incrementally as resources downstream free up, the use of waves could be eliminated. This is the main principle of “waveless” systems. However, such control systems 301 require more intelligent algorithms which have been tuned to the operational conditions of the warehouse/facility 200 (order profile, storage capacity, labor and machine performance, etc.). These algorithms will work well if the operational conditions do not change much. However, once those operational conditions begin to change, the algorithms will need to be adjusted/tuned to the new conditions. Each area of the facility 200 needs to run efficiently, but also needs to be aware of the downstream areas to not overflow or starve them. The amount of data needed to monitor, and the parameters needed to tune could be too much for humanoid interaction to tune accurately in a complex environment.
Modern order fulfillment (e-commerce, etc.) can rapidly change the operational conditions from one day to the next, or even from one hour to another, as well as those seasonal changes, and there needs to be a way to rapidly tune the control system 301 to adapt to those changes. Examples of the changes include different order profiles (small or large, single- or multi-unit orders), labor skills/performance, delivery time constraints, etc.). This is where artificial intelligence (AI) and machine learning techniques can be used to react and adapt the controlling algorithms quickly to changes and to keep the facility 200 running at a peak or optimal performance.
Flexible fulfilment includes the following aspects: operational flexibility and operational scalability. Operational flexibility refers to the ability of a system to change or adapt based on new conditions within an operation. Operation scalability is defined by the ease and speed in which the system can scale. Demands for e-commerce can vary weekly, monthly, and for periodic (annual) peaks. Flexibility in order fulfillment allows a facility 200 to operate adaptively depending on current needs, such as adapting for peak ecommerce periods; heavy brick-and-mortar replenishment; non-peak ecommerce; weekly, monthly, and promotional peaks; and direct-to-consumer activities. An exemplary facility 200 implementing flexible order fulfillment needs to balance fixed automation resources and mobile automation resources (see
An exemplary warehouse/facility 200 includes a combination of both fixed and mobile automation, as well as an intelligent software (and its architecture) that binds fixed automation resources and mobile automation resources into a flexible fulfillment solution for the facility 200. As discussed herein, a key component of flexible fulfillment is finding and maintaining a dynamic balance between the fixed and mobile automation assets. Warehouse management (WMS) support basic functions of receiving, put-away, storing, counting and picking, packing, and shipping goods. Extended WMS capabilities are value-added capabilities that supplement core functions, such as labor management, slotting, yard management and dock scheduling. A Warehouse Control Solution (WCS) is a real-time, integrated control solution that manages the flow of items, cartons, and pallets as it travels on many types of automated equipment, such as conveyors, sorters, ASRS, pick to light, carrousels, print and apply, merges and de-casing lines. A Warehouse Execution Solution (WES) is a newer breed of solution, compared to a WMS or WCS. It is a focused version of a WMS with controls functionality. WES is encroaching on the WMS territory for tasks related to wave management, light task management, inventory management (single channel), picking, and shipping.
An exemplary controller 301 of a flexible order fulfillment management system 300, using reinforcement learning, is configured to control different types of agents (the macro-agent or orchestrator 302 and a plurality of micro-agents 304) in the warehouse facility 200 and to optimize various objectives of the warehouse facility 200. The controller 301 adapts to varying operating conditions and makes ongoing control and communication, and coordinates decisions amongst various systems and resources engaged in supporting the order fulfillment objective (see
In step 126, the controller 301 runs the operations in the warehouse facility 200 by communicating commands to the downstream execution systems and operator HMIs (see
As illustrated in
As discussed herein, the simulated warehouse includes two types of workers or agents configured to perform distinct tasks and with particular capabilities (e.g., AMRs configured as pickers or carriers). AMRs are sequentially assigned orders. For each order, an AMR has to collect specific items in given quantities. Once all items are collected, the AMR has to move to a specific location to deliver and complete the order. Upon completion, the AMR is assigned a new order (as long as there are still outstanding, unassigned orders remaining).
The exemplary pickers are configured to move across the same locations as the AMRs and are needed to load any needed items onto the AMRs. For a picker to load an item onto an AMR, both workers have to be located at the location of that particular item. As also discussed herein, the picker may be either a robotic picker or a humanoid picker.
The warehouse simulator is also compatible with real customer data to create simulations of real-world warehouse systems.
As discussed herein, reinforcement learning (RL) is a type of artificial intelligence aiming at learning effective behavior in an interactive, sequential environment based on feedback and guided trial-and-error (such guided or intelligent trail-and-error is to be distinguished from mere blind or random trail-and-error). In contrast to other types of machine learning, RL usually has no access to any previously generated datasets and iteratively learns from collected experience in the environment (e.g., the operational data, the simulation data, and the synthesized data).
At each point in time, a learning agent is provided a description of the current state of the environment. The agent takes an action within this environment, and after this interaction, observes a new state of the environment. The agent receives a positive reward to promote desired behaviors, or a negative reward to deter undesired behaviors. This selection of an action, and an evaluation of the result is repeated for a plurality of possible decisions for a particular decision point.
The learning paradigm of RL has been found to be very effective in interactive control tasks. The agent is defined as the decision-making system, which maps the environment state to a set of actions for each agent (robotic and humanoid pickers and autonomous vehicles (AMRs)). The agent would be informed about the location of various items, other agents and possibly orders of the other agents. Based on such information, the agent selects an action and subsequently receives the newly reached state of the environment as well as a positive or negative numerical reward feedback. Agents are given a positive reward for good actions (such as completing an order or picking a single item) and a negative reward for bad actions (e.g., waiting too long). Such agents receive rewards according to the cumulative effect of their actions over time, as opposed to the reward for a single good or bad action.
In the exemplary warehouse system 200, there are often dozens of workers (AMRs and pickers) moving through the environment to collect items and deliver orders. In this case, solutions are needed which enable effective learning of coordination strategies across all workers. Collectively learning such a strategy quickly becomes infeasible as the overall joint-action space grows exponentially with the number of workers. In order to learn strategies for sufficiently large problems, it is necessary to decompose the decision space for independent workers. However, such distribution of learning requires these independent entities to learn to coordinate their strategies for effective cooperation. Multi-agent reinforcement learning (MARL) is designed to tackle these challenges and enable multiple independent agents to learn cooperating strategies effectively. In exemplary MARL approaches, each worker is represented by an individually learning agent.
Thus, an objective of the exemplary training system is to train a reinforcement learning algorithm to determine the strategy for allocating AMR and picker movements to optimize the order throughput for a specific time frame. Allocating AMR and picker movements: the algorithm is configured to decide where each AMR and picker should go next, at every point a decision can be made for their next location. Optimize order throughput is defined as minimizing time to compete all orders.
The exemplary controller 301 of a fulfillment facility 200 makes use of a multi-agent reinforcement learning (MARL) system to address problems in the fulfillment facility 200 which are primarily characterized by robotics, distributed control, resource management, and collaborative decision support, etc. The complexity of many tasks arising in such a setting makes them challenging to solve with software control learned from what has happened in the past (historical data). Application of MARL to the order fulfillment problem leverages the key idea that agents must discover a solution on their own by learning. Autonomous mobile robots (AMRs) 304 and fixed automation (automated trailer loading/unloading, multi-shuttle storage, conveyor, sorting, person-to-goods and goods-to-person, etc.), along with the requisite software (warehouse management system, warehouse control system, warehouse execution system) as sub-systems (micro-agents 304) within the exemplary order fulfillment facility 200. These sub-systems or micro-agents are optimized, orchestrated (by the macro-agent or orchestrator 302) and integrated into a control system 300.
As discussed herein, and with reference to
The multi-agent perspective of the exemplary order fulfillment facility 200 includes a number of agents (also considered “micro-agents” 304) representing decision points and attendant decision spaces specific to each software subsystem, collection of robots, fixed automation systems, resource/decision management systems, etc. Deep reinforcement learning via multi-agent training allows for a collaborative learning approach among such agents to take advantage of cooperative strategies that improve warehousing/facility key performance indicators (KPIs) (throughput, cycle time, labor utilization, etc.). These agents learn by interacting with the dynamic environment—in this case a fulfillment center/facility 200—whose present state is determined by previously taken actions and exogenous factors (see
Within restricted domains of warehouse/facility operation, micro-agents 304 may cooperate to jointly maximize rewards for a specific subsystem in an extension of MDP called a Markov game. While micro-agents 304 focus on decentralized actions to optimize specific subsystems, at a higher level of hierarchical control, a macro-agent, or “orchestrator” 302, can centrally direct cooperative orchestration across functional areas to drive optimal operating points of the global system (see
In conventional warehousing facilities, warehousing functional areas are often optimized independently and operate in parallel under the assumption that optimal behavior among individual functions collectively constitute optimal behavior at the macro level. However, when functional areas are coupled due to the sharing of some limited resource pool, such as, labor or inventory, they are better viewed as interacting subunits of the comprehensive system whose individual activities must be coordinated to achieve the best global result. By contrast to conventional approaches, the exemplary warehousing facility 200 utilizes a control system 300 that makes use of a hierarchical decomposition of the main warehousing functions as shown in
A convenient way of framing this problem in a computationally tractable way is to view each functional task, or “micro-action” as independently learnable within the paradigm of reinforcement learning (RL), while “macro-actions” performed by a macro-agent or orchestrator 302, and deciding which functional task to pursue next, falls under a different hierarchy of learning which can be trained using similar or different methods to those used to train the micro-agents 304. This is in contrast to a flat, non-hierarchical framework, where agents participate in cooperative multi-agent RL (MARL) with a much larger action space. A hierarchical decomposition approach can be successful while requiring much lower computational resources than the flattened, non-hierarchical framework.
Within each of the functional tasks of the decomposed problem, the state and action spaces should be small enough to be tractable as a separate MARL problem, for which several approaches are possible. For example, distributed RL, such as IMPALA and SEED, and shared experience actor critic (SEAC) can solve large scale problems with many interacting agents. For improved scalability, several of the functional tasks may be combined as another hierarchical multi-agent problem, as in “cooperative HRL,” which further decomposes the problem's action space into options and primitives, such as going to locations, picking, and putting.
Another example of expanding the scope of decision making to encompass multiple functional domains in the warehouse/facility 200 is found in the relationship between order release and inventory allocation (see
When these separate logical functions are treated as micro-agents 304 communicatively interacting within a dynamic environment (the warehouse facility 200), new optimization opportunities arise. For example, inventory allocation may determine that a currently released order could be fulfilled by full case inventory, but that it may be beneficial to delay allocation until sufficient demand is accrued to fully consume the full case. If this holding action is communicated to order release, it could prompt expedited release of more work requiring the SKU of the held case. While it may be difficult to develop manual heuristics to determine how long to wait, and how fully consumed a case should be before it should be allocated based on the current state of work in the pipeline and current storage space utilization, these factors can easily be accommodated within an RL framework. These separate logical functions can be trained in concert to develop collaborative strategies that would not be easily possible without the use of learning-based algorithms.
The multi-agent reinforcement learning/hierarchical reinforcement learning (MARL/HRL) approach described above promises new heights of flexibility in fulfillment compared to current/conventional methods. Such an approach will:
Some obstacles to the commercial application of RL in real-world scenarios are the related problems of sample efficiency and exploration within reasonable timeframes and performance guarantees. Model-free RL approaches, while extremely powerful at honing in on optimal strategies with well-considered exploration strategies and sufficient data of state space exploration, typically require a voluminous amount of such data for success. It is usually not feasible, especially in a commercial setting, to experiment in real-time using fulfillment center/facility assets due to the adverse effects of extensive exploratory actions taken over long periods of time. To solve this problem, the exemplary control system 300 combines the data generation strengths of simulation with real world corrective data (i.e., production data collected during actual operations in the warehouse facility 200) for robust learning. As discussed herein, additional synthetic data is also produced by GANs trained on the production data.
The use of high-fidelity facility emulation or simulation tools—sometimes referred to as “digital twin” technology—is now commonplace when designing large-scale custom-tailored WES solutions. As these tools come to more closely mimic the environmental setup and workflows of the fulfillment centers/facilities they emulate, they become powerful tools for simulated experimentation of the sort required for a successful implementation of RL. When married with an RL training framework, simulation can be used for on-policy exploration and learning. On-policy RL methods, such as actor-critic variants, often offer superior training stability compared to off-policy methods such as DQN and other value-based methods. On the other hand, on-policy learning is sample inefficient since only data derived from the current policy can be used for policy updates. Once a policy update is performed, all previously recorded data must be discarded, and new data collected for the updated policy. Depending on the computational efficiency of the simulation being used, which in turn depends on the level of fidelity desired, it may be prohibitively computationally expensive to discard data, as is required for on-policy learning. In this case, it may be necessary to incorporate elements of off-policy training by using data stored in a replay buffer generated by old policies with an importance sampling correction applied during policy updates.
Another obstacle to the application of reinforcement learning methodologies in real-world problems is the simulation-to-reality gap problem, where subtle differences between the simulated training environment and the real-world environment led to suboptimal performance. To help solve this problem, the exemplary control system utilizes batch RL methods to incorporate data collected from real world operations to help close the simulation-to-reality gap. In addition to real-world operation data, as discussed herein, the exemplary control system 300 may also utilize Generative Adversarial Networks (GANs) techniques to synthesize data to provide augmentation to simulated and real data.
Accordingly, the exemplary control system 300 provides an adaptive, hierarchically sensitive control of micro-actions and macro-actions. The micro-actions are performed by micro-agents 304, which are guided by algorithms that are independently learnable within the exemplary reinforcement learning methods. The macro-actions are performed by the orchestrator 302, which decides which functional task to perform next, and whose training falls under a different hierarchy of algorithm training. The training of the macro-agent 302 may be performed using similar or different methods as those used to train the algorithms controlling the micro-agents 304.
Changes and modifications in the specifically described embodiments can be carried out without departing from the principles of the present invention which is intended to be limited only by the scope of the appended claims, as interpreted according to the principles of patent law including the doctrine of equivalents.
The present application claims priority of U.S. provisional application Ser. No. 63/393,056, filed Aug. 4, 2022, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63395056 | Aug 2022 | US |