Computing devices such as personal computers, laptop computers, tablet computers, cellular phones, and countless types of Internet-capable devices are increasingly prevalent in numerous aspects of modern life. This prevalence of Internet-capable devices can enable us to connect people all over the world, and as such, the demand for data connectivity via the Internet, cellular data networks, and other such networks, is growing rapidly. In areas where conventional Internet connectivity-enabling infrastructure exists (e.g., urban or other relatively densely populated areas), people can connect to make phone calls in an emergency, get access to the weather forecasts (e.g., to plan for crop planting, to plan logistics for major events, etc.) and to education. However, there are many areas of the world where data connectivity is unavailable, unreliable and/or extremely costly due to difficulties in building and maintaining conventional infrastructure in these areas. In these areas, the growing demand for data connectivity is not being met, thus they can benefit greatly from data connectivity delivered by fleets of aerial vehicles.
There are various types of aerial vehicles that may provide these types of connectivity services, floating or flying at different altitudes and regions of the atmosphere. Particularly, in the stratosphere where winds are stratified, wind currents are strong and quickly changing, and each layer of wind may vary in speed and direction, unmanned high altitude aerial vehicles may navigate from location to location using these stratospheric wind currents. However, many factors, in addition to predicted weather forecasts, are important in generating navigation policies for planning and/or controlling movement of such aerial vehicles. Conventional methods for generating flight controllers are unable to use large numbers of variable factors and objectives under uncertainty, such as imperfect weather forecasts, to optimize flight objectives. Typical flight controllers also have difficulty adapting to localized differences, such that a controller would not be tuned to optimize a vehicle's flight over two or more regions (e.g., over a country, island, ships, or bodies of water, near the equator versus another farther North, South, or closer to a pole) with different flight regulations, no-fly restrictions, and other environmental or regulatory differences.
Conventional flight navigation techniques have largely been unique to a particular type of vehicle and determined based on a combination of heuristics (i.e., rules) and conventional models. The particular set of heuristics and conventional models for determining flight paths and actions often will be tailored to a specific vehicle system, with a particular set of characteristics such as size, weight, materials, whether the vehicle is floating or self-propelled, or other characteristics. Thus, conventional techniques are unable to generate controllers for many different types of vehicles without a lot of custom human design, and generally require a significant amount of human engineering and tailoring to achieve each defined objective, much less to achieve multiple objectives at the same time, such as reaching a target destination while optimizing for power consumption, avoiding unauthorized airspace, and reducing time and/or distance to the target destination.
Thus, improved systems and methods for generating controllers for lighter-than-air vehicles is desirable.
The present disclosure provides for a system and method for generating controllers for lighter-than-air vehicles, particularly controllers directed to achieving particular objectives. A computer-implemented method for generating an objective-directed controller for an aerial vehicle may include defining an action space for an objective-directed controller; providing a set of feature vectors and the action space as input to a simulation module, the set of feature vectors associated with a desired objective; training, by a learning module, the objective-directed controller according to a reward function correlated with the desired objective; evaluating the trained objective-directed controller; and storing the trained objective-directed controller. In some examples, the action space is defined with continuous directions. In some examples, the action space is defined with discretized directions. In some examples, the action space is defined with a set of actions comprising up, down and stay. In some examples, the action space is defined with a set of actions comprising lateral propulsion in a given direction at a given speed. In some examples, the action space is defined with power levels on and off. In some examples, the action space is defined with a plurality of power levels.
In some examples, the reward function is configured to generate a cumulative incursion score. In some examples, the reward function is configured to assess a large first-entry penalty. In some examples, the reward function is configured to reward progress made on an objective. In some examples, the reward function is configured to generate a fixed penalty incursion score. In some examples, the desired objective comprises navigation toward a target heading using an altitude control system. In some examples, the desired objective comprises navigation toward a target heading using an altitude control system and lateral propulsion system. In some examples, the desired objective comprises station seeking. In some examples, the desired objective comprises map following. In some examples, the desired objective comprises restricted zone avoidance. In some examples, the desired objective comprises storm avoidance. In some examples, the set of feature vectors includes a standard feature vector used for a plurality of objectives. In some examples, the set of feature vectors includes an objective-directed feature vector.
A distributed computing system may include a storage system configured to store objectives, feature vectors, and trained objective-directed controllers; and one or more processors configured to: define an action space for an objective-directed controller, provide a set of feature vectors and the action space as input to a simulation module, the set of feature vectors associated with a desired objective, train, by a learning module, the objective-directed controller according to a reward function correlated with the desired objective, evaluate the trained objective-directed controller, and store the trained objective-directed controller.
The figures depict various example embodiments of the present disclosure for purposes of illustration only. One of ordinary skill in the art will readily recognize from the following discussion that other example embodiments based on alternative structures and methods may be implemented without departing from the principles of this disclosure, and which are encompassed within the scope of this disclosure.
The Figures and the following description describe certain embodiments by way of illustration only. One of ordinary skill in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures.
The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for navigating aerial vehicles in operation, as well as for generating flight policies for such aerial vehicle navigation using deep reinforcement learning.
Aspects of the present technology are advantageous for high altitude systems (i.e., systems that are operation capable in the stratosphere, approximately at or above 7 kilometers above the earth's surface in some regions, and at or above 20 kilometers above the earth's surface in other regions, or beyond in the exosphere or cosmic space), such as High Altitude Platforms (HAPs), High Altitude Long Endurance (HALE) aircraft, unmanned aerial vehicles (UAVs), including lighter than air vehicles (e.g., floating stratospheric balloons), propelled lighter than air vehicles (e.g., propelled floating stratospheric balloons), fixed-wing vehicles (e.g., drones, rigid kites), various types of satellites, and other high altitude aerial vehicles. In some examples, high altitude systems are configured to fly above an altitude reserved for commercial airline flights. One way to provide enhanced network access is through a network of aerial vehicles carrying Internet, cellular data, or other network capabilities. To maintain a network, each aerial vehicle in a fleet or network of aerial vehicles may travel to a particular location. In some embodiments, lighter than air aerial vehicles (i.e., propelled or not) may rely on rapidly changing and extreme (i.e., strong, high speed, and volatile) wind conditions to assist in navigation efforts to different locations. Other environmental and non-environmental factors may impact an aerial vehicle's flight plan or policy. In view of this, large scale simulations may be performed to evaluate operational characteristics or capabilities (e.g., power system availability, ambient temperatures, software and hardware versions implemented or accessible onboard, integrity of various components) and life cycle of individual aerial vehicles or fleets. Such simulations may be used to manage the life cycle of an aerial vehicle to manage risk of failures and to optimize availability for service delivery.
This disclosure is directed to a deep reinforcement learning system (hereinafter “learning system”) for generating optimal flight policies to control aerial vehicles according to a desired goal (i.e., objective), along with methods for performing training learned flight policies that are improved and optimized for one or more objectives, and deploying said learned flight policies in an aerial vehicle navigation system. As described in more detail below, the learning system comprises a simulation module (including one or more “Workers” or simulators), one or more replay buffers, a learning module comprising a deep reinforcement learning architecture designed to train flight policies, and one or more servers or repositories that store learned policies from the learning module.
In a training loop, the simulation module simulates an aerial vehicle's flight through a region of the atmosphere (e.g., stratosphere) according to a given policy (e.g., encoded into a neural network, for determining an action by the aerial vehicle in a given environment and aerial vehicle state). The simulation module generates a frame, represented by one or more feature vectors, for each time step, and feeds the frames of each simulation to one or more replay buffers. The replay buffers store the frames (e.g., in sequential order in a circular buffer or at random), and the learning module requests a set of frames from one or more replay buffers as inputs. An input may comprise a random sample of frames from a circular buffer, a prioritized sample of frames according to optimization criteria, or a set of frames with other characteristics.
The training loop continues with the learning module processing the frames according to a deep reinforcement learning architecture to determine, in a given situation (i.e., a given vehicle state in a given environment), which action provides a larger or largest estimated reward. Actions to be taken by an aerial vehicle may include ascending, descending, maintaining altitude, and propelling itself in a direction, among others, and may be manifested as discrete actions such as up, down, or stay (i.e., maintain altitude). In some examples, the learning module may run one or more neural networks that output a value or other representation of an action, or a command associated with an action, and a magnitude associated with said action or command. The deep reinforcement learning architecture may be configured to run one or more variations of reinforcement learning, including value-based methods, distributional methods, and policy-based methods. Some examples of reinforcement learning techniques include, without limitation, Q-learning, double Q-learning, distributional Q-learning, categorical Q-learning, quantile regression Q-learning; policy gradient, actor-critic, soft actor-critic, and trust region policy optimization, among others. The reinforcement learning algorithm in the learning module is characterized by a reward function corresponding to the objective of the flight policy training. The learning module generates learned flight policies (e.g., encoded in neural networks), and scores them according to the reward function. The learned policies may be stored in a policy server, from which the simulation module can pull learned policies to run further simulations.
A reward function is defined according to a desired objective. Example objectives include: flying within a predetermined radius of a target location; following a mapped trajectory; flying in a given direction; arriving at a location at a desired date and time; maximizing (or otherwise optimizing) the amount of time an aerial vehicle provides connection services to a given area; conserving energy or minimizing energy consumption during a time period of flight or in achieving any of the aforementioned objectives; and achieving any of the aforementioned objectives in coordination with other aerial vehicles (i.e., in the context of a fleet of aerial vehicles). In some examples, the systems described herein may be tuned with a reward function that optimizes for multiple objectives (e.g., any combination of the example objectives above), and may further account for other factors, such as minimizing wear on a vehicle, avoiding inclement weather, etc. A learned flight policy may be deemed high performing if it scores well according to the reward function, and threshold scores may be defined to determine whether a learned flight policy is high performing, low performing, operation-ready, or otherwise should be kept (i.e., stored for further use in simulations or operation) or discarded.
Multiple learning systems may be run in parallel as a meta-learning system, with varied parameters in each learning system. Parameters that may be varied may include: the reward function (e.g., objective being optimized, slope, additional factors, etc.), reward tuning or modification (e.g., period of time or times of day during which full reward is valid or awarded, penalties for various characteristics of simulation or resulting reward), the number of frames that a learning module requests per input, characteristics of said frames for input (e.g., random, prioritized, or other), the depth of the reinforcement learning architecture (e.g., number of layers, hidden or otherwise) in the learning module, number of objectives, types of objectives, number of available actions, types of available actions, length of time into the future that is being predicted, and more.
A learning system also may store high performing learned flight policies (i.e., store neural networks encoded with high performing flight policies) in a policy repository for use by an operational navigation system to control movement of an aerial vehicle according to one or more desired objectives. These high performing learned flight policies may be used to determine actions for an aerial vehicle in a given situation. In an embodiment described herein, an aerial vehicle system may generate an input vector characterizing a state of the aerial vehicle, which may be provided to a learned flight policy that processes the input vector to output an action optimized for an objective of the learned flight policy. Such input vectors may be generated onboard an aerial vehicle in some examples, and in other examples, may be generated offboard (e.g., in a datacenter, which may or may not be integrated with a ground station or other cloud infrastructure, or the like). The action may be converted to a set of commands configured to cause the aerial vehicle to perform the action. In some examples, the input vector may include more than an aerial vehicle's physical and operational state (e.g., battery levels, location, pose, speed, weight, dimensions, software version, hardware version, among others), but also may include one or more of the following environmental inputs: sensor inputs (e.g., measuring temperature, pressure, humidity, precipitation, etc.), weather forecasts, map information, air traffic information, date and time.
Using the systems and methods described herein, one can customize flight policies across vehicle types and also to particular environments (Peru vs. Ecuador vs. Kenya, if it is useful, for the same vehicle) to improve performance with minimal human interference. For example, a system can customize flight policies for particular countries, each with its own set of regulations and no-fly restrictions, as well as weather and wind forecasts specific to its region. Thus, these systems and methods for aerial vehicle navigation can generate controllers for many types of aerial vehicles with minimal custom human design. Also, for any given vehicle and environment, one can process more data, and more types of data, to generate improved controllers.
Using the learning systems described herein, an objective-directed controller may be trained. In some examples, one or more objective-directed controllers may be used in combination (e.g., turned on and off) to improve navigation of an LTA vehicle. An action space may be defined for each objective-directed controller with continuous (or nearly continuous) or discretized (i.e., a predetermined number of) directions in two or three (or more) dimensions. In some examples, directions for an action space may be based on an absolute directional framework (e.g., North, South, East, West and combinations thereof), relative to a target heading or location (e.g., a discretized set of directions relative to a target heading or a continuous set of directions relative to a target heading), or dynamically determined based on other factors. For example, in a dynamic directional action space, an objective-directed controller may cause a vehicle with lateral propulsion to turn and/or propel towards, away, or tangential to a target location (i.e., station), depending on an objective (e.g., station seeking versus map following) and relevant current and expected wind patterns. The action space also may be defined with respect to actions a given vehicle can take (e.g., up, down, stay, apply lateral propulsion in a given direction, for example, at a given speed and/or power level). The action space may further be defined with respect to power levels (e.g., on, off, continuum of power levels) being applied to an altitude control system and/or a propulsion system.
In an example, actions defined in the action space may be on a spectrum or in intervals between two values (e.g., −1 to +1, wherein −1 may be down and +1 may be up with stay being 0, or wherein the two extremes represent other opposing actions), power levels may be on a spectrum or in intervals between two values (e.g., 0 to 1, wherein 0 is off and 1 is full power on), and directions may be on a spectrum or in intervals between two values (e.g., −π to π). A reward score may be generated when an objective-directed controller is trained on a set of values representing each of these aspects of the action space given a set of feature vectors, a state of the vehicle, and a reward function corresponding to the objective (e.g., argmax or other optimization function). In some examples, a reward or cost function may include a cumulative incursion score, wherein a penalty is applied for each incursion (i.e., instance or amount of time or distance spent in an undesirable zone (e.g., a no-fly zone, storm, or the like) or outside of a desirable zone (e.g., a proximity of a station or mapped path)), a reward otherwise (i.e., time or distance spent outside of an undesirable zone or within a desirable zone), and a straightforward computation of a value based on said penalty and reward. In other examples, the reward or cost function may include a large first-entry penalty as well as cumulative incursion score, wherein a larger penalty may be assessed for entering an undesirable zone (e.g., if the vehicle was not in a restricted zone or storm in an immediately preceding in time frame) or exiting a desirable zone (e.g., if the vehicle was within a threshold proximity of a station or mapped path in an immediately preceding in time frame), and a smaller penalty for remaining in the undesirable zone (e.g., the vehicle was already in a restricted zone or storm in the immediate preceding frame) or outside of the desirable zone (e.g., the vehicle was already outside of a threshold proximity of a station or mapped path). In other examples, a reward or cost function may give rewards for making progress on an objective (e.g., following a mapped path). In still other examples, a reward or cost function may return a fixed penalty incursion score, wherein there is a fixed penalty for intersecting an undesirable zone in a current frame (i.e., without considering earlier or later in time frames) and a fixed reward for intersecting a desirable zone.
Objective-directed controllers may each be directed to a desired objective (e.g., navigation using an altitude control system (e.g., toward a target location or heading), navigation using an altitude control system and lateral propulsion system (e.g., toward a target location or heading), station seeking (i.e., remaining within a proximity of a target location (aka station)) for a threshold amount of time—a minimum desired amount or for as long as possible), map following (i.e., progressing along, or closely following, a mapped trajectory or flight plan), restricted zone avoidance (e.g., no-fly areas), storm avoidance, and the like). A set of feature vectors associated with a desired objective may be provided as input to a learning system to train an objective-directed controller. In some examples, a set of standard or generic feature vectors may be provided for all controllers, including without limitation, position-related, pressure-related (e.g., minimum, maximum, current), power-related (e.g., altitude control system power, available solar power), solar-related (e.g., solar elevation, solar angle, next sunrise), and the like. A set of objective-directed feature vectors may be provided in addition to the standard or generic set. For example, storm avoidance feature vectors may include one or more of storm-related predictions, a storm proximity threshold (e.g., dynamic depending on nature and severity of storm, or a predetermined radius), a time until incurring a risk of breaching a storm proximity threshold or a risk in a given look-ahead time period for each of a range of pressures (i.e., altitudes) (e.g., expressed as an average trajectory risk, first time to non-zero risk, or other risk factor). In another example, map following feature vectors may include one or more of a speed (e.g., ground speed), direction (e.g., vehicle heading), and progress on a mapped path (e.g., map lookups compared with vehicle trajectory). In an example where map following is aided by lateral propulsion, additional feature vectors may include an effectiveness factor for application of lateral propulsion (i.e., at different levels of power in different directions).
In an example, a combination of objective-directed controllers may be concurrently trained and used for more comprehensive control of a vehicle. Such a combination may include one of a directional system controller (e.g., an altitude control system, an altitude control system plus lateral propulsion system), one of an objective-specific controller (e.g., station seeking, map following), and all overriding controllers (e.g., restricted zone avoidance, storm avoidance). Other combinations may include fewer or more types of controllers. Such combinations may be trained together to generate combination-specific objective-directed controllers.
A learning system, as described herein, may be configured to output an objective-directed controller trained according to a reward function corresponding to a desired objective, the objective-directed controller configured to select an action that maximizes a probability of achieving the desired objective (i.e., highest reward score).
Example Systems
Connection 104a may structurally, electrically, and communicatively, connect balloon 101a and/or ACS 103a to various components comprising payload 108a. In some examples, connection 104a may provide two-way communication and electrical connections, and even two-way power connections. Connection 104a may include a joint 105a, configured to allow the portion above joint 105a to pivot about one or more axes (e.g., allowing either balloon 101a or payload 108a to tilt and turn). Actuation module 106a may provide a means to actively turn payload 108a for various purposes, such as improved aerodynamics, facing or tilting solar panel(s) 109a advantageously, directing payload 108a and propulsion units (e.g., propellers 107 in
Payload 108a may include solar panel(s) 109a, avionics chassis 110a, broadband communications unit(s) 111a, and terminal(s) 112a. Solar panel(s) 109a may be configured to capture solar energy to be provided to a battery or other energy storage unit, for example, housed within avionics chassis 110a. Avionics chassis 110a also may house a flight computer (e.g., computing device 301, as described herein), a transponder, along with other control and communications infrastructure (e.g., a controller comprising another computing device and/or logic circuit configured to control aerial vehicle 120a). Communications unit(s) 111a may include hardware to provide wireless network access (e.g., LTE, fixed wireless broadband via 5G, Internet of Things (IoT) network, free space optical network, or other broadband networks). Terminal(s) 112a may comprise one or more parabolic reflectors (e.g., dishes) coupled to an antenna and a gimbal or pivot mechanism (e.g., including an actuator comprising a motor). Terminal(s) 112(a) may be configured to receive or transmit radio waves to beam data long distances (e.g., using the millimeter wave spectrum or higher frequency radio signals). In some examples, terminal(s) 112a may have very high bandwidth capabilities. Terminal(s) 112a also may be configured to have a large range of pivot motion for precise pointing performance. Terminal(s) 112a also may be made of lightweight materials.
In other examples, payload 108a may include fewer or more components, including propellers 107 as shown in
Ground station 114 may include one or more server computing devices 115a-n, which in turn may comprise one or more computing devices (e.g., computing device 301 in
As shown in
Computing device 301 also may include a memory 302. Memory 302 may comprise a storage system configured to store a database 314 and an application 316. Application 316 may include instructions which, when executed by a processor 304, cause computing device 301 to perform various steps and/or functions, as described herein. Application 316 further includes instructions for generating a user interface 318 (e.g., graphical user interface (GUI)). Database 314 may store various algorithms and/or data, including neural networks (e.g., encoding flight policies and controllers, as described herein) and data regarding wind patterns, weather forecasts, past and present locations of aerial vehicles (e.g., aerial vehicles 120a-b, 201a-b, 211a-c), sensor data, map information, air traffic information, feature vectors, vehicle states, telemetry, among other types of data. Memory 302 may include any non-transitory computer-readable storage medium for storing data and/or software that is executable by processor 304, and/or any other medium which may be used to store information that may be accessed by processor 304 to control the operation of computing device 301.
Computing device 301 may further include a display 306, a network interface 308, an input device 310, and/or an output module 312. Display 306 may be any display device by means of which computing device 301 may output and/or display data. Network interface 308 may be configured to connect to a network using any of the wired and wireless short range communication protocols described above, as well as a cellular data network, a satellite network, free space optical network and/or the Internet. Input device 310 may be a mouse, keyboard, touch screen, voice interface, and/or any or other hand-held controller or device or interface by means of which a user may interact with computing device 301. Output module 312 may be a bus, port, and/or other interface by means of which computing device 301 may connect to and/or output data to other devices and/or peripherals.
In some examples computing device 301 may be located remote from an aerial vehicle (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) and may communicate with and/or control the operations of an aerial vehicle, or its control infrastructure as may be housed in avionics chassis 110a-b, via a network. In one embodiment, computing device 301 is a data center or other control facility (e.g., configured to run a distributed computing system as described herein), and may communicate with a controller and/or flight computer housed in avionics chassis 110a-b via a network. As described herein, system 300, and particularly computing device 301, may be used for planning a flight path or course for an aerial vehicle based on wind and weather forecasts to move said aerial vehicle along a desired heading or within a desired radius of a target location. Various configurations of system 300 are envisioned, and various steps and/or functions of the processes described below may be shared among the various devices of system 300, or may be assigned to specific devices.
Simulation module 502 may be configured to feed frames of simulations to replay buffers 504, which serve to randomize and store said frames independent of the particular simulations from which they came. A plurality of simulators in simulation module 502 (e.g., comprising several or ten or more simulators) may work to feed simulation frames to a plurality of replay buffers 504 (comprising one or more replay buffers).
Learning module 506 may be configured to pull sets of frames (e.g., comprising 32 or 64 frames, or any number of frames ranging from a dozen to multiples of fives, tens or dozens of frames) from replay buffers 504 to train learned flight policies. Learning module 506 may comprise a deep reinforcement learning architecture configured to run one or more variations of reinforcement learning, such as Q-learning, double Q-learning, distributional Q-learning, or other policy learning methods. The reinforcement learning algorithm in learning module 506 may be configured to maximize a sum of rewards generated by a reward function corresponding to the objective of the flight policy training. The reward function may be correlated with an objective (i.e., a control objective related to navigation or operation of an aerial vehicle, such as a high altitude aerial vehicle). Examples of objectives may include: following a map (e.g., following the gradient or map of a heuristic function built to indicate how to efficiently cross an ocean); spending the most (or otherwise optimal) amount of time within a radius of a target location (e.g., a latitude-longitude, a city, an island, a group or chain of islands, a group of buoyed structures such as offshore wind farms); following a mapped trajectory; flying in a given direction; arriving at a location at a desired date and time; maximizing (or otherwise optimizing) the amount of time an aerial vehicle provides connection services to a given area; following, or remaining within a given radius of, a terrestrial or nautical vehicle (e.g., a cruise ship, an all-terrain vehicle, etc.); and any of the above in coordination with other aerial vehicles (i.e., in the context of a fleet of aerial vehicles). Thresholds may be predetermined to categorize learned flight policies into various levels of performance (e.g., high, medium, or low performing) based on how a learned flight policy scores according to the reward function (i.e., how high of a reward produced by said learned flight policy). In some examples, the reward score may comprise a value. In other examples, where certain types of reinforcement learning is implemented (e.g., distributional Q-learning), a reward score may comprise a probability distribution or a distribution of rewards. In an example, learning module 506 may be configured to provide medium and high performing learned flight policies to policy server 508 for storage and further use by simulation module 502 to run simulations, and also to provide high performing learned flight policies to operation-ready policies server 510 for use in operational navigation systems, according to methods described below. In another example, learning module 506 may be configured to provide medium and high performing learned flight policies to policy server 508, and to provide a separate category of highest performing learned flight policies to operations policies server 510 for use in operational navigation systems. Learning module 506 may discard (i.e., delete) low-performing learned flight policies.
Turning to
In some examples, system 500 also may include an operation-ready policies server 510 separate from policy server 508, in which learned flight policies that meet or exceed a certain threshold score defined for an appropriate category of flight policies (e.g., high performing, highest performing, operation-ready, and other appropriate categories) may be stored. The operation-ready flight policies in operation-ready policies server 510 may be provided to aerial vehicles (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) for navigation of said aerial vehicles (i.e., to determine actions to be performed by said aerial vehicles to achieve an objective), according to methods described in more detail below.
In an example, a subset of 30-60 of learning systems 602a-n may be run in parallel with the objective of training optimal flight policies to navigate an aerial vehicle (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) from a starting location to a target location. In this example, the objective amongst this subset of learning systems 602a-n may be the same, but each of learning systems 602a-n may be run with one or more of the following parameter variations: distances between said starting and target locations, the geographical locations of the starting and target locations (i.e., differing hemispheres or regions of the world, thereby invoking very different environmental conditions), starting times, number of frames per input (e.g., 16, 32, 64 or more), the reward function slope (i.e., how steep is the slope, which translates into how strictly the reward may be scored), number of simulators. In this example, said subset of 30-60 of learning systems 602a-n may further be grouped into subset groups, each subset group running variations on a single parameter or a few related parameters (e.g., a subset group running variations on the reward function, another subset group running variations on geographical locations, another subset group running variations on input vectors, etc.). In another example, another subset of learning systems 602a-n may be training on the same objective with the same parameter variations using a different type of reinforcement learning (e.g., Q-learning in one subset providing value scores, distributional Q-learning in another subset providing probability distribution scores). Evaluation server 512 may evaluate the performance of the learned flight policies being generated by learning systems 602a-n (e.g., retrieving from policy servers 508a-n or directly from learning modules 506a-n). Evaluation server 512 may determine which learned flight policies should be stored in operation-ready policies server 510, and provide feedback on which stacks are doing well or doing poorly. For example, where certain ones of said subset of 30-60 of learning systems 602a-n, characterized by certain sets of parameters, perform poorly (i.e., produce poor performing learned flight policies), the poor performing learning systems (i.e., sets of parameters) may be discontinued. As certain of learning systems 602a-n are discontinued, new ones with new sets of parameters may be added to particular subsets or subset groups.
In some examples, meta-learning system 600 may be implemented in a distributed computing environment wherein a plurality of copies of a stack of learning systems may be maintained along with a plurality of policy servers, each policy server being dedicated to a learning stack.
Example Methods
The simulation module may generate a plurality of frames, each frame representing a time step of a simulation at step 704, each frame comprising a feature vector representing a set of features of the aerial vehicle state and environment in said time step, wherein actions taken in each simulation in said time step correspond to a feature vector of a frame, and the resulting time step captured in the frame is rewarded for the behavior (i.e., the actions taken). At step 706, the plurality of frames may be stored in a replay buffer, the replay buffer configured to provide a random or scrambled set of frames (e.g., to disrupt the plurality of frames from their original time sequence order) for input into a learning module. As described herein, the replay buffer may comprise one or more buffers.
The method may continue, at step 708, with requesting, by a learning module, a set of frames from the replay buffer. The learning module may comprise a deep reinforcement architecture, as described above. The learning module may process the set of frames using a reinforcement learning algorithm at step 710, in order to then generate a learned flight policy that is scored according to a reward function defined for that learning module at step 712. The learned flight policy, encoded in a neural network, may be stored in a policy server at step 714.
Turning to
A threshold may be predetermined to define whether a learned flight policy is high performing or low performing, and said threshold may comprise one or more thresholds used to define more granular performance categories (e.g., high performing threshold, very high performing threshold, low performing threshold, medium performing threshold, simulation-ready threshold, operation-ready threshold, discard threshold, anomalous threshold, and others, any of which may be a value-based threshold wherein the score comprises a value or a distribution-based threshold wherein the score comprises a probability distribution).
In an alternative embodiment, in some examples, the learning module itself may determine whether a threshold is met (i.e., the score for the learned flight policy meets or exceeds the threshold). The learned flight policy, which may be encoded in a neural network, that meets or exceeds said threshold may be stored in the policy server configured to store neural networks, at step 714. Once the learned flight policy is stored in the policy server, it may be provided to a simulation module, or pulled by a simulation module from the policy server, to run further simulations. If the threshold is not met (i.e., the score for the learned flight policy falls below the threshold), the learned flight policy may be discarded. Where the score comprises a value, the value score will be evaluated against a predetermined threshold value. Where the score comprises a probability distribution or distribution of rewards, the score may be evaluated against a predetermined distribution threshold.
In some examples, the operation-ready threshold for storing in the operation-ready policies server may be different from other thresholds, e.g., for storing in a normal policies server (e.g., if the learned flight policy score meets or exceeds a different and higher threshold). In still other examples, methods 700 and 750 may include evaluating a learned flight policy score against still other thresholds to gate the grouping and treating of learned flight policies in other ways.
Method 700 may be performed by flight policy training system 500 and method 750 may be performed by a flight policy training system 550, either or both of which may be implemented in a distributed computing system such as distributed computing system 400. In some examples, method 700 may be performed as a training loop in each of a plurality of training systems, such as in a meta-learning system 600. Method 750 also may be performed as part of a training loop in a meta-learning system 600.
In still other embodiments, methods 700 and 750 may be implemented using real world flight data or other sources of flight data, rather than simulations. Therefore, the training methods described herein may be performed using simulated data, historical data, fresh data collected during operation, or a mixture of these, to generate input for a learning module.
An action may be selected by the trained neural network based on the input vector at step 806. The action may be converted into a set of commands, at step 808, the set of commands configured to cause the aerial vehicle to perform the action. For example, an action “descend” or “down” may be converted to a set of commands that includes spinning an ACS fan motor of the aerial vehicle at a predetermined number of watts. A control system, as described herein, may then cause the aerial vehicle to perform the action using the set of commands at step 810. In some examples, a determination may be made whether the aerial vehicle operation should continue at step 812. If yes, method 800 may return to step 804 to generate further input vectors to select further actions. If no, method 800 may end.
It would be recognized by a person of ordinary skill in the art that some or all of the steps of methods 700 and 800, as described above, may be performed in a different order or sequence, repeated, and/or omitted without departing from the scope of the present disclosure.
While specific examples have been provided above, it is understood that the present invention can be applied with a wide variety of inputs, thresholds, ranges, and other factors, depending on the application. For example, the time frames and ranges provided above are illustrative, but one of ordinary skill in the art would understand that these time frames and ranges may be varied or even be dynamic and variable, depending on the implementation.
As those skilled in the art will understand, a number of variations may be made in the disclosed embodiments, all without departing from the scope of the invention, which is defined solely by the appended claims. It should be noted that although the features and elements are described in particular combinations, each feature or element can be used alone without other features and elements or in various combinations with or without other features and elements. The methods or flow charts provided may be implemented in a computer program, software, or firmware tangibly embodied in a computer-readable storage medium for execution by a general-purpose computer or processor.
Examples of computer-readable storage mediums include a read only memory (ROM), random-access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks.
Suitable processors include, by way of example, a general-purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, or any combination of thereof.
This application is a continuation-in-part of U.S. patent application Ser. No. 16/667,424 filed Oct. 29, 2019, and U.S. patent application Ser. No. 16/667,441, filed Oct. 29, 2019, all of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 16667424 | Oct 2019 | US |
Child | 17187270 | US | |
Parent | 16667441 | Oct 2019 | US |
Child | 16667424 | US |