The present disclosure relates to the field of computer technology, particularly to an indoor autonomous navigation, and more particularly to a method for navigating a robot in a task environment.
Currently, the autonomous navigation system used in most mobile robots relies on a fine-grained map of a task environment pre-established by scanning. During the navigation, in response to receiving a coordinate, a robot will search for a global path through a search algorithm, and then optimize the global path based on local observations to obtain a final planned path. However, when being located in a new environment, the existing mobile robot cannot executes the autonomous navigation immediately since a coordinate of the destination may not be known or a fine-grained map is not available.
Embodiments of the present disclosure provide a method and apparatus for navigating a robot in a task environment, and a non-transitory medium.
In a first aspect, some embodiments of the present disclosure provide a method for navigating a robot in a task environment. The method includes: receiving, by a pre-trained sequential prediction model, a navigation graph of the task environment, instructions in natural language and an initial location of the robot in the navigation graph, where the navigation graph comprises nodes indicating locations in the task environment, coordinates of the nodes, and edges indicating connectivity between the locations; and predicting sequentially, by the pre-trained sequential prediction model, a sequence of single-step behaviors executable by the robot to navigate the robot from the initial location to a destination.
In a second aspect, some embodiments of the present disclosure provide an electronic device, the electronic device comprises at least one processor; and a memory storing instructions executable to cause the at least one processor to perform the method for navigating a robot in a task environment according to any one of the embodiment in the first aspect.
In a third aspect, some embodiments of the present disclosure provide a non-transitory computer readable storage medium storing a computer program executable to cause a processor to perform the method for navigating a robot in a task environment according to any one of the embodiment in the first aspect.
By reading the detailed description of non-limiting embodiments with reference to the following accompanying drawings, other features, objectives and advantages of the present disclosure will become more apparent:
Embodiments of present disclosure will be described below in detail with reference to the accompanying drawings. It should be appreciated that the specific embodiments described herein are merely used for explaining the relevant disclosure, rather than limiting the disclosure. In addition, it should be noted that, for the ease of description, only the parts related to the relevant disclosure are shown in the accompanying drawings.
It should also be noted that the some embodiments in the present disclosure and some features in the disclosure may be combined with each other on a non-conflict basis. Features of the present disclosure will be described below in detail with reference to the accompanying drawings and in combination with embodiments.
As shown in
Step S101:
A robot may be placed in an environment and need to be navigated across the environment. The environment that a robot needs to be navigated therein is referred to as a task environment. The task environment may be GPS-denied environments, indoor spaces, etc. A task environment is shown in
When a robot follows a corridor, the robot only needs to know that it is moving along a space with the correct semantic meaning, but not necessarily with certain geometry specifications such as path width and curvature. Thus, the navigation of the robot across a task environment may be realized by representing the task environment as a topological map. The nodes in the topological map may refer to semantically meaningful locations such as rooms and corridors, while the edges may indicate connectivity. The topological map may be used as the navigation graph which is used for navigating the robot across the task environment. A navigation graph corresponding to the task environment of
In some embodiments, the navigation graph is encoded with undirected edges together with node coordinates. The undirected edges may be illustrated in the navigation graph as non-directional or bi-directional edges. As an example, the undirected edges is shown as bi-directional edges in
The presentation of genuine geometric information in the navigational map enables one to interpret environmental knowledge adaptively based on actual navigation progress online. This may yield not only more compact, but more directed routing representations by filtering out information unessential at particular navigation progress.
In a scenario, just as shown in
In some embodiments, the navigation graph including nodes, coordinates of the nodes and edges between the nodes, the instructions in natural language and the initial location or initial node of the robot is input into a pre-trained sequential prediction model, for the sequential prediction model to generate a sequence of single-step behaviors based thereon, the sequence of single-step behaviors are executable by the robot to navigate the robot from the initial location (e.g. the Office 3) to a destination (e.g. the Office 0).
Step S102:
The sequential prediction model may be a neural network model pre-trained with annotated sample navigation graphs and corresponding sample instructions in natural language.
In some embodiments, the sequential prediction model predicts a sequence of single-step behaviors executable by the robot sequentially.
During the training of the model, given training sample {(<Gi,li,si>, ui)}i=0Ntrain, the goal is to infer behavior sequences that reflect the instructions in view of new navigation queries by solving
Since the target is a high-level behavior plan, the goal states described by the instructions may only specify target locations but not desired heading directions. Thus, without loss of navigation capability, an embodiment of the present disclosure keeps a simplistic behavior set B:={be (i.e., “exit”), br (i.e., “turn right and move”), bl (i.e., “turn left and move”), and bf (i.e., “move forward”)}. The proposed solution is different from assigning a separate behavior for the same movement in each different situation, such as “Go straight at a T interaction” and “Go straight down the corridor.” The compact behavior set helps the learning focus on matching instructions with navigation movements instead of specific environments, effectively improving generality.
Given an action sequence u0:T−1:=(u0, . . . , uT−1), the robot may take action ut at time t and relocates from node nt to nt+1. The expected entire navigation starts from n0=s and terminates at goal state nT=g. In some embodiments, each action ut can be classified to be ut=b∈B by comparing the robot heading before and after a movement, assuming that robot always head toward the direction of movement. Suppose at time t, the robot is at node nt. Then, by calculating the cross product of heading of ø=x(nt)−x(nt−t) and øt+1=x(nt+t)−x(nt), we can classify the action ut as
where nt+1 may always different from nt−1. Particularly, at t=0, the robot is assumed to be at a room node s and has only one valid behavior “exit”. As such, a valid transition from nt to nt+1 may be denoted as a tuple <nt, ut, nt+1> where ut is inferred according to the above Equation (2). A special behavior bs (i.e., “STOP”) may be also encoded, taken by the robot at any time t to indicate navigation termination.
The sequential prediction model proposed in
In some embodiments, for each single step during the prediction, an adaptive context is generated by adapting the navigation graph to a current prediction process corresponding to the single step, and a single-step behavior is predicted for the current single step based on at least the generated adaptive context and the instructions in natural language. By adapting the navigation graph to an actual prediction process corresponding to the current single step, it obeys the observation that humans usually search for related information on a local horizon instead of paying equal attention across the whole map at all time when following navigation instructions. By adapting the navigation graph to an actual prediction process corresponding to the current single step and predicting the single-step behavior for the current single step based on the adapted context, the challenge on the flexible correspondence between instruction semantics and navigation plans is solved with limited scalability to new and large map.
In some embodiments, the knowledge base adaptation may be realized by the so called d-step action propagation. Other method for realizing knowledge base adaptation may also be adopted.
As an example,
The connectivity information of graph G:=<E, N, X> may be written as a set of tuples {<n, b, n′>i}, each representing a valid navigation behavior moving from node n to n′ with type b. As described in Section, the valid behavior type b for directed edge <n, n′> depends on possible previous locations nprev of robot before reaching n. Thus, a transition <n, b, n′> can be alternatively written in previous-current-next format <nprev, n, n′>, from which b can be inferred according to the above Eq. (2). To adapt the knowledge base G, we search for valid behaviors that can be taken in the next d steps. In other words, we simulate the robot movements continuing from the immediate history <nt−1, nt> and record any valid node transitions and their behavior types. We refer to such process as d-step action propagation hereafter. We implement this process as a bounded breadth-first-search over directed edges in G, taking <nt−1, nt> as the initial element. Each time we pop a directed edge <nprev, n> from the queue, we collect all neighbors n′ of n that is not nprev. For each n′, we add <n, n′> to the queue and compose a tuple <nprev, n, n′>. The tuple is subsequently converted to graph format <n, b, n′> where the behavior type b is inferred from the coordinates x(nprev), x(n), and x(n) according to the above Eq. (2). All valid transitions where the distance between n′ and current node nt is within d may be collected. This yields the adaptive context Ĝt at time step t. See Algorithm 1 for a summary of d-step action propagation algorithm.
Context and Instruction Embedding: both navigation context Ĝt (or G in the static context case) and instructions I are encoded. Each of the transition tuple <n, b, n′> in Ĝt is encoded into a vector of length 2|N|+|B|, where |N| and |B| refer to the number of nodes in graph G and number of valid behaviors, respectively. The context Ĝt is finally encoded
into a matrix of size L
Feature Extraction: the feature extraction is performed on both context and instruction embedding. In some embodiments, a multilayer bidirectional Gated Recurrent Units (GRUs) is used to generate context features
Context-Instruction Attention: now seek for correspondence between navigation context and instructions via attention mechanism. In some embodiments, a one-way attention where only context features attends to instruction features is used. Notably, under the setting of adaptive context, the attention mechanism resembles not only the way people search for paths on a map, but also the fact that people pay primary attention to the environment in proximity when deciding the next movement. This is particularly true when the instructions are based on local environments rather than global landmarks.
For each row
where W∈2H×2H refers to trainable parameters. The attention vector Rti for each transition feature
Aggregating all Rti, an attention matrix Rt of size let L
Progress-Aware Context: This section combines navigation context
where wC∈4H×H refers to the trainable parameters that reduce the feature dimension to H. Then, we attend the hidden state ht to Ct to capture context features related to the current navigation progress. The attention weight αt is computed following:
where W1, W2∈H×H, v∈H are trainable parameter. The progress-aware context St∈H is then computed as St=ΣiL
Behavior Prediction: finally the progress aware context St and hidden state ht are combined to generate the policy at time t. The raw action probability feature {circumflex over (b)}t is computed by concatenating St with ht and feeding into a fully connected layer:
{circumflex over (b)}t=W3[St;ht] (8)
where W3∈(|B|+1)×2H refers to trainable parameters. The result is a preference vector {circumflex over (b)}t for each navigation behavior b∈B as well as a special STOP action bs indicating task termination.
To generate the action ut, a masked softmax function is applied:
Ot={circumflex over (b)}t+mask(G,n0:t)
ut=argmax(softmax(Ot)) (9)
In some embodiments, the input to the mask function includes the entire navigation graph G and navigation trajectory n0:t up to current step t. The function generates a zeros vector with same size as {circumflex over (b)}t where the invalid behaviors are replaced with −∞. To decide whether a certain behavior b is valid, we check if there exists a neighbor node n′ of nt satisfying:
n′≠nt−1 and b=b(nt−1,nt,n′) by Eq. (2) (10)
In some embodiments, when nt=nt−1 (or ut−1=bs), a STOP action is enforced at time t since the navigation is already terminated. Notably, the valid action space at each step t is determined not only by the location nt, but also by the history location nt−1. This setting lifts the requirement for binding behavior semantics with locations, enabling both compact knowledge representation and flexible inference of behavior semantics.
With further reference to
As shown in
In some embodiments, the prediction unit further includes an adaptive context generation subunit and a prediction subunit. The adaptive context generation subunit is configured to: generate, for each single step during the prediction, an adaptive context by adapting the navigation graph to a current prediction process corresponding to the single step. The prediction subunit is configured to predict a single-step behavior for the single-step based on at least the generated adaptive context and the instructions in natural language.
In some embodiments, the adaptive context generation subunit is further configured to: search for, in the navigation graph, valid node transitions between a current node corresponding to the single step and neighbor nodes, except for a previous node, of the current node; predict a behavior of a valid node transition based on coordinates of the current node, a previous node of the current node, and a neighbor node except for the previous node of the current node; and take the neighbor node as a new current node, and repeating steps of searching and predicting, until a distance between a node taken as the new current node and the current node of the current single step is within a preset value; and convert all of the found valid node transitions and the predicted behaviors thereof to graph format to generate the adaptive context for the each single step.
In some embodiments, the behaviors of the valid node transitions are predicted from a behavior set composed of: exit, turn right and move, turn left and move, and move forward.
In some embodiments, the adaptive context generation subunit is further configured to: determine heading of the robot at the current node by subtracting a coordinate of the previous node from a coordinate of the current node; determine heading of the robot at the neighbor node by subtracting the coordinate of the current node from a coordinate of a neighbor node; calculate a cross product of the heading of the robot at the current node and the heading of the robot at the neighbor node; and predict the single-step behavior of the valid node transition based on the calculated cross product.
In some embodiments, the prediction subunit is further configured to: predict the single-step behavior for the single-step based on the generated adaptive context, the instructions in natural language, and a current hidden state updated by a gated recurrent unit (GRU), wherein the GRU takes a previous single-step behavior of a previous single step as input and updates to obtain the current hidden state.
In some embodiments, the apparatus for predicting a sequence of single-step behaviors further includes a navigation graph creating unit, configured to: create a topological map of the task environment, with locations in the task environment as nodes of the topological map, and the connectivity between the locations as edges of the topological map; and determine the created topological map as the navigation graph of the task environment.
The apparatus 600 corresponds to the steps in the foregoing method embodiments. Therefore, the operations, features, and technical effects that can be achieved in the above method for predicting a sequence of single-step behaviors are also applicable to the apparatus 600 and the units contained therein, and detailed description thereof will be omitted.
According to an embodiment of the present disclosure, an electronic device and a readable storage medium are provided.
As shown in
As shown in
The memory 702 is a non-transitory computer readable storage medium provided in an embodiments of the present disclosure. The memory stores instructions executable by at least one processor, so that the at least one processor performs the method for predicting a sequence of single-step behaviors provided by embodiments of the present disclosure. The non-transitory computer readable storage medium of some embodiments of the present disclosure stores computer instructions for causing a computer to perform the method for predicting a sequence of single-step behaviors provided in embodiments of the present disclosure.
The memory 702, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules corresponding to the method for predicting a sequence of single-step behaviors in embodiments of the present disclosure (for example, the receiving unit 601 and the prediction unit 602 as shown in
The memory 702 may include a storage program area and a storage data area, where the storage program area may store an operating system and at least one function required application program; and the storage data area may store data created by the use of the electronic device for predicting a sequence of single-step behaviors. In addition, the memory 702 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 702 may optionally include memories remotely provided with respect to the processor 701, and these remote memories may be connected to the electronic device for predicting a sequence of single-step behaviors through a network. Examples of the above network include but are not limited to the Internet, intranet, local area network, mobile communication network, and combinations thereof.
The electronic device of the method for predicting a sequence of single-step behaviors may further include: an input apparatus 703 and an output apparatus 704. The processor 701, the memory 702, the input apparatus 703, and the output apparatus 704 may be connected through a bus 705 or in other methods. In
The input apparatus 703 may receive input digital or character information, and generate key signal inputs related to user settings and function control of the electronic device of the method for predicting a sequence of single-step behaviors, such as touch screen, keypad, mouse, trackpad, touchpad, pointing stick, one or more mouse buttons, trackball, joystick and other input apparatuses. The output apparatus 704 may include a display device, an auxiliary lighting apparatus (for example, LED), a tactile feedback apparatus (for example, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
Various embodiments of the systems and technologies described herein may be implemented in digital electronic circuit systems, integrated circuit systems, dedicated application specific integrated circuits (ASIC), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs that may be executed and/or interpreted on a programmable system that includes at least one programmable processor. The programmable processor may be a dedicated or general purpose programmable processor, and may receive data and instructions from a memory system, at least one input apparatus, and at least one output apparatus, and transmit the data and instructions to the storage system, the at least one input apparatus, and the at least one output apparatus.
These computing programs, also referred to as programs, software, software disclosures, or codes, include machine instructions of a programmable processor, and may be implemented using high-level procedures and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms “machine readable medium” and “computer readable medium” refer to any computer program product, device, and/or apparatus (for example, magnetic disk, optical disk, memory, programmable logic apparatus (PLD)) used to provide machine instructions and/or data to the programmable processor, including machine readable medium that receives machine instructions as machine readable signals. The term “machine readable signal” refers to any signal used to provide machine instructions and/or data to the programmable processor.
To provide interaction with a user, the systems and technologies described herein may be implemented on a computer, the computer has: a display apparatus for displaying information to the user, such as a Cathode Ray Tube (CRT) or an liquid crystal display (LCD) monitor; and a keyboard and pointing apparatus, such as a mouse or a trackball, and a user may use the keyboard and the pointing apparatus to provide input to the computer. Other types of apparatuses may also be used to provide interaction with the user. For example, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and any form (including acoustic input, voice input, or tactile input) may be used to receive input from the user.
The systems and technologies described herein may be implemented in a computing system that includes backend components, e.g., as a data server, or in a computing system that includes middleware components, e.g., an application server, or in a computing system including front-end components, e.g., a user computer having a graphical user interface or a web browser through which a user may interact with embodiments of the systems and technologies described herein, or in a computing system including any combination of such backend components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), the Internet, and block chain networks.
The computer system may include a client and a server. The client and server are generally far from each other and usually interact through a communication network. The relationship between the client and the server is generated by computer programs that run on the corresponding computer and having a client-server relationship with each other.
The above description only provides an explanation of the preferred embodiments of the present disclosure and the technical principles used herein. It should be appreciated by those skilled in the art that the inventive scope of the present disclosure is not limited to the technical solutions formed by the particular combinations of the above-described technical features. The inventive scope should also cover other technical solutions formed by any combinations of the above-described technical features or equivalent features thereof without departing from the concept of the disclosure, for example, technical schemes formed by the above-described features being interchanged with, but not limited to, technical features with similar functions disclosed in embodiments of the present disclosure.
Number | Name | Date | Kind |
---|---|---|---|
6219616 | Litmanovich | Apr 2001 | B1 |
10606898 | Tellex | Mar 2020 | B2 |
11131993 | Singh | Sep 2021 | B2 |
11155259 | Yao | Oct 2021 | B2 |
20170219353 | Alesiani | Aug 2017 | A1 |
20180283882 | He | Oct 2018 | A1 |
20180307779 | Tellex | Oct 2018 | A1 |
20190035096 | Huang | Jan 2019 | A1 |
20190094870 | Afrouzi | Mar 2019 | A1 |
20200023514 | Tellex | Jan 2020 | A1 |
20200198140 | Dupuis | Jun 2020 | A1 |
20200302250 | Chu | Sep 2020 | A1 |
20210041243 | Fay | Feb 2021 | A1 |
20210069905 | Zhang | Mar 2021 | A1 |
20210403051 | Jaegal | Dec 2021 | A1 |
20220073101 | Wang | Mar 2022 | A1 |
20220092456 | Piot | Mar 2022 | A1 |
Number | Date | Country |
---|---|---|
110825829 | Feb 2020 | CN |
H02-172000 | Jul 1990 | JP |
H04-372985 | Dec 1995 | JP |
2010-531461 | Dec 2008 | JP |
2015-212706 | Nov 2015 | JP |
2017-053683 | Mar 2017 | JP |
WO 2020069160 | Apr 2020 | WO |
WO 2020194253 | Oct 2020 | WO |
Entry |
---|
Garcia, N., Damask, A., Physical Quantities. In: Physics for Computer Science Students, 1991, Springer Study Edition. Springer, New York, NY. pp. 16-17 (Year: 1991). |
Extended European Search Report received in Application No. 21179395.5, dated Oct. 7, 2021 in 10 pages. |
Cunjun Yu et al., “Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction”, arxiv.org, Cornell University Library, dated Jul. 24, 2020 in 19 pages. |
Pengpeng Zhou et al., “Translating Natural Language Instructions for Behavioral Robot Indoor Navigation with Attention-History Based Attention”, Proceedings of the 2nd International Scientific Conference on Innovations in Digital Economy, dated Dec. 11-13, 2020 in 5 pages. |
Number | Date | Country | |
---|---|---|---|
20220197288 A1 | Jun 2022 | US |