The present disclosure relates to the field of robot technologies, and more particularly, to a method, an apparatus, and storage medium for robot control.
With the development of science and technology, more and more robots are used in people's daily lives, such as common sweeping robots and mopping robots and more advanced robot housekeepers. Applications of these robots in the household make people's lives more comfortable and convenient. Machine vision as an aid can help robots obtain information about scenes within fields of view of the robots.
At present, control of operations of a robot is mainly based on an application on a mobile phone or buttons on a machine body of the robot.
However, the usage of the application on the mobile phone relies heavily on the mobile phone and a network that is used to transmit control instructions. Thus, the usage of the application is inconvenient and not fast enough. In addition, the buttons on the machine body of the robot have limited functions. Therefore, conventional robot control approaches based on the application on the mobile phone or the buttons on the machine body of the robot are relatively complicated and cumbersome, resulting in poor user experience.
An object of the present disclosure is to provide a robot control method and apparatus, and a storage medium to overcome the above-mentioned defects in the related art. The object is realized by the following embodiment.
In one embodiment of the present disclosure, a robot control method is provided. The method includes: obtaining a first scene image of a robot, and detecting whether the first scene image includes a foot; obtaining multiple frames of second scene images of the robot consecutively in response to detecting the foot in a predetermined quantity of consecutive frames of first scene images; and recognizing a foot posture based on the multiple frames of second scene images, and controlling the robot based on a control manner corresponding to the recognized foot posture.
In some embodiments of the present disclosure, the detecting whether the first scene image includes the foot includes: inputting the first scene image into a trained first neural network model, the first neural network model being configured to detect whether the first scene image includes the foot, and output a detection result.
In some embodiments of the present disclosure, the method further includes a training process of the first neural network model, the training process including: obtaining an image, captured by the camera, containing the foot as a positive sample, and obtaining an image, captured by the camera, without the foot as a negative sample; and obtaining the first neural network model by training a pre-established classification model using the positive sample and the negative sample.
In some embodiments of the present disclosure, the recognizing the foot posture based on the multiple frames of second scene images includes: inputting the multiple frames of second scene images that are obtained consecutively into a trained second neural network model, and recognizing the foot posture by the second neural network model based on the multiple frames of second scene images.
In some embodiments of the present disclosure, the recognizing the foot posture by the second neural network model based on the multiple frames of second scene images includes: obtaining multiple frames of feature maps by performing, using a feature extraction module in the second neural network model, a feature extraction sequentially on the multiple frames of second scene images; obtaining, by a temporal shift module in the second neural network model, the multiple frames of feature maps from the feature extraction module, and obtaining multiple frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, a temporal shift on each of the multiple frames of feature maps; and obtaining, by a recognition module in the second neural network model, the multiple frames of shifted feature maps from the temporal shift module, obtaining, by the recognition module in the second neural network model, the multiple frames of feature maps from the feature extraction module, and recognizing, by the recognition module in the second neural network model, the foot posture based on the multiple frames of shifted feature maps and the multiple frames of feature maps.
In some embodiments of the present disclosure, the obtaining the multiple frames of shifted feature maps by performing, using the temporal shift module in the second neural network model, the temporal shift on each of the multiple frames of feature maps includes: for each of frames of feature maps ranging from a first frame of feature map to a penultimate frame of feature map in the multiple frames of feature maps, shifting features of part of channels in the feature map to corresponding channels of a successively subsequent frame of feature map to obtain the multiple frames of shifted feature maps.
In some embodiments of the present disclosure, the recognizing, by the recognition module in the second neural network model, the foot posture based on the multiple frames of shifted feature maps and the multiple frames of feature maps includes: performing, by a convolutional layer in the recognition module, a convolution operation on each of the multiple frames of shifted feature maps; obtaining, by a merging layer in the recognition module, each of multiple frames of convolved feature maps from the convolutional layer, and obtaining multiple frames of merged feature maps by merging, by the merging layer in the recognition module, each of the multiple frames of convolved feature maps with a corresponding one of the multiple frames of feature maps; and obtaining, by a fully connected layer in the recognition module, the multiple frames of merged feature maps from the merging layer, and obtaining, by the fully connected layer in the recognition module, a foot posture recognition result based on the multiple frames of merged feature maps.
In some embodiments of the present disclosure, the obtaining the multiple frames of feature maps by performing, using the feature extraction module in the second neural network model, the feature extraction sequentially on the multiple frames of second scene images includes obtaining the multiple frames of feature maps by performing, using a convolutional module with an attention enhancement mechanism in the feature extraction module, the feature extraction sequentially on the multiple frames of second scene images.
In some embodiments of the present disclosure, the convolutional module with the attention enhancement mechanism includes an attention module arranged between at least one pair of adjacent convolutional layers. The attention module includes a channel attention module, a first fusion layer, a spatial attention module, and a second fusion layer. The method further includes: obtaining, by the channel attention module, a channel weight based on a feature map outputted by a previous convolutional layer; obtaining a first fusion feature map by fusing, by the first fusion layer, the channel weight to the feature map outputted by the previous convolutional layer; obtaining, by the spatial attention module, a spatial position weight based on the first fusion feature map outputted by the first fusion layer; and obtaining a second fusion feature map by fusing, by the second fusion layer, the spatial position weight to the first fusion feature map, and inputting, by the second fusion layer, the second fusion feature map to a next convolutional layer.
In some embodiments of the present disclosure, the method further includes a training process of the second neural network model, the training process including: obtaining multiple video segments each containing the foot that are captured by the camera, and labeling a predetermined foot posture contained in each of the multiple video segments; and obtaining the second neural network model by training a pre-established action recognition model using multiple labeled video segments.
In some embodiments of the present disclosure, subsequent to the obtaining the second neural network model, the method further includes: performing an integer quantization on at least one model parameter of the second neural network model.
In some embodiments of the present disclosure, the controlling the robot based on the control manner corresponding to the recognized foot posture includes: controlling, based on the recognized foot posture being a first predetermined posture, the robot to initiate a cleaning mode to start cleaning; controlling, based on the recognized foot posture being a second predetermined posture, the robot to stop cleaning; controlling, based on the recognized foot posture being a third predetermined posture, the robot to access a target-tracking mode; and controlling, based on the recognized foot posture being a fourth predetermined posture, the robot to perform cleaning in a predetermined range around a position of the foot.
In one embodiment of the present disclosure, a robot control apparatus is provided. The apparatus includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor is configured to implement, when executing the computer program, the steps of the method in the above-mentioned embodiments.
In one embodiment of the present disclosure, a robot is provided. The robot includes: the robot control apparatus in the above-mentioned embodiments; and a camera configured to capture a scene image of the robot.
In one embodiment of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium has a computer program stored thereon. The program, when executed by a processor, implements the steps of the method in the above-mentioned embodiments.
Based on the robot control method, the robot control apparatus, and the storage medium in the above-mentioned embodiments, embodiments of the present disclosure provide the following advantageous effects or advantages.
In a scene image captured by the camera mounted at the robot, when the foot is captured within a field of view of the robot, a specific foot posture is recognized using a video captured by the camera. Therefore, intelligent control of the robot is realized based on the recognized foot posture. Compared with a conventional control method based on an application on a mobile phone or pressing buttons on a machine body of the robot, the method provided in this disclosure is more convenient, efficient, and intelligent, which can greatly improve user experience. In addition, since the intelligent control can be realized using an existing camera of the robot, intelligent control of a product can be realized without additional costs, which further improves the user experience.
The accompanying drawings are merely used for the purpose of illustrating the embodiments, and should not be construed as a limitation on the present disclosure. Moreover, the same components are indicated by the same reference signs throughout the accompanying drawings.
Embodiments of the present disclosure will be further described with reference to the accompanying drawings and in conjunction with the embodiments.
Embodiments of the present disclosure will be described clearly and completely below in combination with accompanying drawings of the embodiments of the present disclosure. Apparently, the embodiments described below are only a part of the embodiments of the present disclosure, rather than all embodiments of the present disclosure.
It should be noted that all directional indications (such as up, down, left, right, front, rear, etc.) in the embodiments of the present disclosure are only used to explain relative positions between various components, movements of various components, or the like under a predetermined posture (as illustrated in the figures). When the predetermined posture changes, the directional indications also change accordingly.
In addition, in the present disclosure, descriptions associated with “first”, “second”, or the like are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated features. Therefore, the features associated with “first” and “second” may explicitly or implicitly include at least one of the features. In the description of the present disclosure, “plurality” means at least two, unless otherwise specifically defined.
In the present disclosure, unless otherwise clearly stipulated and limited, terms such as “connect”, “fix”, or the like should be understood in a broad sense. For example, “fix” may mean a fixed connection or a detachable connection or connection as one piece; mechanical connection or electrical connection; direct connection or indirect connection through an intermediate; or internal communication of two components or the interaction relationship between two components, unless otherwise clearly limited. Specific meanings of the above-mentioned terms in the present disclosure can be understood according to specific circumstances.
In addition, combinations can be performed on the various embodiments of the present disclosure. When a combination of the technical solutions is contradictory or unattainable, the combination of the technical solutions neither exists nor falls within the protection scope of the appended claims of the present disclosure.
To solve a current problem of inconvenience caused by controlling a robot using an application on a mobile phone or buttons on a machine body of the robot, an improved robot control method is provided according to the present disclosure. In particular, a first scene image captured by a camera of a robot is obtained. Whether the first scene image includes a foot is detected. Multiple frames of second scene images captured by the camera are obtained consecutively in response to detecting the foot in a predetermined quantity of consecutive frames of first scene images. Therefore, a foot posture is recognized based on the multiple frames of second scene images. The robot is controlled based on a control manner corresponding to the recognized foot posture.
The following effects can be achieved based on the above description.
In a scene image captured by the camera mounted at the robot, when the foot is captured within a field of view of the robot, a specific foot posture is recognized using a video captured by the camera. Therefore, intelligent control of the robot is realized based on the recognized foot posture. Compared with a conventional control method based on an application on a mobile phone or pressing buttons on a machine body of the robot, the method provided in this disclosure is more convenient, efficient, and intelligent, which can greatly improve user experience. In addition, since the intelligent control can be realized using an existing camera of the robot, intelligent control of a product can be realized without additional costs, which further improves the user experience.
As illustrated in a schematic structural view of a robot in
In particular, the robot body 10 is configured to move and is electrically connected to the robot control apparatus 20. The camera 30 is arranged at a side of the robot body 10 facing away from the ground. That is, the camera 30 is arranged directly in front of the robot. During a movement of the robot, the camera 30 may be controlled to take pictures of scenes in front of the robot to realize obstacle avoidance and path planning.
In one embodiment, the robot control apparatus 20 may be independent of the robot body 10 or, of course, may be integrated within the robot body 10. The present disclosure is not limited in this regard.
It should be noted that the robot body 10 is provided with structures such as a movement module, a control module, and various sensors, to realize environmental map construction and path planning for the robot.
Based on the above functional description of the respective structures, a control principle of the robot control device 20 is as follows. The first scene image captured by the camera of the robot is obtained. Whether the first scene image includes the foot is detected. The multiple frames of second scene images captured by the camera are obtained consecutively in response to detecting the foot in the predetermined quantity of consecutive frames of first scene images. Therefore, the foot posture is recognized based on the multiple frames of second scene images. The robot is controlled based on the control manner corresponding to the recognized foot posture.
The embodiments of the present disclosure will be described clearly and completely below in combination with accompanying drawings of the embodiments of the present disclosure.
At block 201, a first scene image captured by a camera of a robot is obtained, and whether the first scene image includes a foot is detected.
Whether the robot is in a movement state or a stopped state, scene images of the scene around the robot are captured in real time by the camera mounted at the robot. When a person accesses a field of view of the camera, only a foot of the person can be captured by the camera due to a limited height of the robot.
In one embodiment, for a process of detecting whether the first scene image includes the foot, the first scene image may be inputted into a trained first neural network model, the first neural network model being configured to detect whether the first scene image includes the foot, and output a detection result. Detecting whether the foot appears within the field of view using the neural network is simple and accurate.
Since the first neural network model only needs to determine whether a human foot appears in the image, a simple binary classification model may be adopted as the first neural network model.
In one embodiment, due to limited computing power of the robot itself, the binary classification model may be constructed using a deep convolutional network plus a small quantity of fully connected layers.
In this embodiment, the first neural network model needs to be pre-trained before being applied. A training process of the first neural network model includes: obtaining an image, captured by the camera, containing the foot as a positive sample, and obtaining an image, captured by the camera, without the foot as a negative sample; and obtaining the first neural network model by training a pre-established classification model using the positive sample and the negative sample.
During capturing of positive and negative samples, different persons wearing different shoes may make some predetermined postures with their feet within the field of view of the camera of the robot, which is captured by the camera into a video. Images containing the feet are taken from the video as positive samples, while images containing no feet are taken from the video as negative samples.
At block 202, multiple frames of second scene images captured by the camera are obtained consecutively in response to detecting the foot in a predetermined quantity of consecutive frames of first scene images.
By limiting that the foot posture is recognized subsequent to a detection of the foot in multiple consecutive frames of first scene images, a false detection or a user's unintentional entrance into the field of view of the camera can be avoided.
Further, since it takes a few seconds for a person to complete some specific foot postures, multiple consecutive frames of images need to be obtained for recognition of some specific foot postures.
It is conceivable for those skilled in the art that the above-mentioned “first scene image” and “second scene image” both belong to the scene images captured by the camera, except that the first scene image is captured earlier than the second scene image. That is, a specific foot posture recognition is performed only when the human foot is detected by the robot within the field of view of the camera.
At block 203, a foot posture is recognized based on the multiple frames of second scene images, and the robot is controlled based on a control manner corresponding to the recognized foot posture.
In one embodiment, for a recognition process of the foot posture, the multiple frames of second scene images that are obtained consecutively may be inputted into a trained second neural network model, and recognizing the foot posture by the second neural network model based on the multiple frames of second scene images.
For a specific process of recognizing the foot posture by the second neural network model, reference can be made to relevant description in the following embodiments, which will not be described in detail herein.
The second neural network model belongs to an action recognition model. The second neural network model needs to be pre-trained before being applied. The training process of the second neural network model includes: obtaining multiple video segments each containing the foot that are captured by the camera, and labeling a predetermined foot posture contained in each of the multiple video segments; and obtaining the second neural network model by training a pre-established action recognition model using multiple labeled video segments.
In one embodiment, during collection of training data, different persons wearing different shoes may make some predetermined postures with their feet within the field of view of the camera of the robot, which is captured by the camera into a video. Each video is labeled with a foot posture label.
Types of valid foot postures that can be recognized by the second neural network model may be set as desired. Assuming that the robot has four control functions, data sets containing four foot postures are collected to train the action recognition model during training of the second neural network model. Foot postures other than the four foot postures are recognized as invalid foot postures by the second neural network model.
In one embodiment, when four valid foot postures can be recognized by the second neural network model, for a process of controlling the robot based on the control manner corresponding to the recognized foot posture, the robot is controlled to initiate a cleaning mode to start cleaning, when the recognized foot posture is a first predetermined posture. When the recognized foot posture is a second predetermined posture, the robot is controlled to stop cleaning. When the recognized foot posture is a third predetermined posture, the robot is controlled to access a target-tracking mode, to enable the robot to track the user's foot and reach a specific position. When the recognized foot posture is a fourth predetermined posture, the robot is controlled to perform intensive cleaning in a predetermined range around a position of the foot.
It is conceivable for those skilled in the art that the above four control functions are only exemplary illustrations. The present disclosure includes, but is not limited to, the four functions such as start cleaning, stop cleaning, target tracking, and intensive cleaning.
It should be noted that since model parameters contained in each layer of the trained second neural network model are all of floating-point types but the robot has limited computing power, a problem of an extremely low operation efficiency occurs if the robot is equipped with a floating-point format model. To eliminate an effect on the operation efficiency of the robot, in the present disclosure, an integer quantization may be further performed on at least one model parameter of the second neural network model, subsequent to obtaining the second neural network model. Therefore, deploying the second neural network model, obtained after performing the integer quantization, to the robot can occupy less resources, realizing an efficient and stable recognition.
In one embodiment, a calculation formula for the integer quantization is as follows:
where Q represents a quantized integer parameter, R represents a parameter of floating-point type in the model, S=(Rmax−Rmin)/(Qmax−Qmin), which is a quantization scale, and Z=Qmax−Rmax/S, which is a zero point of quantization.
Based on the above-mentioned step 201 to step 203, in an embodiment, to minimize a resource occupation of the robot and ease an energy consumption burden, the complex second neural network model remains dormant, while only the simple first neural network model is run to detect whether the first scene image includes the foot. When the foot is detected in the multiple consecutive frames of first scene images, the second neural network model is woken up, to recognize the foot posture based on the multiple frames of second scene images that are further captured.
Further, when no valid foot posture is recognized by the second neural network model for a sustained period of time, the second neural network model is controlled to re-access a dormant state, and the first neural network model is activated.
Therefore, a control flow illustrated in
At block 301, multiple frames of feature maps are obtained by performing, using a feature extraction module in the second neural network model, a feature extraction sequentially on the multiple frames of second scene images.
In one embodiment, the multiple frames of feature maps are obtained by performing, using a convolutional module with an attention enhancement mechanism in the feature extraction module, the feature extraction sequentially on the multiple frames of second scene images, in such a manner that the model can focus more on effective information.
The convolutional module includes multiple convolutional layers. The convolutional module with the attention enhancement mechanism includes an attention module arranged between at least one pair of adjacent convolutional layers. The attention module may be configured to calculate a channel weight and a spatial position weight based on input features, and then weigh the input features without changing shapes of the input features.
In one embodiment, as illustrated in
A product operation is performed by both the first fusion layer and the second fusion layer. That is, a product operation is performed on the two pieces of input information.
At block 302, the multiple frames of feature maps are obtained by a temporal shift module in the second neural network model from the feature extraction module, and multiple frames of shifted feature maps are obtained by performing, using the temporal shift module in the second neural network model, a temporal shift on each of the multiple frames of feature maps.
Performing the temporal shift on each of the multiple frames of feature maps can effectively fuse temporal information between the multiple frames of feature maps without increasing computational effort, in such a manner that the shifted feature maps can obtain a temporal modeling ability.
In one embodiment, for each of frames of feature maps ranging from a first frame of feature map to a penultimate frame of feature map in the multiple frames of feature maps, features of part of channels in the feature map are shifted to corresponding channels of a successively subsequent frame of feature map to obtain the multiple frames of shifted feature maps.
To ensure that the multiple frames of shifted feature maps are consistent in size, a feature of a removed channel in the first frame of feature map may be set as 0.
Based on a temporal shift principle illustrated in
At block 303, the multiple frames of shifted feature maps are obtained by a recognition module in the second neural network model from the temporal shift module, the multiple frames of feature maps are obtained by the recognition module in the second neural network model from the feature extraction module, and the foot posture is recognized by the recognition module in the second neural network model based on the multiple frames of shifted feature maps and the multiple frames of feature maps.
In one embodiment, a recognition process of a structure of a recognition module illustrated in
1. A convolution operation is performed by a convolutional layer in the recognition module on each of the multiple frames of shifted feature maps, to fuse a shifted previous temporal feature with a current temporal feature. It should be noted that to ensure subsequent merging between each of the multiple frames of shifted feature maps and the feature map outputted by the feature extraction module, the quantity of channels and the size of the feature map outputted subsequent to convolution should be guaranteed to remain unchanged during a convolutional operation.
2. Each of multiple frames of convolved feature maps is obtained by a merging layer in the recognition module from the convolutional layer. Multiple frames of merged feature maps are obtained by merging, by the merging layer in the recognition module, each of the multiple frames of convolved feature maps with a corresponding one of the multiple frames of feature maps outputted by the feature extraction module. Since temporal modeling information across frames is contained in the convolved feature map and spatial modeling information is retained by the feature extraction module, merging the convolved feature map with the feature map outputted by the feature extraction module allows the merged feature map to have both temporal modeling information and spatial modeling information. Specifically, a merge operation performed on the merging layer is a feature addition operation.
3. The multiple frames of merged feature maps are obtained by a fully connected layer in the recognition module from the merging layer, and a foot posture recognition result is obtained by the fully connected layer in the recognition module based on the multiple frames of merged feature maps.
The foot posture recognition result outputted by the fully connected layer may be a serial number of a valid foot posture or a serial number of an invalid foot posture.
The valid foot posture refers to a foot posture that can be recognized by the recognition module. The invalid foot posture refers to a foot posture that cannot be recognized by the recognition module.
Therefore, a recognition process illustrated in
Further, when no valid foot posture is recognized by the second neural network model for a sustained period of time, the second neural network model is controlled to re-access the dormant state, and the first neural network model is activated to continue a foot detection.
According to the embodiments of the present disclosure, a robot control apparatus corresponding to the robot control method according to the above-mentioned embodiments is further provided to execute the robot control method described above.
The memory 703 referred to in the present disclosure may be any electronic, magnetic, optical, or other physical storage devices, and may contain storage information such as executable instructions or data. Specifically, the memory 703 may be a Random Access Memory (RAM), a flash memory, a storage drive (e.g., a hard disk drive), any type of storage disk (e.g., an optical disc, a Digital Versatile Disc (DVD), etc.), or a similar storage medium, or a combination thereof. A communication connection between one system network element and at least one other network element is realized by at least one communication interface 701 (which may be a wired or wireless communication interface), using networks such as the Internet, Wide Area Network (WAN), Local Area Network (LAN), and Metropolitan Area Network (MAN).
The bus 704 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, etc. The bus may be classified into an address bus, a data bus, a control bus, etc. The memory 703 is configured to store a program. The processor 702 is configured to execute the program subsequent to a reception of an execution instruction.
The processor 702 may be an integrated circuit chip having signal processing capabilities. In an implementation, actions of the above method may be accomplished by an integrated logic circuit in hardware in the processor 702 or by instructions in the form of software. The above processor 702 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; and may further be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other programmable logic devices, a discrete gate or a transistor logic device, and a discrete hardware component. The method, actions, and logic block diagrams according to any of the embodiments of the present disclosure may be implemented or performed. The general-purpose processor may be a microprocessor, or the processor may further be any conventional processor, etc. The actions of the method disclosed in combination with any of the embodiments of the present disclosure may be directly embodied as performed by a hardware decoding processor or performed by a combination of a hardware module and a software module in a decoding processor.
The robot control apparatus according to the embodiments of the present disclosure is of the same concept as the robot control method according to the embodiments of the present disclosure and therefore provides the same advantageous effect as the robot control method adopted, performed, or implemented by the robot control apparatus.
According to the embodiment of the present disclosure, a computer-readable storage medium corresponding to the robot control method according to any of the above-mentioned embodiments is further provided. As illustrated in
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a Phase-Change Memory (PRAM), a Static RAM (SRAM), a Dynamic RAM (DRAM), other types of RAMs, a Read-Only Memory (ROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a flash memory or other optical or magnetic storage media. Details thereof will be omitted here.
The computer-readable storage medium according to the embodiments of the present disclosure is of the same concept as the robot control method according to the embodiments of the present disclosure and therefore provides the same advantageous effect as the robot control method adopted, performed, or implemented by an application stored on the computer-readable storage medium.
The present disclosure is intended to encompass any variations, uses, or adaptations of the present disclosure, which follow the general principles of the present disclosure and include common knowledge or conventional means in the art that are not disclosed here. The description and embodiments are to be considered exemplary only, and the real scope and the essence of the present disclosure are defined by the claims as attached.
Number | Date | Country | Kind |
---|---|---|---|
202111248999.8 | Oct 2021 | CN | national |
The present disclosure is a national phase application of International Application No. PCT/CN2021/136031, filed on Dec. 7, 2021, which claims priority to Chinese patent application No. 202111248999.8, filed on Oct. 26, 2021, the entire contents of which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/136031 | 12/7/2021 | WO |