This application relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus, and a device and a medium.
Computer Vision (CV) technology is a science that studies how to use a machine to “see”, and furthermore, that uses a camera and a computer to replace human eyes to perform machine vision such as recognition and measurement, and to further perform graphic processing, so that the computer processes graphics into an image more suitable for human eyes to observe, or an image transmitted to an instrument for detection. As a scientific discipline, CV studies related theories and technologies and attempts to establish an AI system that can acquire information from images or multidimensional data.
Pose estimation can detect positions of various key points in an image or a video, which has wide application value in fields such as movie animation, assisted driving, virtual reality, and action recognition.
In current pose estimation algorithms, key point detection can be performed on the image or the video, and a final object pose can be constructed based on detected key points and object constraint relationships.
Examples of this application provide a data processing method and apparatus, and a device and a medium, which can improve the accuracy of estimating an object pose.
The examples of this application provide a data processing method, performed by a computer, including:
acquiring an object pose detection result corresponding to an object in an image frame, and a part pose detection result corresponding to a first object part of the object in the image frame, at least one object part of the object is missing from the object pose detection result, and the first object part is one or more parts of the object; and
performing interpolation processing on the at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with the object to obtain a global pose corresponding to the object, the global pose is used for controlling a computer to realize a service function corresponding to the global pose.
The examples of this application further provide a data processing apparatus, including:
The examples of this application further provide a computer, including a memory and a processor, the memory is connected to the processor, the memory is configured to store a computer program, and the processor is configured to invoke the computer program, so that the computer performs the method in the examples of this application.
The examples of this application further provide a non-transitory computer readable storage medium. The non-transitory computer readable storage medium stores a computer program. The computer program is adapted to be loaded and executed by a processor, so that the computer having the processor performs the method in the examples of this application.
The examples of this application further provide a computer program product or a computer program. The computer program product or the computer program includes computer instructions. The computer instructions are stored in a non-transitory computer readable storage medium. A processor of a computer reads the computer instructions from the non-transitory computer readable storage medium, and the processor executes the computer instructions, so that the computer performs the method.
The following clearly and completely describes the technical solutions in examples of this application with reference to the drawings in the examples of this application. Apparently, the described examples are merely part rather than all examples of this application. All other examples obtained by those of ordinary skill in the art based on the examples of this application without creative efforts fall within the scope of protection of this application.
This application relates to pose estimation under computer vision technology. The pose estimation is an important task in computer vision, and is also an essential step for a computer to understand an action and a behavior of an object. The pose estimation may be transformed into a problem about predicting object key points. For example, position coordinates of various object key points in an image may be predicted, and an object skeleton in the image may be predicted according to positional relationships among the various object key points. The pose estimation involved in this application may include object pose estimation for an object, part pose estimation for a specific part of the object, and the like. The object may include, but is not limited to, a human body, an animal, a plant, and the like. The specific part of the object may be a palm, a face, an animal limb, a plant root, and the like. This application does not limit a type of the object.
When an image or a video is a picture shot in a mobile terminal scenario, the picture of the image or the video may only contain some parts of the object. Then, in a process of performing pose estimation on the parts of the object, due to the missing of some parts of the object, the extracted part information is insufficient, resulting in that a final object pose result is not a complete pose of the object, which affects the integrity of an object pose.
In the examples of this application, an object pose detection result for an object and a part pose detection result for a first object part of the object can be obtained by respectively performing object pose estimation and specific part pose estimation on the object in an image frame, and then pose estimation can be performed on the object in the image frame based on the object pose detection result, the part pose detection result, a standard pose; and part key points missing from the object in the image frame can be compensated, which can ensure the integrity and the rationality of a finally obtained global pose of the object, and then improve the accuracy of estimating the global pose.
Refer to
The server 10d may be an independent physical server, or may be a server cluster or a distributed system composed of a plurality of physical servers, or may be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a content delivery network (CDN), big data, and an artificial intelligence platform.
Each of the user terminal 10a, the user terminal 10b, the user terminal 10c, and the like may include: electronic devices having an object pose estimation function, such as a smart phone, a tablet, a laptop, a palmtop computer, a mobile Internet device (MID), a wearable device (for example, a smart watch and a smart bracelet), a smart voice interaction device, a smart home appliance (for example, a smart television), and an on-board device. As shown in
The user terminal (for example, the user terminal 10a) in the user terminal cluster as shown in
An object pose estimation process involved in the examples of this application may be performed by a computer. The computer may be a user terminal in the user terminal cluster shown in
Refer to
For example, when the object in the video data 20a is a human body, the object key points corresponding to the object may be considered as joint points in a human body structure. A key point quantity and a key point class of the object may be pre-defined, for example, the human body structure may include a plurality of object key points of the parts including the limbs, the brain, the waist, and the chest. When the image frame T1 contains a complete object, the image frame T1 may contain all object key points of the object. When the image frame T1 only contains part structures of the object, the image frame T1 may contain part object key points of the object. After the object key points contained in the image frame T1 are detected, the detected object key points may be connected according to the key point class and key point positions of the object, and a result after connection may be marked in the image frame T1, that is, an object pose detection result 20c. The object detection model 20b may be a pre-trained network model and has an object detection function for a video/image. When the object is a human body, the object detection model 20b may also be referred to as a human body pose estimation model.
A human body pose 20j of an object in the image frame T1 may be obtained through the object pose detection result 20c. Due to the missing of some object key points of the human body pose 20j (the missing of human joint points), the user terminal 10a may acquire a standard pose 20k corresponding to the object, and key point compensation may be performed on the human body pose 20j based on the standard pose 20k to obtain a human body pose 20m corresponding to the object in the image frame T1. The standard pose 20k may also be considered as a default pose of the object, or referred to as a reference pose. The standard pose 20k may be pre-constructed based on all object key points of the object, for example, the pose (for example, the global pose) when the human body is standing normally may be determined as the standard pose 20k.
The image frame T1 may also be inputted into the part detection model 20d. A specific part (for example, a first object part) of the object in the image frame T1 is detected through the part detection model 20d to obtain a part pose detection result 20e corresponding to the image frame T1. When it is detected that there is no first object part of the object in the image frame T1, the part pose detection result of the image frame T1 may be determined as null. When it is detected that there is a first object part of the object in the image frame T1, key points of the first object part and positions of the key points may be continued to be detected, the detected key points of the first object part may be connected according to the key point class and key point positions of the first object part, and a result after connection is marked in the image frame T1, that part, a part pose detection result 20e. A key point quantity and a key point class corresponding to the first object part may also be pre-defined. When the object is a human body, the part detection model 20d may be a palm pose estimation model (the first object part is a palm here), for example, the palm may include palm center key points and finger key points. The part detection model 20d may be a pre-trained network model and has an object part detection function for a video/image. For the convenience of description, the key points of the first object part are referred to as part key points below.
As shown in
Further, the user terminal 10a may perform interpolation processing on some missing object parts in combination with the object pose detection result 20c and the part pose detection result 20e, and obtain a rational object key point through the interpolation processing. For example, when the part pose detection result 20e is a palm key point, the interpolation processing may be performed on the parts such as the wrist and the elbow of the object missing from the image frame T1 in combination with the object pose detection result 20c and the part pose detection result 20e, so as to complete human body pose 20m of the object to obtain a human body pose 20n (which may also be referred to as a global pose). Similarly, after obtaining the global pose corresponding to the object in the image frame T1, object pose estimation may be performed on a subsequent image frame in the video data 20a in the same manner to obtain a global pose corresponding to the object in each image frame, and a behavior of the object in the video data 20a may be obtained based on the global pose corresponding to the N image frames. It is to be understood that the video data 20a may also be a video shot in real time. The user terminal 10a may perform object pose estimation on the image frame in the video data shot in real time to acquire the behavior of the object in real time.
In a word, for the image frame that only contains part objects, the global pose of the object in the image frame may be estimated through the object detection result outputted by the object detection model 20b, the part detection result outputted by the part detection model 20d, and the standard pose 20m, which may ensure the integrity and rationality of the finally obtained global pose of the object, thereby improving the accuracy of estimating the global pose.
Refer to
Step S101: Acquire an object pose detection result corresponding to an object in an image frame, and a part pose detection result corresponding to a first object part of the object. At least one object part of the object is missing from the object pose detection result, and the first object part is one or more parts of the object.
Specifically, a computer may acquire video data (for example, the video data 20a in an example corresponding to
For the convenience of description, the example of this application describes an object pose estimation process of the video data or the image data by taking the object is a human body as an example. If object pose estimation is performed on the image data in the mobile terminal scenario, then the image data is taken as the image frame. Object pose estimation is performed on the video data in the mobile terminal scenario, then framing processing may be performed on the video data to obtain N image frames corresponding to the video data. N is a positive integer. Then, an image frame sequencing containing N image frames may be formed according to a time sequence of the N image frames in the video data, and object pose estimation may be performed on the N image frames in the image frame sequence in sequence. For example, after the completion of the object pose estimation of a first image frame in the image frame sequence, object pose estimation may be continued to be performed on a second image frame in the image frame sequence until the object pose estimation of the whole video data is completed.
The computer may acquire an object detection model and a part detection model, and input the image frame into the object detection model. An object pose detection result corresponding to the image frame may be outputted through the object detection model. Meanwhile, the image frame may also be inputted into the part detection model. A part pose detection result corresponding to the image frame may be outputted through the part detection model. The object detection model may be configured to detect key points of the object in the image frame (for example, human body key points, which may also be referred to as object key points). At this moment, the object detection model may also be referred to as a human body pose estimation model. The object detection model may include, but is not limited to: DensePose (a real-time human body pose recognition system, configured to realize real-time pose recognition of a dense population), OpenPose (a framework for real-time estimation of body, facial, and hand morphology of a plurality of persons), Realtime Multi-Person Pose Estimation (a real-time multi-person pose estimation model), DeepPose (a deep neural network-based pose estimation method), and mobilenetv2 (a lightweight deep neural network). The type of the object detection model is not limited by this application. The part detection model may be configured to detect key points of the first object part of the object (for example, palm key points). At this moment, the part detection model may also be referred to as a palm pose estimation model. The part detection model may be a detection-based method or a regression-based method. The detection-based method may predict part key points of the first object part by generating a heat map. The regression-based method may directly regress position coordinates of the part key points. The network structure of the part detection model and the network structure of the object detection model may be the same or may be different. When the network structure of the part detection model and the network structure of the object detection model are the same, network parameters of the two may also be different (obtained by training different data). The type of the part detection model is not limited by this application.
In some examples, the object detection model and the part detection model may be detection models pre-trained by using sample data. For example, the object detection model may be trained by using the sample data carrying human body key point label information (for example, a three-dimensional human body data set), and the part detection model may be trained by using the sample data carrying palm key point information (for example, a palm data set). Or, the object detection model may be an object detection service invoked from an artificial intelligence cloud service through an application interface (API), and the part detection model may be a part detection service invoked by from the artificial intelligence cloud service through the API, which is not specifically limited here.
The artificial intelligence cloud service is also generally referred to as AI as a Service (AIaaS). This is a mainstream service manner for an artificial intelligence platform at present. Specifically, an AIaaS platform will split several common types of AI services and provide independent or packaged services at a cloud. This service manner is similar to opening an AI theme mall: All developers may access and use one or more artificial intelligence services provided by a platform in an API manner, and part experienced developers may also deploy, operate and maintain their own exclusive cloud artificial intelligence services by using AI framework and AI infrastructure provided by the platform.
In some examples, the object detection model used in the examples of this application may be a human body three-dimensional pose estimation model with a confidence level. For example, object key points of an object of an image frame may be predicted through an object detection model. Each predicted object key point may correspond to one first confidence level. The first confidence level may be used for characterizing the accuracy of predicting each predicted object key point. The predicted object key points and the corresponding first confidence level may be referred to an object pose detection result corresponding to the image frame. The part detection model may be a palm three-dimensional pose estimation model carrying a confidence level. For example, the part detection model may predict a position area of the first object part in the image frame, and predict part key points of the first object part in the position area. The part detection model may predict to obtain one or more possible position areas where the first object part is located. One position area may correspond to one second confidence level. The second confidence level may be used for characterizing the accuracy of predicting each predicted position area. The predicted part key points and the second confidence level corresponding to the position area may be referred to the part pose detection result corresponding to the image frame.
Step S102: Perform interpolation processing on at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with the object to obtain a global pose corresponding to the object. The global pose is used for controlling a computer to realize a service function corresponding to the global pose.
Specifically, the computer may acquire the standard pose corresponding to the object (for example, the standard pose 20m in an example corresponding to
Refer to
The image frame may not contain a complete object. For example, some parts of the object (for example, lower limbs of the human body) is not in the image frame, then the some object key points are missing from the object pose detection result corresponding to the image frame, and key point compensation may be performed on the object pose detection result corresponding to the object through the standard pose to complete the missing object key points to obtain a first candidate object pose corresponding to the object. When the part pose detection result includes part key points of the first object part, the first candidate object pose may be adjusted in combination with the part key points in the part pose detection result and the object key points in the object pose detection result to obtain the global pose of the object in the image frame. After obtaining the global pose corresponding to the current image frame, object pose estimation may be continued to be performed on the next image frame in the video data to obtain the global pose of the object in each image frame of the video data.
In some examples, the computer may determine behavior actions of the object according to the global pose of the object in the video data. The object may be managed or cared through these behavior actions, or human-machine interaction may be performed through the behavior actions of the object. In a word, the global pose of the object in the video data may be applied to a human-machine interaction scenario (for example, virtual reality and human-machine animation), a content review scenario, an automatic driving scenario, a virtual live streaming scenario, and a game or movie character action design scenario. In the human-computer interaction scenario, an image (or a video) of a user (an object) may be collected. After obtaining the global pose in the image or the video, the control of a machine may be realized based on the global pose, for example, a specific instruction is executed based on a specific human body action (determined by the global pose). In a game character action design scenario, a human body action is acquired through the global pose corresponding to the object to replace an expensive action capture device, which can reduce the cost and difficulty of a game character action design.
The virtual live streaming scenario may refer to that a live stream in a live streaming room does not directly play a video of an anchor user (the object), but a video of a virtual object with the same behavior actions as the anchor user is played in the live streaming room. For example, the behavior actions of the anchor user may be determined based on the global pose of the anchor user, and then a virtual object may be driven by the behavior actions of the anchor user, that is, the virtual object with the same behavior actions as the anchor user is constructed, and live-streaming is performed by using the virtual object, which can not only prevent the anchor user from appearing in public view, but also achieve the same live streaming effect as a real anchor user. For example, the computer may construct a virtual object associated with the object according to the global pose of the object in the video data, and plays the virtual object with the global pose in a multimedia application (for example, a live streaming room, a video website, and a short video application), that is, the video related to the virtual object may be played in the multimedia application, and the pose of the virtual object is synchronized with the pose of the object in the video data. The global pose corresponding to the object in the video data will be reflected on the virtual object played in the multimedia application. Every time the pose of the object changes, the virtual object in the multimedia application will be driven to transform into the same pose (which can be considered as reconstructing a virtual object with a new pose, the new pose here is a pose of the object after a change), so that the poses of the object and the virtual object are kept consistent all the time.
Refer to
After starting live streaming, the anchor user 40c may collect its video data through a user terminal 40a (for example, a smart phone). At this moment, the anchor user 40c may be used as the object, and the user terminal 40a may be fixed by using a holder 40b. After the user terminal 40a collects the video data of the anchor user 40c, an image frame 40g may be acquired from the video data. The image frame 40g is inputted into each of the object detection model and the part detection model. Part joint points (that is, object key points) of the anchor user 40c contained in the image frame 40g may be predicted by the object detection model. These predicted part joint points may be used as an object pose detection result of the image frame 40g. Palm key points (here, the first object part is a palm by default, and the palm key points may also be referred to as part key points) of the anchor user 40c contained in the image frame 40g may be predicted through the part detection model. These predicted palm key points may be used as a part pose detection result of the image frame 40g. Here, the object pose detection result and the part pose detection result may be marked in the image frame 40g (shown as an image 40h). An area 40i and an area 40j in the image 40h represent the part pose detection result.
As shown in
The virtual object in the live streaming room may be drive through the overall human body pose 40m, so that the virtual object 40m in the live streaming room has the same overall human body pose 40k as the anchor user 40c. For a user entering the live streaming room to watch a live streaming video, a display page of the live streaming room where the virtual object is located may be displayed in a user terminal 40d used by the user. The display page of the live streaming room may include an area 40e and an area 40f The area 40e may be used for playing a video of the virtual object (having the same pose as the anchor user 40c), and the area 40f may be used for posting a bullet comment and the like. In the virtual live streaming scenario, the user entering the live streaming room to watch the live streaming video can only see the video of the virtual object and the voice data of the anchor user 40c, but cannot see the video data of the anchor user 40c. Thus, personal information of the anchor user 40c can be protected, and the same live streaming effect of the anchor user 40c can be achieved through the virtual object.
In some examples, the global pose of the object in the video data may be applied to a content review scenario. When the global pose is the same as the pose in a content review system, a review result of the object in the content review system may be determined as a review approval result, and an access permission for the content review approval result system may be set for the object. After the global pose is approved in the content review system, the object may have the permission to access the content review system.
Refer to
The server 50d may acquire the to-be-verified image 50c transmitted by the user terminal 50a, and acquire a pose 50e set in the content review system by the user A in advance. The pose 50e may be used as verification information of the user A in the content review system. The server 50d may perform pose estimation on the to-be-verified image 50c by using the object detection model, the part detection model, and the standard pose to obtain the global pose of the user A in the to-be-verified image 50c. Similarity comparison is performed on the global pose corresponding to the to-be-verified image 50c and the pose 50e. When the similarity between the global pose of the to-be-verified image 50c and the pose 50e is greater than or equal to a similarity threshold value (for example, the similarity threshold value may be set as 90%), it may be determined that the global pose of the to-be-verified image 50c is the same as the pose 50e, and the user A is approved in the content review system. When the similarity between the global pose of the to-be-verified image 50c and the pose 50e is less than the similarity threshold value, it may be determined that the global pose of the to-be-verified image 50c is different from the pose 50e, the user A is not approved in the content review system, and action error prompt information is returned to the user terminal 50a. The action error prompt information is used for prompting the user A to redo actions for identity review.
In the examples of this application, an object pose detection result for an object and a part pose detection result for a first object part of the object can be obtained by respectively performing object pose estimation and specific part pose estimation on the object in an image frame, and then pose estimation can be performed on the object in the image frame based on the object pose detection result, the part pose detection result, a standard pose; and part key points missing from the object in the image frame can be compensated, which can ensure the integrity and the rationality of a finally obtained global pose of the object, and then improve the accuracy of estimating the global pose.
Refer to
Step S201: Input an image frame into an object detection model, acquire an object pose feature corresponding to an object in the image frame by the object detection model, and recognize a first classification result corresponding to the object pose feature. The first classification result is used for characterizing an object part class corresponding to key points of the object.
Specifically, after acquiring the video data shot in a mobile terminal scenario, a computer may select an image frame from the video data, and input the image frame into a trained object detection model. An object pose feature corresponding to the object in the image frame may be acquired by the object detection model. The first classification result corresponding to the object pose feature may be outputted through a classifier of the object detection model. The first classification result may be used for characterizing an object part class corresponding to the key points of the object (for example, a human body joint). The object pose feature may be an object description feature for the object extracted by the object detection model, or may be a fusion feature between the object description feature corresponding to the object and the part description feature. When the object pose feature is the object description feature corresponding to the object in the image frame, it indicates that part perception-based blocking learning is not introduced in a process of performing feature extraction on the image frame by the object detection model. When the object pose feature is the fusion feature between the object description feature corresponding to the object in the image frame and the part description feature, it indicates that part perception-based blocking learning is introduced in a process of performing feature extraction on the image frame by the object detection model. By introducing part perception-based blocking learning, the object pose feature may include local pose features (part description features) of various parts of the object contained in the image frame, and may include the object description feature of the object contained in the object, which can enhance the fine granularity of the object pose feature, thereby improving the accuracy of the object pose detection result.
In some examples, if the part perception-based blocking learning is introduced in a process of performing feature extraction on the image frame by using the object detection model, then the computer may be configured to: input the image frame into the object detection model, acquire the object description feature corresponding to the object in the image frame in the object detection model, and output a second classification result corresponding to the object description feature according to the classifier in the object detection model; acquire an object convolutional feature for the image frame outputted by a convolutional layer in the object detection model, and perform a product operation on the second classification result and the object convolutional feature to obtain a second activation map corresponding to the image frame; perform blocking processing on the image frame according to the second activation map to obtain M object part area images, and acquire part description features respectively corresponding to the M object part area images according to the object detection model, M is a positive integer; and combine the object description feature and the part description features corresponding to the M object part area images into an object pose feature.
The object description feature may be considered as a feature representation that is extracted from the image frame and is used for characterizing the object. The second classification result may also be used for characterizing an object part class corresponding to key points of the object contained in the image frame. The convolutional layer may refer to the last convolutional layer in the object detection model. The object convolutional feature may represent the convolutional feature, for the image frame, outputted by the last convolutional layer of the object detection model. The second activation map may be a class activation mapping (CAM) corresponding to the image frame. The CAM is a tool for visualizing an image feature. Weighting is performed on the object convolutional feature outputted by the last convolutional layer in the object detection model and the second classification result (the second classification result may be considered as a weight corresponding to the object convolutional feature), and the second activation map may be obtained. The second activation map may be considered as a result after visualizing the object convolutional feature outputted by the convolutional layer, which may be used for characterizing an image pixel area concerned by the object detection model.
The computer may take the CAM (the second activation map) of each object key point in the image frame as prior information of an area position, and perform blocking processing on the image frame, that is, clip the image frame according to the second activation map to obtain an object part area image containing a single part. Then, feature extraction may be performed on each object part area image by the object detection model to obtain a part description feature corresponding to each object part area image. The foregoing object description feature and the part description feature corresponding to each object part area image may be combined into an object pose feature for the object. The part description feature may be considered as a feature representation that is extracted from the object part area image and is used for characterizing an object part.
Step S202: Generate a first activation map according to the first classification result and an object convolutional feature of the image frame outputted by the object detection model.
Specifically, after obtaining the first classification result, the computer may perform multiplication on the first classification result and the object convolutional feature of the image frame to generate the first activation map. Both the first activation map and the second activation map are CAMs for the image frame. However, the first activation map takes the first classification result as a weight of the object convolutional feature outputted by the convolutional layer (here, the first classification result combines the object description feature and the part description feature by default), and the second activation map takes the second classification result as a weight of the object convolutional feature outputted by the convolutional layer. The second classification result is only related to the object description feature.
Step S203: Acquire a pixel average value corresponding to the first activation map, determine a positioning result of the key points of the object in the image frame according to the pixel average value, and determine an object pose detection result corresponding to the image frame according to the object part class and the positioning result.
Specifically, the computer may take the pixel average value of the first activation map and determine the pixel average value as a positioning result of the key points of the object in the image frame, and may determine an object skeleton of the object in the image frame according to the object part class and the positioning result. The object skeleton may be used as an object pose detection result corresponding to the object in the image frame.
Refer to
Blocking processing is performed on the image frame 60a based on the second activation map to obtain M object part area images 60f. The M object part area images 60f are inputted into the feature extraction component 60b in the object detection model in sequence. Part description features 60g respectively corresponding to the M object part area images 60f may be obtained through the feature extraction component 60b. Feature combination is performed on the M part description features 60g and the object description feature 60c of the image frame 60a to obtain an object pose feature. A first classification result 60d may be obtained by recognizing an object pose feature. A first activation map 60e may be obtained by performing weighting on the first classification result 60d and the object convolutional feature outputted by the last convolutional layer in the feature extraction component 60b. A pixel average value of the first activation map 60e may be taken as a positioning result of the object in the image frame 60a, and the object pose detection result corresponding to the object in the image frame 60a may be obtained on this basis.
A manner of acquiring the object pose detection result described in an example corresponding to
Refer to
It is to be understood that a spatial coordinate system is constructed by using image frames, and the position coordinates of the human body three-dimensional key points may refer to the spatial coordinates within the spatial coordinate system.
Step S204: Input the image frame into the part detection model, and detect, in the part detection model, a first object part of the object in the image frame.
Specifically, the computer may also input the image frame into the part detection model, and detect, in the part detection model, whether the image frame contains the first object part of the object. The part detection model may be configured to detect key points of the first object part, so the first object part in the image frame needs to be detected. In a case that the first object part of the object is not detected in the image frame, then the part pose detection result corresponding to the image frame may be directly determined as a null value, and a subsequent step of detecting the key points of the first object part does not need to be performed.
Step S205: In a case that the first object part is detected in the image frame, acquire an area image containing the first object part from the image frame, acquire part key point positions corresponding to the first object part according to the area image, and determine a part pose detection result corresponding to the image frame based on the part key point positions.
Specifically, in a case that the first object part is detected in the image frame, a position area of the first object part in the image frame may be determined, and the image frame is clipped based on the position area of the first object part in the image frame to obtain an area image containing the first object part. Feature extraction may be performed on the area image in the part detection model to acquire a part contour feature corresponding to the first object part in the area image, and the part key point positions corresponding to the first object part may be predicted according to the part contour feature. Key points of the first object part may be connected based on the part key point positions to obtain the part pose detection result corresponding to the image frame.
Refer to
The palm three-dimensional pose estimation model may acquire a plurality of possible areas, and predict a second confidence level that each possible area contains the palm. The area with the second confidence level greater than a second confidence threshold value (which may be the same as or different from the foregoing first confidence threshold value, which is not limited here) is determined as the area containing the palm, for example, the second levels corresponding to both the area 80c and the area 80d are greater than the second confidence threshold value. A right palm pose 80e may be obtained by connecting the palm key points detected in the area 80c, and a left palm pose 80f may be obtained by connecting the palm key points detected in the area 80d. The left palm pose 80f and the right palm pose 80e may be referred to as a part pose detection result corresponding to the image frame 80a.
Step S206: Acquire a standard pose associated with the object, and determine a first key point quantity corresponding to the standard pose, and a second key point quantity corresponding to the object pose detection result.
Specifically, the computer may acquire a standard pose corresponding to the object, and count the first key point quantity of the object key points contained in the standard pose and the second key point quantity of the object key points contained in the object pose detection result. The first key point quantity is known when the standard pose is constructed, and the second key point quantity is the quantity of object key points predicted by the object detection model.
Step S207: In a case that the first key point quantity is greater than the second key point quantity, perform interpolation processing on the object pose detection result according to the standard pose to obtain a first candidate object pose.
Specifically, in a case that the first key point quantity is greater than the second key point quantity, it indicates that there are missing object key points in the object pose detection result, and key point compensation (interpolation processing) may be performed on the object pose detection result through the standard pose to complete missing object key points to obtain the first candidate object pose corresponding to the object. As shown in
For example, assuming that the object is a human body, in a case that key points of the parts such as knees, ankles, feet, and elbows are missing from the object pose detection result predicted by the object detection model, interpolation processing may be performed on the object pose detection result through the standard pose, for example, adding missing object key points, to obtain a more rational first candidate object pose. The integrity and rationality of the object pose can be improved by performing interpolation on the object pose detection result through the standard pose.
Step S208: Perform interpolation processing on the object part associated with the first object part in the first candidate object pose according to the part pose detection result to obtain a global pose corresponding to the object.
Specifically, in an actual application scenario, a pose change of the object depends on a few parts of the object to a great extent, that is, some specific parts of the object (for example, an arm part in a human body structure, the arm part may include key points of the parts such as a palm, a wrist, and an elbow) plays an important role on a final result. Therefore, in the examples of this application, interpolation processing may be performed on the object part associated with the first object part in the first candidate object pose based on the part pose detection result to obtain a global pose corresponding to the object. In some examples, in a case that the part pose detection result is a null value (that is, the image frame does not contain the first object part), then the first candidate object pose may be directly determined as the global pose corresponding to the object.
For example, assuming that the object is a human body, the first object part is a palm. When the image frame contains an elbow part, key points for the elbow part may be predicted by the object detection model. When the image frame does not contain the elbow part, key points for the elbow part cannot be predicted by the object detection model. At this moment, elbow key points and wrist key points of the object may be determined based on a part pose detection result. The elbow key points and wrist key points are added to the first candidate object pose, and the global pose corresponding to the object may be obtained.
In some examples, the object includes a second object part and a third object part. The second object part and the third object part are symmetrical. For example, the second object part is a right arm of the object, and the third object part is a left arm of the object. The second object part is a right leg of the object, and the third object part is a left leg of the object.
In a case that the part pose detection result includes all part key points of the first object part (in a case that the first object part is a palm, it is assumed that the part pose detection result includes left and right palm key points here), if the object pose detection result contains a pose of the second object part and the object pose detection result does not contain a pose of the third object part, that is, the image frame contains the second object part, but does not contain the third object part, then a first part direction corresponding to the third object part may be determined according to the key point positions of the first object part contained in the part pose detection result. The second object part and the third object part are symmetrical parts of the object. The second object part and the third object part are symmetrical parts, so the length of the second object part is the same as the length of the third object part, a first part length of the second object part in the first candidate object pose may be acquired, and key point positions of the third object part may be determined according to the first part length and the first part direction. The key point positions of the third object part are added to the first candidate object pose to obtain the global pose corresponding to the object in the image frame.
In a case that the object pose detection result does not contain poses of the second object part and the third object pose, that is, the image frame does not contain the second object part or the third object part, then a second part direction corresponding to the second object part and a third part direction corresponding to the third object part may be determined according to the key point positions of the first object part contained in the part pose detection result. A second part length corresponding to the second object part and a third part length corresponding to the third object part may be acquired from the (i−1)th image frame. In other words, a length of the second object part in a previous image frame may be taken as a length of the second object part in the image frame, and a length of the third object part in the previous image frame may be taken as a length of the third object part in the image frame. Then, key point positions of the second object part may be determined according to the second part length and the second part direction. Key point positions of the third object part may be determined according to the third part length and the third part direction, and the key point positions of the second object part and the key point positions of the third object part may be added to the first candidate object pose to obtain the global pose corresponding to the object in the image frame. In a case that the (i−1)th image frame also does not contain the second object part and the third object part, then it can be continued to be backtracked to acquire the lengths of the second object part and the third object part in the (i−2)th image frame to determine key point positions of the second object part and the third object part in the image frame. In a case that both the second object part and the third object part are not detected in the image frame previous to the image frame, then an approximate length may be set for each of the second object part and the third object part according to the first candidate object pose to determine key point positions of the second object part and the third object part in the image frame.
For example, assuming that the object is a human body, the first object part is a palm, and the second object part and the third object part are respectively a left arm and a right arm. On a premise that a left palm and a right palm are detected in the image frame, a direction of a left forearm may be calculated through key points of the left palm, a direction of a right forearm may be calculated through key points of the right palm, the left forearm belongs to part of the left arm, and the right forearm belongs to part of the right arm.
In a case that the left arm and the right arm are detected in the image frame, then the lengths of the left and right forearms (the second part length and the third part length) in an image frame (for example, the (i−1)th image frame) previous to the image frame may be taken as the lengths of the left and right forearms in the image frame. In a case that both the left arm and the right arm are not detected in the image frame or in a previous image frame, then reference lengths of the left and right forearms in the image frame may be assigned with reference to shoulder lengths in the image frame. In a case that any arm (for example, the left arm) of the left and right arms is detected in the image frame, then the length of the left arm (the first part length) may be directly assigned to the right forearm. For example, a right wrist point A, a right palm point B, and a right elbow point C are known to be missing, the direction of the right forearm may be represented as a direction from the right palm point B to the right wrist point A, and may be marked as a vector BA; and the length of the left forearm may be represented as a length from the right palm point A to the right wrist point C, and may be marked as L. Position coordinates of the right elbow point C may be calculated through the above information, which may be expressed as: C=A+BA_normal*L. C represents the position coordinates of the right elbow point C, A represents the position coordinates of the right palm point A, and BA_normal represents a unit vector of the vector BA.
It is to be understood that, in a case that the left and right arms are detected in the image frame, then an elbow point predicted by the object detection model may be adjusted and updated based on the detected palm key points, which can improve the accuracy of the elbow point, and then improve the rationality of the global pose.
In some examples, there may be some irrational key points in the pose obtained by performing interpolation processing on the first candidate object pose based on the part pose detection result. Therefore, the irrational object key points may be corrected in combination with a standard pose, so as to obtain a final global pose of the object. Specifically, assuming that the third object part is not detected in the image frame, the computer may determine the first candidate object pose added with the key point positions of the third object part as a second candidate object pose. Then, a pose offset between the standard pose and the second candidate object pose may be acquired. In a case that the pose offset is greater than an offset threshold value (which can be understood as a maximum angle that an object may offset in a normal case), key point correction is performed on the second candidate object pose based on the standard pose to obtain the global pose corresponding to the object in the image frame. The pose offset may be understood as a related angle between the second candidate object pose and the standard pose. For example, when the object is a human body, the pose offset may be an included angle between a shoulder of the second candidate object pose and a shoulder of the standard pose, and the like.
Refer to
The video data shot in a mobile terminal scenario usually cannot contain the overall object, and the pose for the object predicted by the object detection model is incomplete. The rationality of the global pose can be improved by performing processing such as key point interpolation and key point correction. Object key point positions associated with the first object part may be calculated through the part pose detection result, which can improve the accuracy of the global pose.
Refer to
It is to be understood that, in a specific implementation of this application, video collection of a user may be involved. When the above examples of this application are applied to a specific product or technology, user permission or consent needs to be acquired, and the collection, use, and processing of relevant data need to comply with relevant laws, regulations, and standards of relevant countries and regions.
In the examples of this application, an object pose detection result for an object and a part pose detection result for a first object part of the object can be obtained by respectively performing object pose estimation and specific part pose estimation on the object in an image frame, and then pose estimation can be performed on the object in the image frame based on the object pose detection result, the part pose detection result, the standard pose; and part key points missing from the object in the image frame can be compensated, and the object key points that do not conform the standard pose can be corrected, which can ensure the integrity and the rationality of a finally obtained global pose of the object, and then improve the accuracy of estimating the global pose.
Refer to
The pose detection module 11 is configured to acquire an object pose detection result corresponding to an object in an image frame and a part pose detection result corresponding to a first object part of the object in the image frame. At least one object part of the object is missing from the object pose detection result, and the first object part is one or more parts of the object.
The pose estimation module 12 is configured to perform interpolation processing on the at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with the object to obtain a global pose corresponding to the object. The global pose is used for controlling a computer to realize a service function corresponding to the global pose.
For implementations of specific functions of the pose detection module 11 and the pose estimation module 12, refer to the descriptions for step S101 and step S102 in the example corresponding to
In the examples of this application, an object pose detection result for the object and a part pose detection result for the first object part of the object can be obtained by respectively performing global object pose estimation and specific part pose estimation on the object in the image frame, and then pose estimation can be performed on the object in the image frame based on the object pose detection result, the part pose detection result, a standard pose; and part key points missing from the object in the image frame can be compensated, which can ensure the integrity and the rationality of a finally obtained global pose of the object, and then improve the accuracy of estimating the global pose.
Refer to
The pose detection module 21 is configured to acquire an object pose detection result corresponding to an object in an image frame and a part pose detection result corresponding to a first object part of the object in the image frame. At least one object part of the object is missing from the object pose detection result, and the first object part is one or more parts of the object.
The pose estimation module 22 is configured to perform interpolation processing on the at least one object part missing from the object pose detection result according to the part pose detection result and a standard pose associated with object to obtain a global pose corresponding to the object.
The virtual object construction module 23 is configured to construct a virtual object associated with the object, and control the pose of the virtual object according to the global pose.
For implementations of specific functions of the pose detection module 21, the pose estimation module 22, and the virtual object construction module 23, refer to the descriptions for foregoing relevant steps, and details are not described herein again.
In one or more examples, the pose detection module 21 includes: an object detection unit 211 and a part detection unit 212.
The object detection unit 211 is configured to input the image frame into an object detection model, and acquire the object pose detection result by the object detection model.
The part detection unit 212 is configured to input the image frame into a part detection model, and acquire the part pose detection result through the part detection model.
For implementations of specific functions of the object detection unit 211 and the part detection unit 212, refer to step S101 in the example corresponding to
In one or more examples, the object detection unit 211 may include: a part classification subunit 2111, a part map generation subunit 2112, a positioning result determination subunit 2113, and a detection result determination subunit 2114.
The part classification subunit 2111 is configured to input an image frame into an object detection model, acquire an object pose feature corresponding to an object in the image frame by the object detection model, and recognize a first classification result corresponding to the object pose feature. The first classification result is used for characterizing an object part class corresponding to key points of the object.
The part map generation subunit 2112 is configured to generate a first activation map according to the first classification result and an object convolutional feature of the image frame outputted by the object detection model.
The positioning result determination subunit 2113 is configured to acquire a pixel average value corresponding to the first activation map, and determine a positioning result of the key points of the object in the image frame according to the pixel average value.
The detection result determination subunit 2114 is configured to determine the object pose detection result corresponding to the image frame according to the object part class and the positioning result.
For implementations of specific functions of the part classification subunit 2111, the part map generation subunit 2112, the positioning result determination subunit 2113, and the detection result determination subunit 2114, refer to step S201 to step S203 in the example corresponding to
In one or more examples, the part classification subunit 2111 includes: a global classification subunit 21111, a global map acquisition subunit 21112, a blocking processing subunit 21113, and a feature combination subunit 21114.
The global classification subunit 21111 is configured to acquire an object description feature corresponding to the object in the image frame in the object detection model, and output a second classification result corresponding to the object description feature according to a classifier in the object detection model.
The global map acquisition subunit 21112 is configured to acquire an object convolutional feature for the image frame outputted by a convolutional layer in the object detection model, and perform a product operation on the second classification result and the object convolutional feature to obtain a second activation map corresponding to the image frame.
The blocking processing subunit 21113 is configured to perform blocking processing on the image frame according to the second activation map to obtain M object part area images, and acquire part description features respectively corresponding to the M object part area images according to the object detection model. M is a positive integer.
The feature combination subunit 21114 is configured to combine the object description feature and the part description features corresponding to the M object part area images into an object pose feature.
For implementations of specific functions of the global classification subunit 21111, the global map acquisition subunit 21112, the blocking processing subunit 21113, and the feature combination subunit 21114, refer to step S201 in the example corresponding to
In one or more examples, the part detection unit 212 may include: an object part detection unit 2121, a part pose estimation subunit 2122, and a null value determination subunit 2123.
The object part detection unit 2121 is configured to input the image frame into the part detection model, and detect, in the part detection model, a first object part of the object in the image frame.
The part pose estimation subunit 2122 is configured to: in a case that the first object part is detected in the image frame, acquire an area image containing the first object part from the image frame, acquire part key point positions corresponding to the first object part according to the area image, and determine a part pose detection result corresponding to the image frame based on the part key point positions.
The null value determination subunit 2123 is configured to: in a case that first object part is not detected in the image frame, determine the part pose detection result corresponding to the image frame is a null value.
For implementations of specific functions of the object part detection unit 2121, the part pose estimation subunit 2122, and the null value determination subunit 2123, refer to step S204 to step S205 in the example corresponding to
In one or more examples, the part pose estimation subunit 2122 may also include: an image clipping subunit 21221, a part key point determination subunit 21222, and a part key point connection subunit 21223.
The image clipping subunit 21221 is configured to: in a case that the first object part is detected in the image frame, clip the image frame to obtain an area image containing the first object part.
The part key point determination subunit 21222 is configured to acquire a part contour feature corresponding to the area image, and predict part key point positions corresponding to the first object part according to the part contour feature.
The part key point connection subunit 21223 is configured to connect key points of the first object part based on the part key point positions to obtain the part pose detection result corresponding to the image frame.
For implementations of specific functions of the image clipping subunit 21221, the part key point determination subunit 21222, and the part key point connection subunit 21223, refer to step S205 in the example corresponding to
In one or more examples, the pose estimation module 22 includes: a key point quantity determination unit 221, a first interpolation processing unit 222, and a second interpolation processing unit 223.
The key point quantity determination unit 221 is configured to acquire a standard pose associated with the object, and determine a first key point quantity corresponding to the standard pose, and a second key point quantity corresponding to the object pose detection result.
The first interpolation processing unit 222 is configured to: in a case that the first key point quantity is greater than the quantity second key points, perform interpolation processing on the at least one object part missing from the object pose detection result according to the standard pose to obtain a first candidate object pose.
The second interpolation processing unit 223 is configured to: perform interpolation processing on the object part associated with the first object part in the first candidate object pose configured according to the part pose detection result to obtain a global pose corresponding to the object.
For implementations of specific functions of the key point quantity determination unit 221, the first interpolation processing unit 222, and the second interpolation processing unit 223, refer to step S206 to step S208 in the example corresponding to
In one or more examples, the second interpolation processing unit 223 may include: a first direction determination subunit 2231, a first position determination subunit 2232, and a first key point addition subunit 2233.
The first direction determination subunit 2231 is configured to: in a case that the object pose detection result contains a pose of a second object part and the object pose detection result does not contain a pose of a third object part, determine a first part direction corresponding to the third object part according to the key point positions of the first object part contained in the part pose detection result. The second object part and the third object part are symmetrical parts of the object, and the second object part and the third object part are associated with the first object part.
The first position determination subunit 2232 is configured to acquire a first part length of the second object part in the first candidate object pose, and determine key point positions of the third object part according to the first part length and the first part direction.
The first key point addition subunit 2233 is configured to add the key point positions of the third object part to the first candidate object pose to obtain the global pose corresponding to the object in the image frame.
In some examples, the first key point addition subunit 2233 is specifically configured to:
In some examples, the image frame is an ith image frame in video data, and i is a positive integer. The second interpolation processing unit 223 may further include: a second direction determination subunit 2234, a second position determination subunit 2235, and a second key point addition subunit 2236.
The second direction determination subunit 2234 is configured to: in a case that the object pose detection result does not contain poses of the second object part and the third object pose, then determine a second part direction corresponding to the second object part and a third part direction corresponding to the third object part according to the key point positions of the first object part contained in the part pose detection result. The second object part and the third object part are symmetrical parts of the object, and the second object part and the third object part are associated with the first object part.
The second position determination subunit 2235 is configured to acquire, in a jth image frame, a second part length corresponding to the second object part and a third part length corresponding to the third object part, and determine key point positions of the second object part according to the second part length and the second part direction. Where, j is a positive integer, and j is less than i.
The second key point addition subunit 2236 is configured to determine key point positions of the third object part according to the third part length and the third part direction, and add the key point positions of the second object part and the key point positions of the third object part to the first candidate object pose to obtain the global pose corresponding to the object in the image frame.
For implementations of specific functions of the first direction determination subunit 2231, the first position determination subunit 2232, the first key point addition subunit 2233, the second direction determination subunit 2234, the second position determination subunit 2235, and the second key point addition subunit 2236, refer to step S208 in the example corresponding to
In the examples of this application, an object pose detection result for an object and a part pose detection result for a first object part of the object can be obtained by respectively performing object pose estimation and specific part pose estimation on the object in an image frame, and then pose estimation can be performed on the object in the image frame based on the object pose detection result, the part pose detection result, the standard pose; and part key points missing from the object in the image frame can be compensated, and the object key points that do not conform the standard pose can be corrected, which can ensure the integrity and the rationality of a finally obtained global pose of the object, and then improve the accuracy of estimating the global pose.
Refer to
The user interface 1004 in the computer 1000 may also provide a network communication function, and optionally, the user interface 1003 may further include a display, and a keyboard. In the computer 1000 as shown in
It is to be understood that the computer 1000 described in the examples of this application may perform the descriptions of the data processing method in any example in the foregoing
In addition, an example of this application further provides a non-transitory computer readable storage medium. The non-transitory computer readable storage medium stores a computer program. The computer includes computer instructions. A processor can perform the descriptions of the data processing method in any example in the foregoing
In addition, an example of this application further provides a computer program product and a computer program. The computer program product and the computer program may include computer instructions. The computer instructions may be stored in a non-transitory computer readable storage medium. The processor of the computer reads the computer instructions from the non-transitory computer readable storage medium, and the processor may execute the computer instructions, so that the computer performs the descriptions of the data processing method in any example of the foregoing
To simplify the descriptions, the foregoing method examples are described as a series of action combination. But those skilled in the art are to be understood that this application is not limited to the described sequence of the action, as some steps can be performed in other sequences or simultaneously according to this application. Secondly, those skilled in the art are also to be understood that all the examples described in the specification are preferred examples, and the actions and modules involved are not necessarily mandatory to this application.
The steps in the method examples of this application may be reordered, combined, or deleted according to actual needs.
The modules in the apparatus examples of this application may be combined, divided, or deleted according to actual needs.
The term module (and other similar terms such as unit, subunit, submodule, etc.) in the present disclosure may refer to a software module, a hardware module, or a combination thereof. Modules implemented by software are stored in memory or non-transitory computer-readable medium. The software modules, which include computer instructions or computer code, stored in the memory or medium can run on a processor or circuitry (e.g., ASIC, PLA, DSP, FPGA, or other integrated circuit) capable of executing computer instructions or computer code. A hardware module may be implemented using one or more processors or circuitry. A processor or circuitry can be used to implement one or more hardware modules. Each module can be part of an overall module that includes the functionalities of the module. Modules can be combined, integrated, separated, and/or duplicated to support various applications. Also, a function is performed at a particular module can be performed at one or more other modules and/or by one or more other devices instead of or in addition to the function performed at the particular module. Further, modules can be implemented across multiple devices and/or other components local or remote to one another. Additionally, modules can be moved from one device and added to another device, and/or can be included in both devices and stored in memory or non-transitory computer readable medium.
Those skilled in the art can understand that all or part of the processes in the method examples described above may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a non-transitory computer readable storage medium. When the program is executed, the process of each method example as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a RAM, or the like.
What is disclosed above is merely examples of this application, and is certainly not intended to limit the scope of the claims of this application. Therefore, equivalent variations made in accordance with the claims of this application still fall within the scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
2022103327630 | Mar 2022 | CN | national |
This application is a continuation of PCT Application PCT/CN2023/073976, filed Jan. 31, 2023, which claims priority to Chinese Patent Application No. 2022103327630 filed with the China National Intellectual Property Administration on Mar. 31, 2022 and entitled “DATA PROCESSING METHOD AND APPARATUS, AND DEVICE AND MEDIUM.” All are incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/073976 | Jan 2023 | US |
Child | 18238321 | US |