The present disclosure claims priority to Chinese Patent Application No. 202311831083.4, filed Dec. 27, 2023, which is hereby incorporated by reference herein as if set forth in its entirety.
The present disclosure relates to image processing technology, and particularly to an object tracking method, and a terminal device and a computer-readable storage medium using the same.
Object tracking refers to tracking an object of interest in a sequence of image frames. In the process of object tracking, not only image detection technology but also image matching technology is involved. Specifically, all objects are detected from the images first, and then a target object is matched from all the detected objects.
When performing multi-object tracking, object overlapping often occurs. For example, if there are multiple human objects close to each other, or multiple human objects with similar appearances, feature information is easily confused and one human object may be mistaken as another human object, resulting in the object overlapping. Therefore, how to improve the object matching accuracy is the key to improving the reliability of multi-object tracking results.
To describe the technical schemes in the embodiments of the present disclosure or in the prior art more clearly, the following briefly introduces the drawings required for describing the embodiments or the prior art. It should be understood that, the drawings in the following description merely show some embodiments. For those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
In the following descriptions, for purposes of explanation instead of limitation, specific details such as particular system architecture and technique are set forth in order to provide a thorough understanding of embodiments of the present disclosure. However, it will be apparent to those skilled in the art that the present disclosure may be implemented in other embodiments that are less specific of these details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
It is to be understood that, when used in the description and the appended claims of the present disclosure, the terms “including” and “comprising” indicate the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or a plurality of other features, integers, steps, operations, elements, components and/or combinations thereof.
It is also to be understood that the term “and/or” used in the description and the appended claims of the present disclosure refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
As used in the description and the appended claims, the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” according to the context. Similarly, the phrase “if determined” or “if [the described condition or event] is detected” may be interpreted as “once determining” or “in response to determining” or “on detection of [the described condition or event]” or “in response to detecting [the described condition or event]”.
In addition, in the specification and the claims of the present disclosure, the terms “first”, “second”, “third”, and the like in the descriptions are only used for distinguishing, and cannot be understood as indicating or implying relative importance.
References such as “one embodiment” and “some embodiments” in the specification of the present disclosure mean that the particular features, structures or characteristics described in combination with the embodiment(s) are included in one or more embodiments of the present disclosure. Therefore, the sentences “in one embodiment,” “in some embodiments,” “in other embodiments,” “in still other embodiments,” and the like in different places of this specification are not necessarily all refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
Object tracking refers to tracking an object of interest in a sequence of image frames. In the process of object tracking, not only image detection technology but also image matching technology is involved. Specifically, all objects are detected from the images first, and then a target object is matched from all the detected objects.
When performing multi-object tracking, object overlapping often occurs. For example, if there are multiple human objects close to each other, or multiple human objects with similar appearances, feature information is easily confused and one human object may be mistaken as another human object, resulting in the object overlapping. Therefore, how to improve the object matching accuracy is the key to improving the reliability of multi-object tracking results.
In order to solve the forgoing problems, researches have been conducted and found that in most scenarios the installation angle and height of a camera are fixed and the tracked human bodies are mostly pedestrians, that is, general people with similar height and head size. Based on this premise, since the head of human can generally reflect the height of a human body, the pixel positions and size of a head area in a two-dimensional image may be used to effectively represent the distance of the human body relative to the camera capturing the human body.
Although the larger the distance of the human body to the camera 10 increases, the smaller the detection frame for the human body and the lower the position, due to the changeable posture of the human body, the completeness of the detection frame for the human body does not have a proportional relationship with the distance between the human body and the camera 10, while the size of the head area of the human body is almost unaffected by the posture of the human body, it will be a better choice to use the head area of the human body for object tracking.
Based on this, the embodiments of the present disclosure provide an object tracking method. In this embodiment, when performing object matching, feature information of the human body as a whole and the detection frame of the head area are comprehensively considered. In the case that the distance between the human bodies of similar appearance is close, since the size of the head area is less affected by the change of the posture of the human bodies, different human objects can be distinguished in a more accurate manner by using the detection frame of the head area, and the accuracy of object matching can be effectively improved by combining with the feature information of the human body as a whole, thereby improving the reliability of the results of multi-object tracking.
S201: obtaining first feature information of a target human body in a first image and a first detection frame of a head area of the target human body.
S202: obtaining second feature information of each human object in a second image and a second detection frame of a head area of the human object by performing a first image detection on the second image.
In which, the first image and the second image are a sequence of image frames captured through a camera (of, for example, the apparatus for tracking objects of
In the method, since the first image is the frame collected first and the second image is that collected later, the first image detection has been performed on the first image while processing the second image. In order to improve the processing efficiency, the result of object tracking of the first image, that is, the first feature information of the target human body in the first image and the first detection frame of the head area of the target human body may be stored, so that the tracking result corresponding to the stored first image may be obtained while processing the second image. In this manner, when processing the next image frame, there is no need to repeat the image detection of the previous image frame, which is conducive to improving the processing efficiency.
It should be noted that the method for performing the first image detection on each image frame may be the same. For example, the method for performing the first image detection on the first image may be the same as that for performing the first image detection on the second image in step S202. In order to simplify the description, the second image is taken as an example to introduce the first image detection method. As for the process of performing the first image detection on the first image and obtaining the first feature information and the first detection frame of the head area, please refer to the following description that performs the first image detection method on the second image, which will not be descripted herein.
As an example, step S202 may include: obtaining the human body detection frame and the corresponding second feature information of each human object in the second image by inputting the second image into a human body detection model; and obtaining the second detection frame of the head area in the second image by inputting the first image into a detection model for the head area.
In the forgoing manner, it is equivalent to separating the detection task of the human body and that of the head area. However, since the two tasks are performed separately, the detection task of the human body and that of the head area cannot share the feature information of the images (i.e., the first feature information and the second feature information), the two models need to perform feature extraction on the second image respectively, which will increase the complexity of the models and the amount of data processing.
As another example, step S202 may include the following steps:
In this embodiment, the second image detection may be a human body detection. Performing the second image detection on the second image is equivalent to performing a human body detection task.
In steps I-II, it is equivalent to detecting the head area using the result (i.e., the first key point) of the human body detection task. In this manner, it is equivalent to merging the human body detection task and the head area detection task together for execution, which can effectively reduce the complexity of the models and the amount of data processing, and is conducive to improving the processing efficiency.
In some implementations, step I may include:
In this embodiment, the multi-task model is for extracting the feature information of the human image (e.g., the human object) in the to-be-processed image (e.g., the second image) and detecting the key points of the human body in the to-be-processed image. Through the multi-task model, it is equivalent to merging the human body detection task and the head area detection task together for execution, which can effectively reduce the complexity of the models and the amount of data processing, and is conducive to improving the processing efficiency.
In which, the multi-task model may be a neural network, or other model with the functions of image detection algorithm, which will not be specifically limited herein.
In some embodiments, the human body detection module 31 may include a backbone network 311, an intermediate network 312, and a detection network 313. In which, the backbone network 311 is configured to extract features from the input image; the intermediate network 312 is configured to aggregate and refine the features extracted by the backbone network 311, for example, enhancing the feature expression capability and receptive field of the model; the detection network 313 is configured to perform the human body detection based on the extracted features to output the human body detection frame.
Based on the multi-task model in
In which, the process of detecting the human object in the second image using the human body detection module may include: inputting the second image into the above-mentioned multi-task model; first extracting, through the backbone network 311, the features from the second image; aggregating and refining, through the intermediate network 312, the features extracted by the backbone network 311 to obtain the processed feature information so as to input the processed feature information into the detection network 313; and detecting, through the detection network 313, the human object in the second image according to the processed feature information to obtain the human body detection frame for each human object.
As can be seen from the forgoing example, the key point detection module 32 and the feature extraction module 33 both perform processing based on the output result of the human body detection module 31, which is equivalent to the above-mentioned human body detection task, a key point detection task and a human appearance feature extraction task sharing a set of feature information (i.e., the feature information output by the intermediate network 312), realizing feature sharing. In this manner, the backbone network 311 and the intermediate network 312 only need to perform a feature extraction process once in advance, which is conducive to improving processing efficiency.
In this embodiment, before applying the multi-task model, the multi-task model is first trained to obtain the trained multi-task model. By using the trained multi-task model to perform the second image detection, which can not only improve the detection accuracy, but also improve the detection efficiency.
In some embodiments, the process of training the multi-task model may include:
In an example of model training, it may input a sample image into the multi-task model so that the multi-task model outputs predicted feature information and predicted key points of each human object in the sample image; calculate a first loss value between the predicted key points and the real key points corresponding to the sample image, and calculate a second loss value between the predicted feature information and the real appearance features corresponding to the sample image; calculate a total loss based on the first loss value and the second loss value; if the total loss is larger than or equal to a predetermined threshold, updates model parameters of the multi-task model according to the total loss; and continuously trains the updated multi-task model based on the sample image until the total loss is less than the predetermined threshold so as to obtain the trained multi-task model.
It should be noted that the forgoing is only an example of model training. In a practical application, other training methods like controlling the number of iterations may be used, which will not be specifically limited herein. In addition, any model that can implement multiple tasks may be applied, and the specific structure of the multi-task model will not be specifically limited herein.
As an example, the first key point may include facial key points. Accordingly, step II may include: determining vertex positions of the head area according to the facial key points; and determining the second detection frame of the head area according to the vertex positions of the head area.
In this manner, it is equivalent to taking the facial area as the head area. However, in the application scenario of multiple human objects, the human face is often blocked. However, if merely the facial area is taken as the head area, the second detection frame obtained may be very small or even cannot be detected, which will affect the subsequent object tracking.
As another example, the first key points may include eye key points and shoulder key points. Accordingly, step II may include:
In this manner, it is equivalent to comprehensively considering the facial area and the shoulder area when determining the head area, which increases the range of the head area. In the application scenario of multiple human objects, since the probability of the facial area and the shoulder area being blocked at the same time is small, it can effectively increase the probability of the head area being detected, while ensuring the reliability of the detection of the head area.
As an example, the first key points of the n-th human object may be as an equation of:
It should be noted that the forgoing is only an example of calculating the vertices of the second detection frame. Since a pair of opposite corners can determine a rectangular frame, the vertex at the upper left corner and that at the lower right corner may be used to determine the second detection frame. Otherwise, in other examples, the vertex at the lower left corner and that at the upper right corner may also be used to determine the second detection frame, or four vertices may be used to determine the second detection frame, which will not be specifically limited herein.
S203: recognizing the target human body from the human object in the second image according to a first similarity between the first feature information and the second feature information and a second similarity between the first detection frame and the second detection frame.
In some embodiments, the target human body may be recognized from the human object of the second image only according to the first similarity, or it may be recognized from the human object of the second image only according to the second similarity.
Since the first similarity is for representing the similarity between the appearance features of the human body in two consecutive image frames, and the second similarity is for representing the similarity between the detection frames of the human body in the two consecutive image frames, if the object tracking is performed only according to the first similarity, the object tracking cannot be performed accurately in the case that there are multiple human bodies having similar appearances (e.g., wearing the same clothes); otherwise, if the object tracking is performed only according to the second similarity, the object tracking cannot be performed accurately in the case that the intersection between the human bodies is serious.
In some embodiments, step S203 may include:
1) calculating the first similarity between the first feature information and the second feature information.
As an example, the cosine similarity, Mahalanobis distance or Euclidean distance between the first feature information and the second feature information may be calculated to use as the first similarity.
2) calculating the second similarity between the first detection frame and the second detection frame.
As an example, an intersection-over-union ratio between the first detection frame and the second detection frame may be calculated to use as the second similarity. In which, the intersection-over-union ratio refers to the ratio of the intersection of two images to the union of the two images.
3) calculating, according to the first similarity and the second similarity, a third similarity between each human object and the target human body in the second image.
As an example, the first similarity and the second similarity may be added to obtain the third similarity.
As an example, a weighted sum of the first similarity and the second similarity may be calculated to obtain the third similarity. In which, the weight of the first similarity and that of the second similarity may be set according to actual needs or obtained through training.
4) recognizing, according to the third similarity, the target human body from the human object in the second image.
As an example, the human object corresponding to the maximum among the third similarities of different human objects may be recognized as the target human body.
In this embodiment, it is equivalent to comprehensively considering the similarity of the appearances of the human bodies and that of the human body detection frames that are between two consecutive image frames. In the cases that there are multiple human bodies having similar appearances and/or overlapping in position, the object tracking can be performed accurately.
S501: obtaining first feature information of a target human body in a first image and a first detection frame of a head area of the target human body.
S502: obtaining second feature information of each human object in a second image and a second detection frame of a head area of the human object by performing a first image detection on the second image.
Steps S501-S502 are the same as the above-mentioned steps S201-S202. For details, please refer to the descriptions for steps S201-S202, which will not be descripted herein.
S503: obtaining third feature information of the target human body in a third image and a third detection frame of a head area of the target human body in response to there being the third image, where the third image is a frame antecedent to the first image.
The method of obtaining the third feature information of the target human body in the third image and the third detection frame of the head area of the target human body is the same as that of obtaining the first feature information of the target human body in the first image and the first detection frame of the head area of the target human body. For details, please refer to the descriptions for step S201, which will not be descripted herein.
S504: obtaining fourth feature information by calculating average value of the first feature information and the third feature information.
In this embodiment, the feature information may be a vector. Accordingly, as an example, the average value may be calculated by: calculating an average value between a first element in the first feature information and a second element in the second feature information; and determining fourth feature information based on the average value. In which, the first element corresponds to the second element. For example, if the first element is the first element in the first feature information, the second element is the first element in the second feature information; otherwise, if the first element is the last element in the first feature information, the second element is the last element in the second feature information.
It can be understood that since the image detection method of each image frame is the same, the length of the feature information of the human object in each image frame is also the same.
S505: obtaining a fourth detection frame by calculating a middle position of the first detection frame and the third detection frame.
As an example, a coordinate average value between the coordinate of the first vertex in the first detection frame and the coordinates of the second vertex in the third detection frame may be calculated to determine the fourth detection frame according to the coordinate average value. In which, the second vertex matches the first vertex. For example, if the first vertex is the vertex at the upper left corner of the first detection frame, the second vertex is the vertex at the upper left corner of the third detection frame; otherwise, if the first vertex is the vertex at the lower right corner of the first detection frame, the second vertex is the vertex at the lower right corner of the third detection frame.
S506: recognizing, according to a third similarity between the fourth feature information and the second feature information, and a fourth similarity between the fourth detection frame and the second detection frame, the target human body from the human object in the second image.
It should be noted that in this embodiment, it only shows the case where there are two frames of previously collected images before the second image. In a practical application, if there are multiple frames of previously collected images before the second image, the method may be executed by repeating steps S504-S506, which will not be described herein since the principle is the same.
In this embodiment, the tracking results of the image frames previously collected are comprehensively considered to perform the object tracking on the current image frame. In this manner, it can effectively reduce the influence of the error of the tracking result of a previous frame image on the subsequent object tracking results, which is conducive to improving the reliability of object tracking.
It should be understood that, the sequence of the serial number of the steps in the above-mentioned embodiments does not mean the execution order while the execution order of each process should be determined by its function and internal logic, which should not be taken as any limitation to the implementation process of the embodiments.
As shown in
As an example, the detection unit 62 may further configured to:
As an example, the detection unit 62 may further configured to:
As an example, the multi-task model may include a human body detection module, a key point detection module and a feature extraction module. Accordingly, the detection unit 62 may further configured to:
As an example, the first key point may include eye key points and shoulder key points. Accordingly, the detection unit 62 may further configured to:
As an example, the tracking unit 63 may further configured to:
As an example, the tracking unit 63 may further configured to:
It should be noted that the information interaction, execution process and other contents between the above-mentioned apparatus/units are based on the same concept as the method embodiment of the present disclosure, and their specific functions and technical effects can be specifically referred to the method embodiment part, which will not be described herein.
In addition, the object tracking apparatus 6 shown in
Those skilled in the art may clearly understand that, for the convenience and simplicity of description, the division of the above-mentioned functional units and modules is merely an example for illustration. In actual applications, the above-mentioned functions may be allocated to be performed by different functional units according to requirements, that is, the internal structure of the device may be divided into different functional units or modules to complete all or part of the above-mentioned functions. The functional units and modules in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional unit. In addition, the specific name of each functional unit and module is merely for the convenience of distinguishing each other and are not intended to limit the scope of protection of the present disclosure. For the specific operation process of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the above-mentioned method embodiments, and are not described herein.
The terminal device 7 may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server. The terminal device 7 may include, but is not limited to, the processor 70 and the storage 71. It can be understood by those skilled in the art that
The processor 70 may be a central processing unit (CPU), or be other general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or be other programmable logic device, a discrete gate, a transistor logic device, and a discrete hardware component. The general purpose processor may be a microprocessor, or the processor may also be any conventional processor.
The storage 71 may be an internal storage unit of the terminal device 7, for example, a hard drive or a memory of the terminal device 7. In other embodiments, the storage 71 may also be an external storage device of the terminal device 7, for example, a plug-in hard drive, a smart media card (SMC), a secure digital (SD) card, or a flash card that is equipped on the terminal device 7. Furthermore, the storage 71 may also include both the internal storage units and the external storage devices of the terminal device 7. The storage 71 may be configured to store operating systems, applications, boot loaders, data, and other programs such as codes of computer programs. The storage 71 may also be configured to temporarily store data that has been output or will be output.
The present disclosure further provides a computer-readable storage medium. The computer-readable storage medium is stored with a computer program. When the computer program is executed by a processor, the steps in each of the above-mentioned method embodiments can be implemented.
The present disclosure further provides a computer program product. When the computer program product is executed on the terminal device, the terminal device can implement the steps in the above-mentioned method embodiments.
When the integrated unit is implemented in the form of a software functional unit and is sold or used as an independent product, the integrated unit may be stored in a non-transitory computer-readable storage medium. Based on this understanding, all or part of the processes in the method for implementing the above-mentioned embodiments of the present disclosure are implemented, and may be implemented by instructing relevant hardware through a computer program. The computer program may be stored in a non-transitory computer-readable storage medium, which may implement the steps of each of the above-mentioned method embodiments when executed by a processor. In which, the computer program includes computer program codes which may be the form of source codes, object codes, executable files, certain intermediate, and the like. The computer-readable medium may include at least any entity or device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), electric carrier signals, telecommunication signals and software distribution media that is capable of carrying the computer program codes on an apparatus/the terminal device, for example, a USB flash drive, a portable hard disk, a magnetic disk, an optical disk, or the like. In some jurisdictions, according to the legislation and patent practice, a computer readable medium cannot be electric carrier signals and telecommunication signals.
In the above-mentioned embodiments, the description of each embodiment has its focuses, and the parts which are not described or mentioned in one embodiment may refer to the related descriptions in other embodiments.
Those ordinary skilled in the art may clearly understand that, the exemplificative modules and steps described in the embodiments disclosed herein may be implemented through electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented through hardware or software depends on the specific application and design constraints of the technical schemes. Those ordinary skilled in the art may implement the described functions in different manners for each particular application, while such implementation should not be considered as beyond the scope of the present disclosure.
In the embodiments provided by the present disclosure, it should be understood that the disclosed apparatus (device), terminal device and method may be implemented in other manners. For example, the above-mentioned apparatus/terminal device embodiment is merely exemplary. For example, the division of modules or units is merely a logical functional division, and other division manner may be used in actual implementations, that is, multiple units or components may be combined or be integrated into another system, or some of the features may be ignored or not performed. In addition, the shown or discussed mutual coupling may be direct coupling or communication connection, and may also be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms.
The modules described as separate components may or may not be physically separated. The components represented as modules may or may not be physical modules, that is, may be located in one place or be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of this embodiment.
The above-mentioned embodiments are merely intended for describing but not for limiting the technical schemes of the present disclosure. Although the present disclosure is described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that, the technical schemes in each of the above-mentioned embodiments may still be modified, or some of the technical features may be equivalently replaced, while these modifications or replacements do not make the essence of the corresponding technical schemes depart from the spirit and scope of the technical schemes of each of the embodiments of the present disclosure, and should be included within the scope of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202311831083.4 | Dec 2023 | CN | national |