This application is based on and claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2023-0184806, filed on Dec. 18, 2023, in the Korean Intellectual Property Office, the disclosure of which is herein incorporated by reference in its entirety.
The disclosure relates to deep learning-based motion recognition, and more particularly, to a method and a system for efficiently recognizing continuous motions such as hand signals of a traffic officer.
A continuous motion recognition technology is a technology that recognizes motions by collecting continuous motion data by using various sensors and interpreting data, and the core function thereof is appropriately extracting feature information for motion recognition.
As shown in
To solve this problem, a slow fast method using features of an image has appeared, which is illustrated in
However, the fast pathway is highly likely to fail to recognize motions that are not well expressed in an image since it uses only image data, and two types of pathways are constituted and thus there is a problem that much time and high power consumption are required due to lots of computations.
The disclosure has been developed in order to solve the above-described problems, and an object of the disclosure is to provide a method and a system for recognizing motions by using all of spatial features and temporal features regarding image data, and features of key point data, as a solution for enabling motion recognition to be performed in a small low-power edge device having relatively low computing power, and enhancing motion recognition performance.
To achieve the above-described object, a motion recognition method according to an embodiment of the disclosure may include: a first reshaping step of reshaping time-series image data obtained by shooting a target object to a type of image data of a spatial domain; a first extraction step of extracting spatial features from the reshaped image data; a second reshaping step of reshaping the image data from which the spatial features are extracted to a type of time-series image data; a step of integrating the time-series image data and time-series key point data of the target object; a second extraction step of extracting temporal features from the integrated time-series data; and a step of recognizing motions of the target object based on the extracted temporal features.
The time-series image data may be time-series image data of a bounding box through which the target object is detected.
The first reshaping step may include reshaping the time-series image data to the type of image data of the spatial domain according to the following equation:
where If(B×Seq,C,W,H) is image data of a spatial domain, I(B,Seq,C,W,H) is time-series image data, B is a batch size, Seq is sequence data, C is a channel, W is a width, and H is a height.
The second reshaping step may include reshaping the image data from which the spatial features are extracted to the type of time-series image data according to the following equation:
where Iseq(B,Seq,dim0) is time-series image data, X(B×Seq,dim0) is image data from which spatial features are extracted, and dim0 is a dimension of image data from which spatial features are extracted.
The second extraction step may include extracting the temporal features from the integrated time-series data by using a transformer encoder.
According to an embodiment, the motion recognition method may further include a step of adding an index and position information of each key point to the time-series key point data.
The step of adding may include: generating an index of each key point through input embedding; and generating position information of each key point through positional encoding.
The step of integrating may include integrating the time-series image data and the time-series key point data by concatenating.
The target object may be a traffic officer, and the motions may be hand signals.
According to another embodiment of the disclosure, a motion recognition system may include: a first extraction unit configured to reshape time-series image data obtained by shooting a target object to a type of image data of a spatial domain, and to extract spatial features from the reshaped image data; a second reshaping unit configured to reshape the image data from which the spatial features are extracted to a type of time-series image data; an integration unit configured to integrate the time-series image data and time-series key point data of the target object; a second extraction unit configured to extract temporal features from the integrated time-series data; and a recognition unit configured to recognize motions of the target object based on the extracted temporal features.
According to still another embodiment of the disclosure, a motion recognition method may include: a first extraction step of extracting spatial features from time-series image data obtained by shooting a target object; a step of integrating the time-series image data from which the spatial features are extracted, and time-series key point data of the target object; a second extraction step of extracting temporal features from the integrated time-series data; and a step of recognizing motions of the target object based on the extracted temporal features.
According to yet another embodiment of the disclosure, a motion recognition system may include: a first extraction unit configured to extract spatial features from time-series image data obtained by shooting a target object; an integration unit configured to integrate the time-series image data from which the spatial features are extracted, and time-series key point data of the target object; a second extraction unit configured to extract temporal features from the integrated time-series data; and a recognition unit configured to recognize motions of the target object based on the extracted temporal features.
As described above, according to embodiments of the disclosure, motions may be recognized by using all of spatial features and temporal features regarding image data, and features of key point data, so that motions can be more stably recognized even when there are a plurality of objects at the same time and an overlap, occlusion frequently occur.
According to embodiments of the disclosure, spatial features and temporal features regarding image data are embedded in sequence, so that lots of computations are not required and motion recognition can be performed in a small low-power edge device having relatively low computing power.
Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.
Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
Hereinafter, the disclosure will be described in more detail with reference to the accompanying drawings.
Embodiments of the disclosure provide a deep learning-based motion recognition method and system using multiple feature information.
The disclosure relates to a technology for recognizing motions that are frequently occluded by other objects, such as hand signals of a traffic officer, by using all of spatial features and temporal features regarding image data, and features of key point data, and for recognizing motions by few computations by embedding spatial features and temporal features for image data in sequence.
The image feature extraction unit 110 may receive time-series image data that is obtained by cutting only a bounding box through which a target object is detected from time-series data (image sequences) obtained by shooting the target object, and may extract spatial features.
As shown in
where If(B×Seq,C,W,H) is image data of a spatial domain, I(B,Seq.C,W,H) is time-series image data, B is a batch size, Seq is sequence data, C is a channel, W is a width, and H is a height.
This process is a process for reshaping a type of image data of a temporal domain to a type of image data of a spatial domain, and also is a process of transforming 5D data (batch size, sequence, channel, width, height) into 4D data (B×Seq, Channel, W, H).
Next, the image feature extraction unit 110 extracts features in a spatial domain, that is, spatial features, from the reshaped image data (spatial domain feature extraction). The spatial domain feature extraction may be performed by using a lightweight deep learning network such as ResNet 18, EfficientNetB0, Mobilenet, and MobilenetV2.
Referring back to
where Iseq(B,Seq,dim0) is time-series image data, X(Bx Seq,dim0) is image data of a spatial domain from which features are extracted, and dim0 is a dimension of image data of a spatial domain from which features are extracted.
Meanwhile, the key point encoding unit 130 may receive time-series key point data that is extracted from the time-series image data which is obtained by cutting only the bounding box through which the target object is detected from the time-series image data obtained by shooting the target object, and may encode the time-series key point data.
As shown in
The key point data should be processed by a transformer encoder which will be described below. However, since the transformer encoder does not process data in sequence, the index and the position information are added to the key point data.
Encoded time-series key point data may be expressed by Ikey(B,Seq,dim1). Herein, dim1 indicates a dimension of encoded time-series key point data.
Referring back to
where Iuf(B,Seq,dim2) is integrated time-series data, Iseq(B,Seq,dim0) is time-series image data from which spatial features are extracted, Ikey(B,Seq,dim1) is encoded time-series key point data, B is a batch size, Seq is sequence data, and dim2 is a dimension of integrated time-series data.
The integrated feature extraction unit 150 extracts temporal features from the time-series data integrated by the data integration unit 140. Temporal domain feature extraction may be performed by using a lightweight transformer encoder. A lightweight transformer encoder is illustrated in
Referring back to
To recognize motions, the image feature extraction unit 110 reshapes time-series image data that is obtained by cutting only a bounding box through which a target object is detected from time-series image data obtained by shooting the target object to a type of image data of a spatial domain (S210).
The image feature extraction unit 110 extracts spatial features from the image data reshaped in step S210 (S220), and the image data reshaping unit 120 reshapes the image data from which the spatial features are extracted in step S220 to a type of time-series image data which is image data of a temporal domain (S230).
Meanwhile, the key point encoding unit 130 encodes time-series key point data that is extracted from the time-series image data which is obtained by cutting only the bounding box through which the target data is detected from the time-series image data obtained by shooting the target object, and adds an index and position information of each key point (S240).
The data integration unit 140 integrates the time-series image data from which the spatial features are extracted and which is reshaped in step S230, and the time-series key point data which is encoded in step S240 (S250).
Thereafter, the integrated feature extraction unit 150 extracts temporal feature from the time-series data integrated in step S250, and the motion recognition unit 160 recognizes motions of the target object based on the extracted temporal features (S260).
Up to now, a deep learning-based motion recognition method and system using multiple feature information has been described in detail with reference to preferred embodiments.
In the above-described embodiments, motions that are frequently occluded by other objects, such as hand signals of a traffic officer, can be recognized by using all of spatial features and temporal features regarding image data, and features of key point data, and motions can be recognized by few computations by embedding spatial features and temporal features for image data in sequence.
The technical concept of the disclosure may be applied to a computer-readable recording medium which records a computer program for performing the functions of the apparatus and the method according to the present embodiments. In addition, the technical idea according to various embodiments of the disclosure may be implemented in the form of a computer readable code recorded on the computer-readable recording medium. The computer-readable recording medium may be any data storage device that can be read by a computer and can store data. For example, the computer-readable recording medium may be a read only memory (ROM), a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical disk, a hard disk drive, or the like. A computer readable code or program that is stored in the computer readable recording medium may be transmitted via a network connected between computers.
In addition, while preferred embodiments of the present disclosure have been illustrated and described, the present disclosure is not limited to the above-described specific embodiments. Various changes can be made by a person skilled in the at without departing from the scope of the present disclosure claimed in claims, and also, changed embodiments should not be understood as being separate from the technical idea or prospect of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0184806 | Dec 2023 | KR | national |