The present disclosure relates generally to wireless communication systems and, more specifically, the present disclosure relates to a system and method for human gesture recognition and activity detection.
Gesture recognition and activity detection systems typically vision-based and use a camera and motion sensor to track user movements. However, these systems raise privacy concerns for users. As such, there is an increasing interest in using electromagnetic (EM) signals, such as millimeter wave (mmWave) radar, ultra-wide band (UWB) radar, and wireless fidelity (Wi-Fi), for human gesture recognition and activity detection. These modalities address privacy concerns regarding video capture systems while also providing accurate detection and recognition capability, particularly when coupled with machine learning (ML) algorithms.
However, there is a lack of real EM signature data, e.g. Doppler and micro-Doppler signatures from mmWave or UWB radar, limiting the accuracy and precision potential of ML algorithms for a variety of detection and recognition of tasks. Accordingly, there is a need for systems and methods for improved human gesture recognition and activity detection based on EM modalities that overcome these challenges.
The present disclosure relates generally to wireless communication systems and, more specifically, the present disclosure relates to a system and method for human gesture recognition and activity detection.
In one embodiment, a computer-implemented method is provided. The computer-implemented method includes receiving motion capture data of a target from a camera, generating a first set of motion trajectories from the motion capture data, generating a first set of augmented motion trajectories using a set of data augmentation functions on the first set of motion trajectories, generating a radar cross-section of the target using the motion capture data to perform at least one of gesture recognition or activity detection, generating one or more synthetic electromagnetic (EM) signatures of one or more activities of the target using the first set of augmented motion trajectories and the radar cross-section, and training a machine learning model configured for EM signature-based gesture recognition or activity detection with a domain adaptation process using the one or more synthetic EM signatures.
In another embodiment, a gesture recognition and activity detection system is provided. The gesture recognition and activity detection system includes at least one camera configured for motion capture and a controller coupled to the at least one camera. The controller is configured to receive motion capture data of a target from the at least one camera, generate a first set of motion trajectories from the motion capture data, generate a first set of augmented motion trajectories using a set of data augmentation functions on the first set of motion trajectories, generate a radar cross-section of the target using the motion capture data to perform at least one of gesture recognition or activity detection, generate one or more synthetic electromagnetic (EM) signatures of one or more activities of the target using the first set of augmented motion trajectories and the radar cross-section, and train a machine learning model configured for EM signature-based gesture recognition or activity detection with a domain adaptation process using the one or more synthetic EM signatures.
In yet another embodiment, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium includes program code, that when executed by at least one processor of an electronic device, causes the electronic device to receive motion capture data of a target from a camera, generate a first set of motion trajectories from the motion capture data, generate a first set of augmented motion trajectories using a set of data augmentation functions on the first set of motion trajectories, generate a radar cross-section of the target using the motion capture data to perform at least one of gesture recognition or activity detection, generate one or more synthetic electromagnetic (EM) signatures of one or more activities of the target using the first set of augmented motion trajectories and the radar cross-section, and train a machine learning model configured for EM signature-based gesture recognition or activity detection with a domain adaptation process using the one or more synthetic EM signatures.
Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The term “couple” and its derivatives refer to any direct or indirect communication between two or more elements, whether or not those elements are in physical contact with one another. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both direct and indirect communication. The terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrase “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like. The term “controller” means any device, system, or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase “at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed. For example, “at least one of: A, B, and C” includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.
Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), or any other type of memory. A “non-transitory” computer readable medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
Definitions for other certain words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
The communication system 100 includes a network 102 that facilitates communication between various components in the communication system 100. For example, the network 102 can communicate IP packets, frame relay frames, Asynchronous Transfer Mode (ATM) cells, or other information between network addresses. The network 102 includes one or more local area networks (LANs), metropolitan area networks (MANs), wide area networks (WANs), all or a portion of a global network such as the Internet, or any other communication system or systems at one or more locations.
In this example, the network 102 facilitates communications between a server 104 and various client devices 106-114. The client devices 106-114 may be, for example, a smartphone, a tablet computer, a laptop, a personal computer, a wearable device, a head mounted display, AR/VR glasses, a television, an audio playback system or the like. The server 104 can represent one or more servers. Each server 104 includes any suitable computing or processing device that can provide computing services for one or more client devices, such as the client devices 106-114. Each server 104 could, for example, include one or more processing devices, one or more memories storing instructions and data, and one or more network interfaces facilitating communication over the network 102.
Each of the client devices 106-114 represent any suitable computing or processing device that interacts with at least one server (such as the server 104) or other computing device(s) over the network 102. The client devices 106-114 include a desktop computer 106, a mobile telephone or mobile device 108 (such as a smartphone), a PDA 110, a laptop computer 112, and AR/VR glasses 114. However, any other or additional client devices could be used in the communication system 100. Smartphones represent a class of mobile devices 108 that are handheld devices with mobile operating systems and integrated mobile broadband cellular network connections for voice, short message service (SMS), and Internet data communications. In certain embodiments, any of the client devices 106-114 can emit and collect radar signals via a radar transceiver. In certain embodiments, the client devices 106-114 are able to sense the presence of an object located close to the client device and determine whether the location of the detected object is within a first area 120 or a second area 122 closer to the client device than a remainder of the first area 120 that is external to the second area 122. In certain embodiments, the boundary of the second area 122 is at a predefined proximity (e.g., 5 centimeters away) that is closer to the client device than the boundary of the first area 120, and the first area 120 can be a within a different predefined range (e.g., 30 meters away) from the client device where the user is likely to perform a gesture.
In this example, some client devices 108 and 110-114 communicate indirectly with the network 102. For example, the mobile device 108 and PDA 110 communicate via one or more base stations 116, such as cellular base stations or eNodeBs (eNBs) or gNodeBs (gNBs). Also, the laptop computer 112 and the tablet computer 114 communicate via one or more wireless access points 118, such as IEEE 802.11 wireless access points. Note that these are for illustration only and that each of the client devices 106-114 could communicate directly with the network 102 or indirectly with the network 102 via any suitable intermediate device(s) or network(s). In certain embodiments, any of the client devices 106-114 transmit information securely and efficiently to another device, such as, for example, the server 104.
Although
As shown in
The transceiver(s) 210 can include an antenna array 205 including numerous antennas. The antennas of the antenna array can include a radiating element composed of a conductive material or a conductive pattern formed in or on a substrate. The transceiver(s) 210 transmit and receive a signal or power to or from the electronic device 200. The transceiver(s) 210 receives an incoming signal transmitted from an access point (such as a base station, WiFi router, or BLUETOOTH device) or other device of the network 102 (such as a WiFi, BLUETOOTH, cellular, 5G, 6G, LTE, LTE-A, WiMAX, or any other type of wireless network). The transceiver(s) 210 down-converts the incoming RF signal to generate an intermediate frequency or baseband signal. The intermediate frequency or baseband signal is sent to the RX processing circuitry 225 that generates a processed baseband signal by filtering, decoding, and/or digitizing the baseband or intermediate frequency signal. The RX processing circuitry 225 transmits the processed baseband signal to the speaker 230 (such as for voice data) or to the processor 240 for further processing (such as for web browsing data).
The TX processing circuitry 215 receives analog or digital voice data from the microphone 220 or other outgoing baseband data from the processor 240. The outgoing baseband data can include web data, e-mail, or interactive video game data. The TX processing circuitry 215 encodes, multiplexes, and/or digitizes the outgoing baseband data to generate a processed baseband or intermediate frequency signal. The transceiver(s) 210 receives the outgoing processed baseband or intermediate frequency signal from the TX processing circuitry 215 and up-converts the baseband or intermediate frequency signal to a signal that is transmitted.
The processor 240 can include one or more processors or other processing devices. The processor 240 can execute instructions that are stored in the memory 260, such as the OS 261 in order to control the overall operation of the electronic device 200. For example, the processor 240 could control the reception of downlink (DL) channel signals and the transmission of uplink (UL) channel signals by the transceiver(s) 210, the RX processing circuitry 225, and the TX processing circuitry 215 in accordance with well-known principles. The processor 240 can include any suitable number(s) and type(s) of processors or other devices in any suitable arrangement. For example, in certain embodiments, the processor 240 includes at least one microprocessor or microcontroller. Example types of processor 240 include microprocessors, microcontrollers, digital signal processors, field programmable gate arrays, application specific integrated circuits, and discrete circuitry. In certain embodiments, the processor 240 can include a neural network.
The processor 240 is also capable of executing other processes and programs resident in the memory 260, such as operations that receive and store data. As described in greater detail below, the processor 240 may execute processes to support or perform data augmentation of motion trajectories and synthesis of EM signatures to improve performance of ML-based human gesture recognition and activity detection systems for the implementation of methods described herein. The processor 240 can move data into or out of the memory 260 as required by an executing process. In certain embodiments, the processor 240 is configured to execute the one or more applications 262 based on the OS 261 or in response to signals received from external source(s) or an operator. Example, applications 262 can include a multimedia player (such as a music player or a video player), a phone calling application, a virtual personal assistant, and the like.
The processor 240 is also coupled to the I/O interface 245 that provides the electronic device 200 with the ability to connect to other devices, such as client devices 106-114. The I/O interface 245 is the communication path between these accessories and the processor 240.
The processor 240 is also coupled to the input 250 and the display 255. The operator of the electronic device 200 can use the input 250 to enter data or inputs into the electronic device 200. The input 250 can be a keyboard, touchscreen, mouse, track ball, voice input, or other device capable of acting as a user interface to allow a user in interact with the electronic device 200. For example, the input 250 can include voice recognition processing, thereby allowing a user to input a voice command. In another example, the input 250 can include a touch panel, a (digital) pen sensor, a key, or an ultrasonic input device. The touch panel can recognize, for example, a touch input in at least one scheme, such as a capacitive scheme, a pressure sensitive scheme, an infrared scheme, or an ultrasonic scheme. The input 250 can be associated with the sensor(s) 265, a camera, and the like, which provide additional inputs to the processor 240. The input 250 can also include a control circuit. In the capacitive scheme, the input 250 can recognize touch or proximity.
The display 255 can be a liquid crystal display (LCD), light-emitting diode (LED) display, organic LED (OLED), active-matrix OLED (AMOLED), or other display capable of rendering text and/or graphics, such as from websites, videos, games, images, and the like. The display 255 can be a singular display screen or multiple display screens capable of creating a stereoscopic display. In certain embodiments, the display 255 is a heads-up display (HUD).
The memory 260 is coupled to the processor 240. Part of the memory 260 could include a RAM, and another part of the memory 260 could include a Flash memory or other ROM. The memory 260 can include persistent storage (not shown) that represents any structure(s) capable of storing and facilitating retrieval of information (such as data, program code, and/or other suitable information). The memory 260 can contain one or more components or devices supporting longer-term storage of data, such as a read only memory, hard drive, Flash memory, or optical disc.
The electronic device 200 further includes one or more sensors 265 that can meter a physical quantity or detect an activation state of the electronic device 200 and convert metered or detected information into an electrical signal. For example, the sensor 265 can include one or more buttons for touch input, a camera, a gesture sensor, optical sensors, cameras, one or more inertial measurement units (IMUs), such as a gyroscope or gyro sensor, and an accelerometer. The sensor 265 can also include an air pressure sensor, a magnetic sensor or magnetometer, a grip sensor, a proximity sensor, an ambient light sensor, a bio-physical sensor, a temperature/humidity sensor, an illumination sensor, an Ultraviolet (UV) sensor, an Electromyography (EMG) sensor, an Electroencephalogram (EEG) sensor, an Electrocardiogram (ECG) sensor, an IR sensor, an ultrasound sensor, an iris sensor, a fingerprint sensor, a color sensor (such as a Red Green Blue (RGB) sensor), and the like. The sensor 265 can further include control circuits for controlling any of the sensors included therein. Any of these sensor(s) 265 may be located within the electronic device 200 or within a secondary device operably connected to the electronic device 200.
The electronic device 200 as used herein can include a transceiver that can both transmit and receive radar signals. For example, the transceiver(s) 210 includes a radar transceiver 270, as described more particularly below. In this embodiment, one or more transceivers in the transceiver(s) 210 is a radar transceiver 270 that is configured to transmit and receive signals for detecting and ranging purposes. For example, the radar transceiver 270 may be any type of transceiver including, but not limited to a WiFi transceiver, for example, an 802.11ay transceiver. The radar transceiver 270 can operate both radar and communication signals concurrently. The radar transceiver 270 includes one or more antenna arrays, or antenna pairs, that each includes a transmitter (or transmitter antenna) and a receiver (or receiver antenna). The radar transceiver 270 can transmit signals at a various frequencies. For example, the radar transceiver 270 can transmit signals at frequencies including, but not limited to, 6 GHZ, 7 GHZ, 8 GHZ, 28 GHZ, 39 GHz, 60 GHz, and 77 GHz. In some embodiments, the signals transmitted by the radar transceiver 270 can include, but are not limited to, millimeter wave (mmWave) signals. The radar transceiver 270 can receive the signals, which were originally transmitted from the radar transceiver 270, after the signals have bounced or reflected off of target objects in the surrounding environment of the electronic device 200. In some embodiments, the radar transceiver 270 can be associated with the input 250 to provide additional inputs to the processor 240.
In certain embodiments, the radar transceiver 270 is a monostatic radar. A monostatic radar includes a transmitter of a radar signal and a receiver, which receives a delayed echo of the radar signal, which are positioned at the same or similar location. For example, the transmitter and the receiver can use the same antenna or nearly co-located while using separate, but adjacent antennas. Monostatic radars are assumed coherent such that the transmitter and receiver are synchronized via a common time reference.
In certain embodiments, the radar transceiver 270 can include a transmitter and a receiver. In the radar transceiver 270, the transmitter of can transmit millimeter wave (mmWave) signals. In the radar transceiver 270, the receiver can receive the mmWave signals originally transmitted from the transmitter after the mmWave signals have bounced or reflected off of target objects in the surrounding environment of the electronic device 200. The processor 240 can analyze the time difference between when the mmWave signals are transmitted and received to measure the distance of the target objects from the electronic device 200. Based on the time differences, the processor 240 can generate an image of the object by mapping the various distances.
Although
As shown in
The memory 330 and a persistent storage 335 are examples of storage devices 315, which represent any structure(s) capable of storing and facilitating retrieval of information (such as data, program code, and/or other suitable information on a temporary or permanent basis). The memory 330 may represent a random-access memory or any other suitable volatile or non-volatile storage device(s). The persistent storage 335 may contain one or more components or devices supporting longer-term storage of data, such as a ready only memory, hard drive, flash memory, or optical disc.
The communications unit 320 supports communications with other systems or devices. For example, the communications unit 320 could include a network interface card or a wireless transceiver facilitating communications over the network 130. The communications unit 320 may support communications through any suitable physical or wireless communication link(s).
The I/O unit 325 allows for input and output of data. For example, the I/O unit 325 may provide a connection for user input through a keyboard, mouse, keypad, touchscreen, or other suitable input device. The I/O unit 325 may also send output to a display, printer, or other suitable output device.
As described in more detail below, the electronic device 300 can be used to perform data augmentation of motion trajectories and synthesis of EM signatures to improve performance of ML-based human gesture recognition and activity detection systems for the implementation of methods described herein, especially in situations where real-time data is not necessary or, on the other hand, where the calculations are more efficiently or effectively done by the electronic device 300. The electronic device 300 could also maintain or determine any data or calculations that can be done offline and then transmitted to another component in network 100.
A common type of radar is the “monostatic” radar, characterized by the fact that the transmitter of the radar signal and the receiver for its delayed echo are, for all practical purposes, in the same location.
In the example of
In a monostatic radar's most basic form, a radar pulse is generated as a realization of a desired “radar waveform”, modulated onto a radio carrier frequency and transmitted through a power amplifier and antenna (shown as a parabolic antenna), either omni-directionally or focused into a particular direction. Assuming a “target” at a distance R from the radar location and within the field-of-view of the transmitted signal, the target will be illuminated by RF power density pt (in units of W/m2) for the duration of the transmission. The first order, pt can be described as:
where:
The transmit power density impinging onto the target surface will lead to reflections depending on the material composition, surface shape, and dielectric behavior at the frequency of the radar signal. Note that off-direction scattered signals are typically too weak to be received back at the radar receiver, so only direct reflections will contribute to a detectable receive signal. In essence, the illuminated area(s) of the target with normal vectors pointing back at the receiver will act as transmit antenna apertures with directivities (gains) in accordance with their effective aperture area(s). The reflected-back power is:
where:
Prefl . . . effective (isotropic) target-reflected power [W],
Note that the radar cross section, RCS, is an equivalent area that scales proportionally to the actual reflecting area-squared, inversely proportionally with the wavelength-squared and is reduced by various shape factors and the reflectivity of the material. For a flat, fully reflecting mirror of area At, large compared with λ2, RCS=4πAt2/λ2. Due to the material and shape dependency, it is generally not possible to deduce the actual physical area of a target from the reflected power, even if the target distance is known.
The target-reflected power at the receiver location results from the reflected-power density at the reverse distance R, collected over the receiver antenna aperture area:
where:
where:
In case the radar signal is a short pulse of duration (width) TP, the delay τ between the transmission and reception of the corresponding echo will be equal to τ=2R/c, where c is the speed of (light) propagation in the medium (air). In case there are several targets at slightly different distances, the individual echoes can be distinguished as such only if the delays differ by at least one pulse width, and hence the range resolution of the radar will be ΔR=cΔτ/2=cTP/2. Further considering that a rectangular pulse of duration TP exhibits a power spectral density P(f)˜(sin(πfTP)/(πfTP))2 with the first null at its bandwidth B=1/TP, the range resolution of a radar is fundamentally connected with the bandwidth of the radar waveform via:
Although
The gesture recognition and activity detection system 600 may be an electronic device or an electronic device system. For example, the gesture recognition and activity detection system 600 may include any electronic device having a process, such as optical media players (e.g., a digital versatile disc (DVD) player, a Blu-ray player, an ultra-high-definition (UHD) player), a smart appliance, a set-top box, a television, a personal computer, a mobile device, a game console device, a content server, a smart device, a streaming device, or combination thereof. Additionally, the gesture recognition and activity detection system 600 may be a portable electronic device or electronic system. For example, may be a mobile device, such as a cellular phone, a smart phone, a wearable smart device (such as a ring, a watch, a pair of glasses, a bracelet, an ankle bracelet, a belt, a necklace, an earring, a headband, a helmet, or a device embedded in clothing), a portable personal computer (PC) (such as a laptop, a notebook, a subnotebook, a netbook, or an ultra-mobile PC (UMPC), a tablet PC (tablet), a phablet, a personal digital assistant (PDA), a digital camera, a portable game console, an MP3 player, a portable/personal multimedia player (PMP), a handheld e-book, a global positioning system (GPS) navigation device, or any other mobile or stationary device configured to perform wireless or network communication. In one example, a wearable device is a device that is designed to be mountable directly on the body of the user, such as a pair of glasses or a bracelet. In another example, a wearable device is any device that is mounted on the body of the user using an attaching device, such as a smart phone or a tablet attached to the arm of a user using an armband or hung around the neck of the user using a lanyard.
In operation 502, the gesture recognition and activity detection system 600 collects motion capture data used for creating motion capture trajectories 550. The motion capture trajectories 550 are time series of 3D positions of a subset of objects or marked points on objects in the scene, including but not limited to points on a human body performing some activity or points on static or dynamic rigid objects in the scene. In other words, the motion capture trajectories 550 record and describe the motion or path of these points in 3D space as a function of time. In some embodiments, some or all the marked points tracked may be grouped to form at least one marker group that is approximated as at least one planar target and each marker includes a position in the alt least one planar target. A local body-coordinate frame may also be attributed to each of such elements. Typical examples of such elements are rigid bodies and skeleton bone segments. The motion capture trajectories 550 can then also describe the orientation and position of the local body frames—with respect to a global frame—attached to the elements as a function of time in addition to the positions of the marked points. In the simplest case, the motion capture trajectories 550, T(t), may be represented as:
Where N is the total number of marked points (or markers being tracked); Xi(t) is the 3D position vector with elements (x,y,z) describing the position of the ith marker as a function of time.
When some of the markers are grouped to form unit elements. Usually, to facilitate tracking by motion capture systems, a group may be stipulated to include a minimum number of markers (e.g. 3). However, for the purpose of this disclosure, a group or unit element may contain one or more markers. Then, T(t) may be defined as:
Where S is the total number of elements and each element includes a set of markers; Pj(t) is the orientation or attitude of the jth element as a function of time.
Usually, a motion capture track file output from a motion capture system contains useful metadata in addition to the trajectories.
The motion capture trajectories 550 may be obtained from an optical motion capture system, a Kinect depth camera, computer animation of human performances, or AI agents emulating human activities in an environment such as 3D games or virtual reality environments. In optical or video motion capture systems, such as the gesture recognition and activity detection system 600 depicted in
The synthesized electromagnetic (EM) scatter signals emulate the EM signatures of human activities (for example, Doppler and micro-Doppler signatures) as if a real EM signal transmission, reflection, and sensing had occurred. The synthesized and real EM sensing modality could be active or passive. In active EM sensing, one or more EM transmitter antennas are used to illuminate the environment for the purpose of sensing human activities. A portion of the transmitted EM signal is scattered after bouncing from various parts of the human body and sensed using one or more receiver antennas. The scattered EM signal sensed by the receiver antenna encodes a multitude of information about the human-in-the-environment such as the distance of the human (or other targets) from the EM receiver (derived from the delay between the transmitted and the reflected signal), the bulk motion or overall body movement, radial velocity (derived from the change in frequency of the received signal compared to the transmitted signal known as Doppler effect), and distinctive micro-motion patterns such as the distinctive patterns of hand movements during walking, (from the micro-Doppler signature). In some cases, the delay is not directly obtained but estimated from the phase of the received signal. Typical examples of active EM sensing employed for indoor activity human activity monitoring include millimeter-wave (mm-wave or mmWave) frequency-modulated continuous wave (FMCW) radar or pulsed radar such as Ultra-Wideband (UWB) radar. Passive sensing leverages the radio-frequency (RF) signals in an environment, for example, passive Wi-Fi radar (PWR) data may be collected in a Wi-Fi environment to compute the Doppler and micro-Doppler signature of human activity in the environment. Moreover, the myriad EM signal transmitters and receivers may have monostatic (transmitter and receiver are collocated) or bistatic (transmitter and receiver are separated) configurations. The exact radio technology and physical attributes used for activity sensing are mostly unimportant in the context of this disclosure as long as a suitable mathematical model for the particular EM modality of interest is available to accurately simulate the dynamic channel properties arising due to a set of moving point emitters/reflectors.
The motion capture trajectories 550 obtained from physical systems such as optical or video motion capture systems, Kinect sensors, may have noise. Additionally, there might be some gaps (i.e. missing data) in the tracking data for one or more markers for some time periods due to self-occlusions, inter-object occlusions, and other types of occlusions that may happen during the capture. Even trajectories obtained from computer animations may have some irregularities. In operation 504, a trajectory processing step interpolates missing data, if any, and, optionally, applies an appropriate filter to remove high-frequency noise from the trajectory time series. In an embodiment of this disclosure, an exponential moving average filter with a short span is used for processing the trajectories. Some trajectories may also have spike type of noise that, if not removed, may lead to the simulation of sudden unnatural movements. Therefore, the trajectory processing block may also include outlier detection and rejection filtering.
Depending upon the type of target application or human activity, a trajectory record or file may contain either a single activity, multiple activities, or multiple repetitions of the same activity. Additionally, there may be short periods of no activity or irrelevant activity before, after, or during the zero or more inter-activity periods in the trajectory records. Following the trajectory processing step, the beginning and ends of the activity of interest are identified and demarked in operation 506. When the entire file or record contains a single instance of an activity from an audio or video capture in operation 508 taken simultaneously as the motion capture of operation 502, then the start and end points may be implicit. That is, no start and end points are explicitly specified. The identification of the start and end points for operation 510 may be performed in several ways. In the simplest case, these points are manually identified and annotated. Alternatively, simple rules may be created to automatically determine the start and the end of the activity either by analyzing the motion tracks or the audio/video recordings, if available. As shown in
In operation 512, a data augmentation (DA) process is applied to the motion trajectories 550. For example, the DA process is one whose input is the motion trajectories 550 corresponding to an activity and the output is a plurality of sets of motion trajectories 552 with some variations representing variations of the activity. Data augmentation is used to artificially generate a large set of data, e.g., a set of data augmented motion trajectories 554, corresponding to physically plausible variations of the real data from a small set of real data, e.g., the motion capture trajectories 550, collected via measurements or experiments that are usually difficult to do or are time-consuming. The variations are usually random and the type of variations to be performed may be specified separately. Details on the type of data augmentation and how to implement a data augmentation pipeline to obtain an almost unlimited number of artificially generated variations of trajectories from a small set of real trajectories are explained in the later sections of this disclosure.
In operation 514, a time-varying radar cross-section (RCS) 156 is computed for every target in all sets of data augmented motion trajectories 554. The RCS of the target governs the amount of EM energy reflected back by the target. Resultantly, the RCS of the target governs the quality of target detection. For example, if the RCS of a target is large, it may be easily detected whereas if the RCS of the target is small, the target may be hard to detect. The RCS of a target generally depends on the frequency of the incident radiation, the polarization of the transmitting and receiving antennas, the orientation of the target relative to the antenna, also called the “aspect angle”, and the material and shape of the target. However, simulating the radar signatures of human activities, the important factor that determines the RCS as a function of time (represented as σ(t)) is the aspect angle of the target with respect to time (represented by θ(t) and ϕ(t)). Once the various point trajectories are available—both obtained directly from the motion capture system or synthetically generated via data augmentation—the RCS of the point emitters can be computed. The RCS computation of targets is generally application dependent. Details of how the RCS of the points may be computed are provided in later sections of this disclosure. The computed RCS, σ(t), for the set of points may be stored in a file or the memory of the computer.
The frame rate of the motion trajectories 550 and the frame rate of the radar signal (real or synthesized) could be different. For example, a frame rate of 120 Hz is typically used in optical motion capture systems. Whereas the frame rate of a mmWave FMCW radar depends on the number of chirps per frame, the number of samples per chirp, the chirp interval, and the sample rate. In some hardware, the radar's frame rate or frame interval (inverse of the frame rate) can also be specified directly in addition to the other parameters. Therefore, in operation 516, the set of data augmented motion trajectories 554 may be resampled to match the desired frame rate of the target radar. In an embodiment of this disclosure, the frame rate of the set of data augmented motion trajectories 554 is converted to match the target radar frame rate before the generating a set of EM scatter signals 558 in operation 518. Alternatively, the frame-rate conversion could be part of generating the set of EM scatter signals 558. Although the terms “EM scatter” and “EM backscatter” signals have slightly different connotations—while the former indicates scattering along any direction and is generally used in the context of bistatic transmitter-receiver configurations, the latter denotes the scattered energy received back at the receiver in monostatic configuration-they are used interchangeably in this disclosure.
Once an action of interest or a variation thereof, such as variations obtained using data augmentation, is available as a set of point trajectories, e.g., points and their 3D positions over a period of time, along with each point's RCS as a function of time, the backscattered EM signal can be simulated using an appropriate analytical model treating the set of point trajectories as point emitters of EM waves. That is, the entire moving object (or objects) is modeled as a set of point targets. Then, the resulting EM signal at a receiver antenna may be obtained as the phasor sum of the EM signals of each of the target point emitters. For example, in an embodiment of this disclosure, the baseband intermediate frequency (IF) signal for a mmWave monostatic FMCW radar with sawtooth linear frequency modulation is modeled as:
Where,
u_(k,m)(t,t′) is the beat signal or IF signal at the kth receiver antenna corresponding to the mth sweep of the linear modulation in the transmitted signal, obtained at the output of the mixer and following low-pass filtering and some approximation.
t′=t−mTc, where Tc is the sweep duration, is the relative time from the beginning of the mth chirp.
N is the total number of point targets used to represent the entire moving object (or objects).
ai(t) represents the reflected signal strength or amplitude including propagation loss, transmit power (Pt), transmitter and receiver antenna gains (Gt and Gr), target radar RCS (σi(t)), This amplitude may be expressed as ai(t)=√((Pt Gt Gr λ{circumflex over ( )}2σi(t))/((4π){circumflex over ( )}3 ri{circumflex over ( )}4 (t))). Note that other losses such as system losses are included in the gain terms.
fc is the carrier frequency. Also, it equal to the minimum sweep frequency fmin of the radar. Then, the bandwidth, B=fmax−fmin, where fmax is the maximum frequency, and S is the slope, given as S=B/Tc.
ri(t) is the distance of the ith point target from the radar as a function of time.
vi is the radial velocity of the ith point target. It is assumed to be vi is sufficiently slow so that the range ri remains constant during a sweep.
c is the velocity of light (or EM wave).
The term ϕo=4πfc ri/c or ϕo=4πri/λ is the initial phase of the IF signal. The phase, ϕo, changes linearly for small changes in ri.
The term fb=2Sri (t)/c is the instantaneous frequency of the IF signal (also called the beat frequency) related to the ith point target.
The term fd=2fc vi/c is the Doppler frequency related to the ith point target.
The expression in Eq. 3 was obtained by representing the transmitted signal as a sinusoidal signal with a linearly changing frequency:
Then, the backscattered signal for the mth sweep at a receiver antenna k can be represented as the summation of the delayed and attenuated version of the transmitted signals reflected from all the point targets as shown in Eq. 5:
Where, τi=2(ri+vit)/c is the round-trip delay associated with the ith target moving with radial velocity vi at a distance ri from the radar.
The signal uTX(t) u(RX-k)(t) obtained by mixing the transmitted and the received signals is low-pass filtered to obtain the beat signal. Following some simplification and removal of insignificant terms under certain assumptions, such as sufficiently short sweep durations, ignoring terms having c2 in the denominator, the expression in Eq. 3 is obtained.
To account for more realistic scenarios noise terms may be added to Eq. 3 that account for random system noise and the contribution of antenna leakage bias (the direct leakage of power from the transmitting antenna to the receiving antenna) in monostatic configurations. While the random system noise may be modeled as Gaussian noise, the antenna leakage bias is usually a constant value. Therefore, Eq. 4 may be used to simulate more realistic scenarios:
Where,
N (t) accounts for random system noise.
b accounts for constant biases such as antenna leakage bias in monostatic configurations.
In another embodiment of this disclosure, as required by the target application employing UWB radar, a pulsed radar-based analytical model may be used to synthesize the EM scatter signals from a set of point trajectories.
In an embodiment of this disclosure, the target 604, e.g., human body parts such as the hand (or palm), the forearm, the upper arm, and the torso, are modeled as planar rigid bodies (RB) as shown in
Similarly,
Specifically,
The frames in
To generate a TVD and TAD, such as the synthetic TVD 710, the synthetic TVD 810, the synthetic TAD 712, and the synthetic TAD 812, requires radar frames. The radar frame in the example includes Nc chirps, such as 32 chirps, with Ns samples per chirp, such as 64 samples per chirp. The simulated radar signal, obtained at each of the virtual receiver antennas, includes Nc×Ns×NF where NF is the number of radar frames. For each radar frame, a Range Density Map (RDM) is computed by first taking Fast Fourier transforms (FFTs) along each row of Ns samples (called range FFT), discarding the first Ns/2 symmetric values, and then computing FFTs along each column (called Doppler FFT) resulting in Nc×Ns/2 RDM for each frame. The values in the zero-Doppler bin are nulled to remove the static contribution from the direct leakage (as it may be added to the simulated signal in Eq. (4) to match realistic scenarios). A rand profile is obtained from the RDM by summing the power across all Doppler bins (i.e. summing along each column of the RDM). Peaks in the range profile are detected after comparing the range plot against a detection threshold. The first detected peak in the range plot corresponds to the location of the hand (assuming the hand is the closest object to the radar during the gesture action). The column in the RDM corresponding to the detected peak in the range plot is picked to construct a column of the TVD. Repeating the above process for the NF radar frames results in Nc×NF-sized TVD. The simulated radar signal at any one of the receiver antennas may be used to generate the TVD. Alternatively, the mitigate the effect of noise, the radar signals from the three receiver antennas may be averaged. A column of TAD is obtained by picking the columns corresponding to the first peak used for TVD from the range FFTs obtained from two neighboring receiver antennas, constructing a covariance matrix, and applying the MUSIC algorithm to obtain the angular spectrum.
The strength of the radar signature at the receiver as shown in Eq. (4) is:
where, σi(t) is the RCS (radar cross-section) of the ith target 604.
As indicated earlier, an RCS 556 of the target 604 is one of the crucial factors determining whether it can be detected by a radar system since its RCS 556 determines the amount of energy reflected by the target 604 towards the radar. Furthermore, for tasks like human gesture and activity detection and classification using EM signatures, the time-varying RCS from different targets 204, e.g., different parts of the body-in-motion, results in distinctive features that are essential for machine learning (ML) based automated gesture and activity classification. Therefore, modeling the RCS is an important consideration for these systems. At the same time, for task-specific applications such as human activity and gesture recognition in relatively predictable environments, it may be sufficient to model the RCS of the different parts of the human body as planar rigid bodies or primitive shapes, such as ellipsoids, and only account the relative RCS between the different parts as a function of time. In other words, the factors such as the polarization of the transmitted and received radiation with respect to the target 604 orientation, material properties of the target 604, may be ignored.
Note that the definition of the term “target” includes but is not limited to the 3D points corresponding to actual markers estimated by physical markers utilizing real cameras, computer animation, 3D games, or AI-based, such as from 2D videos to 3D, motion capture system, virtual points derived from the actual markers, such as centroids of rigid bodies, the origin of bone segments in skeleton tracking, or shapes used to model parts or whole of the human body in primitive-based modeling.
In an embodiment of this disclosure targeting the task of hand gesture recognition, parts of the human hand and body are modeled as planar rigid bodies as shown in
To compute a time-varying RCS 556 for targets 204 modeled as a set of planar rigid bodies, e.g., palm rigid body 610, is as follows and is illustrated using
where W is the absolute area/size of the rigid body, e.g., the palm rigid body 610.
In the above procedure for calculating the RCS, the centroid unit vector 510 is a unit vector along the line joining the centroid of the rigid body and the radar. For some task-specific applications where the total movements of the target 604 are expected to be bounded within a small angular region in front of the radar, such as hand gesture recognition to control a TV or other such appliances, centroid unit vector 510 could simply be the unit direction vector along the depth assuming the virtual radar is positioned at the origin of the coordinate system. This approximation further simplifies the computation of the time-varying RCS 556.
The RCS 556 of each point target 604 is then obtained by dividing the RCS 556 of the rigid body, e.g., the palm rigid body 610, by the number of markers 602 used to represent the palm rigid body 610. Similarly, when primitive shapes such as ellipsoids are used to model body parts, the RCS 556 of the points may be obtained by dividing the RCS 556 of the primitive shape by the number of points used to represent the primitive shape while computing the EM signature using an analytical model like Eq. 4.
While the above method for RCS computation of body parts modeled as rigid bodies is simple and fast which is suitable for lower power embedded AI systems at the edge, it does not consider the effects of self-occlusion of parts of the body during the course of an activity. This limitation is not inhibitory for synthesizing Doppler and micro-Doppler signatures of human gesture recognition and simple activity detection such as fall detection, running and walking detection. However, for more complex activities such as for exercise classification RCS computation may consider the effects of self-occlusions.
After getting the initial motion capture trajectories 550 and radar position information in operation 1002, a secondary modality such as video, audio, or even manual inputs may be used to mark the start and end of gesture/activity actions at operation 1004. Such information is especially useful when motion capture data for multiple instances of gesture/activity are recorded in the same file. The system processes this information and converts the start and end points to the motion capture frame (or time) domain to extract and process the segments separately in operation 1006. The system may also get the virtual radar position from the settings and configuration file if specified. Alternatively, the radar position may be indicated using markers 602 in the scene during motion capture. Following smoothing and filling missing data in the trajectories using appropriate interpolation techniques, such as spline interpolation, in operation 1008, a coordinate transformation is applied to trajectories to set the position of the radar as the origin of the coordinate system in operation 1010. If the radar location is fixed, this operation could be as simple as subtracting the radar position from the position of all the markers 602 over all time.
In operation 1012, as shown in the flowchart in
As indicated in
A function of data augmentation (DA) is to synthesize a larger set of new data from a smaller set of actual or measured data by applying modifications or transformations to the original data. The larger the size of the set of synthesized data compared to the size of the original data, the greater the benefit of DA. Ideally, the larger set of augmented data should contain different variations in the possible actions that are not present in the original data set either difficulty or time constraints or because the size of the original dataset is small.
In an embodiment of this disclosure, a variety of data augmentation techniques are used to transform the activity trajectories, such as spatial DA transformations, temporal DA transformations, and background noise DA transformations.
Another example of one such spatial DA transformation may include a transformation to spatially vary the arm and body marker trajectories along the X-axis relative to the radar simulating the variation of the radar location (left or right) with respect to the user. This transformation can be achieved by adding a randomly generated constant displacement value to the x-coordinates of all the trajectories for all time samples (motion capture frames) in the gesture segment. Here too, the displacement can be either positive or negative corresponding to the left or right directions.
Yet another example of one such spatial DA transformation may include a transformation to vary the arm and body marker trajectories along the Z-axis relative to the radar to simulate the variation of the depth of the radar from the user. Similar to a. and b., the transformation can be realized by adding a randomly generated constant displacement value—either positive (away from the radar) or negative (towards the radar)—to the z-coordinates of all the trajectories for all time samples in the gesture segment. A trajectory displacement graph 700A is shown in
A further example of a spatial DA transformation may include a transformation to vary only the arm marker trajectories along the Z-axis relative to the centroid of the body marker trajectories to vary the spatial separation between the arm and the body. However, the randomly generated constant displacement term is only added to trajectories of the palm rigid body 610, the forearm rigid body 614, and the upper arm rigid body 614 trajectories but not to the body marker trajectories.
Additionally, a spatial DA transformation may include a transformation to vary the angle of incidence or the aspect angle of the target 604, e.g., an arm and body, markers 602 relative to the radar to simulate the variation of the angular position of the radar with respect to the user. One way to realize this transformation is to vary the position of the virtual radar along an arc around a point defined by the centroid of the body markers 602. The angle of incidence and self-occlusions are automatically handled during the RCS computation. Alternatively, a 3D rotation transformation matrix may be multiplied to all the trajectories, which are 3D position vectors as a function of time, of all markers 602 for all instances of time, e.g., for each of the motion capture frames, rotate the trajectories about the Z-axis. This assumes that the virtual radar is located at the origin of the global coordinate system.
A DA transformation may include a transformation to vary the curvature of the arm marker trajectories along the Y-axis while keeping the body marker trajectories unchanged to simulate the plausible variations of the arc of a gesture action along the vertical axis. A trajectory transformation graph 700B is shown in
Another DA transformation may include a transformation to vary the curvature of the arm marker trajectories along the Z-axis while keeping the body marker trajectories unchanged to simulate the plausible variations of the arc of a gesture action along the depth axis. A trajectory transformation graph 700B is shown in
A DA transformation may also include a transformation to stretch or compress the spatial extent of the gestures along the X-axis to simulate the variations of the gesture extents. The transformation is applied to the arm markers 602 only. This transformation may be realized by scaling the x-coordinates of the arm trajectories. A trajectory transformation graph 700C is shown in
An example of a temporal DA transformation may include a transformation to linearly vary the speed (faster or slower) of the gesture or action to simulate the variance in speed of gesture actions by different users or even by the same user. This transformation can be achieved by up-sampling and down-sampling the spatial coordinates of the trajectories by different factors to affect a speed change. A trajectory transformation graph 700D is shown in
Another example of a temporal DA transformation may include a transformation to non-linearly vary the speed of the gesture or action.
Yet another example of a temporal DA transformation may include a transformation to shift the temporal occurrence (delay or advance the start) of the gesture or action. This transformation can be achieved by advancing or decreasing the start point of the gesture in the segment. Such operation requires “padding” data either to the beginning or towards the end of the original gesture segment. The padding can be achieved by repeating the nearest (first or last) trajectory positions. A trajectory transformation graph 700E is shown in
A background noise DA transformation, different from the type of random noise introduced in Eq. 4, may include a transformation, e.g., addition, of extra trajectories to represent other objects or unwanted actions surrounding the target 604 action. Additionally, useful temporal DA transformations may include transformations to generate non-gesture actions to simulate random actions by users in front of the radar system that should not be classified as a specific gesture in the defined gesture vocabulary. Such data can also be used in ML training to improve the performance of the classifier.
Note that the DA transformations affect the time-varying RCS 556, radial velocity, and range of the different targets 604 involved in the action thereby producing slightly different signatures within the same class of action.
In an embodiment of this disclosure, each of the above transformation is implemented by a separate DA function. All DA functions have the same function call and return signature. In other words, all DA functions accept the same set of parameters which include at least a trajectory and the start and end locations of the activity (additional, application-dependent parameters may also be included) and return the same parameters. The exact means of passing the arguments to the functions and obtaining returns from the DA functions is programing language and implementation specific. For example, while in some programming languages like C/C++, it may be more conducive to pass the arguments by reference and do in-place computations that store the results in the same memory location, in other programming languages such as Python, it may be to pass the arguments as value and return the same type of arguments so that they can be readily passed to the chained function in the sequence. Unifying the interface of the DA functions allows the DA pipeline to create a large set of transformations using a combination of a smaller set of composable (or chainable) functions.
Further, in an embodiment of this disclosure, the parameters controlling the various transformations in each DA function are varied randomly at runtime. Therefore, applying the same DA function or the same combination (sequence) of DA functions subsequently produces slightly different transformations of the same type to the original trajectory. Thus, using the above strategy of composable or chainable DA functions and generation of the random control parameter values within each DA function enables an almost infinite number of possible transformations and corresponding EM signatures to be synthesized.
Although the control parameters within each DA function are randomized during generation of the data augmentation transformation pool, they are bounded, e.g., having minimum and maximum limits, based on physical constraints which are informed by empirical data, the target 604 radar system operating range and Doppler resolution limits, and other application settings. For example, if the maximum range of the target 604 radar system is five meters (based on the radar system parameters) then, random variations that spatially shift any of the targets 604 beyond five meters from the radar are wasteful. One way to circumvent this problem is to set appropriate bounds on the control parameters and clip the generated target 604 distance values based on maximum plausible limits. In another example, in which the transformation is applied in the temporal domain to change the speed of the gesture, the maximum and minimum factor-of-change of the gesture speed may be informed by any empirical data, if available, or based on some reasonable expectations of the time required to perform the action.
The transformed trajectories are fed into an RCS model determined based on whether the rigid bodies are to be used in operation 1016. For example, if an RCS model based on rigid bodies is employed such as described above in
Once the number of elements threshold of operation 1024 is met, the EM scatter signals are synthesized at a receiver using an analytical model, such as a phasor sum of the EM signals from each element orientation in operation 1026. In operation 1028, if the total number of applied DA function transformations d is less than the cardinality of the DA functions D, the method 1000 returns to operation 1014 and again transforms the set of motion capture trajectories for processing in the RCS computation loop beginning with operation 1016. If total number of applied DA function transformations d is equal to or greater than the cardinality of the DA functions D, e.g., all DA functions of the transformation pool have been applied, the method 1000 proceeds to determine whether all gestures in the motion capture trajectories are read in operation 1030. If not, the method 1000 returns to operation 1006 to process any unread gestures. The method 1000 ends when all gestures in the motion capture trajectories are read.
Note that one of the DA functions 1304, for example, the function identified by DAF_0 in
Similarly,
In an embodiment of this disclosure, real EM signature data, for example from mmWave frequency-modulated continuous wave radar Doppler signatures, may be simultaneously captured along with the motion capture of the performance of gesture and activity actions. A method 1600 including a sequence of processing steps is shown in
The set of EM scatter signals 558 may be also used to generate features as part of a ML algorithm as shown in operation 1610. The real signals or some features generated from the real signals, such as a spectrogram, a TVD, or a TAD, may be used to compare the corresponding signals/features obtained from set of EM scatter signals 558. For such comparison, the position of the real radar may be obtained by adding markers on the real radar during capture or provided using the settings and configuration file (
A gesture recognition and activity detection system 1700 for synchronously capturing the motion trajectories of activities and gestures along with real radar Doppler and micro-Doppler signatures is shown in
Similar to the system 200 of
The accuracy of supervised deep-learning models crucially depends on the availability of the quality and quantity of labeled training data. In the case of RF-based gesture recognition and activity detection problems, the real data collected by radar is scarce due to the cost of data acquisition. Domain adaptation is a technique to improve the performance of a model on a target domain containing insufficient annotated data by using the knowledge learned by the model from another related domain (source domain) with adequate labeled data. In an embodiment of this disclosure targeting mmWave gesture recognition, the source domain data is synthesized data that can be easily generated using the data augmentation and simulation pipeline. And the target domain data is the real data collected by radar.
The above flowcharts illustrate example methods that can be implemented in accordance with the principles of the present disclosure and various changes could be made to the methods illustrated in the flowcharts herein. For example, while shown as a series of steps, various steps in each figure could overlap, occur in parallel, occur in a different order, or occur multiple times. In another example, steps may be omitted or replaced by other steps.
Although the present disclosure has been described with exemplary embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims. None of the description in this application should be read as implying that any particular element, step, or function is an essential element that must be included in the claims scope. The scope of patented subject matter is defined by the claims.
The present application claims priority to U.S. Provisional Patent Application No. 63/532,629, filed on Aug. 14, 2023. The contents of the above-identified patent documents are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63532629 | Aug 2023 | US |