This disclosure relates generally to extended reality environments and, more specifically, to object detection and tracking in extended reality devices.
Extended Reality (XR) frameworks such as Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR) frameworks may detect objects, such as hands of a user, in a field of view (FOV) of a camera of the respective reality system. The XR frameworks may track the objects as they move throughout their respective environments. For example, in a gaming application, an XR framework may attempt to track a player's hand as the hand is moved throughout the environment.
According to one aspect, a method includes receiving image data from a camera. The method also includes detecting, based on the image data, an object in a placement area of a hybrid environment. The hybrid environment includes a real environment and a virtual environment. The method further includes, in response to the detection, determining a value of at least one parameter for the object. The method also includes generating profile data based on the at least one parameter value. The profile data registers the object with a user. Further, the method includes tracking the movement of the object within the hybrid environment based on the profile data.
According to another aspect, a method includes capturing at least one image of an object in a real environment of a hybrid environment. The method also includes generating a plurality of data points for the object based on the at least one image. The method further includes generating a multi-dimensional model of the object based on the plurality of data points. The method also includes generating a plurality of action points based on the multi-dimensional model of the object. Further, the method includes tracking a movement of the object in a virtual environment of the hybrid environment based on the plurality of action points.
According to another aspect, an apparatus comprises a non-transitory, machine-readable storage medium storing instructions, and at least one processor coupled to the non-transitory, machine-readable storage medium. The at least one processor is configured to receive image data from a camera. The at least one processor is also configured to detect, based on the image data, an object in a placement area of a hybrid environment. The hybrid environment includes a real environment and a virtual environment. Further, the at least one processor is configured to, in response to the detection, determine a value of at least one parameter for the object. The at least one processor is also configured to generate profile data based on the at least one parameter value, the profile data. The profile data registers the object with a user. The at least one processor is further configured to track movement of the object within the hybrid environment based on the profile data.
According to another aspect, an apparatus comprises a non-transitory, machine-readable storage medium storing instructions, and at least one processor coupled to the non-transitory, machine-readable storage medium. The at least one processor is configured to capture at least one image of an object in a real environment of a hybrid environment. The at least one processor is also configured to generate a plurality of data points for the object based on the at least one image. Further, the at least one processor is configured to generate a multi-dimensional model of the object based on the plurality of data points. The at least one processor is also configured to generate a plurality of action points based on the multi-dimensional model of the object. The at least one processor is further configured to track a movement of the object in a virtual environment of the hybrid environment based on the plurality of action points.
According to another aspect, a non-transitory, machine-readable storage medium stores instructions that, when executed by at least one processor, causes the at least one processor to perform operations that include receiving image data from a camera. The operations also include detecting, based on the image data, an object in a placement area of a hybrid environment, wherein the hybrid environment comprises a real environment and a virtual environment. Further, the operations include, in response to the detection, determining a value of at least one parameter for the object. The operations also include generating profile data based on the at least one parameter value, the profile data registering the object with a user. The operations further include tracking movement of the object within the hybrid environment based on the profile data.
According to another aspect, a non-transitory, machine-readable storage medium stores instructions that, when executed by at least one processor, causes the at least one processor to perform operations that include capturing at least one image of an object in a real environment of a hybrid environment. The operations also include generating a plurality of data points for the object based on the at least one image. The operations further include generating a multi-dimensional model of the object based on the plurality of data points. The operations also include generating a plurality of action points based on the multi-dimensional model of the object. Further, the operations also include tracking a movement of the object in a virtual environment of the hybrid environment based on the plurality of action points.
According to another aspect, an object detection and tracking device includes a means for receiving image data from a camera. The object detection and tracking device also includes a means for detecting, based on the image data, an object in a placement area of a hybrid environment. The hybrid environment includes a real environment and a virtual environment. The object detection and tracking device also includes a means for, in response to the detection, determining a value of at least one parameter for the object. The object detection and tracking device further include a means for generating profile data based on the at least one parameter value, the profile data registering the object with a user. The object detection and tracking device also includes a means for tracking movement of the object within the hybrid environment based on the profile data.
According to another aspect, an object detection and tracking device includes a means for capturing at least one image of an object in a real environment of a hybrid environment. The object detection and tracking device also includes a means for generating a plurality of data points for the object based on the at least one image. The object detection and tracking device also includes a means for generating a multi-dimensional model of the object based on the plurality of data points. The object detection and tracking device also includes a means for generating a plurality of action points based on the multi-dimensional model of the object. The object detection and tracking device also includes a means for tracking a movement of the object in a virtual environment of the hybrid environment based on the plurality of action points.
While the features, methods, devices, and systems described herein may be embodied in various forms, some exemplary and non-limiting embodiments are shown in the drawings, and are described below. Some of the components described in this disclosure are optional, and some implementations may include additional, different, or fewer components from those expressly described in this disclosure.
Various systems, such as gaming, computer vision, extended reality (XR), augmented reality (AR), virtual reality (VR), medical, and robotics-based applications, rely on receiving input from a user by one or more techniques. These techniques can include the tracking of motion of one or more body parts of a user (e.g., a hand, a fist, fingers of a hand, etc.). For example, imaging devices, such as digital cameras, smartphones, tablet computers, laptop computers, automobiles, or Internet-of-things (IoT) devices (e.g., security cameras, etc.), may capture a user's image, and may reconstruct a 3D image based on one or more body parts of the user. For instance, the imaging devices, may capture an image of a user's hand, such as a gamer's hand, and may reconstruct a 3D image of the user's hand for use within an XR, VR or AR based game, e.g., as part of an avatar.
Existing image processing techniques may allow for the receiving of input from a user through motion of one or objects or body parts of a user. However, these conventional techniques may suffer from several shortcomings, such as properly determining whether a detected object or body part belongs to a user, and determining whether the user intended to send the input using the object or body part. Moreover, these conventional systems may also fail to properly detect a user's gestures using the objects, such as gestures performed with a user's hand or an object the user is holding. For instance, if a user suffers from a deformity in one or both of his hands (such as an irregular shape of a hand, more than five fingers on a hand, fewer than five fingers on a hand, etc.), or if the user is holding an object with fewer than five fingers while intending to provide input, or if the user's hand is wholly, or in part, covered by some material (e.g., a mitten), the XR, VR, or AR system may not successfully detect the user's hand, and may be unable to recognize one or more gestures that the user may intend for the XR, VR, or AR system to detect.
In some implementations, an object detection and tracking device may include one or more optical elements, such as a camera, one or more motion sensors, a thermal camera, and a sensitive microphone for detecting sounds based on movements, and may detect one or more objects or body parts of the user within a virtual environment to identify (e.g., determine) input gestures made by the user. In some implementations, the object detection and tracking device may detect an object in a field-of-view (FOV) of a camera, and determine that the object corresponds to a particular user. For instance, the object detection and tracking device may determine that an object corresponds to a user and is being used to provide input gestures. The object detection and tracking device may, additionally or alternatively, determine that the object does not correspond to the user and thus will not be used to provide input gestures.
In one implementation, the object detection and tracking device may include one or more processors that execute instructions stored in a memory of the object detection and tracking device. The one or more processors may include, for example, a camera processor, a central processing unit (CPU), a graphical processing unit (GPU), a digital signal processor (DSP), or a neural processing unit (NPU). The object detection and tracking device may execute the instructions to detect an object, such as a hand of a user, based on one or more parameter values. The parameter values may correspond to, for example, an angle of insertion of the object into a predetermined window within a FOV of the user within the virtual environment. The object detection and tracking device may also execute the instructions to detect that the object corresponds to the user, and may track the object to recognize input gestures from the user. For example, the object detection and tracking device may detect an object within the predetermined window, determine that the object corresponds to the user, and may track the object over the FOV of the user. Based on tracking movement of the object, the object detection and tracking device may determine one or more input gestures from the user.
In another implementation, the object detection and tracking device may include one or more processors that execute instructions stored in a memory of the object detection and tracking device to detect an object of the user, such as a hand, based on a unique profile of the user. For instance, the unique profile of the user may include data characterizing one or more of a shape of a user's hand, palm lines on the user's hand, palm-contours, sizes of the user's fingernails, shapes of the user's fingernails, the object's color, a multi-point outline of the user's hand, and one or more identification marks on the object. The object detection and tracking device may execute the instructions to track the object based on the profile of the user. For instance, the object detection and tracking device may execute the instructions to detect one or more input gestures from the user based on the profile of the user.
In some implementations, the object detection and tracking device may include one or more processors that execute one or more trained machine learning processes to detect an object of a user, such as a user's hand, for tracking and receiving one or more gesture inputs. The trained machine learning processes may include, for example, a trained neural network process (e.g., a trained convolutional neural network (CNN) process), a trained deep learning process, a trained decision tree process, a trained support vector machine process, or any other suitable trained machine learning process. For instance, during initialization, the object detection and tracking device may prompt the user to select an object detected by a camera or sensor of the object detection and tracking device as an object to be utilized for detecting gesture inputs for the user. The object detection and tracking device may apply the trained machine learning process to image data characterizing the selected object to generate a plurality of data points for the object, and a multi-dimensional model of the selected object. Further, the object detection and tracking device may apply the trained machine learning process to the multi-dimensional model of the object to estimate action points. For instance, the action points may be anticipated points in a 3-D space of a virtual environment that may move when making certain gestures. In some instances, the object detection and tracking device may implement a training mode for the machine learning process during which the machine learning process may iteratively alter the action points in the 3-D space for respective gestures. For example, the object detection and tracking device may determine a gesture based on generated action points, and may request and receive a validation from the user to confirm whether the determined gesture is correct.
In some implementations, and based on the execution of instructions stored in non-volatile memory, the one or more processors may apply a machine learning process to a multi-dimensional model of the object to generate a look-up table. The look-up table may include a list of gestures and a sequence of tracking points in a 3-D space that the object may be expected to span during a gesture. The tracking points may include, for example, x, y, z coordinates for each of the tracking points in the 3-D space.
When the training process completes, the one or more processors may store values and sequences of the tracking points and the corresponding gestures as look-up tables (e.g., each look-up table corresponding to a unique object/hand of the user) in a memory device of the object detection and tracking device. The look-up table corresponding to the object may enable the one or more processors to detect and identify a gesture(s) made by the object as movement of the object is tracked (e.g., during a gesture input).
Among other advantages, the embodiments described herein may provide for more accurate object (e.g., hand) detection abilities for XR, VR or AR systems. For instance, the embodiments may more accurately identify which object corresponds to a user, and which object in a FOV of the user may be a foreign or unintended object which is to be ignored. Further, the embodiments as described herein may provide greater flexibility to a user for utilizing an irregularly shaped object for receiving input. The embodiments may especially be helpful and enabling to users in physically challenged situations, such as those with biological defects in one or both hands. The embodiments may further allow a user to engage in multi-tasking (e.g., holding one or more objects in their hands, engaging one or more fingers with other gadgets such as a fitness tracker, smart watch etc.) while making gestures that the object detection and tracking device can successfully detect and track. Further, the embodiments may be employed across a variety of applications, such as in gaming, computer vision, AR, VR, medical, biometric, and robotics applications, among others. Persons of ordinary skill in the art having the benefit of these disclosures would recognize these and other benefits as well.
As illustrated in the example of
Object detection and tracking device 100 may further include a central processing unit (CPU) 116, an encoder/decoder 117, a graphics processing unit (GPU) 118, a local memory 120 of GPU 118, a user interface 122, a memory controller 124 that provides access to system memory 130 and to instruction memory 132, and a display interface 126 that outputs signals that causes graphical data to be displayed on a display 128.
In some examples, one of image sensors 112 may be allocated for each of lenses 113. Further, in some examples, one or more of image sensors 112 may be allocated to a corresponding one of lenses 113 of a respective, and different, lens type (e.g., a wide lens, ultra-wide lens, telephoto lens, and/or periscope lens, etc.). For instance, lenses 113 may include a wide lens, and a corresponding one of image sensors 112 having a first size (e.g., 108 MP) may be allocated to the wide lens. In other instance, lenses 113 may include an ultra-wide lens, and a corresponding one of image sensors 112 having a second, and different, size (e.g., 16 MP) may be allocated to the ultra-wide lens. In another instance, lenses 113 may include a telephoto lens, and a corresponding one of image sensors 112 having a third size (e.g., 12 MP) may be allocated to the telephoto lens.
In an illustrative example, a single object detection and tracking device 100 may include two or more cameras (e.g., two or more of camera 115). Further, in some examples, a single image sensor, e.g., image sensor 112A, may be allocated to multiple ones of lenses 113. Additionally, or alternatively, each of image sensors 112 may be allocated to a different one of lenses 113, e.g., to provide multiple cameras to object detection and tracking device 100.
In some examples, not illustrated in
Each of the image sensors 112, including image sensor 112A, may represent an image sensor that includes processing circuitry, an array of pixel sensors (e.g., pixels) for capturing representations of light, memory, an adjustable lens (such as lens 113), and an actuator to adjust the lens. By way of example, image sensor 112A may be associated with, and may capture images through, a corresponding one of lenses 113, such as lens 113A. In other examples, additional, or alternate, ones of image sensors 112 may be associated with, and capture images through, corresponding additional ones of lenses 113.
Image sensors 112 may also include a subset of two or more different image sensors operating in conjunction with one another that detect motion of an object/hand. For example, image sensors 112 may include two different “color” pixel sensors operating in conjunction with one another. The different color pixel sensors may support different binning types and/or binning levels, and although operating in conjunction with one another, the different color pixel sensors may each operate with respect to a particular range of zoom levels.
Additionally, in some instances, object detection and tracking device 100 may receive user input via user interface 122, and a response to the received user input, CPU 116 and/or camera processor 114 may activate respective ones of lenses 113, or combinations of lenses 113. For example, the received user input may correspond to an affirmation that the object/hand in view of the lens 113A is the object/hand of the user which should be tracked for input gestures. In some instances, the user input via user interface 122 may be an affirmation that the object detection and tracking device 100 has identified the correct gesture during the machine learning process (as described above).
Although the various components of object detection and tracking device 100 are illustrated as separate components, in some examples, the components may be combined to form a system on chip (SoC). As an example, instruction memory 132, CPU 116, GPU 118, and display interface 126 may be implemented on a common integrated circuit (IC) chip. In some examples, one or more of instruction memory 132, CPU 116, GPU 118, and display interface 126 may be implemented in separate IC chips. Various other permutations and combinations are possible, and the techniques of this disclosure should not be considered limited to the example of
System memory 130 may store program modules and/or instructions and/or data that are accessible by camera processor 114, CPU 116, and GPU 118. For example, system memory 130 may store user applications (e.g., instructions for the camera application) and resulting images from camera processor 114. System memory 130 may additionally store information for use by and/or generated by other components of object detection and tracking device 100. For example, system memory 130 may act as a device memory for camera processor 114. System memory 130 may include one or more volatile or non-volatile memories or storage devices, such as, for example, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, a magnetic data media, cloud-based storage medium, or an optical storage media.
Camera processor 114 may store data to, and read data from, system memory 130. For example, camera processor 114 may store a working set of instructions to system memory 130, such as instructions loaded from instruction memory 132. Camera processor 114 may also use system memory 130 to store dynamic data created during the operation of object detection and tracking device 100.
Similarly, GPU 118 may store data to, and read data from, local memory 120. For example, GPU 118 may store a working set of instructions to local memory 120, such as instructions loaded from instruction memory 132. GPU 118 may also use local memory 120 to store dynamic data created during the operation of object detection and tracking device 100. Examples of local memory 120 include one or more volatile or non-volatile memories or storage devices, such as RAM, SRAM, DRAM, EPROM, EEPROM, flash memory, a magnetic data media, a cloud-based storage medium, or an optical storage media.
Instruction memory 132 may store instructions that may be accessed (e.g., read) and executed by one or more of camera processor 114, CPU 116, and GPU 118. For example, instruction memory 132 may store instructions that, when executed by one or more of camera processor 114, CPU 116, and GPU 118, cause one or more of camera processor 114, CPU 116, and GPU 118 to perform one or more of the operations described herein. For instance, instruction memory 132 can include a detection unit 132A that can include instructions that, when executed by one or more of camera processor 114, CPU 116, and GPU 118, cause camera processor 114, CPU 116, and GPU 118 to detect an object/hand of the user as described in different embodiments. Instruction memory 132 can also include tracking unit 132B that can include instructions that, when executed by one or more of camera processor 114, CPU 116, and GPU 118, cause camera processor 114, CPU 116, and GPU 118 to track the movement of the object/hand as described in different embodiments. In an optional implementation, the tracking unit 132B may include look-up tables 132C, that may store for a specific object/hand, a sequence(s) of tracking points and gesture(s) corresponding to the sequence(s) of tracking points. As described herein, the look-up tables 132C may allow identification of a gesture for an object/hand based on tracking points spanned by the object/hand in the gesture. Further, the tracking unit 132B may include instructions that, when executed by one or more of camera processor 114, CPU 116, and GPU 118, cause camera processor 114, CPU 116, and GPU 118 to execute a machine learning process as described herein.
Instruction memory 132 may also store instructions that, when executed by one or more of camera processor 114, CPU 116, and GPU 118, cause one or more of camera processor 114, CPU 116, and GPU 118 to perform image processing operations, generate a plurality of data points for an object/hand, generate a multi-dimensional model of the object/hand, and generate a plurality of action points based on the multi-dimensional model of the object/hand as described herein.
The various components of object detection and tracking device 100, as illustrated in
Camera processor 114 may be configured to receive image frames (e.g., pixel data, image data) from image sensors 112, and process the image frames to generate image and/or video content. For example, image sensor 112A may be configured to capture individual frames, frame bursts, frame sequences for generating video content, photo stills captured while recording video, image previews, or motion photos from before and/or after capture of an input gesture by the user. CPU 116, GPU 118, camera processor 114, or some other circuitry may be configured to process the image and/or video content captured by image sensor 112A into images or video for display on display 128. In an illustrative example, CPU 116 may cause image sensor 112A to capture image frames, and may receive pixel data from image sensor 112A. In the context of this disclosure, image frames may generally refer to frames of data for a still image or frames of video data or combinations thereof, such as with motion photos. Camera processor 114 may receive, from image sensors 112, pixel data of the image frames in any suitable format. For instance, the pixel data may be formatted according to a color format such as RGB, YCbCr, or YUV.
In some examples, camera processor 114 may include an image signal processor (ISP). For instance, camera processor 114 may include an ISP that receives signals from image sensors 112, converts the received signals to image pixels, and provides the pixel values to camera processor 114. Additionally, camera processor 114 may be configured to perform various operations on image data captured by image sensors 112, including auto gain, auto white balance, color correction, or any other image processing operations.
Memory controller 124 may be communicatively coupled to system memory 130 and to instruction memory 132. Memory controller 124 may facilitate the transfer of data going into and out of system memory 130 and/or instruction memory 132. For example, memory controller 124 may receive memory read and write commands, such as from camera processor 114, CPU 116, or GPU 118, and service such commands to provide memory services to system memory 130 and/or instruction memory 132. Although memory controller 124 is illustrated in the example of
Camera processor 114 may also be configured, by executed instructions, to analyze image pixel data and store resulting images (e.g., pixel values for each of the image pixels) to system memory 130 via memory controller 124. GPU 118 or some other processing unit, including camera processor 114 itself, may perform operations to detect an object in a placement area, registering the object/hand, generating a plurality of data points for the object/hand based on an image of the object/hand, generating a multi-dimensional model of the object/hand based on the plurality of data points, generating a plurality of action points based on the multi-dimensional model, and tracking the movement of the object/hand in a virtual environment based on the plurality of action points.
In addition, object detection and tracking device 100 may include a video encoder and/or video decoder 117, either of which may be integrated as part of a combined video encoder/decoder (CODEC). Encoder/decoder 117 may include a video coder that encodes video captured by one or more camera(s) 115 or a decoder that decodes compressed or encoded video data. In some instances, CPU 116 may be configured to encode and/or decode video data using encoder/decoder 117.
CPU 116 may comprise a general-purpose or a special-purpose processor that controls operation of object detection and tracking device 100. A user may provide input to object detection and tracking device 100 to cause CPU 116 to execute one or more software applications. The software applications executed by CPU 116 may include, for example, a camera application, a graphics editing application, a media player application, a video game application, a graphical user interface application or another program. For example, and upon execution by CPU 116, a camera application may allow control of various settings of camera 115, e.g., via input provided to object detection and tracking device 100 via user interface 122. Examples of user interface 122 include, but are not limited to, a pressure-sensitive touchscreen unit, a keyboard, a mouse, or an audio input device, such as a microphone. For example, user interface 122 may receive input from the user to validate an object to be tracked to for input gestures, or validate the gestures recognized during machine learning process (as described above, and with reference to
By way of example, the executed camera application may cause CPU 116 to generate content that is displayed on display 128 and/or in a virtual environment of the hybrid environment of an AR, VR or XR system. For instance, display 128 or a projection in the virtual environment may display information, such as an image insertion guide for indicating to the user a direction and/or an angle of insertion of the object/hand for detection by the object detection and tracking device 100. An executed hand detection and tracking application stored in the system memory 130 and/or the instruction memory 132 (not shown in
As described herein, one or more of CPU 116 and GPU 118 may perform operations that apply a trained machine learning process to generate or update look-up tables 132C.
In some examples, the one or more of CPU 116 and GPU 118 cause the output data to be displayed on display 128. In some examples, the object detection and tracking device 100 transmits, via transceiver 119, the output data to a computing device, such as a server or a user's handheld device (e.g., cellphone). For example, the object detection and tracking device 100 may transmit a message to another computing device, such as a verified user's handheld device, or a projection device simulating a virtual environment based on the output data.
In some examples, the object detection and tracking device 100 may determine whether an angle of insertion of the hand 310 is within a predetermined range, and may generate profile data identifying the hand 310 as the hand of a user based on the determination. For example, the predetermined range may be a range of values of angles based on the horizon of vision of the user 202. When the object detection and tracking device 100 determines that a detected angle of insertion of hand 310 is within the predetermined range of values, the object detection and tracking device 100 may register the hand 310 as an object to be tracked for the user. Similarly, the object detection and tracking device 100 may determine that a direction of insertion into the placement area 308 is an appropriate direction (e.g., bottom to up), and the object detection and tracking device 100 may register the hand 310 as the object to be tracked for the user.
As another example, the object detection and tracking device 100 may determine the angle of insertion of the hand 310 is not within the predetermined range of values, and may not associate the hand 310 with the user. Similarly, the object detection and tracking device 100 may determine that a direction of insertion into the placement area 308 is the not the appropriate direction (e.g., top to bottom), and may not associate the hand 310 with the user. As such, the object detection and tracking device 100 may not register the hand 310 as an object to be tracked. In one implementation, the object detection and tracking device 100 may request the user 202 to re-enter the hand 310 at a suggested angle and/or direction. For example, the object detection and tracking device 100 may provide visual cues (e.g., providing an insertion guidance image that identifies one or more insertion angles for inserting the hand/object 310 into the placement area 308) through a projection in or near the placement area 308, that indicates to the user 202 an angle of insertion and/or a direction of insertion through which the user 202 may insert the hand 310 to successfully register the hand 310 as a hand of the user 202 with the XR system.
For instance,
For instance,
Specifically,
To begin the initialization process, the object detection and tracking device 100 may generate a first request for a placement of the object in a hybrid environment that includes a real environment, and a virtual environment. For example, the object detection and tracking device 100 may indicate to a user the first request in a virtual environment of the XR system by generating and displaying an image that highlights a placement area (e.g., the placement area 308 as described above with reference to
Based on the plurality of image data points, the object detection and tracking device 100 may determine a plurality of action points, and may store the plurality of action points in a memory. The plurality of action points may be significant points that are mobile enough in an object, such as a hand, to create gestures. For example, the plurality of action points may correspond to joints or other points where the object or hand can bend (e.g., the action points may be points in a 3-D model of the hand that are expected to be active/moving in a certain manner for a specific gesture). The object detection and tracking device 100 may apply a trained machine learning process, such as a trained neural network, to the plurality of data points to determine the plurality of action points. For instance, a neural network may be trained using supervised learning based on elements of image data characterizing hands and corresponding action points. The object detection and tracking device 100 may execute the trained neural network to ingest the plurality of data points and output the plurality of action points.
Based on the plurality of action points and the captured image, the object detection and tracking device 100 may determine a gesture of the user. For instance, the object detection and tracking device 100 may capture an image of an object in an FOV 604, and may adapt the plurality of action points to the image of the object to generate tracking points for the object. For example, the tracking points may be points in a 3-D space that a hand 902 is expected to span when making a gesture, such as FOV 604. In some instances, the object detection and tracking device 100 may determine the plurality of action points for multiple captured images, and may determine the gesture based on tracking points generated for each of the captured images. For instance, the object detection and tracking device 100 may capture a second image of the object in the FOV 604, and may adapt the plurality of action points to an object in the second image to generate additional tracking points. The object detection and tracking device 100 may determine the gesture based on the tracking points and the additional tracking points (e.g., the gesture may be a movement of the object from one location in the virtual environment to another). The object detection and tracking device 100 may utilize such sequence of tracking points for identifying gestures and receiving user input corresponding to the gestures. In some implementations, the object detection and tracking device 100 may save the sequence of tracking points spanned during a gesture in a look-up table, e.g., the look-up tables 132C as described above with reference to
In some examples, the object detection and tracking device 100 trains the machine learning process based on images and corresponding action points for objects within the images. For example, the machine learning process may be trained using supervised learning, where a corresponding machine learning model ingests elements of an image and corresponding action points, and generates elements of output data. Further, and during training, the object detection and tracking device 100 may determine one or more losses based on the generated output data. The object detection and tracking device 100 may determine that training is complete when the one or more losses are within a threshold (e.g., less than a threshold value). For instance, the object detection and tracking device 100 may continue to train the machine learning process until the one or more losses are each below respective thresholds.
In some examples, the object detection and tracking device 100 may generate third requests for the user to perform one or more gestures to train and/or optimize the machine learning process for better detection and tracking of the hand 902. The object detection and tracking device 100 may generate the third requests for the user and display the third requests within the virtual environment of the system. For example, the object detection and tracking device 100 may request the user to make a gesture using the hand 902 (e.g., a gesture of rotating the hand 902, making a figure of one or more shapes, such as alphabets, numbers (e.g., figure of 8), or symbols. The object detection and tracking device 100 may then detect the performance of a gesture from the user in response to displaying a third request. The object detection and tracking device 100 may identify one or more points of the hand 902 that engaged in motion during the requested gesture, and adjust the plurality of action points based on the identified one or more points of the hand 902 for the requested gesture. The object detection and tracking device 100 may also identify a sequence of points in 3-D space that are spanned by the hand 902 as described herein, and use such sequence of points for updating the sequence of tracking points stored in the look-up table 132C.
In some examples, object detection and tracking device 100 may obtain, from system memory 130, hand attributes data, which may include a plurality of attributes such as palm-lines, palm-contours, shape, size of fingernails, shape of fingernails, color, multi-point hand outline geometry, and one or more identification marks. The object detection and tracking device 100 may also obtain object data (which may be angle and direction of insertion data) and user horizon of vision data (which may represent at least one view in a range of angles based on the horizon of vision of user). Object detection and tracking device 100 may generate characteristics of a hand of the user based on the hand attributes data and the user horizon of vision data, and provide the generated characteristics to train, for example, a convolutional neural network (CNN) process, a deep learning process, a decision tree process, a support vector machine process, or any other suitable machine learning process.
In one implementation, object detection and tracking device 100 may apply one or more data processing techniques to the hand attributes data and the user horizon of vision data to determine one or more parameters corresponding to a hand of a user. For example, object detection and tracking device 100 may utilize the user horizon of vision data to identify a core region of the hand of the user in the placement area. The object detection and tracking device 100 may then finely characterize the size, shape, position and orientation, etc. of the hand in the placement area based on the hand attributes data. For instance, and as described herein, the hand attributes data may characterize one or more of palm-lines, palm-contours, shape, size of fingernails, shape of fingernails, color, multi-point hand outline geometry, and one or more identification marks of a hand. Based on the hand attributes data, object detection and tracking device 100 may determine hand characteristics data for the hand, which can include one or more of a shape of the hand of the user, a position of the hand of the user, and an orientation of the hand of the user in the placement area. The object detection and tracking device 100 may output the hand characteristics data (for instance, the characteristics including the size, shape, position and orientation of the hand), and may apply one or more data processing techniques and/or correlation techniques to the hand characteristics data and angle and direction of insertion data to generate the object data. For example, object detection and tracking device 100 may correlate the angle and direction of insertion data with the position and orientation of the hand (e.g., included in the hand characteristics data) to generate the object data. In one example, the object data may characterize a plurality of values representing a position, an orientation, and a direction of insertion of an object into a placement area. The object data may further characterize whether the object is a hand of a user, which may be subsequently tracked as described herein.
Further, object detection and tracking device 100 may apply one or more trained machine learning processes, such as a trained CNN process, to the object data and hand attributes data to generate detection output data. The detection output data may identify an object (e.g., hand) inserted into a placement area of a hybrid environment (as described herein with reference to
By way of example, object detection and tracking device 100 may train a CNN process against feature values generated or obtained from a training data set that includes historical object data and hand attributes data, and object detection and tracking device 100 may compute one or more losses to determine whether the CNN process has converged. In some instances, object detection and tracking device 100 may determine one or more of a triplet loss, a regression loss, and a classification loss (e.g., cross-entropy loss), among others, based on one or more of the detection output data and the object data. For example, object detection and tracking device 100 may execute a sigmoid function that operates on the detection output data. Further, object detection and tracking device 100 may provide output generated by the executed sigmoid function as feedback to the training processes, e.g., to encourage more zeros and ones from the generated output.
Object detection and tracking device 100 may also compute a classification loss based on the detection output data and the object data. Further, object detection and tracking device 100 may provide the classification loss and the triplet loss as feedback to the training processes. Object detection and tracking device 100 may further determine whether one or more of the computed losses satisfy a corresponding threshold to determine whether the training processes have converged and the trained CNN process is available for deployment. For example, object detection and tracking device 100 may compare each computed loss to its corresponding threshold to determine if each computed loss meets or exceeds its corresponding threshold. In some examples, when each of the computed losses meet or exceed their corresponding thresholds, object detection and tracking device 100 determines the convergence of the training processes, and the training processes are complete. Further, object detection and tracking device 100 generates training loss data characterizing the computed losses, and stores training loss data within system memory 130.
In some examples, object detection and tracking device 100 may perform additional operations to determine whether the CNN process is sufficiently trained. For example, object detection and tracking device 100 may input elements of feature values generated or obtained from a validation data set that includes validation object data and hand attributes data to the CNN process to generate additional detection output data. Based on the detection output data, object detection and tracking device 100 computes one or more losses that characterize errors in detection of an object (e.g., hand) (as described above with reference to
Although, as described, object detection and tracking device 100 trains a CNN process, one or more of any suitable processing devices associated with object detection and tracking device 100 may train the CNN process as described herein. For example, one or more servers, such as one or more cloud-based servers, may train the CNN process. In some examples, one or more processors (e.g., CPUs, GPUs) of a distributed or cloud-based computing cluster may train the CNN process. In some implementations, the CNN process may be trained by another processing device associated with object detection and tracking device 100, and the other processing device storing the configuration parameters, hyperparameters, and/or weights associated with the trained CNN process in a data repository over a network (e.g., the Internet). Further, object detection and tracking device 100 may obtain, over the network, the stored configuration parameters, hyperparameters, and/or weights, and stores them within instruction memory 132 (e.g., within detection unit 132A). Object detection and tracking device 100 may then establish the CNN process based on the configuration parameters, hyperparameters, and/or weights stored within instruction memory 132A.
In one example, the object detection and tracking device 100 implements the machine learning techniques for object detection and tracking movement of the object, such as a hand of the user, in a placement area of a hybrid environment. For instance, as described above, object detection and tracking device 100 may generate hand characteristics data, which may be further processed in combination with the angle and direction of insertion data to generate the object data that may indicate whether the detected hand corresponds to a hand of a user that should be tracked. Object detection and tracking device 100 may utilize the object data in combination with the hand attributes data and/or the angle and direction of insertion data to generate the detection output data that identifies the hand of the user in the placement area. Object detection and tracking device 100 may track the hand of the user in the placement area of the hybrid environment based on the detection output data.
In some examples, object detection and tracking device 100 may utilize data points and multi-dimensional model data to generate action points. For example, the data points may represent points that trace an image of a hand/object, and the multi-dimensional model data may represent a model defining a shape of an object. The object detection and tracking device 100 may apply one or more data processing techniques to generate action points that, in some examples, may characterize anticipated points of an object/hand in a 3-D space, where the points may move when certain gestures are made by the object/hand (as described above with reference to
In one implementation, object detection and tracking device 100 may apply one or more trained machine learning processes, such as a trained CNN process, to action points, object orientation data, gesture data, multi-dimensional model data and/or data points to generate tracking points data. For example, gesture data may characterize a gesture performed by a user in response to a request for performing the gesture, e.g., a rotation, a figure of one or more alphabets, numbers, or symbols (as described above with reference to
Further, object detection and tracking device 100 may perform operations to generate look-up tables. For instance, in some examples, object detection and tracking device 100 may apply one or more data processing and/or correlation techniques to multi-dimensional model data, object orientation data and tracking points data to generate look-up tables 132C. Look-up tables 132C may include data characterizing one or more gestures and a sequence of tracking points corresponding to each gesture (e.g., as described herein with reference to
In some examples, object detection and tracking device 100 may perform operations to train a CNN process to generate look-up tables 132C using multi-dimensional model data, object orientation data and tracking points data. For instance, object detection and tracking device 100 may generate features based on historical look-up tables, multi-dimensional model data, object orientation data, and tracking points data, and may input the generated features to the CNN process for training (e.g., supervised learning). The CNN may be trained to generate output data characterizing the historical look-up tables based on the features generated from the multi-dimensional model data, object orientation data, and tracking points data.
Object detection and tracking device 100 may compute one or more losses to determine whether the CNN has converged. For example, object detection and tracking device 100 may determine one or more of a triplet loss, a regression loss, and a classification loss (e.g., cross-entropy loss), among others, based on one or more of the tracking points data and look-up tables 132C. For example, object detection and tracking device 100 may execute a sigmoid function that operates on the tracking points data 1309. The sigmoid function can serve as an amplifier to enhance the spoof response generated from the CNN process. Further, object detection and tracking device 100 may provide output generated by the executed sigmoid function as feedback to the CNN process, so as to encourage more zeros and/or ones from the generated output.
Object detection and tracking device 100 may compute a classification loss based on the tracking points data and look-up tables 132C. Further, object detection and tracking device 100 may provide the classification loss and the triplet loss as feedback to the CNN process. Object detection and tracking device 100 may further determine whether one or more of the computed losses satisfy a corresponding threshold to determine whether the CNN process has converged. For example, object detection and tracking device 100 may compare each computed loss to its corresponding threshold to determine if each computed loss meets or exceeds its corresponding threshold. In some examples, when each of the computed losses meet or exceed their corresponding thresholds, object detection and tracking device 100 determines the CNN process has converged, and training is complete. Further, object detection and tracking device 100 generates training loss data characterizing the computed losses, and stores training loss data within system memory 130.
In some examples, object detection and tracking device 100 may provide additional data points characterizing a validation data set to the initially trained CNN to determine whether the initially trained CNN is sufficiently trained. For example, object detection and tracking device 100 may apply the initially trained CNN to the data points characterizing the validation data set to generate action points (e.g., an improved set of action points). For instance, object detection and tracking device 100 may apply the initially trained CNN as described herein to the action points to generate additional tracking points data. Based on the tracking points data, object detection and tracking device 100 may compute one or more losses that characterize errors in detection of a hand/object (as described herein with reference to
Although, as described, object detection and tracking device 100 trains the CNN process, one or more of any suitable processing devices associated with object detection and tracking device 100 may train the CNN process as described herein. For example, one or more servers, such as one or more cloud-based servers, may train CNN. In some examples, one or more processors (e.g., CPUs, GPUs) of a distributed or cloud-based computing cluster may train the CNN process. In some implementations, the CNN process may be trained by another processing device associated with object detection and tracking device 100, and the other processing device storing the configuration parameters, hyperparameters, and/or weights associated with the trained CNN process in a data repository over a network (e.g., the Internet). Further, object detection and tracking device 100 may obtain, over the network, the stored configuration parameters, hyperparameters, and/or weights, and stores them within instruction memory 132 (e.g., within detection unit 132A and/or the tracking unit 132B). Object detection and tracking device 100 may then establish CNN based on the configuration parameters, hyperparameters, and/or weights stored within instruction memory 132A and/or the tracking unit 132B.
In one example, object detection and tracking device 100 implements the machine learning techniques for detection and tracking the movement of an object/hand of the user in the placement area. As described above, object detection and tracking device 100 may generate action points and object orientation data for an object (e.g., hand), where the action points may be processed in combination with gesture data, multi-dimensional model data and/or data points, to generate tracking points data. The tracking points data may represent a sequence of points in a 3-D space spanned by the hand/object during the gesture by the object/hand. The object detection and tracking device 100 may generate look-up tables 132C based on the tracking points data as described herein. As such, the object detection and tracking device 100 can operate to detect and track an object (e.g., hand) of a user in a placement area of a hybrid environment.
At block 1002, object detection and tracking device 100 detects an object in a placement area of a hybrid environment. For example, the CPU 116 of the object detection and tracking device 100 may execute instructions stored in the detection unit 132A to detect the placement of an object in a placement area of the hybrid environment. For example, the placement area may be the placement area 308 (as described above with reference to
At step 1004, in response to the detection, the object detection and tracking device 100 receives at least one parameter corresponding to the hand of the user. For example, the at least one parameter may be a direction of insertion, an angle of insertion, palm-lines, palm-contours, shape, size of fingernails, shape of fingernails, color, multi-point hand outline geometry, or one or more identification marks (as described above with reference to
At step 1006, the object detection and tracking device 100, registers the object based on the at least one parameter. For example, the object detection and tracking device 100 may generate profile data that registers the object with a user, and may store in the profile data within system memory 130 and/or the instruction memory 132. The profile data may indicate that the object corresponding to the parameter(s) is an object of the user that will be used for providing gesture inputs.
At step 1008, the object detection and tracking device 100 tracks the movement of the object based on the registration. For example, the camera 115 may monitor the movements of the object (registered at step 1006) corresponding to the user. The camera 115 may provide one or more images and/or video sequences to the CPU 116, which may execute instructions stored in the detection unit 132A and/or the tracking unit 132B to track the movement of the registered object, as described herein.
At block 1102, the object detection and tracking device 100 captures at least one image of the object in a real environment of a hybrid environment. For example, the camera 115 of the object detection and tracking device 100 may capture at least one image of the object in a real environment of the hybrid environment of the XR system (as described above with reference to
Further, at step 1104, the object detection and tracking device 100 generates a plurality of data points for the object based on the at least one image. For example, the CPU 116 may execute instructions stored in the detection unit 132A and/or the tracking unit 132B to generate the plurality of data points for the object based on the at least one image (as described above with reference to
At block 1106, the object detection and tracking device 100 generates a multi-dimensional model of the object based on the plurality of data points. For instance, the CPU 116 may execute one or more instructions stored in the detection unit 132A and/or the tracking unit 132B to generate a 3D model of the object (as described above with reference to
Proceeding to step 1108, the object detection and tracking device 100 may generate a plurality of action points based on the multi-dimensional model of the object. For example, CPU 116 may execute one or more instructions stored in the detection unit 132A and/or the tracking unit 132B to generate the plurality of action points (as described above with reference to
At step 1112, the object detection and tracking device 100 may track a movement of the object in a virtual environment of the hybrid environment based on the plurality of action points. For example, the CPU 116 may execute one or more instructions stored in the tracking unit 132B to track the movement of the object in a virtual environment of the hybrid environment (as described above with reference to
Implementation examples are further described in the following numbered clauses:
Although the methods described above are with reference to the illustrated flowcharts, many other ways of performing the acts associated with the methods may be used. For example, the order of some operations may be changed, and some embodiments may omit one or more of the operations described and/or include additional operations.
Further, although the exemplary embodiments described herein are, at times, described with respect to an object detection and tracking device, the machine learning processes, as well as the training of those machine learning processes, may be implemented by one or more suitable devices. For example, in some examples, an object detection and tracking device may capture an image or video sequence and may transmit the image to a distributed or cloud computing system. The distributed or cloud computing system may apply the trained machine learning processes described herein to track the movement of the object.
Additionally, the methods and system described herein may be at least partially embodied in the form of computer-implemented processes and apparatus for practicing those processes. The disclosed methods may also be at least partially embodied in the form of tangible, non-transitory machine-readable storage media encoded with computer program code. For example, the methods may be embodied in hardware, in executable instructions executed by a processor (e.g., software), or a combination of the two. The media may include, for example, RAMs, ROMs, CD-ROMs, DVD-ROMs, BD-ROMs, hard disk drives, flash memories, or any other non-transitory machine-readable storage medium. When the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the method. The methods may also be at least partially embodied in the form of a computer into which computer program code is loaded or executed, such that, the computer becomes a special purpose computer for practicing the methods. When implemented on a general-purpose processor, computer program code segments configure the processor to create specific logic circuits. The methods may alternatively be at least partially embodied in application specific integrated circuits for performing the methods.