This disclosure relates to systems and methods to reconstruct facial expressions and to track facial movements from head or neck-mounted wearable devices.
Humans use facial expressions as a natural mode of communication. The ability to continuously record and understand facial movements can improve interactions between humans and computers in a variety of applications.
Conventional facial reconstruction methods require a camera to be positioned in front of a user's face with a specified position and angle relative to the user's face. To achieve reliable facial reconstruction, the camera needs an entire view of the face without occlusions. Conventional facial reconstruction methods do not perform well if the user is in motion, the camera is not appropriately set up, the camera is not in front of the user, or the user's face is partially occluded or not fully visible to the camera due to the camera's position or angle relative to the user's face.
As an alternative to frontal camera systems, wearable devices for facial expression reconstruction have been developed using sensing techniques, such as acoustic interference, pressure sensors, electrical impedance tomography, and electromyography. These wearable devices use instrumentation that is mounted directly on a user's face. These conventional devices often cover the user's face and only recognize discrete facial expressions. Examples of these conventional wearable devices include face masks with built-in ultrasonic transducers or electrodes secured to a human face with electromyography or capacitive sensing abilities. These wearable devices are attached directly to the user's face or body and may block the field of vision and interfere with normal daily activities, such as eating or socializing.
Another alternative to frontal camera systems is smart eyewear including smart glasses, augmented reality glasses, and virtual reality headsets. However, these smart eyewear devices cannot track high-quality facial movements continuously. For example, virtual reality devices cannot depict 3D avatars in virtual worlds with facial expressions of the user.
The present technology allows reconstruction of facial expressions and tracking of facial movements using non-obtrusive, wearable devices that capture optical or acoustical images of facial contours, chin profiles, or skin deformation of a user's face. The wearable devices include head-mounted technology that continuously reconstruct full facial expressions by capturing the positions and shapes of the mouth, eyes, and eyebrows. Miniature cameras capture contours of the sides of the face, which are used to train a deep-learning model to predict facial expressions. An alternate embodiment of this technology includes a neck-mounted technology to continuously reconstruct facial expressions. Infrared cameras capture chin and face shapes underneath the neck, which are used to train a deep-learning model to predict facial expressions.
Additional embodiments include various camera types or acoustic imaging systems for the wearable devices. The acoustic imaging systems may comprise microphones and speakers. For example, the wearable devices may use the microphones and speakers to transmit and receive acoustical signals to determine skin deformation of a user. Full facial movements and expressions can be reconstructed from subtle skin deformations. The wearable devices may comprise ear mounted devices, eye mounted devices, or neck mounted devices.
The systems of this technology include wearable devices configured with miniature cameras or acoustical imaging systems in communication with computing devices to transmit images or acoustical signals from the cameras or acoustical imaging systems to a remote server, data acquisition system, or data processing device for facial expression reconstruction or to track facial movement. The wearable devices include headphone, earbud, necklace, neckband, glasses, virtual reality (“VR”) headsets, augmented reality (“AR”) headsets, and other form factors. Each of the form factors include miniature cameras or acoustical imaging systems and micro computing devices, such as Raspberry Pi™.
The exemplary headphone device includes two cameras and may include functionality with Bluetooth or Wi-Fi connectivity and speakers embedded within the earpieces of the headphone. The cameras are attached to the earpieces of the headphone device and are connected to a computing device to transmit acquired images to a remote server. The headphone device is configured to acquire images of the contours of a user's face by directing the cameras at adjustable angles and positions.
The exemplary earbud devices are constructed for wear as a left-side earbud and a right-side earbud. Both the left and right earbud devices include a camera to acquire images of the contours of a user's face. Each camera is connected to a computing device to transmit acquired images to a remote server.
The exemplary necklace device includes an infrared (“IR”) camera with an IR LED and an IR bandpass filter. The necklace device is configured to acquire images of a user's chin profile by directing the camera at the profile of a user's chin. The IR LED projects IR light onto a user's chin to enhance the quality of the image captured by the camera. The IR bandpass filter filters visible light such that the camera captures infrared light reflected by a user's skin. The camera is connected a computing device to transmit acquired images to a remote server.
The exemplary neckband device includes two cameras fashioned to be positioned on the left and right sides of a user's neck. The neckband device includes IR LEDs and IR bandpass filters configured in proximity to each camera, similar to the necklace device. Each camera is connected a computing device to transmit acquired images to a remote server.
Using one of the wearable devices, facial expressions can be reconstructed from images acquired from the cameras within the devices. A data training set is created using frontal view images of a user in a machine learning algorithm. Multiple frontal view images of a user are acquired with a variety of facial expressions. The frontal view images are transmitted to a data processing system to create the data training set.
The wearable device captures one or more facial digital image(s) and transmits the digital images to data acquisition computing devices connected to each of the cameras of the wearable device. The data acquisition computing devices subsequently transmit the images to a remote server or data processing system. The data processing system reconstructs facial expressions using the images. The data processing system pre-processes the images by reducing the image size, removing noise from the image, extracting skin color from the background of the image, and binarizing each the image. In some embodiments, a wearable imaging system comprises one or more imaging sensor(s), wherein the one or more imaging sensor(s) are positioned to capture one or more image(s) with incomplete side contours of a face (for example, from ear(s) and/or from neck), a processor, and a non-transitory machine-readable storage medium comprising machine-readable instructions executed by the processor to extract a plurality of features (for example, landmarks or parameters) from the one or more image(s), compare the extracted features and/or changes of the extracted features with features from a ground truth, and output one or more recognition or prediction result(s), wherein the results comprise a word or a phrase spoken by a user, an emoji of a facial expression of a user, and/or a real-time avatar of a facial expression of a user.
The data processing system applies a deep-learning model to the pre-processed images for facial expression reconstruction. The reconstructed facial expressions may be used in applications such as silent speech and emoji recognition.
In an alternate embodiment, acoustic sensing technology can be used to reconstruct facial expressions using an array of microphones and speakers and deep-learning models. Acoustic sensing technology is small, lightweight and, therefore, suitable to mount to a variety of wearable devices. Facial movements may be tracked by detecting skin deformations observed from different positions on a user's head. When a user performs facial movements, the skin on the face deforms with unique patterns. Using acoustic sensing technology, skin deformations may be captured without capturing images of the entire face. The acoustic technology comprises a wearable device such as headphones, earbuds, necklaces, neckbands, and any other suitable form factors such that the microphones and speakers can be mounted or attached to the wearable device. The positions of the microphones and speakers may be adjustable to optimize the recording of reflected signals.
The acoustic wearable device actively sends signals from the speakers towards a user's face. The signals are reflected by the user's face and captured by the microphones on the acoustic wearable device. The signals are reflected differently back towards the microphones based on different facial expressions or movements. The acoustic wearable device sends the transmitted and reflected signals to a computing device for further processing. An echo profile is calculated based on the transmitted and reflected acoustic signals. The echo profile is calculated by calculating a cross-correlation between the received acoustic signals and the transmitted acoustic signals, which can display the deformations of the skin in temporal and spatial domains. The echo profile is input into a deep learning module to reconstruct the facial expressions of the user.
While acoustic sensing technology may be used to determine facial expressions by detecting skin deformations, acoustic sensing technology may also be used to directly determine other outputs associated with skin deformation without the need to first determine facial expressions. For example, the acoustic sensing technology may track blinking patterns and eyeball movements of a user. Blinking patterns and eyeball movements can be applied to the diagnosis and treatment processes of eye diseases. The acoustic sensing technology may detect movements of various parts of the face associated with speech, whether silent or voiced. While speaking, different words and/or phrases lead to subtle, yet distinct, skin deformations. The acoustic sensing technology can capture the skin deformation to recognize silent speech or to vocalize sound. The acoustic sensing technology may also be used to recognize and track physical activities. For example, while eating, a user opens the mouth, chews, and swallows. During each of those movements, the user's skin deforms in a certain way such that opening the mouth is distinguishable from chewing and from swallowing. The detected skin deformations may be used to determine the type and quantity of food consumed by a user. Similar to eating, a type and quantity of a consumed drink may be determined by skin deformation. The acoustic sensing technology may also be used to determine an emotional status of a user. An emotional status may be related to skin deformations. By detecting skin deformations, an emotional status can be determined and reported to corresponding applications.
These and other aspects, objects, features, and advantages of the disclosed technology will become apparent to those having ordinary skill in the art upon consideration of the following detailed description of illustrated examples.
Turning now to the drawings, in which like numerals indicate like (but not necessarily identical) elements throughout the figures, examples of the technology are described in detail.
In example embodiments, network 140 includes one or more wired or wireless telecommunications system(s) by which network devices may exchange data. For example, the network 140 may include one or more of a local area network (LAN), a wide area network (WAN), an intranet, an Internet, a storage area network (SAN), a personal area network (PAN), a metropolitan area network (MAN), a wireless local area network (WLAN), a virtual private network (VPN), a cellular or other mobile communication network, a BLUETOOTH® wireless technology connection, a near field communication (NFC) connection, any combination thereof, and any other appropriate architecture or system that facilitates the communication of signals, data, and/or messages.
Facial expression reconstruction system 100 comprises cameras 110 that may be any suitable sensors for capturing images. For example, cameras 110 may include depth cameras, red-green-blue (“RGB”) cameras, infrared (“IR”) sensors, acoustic cameras or sensors, speakers, microphones, thermal imaging sensors, charge-coupled devices (“CCDs”), complementary metal oxide semiconductor (“CMOS”) devices, active-pixel sensors, radar imaging sensors, fiber optics, and any other suitable image device technology.
The image frame resolution of the camera 110 may be defined by the number of pixels in a frame. The image resolution of the camera 110 may comprise any suitable resolution, including of the following resolutions, without limitation: 32×24 pixels; 32×48 pixels; 48×64 pixels; 160×120 pixels, 249×250 pixels, 250×250 pixels, 320×240 pixels, 420×352 pixels, 480×320 pixels, 640×480 pixels, 720×480 pixels, 1280×720 pixels, 1440×1080 pixels, 1920×1080 pixels, 2048×1080 pixels, 3840×2160 pixels, 4096×2160 pixels, 7680×4320 pixels, or 15360×8640 pixels. The resolution of the camera 110 may comprise a resolution within a range defined by any two of the preceding pixel resolutions, for example, within a range from 32×24 pixels to 250×250 pixels (for example, 249×250 pixels). In some embodiments, at least one dimension (the height and/or the width) of the image resolution of the camera 110 can be any of the following, including but not limited to 8 pixels, 16 pixels, 24 pixels, 32 pixels, 48 pixels, 72 pixels, 96 pixels, 108 pixels, 128 pixels, 256 pixels, 360 pixels, 480 pixels, 720 pixels, 1080 pixels, 1280 pixels, 1536 pixels, or 2048 pixels. The camera 110 may have a pixel size smaller than 1 micron, 2 microns, 3 microns, 5 microns, 10 microns, 20 microns, and the like. The camera 110 may have a footprint (for example, a dimension in a plane parallel to a lens) on the order of 10 mm×10 mm, 8 mm×8 mm, 5 mm×5 mm, 4 mm×4 mm, 2 mm×2 mm, or 1 mm×1 mm, 0.8 mm×0.8 mm, or smaller.
Each camera 110 is in communication with a computing device 120. Each camera 110 is configured to transmit images or data to a computing device 120. Each camera 110 may be communicatively coupled to a computing device 120. In an alternate example, each camera 110 may communicate wirelessly with a computing device 120, such as via near field communication (“NFC”) or other wireless communication technology, such as Bluetooth, Wi-Fi, infrared, or any other suitable technology.
Computing devices 120 comprise a central processing unit 121, a graphics processing unit 122, a memory 123, and a communication application 124. In an example, computing devices 120 may be small, single-board computing devices, such as a Raspberry Pi™ device. Computing devices 120 function to receive images or data from cameras 110 and to transmit the images via network 140 to a data processing system 130.
Computing device 120 comprises a central processing unit 121 configured to execute code or instructions to perform the operations and functionality described herein, manage request flow and address mappings, and perform calculations and generate commands. Central processing unit 121 may be configured to monitor and control the operation of the components in the computing device 120. Central processing unit 121 may be a general purpose processor, a processor core, a multiprocessor, a reconfigurable processor, a microcontroller, a digital signal processor (“DSP”), an application specific integrated circuit (“ASIC”), a graphics processing unit (“GPU”), a field programmable gate array (“FPGA”), a programmable logic device (“PLD”), a controller, a state machine, gated logic, discrete hardware components, any other processing unit, or any combination or multiplicity thereof. Central processing unit 121 may be a single processing unit, multiple processing units, a single processing core, multiple processing cores, special purpose processing cores, co-processors, or any combination thereof. Central processing unit 121 along with other components of the computing device 120 may be a virtualized computing machine executing within one or more other computing machine(s).
Computing device 120 comprises a graphics processing unit 122 that serves to accelerate rendering of graphics and images in two- and three-dimensional spaces. Graphics processing unit 122 can process multiple images, or data, simultaneously for use in machine learning and high-performance computing.
Computing device 120 comprises a memory 123. Memory 123 may include non-volatile memories, such as read-only memory (“ROM”), programmable read-only memory (“PROM”), erasable programmable read-only memory (“EPROM”), flash memory, or any other device capable of storing program instructions or data with or without applied power. Memory 123 may also include volatile memories, such as random access memory (“RAM”), static random access memory (“SRAM”), dynamic random access memory (“DRAM”), and synchronous dynamic random access memory (“SDRAM”). Other types of RAM also may be used to implement memory 123. Memory 123 may be implemented using a single memory module or multiple memory modules. While memory 123 is depicted as being part of the computing device 120, memory 123 may be separate from the computing device 120 without departing from the scope of the subject technology.
Computing device 120 comprises a communication application 124. Communication application 124 interacts with web servers or other computing devices or systems connected via network 140, including data processing system 130.
Facial expression reconstruction system 100 comprises a data processing system 130. Data processing system 130 serves to receive images or data from cameras 110 via computing devices 120 and network 140. Data processing system 130 comprises a central processing unit 131, a modeling application 132, a data storage unit 133, and a communication application 134.
Data processing system 130 comprises central processing unit 131 configured to execute code or instructions to perform the operations and functionality described herein, manage request flow and address mappings, and perform calculations and generate commands. Central processing unit 131 may be configured to monitor and control the operation of the components in the data processing system 130. Central processing unit 131 may be a general purpose processor, a processor core, a multiprocessor, a reconfigurable processor, a microcontroller, a digital signal processor (“DSP”), an application specific integrated circuit (“ASIC”), a graphics processing unit (“GPU”), a field programmable gate array (“FPGA”), a programmable logic device (“PLD”), a controller, a state machine, gated logic, discrete hardware components, any other processing unit, or any combination or multiplicity thereof. Central processing unit 131 may be a single processing unit, multiple processing units, a single processing core, multiple processing cores, special purpose processing cores, co-processors, or any combination thereof. Central processing unit 131 along with other components of the data processing system 130 may be a virtualized computing machine executing within one or more other computing machine(s).
Data processing system 130 comprises a modeling application 132. The modeling application 132 employs a variety of tools, applications, and devices for machine learning applications. The modeling application 132 may receive a continuous or periodic feed of images or data from one or more of the computing device(s) 120, the central processing unit 131, or the data storage unit 133. Collecting the data allows the modeling application 132 to leverage a rich dataset to use in the development of a training set of data or ground truth for further use in facial expression reconstructions. The modeling application 132 may use one or more machine learning algorithm(s) to develop facial expression reconstructions, such as a convolution neural network (“CNN”), Naïve Bayes Classifier, K Means Clustering, Support Vector Machine, Apriori, linear regression, logistic regression, decision trees, random forest, or any other suitable machine learning algorithm.
Data processing system 130 comprises a data storage unit 133. Data storage unit 133 may be accessible by the modeling application 132 and the communication application 134. The example data storage unit 133 can include one or more tangible computer-readable storage device(s). The data storage unit 133 can be within the data processing system 130 or can be logically coupled to the data processing system 130. For example, the data storage unit 133 can include on-board flash memory and/or one or more removable memory device(s) or removable flash memory. In certain embodiments, the data storage unit 133 may reside in a cloud-based computing system.
Data processing system 130 comprises a communication application 134. Communication application 134 interacts with web servers or other computing devices or systems connected via network 140, include the computing devices 120 and user computing device 150.
User computing device 150 is a computing device configured to receive and communicate results of facial expression reconstruction or facial movement. The results of the facial expression reconstruction may be displayed as a graphical representation of a facial expression, such as 630 of
Headphone device 210 is configured to acquire images of a left and a right contour of a user's face by directing cameras 110-1 and 110-2 at adjustable angles and positions. Each of the cameras 110 are independently adjustable by a first angle 214, a slide 216, and second angle 218. First angle 214 adjusts a tilt of the position of the camera 110 relative to a plane parallel to an exterior surface of a user's ear. The first angle 214 may be adjusted so that the camera 110/earpiece 212 assembly is tilted closer to a user's ear, or the first angle 214 may be adjusted so that the camera 110 via the earpiece 212 is tilted farther away or offset from a user's ear. In an example, first angle 214 may be 0° indicating that the camera 110/earpiece 212 assembly is in a vertical position relative to a plane parallel to the exterior surface of a user's ear. First angle 214 may be adjusted by −10°, −20°, −30°, −40°, or any suitable angle measure relative to the plane parallel to the exterior surface of a user's ear, such that each camera 110 may be aligned with the left and right contours of a user's face.
Slide 216 adjusts a position of the camera 110 relative to the earpiece 212 in a direction that is perpendicular to a plane parallel to the exterior surface of a user's ear, in other words, the position of the camera 110 may change along the slide 216 while the position of the earpiece 212 is fixed. The position of the camera 110 via the slide 216 may be adjusted such that the earpiece 212 is in close contact with an exterior surface of a user's ear. The position of the camera 110 via the slide 216 may be adjusted such that the camera 110 is extended a distance away from the plane parallel to the exterior surface of a user's ear. In an example, the extended distance may be 1 cm, 2 cm, 3 cm, or any suitable distance away from the plane parallel to the exterior surface of a user's ear. The slide 216 positions the cameras 110 in a manner similar to positioning the earpieces 212 of the headphone device 210. In some embodiments, the position of an imaging sensor or a camera (for example, the optical center of the lens in the camera) is less than 5 cm, less than 4 cm, less than 3 cm, less than 2 cm, less than 1 cm, or less than 0.5 cm away from the surface plane of a user's ear. In some embodiments, the position of an imaging sensor or a camera projected to the closest skin surface is within the region of a user's ear or less than 5 cm, less than 4 cm, less than 3 cm, less than 2 cm, or less than 1 cm away from the nearest contour edge of a user's ear. In some embodiments, the system comprises at least one, at least two, or at least three imaging sensors (for example, cameras) located in different positions as described above. In some embodiments, an imaging sensor, such as a camera, is positioned below a chin of a user and less than 25 cm, less than 20 cm, less than 15 cm, less than 10 cm, or less than 5 cm away from the chin of a user, or 2-30 cm below and away from the chin of the user.
Second angle 218 adjusts a rotational position of the camera 110 along a horizontal axis through earpieces 212-1 and 212-2. In an example, second angle 218 adjusts an angular position of the camera 110 relative to the horizontal axis while the position of the earpiece 212 remains unchanged. In an alternate example, second angle 218 adjusts an angular position of the camera 110/earpiece 212 assembly. Relative to a left or right contour of a user's face, second angle 218 may be 0° indicating that the camera 110 is in a horizontal position. A second angle 218 of 10° indicates that the camera 110 is directed 10° upwards. A second angle 218 of −10° indicates that the camera 110 is directed 10° downwards. Any suitable measure of second angle 218 may be used to align the cameras 110 with the contour of a user's face.
Each of the cameras 110 are independently adjustable by the first angle 214 and the second angle 218, by any suitable mechanism, for example, by mounting the cameras 110 to the headphone device 210 via rotational positioning devices allowing incremental changes of the direction of the cameras 110.
The camera 110 position for each earbud device 220 may be controlled by twisting and/or rotating the earbud device 220 in the user's ear. The earbud device 220 may be rotated such that the camera 110 is angled closer to the contour of a user's face. In an alternate example, the camera 110 may be attached to the earbud device 220 such that the camera 110 may be positioned independently of the earbud device 220. The camera 110 may be attached to the earbud device 220 with a ball and socket joint or any other suitable attachment method such that the position of the camera 110 may be adjusted independently of the earbud device 220.
Also, in
The computing devices 120, 130, and 150 and any other network computing devices other computing machines associated with the technology presented herein may be any type of computing machine, such as, but not limited to, those discussed in more detail with respect to
Furthermore, any functions, applications, or components associated with any of these computing machines, such as those described herein or others (for example, scripts, web content, software, firmware, hardware, or modules) associated with the technology presented herein may be any of the components discussed in more detail with respect to
The network connections illustrated are examples and other means of establishing a communications link between the computers and devices can be used. The computing machines discussed herein may communicate with one another, as well as with other computing machines or communication systems over one or more network(s). Each network may include various types of data or communications networks, including any of the network technology discussed with respect to
Additionally, those having ordinary skill in the art having the benefit of the present disclosure will appreciate that the devices illustrated in the figures may have any of several other suitable computer system configurations.
The methods illustrated in
The deep-learning model process 400 comprises an image processing phase 410 and a regression phase 450. In the image processing phase 410, image processing is divided into two parallel paths. Path A is directed to processing an image of a right facial contour, and path B is directed to processing an image of a left facial contour. The right facial contour images and left facial contour images are processed independently of each other and combined in block 454 of the deep-learning model process 400.
In block 412, the data processing system 130 receives an input of an image of a right facial contour and an input of an image of a left facial contour. Example right facial contour images are depicted in row 640 of
From block 414-n, the process proceeds to block 416. Block 416 is a pooling layer. The pooling layer may be a Max Pooling that returns a maximum value from the image. The pooling layer may be an Average Pooling that returns an average of all of the values from the image. The pooling layer of block 416 outputs a vector representation of each input image (in other words, a right image vector for the right facial contour and a left image vector for the left facial contour).
The right image vector and the left image vector are received as inputs to the regression phase 450. Within the regression phase 450, the right image vector and the left image vector are inputs into two fully connected layers 452 with a rectified linear unit (“ReLU”) between the fully connected layer 452. The fully connected layer 452 learns facial landmarks of the right image vector and the left image vector based on facial landmarks in a training data set, or ground truth, set of facial expressions. The fully connected layer 452 compares features of the right image vector to right side facial landmarks in the training data set of facial expressions to match the features of the right image vector to a right side facial expression in the training data set. Similarly, the fully connected layer 452 compares features of the left image vector to left side facial landmarks in the training data set of facial expressions to match the features of the left image vector to a left side facial expression in the training data set. The fully connected layer 452 outputs landmarks of both the right and left side of a user's face. The output landmarks are inputs to a matching module 454. The matching module 454 concatenates the landmarks from the right and left sides by aligning key landmarks that are present in both the right and left sides using translation and scaling. The matching module 454 outputs the final facial expression reconstruction, or predicted facial expression, as a set of facial landmarks as the reconstruction result in block 456. Examples of predicted facial expressions as a set of facial landmarks are depicted in row 630 of
After the sequential placement of frames 530-1 through 530-n, the process 500 proceeds to block 540 where the frames are inputs into a Bidirectional Long Short-Term Memory (“BLSTM”) model for classification. Block 540 depicts a two-layer BLSTM model. However, any suitable number of layers may be used. The blocks 541-1 through 541-n depicted within the BLSTM model 540 are recurrently connected memory blocks. Each of the blocks 541 have feedback connections such that sequences of data can be processed as opposed to single images or data points. As depicted in block 540, blocks 541-1 through 541-n comprise bidirectional connections within each of the blocks 541 in each layer of block 540. Processing sequences of data allows for speech recognition, emoji recognition, real-time avatar facial expressions, or other suitable applications. The output from block 540 is a vector representation of the input frames.
In block 550, the vector representation is received by a fully connected layer comprising a SoftMax function. The SoftMax function transforms the vector representation into a probability distribution to predict a facial event. The output from block 550 is an encoding of a facial event class. The facial event classification may be a facial expression. In the example of silent speech recognition, the facial event classification may be the pronunciation of a specific word or phrase, such as “hello” or “how are you.” In the example of emoji recognition, the facial event classification may be an emoji indicating a smiling face, a frowning face, a crying face, or any other suitable emoji associated with a facial expression. In the example of real-time avatar facial expressions, the facial event classification may be a three-dimensional visualization of a facial expression.
Row 640 depicts right facial contour images 640-1 through 640-n captured by a head-mounted, wearable device, such as headphone device 210 or earbud device 220 described herein with reference to
The right facial contour images 640, the left facial contour images 650, and the landmark training data images 620 are used in the deep-learning model process 400 to construct the predicted facial expressions in row 630. The predicted facial expressions 630 are depicted as landmark images in images 630-1 through 630-n. The predicted facial expressions 630 are illustrated in
In block 920, the frontal view digital images are transmitting to data processing system 130. The frontal view digital images may be transmitted to the data processing system 130 by a user's computing device, such as user computer device 150, or any other suitable device to transmit data.
In block 930, the data processing system 130 receives the frontal view digital images.
In block 940, the data processing system 130 extracts facial landmark positions. To extract the facial landmark positions, the data processing system 130 uses a computer vision library as a ground truth acquisition method. In an example, the computer vision library is a Dlib library. The computer vision library is used to extract key feature points or landmarks from the frontal view digital images. The computer vision library may extract 42, 68, or any suitable number of key feature points.
In block 950, the data processing system 130 aligns the extracted landmark positions using an affine transformation. Aligning the extracted landmark positions accounts for variations in a user's head position in the frontal view digital images. When a user acquires the frontal view digital images as described in block 910, the same facial expressions may vary if the user slightly changes his face orientation. To align the extracted landmark positions, a set of landmarks are selected whose relative positions change very little when making facial expressions. For example, the selected landmarks may be one or more of a right canthus, left canthus, or apex of the nose. The selected landmarks are used to calculate an affine matrix for each frontal view digital image. The landmark positions are aligned to the same range to reduce the influence of head position change when the frontal view digital images are acquired.
In block 960, the data processing system 130 creates the data training set based on the aligned facial landmark positions. In an example, after the extracted landmark positions are aligned, the data processing system 130 selects the most informative feature points from the extracted landmark positions. In an example, the Dlib library may extract 68 facial landmarks from each of the frontal view digital images. When making facial expressions, changes mainly occur in the areas around the mouth, eyes, and eyebrows. The less informative features points may be removed, leaving a smaller set of facial landmarks, such as 42 facial landmarks. Any suitable number of facial landmarks may be used. The data training set is the set of the most informative landmark positions for each facial expression from each of the frontal view digital images. Example data training set images are depicted in row 620 of
From block 960, the method 810 returns to block 820 of
Referring back to
In an alternate example, the wearable device may be a neck-mounted device such as necklace device 1210 or neckband device 1220, described hereinafter with reference to
In block 830, the wearable device transmits the one or more facial digital image(s) to one or more data acquisition computing device(s) 120. As depicted and described in reference to
In block 840, the one or more data acquisition computing device(s) 120 transmit the one or more facial digital image(s) to the data processing system 130.
In block 850, the data processing system 130 receives the one or more facial digital image(s).
In block 860, the data processing system 130 reconstructs facial expressions using the one or more facial digital image(s). Block 860 is described in greater detail herein with reference to method 860 of
In block 1010, the data processing system 130 receives the one or more facial digital image(s).
In block 1020, the data processing system 130 creates one or more pair(s) of synchronized facial digital images from the one or more facial digital image(s). In the example where the one or more facial digital image(s) are acquired from a head-mounted wearable device, the wearable device captures right facial contour images and left facial contour images. To accurately reconstruct a facial expression of a user, the data processing system 130 synchronizes the images from the left camera 110-1 and the right camera 110-2 such that each pair of right and left facial contour images represents a particular facial expression.
In block 1030, the data processing system 130 pre-processes each pair of synchronized facial digital images. Block 1030 is described in greater detail herein with reference to method 1030 of
In block 1120, the data processing system 130 extracts skin color from the background of each converted pair of facial digital images. In an example, the skin color is extracted using Otsu's thresholding method. Otsu's thresholding method determines whether pixels in an image fall into a foreground or a background. In the current example, the foreground represents a facial contour of each facial digital image, and the background represents an area of the image outside of the facial contour.
In block 1130, the data processing system 130 binarizes each facial digital image after the extraction of the skin color from the background. Image binarization is the process of taking the image in YCrCb color space and converting it to a black and white image. The binarization of the image allows for an object to be extracted from an image, which in this example is a facial contour.
In block 1140, the data processing system 130 filters the binarized digital images to remove noise from the images. Filtering the binarized digital images produces a smoother image to assist in more accurate facial expression reconstructions.
From block 1140, the method 1030 returns to block 1040 of
From block 1040, the method 860 returns to block 870 of
From block 870 of
Necklace device 1210 comprises a chain 1218 or other suitable device for securing the necklace device 1210 about the neck of a user. In an alternate example, necklace device 1210 may attach to a user's clothing instead of being secured about the neck of a user. In the current example, necklace device 1210 may be attached to a user's clothing underneath the user's chin. In an alternate example, multiple necklace devices 1210 may be attached to a user's clothing to capture camera 110 images from multiple viewpoints. In the current example, a necklace device 1210 may be attached on a user's clothing close to each shoulder of the user. The necklace device 1210 may comprise a clip, a pin, a clasp, or any other suitable device to attach necklace device 1210 to a user's clothing.
The camera 110 is in communication with a computing device 120, as previously described with respect to
In
The methods illustrated in
In block 1410, the data processing system 130 receives IR images from either the necklace device 1210 or the neckband device 1220. Example necklace 1210 IR images are depicted at 1410-1. Example neckband 1220 IR images are depicted at 1410-2 and 1410-3 as the neckband acquires both a right and left side IR image of a user's chin profile. Other example IR images are depicted in row 1540A of
In block 1420, the data processing system 130 pre-processes the IR images. Pre-processing the IR images is described in greater detail in reference to method 1720 of
In block 1425, the data processing system 130 duplicates the pre-processed IR images of the necklace device 1210 into three channels to improve the expressiveness and the ability to extract features of the model. As the neckband device 1220 already comprises two pre-processed images, the images are not duplicated into additional channels.
The pre-processed IR images are input into an image processing phase of the deep-learning model process 1400 depicted at block 1430. Block 1430 comprises convolution layers, normalization layers, and an averaging pooling layer. The processing of block 1430 is described in greater detail herein with reference to blocks 414 and 416 of
The vector representations of each of the pre-processed IR images are input into a regression phase 1440 of the deep-learning model process 1400. The architecture of the regression phase 1440 is similar to the architecture of regression phase 450, previously described herein with reference to
In block 1450, the data processing system 130 combines the blend shapes with three-dimensional angles of rotation of the user's head. In an example, the three-dimensional angles of rotation are represented by Euler's angles of roll, yaw, and pitch. In block 1460, the final facial expression reconstruction is output as a three-dimensional image. Example three-dimensional facial expression reconstructions are depicted in row 1530A of
Row 1510A illustrates frontal view camera images of a user. To construct a training data set or ground truth, frontal camera images of a user are acquired with the user making various different facial expressions, as depicted in images 1510A-1 through 1510-n. The method to create the training data set from the frontal view camera images is described herein in greater detail with reference to method 810′ of
Row 1540A depicts IR images 1540A-1 through 1540A-n captured by a neck-mounted wearable device, such as necklace device 1210 or neckband device 1220 described herein in greater detail in reference to
In this example, the images captured by the cameras of the neck-mounted devices can be processed for facial reconstruction similarly to the methods discussed previously with reference to
Blocks 910, 920, and 930 of
In block 1640, the data processing system 130 extracts a set of facial geometric features from the one or more frontal view digital image(s). In an example, the one or more frontal view digital image(s) are three-dimensional digital images captured from a camera that provides depth data in real time along with visual information. Example frontal view digital images are depicted in rows 1510A and 1510B, respectively, of
In block 1650, the data processing system 130 compares the extracted features to pre-defined shape parameters. In an example, the AR application comprises pre-defined blend shapes as templates for complex facial animations. In the example, the AR application comprises blend shapes with features for left and right eyes, mouth and jaw movement, eyebrows, cheeks, nose, tongue, and any other suitable facial features.
In block 1660, the data processing system 130 creates a data training set based on the comparison of the extracted features to the pre-defined shape parameters. The data training set comprises a blend shape with a Euler angle of head rotation as depicted in block 1520A of
From block 1660, the method 810′ returns to block 820 of
Block 1010 of
In block 1720, the data processing system 130 pre-processes the one or more facial digital image(s). Example facial digital images are depicted in rows 1540A and 1540B of
In block 1810, the data processing system 130 converts each of the one or more digital facial image(s) into gray-scale digital facial images. The one or more digital facial image(s) are converted to gray-scale to remove any potential color variance. As the IR bandpass filter 1214 only allows monochrome light into the camera 110, any color present in the one or more digital facial image(s) does not represent details related to the facial expression of the user.
In block 1820, the data processing system 130 separates the facial image from the background image in each of the gray-scale digital facial images. Using the IR technology previously discussed in reference to
In block 1830, the data processing system 130 applies data augmentation to each of the separated facial images. As a user wears a necklace device 1210 or a neckband device 1220 and performs activities, the position and angles of the cameras 110 of the necklace device 1210 and neckband device 1220 may not be constant. To mitigate the issue, a probability of 60% to conduct three types of image transformations is set, which can be caused by camera shifting causing translation, rotation, and scaling on the images. In an alternate example, any suitable probability may be used. Three Gaussian Models are deployed to generate the parameters for the translation (μ=0, σ2=30), rotation (μ=0, σ2=10), scaling (μ=1, σ2=0.2) on the synthesized training data. The data augmentation is performed on all the images in the training dataset during each training epoch before feeding the images into deep-learning model process 1400. Data augmentation improves the deep-learning model's ability to confront camera shifting and avoid over-fitting during model training.
From block 1830, the method 1720 returns to block 1730 of
In block 1730, the data processing system 130 applies the deep-learning model to each of the pre-processed one or more facial digital image(s) to generate facial expression reconstruction. The deep-learning model for facial expression reconstruction was described in the deep-learning model process 1400 herein with reference to
From block 1730, the method 860′ returns to block 870 of
Cameras 110 were described herein with reference to
The embodiments herein describe wearable devices in the form of head-mounted devices headband device 210 and earbud devices 220, and neck-mounted devices necklace device 1210 and neckband device 1220. Any suitable wearable device may be used such that the one or more camera(s) may be directed towards a user's face including, but not limited to, glasses, smart glasses, a visor, a hat, a helmet, headgear, a virtual reality (“VR”) headset.
The embodiments herein describe head-mounted and neck-mounted wearable devices comprises cameras 110 with adjustable positions and angles. Each camera 110 may be positioned such that at least one of a buccal region, a zygomatic region, and/or a temporal region of a user's face are included in the field of view of the camera 110.
In an alternate embodiment, the wearable devices previously described herein may be configured for use in a hand-face touching detection system to recognize and predict a time and position that a user's hand touches the user's face. An important step in reducing the risk of infection is avoiding touching the face because a virus, such as COVID-19, may enter the mucous membranes of the eyes, nose, and/or mouth. Touching different areas of the face carries different health related risks. For example, contacting a mucous membrane may introduce a higher risk of transmitting a virus than touching non-mucous areas such as the chin and cheek. In addition, the frequency of touching the face may be an indicator regarding the stress level of a person. Understanding how people touch their face may alleviate multiple health challenges. Accurately recognizing where the hand touches the face is an important step towards alleviating health risks introduced by hand-face touching behaviors. In order to implement behavior intervention technologies, the hand-face touching detection system predicts the behavior in advance rather than simply detecting the touching behavior.
A data training set is created similar to the data training sets described herein with reference to
The frontal view images of the user are sent to a server, such as data processing system 130, to create the data training set, as previously described herein with reference to
In an example, necklace device 1210 may be positioned on a user to acquire images of the user's facial area, as previously described herein, to also include a user's hand if positioned or in motion near the user's face. Any suitable wearable device may be used to acquire the images of the user's face. To predict the time and location of a hand-face touch, camera images are monitored over a period of time. Camera images are sent to a server, such as data processing system 130, for processing. The data processing system receives the hand/facial images and reconstructs the position of the user's hand relative to the user's face using the data training set and a deep-learning model, such as the models previously described herein with reference to
In an alternate embodiment, acoustic sensing technology can be used to reconstruct facial expressions using an array of microphones and speakers and deep-learning models. The acoustic technology comprises a wearable device such as headphones, earbuds, necklaces, neckbands, and any other suitable form factors such that the microphones and speakers can be mounted or attached to the wearable device. In an example, the microphones are Micro-Electro-Mechanical System (“MEMS”) microphones that are placed on a printed circuit board (“PCB”). In an example, each microphone/speaker assembly may comprise four microphones and one speaker. Any suitable number of microphones and speakers may be included. The positions of the microphones and speakers may be adjustable to optimize the recording of reflected signals.
The acoustic wearable device actively sends signals from the speakers towards a user's face. In an example, the speakers transmit inaudible acoustic signals within a frequency range of 16 kHz to 24 kHz towards the user's face. Any suitable frequency range may be used. The signals are reflected by the user's face and captured by the microphones on the acoustic wearable device. The signals are reflected differently back towards the microphones based on different facial expressions or movements. A Channel Impulse Response (“CIR”) is calculated based on the acoustic signals received at each microphone.
Each microphone is in communication with a computing device, such as computing device 120, such that the CIR images/data can be transmitted for processing to a server, such as data processing system 130.
The data processing system creates a data training set using frontal view images of a user in a machine learning algorithm, previously described herein with reference to
The wearable devices and systems described herein may be used in applications such as silent speech recognition, emoji input, and real-time avatar facial expressions. Silent speech recognition is a method to recognize speech when vocalization is inappropriate, background noise is excessive, or vocalizing speech is challenging due to a disability. The data training set for silent speech recognition comprises a set of frontal view facial images directed to the utterance of a word or phrase. To recognize silent speech, the wearable device, such as necklace 1220, captures a series of facial movements from underneath the chin of a user while the user silently utters words or commands and transfers the series of digital facial images to the data processing system 130 for facial expression reconstruction. The results of the facial expression reconstruction, previously described herein with reference to block 456 or block 1460, are used as inputs to the classifier process 500, previously described herein with reference to
Microphone 1912 is a small form factor microphone. In an example, microphone 1912 is a Micro-Electro-Mechanical System (“MEMS”) microphone. Microphone 1912 may be a microphone chip, a silicon microphone, or any other suitable microphone. In an example, microphone 1912 is configured to receive reflected signals transmitted by speaker 1911. As depicted in
Bluetooth module 1913 may be a Bluetooth low energy (“BLE”) module. Bluetooth module 1913 may be a System-on-Chip (“SoC”) module. In an example, Bluetooth module 1913 may comprise a microcontroller unit (“MCU”) with memory. In an example, the memory may be an on-board SanDisk (“SD”) card. The SD card may function to save acoustic data on the MCU. Bluetooth module 1913 may be configured to control a battery (not depicted) to provide power to the acoustic sensing system 1910. In an example, the battery may be a lithium polymer (“LiPo”) battery. Bluetooth module 1913 may be configured to power and control speaker 1911 transmissions, microphone 1912 signal receptions, and data transmissions external to a wearable device, for example transmissions to computing device 1920. As depicted in
In an example, a wearable device system may comprise one or more speaker(s) 1911 and one or more microphone(s) 1912, with a pair of a speaker 1911 and a microphone 1912 affixed on a PCB 1914. Any suitable number of speakers 1911 and microphones 1912 may be included on a particular wearable device system. The positions and/or orientations of the speakers 1911 and microphones 1912 may be adjustable to optimize the recording of reflected signals.
Returning to
As described herein,
In block 2510, a wearable system transmits at least one acoustic signal. The wearable system may be a system as described herein with reference to glasses device 2210, headset device 2310, headphone device 2410. In an example, the wearable system may include other embodiments such as ear buds, ear pods, ITE headphones, over-the-ear headphones, OTE headphones, glasses, smart glasses, a visor, a hat, a helmet, headgear, a virtual reality headset, another head-borne device, a necklace, a neckband, a garment-attachable system, or any other type of device or system to which at least one acoustic sensor system 1910 may be affixed. In an example, the wearable system is positioned on a wearer such that the distance from the wearable system to the wearer's face is less than 10 cm, 11 cm, 12 cm, 13 cm, 14 cm, 15 cm, or any suitable distance. The at least one acoustic signal is transmitted by a speaker 1911 of at least one acoustic sensor system 1910. Acoustic sensor system 1910 may be configured to transmit signals in a frequency range of 16-24 kHz. Any suitable frequency range may be used. Acoustic sensor system 1910 may transmit signals that are FMCW transmissions, CIR transmissions, or any other suitable type of acoustic signal. In an example, the transmitted signals have an associated sample length. For example, the sample length may be set to 600 corresponding to a sample length of 0.012 seconds. In this example, approximately 83.3 frames may be collected per second. Any suitable sample length may be used. In an example, acoustic sensor system 1910 transmits inaudible acoustic signals within the frequency range of 16 kHz to 24 kHz towards a wearer's face. In an example, acoustic sensor system 1910 may be positioned on the wearable system to transmit acoustic signals towards a first facial feature on a first side of a sagittal plane of a wearer and to transmit acoustic signals towards a second facial feature on a second side of the sagittal plane of a wearer. In an example, acoustic sensor system 1910 may be positioned on the wearable system to transmit acoustic signals towards an underside contour of a chin of a wearer.
In block 2515, the wearable system receives at least one reflected acoustic signal. Subsequent to transmitting the at least one acoustic signal toward a wearer's face as described herein with reference to block 2510, the signal is reflected by the wearer's face. The at least one reflected acoustic signal is received by microphone 1912 of the at least one acoustic sensor system 1910. The at least one transmitted signal is reflected differently to microphone 1912 based on skin deformations associated with different facial movements.
In block 2520, the wearable system transmits the at least one transmitted acoustic signal and the at least one reflected acoustic signal to data processing computing device 1920. In an example, the at least one transmitted acoustic signal and the at least one reflected acoustic signal are transmitted in frames. The at least one transmitted acoustic signal and the at least one reflected acoustic signal are transmitted by the Bluetooth module 1913 of the at least one acoustic sensor system 1910.
In block 2525, data processing computing device 1920 receives the at least one transmitted acoustic signal and the at least one reflected acoustic signal.
In block 2530, data processing computing device 1920 filters the received acoustic signals. In an example, the at least one transmitted acoustic signal and the at least one reflected acoustic signal are filtered to remove noise outside of a target frequency range. In an example, the target frequency range is 15.5 kHz to 20.5 kHz. Any suitable target frequency range may be used. In an example, the at least one transmitted acoustic signal and the at least one reflected acoustic signal are filtered using a Butterworth band-pass filter. Any suitable filter may be used.
In block 2535, data processing computing device 1920 constructs a profile using the filtered acoustic signals. In an example, the profile is an echo profile. The echo profile is determined by calculating a cross-correlation between the filtered at least one transmitted acoustic signal and the filtered at least one reflected acoustic signal. The echo profile depicts deformations of the skin of the wearer's face in temporal and spatial domains. In an example, a differential echo profile may be calculated. A differential echo profile is calculated by subtracting echo profiles between two adjacent echo frames. The differential echo profile removes static objects that may be present in the echo profile. Further, as the wearable system was positioned to be less than 10-15 cm away from a wearer's face, echo profiles comprising distance greater than that range may also be removed. For example, echo profiles with a distance greater than ±16 cm, ±17 cm, ±18 cm, or any other suitable distance may be removed. In an example, the transmitted acoustic signal is a FMCW signal from which the echo profile is constructed. In alternate examples, the transmitted acoustic signal may be a CIR signal with global system for mobiles (“GSM”), characteristic impedance (“ZC”), or Barker sequence encoding; angle of arrival (“AoA”); doppler effect; or phase change detection.
In block 2540, data processing computing device 1920 applies a deep learning model to the constructed profile. The deep-learning model for facial expression reconstruction was previous described herein with reference to the processes and methods of
In block 2545, data processing computing device 1920 assigns a deformation to the constructed profile based on the results of the deep learning model. In an example, the assigned deformation has a predetermined degree of correspondence to a selected one of a plurality of deformations in the deep learning model.
In block 2550, the data processing computing device 1920 communicates a facial output based on the assigned deformation. In an example, the facial output comprises one of a facial movement, an avatar movement associated with the facial movement, a speech recognition, text associated with the speech recognition, a physical activity, an emotional status, a two-dimensional visualization of a facial expression, a three-dimensional visualization of a facial expression, an emoji associated with a facial expression, or an avatar image associated with a facial expression. In an example, the facial output is communicated to user computing device 150. In an example, the facial output is communicated in real time. In an example, the facial output is continuously updated.
While acoustic sensing technology may be used to determine facial expressions by detecting skin deformations, acoustic sensing technology may be used to directly determine other outputs associated with skin deformation without the need to first determine facial expressions. For example, the acoustic sensing technology may track blinking patterns and eyeball movements of a user. Blinking patterns and eyeball movements can be applied to the diagnosis and treatment processes of eye diseases. The acoustic sensing technology may detect movements of various parts of the face associated with speech, whether silent or voiced. While speaking, different words and/or phrases lead to subtle, yet distinct, skin deformations. The acoustic sensing technology can capture the skin deformation to recognize silent speech or to vocalize sound. By tracking subtle skin deformations, the acoustic sensing technology can be used to synthesize voice. The acoustic sensing technology may also be used to recognize and track physical activities. For example, while eating, a user opens the mouth, chews, and swallows. During each of those movements, the user's skin deforms in a certain way such that opening the mouth is distinguishable from chewing and from swallowing. The detected skin deformations may be used to determine the type and quantity of food consumed by a user. Similar to eating, a type and quantity of a consumed drink may be determined by skin deformation. The acoustic sensing technology may also be used to determine an emotional status of a user. An emotional status may be related to skin deformations. By detecting skin deformations, an emotional status can be determined and reported to corresponding applications. In an example, acoustic sensor system based facial expression reconstruction system 1900, described herein with reference to
The computing machine 4000 may be implemented as a conventional computer system, an embedded controller, a laptop, a server, a mobile device, a smartphone, a set-top box, a kiosk, a router or other network node, a vehicular information system, one or more processor(s) associated with a television, a customized machine, any other hardware platform, or any combination or multiplicity thereof. The computing machine 4000 may be a distributed system configured to function using multiple computing machines interconnected via a data network or bus system.
The processor 4010 may be configured to execute code or instructions to perform the operations and functionality described herein, manage request flow and address mappings, and to perform calculations and generate commands. The processor 4010 may be configured to monitor and control the operation of the components in the computing machine 4000. The processor 4010 may be a general purpose processor, a processor core, a multiprocessor, a reconfigurable processor, a microcontroller, a digital signal processor (“DSP”), an application specific integrated circuit (“ASIC”), a graphics processing unit (“GPU”), a field programmable gate array (“FPGA”), a programmable logic device (“PLD”), a controller, a state machine, gated logic, discrete hardware components, any other processing unit, or any combination or multiplicity thereof. The processor 4010 may be a single processing unit, multiple processing units, a single processing core, multiple processing cores, special purpose processing cores, co-processors, or any combination thereof. The processor 4010 along with other components of the computing machine 4000 may be a virtualized computing machine executing within one or more other computing machine(s).
The system memory 4030 may include non-volatile memories such as read-only memory (“ROM”), programmable read-only memory (“PROM”), erasable programmable read-only memory (“EPROM”), flash memory, or any other device capable of storing program instructions or data with or without applied power. The system memory 4030 may also include volatile memories such as random access memory (“RAM”), static random access memory (“SRAM”), dynamic random access memory (“DRAM”), and synchronous dynamic random access memory (“SDRAM”). Other types of RAM also may be used to implement the system memory 4030. The system memory 4030 may be implemented using a single memory module or multiple memory modules. While the system memory 4030 is depicted as being part of the computing machine 4000, one skilled in the art will recognize that the system memory 4030 may be separate from the computing machine 4000 without departing from the scope of the subject technology. It should also be appreciated that the system memory 4030 may include, or operate in conjunction with, a non-volatile storage device such as the storage media 4040.
The storage media 4040 may include a hard disk, a floppy disk, a compact disc read only memory (“CD-ROM”), a digital versatile disc (“DVD”), a Blu-ray disc, a magnetic tape, a flash memory, other non-volatile memory device, a solid state drive (“SSD”), any magnetic storage device, any optical storage device, any electrical storage device, any semiconductor storage device, any physical-based storage device, any other data storage device, or any combination or multiplicity thereof. The storage media 4040 may store one or more operating system(s), application programs and program modules such as module 4050, data, or any other information. The storage media 4040 may be part of, or connected to, the computing machine 4000. The storage media 4040 may also be part of one or more other computing machine(s) that are in communication with the computing machine 4000 such as servers, database servers, cloud storage, network attached storage, and so forth.
The module 4050 may comprise one or more hardware or software element(s) configured to facilitate the computing machine 4000 with performing the various methods and processing functions presented herein. The module 4050 may include one or more sequence(s) of instructions stored as software or firmware in association with the system memory 4030, the storage media 4040, or both. The storage media 4040 may therefore represent machine or computer readable media on which instructions or code may be stored for execution by the processor 4010. Machine or computer readable media may generally refer to any medium or media used to provide instructions to the processor 4010. Such machine or computer readable media associated with the module 4050 may comprise a computer software product. It should be appreciated that a computer software product comprising the module 4050 may also be associated with one or more process(es) or method(s) for delivering the module 4050 to the computing machine 4000 via the network 4080, any signal-bearing medium, or any other communication or delivery technology. The module 4050 may also comprise hardware circuits or information for configuring hardware circuits such as microcode or configuration information for an FPGA or other PLD.
The input/output (“I/O”) interface 4060 may be configured to couple to one or more external device(s), to receive data from the one or more external device(s), and to send data to the one or more external device(s). Such external devices along with the various internal devices may also be known as peripheral devices. The I/O interface 4060 may include both electrical and physical connections for operably coupling the various peripheral devices to the computing machine 4000 or the processor 4010. The I/O interface 4060 may be configured to communicate data, addresses, and control signals between the peripheral devices, the computing machine 4000, or the processor 4010. The I/O interface 4060 may be configured to implement any standard interface, such as small computer system interface (“SCSI”), serial-attached SCSI (“SAS”), fiber channel, peripheral component interconnect (“PCP”), PCI express (PCIe), serial bus, parallel bus, advanced technology attached (“ATA”), serial ATA (“SATA”), universal serial bus (“USB”), Thunderbolt, FireWire, various video buses, and the like. The I/O interface 4060 may be configured to implement only one interface or bus technology. Alternatively, the I/O interface 4060 may be configured to implement multiple interfaces or bus technologies. The I/O interface 4060 may be configured as part of, all of, or to operate in conjunction with, the system bus 4020. The I/O interface 4060 may include one or more buffer(s) for buffering transmissions between one or more external device(s), internal device(s), the computing machine 4000, or the processor 4010.
The I/O interface 4060 may couple the computing machine 4000 to various input devices including mice, touch-screens, scanners, electronic digitizers, sensors, receivers, touchpads, trackballs, cameras, microphones, keyboards, any other pointing devices, or any combinations thereof. The I/O interface 4060 may couple the computing machine 4000 to various output devices including video displays, speakers, printers, projectors, tactile feedback devices, automation control, robotic components, actuators, motors, fans, solenoids, valves, pumps, transmitters, signal emitters, lights, and so forth.
The computing machine 4000 may operate in a networked environment using logical connections through the network interface 4070 to one or more other system(s) or computing machines across the network 4080. The network 4080 may include WANs, LANs, intranets, the Internet, wireless access networks, wired networks, mobile networks, telephone networks, optical networks, or combinations thereof. The network 4080 may be packet switched, circuit switched, of any topology, and may use any communication protocol. Communication links within the network 4080 may involve various digital or an analog communication media such as fiber optic cables, free-space optics, waveguides, electrical conductors, wireless links, antennas, radio-frequency communications, and so forth.
The processor 4010 may be connected to the other elements of the computing machine 4000 or the various peripherals discussed herein through the system bus 4020. It should be appreciated that the system bus 4020 may be within the processor 4010, outside the processor 4010, or both. Any of the processor 4010, the other elements of the computing machine 4000, or the various peripherals discussed herein may be integrated into a single device such as a system on chip (“SOC”), system on package (“SOP”), or ASIC device.
Examples may comprise a computer program that embodies the functions described and illustrated herein, wherein the computer program is implemented in a computer system that comprises instructions stored in a machine-readable medium and a processor that executes the instructions. However, it should be apparent that there could be many different ways of implementing examples in computer programming, and the examples should not be construed as limited to any one set of computer program instructions. Further, a skilled programmer would be able to write such a computer program to implement an example of the disclosed examples based on the appended flow charts and associated description in the application text. Therefore, disclosure of a particular set of program code instructions is not considered necessary for an adequate understanding of how to make and use examples. Further, those skilled in the art will appreciate that one or more aspect(s) of examples described herein may be performed by hardware, software, or a combination thereof, as may be embodied in one or more computing system(s). Moreover, any reference to an act being performed by a computer should not be construed as being performed by a single computer as more than one computer may perform the act.
The examples described herein can be used with computer hardware and software that perform the methods and processing functions described herein. The systems, methods, and procedures described herein can be embodied in a programmable computer, computer-executable software, or digital circuitry. The software can be stored on computer-readable media. Computer-readable media can include a floppy disk, RAM, ROM, hard disk, removable media, flash memory, memory stick, optical media, magneto-optical media, CD-ROM, etc. Digital circuitry can include integrated circuits, gate arrays, building block logic, field programmable gate arrays (“FPGA”), etc.
The systems, methods, and acts described in the examples presented previously are illustrative, and, alternatively, certain acts can be performed in a different order, in parallel with one another, omitted entirely, and/or combined between different examples, and/or certain additional acts can be performed, without departing from the scope and spirit of various examples. Accordingly, such alternative examples are included in the scope of the following claims, which are to be accorded the broadest interpretation so as to encompass such alternate examples.
Although specific examples have been described above in detail, the description is merely for purposes of illustration. It should be appreciated, therefore, that many aspects described above are not intended as essential elements unless explicitly stated otherwise. Modifications of, and equivalent components or acts corresponding to, the disclosed aspects of the examples, in addition to those described above, can be made by a person of ordinary skill in the art, having the benefit of the present disclosure, without departing from the spirit and scope of examples defined in the following claims, the scope of which is to be accorded the broadest interpretation so as to encompass such modifications and equivalent structures.
Various embodiments are described herein. It should be noted that the specific embodiments are not intended as an exhaustive description or as a limitation to the broader aspects discussed herein. One aspect described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced with any other embodiment(s). Reference throughout this specification to “one embodiment,” “an embodiment,” “an example embodiment,” or other similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention described herein. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” “an example embodiment,” or other similar language in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiment(s), as would be apparent to a person having ordinary skill in the art and the benefit of this disclosure. Furthermore, while some embodiments described herein include some, but not other, features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention. For example, in the appended claims, any of the claimed embodiments can be used in any combination.
The following examples are presented to illustrate the present disclosure. The examples are not intended to be limiting in any manner.
Example 1 is a wearable system, comprising: at least one acoustic sensor system configured to transmit at least one acoustic signal, to receive at least one reflected acoustic signal, and to output the at least one transmitted acoustic signal and the at least one received reflected acoustic signal; a processor that receives the at least one transmitted acoustic signal and the at least one received reflected acoustic signal from each of the at least one acoustic sensor system; and a physical memory, the physical memory comprising instructions that when executed by the processor cause the processor to: calculate a profile associated with the at least one transmitted acoustic signal and the at least one received reflected acoustic signal; assign a deformation to the profile, the assigned deformation having a predetermined degree of correspondence to a selected one of a plurality of deformations in a model when compared to the profile; and communicate a facial output based on the assigned deformation.
Example 2 includes the subject matter of Example 1, wherein the at least one acoustic sensor system comprises at least one speaker and at least one microphone.
Example 3 includes the subject matter of Examples 1 and 2, wherein the at least one acoustic sensor system comprises a Bluetooth low energy (“BLE”) module to output the at least one transmitted acoustic signal and the at least one received reflected acoustic signal.
Example 4 includes the subject matter of any of Examples 1-3, wherein the at least one transmitted acoustic signal is a frequency modulation continuous wave (“FMCW”) signal.
Example 5 includes the subject matter of any of Examples 1-4, wherein the profile is a differential echo profile.
Example 6 includes the subject matter of any of Examples 1-5, wherein the deformation is a skin deformation.
Example 7 includes the subject matter of any of Examples 1-6, wherein the communicated facial output comprises one of a facial movement, an avatar movement associated with the facial movement, a speech recognition, text associated with the speech recognition, a physical activity, an emotional status, a two-dimensional visualization of a facial expression, a three-dimensional visualization of a facial expression, an emoji associated with a facial expression, or an avatar image associated with a facial expression.
Example 8 includes the subject matter of any of Examples 1-7, the at least one acoustic sensor system positioned on the wearable system to transmit acoustic signals towards a first facial feature on a first side of a sagittal plane of a wearer and to transmit acoustic signals towards a second facial feature on a second side of the sagittal plane of a wearer.
Example 9 includes the subject matter of any of Examples 1-8, the at least one acoustic sensor system positioned on the wearable system to transmit acoustic signals towards an underside contour of a chin of a wearer.
Example 10 includes the subject matter of any of Examples 1-9, the wearable system comprising ear buds, ear pods, in-the-ear (ITE) headphones, over-the-ear headphones, or outside-the-ear (OTE) headphones to which the at least one acoustic sensor system is attached.
Example 11 includes the subject matter of any of Examples 1-10, the wearable system comprising glasses, smart glasses, a visor, a hat, a helmet, headgear, a virtual reality headset, or another head-borne device to which the at least one acoustic sensor system is attached.
Example 12 includes the subject matter of any of Examples 1-11, the wearable system comprising a necklace, a neckband, or a garment-attachable system to which the at least one acoustic sensor system is attached.
Example 13 includes the subject matter of any of Examples 1-12, further comprising a computing device that receives and displays the communicated facial output.
Example 14 includes the subject matter of any of Examples 1-13, wherein the model is trained using machine learning.
Example 15 includes the subject matter of any of Examples 1-14, the training comprising: receiving one or more frontal view facial image(s) of a subject, each of the frontal view facial images corresponding to a deformation of a plurality of deformations of the subject; receiving one or more transmitted acoustic signal(s) and one or more corresponding reflected acoustic signal(s) associated with the subject, each of the one or more transmitted acoustic signal(s) and the one or more corresponding reflected acoustic signal(s) from the at least one acoustic sensor system also corresponding to a deformation of the plurality of deformations of the subject; and correlating, for each of the deformations, the one or more transmitted acoustic signal(s) and the one or more corresponding reflected acoustic signal(s) from the at least one acoustic sensor system corresponding to a particular deformation to the one or more frontal view facial image(s) corresponding to the particular deformation.
This application claims the benefit of U.S. Provisional Patent Application No. 63/343,023, filed May 17, 2022 and entitled “Wearable Facial Movement Tracking Devices” and is a continuation-in-part of PCT Patent Application No. PCT/US2021/032511, filed May 14, 2021 and entitled “Wearable Devices For Facial Expression Recognition,” which claims the benefit of U.S. Provisional Patent Application No. 63/025,979, filed May 15, 2020 and entitled “C-Face: Continuously Reconstructing Facial Expressions By Deep Learning Contours Of The Face With Ear-Mounted Miniature Cameras.” The entire contents of the above-identified priority applications are hereby fully incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63343023 | May 2022 | US | |
63025979 | May 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2021/032511 | May 2021 | US |
Child | 17986102 | US |