This disclosure relates generally to videoconferencing and relates particularly to rectifying images livestreamed during a videoconference.
During a real-time videoconference, people at a videoconferencing endpoint interact with people at one or more other videoconferencing endpoints over a network. When the data transmission error rate of the network is too high, or the data transmission rate of the network is too low, the quality of transmitted images can suffer. Attempts to compensate for such network shortcomings, especially as they pertain to transmission of images containing faces, have not been wholly successful. Thus, there is room for improvement in the art.
An example of this disclosure is a method of rectifying images in a videoconference. The method includes: receiving a first image frame; determining locations of first feature landmarks corresponding to the first image frame; determining a first region corresponding to the first image frame, based on the locations of the first feature landmarks; partitioning the first region into a first plurality of polygons based on the locations of the first feature landmarks; receiving a second image frame; determining locations of second feature landmarks corresponding to the second image frame; determining a second region corresponding to the second image frame, based on the locations of the second feature landmarks; partitioning the second region into a second plurality of polygons based on the locations of the second feature landmarks; translating image data of one or more polygons of the first plurality of polygons to one or more polygons of the second plurality of polygons; and forming a composite image frame by replacing image data of at least one polygon in the second plurality of polygons with translated image data from the one or more polygons of the first plurality of polygons. In another example of this disclosure, the method also includes: receiving the first image frame at a neural processing unit; receiving the composite image frame at the neural processing unit; and forming a rectified image frame using the neural processing unit, based on the first image frame and the composite image frame.
Another example of this disclosure is a videoconferencing system with a processor that is operable to: receive a first image frame; determine locations of first feature landmarks corresponding to the first image frame; determine a first region corresponding to the first image frame, based on the locations of the first feature landmarks; partition the first region into a first plurality of polygons based on the locations of the first feature landmarks; receive a second image frame; determine locations of second feature landmarks corresponding to the second image frame; determine a second region corresponding to the second image frame, based on the locations of the second feature landmarks; partition the second region into a second plurality of polygons based on the locations of the second feature landmarks; translate image data of one or more polygons of the first plurality of polygons to one or more polygons of the second plurality of polygons; and form a composite image frame by replacing image data of at least one polygon in the second plurality of polygons with translated image data from the one or more polygons of the first plurality of polygons. In one or more examples of this disclosure, the videoconferencing system is operable to: receive the first image frame at a neural processing unit; receive the composite image frame at the neural processing unit; and form a rectified image frame using the neural processing unit, based on the first image frame and the composite image frame.
For illustration, there are shown in the drawings certain examples described in the present disclosure. In the drawings, like numerals indicate like elements throughout. The full scope of the inventions disclosed herein are not limited to the precise arrangements, dimensions, and instruments shown. In the drawings.
In the drawings and the description of the drawings herein, certain terminology is used for convenience only and is not to be taken as limiting the examples of the present disclosure. In the drawings and the description below, like numerals indicate like elements throughout.
By way of introduction,
In some examples of this disclosure, a rectified image frame- or some portion thereof—formed during the teleconference can subsequently be used as a reference frame. In some examples of this disclosure, a recently received high-quality image frame can replace an earlier reference frame. For example, before a videoconference, Ms. Polly might provide a high-quality image of herself, such as from a photographic identification card. Transmission quality of the videoconference might initially be poor, so the image from the photo ID will be used as a reference frame. Later in the videoconference, transmission quality might improve, and one or more high-quality image frames of Ms. Polly could be received. It may be advantageous to use a more recently received high-quality image frame as a reference frame should the quality of the transmission again decline.
During a videoconference, camera 318 captures video and provides the captured video to the video module 310. In at least one example of this disclosure, camera 318 is an electronic pan-tilt-zoom (EPTZ) camera. In some examples, camera 318 is a smart camera. Additionally, one or more microphones (e.g., 322, 324) capture audio and provide the captured audio to the audio module 306 for processing. The captured audio and concurrently captured video can form a data stream. (See preceding paragraph.) Microphone 322 can be used to detect (video) data indicating a presence of one or more persons (e.g., participants 332) at the endpoint 301. The system 300 can use the audio captured with microphone 322 as conference audio.
In some examples, the microphones 322, 324 can reside within a microphone array (e.g., 326) that includes both vertically and horizontally arranged microphones for determining locations of audio sources, e.g., participants 332 who are speaking.
After capturing audio and video, the system 300 encodes the captured audio and video in accordance with an encoding standard, such as MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264 and their descendants. Then, the network module 316 outputs the encoded audio and video to the remote endpoints 302 via the network 304 using an appropriate protocol. Similarly, the network module 316 receives conference audio and video through the network 304 from the remote endpoints 302 and transmits the received audio and video to their respective codecs 308/312 for processing. Endpoint 301 also includes a loudspeaker 328 which outputs conference audio, and a display 330 which outputs conference video.
Using camera 318, the system 300 can capture a view of a room at the endpoint 301, which would typically include all (videoconference) participants 332 at the endpoint 301, as well as some of their surroundings. According to some examples, the system 300 uses camera 318 to capture video of one or more participants 332, including one or more current talkers, in a tight or zoom view. In at least one example, camera 318 is associated with a sound source locator (e.g., 334) of an audio-based locator (e.g., 336).
In one or more examples, the system 300 may use the audio-based locator 336 and a video-based locator 340 to determine locations of participants 332 and frame views of the environment and participants 332. The control module 314 may use audio and/or video information from these locators 336, 340 to crop one or more captured views, such that one or more subsections of a captured view will be displayed on a display 330 and/or transmitted to a remote endpoint 302.
In some examples, to determine how to configure a view, the control module 314 uses audio information obtained from the audio-based locator 336 and/or video information obtained from the video-based locator 340. For example, the control module 314 may use audio information processed by the audio-based locator 336 from one or more microphones (e.g., 322, 324). In some examples, the audio-based locator 336 includes a speech detector 338 which can be used to detect speech in audio captured by microphones 322, 324 to determine a location of a current participant 332. In some examples, the control module 314 uses video information captured using camera 318 and processed by the video-based locator 340 to determine the locations of participants 332 and to determine the framing for captured views.
The memory 412 can be any standard memory such as SDRAM. The memory 412 stores modules 418 in the form of software and/or firmware for controlling the system 300. In addition to audio codec 308 and video codec 312, and other modules discussed previously, the modules 418 can include operating systems, a graphical user interface that enables users to control the system 300, and algorithms for processing audio/video signals and controlling the camera(s) 404.
The network interface 410 enables communications between the endpoint 301 and remote endpoints 302. In one or more examples, the network interface 410 provides data communication with local devices such as a keyboard, mouse, printer, overhead projector, display, external loudspeakers, additional cameras, and microphone pods, etc.
The camera(s) 404 and the microphone(s) 406 capture video and audio in the videoconference environment, respectively, and produce video and audio signals transmitted through bus 416 to the processor 408. In at least one example of this disclosure, the processor 408 processes the video and audio using algorithms of modules 418. For example, the system 300 processes the audio captured by the microphone(s) 406 as well as the video captured by the camera(s) 404 to determine the location of participants 332 as well as to control and select from the views of the camera(s) 404. Processed audio and video can be sent to remote devices coupled to network interface 410 and devices coupled to general interface 414.
The frame buffer 422 buffers frames of both incoming (from the video camera 404) and outgoing (to the video display 330) video streams, enabling the system 300 to process the streams before routing them onward. If the video display (330) accepts the buffered video stream format, the processor 408 routes the outgoing video stream from the frame buffer 422 to the display 330 (e.g., via input-output interface 414).
Although the processor 408 may be able to process the buffered video frames, in some examples of this disclosure, faster and more power-efficient operations can be achieved using a neural processing unit (e.g., neural processor) 420 and/or a graphics processing unit 424 to perform such processing. In some examples of this disclosure, graphics processing unit 424 employs circuitry optimized for vector functions, whereas neural processing unit 420 employs circuitry optimized for matrix operations. These vector operations and matrix operations are closely related, differing mainly in their precision requirements and their data flow patterns. The processor 408 sets the operating parameters of the neural processing unit 420 or graphics processing unit 424 and provides the selected processing unit access to the frame buffer 422 to carry out image frame processing operations.
As noted, aspects of this disclosure pertain to detecting landmarks in image frames depicting a human face. Facial landmark detection can be achieved in various ways. See e.g., Chinese Utility Patent Application No. 2019-10706647.9, entitled “Detecting Spoofing Talker in a Videoconference,” filed Aug. 1, 2019, which is entirely incorporated by reference herein. In at least one example of this disclosure, there are sixty-eight landmarks (501) on a human face (500). The number of landmarks (501) detected can vary depending on various factors such as the quality of the facial data captured by the cameras (e.g., 404), the angle of the face (500) relative each camera (e.g., 404), and lighting conditions at the endpoint (e.g., 301). Of the sixty-eight facial landmarks available for analysis, in at least one example of this disclosure, nine facial landmarks (503, 505, 506, 508, 510, 511, 512, 513, 515) are used.
At step 612, the system 300 receives another image frame 614 (e.g., 102) depicting Ms. Polly. Image frame 614 could be received from various sources, such as from a remote endpoint 302. In one or more examples of this disclosure, image frame 614 can be received within a data stream from a remote endpoint 302 during a videoconference. In at least one example of this disclosure, step 602 and step 606 are performed before such a videoconference. Ms. Polly's picture in image frame 614 (102) is blurry and of poor quality, so the system 300 will rectify the image frame 614. (As discussed, degradation of image frames can be caused by packet loss or other network/equipment issues.) At step 615, the system 300 determines a (facial) region 616 within the received image frame 614 based on the locations of landmarks (e.g., 503, 505, 506, 508, 510, 511, 512, 513, 515) within the new image frame 614, and partitions the region 616 into a plurality of polygons 618. In some examples of this disclosure, the polygons 618 are triangles.
At step 620, the system 300 maps 621 pixel information from polygons 610 of region 608 of the reference image frame 604 to corresponding polygons 618 determined from region 616. Mapping 621 of pixel information from one polygon (e.g., 610′) to another polygon (e.g., 618′) can be achieved by the system 300 using various techniques. See e.g., J. Burkardt, “Mapping Triangles,” Information Technology Department, Virginia Tech, Dec. 23, 2010, which is fully incorporated by reference herein.
Reference image frame 604 and received image frame 614 are both based on pictures of Ms. Polly, thus there is a (scalable) relationship between the relative positions of landmarks (501) in the reference image frame 604 and the relative positions of landmarks (501) in the received image frame 614, (e.g., landmarks 503, 505, 506, 508 of Ms. Polly's eyes are always closer to landmarks of her nose 510-512 than the landmarks of her mouth 513, 515). The scalable relationship between the relative positions of the landmarks in the two image frames, means that when the regions are subdivided in like manner using corresponding landmarks, there will necessarily be a relationship between image data at a given point in a polygon (e.g., 610″) from the reference image frame (e.g., 604) and its corresponding polygon (e.g., 618″) in a second image frame (e.g., 614). The system 300 replaces image data in some or all of the polygons in region 616 of the received image frame 614 with translated image data from region 608 of the reference image frame 604, forming a “revised” (facial) region 622.
At step 624, the system 300 forms a composite image frame 628, in which (at least some data of) the revised facial region 622 replaces 626 (at least some data of) the original facial region 616 in the received image frame 614. In some examples of this disclosure, the composite image frame 628 can be rendered—such as by being displayed using a display device (e.g., 330)—and the method 600 ended thereafter. The level of detail and resolution of the reference image frame 604 affects detail and resolution of the composite image frame 628. It is therefore recommended that the reference image frame 604 be of sufficient quality and definition to make the subject's facial features clearly discernable and reproducible. In some examples of this disclosure, the method 600 does not end by rendering the composite image frame 628. Instead, the method 600 proceeds to step 630 and step 632, in which the received image frame 614 and the composite image frame 628 are passed to one or more neural networks (e.g., 420). At step 634 of the method 600, the system 300 uses the one or more neural networks to perform convolutional-type operations 635 on the received image frame 614 and the composite image frame 628 to produce a rectified image frame 636 (e.g., 108).
Neural networks 706, 710, may be convolutional neural networks that support deep learning, (such as U-Net). As shown in
A second result layer 806 is derived from the first result layer 804 using different coefficient sets corresponding to a pixel volume 3-by-3-by-64. The number of feature planes in the second result layer 806 is kept at 64. A subset of result layer 806 contains facial image data and forms a feature map 810.
As with the first operation on input layer 802, the set of coefficients for the first operation on input layer 812 corresponds to an input pixel volume of 3-by-3-by-N (3 pixels wide, 3 pixels high, and N planes deep), where N corresponds to the number of planes in input layer 812 (e.g., three), and each plane in a result layer gets its own set of coefficients. Thus, result layer 814 has 64 feature planes. As with input layer 802, the input pixel values of input layer 812 are scaled by the coefficient values, and the scaled values are summed and “rectified.” The rectified linear unit operation (ReLU) passes any positive sums through but replaces any negative sums with zero.
Result layer 816 is derived from prior first result layer 814 using different coefficient sets corresponding to a pixel volume 3-by-3-by-64. The number of feature planes in result layer 816 is kept at 64. A subset of result layer 816 contains facial image data and forms a second (facial) feature map 820. An eltwise operation is performed on layer 806 and layer 816 which produces layer 822.
The U-Net architecture then applies a “2-by-2 max pooling operation with stride 2 for down-sampling” to layer 822 and layer 816, meaning that each 2-by-2 block of pixels in a layer 822 and layer 816 is replaced by a single pixel in a corresponding plane of the down-sampled result layer. The number of planes remains at 64, but the width and height of the planes is reduced by half.
Result layer 826 is derived from down-sampled result layer 824, using additional 3-by-3-by-64 coefficient sets to double the number of planes from 64 to 128. Result layer 828 is derived from result layer 826 using coefficient sets of 3-by-3-by-128. In some examples of this disclosure, result layer 828 contains a feature map 830.
Result layer 834 is derived from down-sampled result layer 832, using additional 3-by-3-by-64 coefficient sets to double the number of planes from 64 to 128. Result layer 836 is derived from result layer 834 using coefficient sets of 3-by-3-by-128. In some examples of this disclosure, result layer 836 contains a feature map 840. An eltwise operation is performed on result layer 828 and result layer 836 to produce result layer 842.
A 2-by-2 max pooling operation is applied to result layer 842 and result layer 836, producing down-sampled layer 844 and down-sampled layer 852 having width and height dimensions a quarter of the original image frame dimensions. The depth of down-sampled layer 844 and down-sampled layer 852 is 128.
Result layer 854 is derived from down-sampled result layer 852, using additional 3-by-3-by-128 coefficient sets to double the number of feature planes from 128 to 256. Result layer 856 is derived from result layer 854 using coefficient sets of 2-by-3-by-256. In at least one example of this disclosure, result layer 856 contains a feature map 858.
Result layer 846 is derived from down-sampled result layer 844, using additional 3-by-3-by-128 coefficient sets to double the number of feature planes from 128 to 256. Result layer 848 is derived from result layer 846 using coefficient sets of 2-by-3-by-256. In at least one example of this disclosure, result layer 848 contains a feature map 850. An eltwise operation if performed on result layer 856 and result layer 848 to produce result layer 860.
A 2-by-2 max pooling operation is applied to result layer 848 and result layer 860 to produce down-sampled result layer 862 and down-sampled result layer 870, each having width and height dimensions one eighth of the original image frame dimensions. The depth of down-sampled result layer 862 and down-sampled result layer 870 is 256.
In similar fashion, result layer 872 is derived from down-sampled result layer 870 and result layer 874 is derived from result layer 872, while result layer 864 is derived from down-sampled result layer 862 and result layer 866 is derived from result layer 864. In at least one example of this disclosure, result layer 874 comprises a feature map 876 and result layer 866 comprises a feature map 868. An eltwise operation is performed on result layer 874 and result layer 866 to produce result layer 878 having a depth of 512. Result layer 878 is down-sampled to produce down-sampled result layer 880. Result layer 882 is derived from down-sampled result layer 880, and result layer 884 is derived from result layer 882. The pixel values in the planes of layer 862, layer 870, and layer 880 are repeated to double the width and height of each plane. Thereafter, the U-Net begins up-sampling. A set of 2-by-2-by-N convolutional coefficients is used to condense the number of up-sampled planes by half. Result layer 878 is concatenated with the condensed layer 884, forming a first concatenated layer 886 having 1024 planes.
Result layer 888 is derived from the first concatenated layer 886, using 512 coefficient sets of size 3-by-3-by-1024. Result layer 890 is derived from result layer 888 using 512 coefficient sets of size 3-by-3-by-512. A second concatenated layer 892 is derived by up-sampling result layer 890, condensing the result, and concatenating the result with result layer 860. 256 coefficient sets of size 3-by-3-by-512 are used to derive result layer 894 from the second concatenated layer 892, and 256 coefficient sets of size 3-by-3-by-256 work to derive result layer 896 from result layer 894.
A third concatenated result layer 898 is derived from result layer 896 using up-sampling, condensing, and concatenation with result layer 842. Result layer 803 and result layer 805 are derived in turn, and result layer 805 is up-sampled, condensed, and concatenated with result layer 822 to form fourth concatenated layer 807. Result layer 811 is derived from fourth concatenated layer 807 and can be output as a data block (e.g., 708).
The number of planes in each result layer in
The system bus 1010 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output system (BIOS) stored in ROM 1040 or the like, may provide the basic routine that helps to transfer information between elements within the device 1000, such as during start-up. The device 1000 further includes storage devices 1080 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like. The storage device 1080 can include software modules 1062, 1064, 1066 for controlling the processor 1020. Other hardware or software modules are contemplated. The storage device 1080 is connected to the system bus 1010 by a drive interface. The drives and the associated computer readable storage media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the device 1000. In at least one example, a hardware module that performs a function includes the software component stored in a non-transitory computer-readable medium coupled to the hardware components—such as the processor 1020, bus 1010, output device 1070, and so forth—necessary to carry out the function.
For clarity of explanation, the device of
Examples of this disclosure include the following examples:
1. A method (200, 600) of rectifying images in a videoconference, comprising: receiving (202) a first image frame (604); determining locations of first feature landmarks (501) corresponding to the first image frame (604); determining (606) a first region (608) corresponding to the first image frame (604), based on the locations of the first feature landmarks (501); partitioning the first region (608) into a first plurality of polygons (610) based on the locations of the first feature landmarks (501); receiving a second image frame (102, 614); determining locations of second feature landmarks (501) corresponding to the second image frame (102, 614); determining a second region (616) corresponding to the second image frame (102, 614), based on the locations of the second feature landmarks (501); partitioning the second region (616) into a second plurality of polygons (618) based on the locations of the second feature landmarks (501); translating image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618); and forming a composite image frame by replacing image data of at least one polygon in the second plurality of polygons (618) with translated image data from the one or more polygons (610) of the first plurality of polygons (610).
2. The method (200, 600) of example 1, further comprising: receiving (202) the first image frame (604) at a neural processing unit; receiving the composite image frame at the neural processing unit; and forming a rectified image frame using the neural processing unit, based on the first image frame (604) and the composite image frame.
3. The method (200, 600) of example 1, wherein: partitioning the first region (608) into the first plurality of polygons (610) based on the locations of the first feature landmarks (501) comprises partitioning the first region (608) into a first quantity of polygons; partitioning the second region (616) into the second plurality of polygons (618) based on the locations of the second feature landmarks (501) comprises partitioning the second region (616) into a second quantity of polygons; and the second quantity of polygons is equal to the first quantity of polygons.
4. The method (200, 600) of example 1, wherein: determining locations of first feature landmarks (501) corresponding to the first image frame (604) comprises determining locations of first facial feature landmarks (501) corresponding to the first image frame (604); determining (606) the first region (608) corresponding to the first image frame (604), based on the locations of the first feature landmarks (501) comprises determining (606) a first facial region corresponding to the first image frame (604), based on the locations of the first facial feature landmarks (501); partitioning the first region (608) into the first plurality of polygons (610) based on the locations of the first feature landmarks (501) comprises partitioning the first facial region into the first plurality of polygons (610) based on the locations of the first facial feature landmarks (501); determining locations of second feature landmarks (501) corresponding to the second image frame (102, 614) comprises determining locations of second facial feature landmarks (501) corresponding to the second image frame (102, 614); determining the second region (616) corresponding to the second image frame (102, 614), based on the locations of the second feature landmarks (501) comprises determining a second facial region corresponding to the second image frame (102, 614), based on the locations of the second facial feature landmarks (501); partitioning the second region (616) into a second plurality of polygons (618) based on the locations of the second feature landmarks (501) comprises partitioning the second facial region into the second plurality of polygons (618) based on the locations of the second facial feature landmarks (501); and translating image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618) comprises mapping (621) facial image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618).
5. The method (200, 600) of example 4, wherein: receiving (202) the first image frame (604) comprises retrieving a first image depicting a person, the first image being of a first quality; and receiving the second image frame (102, 614) comprises receiving the second image frame (102, 614) within a data stream initiated at a remote videoconferencing system (100), the second image frame depicting the person and being of a second quality, wherein the second quality is inferior to first quality.
6. The method (200, 600) of example 5, further comprising: receiving (202) the first image frame (604) at a neural processing unit; receiving the composite image frame at the neural processing unit; forming a rectified image frame using the neural processing unit, based on the first image frame (604) and the composite image frame; and rendering the rectified image frame, wherein rendering the rectified image frame comprises displaying an image depicting the person.
7. The method (200, 600) of example 6, wherein rendering the rectified image frame further comprises displaying the image depicting the person within a predetermined period of receiving the second image frame (102, 614) at a videoconferencing system (100). In at least one example, the predetermined period is 80 milliseconds.
8. The method (200, 600) of example 6, further comprising: receiving a third image frame, the third image frame corresponding to the rectified image frame; determining locations of third feature landmarks (501) corresponding to the third image frame; determining (606) a third region corresponding to the third image frame, based on the locations of the third feature landmarks (501); partitioning the third region into a third plurality of polygons based on the locations of the third feature landmarks (501); receiving a fourth image frame; determining locations of fourth feature landmarks (501) corresponding to the fourth image frame; determining a fourth region corresponding to the fourth image frame, based on the locations of the fourth feature landmarks (501); partitioning the fourth region into a fourth plurality of polygons based on the locations of the fourth feature landmarks (501); translating image data of one or more polygons of the third plurality of polygons to one or more polygons of the fourth plurality of polygons; and forming a composite image frame by replacing image data of at least one polygon in the fourth plurality of polygons with translated image data from the one or more polygons of the third plurality of polygons.
9. The method (200, 600) of example 6, wherein: receiving (202) the first image frame (604) at the neural processing unit comprises receiving (202) the first image frame (604) at a processing unit comprising a U-net architecture; and receiving the composite image frame at the neural processing unit comprises receiving the composite image frame at the processing unit having the U-net architecture.
10. The method (200, 600) of example 9, wherein: receiving (202) the first image frame (604) at the neural processing unit further comprises receiving (202) the first image frame (604) at a processing unit comprising a VDSR architecture; and receiving the composite image frame at the neural processing unit receiving further comprises receiving the composite image frame at the processing unit having the VDSR architecture.
11. The method (200, 600) of example 1, wherein: determining locations of first feature landmarks (501) corresponding to the first image frame (604) comprises discerning first facial feature landmarks (501); and determining locations of second feature landmarks (501) corresponding to the second image frame (102, 614) comprises discerning second facial feature landmarks (501).
12. A videoconferencing system (100) with video image rectification, the videoconferencing system (100) comprising a processor (408, 1020), wherein the processor (408, 1020) is operable to: receive a first image frame (604); determine locations of first feature landmarks (501) corresponding to the first image frame (604); determine a first region (608) corresponding to the first image frame (604), based on the locations of the first feature landmarks (501); partition the first region (608) into a first plurality of polygons (610) based on the locations of the first feature landmarks (501); receive a second image frame (102, 614); determine locations of second feature landmarks (501) corresponding to the second image frame (102, 614); determine a second region (616) corresponding to the second image frame (102, 614), based on the locations of the second feature landmarks (501); partition the second region (616) into a second plurality of polygons (618) based on the locations of the second feature landmarks (501); translate image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618); and form a composite image frame by replacing image data of at least one polygon in the second plurality of polygons (618) with translated image data from the one or more polygons (610) of the first plurality of polygons (610).
13. The videoconferencing system (100) of example 12, further comprising a neural processor, wherein the neural processor is further operable to: receive the first image frame (604) at a neural processing unit; receive the composite image frame at the neural processing unit; and form a rectified image frame using the neural processing unit, based on the first image frame (604) and the composite image frame.
14. The videoconferencing system (100) of example 12, wherein the processor (408, 1020) is further operable to: partition the first region (608) into the first plurality of polygons (610) based on the locations of the first feature landmarks (501) by partitioning the first region (608) into a first quantity of polygons; and partition the second region (616) into the second plurality of polygons (618) based on the locations of the second feature landmarks (501) by partitioning the second region (616) into a second quantity of polygons equal to the first quantity of polygons.
15. The videoconferencing system (100) of example 12, wherein the processor (408, 1020) is further operable to: determine locations of first feature landmarks (501) corresponding to the first image frame (604) by determining locations of first facial feature landmarks (501) corresponding to the first image frame (604); determine (606) the first region (608) corresponding to the first image frame (604), based on the locations of the first feature landmarks (501) by determining (606) a first facial region corresponding to the first image frame (604), based on the locations of the first facial feature landmarks (501); partition the first region (608) into the first plurality of polygons (610) based on the locations of the first feature landmarks (501) by partitioning the first facial region into the first plurality of polygons (610) based on the locations of the first facial feature landmarks (501); determine locations of second feature landmarks (501) corresponding to the second image frame (102, 614) by determining locations of second facial feature landmarks (501) corresponding to the second image frame (102, 614); determine the second region (616) corresponding to the second image frame (102, 614), based on the locations of the second feature landmarks (501) by determining a second facial region corresponding to the second image frame (102, 614), based on the locations of the second facial feature landmarks (501); partition the second region (616) into a second plurality of polygons (618) based on the locations of the second feature landmarks (501) by partitioning the second facial region into the second plurality of polygons (618) based on the locations of the second facial feature landmarks (501); and translate image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618) by mapping (621) facial image data of one or more polygons (610) of the first plurality of polygons (610) to one or more polygons of the second plurality of polygons (618).
16. The videoconferencing system (100) of example 15, wherein the processor (408, 1020) is further operable to: receive the first image frame (604) by retrieving a first image depicting a person, the first image being of a first quality; and receive the second image frame (102, 614) by receiving the second image frame (102, 614) within a data stream initiated at a remote videoconferencing system (100), the second image frame depicting the person and being of a second quality, wherein the second quality is inferior to first quality.
17. The videoconferencing system (100) of example 16, further comprising a neural processing unit, wherein the neural processing unit is operable to: receive the first image frame (604); receive the composite image frame; form a rectified image frame based on the first image frame (604) and the composite image frame; provide the rectified image frame to the processor (408, 1020), wherein the processor (408, 1020) is further operable to cause a display device to display an image depicting the person based on the rectified image frame.
18. The videoconferencing system (100) of example 17, wherein the processor (408, 1020) is further operable to: cause the display device to display the image depicting the person based on the rectified image frame within a predetermined period of receiving the second image frame (102, 614). In at least one example, the predetermined period is sixty milliseconds.
19. The videoconferencing system (100) of example 17, wherein the processor (408, 1020) is further operable to: receive a third image frame, the third image frame corresponding to the rectified image frame; determine locations of third feature landmarks (501) corresponding to the third image frame; determine a third region corresponding to the third image frame, based on the locations of the third feature landmarks (501); partition the third region into a third plurality of polygons based on the locations of the third feature landmarks (501); receive a fourth image frame; determine locations of fourth feature landmarks (501) corresponding to the fourth image frame; determine a fourth region corresponding to the fourth image frame, based on the locations of the fourth feature landmarks (501); partition the fourth region into a fourth plurality of polygons based on the locations of the fourth feature landmarks (501); translate image data of one or more polygons of the third plurality of polygons to one or more polygons of the fourth plurality of polygons; and form a composite image frame by replacing image data of at least one polygon in the fourth plurality of polygons with translated image data from the one or more polygons of the third plurality of polygons.
20. The videoconferencing system (100) of example 12, wherein the processor (408, 1020) is further operable to: determine locations of first feature landmarks (501) corresponding to the first image frame (604) by discerning first facial feature landmarks (501); and determine locations of second feature landmarks (501) corresponding to the second image frame (102, 614) by discerning second facial feature landmarks (501).
21. The videoconferencing system of example 19, wherein the neural processor (420, 701) is further operable to: receive the third image frame; receive the second composite image frame; form a second rectified image frame based on the third image frame and the second composite image frame; and provide the second rectified image frame to the processor (408, 1020), wherein the processor (408, 1020) is further operable to cause a display device to display an image depicting the person based on the second rectified image frame.
The various examples within this disclosure are provided by way of illustration and should not be construed to limit the scope of the disclosure. Various modifications and changes can be made to the principles and examples described herein without departing from the scope of the disclosure and without departing from the claims which follow.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/100372 | 7/6/2020 | WO |