The present invention is related to a method of video processing for live streaming, and more particularly, to a video processing method that is arranged to perform partial highlighting with the aid of hand gesture detection and an associated system on chip (SoC).
Live streaming is widely used in modern society, and has seen a particular rise in popularity during the Covid-19 pandemic when face-to-face meetings were replaced with remote video conferences. When one party in a remote video conference includes multiple participants that can be seen in an image (e.g. an image displayed on a screen), it may be difficult for the other party's participants to distinguish a speaker. Specifically, assume that a current remote video conference is taking place between a first party and a second party, wherein the first party has multiple participants in a physical conference room, and the audio and video information of the physical conference room is captured by a microphone and camera and transmitted to participants in the remote second party through a network. Due to the relative positioning of the multiple participants in the first party and limitations with regards to the size of the image, the participants of the second party may not be able to correctly identify a current speaker within the image, such that the participants of the second party may be confused as to whom the current speaker is, thereby affecting efficiency of the conference.
It is therefore one of the objectives of the present invention to provide a person tracking technology that can be applied to a remote video, wherein a current speaker in an image (e.g. an image displayed on a screen) can be highlighted, to address the above-mentioned issues.
According to an embodiment of the present invention, an SoC that is arranged to perform partial highlighting with the aid of hand gesture detection is provided. The SoC comprises a person recognition circuit, a hand gesture detection circuit, a sound detection circuit, and a processing circuit. The person recognition circuit is arranged to obtain an image data from an image capturing device, and perform person recognition upon the image data to generate a recognition result. The hand gesture detection circuit is arranged to obtain the image data from the image capturing device, and perform hand gesture detection upon a hand gesture image data in the image data to generate a hand gesture detection result. The sound detection circuit is arranged to receive multiple sound signals from multiple microphones, and determine a voice characteristic value of a main sound. The processing circuit is coupled to the person recognition circuit, the hand gesture detection circuit, and the sound detection circuit, and is arranged to determine a specific region in the image data according to the recognition result, the gesture detection result, and the voice characteristic value of the main sound, and process the image data to highlight the specific region.
According to an embodiment of the present invention, a video processing method that is arranged to perform partial highlighting with the aid of hand gesture detection is provided. The video processing method comprises: obtaining an image data from an image capturing device, and performing person recognition upon the image data to generate a recognition result; obtaining the image data from the image capturing device, and performing hand gesture detection upon hand gesture image data in the image data, to generate a hand gesture detection result; receiving multiple sound signals from multiple microphones, and determining a voice characteristic value of a main sound; determining a specific region in the image data according to the recognition result, the gesture detection result, and the voice characteristic value of the main sound; and processing the image data to highlight the specific region.
One of the benefits of the present invention is that by detecting the current speaker and highlighting the speaker in the image data, the video processing method and the SoC of the present invention can enable participants in the remote conference room to clearly identify the speaker, which can effectively improve the conference efficiency. In addition, the video processing method and the SoC of the present invention can ensure the accuracy of related operations by hand gesture detection.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
When one party in a remote video conference includes multiple participants in an image (e.g. an image displayed on a screen), the other party's participants may sometimes have difficulty distinguishing a current speaker from among the participants in the image. For example, if the participants in the second conference room are not familiar with the respective voices of the participants in the first conference room, or if the speaker in the first conference room is not facing the camera, the participants in the second conference room may sometimes find it difficult to identify the speaker, which can result in communication difficulties.
A method for highlighting the speaker is designed in a system on chip (SoC) in the electronic device 110, so that the participants in the second conference room can clearly identify the speaker in the first conference room, to address the above-mentioned issues.
It should be noted that the image capturing device 202 and the microphones 204_1-204_N are disposed in the electronic device 110; however, in some embodiments, the image capturing device 202 and the microphones 204_1-204_N are externally connected to the electronic device 110.
The person recognition circuit 210 is arranged to perform person recognition upon the image data received by the image capturing device 202, to first determine whether there is a person/people in the received image data, and determine a characteristic value of each person and a position/region of each person in the image (e.g. the image displayed on the screen). Specifically, the person recognition circuit 210 may utilize a deep learning method or a neural network method to process at least one frame in the image data. For example, multiple different convolution kernels (e.g. convolution filters) are utilized to perform multiple convolution operations upon the at least one frame (e.g. an image frame) to recognize whether there is a person in the at least one frame. In addition, for a detected person, a characteristic value of the detected person (or a characteristic value of a region in which the detected person is located) is determined by the above-mentioned deep learning method or neural network method, wherein the characteristic value can be a multi-dimensional vector (e.g. a vector with dimension “512”). It should be noted that the above-mentioned circuit design related to person recognition is well known to those with ordinary knowledge in the art. One of the key points of this embodiment is the application of people recognized by the person recognition circuit 210 and their characteristic values. Other details of the person recognition circuit 210 are not repeated here.
The hand gesture detection circuit 215 is arranged to perform hand gesture detection upon a hand gesture image data in the image data received by the image capturing device 202 to generate at least one hand gesture detection result. More particularly, the hand gesture detection circuit 215 may include multiple sub-circuits for a two-stage operation, which are expressed as follows:
The voice activity detection circuit 220 is arranged to receive sound signals from the microphones 204_1-204_N, and determine whether there is a voice component in the sound signals. Specifically, the voice activity detection circuit 220 can perform the following operations: performing noise reduction upon the received sound signals; converting the sound signals to the frequency domain and then processing a block to obtain characteristic values; and comparing the obtained characteristic values with a reference value to determine whether the sound signals are voice signals. It should be noted that, since circuit designs related to the voice activity detection are well known to those with ordinary knowledge in the art, and one of the key points of this embodiment is to perform subsequent operations according to the determination result generated by the voice activity detection circuit 220, details of the voice activity detection circuit 220 are omitted here for brevity. In addition, in another embodiment, the voice activity detection circuit 220 can only receive sound signals from a part of the microphones 204_1-204_N, without receiving sound signals of all microphones 204_1-204_N.
Regarding operations of the sound direction detection circuit 230, the microphones 204_1 to 204_N can be placed at several known locations of the electronic device 110, so that the sound direction detection circuit 230 can determine an azimuth of a main sound in the first conference room (i.e. direction and angle of a main speaker relative to the electronic device 110) according to a time difference of sound signals from the microphones 204_1-204_N. In this embodiment, the sound direction detection circuit 230 can only determine one direction at a time; that is, if there are multiple people in the first conference room talking at the same time (or making other sounds), the sound direction detection circuit 230 will determine which direction the main sound comes from according to some characteristics (e.g. signal strength) of the received multiple sound signals. It should be noted that, since the circuit designs related to the sound direction detection circuit 230 are well known to those with ordinary knowledge in the art, and one of the key points of this embodiment is to perform subsequent operations according to a determination result generated by the sound direction detection circuit 230, details of the sound direction detection circuit 230 are omitted here for brevity.
In Step 300, the flow starts, the electronic device 110 is powered on, and the connection with the electronic device 120 of the second conference room is completed.
In Step 302, the voice activity detection circuit 220 receives the sound signals from the microphones 204_1-204_N and determines whether there is a voice component in the sound signals. If yes, Step 304 is entered; if no, the flow returns to Step 302 to keep detecting whether there is a voice component in the sound signals.
In Step 304, the processing circuit 304 enables the person recognition circuit 210 after the voice activity detection circuit 220 detects that the sound signals have the voice component, so that the person recognition circuit 210 starts to perform person recognition upon the received image data to determine whether there is a person in the received image data, and determines the characteristic value of each person and the position/region of each person in the image (e.g. the image displayed on the screen, such as the image frame). Take
In Step 305, the processing circuit 240 enables the hand gesture detection circuit 215 so that the hand gesture detection circuit 215 starts to perform hand gesture detection, to ensure correctness of related operations through the hand gesture detection.
In Step 306, the processing circuit 240 enables the sound direction detection circuit 230, and the sound direction detection circuit 230 starts to determine the direction and the angle of the main sound relative to the electronic device 110 according to the time difference of the sound signals from the microphones 2401-240 N. It should be noted that Step 304, Step 305, and Step 306 may be executed simultaneously, i.e. the execution of this embodiment is not limited to the sequence shown in
In Step 308, according to the region (e.g. the regions 410-450 shown in
In Step 310, after determining the current speaker in the image (e.g. the image frame), the processing circuit 240 processes the image data from the image capturing device 202, to highlight the main speaker in the image data.
In Step 311, in addition to determining the region of the main speaker in the image (e.g. the image frame, such as the image data) and processing the image data to highlight the region, the processing circuit 240 further enables a gesture lock for the region, to indicate that the processing circuit 240 keeps highlighting the region. Specifically,
It should be noted that enhancing the visual effect of the person in the region 440 does not necessarily need to visually enhance the entire region 440, and only a part of the region 440 being visually enhanced can also achieve the same effect. Take
In Step 312, the processing circuit 240 keeps tracking the highlighted person, and keeps processing the image data from the image capturing device 202 to highlight the person in the image data.
Specifically, the person recognition circuit 210 can keep determining the characteristic value and the region of each person in the image (e.g. the image frame), and the processing circuit 240 can keep highlighting the person in the current and subsequent image (e.g. the image frame) according to the characteristic value of the highlighted person. Take the region 440 in
It should be noted that, since the current speaker may move (e.g. move from one position to another position in the first conference room), and may not keep speaking, Step 312 can prevent the image from turning on and turning off the visual effect for enhancing the speaker (which affects the feelings of the participants in the second conference room), but the present invention is not limited thereto. For example, after a period of time, the SoC 200 can perform the relevant determination operations again, and more particularly, determine the relative position and the characteristic value (or the characteristic value of the region where each person is located) of each person in detected people.
In Step 314, according to the region of each person in the image (e.g. the image frame) determined by the person recognition circuit 210, the direction and the angle of the main speaker relative to the electronic device 110 detected by the sound direction detection circuit 230, and the detection result indicating that someone is speaking (i.e. the received sound signal has the voice component) generated by the voice activity detection circuit 220, the processing circuit 240 can correctly determine whether the speaker changes with the aid of the hand gesture detection performed by the hand gesture detection circuit 215. If the determination is negative (e.g. none of the other people are speaking and raising their hands with the predetermined hand gesture), the method returns to Step 312 to keep tracking the current speaker. If the determination is positive (e.g. another person is speaking and raising their hand with the predetermined hand gesture), the method returns to Step 308 is returned to determine a new speaker. Specifically, the sound direction detection circuit 230 can only detect the direction of the sound and cannot know whether the sound in the determined direction is a human sound. As a result, under a condition that the voice activity detection circuit 220 detects that the current sound signal has the voice component, if the direction and the angle of the main speaker relative to the electronic device 110 detected by the sound direction detection circuit 230 changes to the position of another person, the processing circuit 240 can determine that the speaker has changed. It should be noted that, in order to prevent the processing circuit 240 from constantly changing the highlighted person in the image data, Step 314 maybe performed after a relatively long period of detection.
Some implementation details for the gesture lock can be further described as follows. According to some embodiments, the hand gesture detection result can indicate that the predetermined hand gesture is detected. In addition to determining a specific region (e.g. the region 440 in
According to some embodiments, it is assumed that the person in the region 440 shown in
According to some embodiments, the hand gesture detection circuit 215 is not limited to perform the hand gesture detection with a single predetermined hand gesture, and more particularly, the predetermined hand gesture can be replaced by a predetermined hand gesture set, wherein the predetermined hand gesture set may include multiple predetermined hand gestures (e.g. the predetermined hand gestures shown in
According to some embodiments, a shape, type, direction, and/or finger count of the predetermined hand gestures in the predetermined hand gesture set may vary.
In another embodiment, in order to determine whether the speaker changes, the processing circuit 240 further includes a voiceprint recognition mechanism to support the detection result of the sound direction detection circuit 230. Specifically, since each person's voice has unique characteristics, the voiceprint recognition mechanism in the processing circuit 240 can continuously capture a part of sound clips to determine whether the voice characteristic values of these sound clips belong to the same person. For example, if the speaker is determined to be changed according to the person recognition circuit 210, the voice activity detection circuit 220, and the sound direction detection circuit 230, but the voiceprint recognition mechanism determines that the voice characteristic values of the sound clips belong to the same person, the processing circuit 240 can suspend determining whether the speaker has changed, and then make another determination after a period of time.
In the above embodiments, the sound direction detection circuit 230 is regarded as the sound detection circuit, but the present invention is not limited thereto. In other embodiments, the sound detection circuit can be equipped with the voiceprint recognition mechanism (e.g. a voiceprint recognition sub-circuit), and more particularly, the sound detection circuit can include the sound direction detection circuit 230 and the voiceprint recognition sub-circuit, and utilize the sound direction detection circuit 230 with the aid of the voiceprint recognition mechanism (e.g. the voiceprint recognition sub-circuit) to determine the speaker and the highlighted person. For example, the sound detection circuit of the present invention can receive and obtain multiple sound signals from multiple microphones to determine a voice characteristic value of a main sound, and the voice characteristic value can be a voiceprint (e.g. the voiceprint of sound clips detected by the voiceprint recognition sub-circuit) or an azimuth of the main sound.
In summary, the video processing method of the present invention can effectively improve video conference efficiency by detecting a current speaker and highlighting the speaker in the image data, thereby enabling participants in a remote conference room to clearly identify the speaker. In addition, the video processing method and the SoC of the present invention can ensure correctness of related operations using hand gesture detection, and more particularly, use gesture lock to keep highlighting the specific region, thereby avoiding any false highlighted region switching caused by an action of another person when the predetermined hand gesture is not used (e.g. some other person is talking informally).
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
111127891 | Jul 2022 | TW | national |