Implementations of the claimed invention generally may relate to the field of telecommunications and, more particularly, to conferencing systems.
Conferencing technology enables the users of two or more people at geographically remote locations to have audio communication with each other. With the growth of multimedia and Internet applications, the use of conferencing may become even more popular in the future, not only in business but also in our everyday lives. However, as conferencing finds more usage, the conference size requirement may also likely increase. Today it is not uncommon to have a conference call that has ten or more users.
One conventional method to implement a conferencing system is to sum the audio streams of every user together and the result is then sent to all users. However, as the number of users of a conference call increases, it may become unpractical to sum all the users since the result will typically overflow and the accumulation of noise in the sum may also cause quality problems.
In another method, a conferencing system automatically detects a few of the loudest audio streams in a conference, identified as the active talkers, and then arithmetically adds these streams to create a sum. This method may have limitations, including but not limited to degrading conferencing quality under certain situations. In particular, because only a small subset of conferees may be allowed to talk, this method may not be capable of capturing all the audio information and accurately reflect the actual dynamics of a real life conference. Some users may be cut off inappropriately since not everyone's voice can be captured when this method is used or each user will try to speak ever louder to get to be the active talker.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate one or more implementations consistent with the principles of the invention and, together with the description, explain such implementations. The drawings are not necessarily to scale, the emphasis instead being placed upon illustrating the principles of the invention. In the drawings,
The following detailed description refers to the accompanying drawings. The same reference numbers may be used in different drawings to identify the same or similar elements. In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular structures, architectures, interfaces, techniques, etc. in order to provide a thorough understanding of the various aspects of the claimed invention. However, it will be apparent to those skilled in the art having the benefit of the present disclosure that the various aspects of the invention claimed may be practiced in other examples that depart from these specific details. In certain instances, descriptions of well known devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
Output gain element 106 controls the level of the conference audio to each user. This gain may be set to be inversely proportional to the energy of the input audio of the user. In other words, the louder a user talks, the lower the volume of the conference audio back to that user. This helps to limit the tendency to talk louder when the environment is noisier. When the user is not talking, the output gain may be set to unity. There is a selectable lower threshold to the output gain. The output gain setting may be implemented by a simple table lookup.
Intelligent gain control may be used to lower all of the user's input via the input gain element by a proportional amount so when you add them up there is no overflow. For example, if users A and B are speaking, A and B would be added. Other users are added too in case they interject something. These other users are captured too but their gain is lowered. Two classes of users may be established: active talkers (A and B) and less active users. Active users get most of the share and their gain level is set accordingly to the input level of each of the users. In particular, the output gain level is inversely proportional to the level of their voice. If you the user talks loudly, the user's level is adjusted down more.
Voice activity detector 202 tracks the input audio and updates the adaptive threshold based on the average background noise level. Adaptive threshold 204 gates off unwanted background noise. This helps to eliminate noise input from users from noisy background (for example, users using a cell phone connection). The initial consideration for determining the level of gain applied to each user is the signal level of the user audio. If the level is very high, gain is lower and vice versa. The overall energy levels of the active users which passed their individual thresholds are used to determine the input gain levels in an inversely proportional manner. In other words, the higher the overall energy, the lower the input gains.
In particular, voice activity detector 202 detects voice and minimizes the amount of noise added into the conference. In a typical implementation, voice activity detector 202 cuts off if there is no voice activity. Audio data is received by voice activity detector 202 from an audio channel. Signal, which contains audio data, is then output by voice activity detector 202. The energy of the audio signal has a waveform. The portion of the waveform which exceeds a noise floor is considered to be speech energy, whereas the portions of the waveform not exceeding the noise floor are considered to be only noise energy. But if there is some voice activity and the user does not have a history as an active talker, the user is still allowed in although at a lower gain. For example, if there is no audio but just background noise such as from a cell phone caller, voice activity detector 202 minimizes the likelihood of static background noise ruining the conference. The history of the conference user is also taken into consideration. In particular, which user has been more actively talking. That may also be used to determine the level of gain is applied.
Voice activity detector 202 initially determines whether there is any active audio.
If voice activity detector 202 detects no active audio, then output of energy estimator 206 is zero.
If voice activity detector 202 detects active audio, energy estimator 206 determines how much energy is in the signal. Energy estimator 206 determines the length of time a user talks. That information is also used to determine what the gain is. Output from energy estimator 206 is applied to both input and output gain determination for that particular channel.
The output of energy estimator 206 is applied to output and input gain table lookups 210 and 212, which determine the output and input gain levels. The table lookups provide levels which are generally inversely proportional to the energy level input provided by energy estimator 206. For example, if the energy detected by energy estimator 206 is high, both the input and output gain is adjusted down. If the energy detected by energy estimator 206 is low, both the input and output gain is adjusted up.
In particular, output of energy estimator 206 is applied to output gain table lookup 210 that provides an output gain signal to the user. A problem with conference calls arises when users converse at higher volume because they believe others cannot hear. For example, if the audio feedback to the user is high, the user may speak even louder to be heard. The feedback in such a situation may be lowered so that it is not as noisy. This will increase the likelihood that the high volume user returns to conversing at normal volume. The output gain signal is related to how loud the user talks. If a user talks really loud, feedback is lowered.
Output of energy estimator is applied to input gain table lookup 212. The input gains are also set individually for each user according to a hierarchy. The main active talkers, determined by the 2-4 inputs with historical most active audios as tracked by the voice activity detectors and energy estimators, are accorded proportionally higher gains. The remaining users are accorded lower gains, but nonetheless provide input to the conference. The input gain setting can also be implemented by a table lookup 212.
Output of energy estimator 206 is also applied to summer 208, which generates a second signal to input gain table lookup 212. Both inputs from energy estimator 206 and summer 208 are applied to input gain table lookup 212 and used to determine the level of input gain. For example, if the input gain is too high, all the signals may be clipped. Accordingly, if the sum of all the inputs is high, the gain may be lowered.
This allows all relevant users to be heard in a more complex conferencing environment and, at the same time, removes low background noise with much greater accuracy. It allows a conferencing system to more truly capture the meeting dynamic but simultaneously maintain the best possible overall audio volume for the total conference. All parties in a small, medium or large conference call may be heard, while maintaining a proper overall signal level.
It is initially determined whether there is any active audio (act 302).
If there is no active audio, then the estimated energy is set to zero (act 304).
If active audio is detected, the amount of energy is estimated (act 306). The information is also used to determine what the gain is.
Output gain is determined based upon the amount of detected energy (act 308). In one implementation, the table lookups provide levels which are generally inversely proportional to the energy level input. For example, if the energy detected is high, both the input and output gain is adjusted down. If the energy detected is low, both the input and output gain is adjusted up.
The detected energy for all the users is then determined (act 310).
Input gain is determined based upon the amount of detected energy from a single user and the amount of energy detected from all of the users (act 312). The input gains are also set individually for each user according to a hierarchy. The main active talkers, determined by the 2-4 inputs with historical most active audios as tracked by the voice activity detectors and energy estimators, are accorded proportionally higher gains. The remaining users are accorded lower gains, but nonetheless provide input to the conference. The input gain setting can also be implemented by a table lookup.
The foregoing description of one or more implementations provides illustration and description, but is not intended to be exhaustive or to limit the scope of the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of various implementations of the invention.
Moreover, the acts in
No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. For example, the form of audio communication is not critical. In one embodiment, the audio channel may be an Integrated Services Digital Network (ISDN) link. In other embodiments, the audio channel may be a standard computer local area network (LAN), or a telephone connection. Also, as used herein, the article “a” is intended to include one or more items. Variations and modifications may be made to the above-described implementation(s) of the claimed invention without departing substantially from the spirit and principles of the invention. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.