Efficient voice activity detector to detect fixed power signals

Information

  • Patent Application
  • 20080071531
  • Publication Number
    20080071531
  • Date Filed
    September 19, 2006
    18 years ago
  • Date Published
    March 20, 2008
    16 years ago
Abstract
The present invention is directed to a voice activity detector that uses the periodicity of amplitude peaks and valleys to identify signals of substantially fixed power or having periodicity.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 depicts a voice communications architecture according to a first embodiment of the present invention;



FIG. 2 depicts the response of a noise floor power waveform to speech variations in the power of a received signal.



FIGS. 3A and 3B depict a periodic signal waveform and the response of a noise floor power waveform to the substantially constant power of the signal;



FIGS. 4A and 4B depict periodic signal waveforms to illustrate concepts of the present invention;



FIG. 5 is a set of data structures according to an embodiment of the present invention; and



FIG. 6 is a flow chart according to an embodiment of the present invention.





DETAILED DESCRIPTION

An architecture 100 according to a first embodiment is depicted in FIG. 1. The architecture 100 includes a voice communication device 104 and enterprise network 108 interconnected by a Wide Area Network or WAN 112. The enterprise network 108 includes a gateway 116 servicing a server 120, Local Area Network 124, and communication device 128.


The gateway 116 can be any suitable device for controlling ingress to and egress from the corresponding LAN. The gateway is positioned logically between the other components in the corresponding enterprise premises 108 and the network 112 to process communications passing between the server 120 and internal communication device 128 on the one hand and the network 112 on the other. The gateway 116 typically includes an electronic repeater functionality that intercepts and steers electrical signals from the network 112 to the corresponding LAN 124 and vice versa and provides code and protocol conversion. When processing voice communications, the gateway 116 further performs a number of VoIP functions, particularly silence suppression and jitter buffer processing. The gateway 116 therefore includes a Voice Activity Detector 132 to perform VAD and SAD and a comfort noise generator (not shown) to generate comfort noise during periods of silence. Comfort noise is synthetic background noise, which prevents the listener from perceiving, from the periods of absolute silence resulting from silence suppression, that the communication channel has been disconnected. Examples of suitable gateways include modified versions of Avaya Inc. 's, G700, G650, G350, Crossfire, MCC/SCC media gateways and Acme Packet's Net-Net 4000 Session Border Controller.


The server 120 processes call control signaling, such as incoming Voice Over IP or VoIP and telephone call set up and tear down messages. The term “server”, as used herein, should be understood to include an ACD, a Private Branch Exchange PBX (or Private Automatic Exchange PAX) an enterprise switch, an enterprise server, or other type of telecommunications system switch or server, as well as other types of processor-based communication control devices such as media servers, computers, adjuncts, etc. Illustratively, the server of FIG. 1 can be Avaya Inc.'s Definity™ Private-Branch Exchange (PBX)-based ACD system or MultiVantage PBX running modified Advocate™ software, CRM Central 2000 Server™, Communication Manager, S8300™ media server, SIP Enabled Services™, and/or Avaya Interaction Center™.


The internal and external communication devices 104 and 128 are preferably packet-switched stations or communication devices, such as IP hardphones (e.g., Avaya Inc.'s 4600 Series IP Phones™), IP softphones (e.g., Avaya Inc.'s IP Softphone™), Personal Digital Assistants or PDAs, Personal Computers or PCs, laptops, packet-based H.320 video phones and conferencing units, packet-based voice messaging and response units, peer-to-peer based communication devices, and packet-based traditional computer telephony adjuncts. Examples of suitable devices are the 4610™, 4621SW™, and 9620™ IP telephones of Avaya, Inc.


The voice activity detector 116, as can be seen from FIG. 1, can be located in a number of components depending on the architecture.


The detector 132 exploits the periodicity of a fixed signal by detecting peaks and troughs (i.e. turning points). In addition to time-based periodicity, the detector 132 uses amplitude-based periodicity. It relies on the detection of regular patterns within the signal. The detector 132 can be efficient, as it does not require significant signal processing resources to detect a fixed power signal.


A buffer 136 of n audio samples is stored. The number of samples is typically the same number of audio samples contained in a packet (or frame) to be transmitted to the destination communication device. N is frequently 80, as this represents 10 milliseconds of voice sampled at 8 kHz. The detector 132 iterates over this buffer 136, one-sample-at-a-time, and records selected characteristics of the sampled portion of the signal. In particular, the high and low points of the signal (e.g., peaks and troughs) are recorded. This information, when combined with the previous history of the recorded signal features, provides a condensed historical span of what the pattern is like.


Followed by this, there is a post processing step to search the gathered information for a pattern (or template). This is typically done by searching for repetitions. For example with a dual frequency signal, the detector 132 searches for a signal pattern having two distinct peaks and two distinct troughs and, for a single frequency signal, for a signal pattern having only one peak and only one trough. When the values do not fit the selected pattern, the sampled signal is deemed to be a more random signal and is rejected by the algorithm. Account can be taken of the noise floor waveform and any possible interference by establishing a range within which two values are considered to be similar. This allows the algorithm to execute in the presence of background noise.


An example of the recorded data structures generated during processing of the samples in the buffer 136 is shown in FIG. 5. As can be seen from FIG. 5, each audio sample has a corresponding sample identifier 500, which for simplicity sake is shown as being consecutively numbered. Each sample is analyzed for whether it is, relative to the prior sample, trending upward (positive) or downward (negative) in amplitude. When the trend 504 changes between adjacent samples, a turning point, or a peak or valley, is identified. With reference to FIG. 5, turning points are identified in one of or between samples 2 and 3 (a peak), 7 and 8 (a valley), 12 and 13 (a peak), and 17 and 18 (a valley). Each instance of a turning point is marked by a suitable indicator 508 (e.g., “Y” meaning that a turning point exists and “N” meaning that a turning point does not exist). The temporal distance to the prior turning point 512 is tracked by counting the number of samples to the prior instance of a turning point because the sample size is associated with a fixed time period (e.g, 10 milliseconds). For example, the temporal distance associated with the turning point at sample 3 is 0 (because there is no sample data prior to sample 1), at sample 8 is 5 (or 50 milliseconds), at sample 13 is 5 (or 50 milliseconds), and at sample 18 is 5 (or 50 milliseconds). Finally, the amplitude 516 of each turning point is recorded. For example, the amplitude of the turning point at sample 3 is +11,000 units, at sample 8 is −10,500 units, at sample 13 is +10,700 units, and at sample 18 is −11,500 units. As will be appreciated, periodic amplitude is a 16-bit range (i.e., +32767 to −32,768). As will be further appreciated, to save memory space the data structures may be abbreviated to include only those samples associated with a turning point (e.g., to include only samples 3, 8, 13, and 18).


The resulting recorded data is then examined for the occurrence of a fixed pattern within the signal itself based on the periodicity of turning points and amplitude of those points. The fixed pattern within the signal may be identified by comparing the data to one or more templates typical of different types of progress tones, such as intercept tones, ringback tones, busy tones, dial tones, reorder tones, and the like, to determine whether the analyzed sampled signal segment is a fixed signal. As noted, the pattern searched for in a dual frequency signal has first and second sets of distinct peaks and first and second sets of distinct troughs arranged in alternating fashion. The pattern searched for in a single frequency signal set of peaks and a set of troughs arranged in alternating fashion. Most progress tones are single frequency signals. The pattern is defined using not only the temporal periodicity of the turning points but also the signal amplitude at the turning points. A probability may be used to determine how well the segment fits the pattern. Probabilities below a specified threshold are not deemed to be fixed signals while probabilities at or above the specified threshold are deemed to be fixed signals. As can be seen from the data structures in FIG. 5, the sampled signal segment would be deemed to be a fixed signal.


As will be appreciated, any suitable pattern matching algorithm may be used to post process. Such algorithms generally check for the presence of the constituents of a given pattern.


An example of a relatively simple algorithm is to construct first and second arrays describing a sampled audio signal segment. The first array comprises the number of instances of selected temporal distances between turning points. For example, the array would contain a number of instances for each of the selected temporal distances of 1, 2, 3, 4, . . . . The second array comprises the number of instances of a number of selected amplitude ranges at turning points. For example, the array would contain a number of instances for each of the amplitude ranges A-B, B-C, C-D, . . . , where A, B, C, D, . . . are amplitude values. The resulting instances in each array column could then be compared to specified templates for temporal and amplitude periodicity to determine if the signal segment is likely a fixed signal segment. The templates may be, for example, a maximum permissible distribution of the instances among differing array columns. If the instances are too widely distributed, the comparison would indicate that the signal segment is variable while a tighter distribution indicates that the signal segment is fixed. The template match probabilities from the comparisons to the first and second arrays can then be weighted to arrive at a combined probability that the signal segment is characteristic of a fixed or variable signal.


This analytical approach is further shown in FIGS. 4A and B. FIGS. 4A and 4B show fixed or constant signals, such as a tone, and, for comparison sake, the allowable range based on the noise floor waveform. Various sample points are further shown in each signal segment. The dashed lines in FIG. 4B show the periodic signal pattern. As can be seen from FIGS. 4A and 4B, the sample points would display behavior similar to that of FIG. 5. As can be seen by the dashed lines, the pattern of the signal of FIG. 4B is repeated in the next signal segment, though the amplitudes of the turning points might have shifted slightly. The algorithm of the present invention can be written in a way that is capable of detecting patterns in the presence of minor waveform imperfections. In other words, the pattern does not have to match exactly. This can be particularly important as signals can become distorted by background noise. The imperfections are taken into account, at least in part, because substantial similarity or dissimilarity in signal amplitude between the template and the analyzed sampled signal segment is normally weighted more heavily than substantial similarity or dissimilarity in temporal spacing between turning points.


The operation of the detector 132 will now be described with reference to FIG. 6.


In step 600, a frame comprising n audio signal samples is received. The samples in the frame are generated when the received analog audio signal is converted to digital form. The following steps are performed sample-by-sample and frame-by-frame. As noted, a packet will commonly contain one frame of 80 samples.


In step 604, a next sample is selected for analysis.


In step 608, the trend indicated by the selected sample is determined. As noted, the trend is typically determined by comparing the amplitude of the selected sample with the amplitude of the prior sample. If the amplitude is increasing, the trend is positive, and, if the amplitude is decreasing, the trend is negative.


In decision diamond 612, it is determined whether the sample includes a turning point. When a trend changes from positive in the prior sample to negative in the selected sample or from negative in the prior sample to positive in the selected sample, the selected sample is deemed to include a turning point.


When the selected sample includes a turning point, the temporal distance to the prior turning point is determined in step 616. This is done by counting the number of samples between the selected sample and the most recent (prior) sample containing a turning point.


In step 620, the sample identifier, a turning point indicator, a temporal distance from the turning point in the selected sample to the prior turning point, and an amplitude of the current turning point are saved.


When the selected sample does not include a turning point or after step 616, it is determined, in decision diamond 624, whether there is a next sample. If so, the detector returns to step 604. If not, the detector, in decision diamond 628, determines whether the recorded data defines a pattern. When the recorded data likely defines a pattern, the detector, in step 632, concludes that the audio samples in the selected packet are not silence and overrides any contrary determination made by another technique, such as by using the noise floor waveform. When the recorded data likely does not define a pattern, the detector, in step 636, concludes that the audio samples in the selected packet are not a fixed signal. Therefore, no change is made to the result determined by another technique.


Depending on the contents of the frame, it is either discarded as silence or packetized and transmitted to the destination endpoint as an active signal.


A number of variations and modifications of the invention can be used. It would be possible to provide for some features of the invention without providing others.


For example in one alternative embodiment, the present invention is used for non-VoIP applications, such as speech coding and automatic speech recognition.


In yet another embodiment, dedicated hardware implementations including, but not limited to, Application Specific Integrated Circuits or ASICs, programmable logic arrays, and other hardware devices can likewise be constructed to implement the methods described herein. Furthermore, alternative software implementations including, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the methods described herein.


It should also be stated that the software implementations of the present invention are optionally stored on a tangible storage medium, such as a magnetic medium like a disk or tape, a magneto-optical or optical medium like a disk, or a solid state medium like a memory card or other package that houses one or more read-only (non-volatile) memories. A digital file attachment to e-mail or other self-contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. Accordingly, the invention is considered to include a tangible storage medium or distribution medium and prior art-recognized equivalents and successor media, in which the software implementations of the present invention are stored.


Although the present invention describes components and functions implemented in the embodiments with reference to particular standards and protocols, the invention is not limited to such standards and protocols. Other similar standards and protocols not mentioned herein are in existence and are considered to be included in the present invention. Moreover, the standards and protocols mentioned herein and other similar standards and protocols not mentioned herein are periodically superseded by faster or more effective equivalents having essentially the same functions. Such replacement standards and protocols having the same functions are considered equivalents included in the present invention.


The present invention, in various embodiments, includes components, methods, processes, systems and/or apparatus substantially as depicted and described herein, including various embodiments, subcombinations, and subsets thereof. Those of skill in the art will understand how to make and use the present invention after understanding the present disclosure. The present invention, in various embodiments, includes providing devices and processes in the absence of items not depicted and/or described herein or in various embodiments hereof, including in the absence of such items as may have been used in previous devices or processes, e.g., for improving performance, achieving ease and\or reducing cost of implementation.


The foregoing discussion of the invention has been presented for purposes of illustration and description. The foregoing is not intended to limit the invention to the form or forms disclosed herein. In the foregoing Detailed Description for example, various features of the invention are grouped together in one or more embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the following claims are hereby incorporated into this Detailed Description, with each claim standing on its own as a separate preferred embodiment of the invention.


Moreover, though the description of the invention has included description of one or more embodiments and certain variations and modifications, other variations and modifications are within the scope of the invention, e.g., as may be within the skill and knowledge of those in the art, after understanding the present disclosure. It is intended to obtain rights which include alternative embodiments to the extent permitted, including alternate, interchangeable and/or equivalent structures, functions, ranges or steps to those claimed, whether or not such alternate, interchangeable and/or equivalent structures, functions, ranges or steps are disclosed herein, and without intending to publicly dedicate any patentable subject matter.

Claims
  • 1. A method, comprising: (a) receiving a plurality of audio samples, the audio samples defining a sampled signal segment;(b) identifying turning points in a signal amplitude waveform defined by the audio samples;(c) determining whether the identified turning points are representative of a signal of a substantially fixed power level; and(d) when the identified turning points are representative of a signal of a substantially fixed power level, deeming the sampled signal segment to comprise an active signal.
  • 2. The method of claim 1, wherein the sampled signal segment is received as part of a live voice call between first and second parties, wherein the turning points correspond to peaks and valleys in the signal amplitude waveform, and wherein, when the identified turning points are representative of a signal of a substantially fixed power level, the sampled signal segment is deemed to include a periodic pattern.
  • 3. The method of claim 2, wherein silence suppression is in effect and wherein, when the sampled signal segment comprises an active signal, transmitting the plurality of audio samples to a destination node and wherein, when the sampled signal segment does not comprise an active signal and when the segment does not comprise voice energy of the first and/or second parties, not transmitting the plurality of audio samples to the destination node.
  • 4. The method of claim 1, wherein the method is used for determining jitter buffer adjustment points and further comprising: (e) identifying temporal distances between adjacent, identified turning points in the signal amplitude waveform;(f) determining whether the temporal distances between adjacent, identified turning points are representative of a signal of a substantially fixed power level; and(g) when the temporal distances are representative of a signal of a substantially fixed power level and when the identified turning points are representative of a signal of a substantially fixed power level, deeming the sampled signal segment to comprise an active signal.
  • 5. The method of claim 4, wherein, in determining whether the sampled signal segment comprises an active signal, the results of step (c) are weighted more heavily than the results of step (f).
  • 6. The method of claim 1, wherein the turning points are not zero crossings and wherein, when the identified turning points are representative of a signal of a substantially fixed power level, the sampled signal segment is deemed to include a progress tone.
  • 7. A computer readable medium comprising processor executable instructions to perform the steps of claim 1.
  • 8. A method comprising: (a) during a voice conversation, receiving an analog audio signal;(b) converting the analog audio signal into a digital representation thereof the digital representation comprising a plurality of speech frames, each speech frame comprising a plurality of audio samples, each audio sample comprising a signal amplitude and having a fixed temporal duration;(c) identifying signal amplitude turning points in the audio samples;(d) determining whether the identified turning points are representative of aperiodic signal; and(e) when the identified turning points are representative of a periodic signal, transmitting the selected speech frame to a destination endpoint.
  • 9. The method of claim 8, wherein, when the identified turning points are representative of a periodic signal, not allowing the jitter buffer to adjust and wherein, when the identified turning points are not representative of a periodic signal, wherein, when the selected frame does not comprise voiced speech, not transmitting the selected speech frame to the destination endpoint and the jitter buffer is not allowed to adjust.
  • 10. The method of claim 8, wherein the periodic signal has a substantially fixed power level and further comprising: (f) identifying temporal distances between adjacent, identified turning points; and(g) determining whether the temporal distances between adjacent, identified turning points are representative of a periodic signal; and wherein, in step (d), when the temporal distances are representative of a periodic signal and, when the identified turning points are representative of a signal of a periodic signal, the selected frame is deemed to include a progress tone.
  • 11. The method of claim 8, wherein the turning points are not zero crossings and wherein, when the identified turning points are representative of a periodic signal, the sampled signal segment is deemed to include a progress tone.
  • 12. A computer readable medium comprising processor executable instructions to perform the steps of claim 8.
  • 13. A device, comprising: a voice activity detector operable to:(a) receive a plurality of audio samples, the audio samples defining a sampled signal segment;(b) identify turning points in a signal amplitude waveform defined by the audio samples;(c) determine whether the identified turning points are representative of a signal of a substantially fixed power level; and(d) when the identified turning points are representative of a signal of a substantially fixed power level, deem the sampled signal segment to comprise an active signal.
  • 14. The device of claim 13, wherein the sampled signal segment is received as part of a live voice call between first and second parties, wherein the turning points correspond to peaks and valleys in the signal amplitude waveform, and wherein, when the identified turning points are representative of a signal of a substantially fixed power level, thejitter buffer is not allowed to adjust.
  • 15. The device of claim 14, wherein silence suppression is in effect and wherein, when the sampled signal segment comprises an active signal, transmitting the plurality of audio samples to a destination node but not allowing the jitter buffer to adjust and wherein, when the sampled signal segment does not comprise an active signal and when the segment does not comprise voice energy of the first and/or second parties, not transmitting the plurality of audio samples to the destination node but allowing the jitter buffer to adjust.
  • 16. The device of claim 13, wherein the detector is further operable to: (e) identify temporal distances between adjacent, identified turning points in the signal amplitude waveform;(f) determine whether the temporal distances between adjacent, identified turning points are representative of a signal of a substantially fixed power level; and(g) when the temporal distances are representative of a signal of a substantially fixed power level and when the identified turning points are representative of a signal of a substantially fixed power level, deem the sampled signal segment to comprise an active signal.
  • 17. The device of claim 16, wherein, in determining whether the sampled signal segment comprises an active signal, the results of step (c) are weighted more heavily than the results of step (f).
  • 18. The device of claim 13, wherein the turning points are not zero crossings and wherein, when the identified turning points are representative of a signal of a substantially fixed power level, the sampled signal segment is deemed to include a progress tone.
  • 19. The device of claim 13, wherein the device is a gateway.
  • 20. The device of claim 13, wherein the device is a packet-switched voice communication device.