The present invention relates to an audio signal processing method and an audio signal processing device which are capable of encoding or decoding an audio signal.
Generally, for an audio signal containing strong speech signal characteristics, linear predictive coding (LPC) is performed. Linear predictive coefficients generated by linear predictive coding are transmitted to a decoder, and the decoder reconstructs the audio signal through linear predictive synthesis using the coefficients.
Generally, an audio signal comprises signals of various frequencies. As examples of such signals, human audible frequency ranges from 20 Hz to 20 kHz while human speech frequency ranges from 200 Hz to 3 kHz. An input audio signal may include not only a band of human speech but also high frequency region components over 7 kHz which human voice rarely reaches. As such, if a coding scheme suitable for narrowband (about 4 kHz or below) is used for wideband (about kHz or below) or super wideband (about 16 kHz or below), speech quality may be deteriorated.
An object of the present invention can be achieved by providing an audio signal processing method and device for applying coding modes in a such manner that the coding modes are switched for respective frames according to network conditions (and audio signal characteristics).
Another object of the present invention, in order to apply appropriate coding schemes to respective bandwidths, is to provide an audio signal processing method and an audio signal processing device for switching coding schemes according to bandwidths for respective frames by switching coding modes for respective frames.
Another object of the present invention is to provide an audio signal processing method and an audio signal processing device for, in addition to switching coding schemes according to bandwidths for respective frames, applying various bitrates for respective frames.
Another object of the present invention is to provide an audio signal processing method and an audio signal processing device for generating respective- type silence frames and transmitting the same based on bandwidths when a current frame corresponds to a speech inactivity section.
Another object of the present invention is to provide an audio signal processing method and an audio signal processing device for generating a unified silence frame and transmitting the same irrelevant to bandwidths when a current frame corresponds to a speech inactivity section.
Another object of the present invention is to provide an audio signal processing method and an audio signal processing device for smoothing a current frame with the same bandwidth as a previous frame, if the bandwidth of the current frame is different from that of the previous frame.
The present invention provides the following effects and advantages.
Firstly, by switching coding modes for respective frames according to feedback information from a network, coding schemes may be adaptively switched according to conditions of the network (and a receiver's terminal), so that encoding suitable for a communication environment may be performed and transmission may be performed at relatively low bit rates to a transmitting side.
Secondly, by switching coding modes for respective frames taking account of audio signal characteristics in addition to network information, bandwidths or bit rates may be adaptively changed to the extent that network conditions allow.
Thirdly, in a speech activity section, switching is performed by selecting other bandwidths at or below allowable bitrates based on network information, an audio signal of good quality may be provided to a receiving side.
Fourthly, when bandwidths having the same or different bitrates are switched in a speech activity section, discontinuity due to bandwidth change may be prevented by performing smoothing based on bandwidths of previous frames at a transmitting side.
Fifthly, in a speech inactivity section, a type of a silence frame for a current frame is determined depending on bandwidth(s) of previous frame(s), thus distortions due to bandwidth switching may be prevented
Sixthly, in a speech inactivity section, by applying a unified silence frame irrelevant to previous or current frames, power for control, resources, and the number of modes at the time of transmission may be reduced, distortions due to bandwidth switching may be prevented.
Seventhly, if a bandwidth is changed in a transition from a speech activity section to a speech inactivity section, by performing smoothing on a bandwidth of a current frame based on previous frames at a receiving end, discontinuity due to bandwidth change may be prevented.
In order to achieve such objectives, an audio signal processing method according to the present invention includes receiving an audio signal, receiving network information indicative of a coding mode and determining the coding mode corresponding to a current frame, encoding the current frame of the audio signal according to the coding mode, and transmitting the encoded current frame. The coding mode is determined based on a combination of bandwidths and bitrates, and the bandwidths comprise at least two of narrowband, wideband, and super wideband.
According to the present invention, the bitrates may include two or more predetermined support bitrates for each of the bandwidths.
According to the present invention, the super wideband is a band that covers the wideband and the narrowband, and the wideband is a band that covers the narrowband.
According to the present invention, the method may further include determining whether or not the current frame is a speech activity section by analyzing the audio signal, in which the determining and the encoding may be performed if the current frame is the speech activity section.
According to another aspect of the present invention, provided herein is an audio signal processing method comprising receiving an audio signal, receiving network information indicative of a maximum allowable coding mode, determining a coding mode corresponding to a current frame based on the network information and the audio signal, encoding the current frame of the audio signal according to the coding mode, and transmitting the encoded current frame. The coding mode is determined based on a combination of bandwidths and bitrates, and the bandwidths comprise at least two of narrowband, wideband, and super wideband.
According to the present invention, the determining a coding mode may include determining one or more candidate coding modes based on the network information, and determining one of the candidate coding modes as the coding mode based on characteristics of the audio signal.
According to another aspect of the present invention, provided herein is an audio signal processing device comprising a mode determination unit for receiving network information indicative of a coding mode and determining the coding mode corresponding to a current frame, and an audio encoding unit for receiving an audio signal, for encoding the current frame of the audio signal according to the coding mode, and for transmitting the encoded current frame. The coding mode is determined based on a combination of bandwidths and bitrates, and the bandwidths comprise at least two of narrowband, wideband, and super wideband.
According to another aspect of the present invention, provided herein is an audio signal processing device comprising a mode determination unit for receiving an audio signal, for receiving network information indicative of a maximum allowable coding mode, and for determining a coding mode corresponding to a current frame based on the network information and the audio signal, and an audio encoding unit for encoding the current frame of the audio signal according to the coding mode, and for transmitting the encoded current frame,. The coding mode is determined based on a combination of bandwidths and bitrates, and the bandwidths comprise at least two of narrowband, wideband, and super wideband.
According to another aspect of the present invention, provided herein is an audio signal processing method comprising receiving an audio signal, determining whether a current frame is a speech activity section or a speech inactivity section by analyzing the audio signal, if the current frame is the speech inactivity section, determining one of a plurality of types including a first type and a second type as a type of a silence frame for the current frame based on bandwidths of one or more previous frames, and for the current frame, generating and transmitting the silence frame of the determined type. The first type includes a linear predictive conversion coefficient of a first order, the second type includes a linear predictive conversion coefficient of a second order, and the first order is smaller than the second order.
According to the present invention, the plurality of types may further include a third type, the third type includes a linear predictive conversion coefficient of a third order, and the third order is greater than the second order.
According to the present invention, the linear predictive conversion coefficient of the first order may be encoded with first bits, the linear predictive conversion coefficient of the second order may be encoded with second bits, and the first bits may be smaller than the second bits.
According to the present invention, the total bits of each of the first, second, and third types may be the same.
According to another aspect of the present invention, provided herein is an audio signal processing device comprising an activity section determination unit for receiving an audio signal, and determining whether a current frame is a speech activity section or a speech inactivity section by analyzing the audio signal, a type determination unit, if the current frame is not the speech inactivity section, for determining one of a plurality of types including a first type and a second type as a type of a silence frame for the current frame based on bandwidths of one or more previous frames, and a respective-types-of silence frame generating unit, for the current frame, for generating and transmitting the silence frame of the determined type. The first type includes a linear predictive conversion coefficient of a first order, the second type includes a linear predictive conversion coefficient of a second order, and the first order is smaller than the second order.
According to another aspect of the present invention, provided herein is an audio signal processing method comprising receiving an audio signal, determining whether a current frame is a speech activity section or a speech inactivity section by analyzing the audio signal, if a previous frame is a speech inactivity section and the current frame is the speech activity section, and if a bandwidth of the current frame is different from a bandwidth of a silence frame of the previous frame, determining a type corresponding to the bandwidth of the current frame from among a plurality of types, and generating and transmitting a silence frame of the determined type. The plurality of types comprises first and second types, the bandwidths comprise narrowband and wideband, and the first type corresponds to the narrowband, and the second type corresponds to the wideband.
According to another aspect of the present invention, provided herein is an audio signal processing device comprising an activity section determination unit for receiving an audio signal and determining whether a current frame is a speech activity section or a speech inactivity section by analyzing the audio signal, a control unit, if a previous frame is a speech inactivity section and the current frame is the speech activity section, and if a bandwidth of the current frame is different from a bandwidth of a silence frame of the previous frame, for determining a type corresponding to the bandwidth of the current frame from among a plurality of types, and a respective-types-of silence frame generating unit for generating and transmitting a silence frame of the determined type. The plurality of types comprises first and second types, the bandwidths comprise narrowband and wideband, and the first type corresponds to the narrowband, and the second type corresponds to the wideband.
According to another aspect of the present invention, provided herein is an audio signal processing method comprising receiving an audio signal, determining whether a current frame is a speech activity section or a speech inactivity section, and if the current frame is the speech inactivity section, generating and transmitting a unified silence frame for the current frame, regardless of bandwidths of previous frames. The unified silence frame comprises a linear predictive conversion coefficient and an average of frame energy.
According to the present invention, the linear predictive conversion coefficient may be allocated 28 bits and the average of frame energy may be allocated 7 bits.
According to another aspect of the present invention, provided herein is an audio signal processing device comprising an activity section determination unit for receiving an audio signal and for determining whether a current frame is a speech activity section or a speech inactivity section by analyzing the audio signal, and a unified silence frame generating unit, if the current frame is the speech inactivity section, for generating and transmitting a unified silence frame for the current frame, regardless of bandwidths of previous frames. The unified silence frame comprises a linear predictive conversion coefficient and an average of frame energy.
Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings. It should be understood that the terms used in the specification and appended claims should not be construed as limited to general and dictionary meanings but be construed based on the meanings and concepts according to the spirit of the present invention on the basis of the principle that the inventor is permitted to define appropriate terms for best explanation. The preferred embodiments described in the specification and shown in the drawings are illustrative only and are not intended to represent all aspects of the invention, such that various equivalents and modifications can be made without departing from the spirit of the invention.
As used herein, the following terms may be construed as follows; and, other terms may be construed in a similar manner. Coding may be construed as encoding or decoding depending on context, and information may be construed as a term covering values, parameter, coefficients, elements, etc. depending on context. However, the present invention is not limited thereto.
Here, an audio signal, in contrast to a video signal in a broad sense, refers to a signal which may be recognized by auditory sense when reproduced and, in contrast to a speech signal in a narrow sense, refers to a signal having no or few speech characteristics. Herein, an audio signal is to be construed in a broad sense and is understood as an audio signal in a narrow sense when distinguished from a speech signal.
In addition, coding may refer to encoding only or may refer to both encoding and decoding.
The mode determination unit 110 receives network information from the network control unit 150, determines a coding mode based on the received information, and transmits the determined coding mode to the audio encoding unit 130 (and the silence frame generating unit 140). Here, the network information may indicate a coding mode or a maximum allowable coding mode, description of each of which will be given below with reference to
On the other hand, the activity section determination unit 120 determines whether a current frame is a speech-activity section or a speech inactivity section by performing analysis of an input audio signal and transmits an activity flag (hereinafter referred to as a “VAD flag”) to the audio encoding unit 130, silence frame generating unit 140 and network control unit 150 and the like. Here, the analysis corresponds to a voice activity detection (VAD) procedure. The activity flag indicates whether the current frame is a speech-activity section or a speech inactivity section.
The speech inactivity section corresponds to a silence section or a section with background noise, for example. It is inefficient to use a coding scheme of the activity section in the inactivity section. Therefore, the activity section determination unit 120 transmits an activity flag to the audio encoding unit 130 and the silence frame generating unit 140 so that, in a speech activity section (VAD flag=1), an audio signal is encoded by the audio encoding unit 130 according to respective coding schemes and in a speech inactivity section (VAD flag=0) a silence frame with low bits is generated by the silence frame generating unit 140. However, exceptionally, even in the case of VAD flag=0, an audio signal may be encoded by the audio encoding unit 130, description of which will be given below with reference to
The audio encoding unit 130 causes at least one of narrowband encoding unit (NB encoding unit) 131, wideband encoding unit (WB encoding unit) 132 and super wideband unit (SWB encoding unit) 133 to encode an input audio signal to generate an audio frame, based on the coding mode determined by the mode determination unit 110.
In this regard, the narrowband, the wideband, and the super wideband have wider and higher frequency bands in the named order. The super wideband (SWB) covers the wideband (WB) and the narrowband (NB), and the wideband (WB) covers the narrowband (NB).
NB encoding unit 131 is a device for encoding an input audio signal according to a coding scheme corresponding to narrowband signal (hereinafter referred to as NB coding scheme), WB encoding unit 132 is a device for encoding an input audio signal according to a coding scheme corresponding to wideband signal (hereinafter referred to as WB coding scheme), and SWB encoding unit 133 is a device for encoding an input audio signal according to a coding scheme corresponding to super wideband signal (hereinafter referred to as SWB coding scheme). Although the case that different coding schemes are used for respective bands (that is, respective encoding units) has been described above, a coding scheme of an embedded structure covering lower bands may be used; or a hybrid structure of the above two structures may also be used.
Referring to
Referring back to
The network control unit 150 receives channel condition information from a network such as a mobile communication network (including a base station transceiver (BTS), a base station (BSC), a mobile switching center (MSC), a PSTN, an IP network, etc). Here, network information is extracted from the channel condition information and is transferred to the mode determination unit 110. As described above, the network information may be information which directly indicates a coding mode or indicates a maximum allowable coding mode. Further, the network control unit 150 transmits an audio frame or a silence frame to a network.
Two examples of the mode determination unit 110 will be described with reference to
Referring to
A support bitrates which corresponds to two or more bandwidths may be presented. For example, in
The last factor for determining a coding mode is to determine whether it is a silence frame, which will be specifically described below together with the silence frame generating unit.
Referring to
Thus far, the coding modes have been described with reference to
Referring to
On the other hand, the coding modes described with reference to
<Switching Between Different Bandwidths>
A. In a case of NB/WB
B. In a case of SWB
split band coding layer by band split
For each of the cases, a bit allocation method depending on a source is applied. If no enhancement layer is present, bit allocation is performed within a core. If an enhancement layer is present, bit allocation is performed for a core layer and an enhancement layer.
As described above, in a case that an enhancement layer is present, bits of bitrates of a core layer may be variably switched for each of frames (in the above cases b.1), b.2) and b.3)). It is obvious that even in this case coding modes are generated based on network information (and characteristics of an audio signal or coding modes of previous frames).
First, the concept of a core layer and enhancement layers will be described with reference to
Here, the core layer may be a codec used in existing communication networks or a newly designed codec. It is a structure to complement a music component other than speech signal component and is not limited to a specific coding scheme. Further, although a bit stream structure without the enhancement may be possible, at least a minimum rate of a bit stream of the core should be defined. For this purpose, a block for determining degrees of tonality and activity of a signal component is required. The core layer may correspond to AMR-WB Inter-OPerability (IOP). The above-described structure may be extended to narrowband (NB), wideband (WB), and even super wideband (SWB full band (FB)). In a codec structure of a band split, interchange of bandwidths may be possible.
Referring to
Hereinafter, with reference to
Referring to
The type determination unit 142A receives bandwidth(s) of previous frame(s), and, based on the received bandwidth(s), determines one type as a type of a silence frame for a current frame, from among a plurality of types including a first type, a second type (and a third type). Here, the bandwidth(s) of the previous frame(s) may be information received from the mode determination unit 110 of
In a frame after several pause frames have ended, i.e., the 8th frame after the inactivity sections have begun (the 43th frame in the drawing), a silence frame is not generated. In this case, the transmission type may be ‘SID_FIRST’. In the 3rd frame from this (0th frame (current frame(n)) in the drawing), a silence frame is generated. In this case, the transmission type is ‘SID_UPDATE’. After that, the transmission type is ‘SID_UPDATE’ and a silence frame is generated for every 8th frame.
In generating a silence frame for the current frame(n), the type determination unit 142A of
Meanwhile, the first to third orders and the first to third bits have the relation shown below:
The first order (O1)≦the second order (O2)≦the third order (O3)
The first bits (N1)≦the second bits (N2)≦the third bits (N3)
This is because it is preferred that the wider a bandwidth is, the higher the order of a linear predictive coefficient is, and that the higher the order of a linear predictive coefficient is, the larger bits are.
The first type silence frame (NB SID) may further include a reference vector which is a reference value of a linear predictive coefficient, and the second and third type silence frames (NB SID, WB SID) may further include a dithering flag. Further, each of the silence frames may further include frame energy. Here, the dithering flag, which is information indicating periodic characteristics of background noises, may have values of 0 and 1. For example, using a linear predictive coefficient, if a sum of spectral distances is small, the dithering flag may be set to 0; if the sum is large, the dithering flag may be set to 1. Small distance indicates that spectrum envelope information among previous frames is relatively similar. Further, each of the silence frames may further include frame energy.
Although bits of the elements of respective types are different, the total bits may be the same. In
Referring back to
The respective-types-of silence frame generating unit 144A generates one of the first to third type silence frames (NB SID, WB SID, SWB SID) for a current frame of an audio signal, according to the type determined by the type determination unit 142A. Here, an audio frame which is a result of the audio encoding unit 130 in
A control unit 146C uses bandwidth information and audio frame information (spectrum envelope and residual information) of previous frames, and determines a type of a silence frame for a current frame with reference to an activity flag (VAD flag). The respective-types-of silence frame generating unit 144C generates the silence frame for the current frame using audio frame information of n previous frames based on bandwidth information determined in the control unit 146C. At this time, an audio frame with different bandwidth among the n previous frames is calculated such that it is converted into a bandwidth of the current frame, to thereby generate a silence frame of the determined type.
An example of syntax of a unified silence frame is illustrated in
By generating a unified silence frame regardless of bandwidths of previous frames, power required for control, resources and the number of modes at the time of transmission may be reduced, and distortions occurring due to bandwidth switching in a speech inactivity section may be prevented.
The control unit 146C determines a type of a silence frame for a current frame based on bandwidths of previous and current frames and an activity flag (VAD flag).
Referring back to
Referring to
Firstly, a decoder 200-1 of a first type includes all of NB decoding unit 131A, WB decoding unit 132A, SWB decoding unit 133A, a converting unit 140A, and an unpacking unit 150. Here, NB decoding unit decodes NB signal according to NB coding scheme described above, WB decoding unit decodes WB signal according to WB coding scheme, and SWB decoding unit decodes SWB signal according to SWB coding scheme. If all of the decoding units are included, as the case of the first type, decoding may be performed regardless of a bandwidth of a bit stream. The converting unit 140A performs conversion on a bandwidth of an output signal and smoothing at the time of switching bandwidths. In the conversion of a bandwidth of an output signal, the bandwidth of the output signal is changed according to a user's selection or hardware limitation on the output bandwidth. For example, SWB output signal decoded with SWB bit stream may be output with WB or NB signal according to a user's selection or hardware limitation on the output bandwidth. In performing the smoothing at the time of switching bandwidths, after NB frame is output, if a bandwidth of a current frame is an output signal other than NB, the conversion on the bandwidth of the current frame is performed. For example, after NB frame is output, a current frame is SWB signal output with SWB bit stream, bandwidth conversion into WB is performed so as to perform smoothing. WB signal output with WB bit stream, after NB frame is output, is converted into an intermediate bandwidth between NB and WB so as to perform smoothing. That is, in order to minimize a difference between bandwidths of a previous frame and a current frame, conversion into an intermediate bandwidth between previous frames and a current frame is performed.
A decoder 200-2 of a second type includes NB decoding unit 131B and WB decoding unit 132B only, and is not able to decode SWB bit stream. However, in a converting unit 140B, it may be possible to output in SWB according to a user's selection or hardware limitation on the output bandwidth. The converting unit 140B performs, similarly to the converting unit 140A of the first type decoder 200-1, conversion of a bandwidth of an output signal and smoothing at the time of bandwidth switching.
A decoder 200-3 of a third type includes NB decoding unit 131C only, and is able to decode only a NB bit stream. Since there is only one decodable bandwidth (NB), a converting unit 140C is used only for bandwidth conversion. Accordingly, a decoded NB output signal may be bandwidth converted into WB or SWB through the converting unit 140C.
Other aspects of the various types of decoders of
When two or more types of BW bit streams are received from a transmitting side, the received bit streams are decoded according to each routine with reference to types of a decodable BW and output bandwidth at a receiving side, and a signal output from the receiving side is converted into a BW supported by the receiving side. For example, if a transmitting side is capable of encoding with NB/WB/SWB, a receiving side is capable of decoding with NB/WB, and a signal output bandwidth may be up to SWB, referring to
Referring to
<<Example of Decreasing Bandwidth>>
A receiving side supports up to SWB—decoded as transmitted.
A receiving side supports up to WB—For a transmitted SWB frame, a decoded SWB signal is converted into WB. The receiving side includes a module capable of decoding SWB.
A receiving side support NB only—For a transmitted WB/SWB frame, a decoded SWB signal is converted into NB. The receiving end includes a module capable of decoding WB/SWB.
Referring to
The audio signal processing device according to the present invention may be incorporated in various products. Such products may be mainly divided into a standalone group and a portable group. The standalone group may include a TV, a monitor, a set top box, etc., and the portable group may include a portable multimedia player (PMP), a mobile phone, a navigation device, etc.
A user authenticating unit 520, which receives user information and performs user authentication, may include at least one of a fingerprint recognizing unit, an iris recognizing unit, a face recognizing unit, and a voice recognizing unit. Each of which receives fingerprint, iris, facial contour, and voice information, respectively, converts the received information into user information, and performs user authentication by determining whether the converted user information matches user information or previously registered user data.
A input unit 530, which is an input device for inputting various kinds of instructions from a user, may include at least one of a keypad unit 530A, a touchpad unit 530B, a remote controller unit 530C, and a microphone unit 530D; however, the present invention is not limited thereto. Here, the microphone unit 530D is an input device for receiving a voice or audio signal. Here, the keypad unit 530A, the touchpad unit 530B, and the remote controller unit 530C may receive instructions to initiate a call or to activate the microphone unit 530B. A control unit 550 may, upon receiving an instruction to initiate a call through the keypad unit 530B and the like, cause the mobile communication unit 510E to request a call to a mobile communication network.
A signal coding unit 540 performs encoding or decoding of an audio signal and/or video signal received through the microphone unit 530D or the wired/wireless communication unit 510, and outputs an audio signal in the time domain. The signal coding unit 540 includes an audio signal processing apparatus 545, which corresponds to the above-described embodiments of the present invention (i.e., the encoder 100 and/or decoder 200 according to the embodiments). As such, the audio signal processing apparatus 545 and the signal coding unit including the same may be implemented by one or more processors.
The control unit 550 receives input signals from input devices, and controls all processes of the decoding unit 540 and the output unit 560. The output unit 560, which outputs an output signal generated by the decoding unit 540, may include a speaker unit 560A and display unit 560B. When the output signal is an audio signal, the output signal is output through the speaker, and when the output signal is a video signal, the output signal is output through the display.
The signal coding unit 760 performs encoding or decoding of an audio signal and/or a video signal received through the mobile communication unit 710, the data communication unit 720 or the microphone unit 740, and outputs an audio signal in the time-domain through the mobile communication unit 710, the data communication unit 720 or the speaker 770. The signal coding unit 760 includes an audio signal processing apparatus 765, which corresponds to the embodiments of the present invention (i.e., the encoder 100 and/or the decoder 200 according to the embodiment). As such, the audio signal processing apparatus 765 and the signal coding unit 760 including the same may be implemented by one or more processors.
The audio signal processing method according to the present invention may be implemented as a program executed by a computer so as to be stored in a computer readable storage medium. Further, multimedia data having the data structure according to the present invention may be stored in a computer readable storage medium. The computer readable storage medium may include all kinds of storage devices storing data readable by a computer system. Examples of the computer readable storage medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device, as well as a carrier wave (transmission over the Internet, for example). In addition, the bit stream generated by the encoding method may be stored in a computer readable storage medium or transmitted through wired/wireless communication networks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
The present invention is applicable to encoding and decoding of an audio signal.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/KR2011/004843 | 7/1/2011 | WO | 00 | 6/17/2013 |
Number | Date | Country | |
---|---|---|---|
61360506 | Jul 2010 | US | |
61383737 | Sep 2010 | US | |
61490080 | May 2011 | US |