Many speech and audio processing applications (e.g., speech analysis, speech synthesis, speech compression, speech transformation, speech coding, speech recognition, audio analysis, audio synthesis, audio compression, audio transformation, audio coding, etc.) involve approximating portions of speech and audio signals using parametric models and encoding at least some of the parameters of these models. For example, many speech and audio processing applications involve approximating portions of a signal using a sinusoidal model, whereby a windowed portion of the signal may be approximated using a finite sum of sinusoids, and encoding at least some of the parameters of the sinusoidal model. The parameters of a sinusoidal model may include an amplitude, frequency, and phase for each sinusoid in the sum of sinusoids.
Some aspects of the technology described herein relate to a method for encoding an audio signal represented by a plurality of frames including a first frame. The method comprises using at least one computer hardware processor to perform: obtaining an initial discrete spectral representation of the first frame; obtaining a primary discrete spectral representation of the initial discrete spectral representation at least in part by estimating a phase envelope of the initial discrete spectral representation and evaluating the estimated phase envelope at a discrete set of frequencies; calculating a residual discrete spectral representation of the initial discrete spectral representation based on the initial discrete spectral representation and the primary discrete spectral representation; and encoding the residual discrete spectral representation using a plurality of codewords.
Some aspects of the technology described herein relate to a system for encoding an audio signal represented by a plurality of frames including a first frame. The system comprises at least one non-transitory memory storing a plurality of codewords; and at least one computer hardware processor configured to perform: obtaining an initial discrete spectral representation of the first frame; obtaining a primary discrete spectral representation of the initial discrete spectral representation at least in part by estimating a phase envelope of the initial discrete spectral representation and evaluating the estimated phase envelope at a discrete set of frequencies; calculating a residual discrete spectral representation of the initial discrete spectral representation based on the initial discrete spectral representation and the primary discrete spectral representation; and encoding the residual discrete spectral representation using a plurality of codewords.
Some aspects of the technology described herein relate to at least one non-transitory computer-readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for encoding an audio signal represented by a plurality of frames including a first frame. The method comprises: obtaining an initial discrete spectral representation of the first frame; obtaining a primary discrete spectral representation of the initial discrete spectral representation at least in part by estimating a phase envelope of the initial discrete spectral representation and evaluating the estimated phase envelope at a discrete set of frequencies; calculating a residual discrete spectral representation of the initial discrete spectral representation based on the initial discrete spectral representation and the primary discrete spectral representation; and encoding the residual discrete spectral representation using a plurality of codewords.
The foregoing is a non-limiting summary of the invention, which is defined by the attached claims.
Various aspects and embodiments of the application will be described with reference to the following figures. The figures are not necessarily drawn to scale. Items appearing in multiple figures are indicated by the same or a similar reference number in all the figures in which they appear.
The inventors have appreciated that conventional techniques for encoding parameters of a sinusoidal model may be improved upon. As described above, parameters of a sinusoidal model include amplitudes, frequencies, and phases of the sinusoids in the model. However, conventional encoding techniques do not provide for an efficient means of encoding phases of the sinusoids in the sinusoidal model. Existing approaches for encoding sinusoidal model phases require a high bit budget and have high computational complexity such that they are not suitable for implementation using fixed-point arithmetic. Accordingly, some embodiments provide for efficient techniques for encoding sinusoidal model phases and, optionally, other sinusoidal model parameters. The encoding techniques describe herein allow for encoding the sinusoidal model parameters using fewer bits than conventional encoding techniques and may be implemented in a computationally efficient manner using floating point and fixed point arithmetic.
Some embodiments of the technology described herein address one or more drawbacks of conventional techniques for encoding sinusoidal model parameters. Some embodiments provide for encoding of one or more audio frames representing an audio signal, which may be a speech signal, a music signal, and/or any other suitable type of audio signal. An audio frame representing the audio signal may be encoded by obtaining an initial discrete spectral representation (DSR) of the audio frame and encoding the initial DSR in two stages by obtaining a coarse approximation of initial DSR, including its phase envelope, and representing the information in the initial DSR, not captured by the coarse approximation, by a linear combination of codewords.
In some embodiments, the initial discrete spectral representation of a frame may comprise an amplitude and a phase for each frequency in a discrete set of frequencies. The initial discrete spectral representation may be obtained by fitting a sinusoidal model to the audio frame and/or in any other suitable way. As such, in some embodiments, encoding the initial discrete spectral representation may comprise encoding parameters of a sinusoidal model including the phase parameters of the sinusoidal model.
In some embodiments, encoding the initial discrete spectral representation may comprise: (1) obtaining a primary discrete spectral representation of initial DSR at least in part by estimating a phase envelope of the initial DSR and evaluating the estimated phase envelope at a discrete set of frequencies; (2) calculating a residual discrete spectral representation of the initial DSR based on the difference between the initial and primary discrete spectral representations; and (3) encoding the residual discrete spectral representation using a linear combination of codewords.
In some embodiments, estimating the phase envelope of the initial DSR may comprise estimating parameters of a continuous-in-frequency representation of the phase envelope. The continuous-in-frequency representation of the phase envelope may be a Mel-frequency cepstral representation such that estimating parameters of the representation may comprise estimating a plurality of Mel-frequency cepstral coefficients, for example, Mel-frequency regularized cepstral coefficients.
In some embodiments, encoding the residual discrete spectral representation using a linear combination of codewords may comprise iteratively selecting the codewords in the linear combination from one or more codebooks. The iterative selection may be performed by using a perceptual measure and/or any other suitable type of measure. The codebook(s) from which the codewords are selected may comprise stochastic codewords. For example, in some embodiments, the codebook(s) may comprise a plurality of sub-frame sub-band codewords, as described in more detail below.
It should be appreciated that the embodiments described herein may be implemented in any of numerous ways. Examples of specific implementations are provided below for illustrative purposes only. It should be appreciated that these embodiments and the features/capabilities provided may be used individually, all together, or in any combination of two or more, as aspects of the technology described herein are not limited in this respect.
Each of computing devices 104 and 110 may be a portable computing device (e.g., a laptop, a smart phone, a PDA, a tablet device, etc.), a fixed computing device (e.g., a desktop, a server, a rack-mounted computing device) and/or any other suitable computing device that may be configured to encode one or more frames representing an audio signal (e.g., a speech signal) in accordance with embodiments described herein. Network 108 may be a local area network, a wide area network, a corporate Intranet, the Internet, any/or any other suitable type of network. Each of connections 110a and 110b may be a wired connection, a wireless connection, or a combination thereof.
It should be appreciated that aspects of the technology described herein are not limited to operating in the illustrative environment 100 shown in
Process 200 begins at act 202, where an audio signal is obtained. The audio signal may be obtained from any suitable source. For example, the audio signal may be stored and, at act 202, accessed by a computing device performing process 200. As another example, the audio signal may be received from an application program or an operating system (e.g., from an application program or an operating system requesting that the audio signal be encoded). The audio signal may be in any suitable format, as aspects of the technology described herein are not limited in this respect.
Next, process 200 proceeds to act 204, where the audio signal received at act 202 is processed to obtain one or more audio frames representing the audio signal. Each of the obtained audio frames may represent (e.g., may comprise) a portion of the audio signal. In some instances, the audio frames may be overlapping such that two or more frames may represent a portion of the audio signal. In some instances, the audio frames may not overlap such that each frame in the plurality of frames may represent a respective portion of the audio signal. The audio frames may be generated in any suitable way and, for example, may be generated using time-shifted versions of a suitable windowing function, sometimes termed an apodization or tapering function. Examples of a windowing function that may be used include, but are not limited to a rectangular window, a triangular window, a Parzen window, a Welch window, a Hann window, a Hamming window, a Blackman window, and a raised cosine window.
Next, process 200 proceeds to act 206, where one of the audio frames is selected for encoding. The audio frame may be selected in any suitable way, as aspects of the technology described herein are not limited in this respect.
Next, process 200 proceeds to act 208, where the audio frame selected at act 206 may be encoded. In some embodiments, the audio frame may be processed to obtain an initial discrete spectral representation (DSR) of the audio frame, which representation comprises an amplitude and a phase for each frequency in a discrete set of frequencies. As such, the initial spectral representation may also be termed a “full line spectral representation.” The initial DSR may be encoded in two stages: (1) obtaining a coarse approximation to the initial DSR (also called “primary discrete spectral representation” herein); and (2) obtaining a representation of the residual information in the initial DSR, which is not a captured by the coarse approximation, using a linear combination of codewords. As such, the encoding of the initial DSR may include an encoding of the coarse approximation to the initial DSR and information identifying the codewords representing the residual information not captured by the coarse approximation and the respective weights or gains of the codewords in the linear combination.
In some embodiments, obtaining the coarse representation of the initial DSR may comprise estimating a phase envelope of the initial DSR and evaluating the estimated phase envelope at a discrete set of frequencies. In some embodiments, estimating the phase envelope of the initial DSR includes estimating a continuous-in-frequency representation of the phase envelope and sampling the continuous-in-frequency representation at the discrete set of frequencies. In some embodiments, the continuous-in-frequency representation may comprise a Mel-regularized cepstral coefficient representation of the phase envelope.
In some embodiments, obtaining a representation of the residual information in the initial DSR, not captured by the coarse representation, may comprise encoding the difference between the initial DSR and the coarse representation by using a linear combination of stochastic codewords. The codewords in the linear combination may be selected iteratively from one or more codebooks. For example, codewords in the linear combination may be selected iteratively using a perceptual measure. In some embodiments, the codewords may be selected from one or more codebooks of sub-frame sub-band stochastic codewords. The above-described aspects of encoding an audio frame, at act 208 of process 200, are described further below with reference to
After encoding the selected audio frame at act 208, process 200 proceeds to decision block 210, where it is determined whether another audio frame is to be encoded. This may be done in any suitable way. For example, when each of the audio frames obtained at act 204 has been encoded, it may be determined that another audio frame is not to be encoded. On the other hand, when one or more of the audio frames obtained at act 204 has not been encoded, it may be determined that another audio frame is to be encoded.
When it is determined, at decision block 210, that another audio frame is to be encoded, process 200 returns via the YES branch to act 206, and acts 206 and 208 are repeated such that another audio frame is encoded. On the other hand, when it is determined, at decision block 210, that another audio frame is not to be encoded, process 200 proceeds to act 212, where the parameters representing the encoded frames are output. The parameters may be output to one or more application programs, an operating system, stored for subsequent access, transmitted to one or more other computing devices, and/or output in any other suitable manner. After the parameters representing the encoded audio frames are output, process 200 completes.
It should be appreciated that process 200 is illustrative and that there are variations of process 200. For example, in the embodiment illustrated in
Process 300 begins at act 302, where an audio frame to be encoded is obtained. The audio frame may be obtained in any suitable way. For example, the audio frame may be received from an application program or an operating system. As another example, the audio frame may be obtained by processing an audio signal to obtain a set of audio frames and the audio frame may be selected from the set of audio frames. As yet another example, the audio frame may be stored and may be accessed, at act 302, by the computing device performing process 300. The audio frame may be in any suitable format, as aspects of the technology described herein are not limited in this respect.
Next, process 300 proceeds to act 304, where an initial discrete spectral representation (DSR) of the audio frame is obtained. As described above, the initial discrete spectral representation may comprise an amplitude value and a phase value for each frequency in a discrete set of frequencies. In some embodiments, the initial discrete spectral representation may be obtained by fitting a sinusoidal model to the audio frame to represent the signal in the audio frame as a finite sum of sinusoids characterized by their respective amplitudes, frequencies, and phases. The resultant initial discrete spectral representation may comprise a frequency, an amplitude, and a phase for each sinusoid in a set of sinusoids. As a specific non-limiting example, an audio frame sw(n) obtained by windowing an audio signal, may be approximated using the following sum of L+1 sinusoids:
where k is an integer ranging from 0 to L, Ak is the amplitude of the kth sinusoid, θk is the frequency of the kth sinusoid, φk is the phase of the kth sinusoid, and w(n) is a windowing function examples of which have been described above. The corresponding initial discrete spectral representation then comprises the sets {Ak}, {θk}, and {φk}, which are the amplitudes, frequencies, and phases of the sum of sinusoids shown above in Equation (1). In embodiments in which the initial DSR is obtained by fitting a sinusoidal model to the audio frame obtained at act 302, the initial DSR may be termed a “full sinusoidal representation.”
Next, process 300 proceeds to acts 306a, 306b, 306c, and 306d, where a primary discrete spectral representation of the audio frame is obtained. The primary discrete spectral representation may be a coarse approximation to the initial discrete spectral representation and any information in the initial DSR that is not captured by the primary discrete spectral representation may be encoded as described below with reference to acts 308 and 310. In the embodiment illustrated in
As illustrated in
The continuous-in-frequency representation of the amplitude envelope may be a linear predictive coefficient (LPC) model, a line spectral frequency (LSF) model, a Mel-frequency regularized cepstral coefficient (MRCC) model, any suitable parametric model, or any other suitable type of model. It should be appreciated that the amplitude envelope parameters may be obtained in any other suitable way, as aspects of the technology described herein are not limited in this respect. For example, in some embodiments, amplitude envelope parameters may have been previously obtained for the audio frame using any suitable technique and, at act 306a, the previously obtained values may be received and/or accessed.
Next, process 300 proceeds to act 306b, where phase envelope parameters representing a phase envelope of the initial discrete spectral representation are obtained. In some embodiments, obtaining the phase envelope parameters may comprise estimating the phase envelope of the initial DSR and obtaining a set of phase envelope parameters representing the estimated phase envelope. In some embodiments, obtaining the phase envelope parameters may be performed based, at least in part, on the amplitude envelope of the initial DSR estimated at act 306a.
In some embodiments, before the phase envelope of the initial DSR is estimated, the signal in the audio frame may be phase aligned. Performing the phase alignment may comprise applying a time-domain shift to the signal in the audio frame. Applying a time-domain shift may reduce entropy of the phase of the resultant signal and result in improved estimates of the phase envelope. The time-domain shift to apply to the signal in the audio frame may be determined in any suitable way. For example, the time-domain shift may be determined based on a location of an extremum (e.g., largest amplitude) of the signal. As another example, the time-domain shift may be determined so that variability of the spectral lines in a line spectrum fit to the signal is minimized. As a specific non-limiting example, in embodiments where the initial DSR is obtained by fitting a sinusoidal model such that the audio frame is approximated by a sum of sinusoids as shown in Equation (1) above, the sum of sinusoids may be shifted in the time domain by an amount t to yield the following time-shifted representation:
In some embodiments, estimating the phase envelope of the initial DSR may comprise estimating a continuous-in-frequency representation of the phase envelope of the initial DSR. The continuous-in-frequency representation of the phase envelope may allow for calculation of a phase value for any frequency in a continuous range of frequencies. The continuous-in-frequency representation of the initial DSR's phase envelope may be a parametric representation and, for example, may be a Mel-frequency regularized cepstral coefficient (MRCC) representation (e.g., a weighted MRCC representation) as described in more detail below. However, the continuous-in-frequency representation of the phase envelope of the initial DSR may be any other suitable type of continuous-in-frequency representation, as aspects of the technology described herein are not limited in this respect.
In embodiments where the initial DSR includes phase, amplitude, and frequency parameters (e.g., when the initial DSR is obtained by fitting a sinusoidal model to the audio frame), estimating the continuous-in-frequency representation may comprise estimating parameters of the continuous-in-frequency representation based, at least in part, on the phase, amplitude, and/or frequency parameters characterizing the initial DSR. For instance, in embodiments where the continuous-in-frequency representation of the phase envelope comprises a set of Mel-frequency regularized cepstral coefficients, estimating the continuous-in-frequency representation may comprise estimate the set of Mel-frequency regularized cepstral coefficients based on the phase, amplitude, and/or frequency parameters characterizing the initial discrete spectral representation obtained at act 304. As a specific non-limiting example, the continuous-in-frequency representation may comprise an MRCC representation including a vector d of phase cepstral coefficients, which may be estimated by solving the following quadratic minimization problem:
where {φi} correspond to the unwrapped phases in the initial discrete spectral representation of the audio frame (e.g., the phases of the line spectrum components obtained by fitting a sinusoidal model to the audio frame), where {{tilde over (f)}i} and {Ai} are Mel-frequencies and amplitudes in the initial discrete spectral representation of the audio frame (e.g., the Mel-frequencies and amplitudes of the line spectrum components obtained by fitting a sinusoidal model to the audio frame), where the continuous phase spectrum Φ({tilde over (f)}) is approximated in the cepstral domain as a sum of K sinusoids combined with a linear in-frequency term according to:
Φ({tilde over (f)})≈α+β·{tilde over (f)}−2Σk=lKdk·sin(2πk·{tilde over (f)}),
and where α is a constant phase offset equal to either 0 or π depending on the polarity of the time-domain waveform, β is a time offset of the waveform and d={dk} is the vector of the phase cepstral coefficients. It should be appreciated, however, that the continuous-in-frequency representation of the phase envelope of the initial DSR may be estimated in any other suitable way, as aspects of the disclosure provided herein are not limited in this respect.
Next, process 300 proceeds to act 306c, where the phase envelope parameters obtained at act 306a and/or the amplitude envelope parameters obtained at act 306b may be quantized. In some embodiments, only the phase envelope parameters may be quantized. In some embodiments, only the amplitude envelope parameters may be quantized. In some embodiments, both the phase envelope parameters and the amplitude envelope parameters may be quantized. Any suitable quantization technique may be used, as aspects of the technology described herein are not limited in this respect.
Next, process 300 proceeds to act 306d, where the primary discrete spectral representation is obtained based on the phase envelope parameters and the amplitude envelope parameters obtained at act 306c. In some embodiments, the primary discrete spectral representation may comprise phase values obtained by evaluating (which may be thought of as sampling) the phase envelope, represented by the phase envelope parameters, at a set of discrete frequencies. Additionally, the primary discrete spectral representation may comprise amplitude values obtained by evaluating the amplitude envelope, represented by the amplitude envelope parameters, at a set of discrete frequencies. The phase and amplitude envelopes may be sampled at the same discrete set of frequencies. Accordingly, in some embodiments, the primary discrete spectral representation may comprise phase and amplitude values for each frequency in a discrete set of frequencies.
After the primary discrete spectral representation is obtained at acts 306a-306d, process 300 proceeds to act 308, where a residual discrete spectral representation is calculated based on the initial DSR obtained at act 304 and the primary DSR obtained at acts 306a-306d. In some embodiments, the residual DSR may be obtained by subtracting the primary DSR from the initial DSR. Though the residual DSR may be obtained in any other suitable way (e.g., weighted subtraction, frequency-dependent weighted subtraction, etc.), as aspects of the technology described herein are not limited in this respect.
Next, process 300 proceeds to act 310, where the residual discrete spectral representation obtained at act 308 is encoded using a linear combination of codewords. The codewords in the linear combination may be selected from one or more codebooks of codewords. This may be done using any suitable selection technique. In some embodiments, the codewords in the linear combination may be selected from the codebook(s) iteratively (e.g., one at a time) using one or more selection criteria. For example, the codewords in the linear combination may be selected from the codebook(s) iteratively based, at least in part, on a perceptual weighting measure. In other embodiments, codewords in the linear combination may be selected from the codebook(s) jointly rather than iteratively, using any suitable selection criteria.
In some embodiments, the codewords in the linear combination may be selected from a codebook of sub-frame sub-band stochastic codewords. The codebook may have one or more stochastic codewords for each combination of sub-frames and sub-bands. For example, the codebook may include one or more stochastic codewords for each combination of a sub-frame of M sub-frames and a sub-band of N sub-bands. Such a codebook may include one or more stochastic codewords for each combination (i, j; 1≦i≦M; 1≦j≦N) where the index i represents the ith sub-frame and the index j represents the jth sub-band.
A particular sub-frame sub-band stochastic codeword (e.g., a codeword corresponding to the ith sub-frame and jth sub-band) may be generated by: (1) generating a stochastic time-domain signal (e.g., using Gaussian noise); (2) setting portions of the stochastic time-domain signal not corresponding to a sub-frame (e.g., portions of the stochastic time-domain signal outside of the ith sub-frame) to 0 to obtain a sub-frame codeword; (3) converting the sub-frame codeword to the frequency domain (e.g., via a discrete Fourier transform) to obtain a frequency-domain sub-frame codeword; and (4) setting values of the frequency domain sub-frame codeword to zero outside of a sub-band (e.g., the jth sub-band) to obtain the particular sub-frame sub-band stochastic codeword. However, a sub-frame sub-band codeword may be generated in any other suitable way, as aspects of the technology described herein are not limited in this respect.
As a specific non-limiting example, when the audio frame received at act 302 is 5 ms long, the codebook may comprise one or more stochastic codewords for each of 1.25 ms sub-frames of the 5 ms frame and each of a multiple sub-bands. One such codeword may be generated by: (1) generating a stochastic (e.g., Gaussian) time-domain signal that is 5 ms long; (2) setting the values of the stochastic time-domain signal outside of the 0-1.25 ms portion to 0 so as to obtain a sub-frame codeword; (3) transforming the sub-frame codeword to the frequency domain to obtain a frequency-domain sub-frame codeword; and (4) setting values of the frequency domain sub-frame codeword to zero outside of a sub-band (e.g., 500-1000 Hz or any other suitable sub-band) to obtain the codeword. Another such codeword may be generated by: (1) generating a stochastic (e.g., Gaussian) time-domain signal that is 5 ms long; (2) setting the values of the stochastic time-domain signal outside of the 1.25-2.5 ms portion to 0 so as to obtain a sub-frame codeword for the second sub-frame; (3) transforming the sub-frame codeword to the frequency domain to obtain a frequency-domain sub-frame codeword; and (4) setting values of the frequency domain sub-frame codeword to zero outside of a sub-band (e.g., 500-1000 Hz or any other suitable sub-band) to obtain the codeword.
A specific non-limiting example of a technique for iteratively selecting a linear combination of K codewords {xk} from a codebook in the line spectral domain is described next. Let S0=diag(A0×ejφ0) be diagonal matrix having its main diagonal be the primary discrete spectral representation obtained at acts 306a-306d, where A0 is a vector of sinusoidal amplitudes (e.g., obtained, at act 306d, by evaluating the amplitude envelope of the initial DSR at a discrete set of frequencies), φ0 is a set of sinusoidal phases (e.g., obtained, at act 306d, by evaluating the phase envelope of the initial DSR at the discrete set of frequencies), and x is a component-wise multiplication. Let S be the initial discrete spectral representation obtained at act 304, then S may be approximated (the approximation being denote as Ŝ) using S0, which represents the primary discrete spectral representation, and K codewords {xk} according to:
S≈Ŝ=SO(Σk=1Kαkxk+1),
where the set {αk} is a set of weights. The overall phase approximation of the initial discrete spectral representation S is then given by {circumflex over (φ)}=angle(Ŝ).
Given a codebook (e.g., a codebook in which each codeword represents a certain sub-frame and a certain sub-band), the codebook may be iteratively searched K times to identify the K codewords {xk} and the corresponding weights {αk} to use for approximating S. During each iteration, a codeword and corresponding gain may be selected based on a perceptual measure. For example, a codeword and corresponding gain that provide the least distortion in a perceptually weighted spectral domain may be selected, as described below.
Let the partial approximation Ŝr of S formed by using r codewords be defined according to:
The partial approximation Ŝr may be defined recursively by:
ŜO=SO,
Ŝr=Ŝr−1+SOαrxr.
Let {tilde over (s)}r=Ŝr−Ŝr-1 denote the partial line spectrum residual, W be a diagonal matrix representing a perceptual weighting filter, and xi be the ith codeword, then the optimal gains are given by:
and the codeword indices and corresponding weights are selected according to
Thus, at each iteration, the index of the codeword selected is given by ir* and the corresponding weight of that codeword is given by gi*,x.
After the residual DSR is encoded at act 310, process 300 proceeds to act 312, where parameters representing the estimated primary DSR and the encoded residual DSR are output. The parameters representing the estimated primary DSR may include the amplitudes and phases obtained at act 306d. In embodiments where the signal in the audio frame was phase aligned by a time-domain shift τ, the parameters representing the estimated primary DSR may include the time-domain shift τ. The parameters representing the encoded residual DSR may include the indices of the codewords selected to represent the residual DSR and the corresponding weights.
The parameters representing the estimated primary DSR and the encoded residual DSR may be output in any suitable way. For example, the parameters may be provided to an application program, an operating system, transmitted to a remote computing device, stored, output in a combination of any of these ways or in any other suitable way. In some embodiments, the parameters representing the estimated primary DSR and the encoded residual DSR may be quantized prior to being output. The parameters may be quantized using a split VQ scheme or any other suitable quantization technique, as aspects of the technology described herein are not limited in this respect. After the parameters representing the estimated primary DSR and the encoded residual DSR are output, process 300 completes.
It should be appreciated that process 300 is illustrative and that there are variations of process 300. For example, process 300 may be adapted for use in the context of speech synthesis. In this variation, process 300 may be modified to not perform act 302, but to begin at act 304 in which an initial discrete spectral representation for a frame to be synthesized is received. For example, at act 304 in the modified process, a set of amplitudes and phases for each of a discrete set of frequencies may be received.
Aspects of the technology described herein are further illustrated in the block diagrams shown in
As shown in the block diagram of
As further shown in
As also shown in
An illustrative implementation of a computer system 500 that may be used in connection with any of the embodiments of the disclosure provided herein is shown in
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of processor-executable instructions that can be employed to program a computer or other processor to implement various aspects of embodiments as discussed above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the disclosure provided herein need not reside on a single computer or processor, but may be distributed in a modular fashion among different computers or processors to implement various aspects of the disclosure provided herein.
Processor-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in one or more non-transitory computer-readable storage media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a non-transitory computer-readable medium that convey relationship between the fields. However, any suitable mechanism may be used to establish relationships among information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationships among data elements.
Also, various inventive concepts may be embodied as one or more processes, of which examples have been provided. The acts performed as part of each process may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
All definitions, as defined and used herein, should be understood to control over dictionary definitions, and/or ordinary meanings of the defined terms.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed. Such terms are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term).
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing”, “involving”, and variations thereof, is meant to encompass the items listed thereafter and additional items.
Having described several embodiments of the techniques described herein in detail, various modifications, and improvements will readily occur to those skilled in the art. Such modifications and improvements are intended to be within the spirit and scope of the disclosure. Accordingly, the foregoing description is by way of example only, and is not intended as limiting. The techniques are limited only as defined by the following claims and the equivalents thereto.
Number | Name | Date | Kind |
---|---|---|---|
6463405 | Case | Oct 2002 | B1 |
9368103 | Nakano | Jun 2016 | B2 |
20090144053 | Tamura | Jun 2009 | A1 |
Entry |
---|
Agiomyrgiannakis and Stylianou, Stochastic Modeling and Quantization of Harmonic Phases in Speech Using Wrapped Gaussian Mixture Models, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Apr. 2007, 1121-4, Honolulu, HI. |
Chazan, et al., High Quality Sinusoidal Modeling of Wideband Speech or the Purposes of Speech Synthesis and Modification, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, May 2006, 877-80, Toulouse, France. |
Eriksson, et al., Quantization of the Spectral Envelope for Sinusoidal Coders, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, May 1998, 37-40, Seattle, WA. |
Lindblom, A Sinusoidal Voice Over Packet Coder Tailored for the Frame-Erasure Channel, IEEE Transactions on Speech and Audio Processing, Sep. 2005, 787-98, 13(5). |
Schechtman and Sorin, Sinusoidal model parameterization for HMM-based TTS system, Interspeech, 11th Annual Conference of the International Speech Communication Association, Sep. 2010, 805-8, Chiba, Japan. |
Sorin, et al., Uniform Speech Parameterization for Multi-form Segment Synthesis, Interspeech, 12th Annual Conference of the International Speech Communication Association, Aug. 2011, 344-7, Florence, Italy. |
[No Author Listed] “SVOPC.” Wikipedia. Available at http://en.wikipedia.org/wiki/SVOPC. Last accessed Nov. 25, 2014. 2 pages. |
Number | Date | Country | |
---|---|---|---|
20160300580 A1 | Oct 2016 | US |