 
                 Patent Application
 Patent Application
                     20120239387
 20120239387
                    This invention relates to the field of voice transformation or voice morphing with encoded information. In particular, the invention relates to voice transformation for preventing fraudulent use of modified speech.
Voice transformation enables speech samples from one person to be modified so that they sound as if they were spoken by someone else. There are two types of transformations:
There are many uses for voice transformation. The following are some examples:
However, in the wrong hands voice transformation tools can also be used improperly. Examples of improper use include the following
At present, it is usually possible to distinguish between a natural and transformed voice and it is not possible to mimic fully a different speaker. However, as research progresses it is expected that within a few years the quality of voice transformation system might be high enough to be indistinguishable from natural voice and indistinguishable from a copied speaker.
According to a first aspect of the present invention there is provided a method for voice transformation, comprising: transforming a source speech using transformation parameters; encoding information on the transformation parameters in an output speech using steganography; wherein the source speech can be reconstructed using the output speech and the information on the transformation parameters.
According to a second aspect of the present invention there is provided a method for reconstructing a voice transformation, comprising: receiving an output speech of a voice transformation system wherein the output speech is transformed speech which has encoded information on the transformation parameters using steganography; extracting the information on the transformation parameters; and carrying out an inverse transformation of the output speech to obtain an approximation of an original source speech.
According to a third aspect of the present invention there is provided a system for voice transformation comprising: a processor; a voice transformation component for transforming a source speech using transformation parameters; and a steganography component for encoding information on the transformation parameters in an output speech using steganography; wherein the source speech can be reconstructed using the output speech and the information on the transformation parameters.
According to a fourth aspect of the present invention there is provided a system for reconstructing a voice transformation, comprising: a processor; a speech receiver for receiving an input speech, wherein the input speech is transformed speech which has encoded information on the transformation parameters using steganography; a steganography decoder component for decoding the information on the transformation parameters from the input speech; and a voice reconstruction component for carrying out an inverse transformation of the input speech to obtain an approximation of an original source speech.
According to a fifth aspect of the present invention there is provided a computer program product for voice transformation, the computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to: transform a source speech using transformation parameters; and encode information on the transformation parameters in an output speech using steganography; wherein the source speech can be reconstructed using the output speech and the information on the transformation parameters.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
    
    
    
    
    
    
    
    
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numbers may be repeated among the figures to indicate corresponding or analogous features.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Method, system and computer program product are described in which steganography or watermarking data is added to transformed speech so it can be identified and transformed back to the original voice. Adding steganographic data to the speech has only small impact on quality so the output of the system will still be usable for most ordinary applications.
Transformation parameters are encoded into the transformed speech by means of steganography so that the original speech can be reconstructed. The transformation parameters can be retrieved from the transformed speech and used to reconstruct the original speech by applying the inverse transform.
In one embodiment, the transformation parameters may be added using steganography after the voice transformation has taken place.
In another embodiment, a voice transformation system may encode the transformation parameters by encoding the transformation parameters in the modulation of the parameters of the transformed speech.
In some cases the transformation can not be inverted. In such cases, the encoded transformation parameters are those that when applied to the modified speech should bring it as close as possible to the original speech. Instead of encoding the transformation parameters themselves, the inverse parameters may be encoded.
If someone uses this to commit a fraudulent or criminal act (for example, calling a bank while impersonating a different person) then the watermarking in the recorded speech can be detected and used to invert the transformed speech back to the original speech (or a close approximation to it). This can be used later to trace or detect the user.
Anyone who would like to avoid the possibility that someone might be calling them while using a voice transformation system may add a system that detects if the watermarking is present and issues an alert if it exists in the incoming speech.
Referring to 
Voice transformation systems apply different transforms on the input speech depending on different tunable parameters. Examples of tunable parameters include: pitch modification parameters, spectral transformation matrices, Gaussian mixtures (GMM) coefficients, speed up/slow down ratios, noise level modification parameters, etc. The parameters may be selected from a list of preset configurations, tuned manually, or trained automatically by comparing speech samples originating from the two voices.
The transformation parameters used in the voice transformation are determined 104 and information on the transformation parameters is generated 105. The information on the transformation parameters may be one of the following: the transformation parameters themselves, inverse transformation parameters, encoded or encrypted transformation parameters or inverse transformation parameters, or an approximation of the transformation parameters or inverse transformation parameters.
This information on the transformation parameters may include an index into a remote database where the parameters themselves are stored. The index may allow the retrieval of the parameters from the database. For example, the transformation parameters may be placed on a web site and the URL of those parameters (e.g. http://www . . . ) may be encoded into the speech.
The information on the transformation parameters may include quantized transformation parameters from the voice transformation system (or the inverse transformation parameters) which are encoded in a binary form and, possibly, also compressed and encrypted. The binary data may then be encoded into the output speech using a stenography method.
The transformed speech has a steganography method applied 106 to encode the information on the transformation parameters into the transformed speech. This is done by combining the information on the transformation parameters as a steganography signal (as hidden data or a watermark) with the transformed speech to generate output speech 107. Steganography methods applied to audio data may range from simple algorithms that insert information in the form of signal noise, to complex algorithms exploiting sophisticated signal processing techniques to hide the information. Some examples of audio steganography include LSB (least significant bit) coding, parity coding, phase coding, spread spectrum and echo hiding.
Some steganographic algorithms work by manipulating different speech parameters. Those algorithms can operate directly inside the voice transformation system and this is described in the second embodiment of the described method with reference to 
Referring to 
Transformation parameters are generated 204 which are applied to the model parameters to modify 205 the model parameters of the source speech.
Information on the transformation parameters may be generated 206 as in the method of 
The information on the transformation parameters is applied in a steganography method by encoding 207 within the modified model parameters. The encoded modified model parameters are then applied 208 in the final speech synthesis and an output speech 209 is generated.
In the second embodiment, the encoded transformation coefficients are combined with the transformed speech parameters. For example, the coefficients can be encoded as small variations on the modified pitch curve of the final voice.
For example, the transformation data may be encoded in the pitch curve by the voice transformation system. Voice transformation systems usually control the pitch curve of the output signal. The pitch is usually adjusted for each short frame (5-20 msec). The integer pitch in Hertz pn can be taken for frame n and the last bit replaced with a bit from the data dn:
  
    
  
The output speech signal is then synthesized with the new pitch p′n instead of pn. The effect is practically inaudible to a human ear but enables 1 bit/frame to be encoded. To extract the data from the output speech a pitch detector is applied on the audio in order to compute the pitch curve and then the last bit of the pitch value from each frame is extracted.
Referring to 
A transformed speech is received 301 and the presence of a watermark or other steganographic data is detected 302. An alert may be issued 303 on detection of steganographic data to alert a receiver to the fact that the received speech is transformed speech and not in the original voice.
The steganographic data is decoded 304 and information on the transformation parameters is extracted 305. If the information on the transformation parameters is an index to the transformation parameters stored elsewhere, the transformation parameters are retrieved. The information on the transformation parameters is applied to inversely transform 306 the received speech to obtain 307 as close to the original speech as possible.
Some or all of the information on the transformation parameters encoded by the steganography may also be encrypted by various ciphers known in the literature. This way only those who have access to the decipher key (e.g. law enforcement agencies) can decipher the information on the transformation parameters and transform the speech back to the original voice.
Instead of encoding the transformation parameters the system may encode the inverse parameters. If the transformation is not invertible (e.g. the sample rate is reduced) then the system can encode the parameters that will bring the transformed voice back as close as possible to the original voice.
The voice transformation parameter set is usually computed by an optimization process that finds the best parameters that when applied to the set of source speech samples will make them sound as close as possible to a set of a target sample. Some of those parameters have simple inversion. For example, if to get from the source to the destination the pitch has been increased by Δp, then to reverse the process the pitch should be lowered by Δp. However, since the synthesis process is not linear and since some parameters are dynamically selected based on the source signal then it is not always easy to invert the process.
One embodiment used in the described method, trains a new set of inverse voice transformation parameters that best transform the synthesized speech into the source speech and encodes those parameters within the transformed speech.
Referring to 
The inverse parameters may be trained by inputting the transformed speech 406 and the source speech 401 to train 409 inverse parameters 410. The trained inverse parameters may be used to reconstruct the transformed speech to as close as possible to the source speech.
Referring to 
A transformation parameter compiling component 520 may be provided which compiles the transformation parameters 511 into information 521 to be encoded. The transformation parameter compiling component 520 may include a quantizing component 522 for quantizing the parameters, a binary stream component 523 for converting the quantized parameters into a binary stream, a compression component 524 for compressing the information, and an encryption component 525 for encrypting the information. The transformation parameter compiling component 520 may also include an inverse parameter training component 526 for providing inverse transformation parameters from the input speech and the transformed speech. The transformation parameter compiling component 520 may include an index component 527 for indexing remotely stored transformation parameters in the information 521 to be encoded.
A steganography component 530 is provided for encoding the information 521 on the transformation parameters into the transformed speech 512 to produce encoded transformed speech 531. A speech output component 540 may be provided for outputting the transformed speech with encoded transformation parameter information.
Referring to 
The voice transformation system 600 may include a speech receiver 601 for receiving source speech 602 to be processed. A speech modelling component 603 is provided which generates model parameters 604 of the source speech 602. A transformation parameter component 605 generates transformation parameters 606 to be used. A parameter modification component 607 may be provided for applying the transformation parameters 606 to the model parameters 604 to obtain modified model parameters 608.
A transformation parameter compiling component 620 may be provided which compiles the transformation parameters 606 into information 621 to be encoded. The compiling component 620 may include one or more of the components described in relation to the compiling component 520 of 
A steganography component 630 is provided for encoding the information 621 into the modified model parameters 608 to generate encoded modified model parameters 631.
A speech synthesis component 640 may be provided for synthesizing the source speech with the encoded modified model parameters 631 to generate encoded transformed speech 641. A speech output component 650 is provided for outputting a speech output in the form of the transformed speech with encoded transformation parameter information.
Referring to 
A steganography decoder component 710 may be provided to extract the encoded information on the transformation parameters. The decoder component 710 may include a deciphering component 711 for deciphering the encoded information if it is encrypted. A parameter reconstruction component 720 may be provided to reconstruct the transformation parameters or inverse transformation parameters from the encoded information. The parameter reconstruction component 720 may retrieve indexed transformation parameters from a remote location.
A voice reconstruction component 730 may be provided to reconstruct the source speech or as close to the original source speech as possible. An output component 740 may be provided to output the reconstructed speech.
Referring to 
The memory elements may include system memory 802 in the form of read only memory (ROM) 804 and random access memory (RAM) 805. A basic input/output system (BIOS) 806 may be stored in ROM 804. System software 807 may be stored in RAM 805 including operating system software 808. Software applications 810 may also be stored in RAM 805.
The system 800 may also include a primary storage means 811 such as a magnetic hard disk drive and secondary storage means 812 such as a magnetic disc drive and an optical disc drive. The drives and their associated computer-readable media provide non-volatile storage of computer-executable instructions, data structures, program modules and other data for the system 800. Software applications may be stored on the primary and secondary storage means 811, 812 as well as the system memory 802.
The computing system 800 may operate in a networked environment using logical connections to one or more remote computers via a network adapter 816.
Input/output devices 813 can be coupled to the system either directly or through intervening I/O controllers. A user may enter commands and information into the system 800 through input devices such as a keyboard, pointing device, or other input devices (for example, microphone, joy stick, game pad, satellite dish, scanner, or the like). Output devices may include speakers, printers, etc. A display device 814 is also connected to system bus 803 via an interface, such as video adapter 815.
A voice transformation system with the above components may be provided as a service to a customer over a network. The detection of a transformed voice and the conversion back to the original voice may also be provided as a service to a customer over a network.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.