This application generally relates to analyzing audio signals for voice authentication. In particular, this application relates to identifying synthetic speech or other types of fraudulent replay attacks in audio signals.
Rapid development in generative artificial intelligence (AI) has resulted in several large models that are able to synthesize high-quality realistic text, images, video and audio commonly known as “deepfakes.” While there may be many several benefits from such synthetic data, there is also a great risk to mass-media and telecommunications. As an example, text-to-speech (TTS) synthesis can generate a synthetic voice that is designed to imitate the voice of a real person. Credible voice imitation can be achieved with as little as a few seconds of recorded speech from a target speaker. There are a number of threats emerging from voice-cloning technology, including threats to institutions that use voice to identify and authenticate a person. Voice recognition is frequently used in call centers, voice assistants and other IoT applications to verify the authenticity of a user through their voice. It is, therefore, beneficial to ensure that these systems are robust to attacks using synthetically-generated speech.
Disclosed herein are systems and methods capable of addressing the above-described shortcomings and may also provide any number of additional or alternative benefits and advantages. Voice synthesis, voice conversion and modification may also be used to evade detection by voice-based fraud-detection tools. Embodiments discussed herein provide for a centralized analysis service that can define and/or enforce a watermarking standard for synthetic speech. Text-to-speech (TTS) services can include watermarks complying with the watermarking standard in synthetic speech (e.g., generated speech, modified speech, converted speech) such that the centralized analysis service can identify synthetic speech from a plurality of different TTS services, providing high confidence as to whether audio includes synthetic speech. Furthermore, the watermarks can include metadata indicating an origin of synthetic speech (e.g., which TTS service), when the synthetic speech was generated, who generated the synthetic speech, and/or whether the synthetic speech is generated using audio authorized for use in generating synthetic speech. In an example, the analysis service detects a watermark and extracts metadata of the watermark to determine which TTS service generated the synthetic speech, which user of the TTS service generated the synthetic speech, and/or whether the synthetic speech was authorized by a person whose voice was used to generate the synthetic speech. The analysis service can transmit an alert to the TTS service of an attempted use of the synthetic speech to allow the TTS service to sanction improper use of the TTS service.
Aspects of the present disclosure are directed to a computer-implemented method including obtaining, by a computer, an audio signal including synthetic speech, extracting, by the computer, metadata from a watermark of the audio signal by applying a set of keys associated with a plurality of TTS services to the audio signal, the metadata indicating an origin of the synthetic speech in the audio signal, and generating, by the computer, based on the metadata as extracted from the watermark, a notification indicating that the audio signal includes the synthetic speech.
In some implementations, the method includes generating a score for each key of the set of keys to determine that the audio signal includes the watermark, wherein the watermark was generated using the key of the set of keys. In some implementations, the method includes transmitting the key to a TTS service to generate the watermark. In some implementations, the metadata includes one or more of a service identifier of a TTS service, a model identifier of a TTS model, a user identifier of a user of the TTS service, or a timestamp indicating when the synthetic speech was generated. In some implementations, the method includes transmitting an alert to a TTS service based on the origin of the synthetic speech in the audio signal. In some implementations, the notification includes a portion of the metadata as extracted from the watermark.
In some implementations, the method includes receiving, by the computer, from the origin of the synthetic speech, the audio signal including the watermark, determining, by the computer, that a robustness of the watermark exceeds a predetermined threshold, and transmitting an approval of the watermark to the origin of the synthetic speech. In some implementations, the watermark includes a consent watermark, and wherein the notification indicates usage consent parameters of the consent watermark. In some implementations, the watermark includes an authorization watermark, and wherein the notification indicates authorization parameters of the authorization watermark.
In some implementations, the method includes obtaining, by the computer, a second audio signal including second synthetic speech, extracting, by the computer, second metadata from a second watermark of the second audio signal, the second metadata indicating a second origin of the second synthetic speech that is different from the origin of the audio signal including the synthetic speech, and generating, by the computer, a second notification indicating that the second audio signal includes the second synthetic speech.
Aspects of the present disclosure are directed to a system including a computing device including at least one processor, configured to obtain an audio signal including synthetic speech, extract metadata from a watermark of the audio signal by applying a set of keys associated with a plurality of TTS services to the audio signal, the metadata indicating an origin of the synthetic speech in the audio signal, and generate, based on the metadata as extracted from the watermark, a notification indicating that the audio signal includes the synthetic speech.
In some implementations, the computing device is further configured to generate a score for each key of the set of keys to determine that the audio signal includes the watermark, the key used to generate the watermark. In some implementations, the computing device is further configured to transmit the key to a TTS service to generate the watermark. In some implementations, the metadata includes one or more of a service identifier of a TTS service, a model identifier of a TTS model, a user identifier of a user of the TTS service, or a timestamp indicating when the synthetic speech was generated. In some implementations, the computing device is configured to transmit an alert to a TTS service based on the origin of the synthetic speech in the audio signal. In some implementations, the notification includes a portion of the metadata as extracted from the watermark.
In some implementations, the computing device is configured to receive from the origin of the synthetic speech, the audio signal including the watermark, determine that a robustness of the watermark exceeds a predetermined threshold, and transmit an approval of the watermark to the origin of the synthetic speech. In some implementations, the watermark includes a consent watermark, and wherein the notification indicates one or more usage consent parameters of the consent watermark.
In some implementations, the watermark includes an authorization watermark, and wherein the notification indicates one or more authorization parameters of the authorization watermark. In some implementations, the computing device is configured to obtain a second audio signal including second synthetic speech, extract second metadata from a second watermark of the second audio signal, the second metadata indicating a second origin of the second synthetic speech that is different from the origin of the audio signal including the synthetic speech, and generate a second notification indicating that the second audio signal includes the second synthetic speech.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention(s) and features as claimed.
The present disclosure can be better understood by referring to the following figures. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosure. In the figures, reference numerals designate corresponding parts throughout the different views.
Reference will now be made to the illustrative embodiments illustrated in the drawings, and specific language will be used here to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Alterations and further modifications of the inventive features illustrated here, and additional applications of the principles of the inventions as illustrated here, which would occur to a person skilled in the relevant art and having possession of this disclosure, are to be considered within the scope of the invention.
The system 100 describes an embodiment of call risk analysis, which in some embodiments may include caller identification, performed by the analytics system 101 on behalf of the provider system 110. The risk analysis operations are based on audio watermarks indicating the presence of synthetic speech and/or other characteristics of a projected audio wave or observed audio signal captured by a microphone of an end-user device 114. The analytics server 102 executes software programming of a machine-learning architecture having various types of functional engines, implementing certain machine-learning techniques and machine-learning models for analyzing the call audio data, which the analytics server 102 receives from the provider system 110. The analytics server 102 may execute various algorithms for detecting audio watermarks and extracting metadata of the audio watermarks to identify synthetic speech. The machine-learning architecture and/or algorithms of the analytics server 102 analyze the various forms of the call audio data to perform the various risk assessment or caller identification operations.
The TTS system 120 may encode watermarks in synthetic speech generated by the TTS system 120 and provide information regarding the watermarks, such as keys used to encode and decode the watermarks, to the analytics system 101. In some implementations, the analytics system 101 provides the keys to the TTS system 120 or encodes the watermarks in the synthetic speech for the TTS system 120. In this way, the analytics system 101 is able to detect and decode (e.g., extract metadata) the watermarks in order to identify the synthetic speech (e.g., by applying the keys on the synthetic speech. Thus, when users access the TTS system 120 to generate the synthetic speech and attempt to use the synthetic speech at the service provider system 110, the analytics system 101 is able to identify the synthetic speech on behalf of the service provider system 110.
The various components of the system 100 may be interconnected with each other through hardware and software components of one or more public or private networks. Non-limiting examples of such networks may include: Local Area Network (LAN), Wireless Local Area Network (WLAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), and the Internet. The communication over the network may be performed in accordance with various communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), and IEEE communication protocols. Likewise, the end-user devices 114 may communicate with callees (e.g., service provider systems 110) via telephony and telecommunications protocols, hardware, and software capable of hosting, transporting, and exchanging audio data associated with telephone calls. Non-limiting examples of telecommunications hardware may include switches and trunks, among other additional or alternative hardware used for hosting, routing, or managing telephone calls, circuits, and signaling. Non-limiting examples of software and protocols for telecommunications may include SS7, SIGTRAN, SCTP, ISDN, and DNIS among other additional or alternative software and protocols used for hosting, routing, or managing telephone calls, circuits, and signaling. Components for telecommunications may be organized into or managed by various different entities, such as, for example, carriers, exchanges, and networks, among others.
The analytics system 101, the provider system 110, and the TTS system 120 are network system infrastructures 101, 110, 120 comprising physically and/or logically related collections of software and electronic devices managed or operated by various enterprise organizations. The devices of each network system infrastructure 101, 110, 120 are configured to provide the intended services of the particular enterprise organization.
The analytics system 101 is operated by a call analytics service that provides various call management, security, authentication (e.g., speaker verification), and analysis services to customer organizations (e.g., corporate call centers, government entities). Components of the call analytics system 101, such as the analytics server 102, execute various processes using audio data in order to provide various call analytics services to the organizations that are customers of the call analytics service. In operation, a caller uses a caller end-user device 114 to originate a telephone call to the service provider system 110. The microphone of the caller device 114 observes the caller's speech and generates the audio data represented by the observed audio signal.
In some implementations, the audio data includes an audio watermark (e.g., watermark identifying the end-user device, watermark identifying the caller, watermark indicating synthetic speech) corresponding to synthetic speech generated by the TTS system 120. The caller device 114 transmits the audio data to the service provider system 110. The interpretation, processing, and transmission of the audio data may be performed by components of telephony networks and carrier systems (e.g., switches, trunks), as well as by the caller devices 114. The service provider system 110 then transmits the call analytics system 101 to perform various analytics and downstream audio processing operations. It should be appreciated that analytics servers 102, analytics databases 104, and admin devices 103 may each include or be hosted on any number of computing devices comprising a processor and software and capable of performing various processes described herein.
The service provider system 110 is operated by an enterprise organization (e.g., corporation, government entity) that is a customer of the call analytics service. In operation, the service provider system 110 receives the audio data and/or the observed audio signal associated with the telephone call from the caller device 114. The audio data may be received and forward by one or more devices of the service provider system 110 to the call analytics system 101 via one or more networks. For instance, the customer may be a bank that operates the service provider system 110 to handle calls from consumers regarding accounts and product offerings. Being a customer of the call analytics service, the bank's service provider system 110 (e.g., bank's call center) forwards the audio data associated with the inbound calls from consumers to the call analytics system 101, which in turn performs various processes using the audio data, such as analyzing the audio data to detect synthetic speech used to impersonate a customer of the bank, among other voice or audio processing services for risk assessment or speaker identification. It should be appreciated that service provider servers 111, provider databases 112 and agent devices 116 may each include or be hosted on any number of computing devices comprising a processor and software and capable of performing various processes described herein.
The end-user device 114 may be any communications or computing device the caller operates to place the telephone call to the call destination (e.g., the service provider system 110). The end-user device 114 may comprise, or be coupled to, a microphone. Non-limiting examples of caller devices 114 may include landline phones 114a and mobile phones 114b. It should be appreciated that the caller device 114 is not limited to telecommunications-oriented devices (e.g., telephones). As an example, a calling end-user device 114 may include an electronic device comprising a processor and/or software, such as a computing device 114c or Internet of Things (IoT) device, configured to implement voice-over-IP (VOIP) telecommunications. As another example, the caller computing device 114c may be an electronic IoT device (e.g., voice assistant device, “smart device”) comprising a processor and/or software capable of utilizing telecommunications features of a paired or otherwise networked device, such as a mobile phone 114b.
In the example embodiment of
The analytics server 102 of the call analytics system 101 may be any computing device comprising one or more processors and software, and capable of performing the various processes and tasks described herein. The analytics server 102 may host or be in communication with the analytics database 104 and may receive and process the audio data from the one or more service provider systems 110. Although
In operation, the analytics server 102 may execute various software-based processes on the call data, which may include detection and/or identification of watermarks (i.e., audio watermarks) and extraction of metadata of the watermarks to identify synthesized speech. The operations of the analytics server 102 may include, for example, receiving the observed audio signal associated with the calling device 114, parsing the observed audio signal into frames and sub-frames, applying operations, in combination with a set of keys, to the observed audio signal to determine whether a corresponding watermark is present, and extracting metadata of the corresponding watermark. The set of keys may be used to generate watermarks associated with a standard. In this way, the compliance with the standard can be detected, and use of synthetic speech generated in compliance with the watermark standard can be detected along with metadata regarding the generation of the synthetic speech.
The analytics server 102 may perform various pre-processing operations on the observed audio signal during deployment. The pre-processing operations can advantageously improve the speed at which the analytics server 102 operates or reduce the demands on computing resources when analyzing the observed audio signal.
During pre-processing, the analytics server 102 parses the observed audio signal into audio frames containing portions of the audio data and scales the audio data embedded in the audio frames. The analytics server 102 further parses the audio frames into overlapping sub-frames. The frames may be portions or segments of the observed audio signal having a fixed length across the time series, where the length of the frames may be pre-established or dynamically determined. The sub-frames of a frame may have a fixed length that overlaps with adjacent sub-frames by some amount across the time series. For example, a one-minute observed audio signal could be parsed into sixty frames with a one-second length. Each frame may be parsed into four 0.25 sec sub-frames, where the successive sub-frames overlap by 0.10 sec.
The analytics server 102 may transform the audio data into a different representation during pre-preprocessing. The analytics server 102 initially generates and represents the observed audio signal, frames, and sub-frames according to a time domain. The analytics server 102 transforms the sub-frames (initially in the time domain) to a frequency domain or spectrogram representation, representing the energy associated with the frequency components of the observed audio signal in each of the sub-frames, thereby generating a transformed representation. In some implementations, the analytics server 102 executes a Fast-Fourier Transform (FFT) operation on the sub-frames to transform the audio data in the time domain to the frequency domain. For each frame (or sub-frame), the analytics server 102 performs a simple scaling operation so that the frame occupies the range [−1, 1] of measurable energy.
In some implementations, the analytics server 102 may employ a scaling function to accentuate aspects of the speech spectrum (e.g., spectrogram representation). The speech spectrum, and in particular the voiced speech, will decay at higher frequencies. The scaling function beneficially accentuates the voiced speech. The analytics server 102 may perform an exponentiation operation on the array resulting from the FFT transformation. An example of the exponentiation operation performed on the array (Y) may be given by Ye=Yα, where α is the exponentiation parameter. The values of the exponentiation parameter may be any value greater than zero and less than or equal to one (e.g., α=0.3).
In some instances, the audio includes synthetic speech including a watermark, the synthetic speech generated by the TTS system 120. The analytics system 101 can apply the keys used to generate the synthetic speech to the audio to detect the watermark and extract metadata from the watermark. Applying the keys to detect the watermark may include executing one of multiple watermarking processes (e.g., algorithm) for analyzing the audio and detecting synthetic speech. Applying the keys to detect the watermark may operate to quickly and efficiently detect synthetic speech generated by the TTS system 120 or any other TTS system in communication with the analytics system 101.
The TTS server 122 of the TTS system 120 may be any computing device comprising one or more processors and software, and capable of performing the various processes and tasks described herein. The TTS server 122 may host or be in communication with the TTS database 124 and may generate synthetic speech. The TTS server 122 may provide the synthetic speech to the user devices 114 and provide information regarding the synthetic speech (e.g., keys, metadata) to the analytics system 101. Although
The presence of the watermark and/or the metadata extracted from the watermark by the analytics server 102, will be forwarded to or otherwise referenced by one or more downstream applications to perform various types of audio and voice processing operations. The downstream applications may be executed by the provider server 111, the analytics server 102, the admin device 103, the agent device 116, or any other computing device. Non-limiting examples of the downstream applications or operations may include speaker verification, speaker recognition, speech recognition, voice biometrics, audio signal correction, or degradation mitigation (e.g., dereverberation), and the like.
The provider server 111 of a service provider system 110 executes software processes for managing a call queue and/or routing calls made to the service provider system 110, which may include routing calls to the appropriate agent devices 116 based on the caller's comments, such as the agent of a call center of the service provider. The provider server 111 can capture, query, or generate various types of information about the call, the caller, and/or the calling device 114 and forward the information to the agent device 116, where a graphical user interface on the agent device 116 is then displayed to the call center agent containing the various types of information. The provider server 111 also transmits the information about the inbound call to the call analytics system 101 to perform various analytics processes, including the observed audio signal and any other audio data. The provider server 111 may transmit the information and the audio data based upon a preconfigured triggering conditions (e.g., receiving the inbound phone call), instructions or queries received from another device of the system 100 (e.g., agent device 116, admin device 103, analytics server 102), or as part of a batch transmitted at a regular interval or predetermined time.
The analytics database 104 and/or the provider database 112 may contain any number of corpora that are accessible to the analytics server 102 via one or more networks. The analytics server 102 may access a variety of corpora to retrieve clean audio signals, previously received audio signals, recordings of background noise, and acoustic impulse response audio data. The analytics database 104 and/or provider database 112 may contain any number of corpora that are accessible to the analytics server 102 via one or more networks. The analytics database 104 may also query an external database (not shown) to access a third-party corpus of clean audio signals containing speech or any other type of training signals (e.g., example noise). In some implementations, the analytics database 102 and/or the provider database 112 may be queried, referenced, or otherwise used by components (e.g., analytics server 102) of the system 100 to assist with configuring or otherwise establishing performance limits on watermarking in relation to audio and/or speech quality (as in examples described in
The analytics database 104 and/or the provider database 112 may store information about speakers or registered callers as speaker profiles. A speaker profile is data files or database records containing, for example, audio recordings of prior audio samples, metadata and signaling data from prior calls, a trained model or speaker vector employed by the neural network, and other types of information about the speaker or caller. The analytics server 102 may query the profiles when executing the neural network and/or when executing one or more downstream operations. The profile could also store the registered feature vector for the registered caller, which the analytics server 102 references when determining a similarity score between the registered feature vector for the registered caller and the feature vector generated for the current caller who placed the inbound phone call.
The admin device 103 of the call analytics system 101 is a computing device allowing personnel of the call analytics system 101 to perform various administrative tasks or user-prompted analytics operations. The admin device 103 may be any computing device comprising a processor and software, and capable of performing the various tasks and processes described herein. Non-limiting examples of the admin device 103 may include a server, personal computer, laptop computer, tablet computer, or the like. In operation, the user employs the admin device 103 to configure the operations of the various components of the call analytics system 101 or service provider system 110 and to issue queries and instructions to such components.
The agent device 116 of the service provider system 110 may allow agents or other users of the service provider system 110 to configure operations of devices of the service provider system 110. For calls made to the service provider system 110, the agent device 116 receives and displays some or all of the relevant information associated with the call routed from the provider server 111.
The watermark may include all or some of the metadata 222. The watermark may be encoded in the watermarked speech 235 using the key 205. The watermark may be imperceptible without the key 205. The key 205 may be required for extracting the encoded metadata 222 from the watermarked speech 235. In this way, the presence of the watermark can be detected using the key 205, and the need for the key 205 for decoding prevents modification or removal of the watermark.
In some implementations, the encoder 230 is part of the TTS service 220. In some implementations, the encoder is part of an analysis service, such as the analytics system 101 of
The call center 310 receives an audio signal 320. The audio signal 320 may be an audio signal of a phone call for authentication purposes. In an example, the audio signal 320 is part of a phone call for verifying a person's identity for a banking service where the person's voice is used to verify the person's identity. The call center 310 may send a request 302 to the analysis service 301 to determine whether the audio signal 302 includes synthetic speech. The request 302 may include the audio signal 302 or a portion of the audio signal 302. The analysis service 301 determines whether the audio signal 302 includes synthetic speech generated by the TTS service 320 by applying a key to the audio signal 302, such as the key 205 of
If the analysis service 301 detects a watermark corresponding to synthetic speech generated by the TTS service 320 (or another TTS service), the analysis service 301 can extract metadata from the watermark. The metadata can include characteristics of the synthetic speech and/or the generation of the synthetic speech, similar to the metadata 222 of
If the analysis service 301 does not detect a watermark corresponding to synthetic speech, the analysis service 301 can determine that the audio signal 302 does not include synthetic speech and/or apply various other analytic methods to determine whether the audio signal 302 includes synthetic speech. In an example, the analysis service 301 determines that the audio signal 302 does not include a watermark corresponding to synthetic speech and then applies one or more machine-learning architectures to generate output indicating whether the audio signal 302 includes synthetic speech.
The analysis service 301 sends a response 304 to the call center 310. The response 304 can be sent to the call center 310 in response to the request 302 and/or the determination by the analysis service 301 as to whether the audio signal 302 includes synthetic speech. In an example, the response 304 includes an indication that the audio signal 302 includes or does not include synthetic speech. In an example, the response 304 includes a confidence score regarding whether the audio signal 302 includes synthetic speech. In some implementations, the response 304 includes a portion of the metadata. In an example, the response 304 includes an identification of the TTS service 302.
The analysis service 301 may generate an alert 306 to send to the TTS service 320. The alert 306 may inform the TTS service 320 that synthetic speech generated by the TTS service 320 was used in an authentication attempt. The alert 306 may include an indication of a severity of the attempt and/or a type of attempt. In an example, the alert 306 may indicate that the authentication attempt was associated with a financial account. In an example, the alert 306 may indicate that the authentication attempt represents an attempted crime or fraud. The alert 306 may indicate a user or account of the TTS service 320 that was used to generate the synthetic speech. In an example, the alert 306 includes a user ID of a user of the TTS service 320. In this way, the TTS service 320 can identify malicious users of the TTS service 320 to ban or otherwise restrict activity of the malicious users.
The TTS service 320 may comply with a watermarking standard defined and/or enforced by the analysis service 301. The watermarking standard may include requirements for watermarks to be present in synthetic speech, robustness thresholds for the watermarks, and/or key sharing for identification and/or decoding of the watermarks. In an example, the analysis service 301 provides compliance with the watermarking standard to the TTS service 320 by using a key to add a watermark to synthetic speech generated by the TTS service 320. In an example, the TTS service sends watermarked speech and a key used to generate the watermarked speech to the analysis service 301 to verify that the watermarked speech complies with the watermarking standard and to provide the analysis service 301 with the key to allow the analysis service 301 to detect the watermark. In this way, synthetic speech can be detected, and an origin of the synthetic speech can be detected.
The analysis service 301 applies a key 405 to speech 435. The speech 435 may be speech included in the audio signal 302 of
The analysis service 301 executes a determination operation 402 that determines whether the score 401 satisfies a threshold. The threshold may be a predetermined threshold for determining whether audio includes synthetic speech including a watermark. In some implementations, the threshold is a correlation threshold. In an example, the threshold is a correlation of 0.9, such that the analysis service 301 determines that the speech 435 includes a watermark corresponding to synthetic speech if the speech 435 has a correlation of 0.9 or above with the key 405 and the analysis service 301 determines that the speech 435 does not include a watermark corresponding to synthetic speech if the speech 435 has a correlation below 0.9 with the key 405. As discussed herein, the analysis service 301 may use a plurality of keys for generation and/or detection of watermarks. The analysis service 301 may generate a plurality of scores for the plurality of keys (i.e., one score per key applied to an audio signal).
If the score 401 is above the threshold, the analysis service extracts metadata of the watermark using a metadata extractor 403. The analysis service 301 generates a positive response 404a indicating that the speech 435 includes a watermark corresponding to synthetic speech, or that the speech 435 is synthetic speech. The positive response 404a may include a portion of the metadata, as discussed in conjunction with
If the score 401 is below the threshold, the analysis service may generate a negative response 404b indicating the speech 435 does not include a watermark corresponding to synthetic speech. As discussed herein, the analysis service 301 may perform other analyses on the speech 435 if the score is below the threshold to determine whether the speech 435 includes synthetic speech. In an example, the analysis service 301, in response to the score 401 being below the threshold, applies a machine-learning architecture configured to identify synthetic speech to the speech 435.
The analysis service 301 may perform additional analysis on the speech 435 to generate the positive response 404a or the negative response 404b and/or perform additional analysis on the speech after generating the positive response 404a or the negative response 404b. In an example, the analysis service 301 may determine whether the speech 435 is recorded speech in response to determining that the speech 435 does not include synthetic speech. In this way, the determination of whether the speech 435 includes synthetic speech is part of a larger analysis as to whether the speech 435 can be used to authenticate a person or whether the speech 435 is part of a malicious attack to impersonate the person. Similarly, the generation of the score 401 and the comparison of the score 401 can be part of a larger analysis of whether the speech 435 includes synthetic speech, as discussed herein.
The analysis service 501 may store a plurality of keys 502 including a first key 502a, a second key 502b, and a third key 502c. The analysis service 501 may use the plurality of keys 502 for encoding watermarks in synthetic speech generated by the plurality of TTS services 520. As discussed above, the plurality of keys 502 may each be used to encode watermarks in speech generated by a single TTS service, by multiple TTS services, and/or in a single instance of synthetic speech. In the illustrated example, the first key 502a corresponds to a first watermark 524a encoded in first synthetic speech 522 generated by the first TTS service 520a, the second key 502b corresponds to a second watermark 524b encoded in second synthetic speech 522b generated by the second TTS service 520b, and the third key 502c corresponds to a third watermark 524c encoded in third synthetic speech 522c generated by the third TTS service 520c.
Using the plurality of keys 502, the analysis service 501 is able to detect the watermarks 524 (the first watermark 524a, the second watermark 524b, the third watermark 524c) and extract metadata of the watermarks 524, as discussed herein. In this way, the analysis service 501 is able, using the stored plurality of keys 502, identify synthetic speech generated by the plurality of TTS service 520. In some implementations, the plurality of TTS services 520a send the synthetic speech 522 (the first synthetic speech 522a, the second synthetic speech 522b, the third synthetic speech 522c) to the analysis service 501 for the analysis service to encode the watermarks 524 in the synthetic speech 522. In an example, the first TTS service 520a generates the first synthetic speech 522a, the first TTS service 520a sends the first synthetic speech 522a to the analysis service 501, the analysis service 501 encodes the first watermark 524a in the first synthetic speech 522a, and the analysis service 501 sends the first synthetic speech 522a including the first watermark 524 to the TTS service 520a.
In some implementations, the analysis service 501 generates the watermarks 524 and sends the watermarks to the plurality of TTS services 520 for the plurality of TTS services 520 to encode the watermarks 524 in the synthetic speech 522. In an example, the second TTS service 520b requests a watermark for the second synthetic speech 522b, the analysis service 501 generates the second watermark 524b and sends the second watermark 524b to the second TTS service 520b, and the second TTS service 520b encodes the second watermark 524b in the second synthetic speech 522b.
In some implementations, the plurality of TTS services 520 generate the watermarks 524, encode the watermarks 524 in the synthetic speech 522, and send the plurality of keys 502 to the analysis service 501.
As discussed herein, the analysis service 501 may define and enforce a watermarking standard for the watermarks 524. In an example, the analysis service 501 generates the plurality of keys 502 such that the watermarks 524 comply with the watermarking standard. In an example, the analysis service 501 approves or rejects keys and/or watermarks from the plurality of TTS services 520 to ensure that the watermarks comply with the watermarking standard. The watermarking standard may include watermark criteria such as robustness to degradation, robustness to attack, and perceptibility.
The plurality of TTS services 520 may provide the synthetic speech 522 including the watermarks 524 to customers of the plurality of TTS services 520. The watermarks 524 and the stored plurality of keys 502 allow the analysis service 501 to detect the synthetic speech 522. In an example, the analysis service 501 can identify when the synthetic speech 522 is used in a voice authentication attempt, as discussed in conjunction with
As discussed herein, the analysis service (e.g., the analysis service 501 of
The analysis service can define and enforce the voice change threshold 701 for watermarks by generating keys and embedding watermarks such that the embedded watermarks comply with the voice change threshold 701 and/or by approving watermarks complying with the voice change threshold 701, similar to how the analysis service defines and enforces the audio quality threshold 601 of
The fingerprint 822 may include an embedding generated based on metadata of the speech 825 and/or the speech 825. The metadata 822 may include a time the speech 825 was generated (e.g., timestamp), the text used to generate the speech 825, a user who requested the speech 825 (e.g., user identifier), a device the user used to request the speech 825, an account of the user, a content of the speech 825, a machine-learning model used to generate the speech 825 (e.g. identifier of a TTS model), and other data identifying and/or describing the speech 825. The synthesis database 830 may store the fingerprint 822 for comparison with audio signals and/or fingerprints to detect synthesized speech, similar to the analysis service described herein.
A call center 810 may receive the speech 825 as part of an authentication process and send the speech 825 to the synthesis database 830. The call center 810 may include various combinations of hardware and software components such as servers, databases, and/or admin devices executing instructions in non-transitory, computer-readable media to implement various software-implemented functions. In an example, the call center 810 includes a plurality of databases to store customer information and a plurality of agent devices to interface with the plurality of databases to update customer information. The synthesis database 830 may generate a fingerprint based on the received speech and compare the fingerprint of the received speech to the stored fingerprint 822. Based on the fingerprint of the received speech matching the stored fingerprint 822, the synthesis database 830 can determine that the received speech includes synthesized speech, specifically the speech 825. The synthesis database 830 may generate a notification to the call center 810 that the received speech includes synthetic speech. The synthesis database 830 may transmit an alert to the TTS service 820 that the speech 825 was used in an authentication attempt at the call center 810.
In some implementations, the synthesis database 830 performs similar functions to the analysis system described herein. In some implementations, the synthesis database 830 is part of the analysis system described herein. The analysis system can generate a fingerprint based on received speech, compare the generated fingerprint to stored fingerprints to determine whether the generated fingerprint matches the stored fingerprints and apply keys to the received speech to determine whether the received speech includes a watermark. In this way, the analysis service can use multiple pieces of data from TTS services to detect and/or identify synthetic speech generated by the TTS services.
At operation 910, an audio signal is obtained including synthetic speech. The audio signal may be obtained from a computing system requesting an indication of whether the audio signal includes synthetic speech, such as the service provider system 110 of
At operation 920, metadata is extracted from a watermark of the audio signal by applying a set of keys associated with a plurality of TTS services to the audio signal, the metadata indicating an origin of the synthetic speech in the audio signal. The metadata includes characteristics of the synthetic speech and/or characteristics of the generation of the synthetic speech such as an identifier of a TTS service (e.g., service identifier), an identifier of a TTS model (e.g., model identifier), an identifier of a user (e.g., user identifier) of the TTS service, and a timestamp indicating when the synthetic speech was generated.
In some implementations, the method 900 includes generating a score for each key of the set of keys to determine that the audio signal includes the watermark. The watermark may be generated using a key of the set of keys. The key used to generate the watermark may be associated with a score indicating that the audio signal includes the watermark. In some implementations, the score indicates a correlation between the key and the watermark. The generated scores are compared to a predetermined threshold (e.g., score threshold, correlation threshold) to determine whether the audio signal includes the watermark.
In some implementations, the method 900 includes transmitting the key to a TTS service to generate the watermark. The TTS service may generate the watermark and encode the watermark in the synthetic speech. In some implementations, the method 900 includes generating the watermark using the key and sending the watermark to the TTS service to encode the watermark in the synthetic speech. In some implementations, the method 900 includes generating the watermark using the key, encoding the watermark in the synthetic speech, and sending the synthetic speech with the encoded watermark to the TTS service. In some implementations, the method 900 includes verifying that the watermark complies with a watermarking standard. The key can be generated to ensure that the watermark complies with the watermarking standard. In some implementations, the method 900 includes receiving, from the origin of the synthetic speech (e.g., TTS service) an audio signal including the watermark, determining that a robustness of the watermark exceeds a predetermined threshold, and transmitting an approval of the watermark to the origin of the synthetic speech. In this way, the watermark can be verified to comply with the watermarking standard such that the watermark is detectable using the key despite audio degradation. In an example, the audio signal includes synthetic speech having a watermark, and the audio signal is degraded due to a codec used in a phone call and background noise. In this example, the watermark is still detectable using the key because the watermark complies with the watermarking standard, including the robustness threshold.
At operation 930, based on the metadata, a notification is generated indicating that the audio signal includes the synthetic speech. In some implementations, the notification includes a confidence level that the audio signal includes the synthetic speech. The confidence level may be the generated score. The notification may include a portion of the extracted metadata. In some embodiments, the notification indicates the origin of the synthetic speech. In some embodiments, the notification indicates a timestamp of when the synthetic speech was generated. In an example, the notification indicates that the audio signal most likely includes synthetic speech generated one month ago. The notification can be displayed to a user and/or sent to a computing device that requested analysis of the audio signal to determine whether the audio signal includes synthetic speech. In an example, the notification is sent to a call center that provided the audio signal and requested to know whether the audio signal includes synthetic speech.
In some implementations, the watermark includes a consent watermark and the notification indicates usage consent parameters of the consent watermark. The consent watermark may indicate whether consent is provided for synthesis of speech using audio. The usage consent parameters may indicate whether audio is allowed to be copied, transmitted, and/or used for synthesis. In an example, the consent watermark is included in audio and indicates that the audio cannot be used for TTS. In this example, inclusion of the consent watermark in the synthetic speech indicates that the synthetic speech was generated without consent. In this example, the notification may indicate to the TTS service that the synthetic speech is unauthorized so the TTS service can determine to not deliver the synthetic speech to a user. In an example, a user of the TTS service uploads audio including the consent watermark indicating that the audio cannot be used for TTS, the TTS service generates synthetic speech using the audio, sends the synthetic speech to the analysis service, and receives the notification that the synthetic speech is unauthorized. In an example, a user of the TTS service uploads audio including the consent watermark indicating that the audio cannot be used for TTS, the TTS service sends the audio to the analysis service and receives the notification that the audio cannot be used for generating synthetic speech.
In some implementations, the watermark includes an authorization watermark, and the notification indicates authorization parameters of the authorization watermark. The authorization watermark can indicate ownership and/or usage authorization for audio. In an example, the authorization watermark indicates an identity of an individual whose voice is present in audio or whose voice was used to generate synthetic speech. The authorization parameters may define when, how, and by whom the audio can be used. In some implementations, the authorization watermark indicates that the TTS service owns the synthetic speech and that the synthetic speech is authorized for commercial, educational, or artistic purposes, but not for impersonation. In an example, the authorization watermark is present in audio used to generate the synthetic speech and the TTS encodes the authorization watermark from the audio in the synthetic speech to indicate that the synthetic speech was properly authorized. In some implementations, the authorization watermark indicates that synthetic speech is authorized for identity verification purposes. The authorization watermark can be referred to as a legitimate-synthesis watermark in this context, indicating that the synthetic speech was generated with the consent and intention of the person whose voice is used to generate the synthetic speech. In an example, an individual with a condition that causes voice loss may record their voice for purposes of generating synthetic speech and then use the synthetic speech for voice authentication. In this example, the authorization watermark (e.g., legitimate-synthesis watermark) indicates that despite being synthetic speech, the synthetic speech can be used for voice authentication.
In some implementations, the method 900 includes transmitting an alert to a TTS service based on the origin of the synthetic speech in the audio signal. In some embodiments, the notification may include the alert. The alert may be generated based on the synthetic speech being used to attempt to commit fraud, a crime, or to impersonate a person. In an example, the alert may be generated based on the synthetic speech being used to impersonate a person for voice authentication. The alert may be transmitted to a TTS service that is the origin of the synthetic speech, or which generated the synthetic speech. In this way, the TTS service can be notified of improper use of the synthetic speech. In an example, the alert notifies the TTS service of improper use of synthetic speech generated by a user of the TTS service, allowing the TTS service to sanction the user.
As discussed herein, particularly in conjunction with
At operation 1010, an audio signal is received. The audio signal may be received as part of an authorization process. In an example, the audio signal is received as part of a voice authorization process where an identity of a person is verified using the person's voice.
At operation 1020, the audio signal is transmitted to a watermark service to extract metadata from a watermark of the audio signal, the metadata indicating an origin of synthetic speech in the audio signal. The watermark service may also provide voice verification services to determine whether a voice in an input audio signal matches a voice in an enrolled audio signal. In some implementations, the watermark service analyzes an enrollment audio signal to determine whether the enrollment audio signal includes synthetic speech prior to storing the enrollment audio signal as the enrolled audio signal. The audio signal may be transmitted to the watermark service to determine whether the audio signal includes synthetic speech and/or to verify the identity of a caller, where the audio signal is speech of the caller.
An example of the watermark service is the analysis service 310 of
At operation 1030, a response is received from the watermark service based on the watermark. The response may indicate whether the audio signal includes synthetic speech and/or whether the voice of the caller matches the identity of the caller. In some implementations, the response includes a recommendation or determination as to whether to accept (i.e., authenticate the caller).
At operation 1040, based on the response from the watermark service, a determination is made whether to accept or reject the audio signa (i.e., accept or reject the caller). In this way, the audio signal and/or an identity of the caller can be verified using the watermark service.
The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof. A code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, attributes, or memory contents. Information, arguments, attributes, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
The actual software code or specialized control hardware used to implement these systems and methods is not limiting of the invention. Thus, the operation and behavior of the systems and methods were described without reference to the specific software code being understood that software and control hardware can be designed to implement the systems and methods based on the description herein.
When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another. A non-transitory processor-readable storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor. Disk and disc, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-Ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.
While various aspects and embodiments have been disclosed, other aspects and embodiments are contemplated. The various aspects and embodiments disclosed are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
This application claims the benefit of U.S. Provisional Application No. 63/515,002, filed Jul. 21, 2023, which is incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63515002 | Jul 2023 | US |