The present disclosure relates to systems and methods for playback of audio, and relates more particularly to enhanced security for remote playback of audio transmitted from a server.
As speech transcription demands increase in modern digital environments, more attention has been focused on the goal of ensuring the security of the audio data involved in the speech transcription. For a company utilizing human transcribers who are outside the company's digital firewall, these human transcribers present a security vulnerability, e.g., for an unauthorized person or entity seeking to access the audio data being handled by the transcribers. As an example, in the field of medical information, which is subject to a multitude of privacy regulations, unauthorized access to any audio data being handled by a transcriber working for a company presents a real possibility of reputational damage and/or regulatory repercussion for the company. Unfortunately, current state of the art doesn't provide any protection particular to transcription against an attack on a transcriber's workstation. Therefore, there is need for a system and a method to achieve increased security against an attack on a transcriber's workstation.
According to an example embodiment of the present disclosure, a system and a method are provided for security against an attack on a remote device by an attacker who has remotely taken control of a remote device, e.g., on playback of audio transmitted from a server to a transcriber working on a PC that has been taken over by a remote attacker who is attempting to recover the audio.
In another example embodiment of the present disclosure, the audio sent from a server to a transcriber's workstation can be encrypted using a key which is specific to the transcriber.
In another example embodiment of the present disclosure, audio decryption can be allowed to proceed only if i) voice biometric authentication of the transcriber has been satisfied (which authentication can be required periodically), and/or ii) the transcriber types into the playback app a decode PIN which is visually displayed in the playback app.
In another example embodiment of the present disclosure, the audio (e.g., speech) is played back binaurally and the signal is modified to use the “spatial release from masking” effect, so that the transcriber can still hear the audio to be transcribed, but an attacker with access to one channel would get a signal corrupted with noise.
In another example embodiment of the present disclosure, the firmware of a headphone worn by a transcriber contains a public key and a private key pair, the server encrypts the audio using the headphone's public key, and the headphone decrypts the audio using its corresponding private key.
In yet another example embodiment of the present disclosure, the decryption functionality is contained in a phone app, and the encrypted audio sent by the server is decrypted by the phone app.
In yet another example embodiment of the present disclosure, the transcriber's workstation can be embodied as a firmware-based device that can encrypt any output, e.g., transcriber's typed output.
As illustrated in
In addition to the above, an additional layer of security can be provided by playing back the audio (e.g., speech) binaurally and modify the signal to use the “spatial release from masking” effect, so that the transcriber can still hear the audio to be transcribed, but an attacker with access to one channel would get a signal corrupted with noise. More specifically, the technique is to spatialize the speech at one angle using Head-related Transfer Function (HRTF) filtering, and spatialize the noise at a different angle. When played back in mono, each channel sounds like noisy speech (unintelligible if the noise is strong enough), but when played back binaurally, the spatial separation can be exploited by the listener to separate the target speech from the noise. The intelligibility degradation may not be large enough to prevent an attacker from correctly hearing most of the words, but it would, at least, make it difficult for the attacker to build a voiceprint from any audio recording the attacker could make. Incidentally, in the case of using binaural presentation for a recording with multiple speakers, it may be beneficial to also spatialize the different speakers (e.g., as determined by automatic speech recognition (ASR) diarization) differently, in order to ease the transcribers' task.
As an additional layer of security, e.g., in the system shown in
The system and method according to the present disclosure provide a crucial security improvement over a conventional transcription application on a PC. A conventional transcription application on a PC can use, e.g., native capabilities in a browser to decode speech transmitted to the PC via Hypertext Transfer Protocol Secure (HTTPS) or standard system calls. In this case, a remote attacker could run his own app or browser to decode and copy the speech, in a ‘man-in-the-middle’ attack (i.e., the remote attacker would produce an attacker's browser, which looked like a normal, “legitimate” browser, except that the attacker's browser copied information out of the “legitimate” browser into a location the attacker could access). Another possibility is that the remote attacker could add a browser extension, allowing the attacker to copy the decoded audio out. Yet another possibility is that the attacker could copy the decoded speech from the decoded audio buffers.
Example embodiments (e.g.,
As used in the present disclosure, the terms “transcriber” and “transcriptionist” are intended to encompass a human engaged in a broad range of speech-to-text conversion tasks, e.g., i) verbatim reporting of spoken words, ii) summarizing of spoken statements (e.g., generating a medical report based on patient encounter, work conventionally done by a medical scribe), and iii) editing of computer-controlled, ASR-based draft of text output from speech, e.g., work done by a quality document specialist (QDS).
The present disclosure provides a first example system which includes: a workstation having a playback app configured for audio playback; and a decryption module having a decryption functionality communicatively connected to the playback app, wherein the decryption module is configured to decrypt audio data previously encrypted with an encryption key associated with the decryption module.
The present disclosure provides a second example system based on the above-discussed first example system, in which second example system the encrypted audio data is i) encrypted by a server using one of a private key or a public key associated with the decryption module, and ii) transmitted for decryption by the decryption module.
The present disclosure provides a third example system based on the above-discussed second example system, in which third example system at least one of: a) a first private key associated with the decryption module is used to generate a second private key, wherein the second private key is used by the server to encrypt the audio data, and the second private key is used by the decryption module for decryption of the audio data; and b) the public key associated with the decryption module is used by the server to encrypt the audio data, and the first private key associated with the decryption module is used for decryption of the audio data.
The present disclosure provides a fourth example system based on the above-discussed second example system, in which fourth example system the decryption module having the decryption functionality is part of the playback app.
The present disclosure provides a fifth example system based on the above-discussed fourth example system, in which fifth example system at least one of: i) the system further comprises a voice biometric authentication module configured to authenticate a transcriber; ii) decryption by the decryption module is enabled only upon input of a decode PIN by the transcriber; and iii) the system is configured to a) modify the audio data to spatialize speech component of the audio data at a specified first angle using head-related transfer function (HRTF) filtering and spatialize noise component of the audio data at a specified second angle, and b) play back the audio data binaurally.
The present disclosure provides a sixth example system based on the above-discussed second example system, in which sixth example system the decryption module having the decryption functionality is part of firmware of a headphone configured to be worn by a transcriber.
The present disclosure provides a seventh example system based on the above-discussed second example system, in which seventh example system the decryption module having the decryption functionality is part of a phone app.
The present disclosure provides an eighth example system based on the above-discussed seventh example system, in which eighth example system one of: i) the encrypted audio data is directly transmitted from the server to the phone app; and ii) the encrypted audio data from the server is relayed by the playback app to the phone app.
The present disclosure provides a ninth example system based on the above-discussed second example system, in which ninth example system the workstation is a firmware-based tablet.
The present disclosure provides a tenth example system based on the above-discussed ninth example system, in which tenth example system a public key associated with the server is sent to the firmware-based tablet, and output of the firmware-based tablet is encrypted using the public key associated with the server.
The present disclosure provides a first example method which includes: providing a workstation having a playback app configured for audio playback; providing a decryption module having a decryption functionality communicatively connected to the playback app; encrypting, using an encryption key associated with the decryption module, audio data; and decrypting, using the decryption module, the encrypted audio data.
The present disclosure provides a second example method based on the above-discussed first example method, in which second example method the encrypted audio data is i) encrypted by a server using one of a private key or a public key associated with the decryption module, and ii) transmitted for decryption by the decryption module.
The present disclosure provides a third example method based on the above-discussed second example method, in which third example method at least one of: a) a first private key associated with the decryption module is used to generate a second private key, wherein the second private key is used by the server to encrypt the audio data, and the second private key is used by the decryption module for decryption of the audio data; and b) the public key associated with the decryption module is used by the server to encrypt the audio data, and the first private key associated with the decryption module is used for decryption of the audio data.
The present disclosure provides a fourth example method based on the above-discussed second example method, in which fourth example method the decryption module having the decryption functionality is provided as part of the playback app.
The present disclosure provides a fifth example method based on the above-discussed fourth example method, which fifth example method further includes at least one of: i) authenticating, using a voice biometric authentication module, a transcriber; ii) enabling decryption by the decryption module only upon input of a decode PIN by the transcriber; and iii) a) modifying the audio data to spatialize speech component of the audio data at a specified first angle using head-related transfer function (HRTF) filtering and spatialize noise component of the audio data at a specified second angle, and b) playing back the audio data binaurally.
The present disclosure provides a sixth example method based on the above-discussed second example method, in which sixth example method the decryption module having the decryption functionality is provided as part of firmware of a headphone configured to be worn by a transcriber.
The present disclosure provides a seventh example method based on the above-discussed second example method, which seventh example method the decryption module having the decryption functionality is provided as part of a phone app.
The present disclosure provides an eight example method based on the above-discussed seventh example method, in which eighth example method one of: i) the encrypted audio data is directly transmitted from the server to the phone app; and ii) the encrypted audio data from the server is relayed by the playback app to the phone app.
The present disclosure provides a ninth example method based on the above-discussed second example method, in which ninth example method the workstation is configured as a firmware-based tablet.
The present disclosure provides a tenth example method based on the above-discussed ninth example method, which tenth example method further includes: sending, by the server, a public key associated with the server to the firmware-based tablet; and encrypting, by the firmware-based tablet using the public key associated with the server, output of the firmware-based tablet.