This application claims priority to European Application No. 22162192.3, filed Mar. 15, 2022, the entire contents of which are incorporated herein by reference.
Examples relate to sound processing devices, to corresponding methods and computer programs for sound processing devices, and to devices, such as mobile devices or hearing aids, comprising a sound processing device.
With the proliferation of audio-recording devices, it is now conceivable that any moderately frequented public place will have multiple devices recording overlapping spaces simultaneously at any time. However, these devices do no collaborate to record the surrounding soundscape so that a lot of useful information is lost in the process. For example, sound triangulation, 3D reconstruction or source-specific de-noising are processes with wide range of applications and are usually enabled by recording the same signal with multiple spatially separated microphones.
There may be a desire for an improved concept for processing sound recorded by a sound processing device.
This desire is addressed by the subject-matter of the independent claims.
Various examples of the present disclosure are based on the finding, that a sound processing device can collaborate with one or more further sound processing devices without exchanging the actual recorded sound in a peer-to-peer fashion, which may carry a high communication load and privacy risks due to the local recording of speech and sound features that can betray a position of the respective further sound processing devices. Instead, the respective sound processing devices can employ a distributed learning algorithm on a sound processing model being used by one of the sound processing devices. In the distributed learning algorithm, the further sound processing devices (also called “helper devices”) determine local adjustments to the sound processing model that are based on the sound that they perceive locally and share these local adjustments with the sound processing device (called “main device”) using the sound processing model to process sound. In various examples of the present disclosure, the main device uses the sound processing model to perform a given sound processing task. This task may be communicated to the helper devices, so helper devices know what aspect of the sound processing model to adjust.
Various examples of the present disclosure relate a sound processing device (e.g., the main device). The sound processing device comprises at least one interface for communicating with one or more further sound processing devices (e.g., the one or more helper devices). The sound processing device comprises processing circuitry, configured to obtain a sound processing model. The processing circuitry is configured to receive, from the one or more further sound processing devices, one or more local adjustments to the sound processing model determined by the one or more further sound processing devices based on sound recorded locally by the one or more further sound processing devices. The processing circuitry is configured to adjust the sound processing model based on the one or more local adjustments. The processing circuitry is configured to process sound recorded locally by the sound processing device using the sound processing model. This enables a cooperation of the sound processing device with the one or more further sound processing devices without exchanging the actual sound recorded by the sound processing devices.
For example, the processing circuitry may be configured use the sound processing model to perform a sound processing task. The processing circuitry may be configured to provide information on the sound processing task to the one or more further sound processing devices. The one or more local adjustments may be determined based on the sound processing task. If the further sound processing devices are aware of the sound processing task, they can determine adjustments that are relevant with respect to the sound processing tasks.
The processing circuitry may be configured to repeatedly receive updates to the one or more local adjustments from at least a subset of the one or more further sound processing devices. Accordingly, the processing circuitry may be configured to repeatedly adjust the sound processing model based on the repeatedly received updates to the one or more local adjustments. By continuously exchanging updates, the sound processing model may be iteratively refined and/or adjusted to changes in the sound landscape.
While cooperation between sound processing devices can be valuable, some sound processing devices may be more useful than others during the adjustment of the sound processing model. For example, the processing circuitry may be configured to determine a usefulness of the one or more local adjustments for the sound processing device, and to ignore or cease receiving updates from another sound processing device based on the usefulness of the local adjustment of the other sound processing device for the sound processing device. This may reduce a communication and processing overhead for the sound processing device and may avoid the adjustments degrading the sound processing model.
The proposed concept is particularly suitable for scenarios with a continuously evolving soundscape. Changes in the soundscape can, via the local adjustments, be propagated so the main device can, in real-time or near real-time, profit from the results of the distributed learning. For example, the processing circuitry may be configured to perform real-time processing or near-real-time processing of the sound recorded by the sound processing device using the sound processing model.
There are various viable sources for obtaining the sound processing model. For example, the processing circuitry may be configured to obtain the sound processing model from a central registry. For example, the central registry may be used to make up-to-date sound processing models available for multiple sound processing devices, so that the sound processing devices can profit from distributed learning performed by different devices.
Alternatively, the processing circuitry may be configured to obtain the sound processing model from another sound processing device, or the processing circuitry may be configured to generate the sound processing model. In this case, a peer-to-peer model can be used, so that no central registry is required. For example, the processing circuitry may be configured to provide the sound processing model (that is obtained from the central registry, from another sound processing device, or generated locally) to the one or more further sound processing devices.
The main device may actively request the helper devices to provide the adjustments or the sound processing model. For example, the processing circuitry may be configured to provide one or more requests to the one or more further sound processing devices to provide the one or more local adjustments and/or the sound processing model. Accordingly, the adjustments and/or sound processing model may be provided as needed by the main device.
The proposed concept is focused on processing audio in a given environment. In particular, the local adjustments may be useful to the main device if they originate from sound processing devices in the same environment as the main device. For example, the further sound processing devices in the environment of the main device may be learned from the central registry. The processing circuitry may be configured to obtain information on a presence of sound processing devices in a general location of the sound processing device from a central registry, and to provide the one or more requests based on the information on the presence of the sound processing devices in the general location of the sound processing device. Alternatively, a peer-to-peer approach may be used. For example, the processing circuitry may be configured to determine a presence of the one or more further sound processing devices in the general location of the sound processing device, and to provide the one or more requests based on the determination of the presence of the one or more further sound processing devices.
As pointed out above, the processing circuitry may be configured to perform distributed learning using the one or more local adjustments to adjust the sound processing model. For example, the distributed learning may be based on integrating the local adjustments proposed by the one or more further sound processing devices.
In general, care may be taken to take into account privacy considerations in the distributed learning process. For example, the local adjustments may be collected such, that the privacy of the owner(s) of the one or more further sound processing devices (and nearby audio sources) is not violated. This can be done by defining (e.g., training) embeddings, which, in this case, are functions that define an alteration of the sound processed locally (or of the adjustments to the sound processing model) that is performed in order to alter (e.g., obfuscate) at least one aspect of the sound recorded locally. For example, the one or more local adjustments may be based on one or more embeddings designed to alter at least one aspect of the sound recorded locally, such as an impact of local speech or an impact of a location of the respective further sound processing device.
In addition, or alternatively, the local adjustments may be limited by a differential privacy algorithm. For example, the one or more local adjustments may be based on a privacy budget imposed by a differential privacy algorithm.
In the present disclosure, a sound processing model is used to process the sound recorded locally by the main device. However, the term sound processing model is not to be understood in a limited fashion. In various examples, multiple layers of sound processing models may be used to process the sound. For example, the processing circuitry may be configured to process the sound recorded locally using the sound processing model and using a second sound processing model, with the sound processing model being a task-agnostic sound processing model and the second sound processing model being a task-specific sound processing model. For example, the task-agnostic model, which may provide a more general improvement of the sound processing, may be adjusted based on the one or more local adjustments.
In some scenarios, a third sound processing layer may be added, such as a further task-specific sound processing model that is adjusted based on adjustments proposed by the one or more further sound processing devices. For example, the processing circuitry may be configured to process the sound recorded locally further using a third sound processing model, with the third sound processing model being a task-specific sound processing model. The processing circuitry may be configured to receive, from the one or more further sound processing devices, one or more further local adjustments to the third sound processing model determined by the one or more further sound processing devices based on sound recorded locally by the one or more further sound processing devices, and to adjust the third sound processing model based on the one or more further local adjustments. This may enable or improve task-specific adjustments to the sound processing performed by the main device.
Various examples of the present disclosure further provide another device comprising the sound processing device (i.e., the main device), such as a hearing aid comprising the sound processing device or a mobile communication device (e.g., a smartphone or smartwatch) comprising the sound processing device.
Various examples of the present disclosure relate to a corresponding method for a sound processing device (i.e., for the main device). The method comprises obtaining a sound processing model. The method comprises receiving, from one or more further sound processing devices, one or more local adjustments to the sound processing model performed by the one or more further sound processing devices based on sound recorded locally by the one or more further sound processing devices. The method comprises adjusting the sound processing model based on the one or more local adjustments. The method comprises processing sound recorded locally by the sound processing device using the sound processing model.
Various examples of the present disclosure relate to a corresponding computer program having a program code for performing the above method (for the main device), when the computer program is executed on a computer, a processor, or a programmable hardware component.
Various examples of the present disclosure relate to another sound processing device (i.e., the helper device). The sound processing device comprises at least one interface for communicating with a further sound processing device (i.e., the main device). The sound processing device comprises processing circuitry, configured to obtain a sound processing model. The processing circuitry is configured to obtain information on a sound processing task being performed by the further sound processing device. The processing circuitry is configured to determine a local adjustment to the sound processing model based on sound recorded locally by the sound processing device and based on the sound processing task being performed by the further sound processing device. The processing circuitry is configured to provide the local adjustment to the further sound processing device. Thus, the helper device may participate in distributed learning with the further sound processing device (i.e., the main device).
As outlined in relation to the main device, the helper device may provide frequent updates to the local adjustment. For example, the processing circuitry may be configured to repeatedly determine updates to the local adjustment to the sound processing model based on newly recorded sound recorded by the sound processing device, and to provide the updates to the further sound processing device. Thus, the sound processing model may be iteratively refined and/or adapted to a changing soundscape.
For example, in a scenario with a central registry, the processing circuitry may be configured to obtain the sound processing model from a central registry. Alternatively, in a peer-to-peer scenario, the processing circuitry may be configured to obtain the sound processing model from another sound processing device.
The local adjustment may be requested by the main device. Accordingly, the processing circuitry may be configured to receive a request for the local adjustment from the further sound processing device, and to provide the local adjustment in response to the request. Thus, the main device may control whether to obtain local adjustment(s) from a helper device.
As outlined above, the determination of the local adjustment may be part of distributed learning. For example, the distributed learning may be focused on the main device integrating the local adjustments proposed by the helper devices.
As shown in connection with the main device, care may be taken to take into account privacy considerations in the distributed learning process. For example, the processing circuitry may be configured to apply one or more embeddings designed to alter at least one aspect of the sound recorded locally, such as an impact of local speech or an impact of a location of the respective further sound processing device. Alternatively, or additionally, the processing circuitry may be configured to determine the local adjustment based on a privacy budget of a differential privacy algorithm.
In some examples, the main device uses multiple layers of sound processing models to process the sound. In particular, the main device may use a task-agnostic sound processing model and one or more task-specific sound processing models. In some cases, the helper device may participate in distributed learning to improve a task-specific model (in addition to the task-agnostic sound processing model). For example, the sound processing model may be a task-agnostic sound processing model. The processing circuitry may be configured to obtain a task-specific sound processing model, to determine a further local adjustment to the task-specific sound processing model based on the sound recorded locally by the sound processing device, and to provide the further local adjustment to the further sound processing device.
Various examples of the present disclosure further provide another device comprising the sound processing device (i.e., the helper device), such as a hearing aid comprising the sound processing device or a mobile communication device (e.g., a smartphone or smartwatch) comprising the sound processing device.
Various examples of the present disclosure relate to a corresponding method for a sound processing device (i.e., for the helper device). The method comprises obtaining a sound processing model. The method comprises obtaining information on a sound processing task being performed by a further sound processing device. The method comprises determining a local adjustment to the sound processing model based on sound recorded locally by the sound processing device and based on the sound processing task being performed by the further sound processing device. The method comprises providing the local adjustment to the further sound processing device.
Various examples of the present disclosure relate to a corresponding computer program having a program code for performing the above method (for the helper device), when the computer program is executed on a computer, a processor, or a programmable hardware component.
Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which
Some examples are now described in more detail with reference to the enclosed figures. However, other possible examples are not limited to the features of these embodiments described in detail. Other examples may include modifications of the features as well as equivalents and alternatives to the features. Furthermore, the terminology used herein to describe certain examples should not be restrictive of further possible examples.
Throughout the description of the figures same or similar reference numerals refer to same or similar elements and/or features, which may be identical or implemented in a modified form while providing the same or a similar function. The thickness of lines, layers and/or areas in the figures may also be exaggerated for clarification.
When two elements A and B are combined using an “or”, this is to be understood as disclosing all possible combinations, i.e., only A, only B as well as A and B, unless expressly defined otherwise in the individual case. As an alternative wording for the same combinations, “at least one of A and B” or “A and/or B” may be used. This applies equivalently to combinations of more than two elements.
If a singular form, such as “a”, “an” and “the” is used and the use of only a single element is not defined as mandatory either explicitly or implicitly, further examples may also use several elements to implement the same function. If a function is described below as implemented using multiple elements, further examples may implement the same function using a single element or a single processing entity. It is further understood that the terms “include”, “including”, “comprise” and/or “comprising”, when used, describe the presence of the specified features, integers, steps, operations, processes, elements, components and/or a group thereof, but do not exclude the presence or addition of one or more other features, integers, steps, operations, processes, elements, components and/or a group thereof.
In
As is evident, the main device 10 and the one or more helper devices 20 interact with each other, with the main device 10, with the helper devices determining local adjustments to a sound processing model, and with the main device using said adjustments to adjust the sound processing model, and to process sound using the adjusted sound processing model. In effect, the main device 10 and the one or more helper devices 20 may perform distributed learning, with the main device 10 reaping the benefits of the distributed learning process. For example, the processing circuitry of the main device may be configured to perform distributed learning using the one or more local adjustments to adjust the sound processing model. Similarly, the determination of the local adjustment performed by the one or more helper devices may be part of distributed learning. For example, the processing circuitry 14 of the main device 10 may share the result of the distributed learning, e.g., the adjusted sound processing model, with a central registry or with the one or more helper devices. In the following, the collaboration between the two types of devices is shown in more details.
On both sides, the actions being performed are based on the sound processing model, which is being obtained by the respective processing circuitry. In general, the sound processing model may be any set of instructions for transforming sound recorded by the respective sound processing device. For example, the sound processing model may comprise a set of labelled audio filters. For example, adjustments to the sound processing devices may relate to parameters of the set of labelled audio filters. The sound processing model may be used to transform the sound recorded locally by the respective sound processing device, e.g., with the purpose of improving an aspect of the sound, e.g., by suppressing noise, or by making voices better understandable.
To obtain the model, two different approaches may be used—a centralized approach, and a decentralized approach. In the centralized approach, the sound processing model may be hosted and provided by a central registry. Accordingly, the processing circuitry of the main device may be configured to obtain the sound processing model from the central registry. Similarly, the processing circuitry of the helper device may be configured to obtain the sound processing model from the central registry. For example, the central registry may be server, e.g., an edge server that covers a pre-defined coverage area (with the main device and/or the one or more helper devices being located in the coverage area). For example, the central registry may be hosted by a provider of a mobile communication system, e.g., by a provider of a cellular mobile communication system or by a hotspot provider.
In the decentralized approach, the sound processing model may be shared among sound processing devices. For example, a decentralized registry may be maintained among the sound processing device using a peer-to-peer communication approach. For example, the processing circuitry of the main device and/or the processing circuitry of the helper device may be configured to obtain the sound processing model from another sound processing device. For example, the processing circuitry of the main device may be configured to obtain the sound processing model from another sound processing device, and to forward the sound processing model to the one or more helper devices. Alternatively, the processing circuitry of the main device may be configured to generate the sound processing model (e.g., based on the sound processing task it is trying to accomplish), and to provide the generated sound processing model to the one or more further sound processing devices.
In general, the main device uses the sound processing model to perform a sound processing task. For example, the main device may use the sound processing model to suppress noise, or to isolate some components of the sound (e.g., voices). Information on the sound processing task may be shared by the main device with the one or more helper devices (if it is not inherent to the sound processing model). For example, the processing circuitry of the main device may be configured to provide information on the sound processing task to the one or more helper devices. Accordingly, the processing circuitry of the helper device may be configured to obtain information on a sound processing task being performed by the main device, which it uses to determine the one or more local adjustments. For example, the processing circuitry of the main device may be configured to compile a sample of sound recorded locally by the main device (and anonymize the sample, e.g., using embeddings), and to provide a task identifier and the sample as information on the task being performed by the main device to the one or more helper devices. For example, the processing circuitry of the main device may be configured to periodically update the sample of sound recorded locally by the main device, and to provide updates of the sample to the one or more helper devices.
In general, the proposed concept is based on the main device inviting the helper devices to collaborate in the distributed learning process. For this purpose, the main device may identify suitable helper devices, e.g., based on their location or willingness to cooperate. Again, a centralized or decentralized may be chosen. For example, the (potential) helper devices may be identified by or via the central registry. For example, the processing circuitry of the main device may be configured to obtain information on a presence of sound processing devices in a general location of the sound processing device from the central registry. For example, the central registry may track a (general) location of the sound processing device, and to determine helper devices that are in the same general location of the main based on their location. Alternatively, a peer-to-peer-approach may be used. The processing circuitry of the main device may be configured to determine the presence of the one or more further sound processing devices in the general location of the sound processing device. For example, the processing circuitry of the main device may be configured to broadcast a request for helper devices to respond if they are in the same general location of the main device. For example, two sound processing devices may be in the same general location if a distance between the two sound processing device is at most 25 meters (or at most 50 meters, or at most 100 meters, or at most 200 meters) or if the two sound processing devices are within the same space (e.g., courtyard, concert hall, open air performing arts venue, public transport platform etc.).
In some examples, the central registry or the processing circuitry of the main device may organize the one or more helper devices in a directed graph (as shown in
Once suitable helper device(s) are identified, the main device may request the one or more helper devices to participate in the distributed learning effort. For example, the processing circuitry of the main device may be configured to provide one or more requests to the one or more helper devices to provide the one or more local adjustments and/or the sound processing model. For example, the one or more requests may be provided to the one or more helper devices based on their presence in the general location of the main device, i.e., based on the information on the presence of the sound processing devices in the general location of the sound processing device. Accordingly, the processing circuitry of the helper device may be configured to receive a request for the local adjustment from the further sound processing device (e.g., based on the presence of the helper device in the general location of the main device), and to provide the local adjustment in response to the request. In some cases, e.g., as shown in
The core of the proposed concept is the determination of the local adjustments by the helper devices. The local adjustment determined by the respective helper device may be considered the contribution of the helper in the distributed learning being performed. For example, the distributed learning may be performed using different techniques, e.g., centralized techniques such as Federated Learning, or decentralized techniques such as Multi-Party Computation (MPC) or Fully Decentralized Learning.
For example, if Federated Learning is used, the sound processing model may be the “global” model being trained, with the model being trained by the helper devices using the sound recorded locally by the helper devices (and the sample of sound provided by the main device, e.g., to test the suitability of the proposed local adjustments). If the sound processing model is implemented using a neural network, the adjusted weights of the neural network may be provided as local adjustment to the main device. If the sound processing model is implemented using a set of audio filters, the parameters of the set of audio filters being changed by the local adjustment may be provided to the main device. Fully decentralized learning may be considered similar to federated learning, albeit without the data being collected centrally, but at each participant of the decentralized learning approach.
In Multi Party Computation, multiple participants (e.g., the main device and the one or more helper devices) each have private data (e.g., the sound recorded locally), which they use to jointly compute the value of a public function using the private data without revealing the private data. For example, a secret sharing scheme, such as Shamir secret sharing or additive secret sharing, may be used to adjust the sound processing model (by the main device), with the local adjustments being the shared secrets of the one or more helper devices.
The processing circuitry of the helper device is configured to determine the local adjustment to the sound processing model based on sound recorded locally by the sound processing device and based on the sound processing task being performed by the further sound processing device. In other words, the processing circuitry of the helper device may be configured to determine the local adjustment to the sound processing model such, that the main device is supported in carrying out the sound-processing task by the local adjustment. For example, the processing circuitry of the helper device may use the sample of sound recorded locally by the main device to evaluate the local adjustment with respect to the sound processing task, e.g., to determine whether the local adjustment is beneficial with respect to the sound processing task (e.g., beneficial with respect to the suppression of noise or beneficial with respect to the isolation of voices). To give an example, which is illustrated in more detail in connection with
On the side of the main device, the processing circuitry is configured to receive, from the one or more further sound processing devices, the one or more local adjustments to the sound processing model determined by the one or more further sound processing devices based on sound recorded locally by the one or more further sound processing devices, e.g., as contribution of the respective one or more helper devices to the distributed learning scheme, e.g., as changes in parameter values of the set of audio filters or as changed weights of a neural network.
The processing circuitry of the main device may evaluate the local adjustments proposed by the one or more helper devices, e.g., to determine whether the respective changes are useful for processing the sound recorded by the main device. For example, the processing circuitry of the main device may be configured to determine a usefulness of the one or more local adjustments for the sound processing device (e.g., for the purpose of performing the sound processing task). Depending on the usefulness of the one or more local adjustments, they may be applied to the sound processing model. For example, depending on the distributed learning scheme being used, the contributions of the one or more helper devices may be used to adjust the sound processing model according to the respective distributed learning scheme.
In general, a soundscape can change quickly, as people and objects move relative to each other, and as new sound sources appear, or previous sound sources cease emitting sound. Therefore, the sound processing model may be continuously adapted to the evolving soundscape. This may be done by not only receiving a single local adjustment per helper device, but by receiving (frequent) updates from the one or more helper devices. For example, the processing circuitry of the helper device may be configured to repeatedly (e.g., periodically, or when the soundscape changes, or both) determine updates to the local adjustment to the sound processing model based on newly recorded sound recorded by the sound processing device, and to provide the updates to the further sound processing device. In general, these updates may be provided frequently, so the main device can adapt the sound processing model to the changing sound scape. For example, a time interval between successive updates to the local adjustment may be at most fifteen seconds (or at most 10 seconds, or at most 5 seconds, or at most 1 second, or at most 100 ms, or at most 50 ms), which may depend on the task being performed. For example, for the purpose of real-time or near-real-time voice processing, update intervals of at most 100 ms (or at most 50 ms) may be desirable, to enable frequent updates to the sound processing model. On the side of the main device, the processing circuitry of the main device is configured to repeatedly receive updates to the one or more local adjustments from at least a subset (deemed to provide useful local adjustments) of the one or more further sound processing devices. The main device may use these updates to update the sound processing model accordingly. For example, the processing circuitry of the main device may be configured to repeatedly adjust the sound processing model based on the repeatedly received updates to the one or more local adjustments.
In some cases, helper devices that were initially deemed to provide useful adjustments may become less useful over time, e.g., as sound sources cease to emit sound, or as the respective devices move relative to each other. Accordingly, the main device may update the list (or graph) of helper devices it requests and receives updates (i.e., subscribes to updates) from. For example, the processing circuitry of the main device may be configured to ignore or cease receiving updates from another sound processing device based on the usefulness of the local adjustment of the other sound processing device for the sound processing device. On the other hand, the processing circuitry of the main device may be configured to add additional helper devices (it requests local adjustments from) over time, e.g., based on them being in the same general location.
Using the adjusted sound processing model, the main device processes the sound recorded locally by the main device. For example, the processing circuitry of the main device may be configured to perform real-time processing or near-real-time processing (e.g., with a delay of at most 5 seconds (or at most 2 seconds, or at most 1 second) between recording and processing of the sound) of the sound recorded by the sound processing device using the sound processing model.
In various examples of the present disclosure, the main device and the helper devices may collaborate in a privacy-preserving manner. This may be done on two levels—as part of the communication, and as part of the local adjustments and or sample of sound shared by the helper devices and main device, respectively.
With respect to communication privacy, the techniques listed as part of the “privacy (communication) layer” shown in connection with
With respect to data privacy, the techniques listed as part of the “privacy (signal) layer” shown in connection with
Additionally, or alternatively, differential privacy may be used. For example, a privacy budget of a differential privacy algorithm may be used to control how often the helper device provides an update to the local adjustment (or whether the helper device agrees to provide a local adjustment) or to control whether to apply a privacy-preserving embedding. For example, the processing circuitry of the helper device may be configured to determine the local adjustment based on a privacy budget of a differential privacy algorithm. Accordingly, the one or more local adjustments may be based on a privacy budget imposed by a differential privacy algorithm.
In some cases, not all of the helper devices (or main devices) may be considered to be trustworthy (or useful). For example, some helper devices may have malicious intent, and may try to poison the distributed learning, while some main devices might try to only benefit from distributed learning, without contributing to the distributed learning of other devices. As will be described in connection with
In the above description, a single sound processing model was mentioned that is being used to process the sound recorded by the main device. However, the proposed concept is not limited to a single sound processing model. The main device may use multiple sound processing models to process the sound recorded by the main device. For example, the processing circuitry of the main device may be configured to process the sound recorded locally using the sound processing model (further also denoted first sound processing model or task-agnostic sound processing model) and using a second sound processing model. The sound processing model may be a task-agnostic sound processing model and the second sound processing model being a task-specific sound processing model. For example, the sound processing model may be the base model, with the second source processing model being applied on top of the first sound source processing model. The first sound processing model being task-agnostic means that it may be suitable for different tasks (as it handles generic aspects, such as the removal of noise). The first sound processing model may then be combined with the second sound processing model, which is a task-specific model (i.e., a model that is specific to a single sound processing task), and which might not be adjusted based on the local adjustments provided by the one or more helper devices. However, the main device may attempt to improve the second sound processing model without input from the one or more helper devices.
In some examples, the layer stack may be extended by a third sound processing model (being inserted between the first and second sound processing model). For example, the processing circuitry of the main device may be configured to process the sound recorded locally further using a third sound processing model. This third sound processing model may be task-specific sound processing model, and it may be improved or optimized using distributed learning with the help of the one or more helper devices. For example, the processing circuitry of the helper device may be configured to obtain a task-specific sound processing model (i.e., the third sound processing model), to determine a further local adjustment to the task-specific sound processing model based on the sound recorded locally by the sound processing device (similar to the determination of the local adjustment), and to provide the further local adjustment to the further sound processing device. For example, the helper device may use the sample of sound provided by the main device and the sound recorded locally by the helper device to determine the further local adjustment to the third sound processing model. Accordingly, the processing circuitry of the main device may be configured to receive, from the one or more further sound processing devices, one or more further local adjustments to the third sound processing model determined by the one or more further sound processing devices based on sound recorded locally by the one or more further sound processing devices, and to adjust the third sound processing model based on the one or more further local adjustments. For example, the determination of the local adjustments, updates to the local adjustments, and adjustment of the third sound processing model may be implemented similar to the respective aspects of the (first) sound processing model.
The at least one interface 12; 22 of the main device 10 and/or the helper device 20 may correspond to one or more inputs and/or outputs for receiving and/or transmitting information, which may be in digital (bit) values according to a specified code, within a module, between modules or between modules of different entities. For example, the at least one interface 12; 22 of the main device 10 and/or the helper device 20 may comprise interface circuitry configured to receive and/or transmit information. For example, the main device 20 and/or the one or more helper devices 20 (and/or the central registry) may be configured to communicate via a computer network, e.g., via a mobile communication system, such as a cellular mobile communication system (being based on a standard defined by the 3′1-Generation Partnership Project, 3GPP, such as Long Term Evolution or a 5th Generation (5G) cellular mobile communication system, or a mobile communication system being based on Bluetooth or a variant of the IEEE (Institute of Electrical and Electronics Engineers) standard 802.11.
For example, the processing circuitry 14; 24 of the main device 10 and/or the helper device may be implemented using one or more processing units, one or more processing devices, any means for processing, such as a processor, a computer or a programmable hardware component being operable with accordingly adapted software. In other words, the described function of the processing circuitry 14; 24 of the main device 10 and/or the helper device 20 may as well be implemented in software, which is then executed on one or more programmable hardware components. Such hardware components may comprise a general-purpose processor, a Digital Signal Processor (DSP), a micro-controller, etc.
For example, the storage circuitry 16; 26 of the main device 10 and/or helper device 20 may comprise at least one element of the group of a computer readable storage medium, such as a magnetic or optical storage medium, e.g., a hard disk drive, a flash memory, Floppy-Disk, Random Access Memory (RAM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), an Electronically Erasable Programmable Read Only Memory (EEPROM), or a network storage.
More details and aspects of the sound processing devices 10; 20 and of the corresponding systems, devices 100; 200, methods and computer programs are mentioned in connection with the proposed concept, or one or more examples described above or below (e.g.,
Various aspects of the present disclosure relate to a concept for a privacy-preserving, crowdsourced decomposition of soundscape. A system is proposed where (devices of) willing participants can, in a privacy preserving manner, perform collaborative machine learning with the purpose of building a (potentially task-agnostic) encoder. For example, the proposed system may be used for collaborative reconstruction of 3D soundscapes, selective noise cancelling, helping with disabilities (hearing loss), or improving voice recognition systems. Various examples of the proposed system support near-real-time to real-time inference depending on the setup and task (e.g., for speech, a latency below 50 ms may be achieved).
In the example of
In order to improve or optimize the model associated with the current environment, the devices exchange information according to a distributed learning algorithm (as part of the privacy-preserving learning strategy 320). In addition, each participant may improve or optimizes which participants he takes information from in order to minimize processing time and increase or maximize performance on his task.
Each device may perform a task which is either improved or accelerated having access to a global encoding model (i.e., the sound processing model). For example, the model may be applied on audio signals 332, 334 and 336 emitted in a first location, in a second location and emitted in a third location. The encoder itself depends on the use case. It (i.e., the sound processing model) may be a function mapping raw sensor inputs to privacy preserving data.
The proposed concept may be implemented in different ways. In the following, examples of high-level implementations of the different communication components and signal processing components are given.
First, examples are given with respect to the components (layers) responsible for communication-related features of the proposed concept. The device network (of sound processing devices) can be setup with or without the presence of a trusted server (i.e., the central registry) facilitating the communication, enabling both centralized and decentralized implementations.
In various examples, a (centralized or decentralized) registry layer may be used, which is a repository of device metadata used to setup the communication network as well as assess device collaboration opportunities. In a centralized implementation, (all of) the devices register in a central server (i.e., the central registry) and publish there the required information to participate in the network. When a user becomes active on the network, it registers to the central server which manages a registry of devices/users. In a decentralized implementation, a peer-to-peer local network may be used. In this implementation, each device keeps track of devices open to collaborate in its vicinity.
The devices may use a subscription layer, which manages communications between the devices. In a centralized implementation, centralized communication may be used (i.e., (all) communication may be routed by (or via) a central server. In a decentralized implementation, a peer-to-peer local network may be used, and communication channels may be opened between trusted devices in a publisher-subscriber fashion. In a broadcasting implementation, the respective data (e.g., the sound processing model, the information on the task and/or the local adjustments) may be broadcast by the participants or by the infrastructure. Contributions of individual recording devices (i.e., sound processing devices) may be broadcast in a localized area. Users can cherry-pick (i.e., select among) the broadcast packets.
In some examples, a verification layer may be used, to validate device honesty (data contributions as well as well subscription behavior). In a centralized implementation, a trusted third part may play the role of validating devices that desire to participate in the local network. This can be done in various ways with cryptographic certificates distributed to trusted agents, and/or by continuous verification of each device behavior on the network. In a decentralized implementation, a trustless network may be used. For example, if no trusted third party exists, each device can monitor the contribution of the devices
For example, a privacy (communication) layer may be used, to increase the privacy for the communication layers (excluding actual data privacy). In a centralized implementation, a curious but honest third party may be used. In the case where the central server is not malicious but is curious, local privacy may be preserved. Standard encryption techniques can be used for communications. The registry can store temporary session IDs instead of permanent device IDs. If the verification layer requires decryption of the content of the shared data, a secure enclave can be setup in collaboration with each participating device. In a decentralized implementation, local privacy may be used. In this setting, privacy leakage can happen like with Bluetooth. It is possible to mitigate it by using various obfuscation techniques, but not to fully prevent it as users can potentially see each other physically and reverse engineer the obfuscation.
In the following, examples are given with respect to the signal processing components. For example, high-level components (layers) responsible for the security and processing of signals are described.
For example, the devices may use an embedding layer, which extracts the necessary information (and only the necessary information) from the current device actual recording (i.e., the sound recorded locally by the respective devices). For example, it can comprise or consist of a basic band pass filter, up to a deep neural network. Its output is pushed to subscribers (e.g., used to determine the local adjustments).
The devices may use a privacy (signal) layer, which may remove (any) privacy sensitive information from the embedding layer. It can be put on top of the embedding layer, with, for instance, differential privacy or cryptographic methods (e.g., distributed learning with Multi-Party Computation), or integrated in it, for instance using adversarial learning.
For example, a reconstruction layer may be used to model the recorded signal using all the embeddings received from participating devices. It can for instance model the signal as a sum of incoherent labelled components. It may optionally contain a forecasting model aiming at real-time reconstruction.
A learning layer may manage the collaborative learning of the embedding and reconstruction layers. For example, the learning layer may subscribe to new recording devices if they appear from their metadata to be potentially helpful and may unsubscribe from the devices which are redundant or do no show signs of overlapping with the locally recorded signal. It may improve/calibrate the embedding and reconstruction layers, e.g., using master-less distributed learning like MPC (Multi-Party Computation) or fully decentralized learning. If a centralized embodiment is chosen, Federated Learning may be used.
In addition, a sound processing device (e.g., recording device, main device) may be able to recruit a new device in order to increase the accuracy of the task at hand. In this example, devices 450 and 460 can try to isolate sources 410 and 420 while suppressing source 430. Because the device 470 has a strong recording of the background with only a weak contribution of 410 and 420, it can be used to suppress source 430.
In addition, it is possible that another source of noise 440, outside of the range of 410 and 420, is interfering with the recording of 470. However, if 480 participates in the soundscape reconstruction of the device 470, 410 and 420 (or 450/460) can indirectly benefit from it.
The same process may be used with a single batch of data being shared from Device B to Device A to populate the registry. This can be included in (6) of the setup process shown in
In the following, an application of the proposed concept on hearing aids is shown. In this application of the proposed concept, the hearing aids may be helped by family and friends' phones.
In the following, the hearing aids (HA) are assumed to be the main device, which is assisted by the helper devices (HD).
The HAs may be one or more devices that have the task of providing hearing aid with improved signal/noise ratio (e.g., by decreasing reverberation), ability to focus attention to specific sound sources. They may also have the task of creating a small dataset that helper devices can use to train an initial model in combination with their own recording.
The HDs may have the task of processing the recorded audio and collaboratively creating the reconstruction model. They may create a small size training dataset that the hearing aids can use to calibrate their local model.
The model should be typically stable on a period of a few tenths of seconds to up to a few seconds and should allow low-latency inference. It may comprise or consist of a list of labelled audio filters, for example.
The following improvement or optimization strategy may be used. For initialization, the HA generate may generate an initial 3D model (e.g., based to microphones situated on each earpiece). For the purpose of distribution, the HD(s) may asynchronously pull the current model and fresh sample data from the HA. Updates may be performed asynchronously on each HD based on utility and/or task parameters. The HD may compare the HA sample data with buffered audio recorded locally and update the model accordingly. The HD may propose model updates (i.e., a local adjustment) to the HA. The HA may consider the update to the model and may report new ratings to the HD.
Alternatively, the hearing aids may be helped by an anonymous crowd. In this case, the previously described implementation example may be extended with additional privacy measures. In the following, the difference to the previously described example is described.
In this case, when providing the model (updates) and sample dataset, the devices may have the task of protecting or guaranteeing the anonymity of the subject in range of the microphone. In order to protect or guarantee sample anonymity, embeddings can be used that suppress speech and randomize implicit location. The speech suppression filter may be common to all collaborating devices. It can be a pre-trained static filter but can also be collaboratively learned using decentralized adversarial learning, each device using the raw locally recorded audio as training set. The removal of the location embedded in the audio signal may be equivalent to resealing the signal components to simulate a “displacement” of the microphone (note that this transformation can correspond to impossible positions without complication). A random location may be selected initially and preserved throughout the learning. The transformation of the speech-free signal to the fully anonymized signal may be stored locally by each device.
The model being used may be (made) location agnostic to avoid localization of the HA and allow multiple HA to participate. Filters may be defined on anonymized (speech-free, location-free) embeddings
Incentives may be orchestrated using trusted services. Alternatively, a trustless approach can be adopted, as for instance a blockchain-based system. Due to the computationally intensive aspect of such protocol, it might not be used during the contribution. Information may be gathered locally, and reward may be computed afterward based on aggregated contribution metrics. This means devices may still need to be trusted to compute those metrics accurately.
With respect to security and communications, corrupted participants may be warranted against using anomaly detection and/or cryptographic measures. For communications, standard networking techniques may be, assuming the shared data is fully anonymized (as shown in connection with the communication components outlined above). The same improvement or optimization strategy may be used as in the case, where the hearing aids are helped by family and friends' phones
More details and aspects of the concept for a privacy-preserving, crowdsourced decomposition of soundscape are mentioned in connection with the proposed concept, or one or more examples described above or below (e.g.,
In the following, some examples of the proposed concept are presented:
The aspects and features described in relation to a particular one of the previous examples may also be combined with one or more of the further examples to replace an identical or similar feature of that further example or to additionally introduce the features into the further example.
Examples may further be or relate to a (computer) program including a program code to execute one or more of the above methods when the program is executed on a computer, processor, or other programmable hardware component. Thus, steps, operations, or processes of different ones of the methods described above may also be executed by programmed computers, processors, or other programmable hardware components. Examples may also cover program storage devices, such as digital data storage media, which are machine-, processor- or computer-readable and encode and/or contain machine-executable, processor-executable or computer-executable programs and instructions. Program storage devices may include or be digital storage devices, magnetic storage media such as magnetic disks and magnetic tapes, hard disk drives, or optically readable digital data storage media, for example. Other examples may also include computers, processors, control units, (field) programmable logic arrays ((F)PLAs), (field) programmable gate arrays ((F)PGAs), graphics processor units (GPU), application-specific integrated circuits (ASICs), integrated circuits (ICs) or system-on-a-chip (SoCs) systems programmed to execute the steps of the methods described above.
Various examples of the present disclosure are based on using a machine-learning model or machine-learning algorithm. Machine learning refers to algorithms and statistical models that computer systems may use to perform a specific task without using explicit instructions, instead relying on models and inference. For example, in machine-learning, instead of a rule-based transformation of data, a transformation of data may be used, that is inferred from an analysis of historical and/or training data. For example, the content of images may be analyzed using a machine-learning model or using a machine-learning algorithm. In order for the machine-learning model to analyze the content of an image, the machine-learning model may be trained using training images as input and training content information as output. By training the machine-learning model with a large number of training images and associated training content information, the machine-learning model “learns” to recognize the content of the images, so the content of images that are not included of the training images can be recognized using the machine-learning model. The same principle may be used for other kinds of sensor data as well: By training a machine-learning model using training sensor data and a desired output, the machine-learning model “learns” a transformation between the sensor data and the output, which can be used to provide an output based on non-training sensor data provided to the machine-learning model.
Machine-learning models are trained using training input data. The examples specified above use a training method called “supervised learning”. In supervised learning, the machine-learning model is trained using a plurality of training samples, wherein each sample may comprise a plurality of input data values, and a plurality of desired output values, i.e., each training sample is associated with a desired output value. By specifying both training samples and desired output values, the machine-learning model “learns” which output value to provide based on an input sample that is similar to the samples provided during the training. Apart from supervised learning, semi-supervised learning may be used. In semi-supervised learning, some of the training samples lack a corresponding desired output value. Supervised learning may be based on a supervised learning algorithm, e.g., a classification algorithm, a regression algorithm, or a similarity learning algorithm. Classification algorithms may be used when the outputs are restricted to a limited set of values, i.e., the input is classified to one of the limited set of values. Regression algorithms may be used when the outputs may have any numerical value (within a range). Similarity learning algorithms are similar to both classification and regression algorithms but are based on learning from examples using a similarity function that measures how similar or related two objects are.
Apart from supervised or semi-supervised learning, unsupervised learning may be used to train the machine-learning model. In unsupervised learning, (only) input data might be supplied, and an unsupervised learning algorithm may be used to find structure in the input data, e.g., by grouping or clustering the input data, finding commonalities in the data. Clustering is the assignment of input data comprising a plurality of input values into subsets (clusters) so that input values within the same cluster are similar according to one or more (pre-defined) similarity criteria, while being dissimilar to input values that are included in other clusters.
Reinforcement learning is a third group of machine-learning algorithms. In other words, reinforcement learning may be used to train the machine-learning model. In reinforcement learning, one or more software actors (called “software agents”) are trained to take actions in an environment. Based on the taken actions, a reward is calculated. Reinforcement learning is based on training the one or more software agents to choose the actions such, that the cumulative reward is increased, leading to software agents that become better at the task they are given (as evidenced by increasing rewards).
Machine-learning algorithms are usually based on a machine-learning model. In other words, the term “machine-learning algorithm” may denote a set of instructions that may be used to create, train, or use a machine-learning model. The term “machine-learning model” may denote a data structure and/or set of rules that represents the learned knowledge, e.g., based on the training performed by the machine-learning algorithm. In embodiments, the usage of a machine-learning algorithm may imply the usage of an underlying machine-learning model (or of a plurality of underlying machine-learning models). The usage of a machine-learning model may imply that the machine-learning model and/or the data structure/set of rules that is the machine-learning model is trained by a machine-learning algorithm.
For example, the machine-learning model may be an artificial neural network (ANN). ANNs are systems that are inspired by biological neural networks, such as can be found in a brain. ANNs comprise a plurality of interconnected nodes and a plurality of connections, so-called edges, between the nodes. There are usually three types of nodes, input nodes that receiving input values, hidden nodes that are (only) connected to other nodes, and output nodes that provide output values. Each node may represent an artificial neuron. Each edge may transmit information, from one node to another. The output of a node may be defined as a (non-linear) function of the sum of its inputs. The inputs of a node may be used in the function based on a “weight” of the edge or of the node that provides the input. The weight of nodes and/or of edges may be adjusted in the learning process. In other words, the training of an artificial neural network may comprise adjusting the weights of the nodes and/or edges of the artificial neural network, i.e., to achieve a desired output for a given input. In at least some embodiments, the machine-learning model may be deep neural network, e.g., a neural network comprising one or more layers of hidden nodes (i.e., hidden layers), prefer-ably a plurality of layers of hidden nodes.
Alternatively, the machine-learning model may be a support vector machine. Support vector machines (i.e., support vector networks) are supervised learning models with associated learning algorithms that may be used to analyze data, e.g., in classification or regression analysis. Support vector machines may be trained by providing an input with a plurality of training input values that belong to one of two categories. The support vector machine may be trained to assign a new input value to one of the two categories. Alternatively, the machine-learning model may be a Bayesian network, which is a probabilistic directed acyclic graphical model. A Bayesian network may represent a set of random variables and their conditional dependencies using a directed acyclic graph. Alternatively, the machine-learning model may be based on a genetic algorithm, which is a search algorithm and heuristic technique that mimics the process of natural selection.
It is further understood that the disclosure of several steps, processes, operations, or functions disclosed in the description or claims shall not be construed to imply that these operations are necessarily dependent on the order described, unless explicitly stated in the individual case or necessary for technical reasons. Therefore, the previous description does not limit the execution of several steps or functions to a certain order. Furthermore, in further examples, a single step, function, process, or operation may include and/or be broken up into several sub-steps, -functions, -processes or -operations.
If some aspects have been described in relation to a device or system, these aspects should also be understood as a description of the corresponding method. For example, a block, device or functional aspect of the device or system may correspond to a feature, such as a method step, of the corresponding method. Accordingly, aspects described in relation to a method shall also be understood as a description of a corresponding block, a corresponding element, a property or a functional feature of a corresponding device or a corresponding system.
The following claims are hereby incorporated in the detailed description, wherein each claim may stand on its own as a separate example. It should also be noted that although in the claims a dependent claim refers to a particular combination with one or more other claims, other examples may also include a combination of the dependent claim with the subject matter of any other dependent or independent claim. Such combinations are hereby explicitly proposed, unless it is stated in the individual case that a particular combination is not intended. Furthermore, features of a claim should also be included for any other independent claim, even if that claim is not directly defined as dependent on that other independent claim.
Number | Date | Country | Kind |
---|---|---|---|
22162192.3 | Mar 2022 | EP | regional |