This application claims priority to United Kingdom Patent Application No. 2317849.4, filed Nov. 21, 2023, which is incorporated herein by reference in its entirety.
Example embodiments relate to processing of respective audio signals captured by a plurality of microphones having different locations on a mobile terminal.
Some mobile terminals comprise a plurality of microphones having different locations on the mobile terminal. For example, a first microphone may be located on a rear surface of the mobile terminal, a second microphone may be located on a front surface of the mobile terminal and a third microphone may be located on an edge surface of the mobile terminal, between the rear and front surfaces. The microphones may respond to changes in air pressure caused by sound waves but may also respond to movement of the mobile terminal which causes a change in air pressure. Hence, audio signals captured by multiple microphones may comprise a mixture of wanted audio and unwanted noise due to the movement of the mobile terminal.
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
According to a first aspect, there is described an apparatus, comprising: means for receiving respective audio signals, A1-AM, captured by a plurality of microphones, M1-MM, having different locations on a mobile terminal; means for estimating motion of the mobile terminal when capturing the respective audio signals, A1-AM; means for computing respective exposure parameters, η1-ηM, for the respective audio signals, A1-AM, the exposure parameter for a particular audio signal being based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured; means for generating respective spectrograms, S1-SM, for the respective audio signals, A1-AM; means for identifying a plurality of common time and frequency range segments across the respective spectrograms, S1-SM; means for computing dissimilarity values, δ1-δK, for the common segments of the respective spectrograms, S1-SM, based on, for audio characteristics within a particular segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular segment of the other spectrograms; and means for, based on the computed exposure parameters, η1-ηM, and the computed dissimilarity values, δ1-δK, selecting, for each particular common segment, which of the audio characteristics within that particular common segment are to be used to generate an audio output.
In some example embodiments, the means for identifying the plurality of common segments may be configured to: determine an ordered list of the respective spectrograms, S1-SM, based on the respective exposure parameters, η1-ηM, computed for the respective audio signals, A1-AM, from which the respective spectrograms are generated; segment the first and second spectrograms in the ordered list to identify respective first and second sets of segments; and combine the respective first and second sets of segments to identify a first plurality of common segments.
In some example embodiments, the means for computing the dissimilarity values, δ1-δL, may be configured, responsive to identifying the first plurality of common segments: to compute the dissimilarity values, δ1-δK, for the first plurality of common segments of the first and second spectrograms.
In some example embodiments, the means for identifying the plurality of common segments may be further configured to: for a next spectrogram in the ordered list: segment the next spectrogram to identify a further set of segments; and combine the further set of segments with the first plurality of common segments to identify an updated plurality of common segments.
In some example embodiments, the means for computing the dissimilarity values, δ1-δK, may be configured, responsive to identifying the updated plurality of common segments: to compute the dissimilarity values, δ1-δK, for the updated plurality of common segments of the first, second and next spectrograms.
In some example embodiments, the respective exposure parameters, η1-ηM, may be indicative of the exposure to airflow of the respective microphones, M1-MM, from which the respective audio signals, A1-AM, are captured due to the motion of the user terminal, and the ordered list is in the order of the most exposed microphone to the least exposed microphone.
In some example embodiments, the means for selecting comprises a learned model may be configured: to receive as a set of input data: the computed exposure parameters, η1-M, for the respective audio signals, A1-AM, the respective spectrograms, S1-SM, and the computed dissimilarity values, δ1-δK, for the common segments of the respective spectrograms; to select, for each common segment, and based on the set of input data, which audio characteristics in that common segment of the respective spectrograms are to be used to generate the audio output; and to provide as output data the selected audio characteristics for the common segments.
In some example embodiments, the learned model may be trained by: providing ground truth data corresponding to respective audio signals, A1-AM, received by the plurality of microphones, M1-MM, when a same type of mobile terminal is stationary; computing reference exposure parameters, η1-ηM, and dissimilarity values, δ1-δL for common segments of respective spectrograms generated when said same type of mobile terminal is in motion; using an initial model to provide output data representing selected audio characteristics for each common segment based on the reference exposure parameters, η1-ηM, and dissimilarity values, δ1-K; comparing the output data with the ground truth data to determine error data; and updating the initial model based on the error data.
In some example embodiments, the selected audio characteristics for each common segment may be provided in an output spectrogram, S*, and the apparatus may further comprise means for converting the output spectrogram, S*, to an audio output.
In some example embodiments, the means for estimating motion of the mobile terminal may be configured to receive one or more motion parameters indicative of at least a direction of motion of the mobile terminal, and the exposure parameter, η1-ηM, for the particular audio signal, A1-AM, may be computed based on the location of the microphone, M1-MM, from which the particular audio signal is captured in relation to the direction of the motion.
In some example embodiments, the means for estimating motion of the mobile terminal may be configured to receive further motion parameters indicative of a velocity of the motion and an orientation of the mobile terminal, and the exposure parameter, η1-ηM, for the particular audio signal, A1-AM, may be computed further based on the velocity of the motion and the orientation of the mobile terminal.
In some example embodiments, the one or more motion parameters may be received from an inertial measurement unit, IMU, of the mobile terminal.
In some example embodiments, the respective audio signals, A1-AM may be captured within a particular time window; the respective exposure parameters, η1-ηM, and dissimilarity values, δ1-δL, for the common segments of the respective spectrograms, S1-SM, may be updated a plurality of times within the particular time window; and the computed audio output may be updated within the particular time window based on the updated respective exposure parameters, η1-ηM, and dissimilarity values, δ1-δK.
According to a second aspect, there is described a method, comprising: receiving respective audio signals, A1-AM, captured by a plurality of microphones, M1-MM, having different locations on a mobile terminal; estimating motion of the mobile terminal when capturing the respective audio signals, A1-AM; computing respective exposure parameters, η1-ηM, for the respective audio signals, A1-AM, the exposure parameter for a particular audio signal being based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured; generating respective spectrograms, S1-SM, for the respective audio signals, A1-AM; identifying a plurality of common time and frequency range segments across the respective spectrograms, S1-SM; computing dissimilarity values, δ1-δK, for the common segments of the respective spectrograms, S1-SM, based on, for audio characteristics within a particular segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular segment of the other spectrograms; and based on the computed exposure parameters, η1-ηM, and the computed dissimilarity values, δ1-δL, selecting, for each particular common segment, which of the audio characteristics within that particular common segment are to be used to generate an audio output.
In some example embodiments, identifying the plurality of common segments may comprise: determining an ordered list of the respective spectrograms, S1-SM, based on the respective exposure parameters, η1-ηM, computed for the respective audio signals, A1-AM, from which the respective spectrograms are generated; segmenting the first and second spectrograms in the ordered list to identify respective first and second sets of segments; and combining the respective first and second sets of segments to identify a first plurality of common segments.
In some example embodiments, computing the dissimilarity values, δ1-δK, may comprise, responsive to identifying the first plurality of common segments: computing the dissimilarity values, δ1-δK, for the first plurality of common segments of the first and second spectrograms.
In some example embodiments, identifying the plurality of common segments may further comprise: for a next spectrogram in the ordered list: segmenting the next spectrogram to identify a further set of segments; and combining the further set of segments with the first plurality of common segments to identify an updated plurality of common segments.
In some example embodiments, computing the dissimilarity values, δ1-δL, may comprise, responsive to identifying the updated plurality of common segments: computing the dissimilarity values, δ1-δL, for the updated plurality of common segments of the first, second and next spectrograms.
In some example embodiments, the respective exposure parameters, η1-ηM, may be indicative of the exposure to airflow of the respective microphones, M1-MM, from which the respective audio signals, A1-AM, are captured due to the motion of the user terminal, and the ordered list may be in the order of the most exposed microphone to the least exposed microphone.
In some example embodiments, the selecting may comprise use of a learned model for: receiving as a set of input data: the computed exposure parameters, η1-M, for the respective audio signals, A1-AM, the respective spectrograms, S1-SM, and the computed dissimilarity values, δ1-δL, for the common segments of the respective spectrograms; selecting, for each common segment, and based on the set of input data, which audio characteristics in that common segment of the respective spectrograms are to be used to generate the audio output; and providing as output data the selected audio characteristics for the common segments.
In some example embodiments, the learned model may be trained by: providing ground truth data corresponding to respective audio signals, A1-AM, received by the plurality of microphones, M1-MM, when a same type of mobile terminal is stationary; computing reference exposure parameters, η1-ηM, and dissimilarity values, δ1-δM for common segments of respective spectrograms generated when said same type of mobile terminal is in motion; using an initial model to provide output data representing selected audio characteristics for each common segment based on the reference exposure parameters, η1-ηM, and dissimilarity values, δ1-K; comparing the output data with the ground truth data to determine error data; and updating the initial model based on the error data.
In some example embodiments, the selected audio characteristics for each common segment may be provided in an output spectrogram, S″, and the method may further comprise converting the output spectrogram, S*, to an audio output.
In some example embodiments, estimating motion of the mobile terminal may comprise receiving one or more motion parameters indicative of at least a direction of motion of the mobile terminal, and the exposure parameter, η1-ηM, for the particular audio signal, A1-AM, may be computed based on the location of the microphone, M1-My, from which the particular audio signal is captured in relation to the direction of the motion.
In some example embodiments, estimating motion of the mobile terminal may comprise receiving further motion parameters indicative of a velocity of the motion and an orientation of the mobile terminal, and the exposure parameter, η1-ηu, for the particular audio signal, A1-AM, may be computed further based on the velocity of the motion and the orientation of the mobile terminal.
In some example embodiments, the one or more motion parameters may be received from an inertial measurement unit, IMU, of the mobile terminal.
In some example embodiments, the respective audio signals, A1-AM may be captured within a particular time window; the respective exposure parameters, η1-ηM, and dissimilarity values, δ1-δK, for the common segments of the respective spectrograms, S1-SM, may be updated a plurality of times within the particular time window; and the computed audio output may be updated within the particular time window based on the updated respective exposure parameters, η1-ηM, and dissimilarity values, δ1-δK.
According to a third aspect, there is described a computer program product, comprising a set of instructions which, when executed on an apparatus, is configured to: cause the apparatus to carry out a method, comprising: receiving respective audio signals, A1-AM, captured by a plurality of microphones, M1-MM, having different locations on a mobile terminal; estimating motion of the mobile terminal when capturing the respective audio signals, A1-Ax; computing respective exposure parameters, η1-ηM, for the respective audio signals, A1-AM, the exposure parameter for a particular audio signal being based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured; generating respective spectrograms, S1-Su, for the respective audio signals, A1-AM; identifying a plurality of common time and frequency range segments across the respective spectrograms, S1-SM; computing dissimilarity values, δ1-δL, for the common segments of the respective spectrograms, S1-SM, based on, for audio characteristics within a particular segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular segment of the other spectrograms; and based on the computed exposure parameters, η1-ηM, and the computed dissimilarity values, δ1-δL, selecting, for each particular common segment, which of the audio characteristics within that particular common segment are to be used to generate an audio output.
The third aspect may include any other feature mentioned with respect to the method of the second aspect.
According to a fourth aspect, there is described a non-transitory computer readable medium comprising program instructions stored thereon to cause the apparatus to carry out a method, comprising: receiving respective audio signals, A1-AM, captured by a plurality of microphones, M1-MM, having different locations on a mobile terminal; estimating motion of the mobile terminal when capturing the respective audio signals, A1-AM; computing respective exposure parameters, η1-ηM, for the respective audio signals, A1-AM, the exposure parameter for a particular audio signal being based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured; generating respective spectrograms, S1-Su, for the respective audio signals, A1-Av; identifying a plurality of common time and frequency range segments across the respective spectrograms, S1-SM; computing dissimilarity values, δ1-δL, for the common segments of the respective spectrograms, S1-Sx, based on, for audio characteristics within a particular segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular segment of the other spectrograms; and based on the computed exposure parameters, η1-ηM, and the computed dissimilarity values, δ1-δL, selecting, for each particular common segment, which of the audio characteristics within that particular common segment are to be used to generate an audio output.
The fourth aspect may include any other feature mentioned with respect to the method of the second aspect.
According to a fifth aspect, there is described an apparatus comprising at least one processing core, at least one memory including computer program code, the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus to: receive respective audio signals, A1-AM, captured by a plurality of microphones, M1-MM, having different locations on a mobile terminal; estimate motion of the mobile terminal when capturing the respective audio signals, A1-AM; compute respective exposure parameters, η1-ηM, for the respective audio signals, A1-AM, the exposure parameter for a particular audio signal being based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured; generate respective spectrograms, S1-SM, for the respective audio signals, A1-AM; identify a plurality of common time and frequency range segments across the respective spectrograms, S1-SM; compute dissimilarity values, δ1-δL, for the common segments of the respective spectrograms, S1-SM, based on, for audio characteristics within a particular segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular segment of the other spectrograms; and based on the computed exposure parameters, η1-ηM, and the computed dissimilarity values, δ1-δL, select, for each particular common segment, which of the audio characteristics within that particular common segment are to be used to generate an audio output.
The fifth aspect may include any other feature mentioned with respect to the method of the second aspect.
Example embodiments will be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Example embodiments relate to processing of respective audio signals captured by a plurality of microphones having different locations on a mobile terminal.
The mobile terminal 100 may comprise one or more processors 102, one or more memories (not shown), a display screen 104, one or more loudspeakers 106, 108, one or more cameras 110 and a plurality of microphones 112, 114, 116. The one or more memories may store software, comprising computer-readable instructions which, when executed by the one or more processors, may perform one or more operations according to one or more example embodiments.
In the
In some example embodiments, there may be two or more than three microphones, each of which has a respective known location on the mobile terminal 100.
The first, second and third microphones 112, 114, 116 may comprise microelectron-mechanical system (MEMS) microphones.
The mobile terminal 100 may also comprise a means for estimating its own motion during use.
For example, the mobile terminal 100 may comprise one or more inertial measurement units (IMU) 120. As will be known, the IMU 120 may comprise one or more of an accelerometer, gyroscope and magnetometer for measuring, respectively, linear acceleration, rotational rate and a heading reference as indicated in
In some embodiments, there may be provided one or more IMU 120 for measuring linear acceleration, rotational rate and a heading reference for each principal axis, being pitch, roll and yaw, as also indicated in
Each of the accelerometer, gyroscope and magnetometer may generate motion data or motion parameters which individually, or usually collectively, may be processed using known software to generate a geometrical model of the mobile terminal's position in terms of direction, velocity and orientation. Taken over time, the software may therefore generate a geometrical model of the mobile terminal's movement in space.
In use, the mobile terminal 100 may be enabled to capture sound waves, such as, but not limited to, speech or even physiological parameters, e.g., heart rate, over a period of time.
The sound waves cause a change in air pressure which may be captured by the first, second and third microphones 112, 114, 116 at their respective locations. However, movement of the mobile terminal 100 will also cause a change in air pressure that may be captured by one or more of the first, second and third microphones 112, 114, 116. The wanted sounds, e.g., the speech or heart rate, may be termed a main characteristic and the unwanted “noise” due to motion of the mobile terminal 100 may be termed a motion characteristic.
The presence of main and motion characteristics may distort the captured audio in an undesirable way. For example, it may be desirable to use the main characteristic in one or more subsequent processing operations, which operations may not work correctly due to the presence of the captured motion characteristic(s) which is or are unpredictable. Example subsequent processing operations may comprise voice recognition operations and/or machine learning (ML) training and/or inference operations.
The spectrogram 200 in the
Often, the main and motion characteristics in a spectrogram are not so distinct and may blend together.
A segment, as used herein, is sub-portion of a spectrogram. A segmented spectrogram may comprise two or more segments. Enclosed within a particular segment are a sub-set of time and frequency characteristics of captured audio, i.e., audio characteristics. Spectrograms may be segmented using one of a plurality of different algorithms. A segment may be configured to have any suitable shape, e.g., a square, a rectangle, a polygon, or a circle and may have any suitable size.
For completeness,
For example, the respective audio signals may comprise a 44100 Hz sampling rate, which is about fifty times greater than motion data generated by the IMU 120, assuming the use of a 9-axis IMU 120, running at 100 Hz and therefore producing 900 values per second. The frequency spectrum of the first, second and third microphones 112, 114, 116 may therefore be five hundred times larger than that of the IMU 120, making direct comparison between audio signals and the motion data difficult and/or non-meaningful. For example, it cannot be reliably stated that that if a 50 Hz component is observed in motion data from the IMU 120 when capturing audio signals by the first, second and third microphones 112, 114, 115, then 50 Hz frequencies in the respective audio signals can be filtered out to remove noise due to movement of the user terminal 100. There is no correct and straightforward rule because user terminal movements may cause complex, non-linear effects due to air pressure changes around the first, second and third microphones 112, 114, 116.
Example embodiments may relate to processing of respective audio signals captured by a plurality of microphones of a mobile terminal, which processing takes into account the respective locations of the plurality of microphones. Example embodiments may process respective spectrograms representing the respective audio signals such as to mitigate noise effects due to movement.
A first operation 601 may comprise receiving respective audio signals, A1-M, captured by a plurality of microphones, M1-M, having different locations on a mobile terminal.
A second operation 602 may comprise estimating motion of the mobile terminal when capturing the respective audio signals, A1-M.
A third operation 603 may comprise for computing respective exposure parameters, η1-M, for the respective audio signals, A1-M. The exposure parameter n for a particular audio signal A may be based on the estimated motion of the mobile terminal and the location of the microphone from which the particular audio signal is captured.
A fourth operation 604 may comprise generating respective spectrograms, S1-M, for the respective audio signals, A1-M.
A fifth operation 605 may comprise identifying a plurality of common time and frequency range segments across the respective spectrograms, S1-M.
A sixth operation 606 may comprise computing dissimilarity values, δ1-K, for the common segments of the respective spectrograms, S1-M. The dissimilarity values, δ1-K, may be based on, for audio characteristics within a particular common segment of a particular spectrogram, how similar those audio characteristics are to audio characteristics within the same particular common segment of the other spectrograms.
A seventh operation 607 may comprise selecting, for each common segment, which of the audio characteristics within that common segment are to be used to generate an audio output. The selecting may be based on the computed exposure parameters, η1-M, and the computed dissimilarity values, δ1-K.
The system 700 may comprise the first, second and third microphones 112, 114, 116 having their known respective locations on the mobile terminal 100. The system 700 may also comprise the one or more IMUs 120. The system 700 may also comprise a kinematics module 702, a segmentation module 704, a purifier module 706 and a conversion module 708. Although shown as separate modules, at least some of the modules may be combined into one module.
According to the first operation 601, respective first, second and third audio signals A1-A3 captured by the first, second and third microphones 112, 114, 116 may be provided to the segmentation module 704. The respective first, second and third audio signals A1-A3 may correspond to a particular time window.
According to the second operation 602, the IMU 120 may provide motion parameters to the kinematics module 702.
The kinematics module 702 may be configured, based on the motion parameters mentioned above, to determine or model one or more movements of the mobile terminal 100, and to compute respective exposure parameters η1-η3 during said particular time window.
The respective exposure parameters η1-η3 may reflect (by any suitable method) the degree to which each of the first, second and third microphones 112, 114, 116 are exposed to the one or more movements of the mobile terminal 100 and therefore the relative impact of the movements on the respective first, second and third audio signals A1-A3. For example, the value of n may be a value between [0, 1] where 0 represents no exposure and 1 represents maximum exposure.
For the sake of simplicity, and with reference to
It may be assumed that the first and third microphones 112, 116 are less exposed to the movement due to their respective locations. The second microphone 114 is the most exposed due to its location on the left-hand side of the user terminal 100. The following example values for n may therefore be computed:
The respective exposure parameters η1-η3 may be provided to the segmentation module 704 and, in some embodiments, to the purifier module 706.
In accordance with the fourth operation 604, the segmentation module 704 may be configured to generate respective first, second and third spectrograms, S1-S3, for the first, second and third audio signals, A1-A3.
Known methods for generating the spectrograms, S1-S3, include dividing the audio signal, A, into equal length time segments, windowing each time segment and computing its spectrum to obtain the short-time Fourier transform.
In accordance with the fifth operation 605, the segmentation module 704 may also be configured to identify a plurality of common segments across the respective first, second and third spectrograms, S1-S3.
Common segments comprise segments which enclose the same portion of time and frequency space across the respective first, second and third spectrograms, S1-S3. The common segments may collectively enclose all, or substantially all, portions of time and frequency space such as to leave little or no gaps.
Identifying the plurality of common segments may be performed in a number of different ways.
Example embodiments may take into account the respective exposure parameters η1-η3.
More specifically, the respective exposure parameters η1-η3 may be used to provide an ordered list of the audio signals A1-A3 or of their respective spectrograms, S1-S3.
The ordered list may be in the order of most exposed microphone to the least exposed microphone, effectively defining a rank order. The respective spectrograms, S1-S3, may be processed in pairs using the rank order, and common segments may be iteratively identified and updated between pairs of spectrograms to identify the common segments.
Based on the above example values for n, the second and first spectrograms, S1-S3, may be processed first. The third spectrogram, S3, may be processed last.
Respective first, second and third spectrograms, S1-S3, indicated by reference numerals 901, 902, 903, are shown in the above rank order.
The second and first spectrograms 902, 901 may be individually segmented to identify, within each said spectrogram, a plurality of segments, for example based on similarity of audio characteristics within different parts of the particular spectrogram. Segmentation may comprise edge-based segmentation (e.g. by identifying boundaries in the spectrogram via identifying rapid changes in magnitude of time and frequency characteristics or points) or region-based (e.g. starting with seed locations and growing regions by adding neighbouring time and frequency points that have same or similar magnitudes) or other forms of segmentation.
The second spectrogram 902 is first segmented such as to identify a set of segments 912 (second set of segments); the second set of segments comprises six segments. The first spectrogram 901 is segmented such as to identify a set of segments 911 (first set of segments); the first set of segments comprises five segments.
The second set of segments 912 includes an additional segment 910 compared to the first set of segments 911 which reflects, in this relatively simple example, some captured motion characteristics.
Referring now to
Referring back to
Referring again to
This above segmentation process may be repeated in turn for further spectrograms S, if any.
In the present example, there are no further spectrograms and hence the second set of common segments 1004 is used for subsequent processing.
In accordance with the sixth operation 606, dissimilarity values, &A-8F, for each of the first, second and third spectrograms 901, 902, 903 may be computed for each of the common segments A-F.
Effectively, the second set of common segments 1004 is overlaid over the each of the first, second and third spectrograms 901, 902, 903 and the audio characteristics in a particular common segment of one spectrogram are compared with the audio characteristics of the other two spectrograms for the same common segment. Determining a similarity/dissimilarity between the common segments of different spectrograms may involve using a correlation function (e.g. cross-correlation or similar) to measure temporal similarity. Alternatively, cosine similarity may be used to compute how similar the values of two segments are, if considered as vectors. Alternatively, dynamic time warping may be used, considering that segments include an element of time. Euclidian distance methods may also be used. The choice of metric used may depend on the application requirements.
The dissimilarity value δ represents how similar/dissimilar the audio characteristics are to audio characteristics in the same common segment of all other spectrograms.
We may, for example, compare audio characteristics of common segment A of the second spectrogram 902 against audio characteristics of segment A of (i) the first spectrogram 901, and then against (ii) the third spectrogram 903, and average the results to provide the value of δ for common segment A of the second spectrogram.
The sixth operation 606 may therefore produce for each spectrogram 901, 902, 902 a set of dissimilarity values, Δ=δA-δF.
For example, the value of δ may be a value between [0, 1] where 0 represents minimal dissimilarity and 1 represents maximum dissimilarity.
So, for common segment A, we may compare audio characteristics enclosed within common segment A:
The process may be repeated for common segments B-F.
In terms of ordering, this comparison process may be performed, initially at least, for the first set of common segments 1002 using the second and first spectrograms 902, 901, and then repeated for the second set of common segments 1004 (which may comprise zero, or only a small number of new common segments) also taking into account the third spectrogram 903.
Alternatively, the process may be performed at the end of the segmentation process using the second set of common segments 1004.
Returning to
As mentioned above, each set 41, 42, 43 comprises values of {84, 8A . . . 8F} indicative of how similar or dissimilar audio characteristics within the common segments A-F are to the same common segments of the other spectrograms.
In accordance with the seventh operation 607, the purifier module 706 may be configured to select, for each common segment, which audio characteristics of the first, second and third spectrograms 901, 902, 903 within that common segment are to be used to generate the audio output.
In one example embodiment, the purifier module 706 may select, for each particular common segment, A-F, the audio characteristics most similar to those in the same segment of the other two spectrograms, i.e., that which has the lowest value of 8. Those selected audio characteristics can be merged (stitched together) across the respective common segments, A-F, to provide an output spectrogram S*.
Reference numeral 1100 is the output spectrogram S* comprising selected audio characteristics which come from either the first or the third spectrogram 901, 903. This is to be expected in this relatively simple example given that the second spectrogram 902 has the larger exposure value η.
The output spectrogram S* keeps the main characteristic (wanted audio) whilst removing or reducing the motion characteristics.
The output spectrogram S* may be provided by the purifier module 706 to the conversion module 708 which is configured to convert the output spectrogram S* to an output audio signal which is therefore denoised to some degree.
The “clean” output audio signal may then be provided to some further processing function, such as a speech recognition function and/or one or more ML processes.
In some example embodiments, rather than the above heuristic method, the purifier module 706 may implement one or more learned models, e.g., machine-learned (ML) models. The type of ML model or algorithm used may comprise a convolutional neural network (CNN). Although used largely for image processing, CNNs may be adapted for spectrogram generation. Convolutional layers of CNNs may capture local patterns, followed by pooling layers for downsampling. This may be effective for feature extraction from audio spectrograms and regenerating new audio spectrograms. For example, a CNN-based deep autoencoder may be trained to compress the input audio data into a lower-dimensional representation and then the spectrogram can be reconstructed from this representation. Similar techniques may be used with recurrent neural networks or transformers that are suitable for tasks where the temporal sequence of data is important, making them well-suited to embodiments described herein.
The one or more learned models may be configured to receive input data and to output the output spectrogram S* comprising the selected audio characteristics across the common segments. As in the previous example, the output spectrogram S* may be provided to the conversion module 708 which is configured to convert the output spectrogram S* to an output audio signal.
The one or more learned models may receive, as input:
The one or more learned models may provide, as output, the output spectrogram S* which results from selecting for each common segment which audio characteristics in that common segment of the respective spectrograms are to be used to generate the audio output.
The one or more learned models may be trained using a particular type of wanted audio signal, e.g., speech or heartbeat. However, the type of audio data is not necessarily important. The one or more learned models may be trained using audio data generated in two settings, that is one when the mobile terminal is stationary and one when the mobile terminal is in motion.
For example, the one or more learned models may be trained by providing ground truth data corresponding to respective audio signals, A1-A3, received by the, or a same type of, mobile terminal when stationary. The same type of mobile terminal means a mobile terminal having the same number of microphones at substantially the same respective locations on the mobile terminal.
Reference exposure parameters, η1-η3, and dissimilarity values, SA-8F, for common segments of respective spectrograms may be generated for the respective audio signals, A1-A3, when the mobile terminal is in motion.
An initial model may provide output data representing selected audio characteristics for each common segment based on the computed reference exposure parameters, η1-η3, and dissimilarity values, SA-8F, for the common segments.
The output data may be compared with the ground truth data to determine error data, and the initial model may be updated based on the error data, e.g., using backpropagation and a suitable loss function to optimize the initial model, e.g. using gradient descent. The process may be repeated for multiple iterations or epochs of training data when the mobile terminal experiences different types of movement, prior to being implemented on the mobile terminal 100.
The learned model may, as noted above, comprise any suitable ML algorithm, for example an artificial neural network (ANN), such as a recurrent neural network (RNN).
In overview, example embodiments process respective spectrograms which may take into account how exposed different microphones are to pressure changes due to motion of the mobile terminal. Audio signals captured by the different microphones may be combined in a selective way via segmentation and quantifying audio characteristics similarity to produce an output spectrogram S* which denoises at least some audio characteristics due to the motion.
By segmenting respective spectrograms using common segments which may cover substantially all of the time and frequency space, time and frequency gaps can be avoided or mitigated against and hence any delays due to the respective positions of the microphones may not affect the result.
In some example embodiments, the exposure values η1-ηM, generated by the kinematics module 702 may change a number of times within a particular time period as movement of the user terminal 100 changes. Example embodiments may therefore involve multiple iterations of segmenting spectrograms S1-S3 and purification by the purification module 706 within the particular time period. Synchronisation is therefore also handled efficiently.
Names of network elements, protocols, and methods are based on current standards. In other versions or other technologies, the names of these network elements and/or protocols and/or methods may be different, as long as they provide a corresponding functionality. For example, embodiments may be deployed in 2G/3G/4G/5G networks and further generations of 3GPP but also in non-3GPP radio networks such as WiFi.
A memory may be volatile or non-volatile. It may be e.g. a RAM, a SRAM, a flash memory, a FPGA block ram, a DCD, a CD, a USB stick, and a blue ray disk.
If not otherwise stated or otherwise made clear from the context, the statement that two entities are different means that they perform different functions. It does not necessarily mean that they are based on different hardware. That is, each of the entities described in the present description may be based on a different hardware, or some or all of the entities may be based on the same hardware. It does not necessarily mean that they are based on different software. That is, each of the entities described in the present description may be based on different software, or some or all of the entities may be based on the same software. Each of the entities described in the present description may be embodied in the cloud.
Implementations of any of the above described blocks, apparatuses, systems, techniques or methods include, as non-limiting examples, implementations as hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Some embodiments may be implemented in the cloud.
It is to be understood that what is described above is what is presently considered the preferred embodiments. However, it should be noted that the description of the preferred embodiments is given by way of example only and that various modifications may be made without departing from the scope as defined by the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2317849.4 | Nov 2023 | GB | national |