The present disclosure relates to classifying images for augmented reality.
Augmented Reality (AR) is the merging of real and virtual worlds to produce new environments and visualizations where actual or real physical objects and digital or virtual objects co-exist and may interact in real time. AR brings a virtual world into a real-world environment of a user with true-to-life visuals and audio. AR mixes virtual sounds from virtual sound objects with real sounds in a real acoustic environment. Virtual sound from a virtual sound object should match equivalent real-world sound as played through headphones to a user to ensure a pleasing AR experience. Otherwise, the user experiences a degradation of the AR experience. Conventional techniques use complex multistep processes to match the virtual sound to the equivalent real-world sound. Such complexity introduces noticeable aural delays into an AR simulation, which may degrade the user experience. Moreover, the complexity disadvantageously increases processing requirements for, and thus the cost of, AR devices.
Extended reality (XR) generally encompass virtual reality (VR) and augmented reality (AR), sometimes referred to as mixed reality (MR). Audio signal reproduction systems have evolved to deliver three-dimensional (3D) audio to a listener. In 3D audio, sounds are produced by headphones or earphones (for simplicity, collectively referred to herein as “headphones”) and can involve or include virtual placement of a sound source in a real or theoretical 3D space or environment auditorily perceived by the listener. For example, virtualized sounds can be provided above, below, or even behind a listener who hears 3D audio-processed sounds. Conventional audio reproduction via headphones tends to provide sounds that are perceived as originating or emanating from inside the head of the listener. In an example, audio signals delivered by headphones, including using a conventional stereo pair of headphones, can be specially processed to achieve 3D audio effects, such as to provide the listener with a perceived spatial sound environment.
A 3D audio headphone system can be used for VR applications, such as to provide the listener with a perception of a sound source at a particular position in a local or virtual environment where no real sound source exists. Similarly, a 3D audio headphone system can be used for AR applications, to provide the listener with the perception of the sound source at the position where no real sound source exists, and yet in a manner that the listener remains at least partially aware of one or more real sounds in the local environment. Computer-generated audio rendering for VR or AR can leverage signal processing technology developments in gaming and virtual reality audio rendering systems and application programming interfaces, such as building upon and extending from prior developments in the fields of computer music and architectural acoustics. Various binaural techniques, artificial reverberation, physical room acoustic modeling, and auralization techniques can be applied to provide users with enhanced listening experiences. A VR or AR signal processing system can be configured to reproduce some sounds such that they are perceived by a listener to be emanating from an external source in a local environment rather than from headphones or from a location inside the head of the listener.
Compared to VR 3D audio, AR audio involves the additional challenge of encouraging suspension of a participant's disbelief, such as by providing simulated environment acoustics and source-environment interactions that are substantially consistent with acoustics of a local listening environment. This presents a challenge of providing audio signal processing for virtual or added signals in such a manner that the signals include or represent the environment of the user, and such that the signals are not readily discriminable from other sounds naturally occurring or reproduced over headphones in the environment. Such audio signal processing provides accurate sound sources in a virtual sound field by matching and applying reverberation properties, including decay times, reverberation loudness characteristics, and/or reverberation equalization characteristics (e.g., spectral content of the reverberation) for a given listening environment. In audio-visual AR applications, computer-generated sound objects (referred to as “virtual sound objects”) can be rendered via acoustically transparent headphones to blend with a physical environment heard naturally by the viewer/listener. Such blending can include or use binaural artificial reverberation processing to match or approximate local environment acoustics.
Embodiments presented herein provide a practical and efficient approach to extend 3D audio rendering algorithms or simulations to faithfully match, or approximate, physical local environment acoustics. The embodiments provide solutions to the above-mentioned problems and/or challenges, and also provide advantages that will become apparent from the ensuing description. The embodiments may be used in 3D audio applications, such as VR and AR, for example. The embodiments use machine learning (ML) techniques to predict acoustic properties of the local environment, such as reverberation characteristics, directly from images of the local environment captured by an image sensor. The embodiments may then use the predicted acoustic properties in an acoustic simulation of the environment that matches or approximates actual acoustics of the local environment. Based on the predicted acoustic properties, the acoustic environment simulation seamlessly blends virtual sound with the local environment, when perceived by a listener via headphones.
More specifically, embodiments presented herein use ML techniques to train one or more neural networks of an ML classifier to predict the acoustic properties of an unknown environment accurately using an image sensor. The predicted acoustic properties are then used to create an acoustic context for virtual sound objects in the form of an acoustic environment simulation created within that environment in real-time. The embodiments advantageously: make use of camera sensors that are generally available on an XR device; allow the use of typical audio plugins used in machine learning engines, such as Unity and Unreal engines; reduce complexity, processing requirements, and delay associated with matching virtual sound to an equivalent real-world sound in real-time AR environments compared to conventional techniques; provide scalable implementations depending on image sensor availability; and may be implemented as a deep learning inference engine.
System-Level Description
At a high-level, embodiments presented herein employ ML techniques to classify images of a real-world (i.e., an actual) environment directly to an acoustic preset that represents a set of acoustic parameters for an acoustic environment simulation (AES). The set of acoustic parameters represent a set of properties sufficient to perform the AES. The AES simulates or models a sound response of the real-world environment based on the set of acoustic parameters of the acoustic preset. The acoustic preset is a parametric representation of the sound response. The AES applies the sound response to sound from virtual sound objects placed (virtually) in the real-world environment, to convert the sound to realistic sound that appears to originate, realistically, from the virtual sound objects when played to a user through headphones. The aforementioned real-world environment includes any real-world environment or space with reverberant qualities, such as, but not limited to, a room, auditorium, concert hall, outdoor theatre, and so on. The rooms may also include rooms in a home, such a kitchen, a living room, a dining room, a bathroom, and so on. The rooms may also include office spaces, and the like.
With reference to
XR system 100 includes an image sensor 102 to capture a sequence of images or video (collectively, “images”) 103, an AR display 104, a headset 106 including left and right headphones, an optional position sensor 107, and an XR processor or processor 108 coupled to, and that communicates with, the image sensor, the AR display, the headset, and the position sensor. XR processor 108 includes (i) an ML-based acoustic environment classifier 120 (referred to simply as an “ML classifier” 120) that includes one or more neural networks to classify images 103 into acoustic presets 122 according to embodiments presented herein, and an interactive audio engine (IAE) 124. IAE 124 may be implemented as part of XR processor 108 as shown in
Image sensor 102 may include a video camera to capture a sequence of images 103 of the real-world environment. Image sensor 102 may be positioned at different positions and orientations (collectively, “vantage points”) in the real-world environment to capture images 103 of different scenes of the real-world environment from the different vantage points. For example, image sensor 102 may include a video camera that is worn by a user who is a target of an AR experience, such that video camera operates to capture different scenes of the real-world environment as the user moves around in the real-world environment. Position sensor 107 senses or determines a position and an orientation of one or more objects, including the user, in the environment, and provides position information 114 indicative of the position and the orientation of the objects to XR processor 108.
At a high-level, in operation, XR processor 108 processes (i) images 103 of the real-world environment, (ii) sound (i.e., sound signals) from virtual sound objects 128, and (iii) position information 114, when available, to produce a video signal 136 and a sound signal 138 representative scenes of the real-world environment augmented with the virtual sound objects and other virtual information. AR display 104 converts video signal 136 to video and plays the video to the user. The headphones of headset 106 convert sound signal 138 to sound and play the sound to the user. More specifically, ML classifier 120 of XR processor 108 employs deep learning neural network techniques to classify images 103 into acoustic presets 122. Each of acoustic presets 122 represents a respective set of acoustic parameters, such as reverberation (“reverb”) parameters, that represent sound properties of the real-world environment. IAE 124 performs AES 126 based on acoustic presets 122, to simulate or model an acoustic response, including reverberation, for the real-world environment. IAE 124 also generates one or more virtual sound objects 128 placed at various virtual locations into scenes of the real-world environment. AES 126 applies the sound response to sound signals generated by virtual sound objects 128, to convert the sound signals from the virtual sound objects to sound signals 118 that convey realistic sound for the virtual sound objects. That is, AES 126 models at least sound reverberation, for example, for the virtual sound objects.
With reference to
With reference to
Acoustic parameters AP1-APN may include additional acoustic parameters, such as one or more sound reflection parameters/coefficients, one or more sound absorption parameters/coefficients, and so on.
At 302, XR processor 108 selects or establishes one of acoustic presets P1-PM as a default or initial acoustic preset for the AES. Acoustic parameters AP1-APN of the default acoustic preset represent initial acoustic parameters.
At 304, ML classifier 120 receives an image among the sequence images 103 captured by image sensor 102. In steady state operation, the image may be a current image among previous and future images among the sequence of images 103 to be processed sequentially through method 300.
At 306, referred to as “inference,” (pre-trained) ML classifier 120 directly classifies the image into a set of multiple (current) classifications corresponding to acoustic presets P1-PM. The set of classifications may simply include labels L1-LM indicative of acoustic presets P1-PM with confidence levels C1-CN associated with respective ones of the labels. Labels L1-LM may be used to access respective ones of (known) acoustic presets P1-PM, and thus (known) acoustic parameters AP1-APN of the acoustic presets. For example, acoustic presets P1-PM may be stored so as to be indexed and thus retried based on labels L1-LM. Confidence level Ci represents a probability that the associated label Li/acoustic preset Pi is correct for the image, i.e., that the image was classified correctly to label Li/acoustic preset Pi. In this way, the classifications may be considered soft decisions, rather than hard decisions.
At 308, XR processor 108 selects the label/acoustic preset associated with the greatest confidence level among confidence levels C1-CN among the classifications, to produce a (current) selected label/acoustic preset. The selected acoustic preset replaces the default acoustic preset from operation 302. The selected acoustic preset is retrieved from memory (i.e., acoustic parameters AP1-APN of the selected preset are retrieved from memory).
At 310, XR processor 108 updates IAE 124 with the selected acoustic preset, i.e., with parameters AP1-APN of the selected acoustic preset.
Method 300 repeats sequentially as next images among the sequence of images 103 arrive for classification, to produce a sequence of classification results corresponding to the sequence of images, and that are sequentially passed to IAE 124 for AES 126.
A variation of method 300 conditions acoustic preset updates to IAE 124 on a predetermined confidence level threshold, which may introduce hysteresis into the updates provided to the IAE as the method repeats to classify successive images. More specifically, the variation only updates IAE 124 when one or more (current) classifications have confidence levels that exceed the confidence level threshold, in which case operations 308 and 310 proceed as described above. Otherwise, the variation does not update IAE 124, i.e., the variation simply maintains a last, previous update to the IAE that exceeded the confidence level threshold. Assuming the classifications include softmax values (i.e., soft decisions) that represent or are associated with confidence levels as probabilities, the confidence level threshold may be set equal to a probability of 0.7, for example. In that case, an update occurs only when the corresponding probability exceeds >0.7. To add hysteresis, the update may occur only when an average confidence level over a predetermined number>1 of consecutive classifications (through operation 306) exceeds 0.7.
Classification Flowcharts
Various methods of classifying images using ML techniques are now described in connection with flowcharts of
At 402, an initial acoustic preset among acoustic presets P1-PM is established.
At 404, an image of a scene of a real-world environment is captured.
At 406, using a deep learning neural network inference, the image (received from 404) is classified directly to M classifications indicative of acoustic presets P1-PM and their respective confidence levels C1-CN. The acoustic preset among acoustic presets P1-PM associated with the highest confidence level among confidence levels C1-CN is considered a “best match” acoustic preset to the real-world environment depicted in the image. That is, the simulated sound response generated by AES 126 based on the best match acoustic preset is closer to an actual sound response of the real-world environment than would be generated based on any of the other acoustic presets. At 408, the best match acoustic preset may be identified/selected based on the confidence levels associated with the classifications/acoustic presets.
At 408, it is determined whether to update AES 126 with the best match acoustic preset, as described above in connection with
From 408, flow control returns to 404 and the process repeats for a next image.
At 502, an initial acoustic preset among acoustic presets P1-PM is established.
At 504, an image of a scene of a real-world environment is captured.
At 506, using a deep learning neural network inference, the image (received from operation 504) is classified to a room type, e.g., kitchen.
At 508, an acoustic preset among acoustic presets P1-PM associated with/assigned to the room type is retrieved.
At 510, the acoustic preset from 508 may be used to update the AES.
From 510, flow control returns to 504 and the process repeats for a next image.
In method 500, inference operation 506 does not classify directly to an acoustic preset. Therefore, an extra operation, 508, is used to identify the acoustic preset after the classification is performed. That is, the room type is translated to the acoustic preset.
Flow proceeds from 402 and 404 to 602. At 602, it is determined whether the user has previously been in the room in which the user is currently positioned. If the user has been in the room previously, flow proceeds to 604, where the acoustic prefix for the room is retrieved from the cache. Flow proceeds from 604 to 408, which uses the acoustic prefix retrieved from the cache. If the user has not been in the room previously, flow proceeds to 406, and operation continues as described above. An example of an XR processor configured to perform method 600 is described below in connection with
At 702, At 702, an initial acoustic preset among acoustic presets P1-PM is established.
At 704, an image of a scene of a real-world environment is captured.
At 706, using the first neural network, the image is directly classified to the general acoustic presets, from which the best general acoustic preset is selected, i.e., the acoustic preset associated with the highest confidence level is selected as the best acoustic preset.
At 708, using the second neural network, the image is directly classified to the secondary acoustic parameters.
At 710, one or more of the general acoustic parameters of the general acoustic preset selected at 706 are modified/adjusted based on one or more of the secondary acoustic parameters, to produce a modified general acoustic preset. For example, values of the general acoustic parameters of the general acoustic preset may be increased or decreased based on values of the secondary acoustic parameters. Alternatively, one or more of the general acoustic parameters may be replaced by one or more of the secondary acoustic parameters.
In a simple example, an absorption coefficient α in a fractional range 0<α<1 may be used as a secondary acoustic parameter, in which case operation 710 may multiply one or more of the general acoustic parameters by the absorption coefficient α, to produce one or more modified general acoustic parameters. In practice, such a modification based on absorption may be more complex for the following reason. Since each material has its own absorption coefficient, early reflections from the material are usually directly influenced by the absorption coefficient of the material. Thus, reverberation in an acoustic environment comprising many different materials can be influenced by an aggregate of the materials in the environment, which collectively produce an aggregate absorption. The aggregate absorption may affect the delay rate of the reverberation differently in different frequency bands, which can be taken into account at operation 710.
At 712, the modified general acoustic preset may be used to update the AES.
From 712, flow returns to 704, and the process repeats.
With reference to
At 804, a depth camera captures a depth map (image) of the same real-world environment for which the image was captured at operation 704.
At 806, the 3D mesh is created from the depth map.
At 808, a secondary acoustic parameter (e.g., material sound absorption) produced at operation 708 is mapped to the 3D mesh.
At 810, the 3D mesh and the secondary acoustic parameter are exported.
Training and real-time operations of ML classifier 120 are now described in further detail in connection with
ML Training
For training and for the inference stage, post training, ML classifier 120 receives an image 906 and produces classifications 908 in the form of labels representative of acoustic presets. In the inference stage, at 910, an acoustic preset with a highest confidence is selected based on the labels and their confidence levels, as described above. During training, image 906 represents a training image on which ML classifier 120 trains.
In the first training scenario, training of ML classifier 120 may include the following operations:
Operations (a)-(c) may be performed based on subjective sound design, i.e., substantially manually by a sound designer. The sound designer uses his/her experience with room acoustics to design respective acoustic presets with respective sets of the most likely sounding acoustic parameters for corresponding ones of scenes depicted in training pictures among many training pictures in a training database. That is, the sound designer designs each respective set of acoustic parameters to best represent or match the acoustic properties of a corresponding scene depicted in one of the training pictures based on subject design experience of the designer. For example, the designer selects a first set of reverberation parameters of a first acoustic preset for a “live” room (e.g., a live kitchen), selects a second set of reverberation parameters for a “dead” room (e.g., a heavily carpeted bedroom including fabric covered furniture), selects a third set of reverberation parameters of a third acoustic preset for a room having intermediate reverberation characteristics between those of the “live” room and the “dead” room, and so on. Then, the designer labels the training pictures with their most likely acoustic presets (which each represents a respective set of the acoustic parameters). For example, the designer labels training pictures of similar live-looking rooms with the first acoustic preset, labels training pictures of similar dead-looking rooms with the second acoustic preset, and labels training pictures of similar rooms that appear to have intermediate reverberation with the third acoustic preset, and so on.
An alternative to relying primarily on the experience of the sound designer to establish the acoustic presets for training uses actual acoustic measurements of rooms with different reverberant properties, and then algorithmically derives the acoustic presets from the acoustic measurements. For example, an acoustic impulse response for each room may be measured using any known or hereafter developed technique for measuring the acoustic impulse response of a real-world environment. Then, a set of acoustic parameters of an acoustic preset is algorithmically derived from the measured acoustic impulse response using any known or hereafter developed technique to derive reverberation parameters, for example, from the acoustic impulse response.
In one simplified example, the absolute value of the impulse response can be normalized and converted to a dB magnitude. The time from the initial pulse (normalized to 0 dB) at which the dB magnitude falls below 60 dB is taken as an RT60 decay time (i.e., how long it would take for a sound to decay 60 dB in a room). With added frequency domain analysis, such methods can be extended to multiband analysis of RT60 times. Similarly, values for initial spectral energies, onset times, early reflection timing, and density, etc., can be directly observed in the impulse response or windowed sections thereof. It is understood that this particular technique is provided by way of example, only, and any additional or alternative methods of impulse analysis may be used.
Once trained, ML classifier 120 may be validated by determining that an arbitrary room model “sounds like” one would expect.
For the inference stage, ML classifier 120 (or logic external to the ML classifier) may be configured to apply a smoothing function on the softmax (output) classification produce by the ML classifier, such that the classification only transitions from its previous state (i.e., previous acoustic preset provided to AES 126) if the softmax classification exceeds a softmax threshold, with some built in hysteresis to avoid spurious classification, similar to the thresholding described above in connection with method 300 of
Training may also leverage transfer learning that takes advantage of a pre-trained neural network that already performs traditional room type classification. This approach freezes the convolutional layer of the pre-trained neural network (at feature extraction) continues to adapt the fully connected layer (classification) using the labels described above.
In the second training scenario, the labels may be based on lower level acoustic parameters, such as reverberation parameters. The reverberation parameters may include I3DL2 acoustic parameters, for example. Initially, a sound designer uses his/her experience with room acoustics to design respective acoustic presets with sets of the most likely sounding acoustic parameters for corresponding ones of scenes depicted in training pictures among many training pictures in a training database. That is, each respective set of acoustic parameters is designed to best represent or match the acoustic properties of a corresponding scene depicted in one of the training pictures. Then, during inference, acoustic parameters are updated based on the labels, as shown at 1002.
In the third training scenario, the labels are based on lower level acoustic parameters that are derived from acoustic measurements of real acoustic properties taken in the same room as depicted in a training image. The acoustic measurement may include a measurement of a room (sound) impulse response, for example. Then, pre-training data preparation includes analyzing the room impulse response to automatically tune the appropriate acoustic parameters, i.e., perform automated tuning. The automated tuning, itself, may be based on an ML neural network.
Both the second and third training scenarios may take advantage of ML neural networks.
In the fourth training scenario, ML classifier 120 is trained on descriptive features of pictures that have acoustic relevance. Data preparation for pre-training includes labeling pictures of scenes of rooms with the given acoustics vocabulary. Although the example of
With reference to
XR Processor Example
Image classification path 1202 includes an image preprocessor 1222 (for acoustic analysis) followed by ML classifier 120. Image preprocessor 1222 processes images 103, i.e., raw image data, to produce images in a format suitable for consumption by ML classifier 120. Image preprocessor 1222 formats the raw image data, and/or selects, recalls, or aggregates the raw image data to match training assumption for ML classifier 120. For example, image preprocessor 1222 may stitch together successive ones of images 103 to produce stitched images for classification, as described above.
Assuming ML classifier 120 has been trained to classify images to both general acoustic presets (with their confidence levels) and secondary acoustic modifiers, directly, the ML classifier classifies each of the images from image preprocessor 1222 to general acoustic preset 1210 and acoustic modifier 1212, directly. In an example, general acoustic preset 1210 incudes initial reverberation parameters, and secondary acoustic modifier 1212 may include one or more of an acoustic absorption parameter, an acoustic reflection parameter, an acoustic diffusion parameter, and specific environment (e.g., room) dimensions.
ML classifier 120 may produce general acoustic preset 1210 and secondary acoustic modifier 1212, concurrently, provided there is sufficient image information, and sufficient ML classifier (e.g., neural network) processing power, for both types of classification to proceed concurrently. Alternatively, ML classifier 120 may (i) initially produce only general acoustic preset 1210 based on initially received images and/or initially limited processing power, and, (ii) when further images arrive and/or further processing power is available, concurrently produce both the general acoustic preset 1219 and secondary acoustic modifier 1212.
APC logic 1206 modifies the (initial) reverberation parameters of general acoustic preset 1210 based on acoustic modifier 1212, to produce a modified general acoustic preset including modified reverberation parameters, and provides the modified general acoustic preset to AES 126 in final acoustic tuning parameters 1220.
Material estimation path 1204 includes an image preprocessor 1232 (for geometric analysis) followed by an architectural mesh and material estimator (referred to simply as a “material estimator”) 1234. Image preprocessor 1232 processes the raw image data in images 103, to produce images for consumption by material estimator 1234. Material estimator 1234 constructs a (digital) architectural 3D mesh for the scenes depicted in the images, estimates types of materials depicted in the scenes based on the architectural 3D mesh, and estimates acoustic properties of the materials, to produce early reflection model data (e.g., parameters) 1214 that includes the acoustic properties. Image preprocessor 1232 and material estimator 1234 may perform geometrical image analysis, generate an architectural mesh, and estimate material properties from the mesh using any known or hereafter developed techniques.
APC logic 1206 combines early reflection model data 1214 with the modified general acoustic preset into final acoustic tuning parameters 1220. Alternatively and/or additionally, APC logic 1206 may further modify the modified general acoustic preset using various parameters in early reflection model data 1214.
In an embodiment that omits material estimation path 1204, early reflection model data 1214 may still be used, but set to default values, for example.
At 1404, further image data flows into ML classifier 120 and, based on the further image data, the ML classifier produces secondary acoustic modifiers (e.g., secondary acoustic modifiers) 1212 in addition to general acoustic presets P1-PM.
At 1406, acoustic parameter safety check logic performs acoustic parameter safety checks on the selected general acoustic preset and the secondary acoustic modifiers to ensure the aforementioned acoustic parameters are within reasonable bounds given the (current) selected general acoustic preset, and additional information useful for performing the safety check. Following the safety checks, APC logic 1206 modifies the selected general acoustic preset based on the secondary acoustic modifiers, to produce a modified/consolidated acoustic preset, including the N acoustic parameters, as modified. The ERE default parameters are retained with the modified/consolidated acoustic preset.
At 1410, material estimation path 1204 generates early reflection model data 1214 based on the initial image data and the further image data.
At 1412, the acoustic parameter safety check logic performs acoustic parameter safety checks on the modified/consolidated acoustic preset and early reflection model data 1214. APC logic 1206 further modifies the modified/consolidated acoustic preset based on early reflection model data 1214, or simply adds the early reflection data to the modified preset, to produce final acoustic tuning parameters 1220.
Cache Embodiment
In the embodiment of
Flowcharts for Acoustic Preset Transition Methods
Training Process
Computer Device
With reference to
Processor 2116 may include a collection of microcontrollers and/or microprocessors, for example, each configured to execute respective software instructions stored in the memory 2114. Processor 2116 may be implemented in one or more programmable application specific integrated circuits (ASICs), firmware, or a combination thereof. Portions of memory 2114 (and the instructions therein) may be integrated with processor 2116. As used herein, the terms “acoustic,” “audio,” and “sound” are synonymous and interchangeable.
The memory 2114 may include read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible (e.g., non-transitory) memory storage devices. Thus, in general, the memory 2114 may comprise one or more computer readable storage media (e.g., a memory device) encoded with software comprising computer executable instructions and when the software is executed (by the processor 2116) it is operable to perform the operations described herein. For example, the memory 2114 stores or is encoded with instructions for control logic 2120 to perform operations described herein related to ML classifier 120, IAE 124, image preprocessors 1222 and 1232, APC logic 1206, material estimation path 1204, and the methods described above.
In addition, memory 2114 stores data/information 2122 used and generated by logic 2120, such as images, acoustic parameters, neural networks, and so on.
Summary Method Flowcharts
With reference to
At 2202, the method receives an image of a real-world environment. To do this the method may capture the image using an image sensor, or access the image from a file of pre-stored images.
At 2204, the method uses an ML classifier, already or previously trained as described herein, to receive the image captured at operation 2202 and to directly classify the image to classifications associated with, and indicative of, (known) acoustic presets for an AES. The classifications include respective confidence levels. The acoustic presets each includes (known) acoustic parameters that represent sound reverberation for the AES.
At the time of the classifying in operation 2204, the acoustic presets and their respective parameters are already known from the a priori training of the ML classifier. Thus, the ML classifier classifies the image “directly” to the classifications associated with and indicative of the acoustic presets without classifying to a room type first, which would then require further operations to derive acoustic parameters from the room type, for example. The directly classifying of operation 2204 is essentially a single classifying operation flowing from the image to the classifications that provides direct access to known/predetermined acoustic parameters associated with the classifications, without intervening parameter translations. Moreover, the AES uses the acoustic presets directly, i.e., as is. In an embodiment, the ML classifier was trained on (labeled) training images of real-world environments divided into different groups of the training images. The training images of the different groups of the training images are labeled with respective ones of the acoustic presets that are the same within each of the different groups, but that differ across the different groups. The training images may also be further labeled with additional (secondary) acoustic parameters, exploited in further operations 2210-2214, described below.
At 2206, the method selects an acoustic preset among the acoustic presets (i.e., a particular one of the acoustic presets) based on the confidence levels of the classifications. The method accesses/retrieves the acoustic preset.
At 2208, the method performs the AES based on the acoustic parameters of the acoustic preset. The AES models sound reverberation for one or more virtual sound objects placed virtually in the real-world environment based on the acoustic parameters of the acoustic preset.
At 2210, the method use the machine learning classifier to further classify the image, or to classify one or more further images, directly, to produce one or more acoustic parameter modifiers. The further classifying may be concurrent with the classifying of operation 2204. Alternatively, the further classifying may result from receiving and classifying additional or subsequent images.
At 2212, the method modifies the acoustic parameters of the acoustic preset from 2206 based on the one or more acoustic parameter modifiers from 2210, to produce a modified acoustic preset including modified acoustic parameters for the AES.
At 2214, the method performs the AES using the modified acoustic parameters.
Different combinations of operations 2202-2206 of method 2200 may represent separate and independent embodiments. For example, operations 2202-2206 collectively represent an independent embodiment.
With reference to
At 2302, the method captures/receives a second image of the real-world environment.
At 2304, using the machine learning classifier, the method directly classify the second image to produce second classifications that have respective second confidence levels.
At 2306, the method determines whether one or more of the second classifications have respective second confidence levels that exceed a confidence level threshold.
At 2308, if one or more of the second classifications have respective second confidence levels that exceed the confidence level threshold, the method selects a second acoustic preset among the acoustic presets (a second particular one of the acoustic presets) based on the second confidence levels of the second classifications, and updates/replaces the acoustic preset with the second acoustic preset for the acoustic environment simulation.
At 2310, if one or more of the second classifications do not have corresponding second confidence levels that exceed the confidence level threshold, the method does not select a second acoustic preset, and does not update/replace the acoustic preset for the acoustic environment simulation.
In methods 2200 and 2300, individual classifications may be based on one image or more than one image. For example, considering the context of classifying a sequence of images (or a sequence of image frames), the methods may classify one image at a time, to produce a separate classification for each image (or image frame); however, the classification preset (i.e., the acoustic preset presented to the AES) changes or updates when there is a significant/substantial difference in a “running average” of confidence levels for classifications from several such images (or image frames). Also, an image under classification may be augmented using multiple images from the image sensor, e.g., by stitching multiple perspectives to generate a less cropped perspective of the environment.
In summary, in one embodiment, a method is provided comprising: receiving an image of a real-world environment; using a machine learning classifier, classifying the image to produce classifications associated with acoustic presets for an acoustic environment simulation, the acoustic presets each including acoustic parameters that represent sound reverberation; and selecting an acoustic preset among the acoustic presets based on the classifications.
In another embodiment, an apparatus is provided comprising: a processor configured to: receive an image of a real-world environment; use a trained machine learning classifier including one or more neural networks to classify the image directly to classifications associated with acoustic presets for an acoustic environment simulation, the acoustic presets each including acoustic parameters that represent sound reverberation; select an acoustic preset among the acoustic presets based on the classifications; and perform the acoustic environment simulation based on the acoustic parameters of the acoustic preset.
In a further embodiment, a non-transitory computer readable medium is provided. The computer readable medium is encoded with instructions that, when executed by a processor, cause the processor to perform the methods presented herein, including to: receive an image of a real-world environment; use a machine learning classifier, previously trained on training images of real-world environments labeled with respective ones of acoustic presets, the acoustic presets each including acoustic parameters that represent sound reverberation, to classify the image directly to classifications associated with the acoustic presets for an acoustic environment simulation; select an acoustic preset among the acoustic presets based on the classifications; and perform the acoustic environment simulation based on the acoustic parameters of the acoustic preset.
In another embodiment, a system is provided comprising: an image sensor to capture an image of a real-world scene; a processor coupled to the image sensor and configured to: implement and use a previously trained machine learning classifier to classify the image directly to classifications associated with acoustic presets for an acoustic environment simulation, the acoustic presets each including acoustic parameters that represent sound reverberation; select an acoustic preset among the acoustic presets based on the classifications; and perform the acoustic environment simulation based on the acoustic parameters of the acoustic preset, to produce a sound signal representative of the acoustic environment simulation; and one or more headphones coupled to the processor and configured to convert the sound signal to sound.
Although the techniques are illustrated and described herein as embodied in one or more specific examples, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made within the scope and range of equivalents of the claims.
Each claim presented below represents a separate embodiment, and embodiments that combine different claims and/or different embodiments are within the scope of the disclosure and will be apparent to those of ordinary skill in the art after reviewing this disclosure.
This application is a continuation of International Application No. PCT/US2019/066315, filed on Dec. 13, 2019, which claims the benefit of U.S. provisional patent application No. 62/784,648, filed Dec. 24, 2018, the entireties of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6738479 | Sibbald et al. | May 2004 | B1 |
6894714 | Gutta et al. | May 2005 | B2 |
7203635 | Oliver et al. | Apr 2007 | B2 |
7330197 | Kobayashi et al. | Feb 2008 | B2 |
7409423 | Horvitz et al. | Aug 2008 | B2 |
7415414 | Azara et al. | Aug 2008 | B2 |
7689521 | Nodelman et al. | Mar 2010 | B2 |
7792674 | Dalton, Jr. et al. | Sep 2010 | B2 |
7801836 | Sureka | Sep 2010 | B2 |
8140188 | Takemitsu et al. | Mar 2012 | B2 |
8254393 | Horvitz | Aug 2012 | B2 |
8301406 | Lee et al. | Oct 2012 | B2 |
8392188 | Riccardi | Mar 2013 | B1 |
8396671 | Jojic et al. | Mar 2013 | B2 |
8484146 | Movellan et al. | Jul 2013 | B2 |
8542250 | Baseley et al. | Sep 2013 | B2 |
8660679 | Kurabayashi et al. | Feb 2014 | B2 |
8767968 | Flaks et al. | Jul 2014 | B2 |
8775427 | Birdwell et al. | Jul 2014 | B2 |
8775428 | Birdwell et al. | Jul 2014 | B2 |
8788270 | Patel et al. | Jul 2014 | B2 |
9015093 | Commons | Apr 2015 | B1 |
9053431 | Commons | Jun 2015 | B1 |
9104189 | Berges Gonzalez et al. | Aug 2015 | B2 |
9197977 | Mahabub et al. | Nov 2015 | B2 |
9292085 | Bennett et al. | Mar 2016 | B2 |
9398393 | Antani et al. | Jul 2016 | B2 |
9417692 | Lamb et al. | Aug 2016 | B2 |
9442496 | Beckman et al. | Sep 2016 | B1 |
9449613 | Peters et al. | Sep 2016 | B2 |
9462387 | Oomen et al. | Oct 2016 | B2 |
9497501 | Mount et al. | Nov 2016 | B2 |
9665832 | Thieberger et al. | May 2017 | B2 |
9681250 | Luo et al. | Jun 2017 | B2 |
9706292 | Duraiswami et al. | Jul 2017 | B2 |
9818228 | Lanier et al. | Nov 2017 | B2 |
9843883 | Su | Dec 2017 | B1 |
9886954 | Meacham et al. | Feb 2018 | B1 |
9942687 | Chemistruck et al. | Apr 2018 | B1 |
9959682 | Boyle et al. | May 2018 | B2 |
9984682 | Tao et al. | May 2018 | B1 |
10013654 | Levy et al. | Jul 2018 | B1 |
10031718 | Dack et al. | Jul 2018 | B2 |
10038967 | Jot et al. | Jul 2018 | B2 |
10042604 | Fallon | Aug 2018 | B2 |
10057708 | Robinson et al. | Aug 2018 | B2 |
10068376 | Smith et al. | Sep 2018 | B2 |
10068557 | Engel et al. | Sep 2018 | B1 |
10089578 | Jehan et al. | Oct 2018 | B2 |
10127723 | Miller | Nov 2018 | B2 |
10129664 | Fitz et al. | Nov 2018 | B2 |
10154360 | Lovitt | Dec 2018 | B2 |
10169921 | Jayaraj et al. | Jan 2019 | B2 |
10176635 | Teegan | Jan 2019 | B2 |
10186085 | Greco et al. | Jan 2019 | B2 |
10206055 | Mindlin et al. | Feb 2019 | B1 |
10223934 | Paul | Mar 2019 | B2 |
10225682 | Lee | Mar 2019 | B1 |
10234848 | Mehr et al. | Mar 2019 | B2 |
10235128 | Vaughn et al. | Mar 2019 | B2 |
10248744 | Schissler et al. | Apr 2019 | B2 |
10255285 | Sabin | Apr 2019 | B2 |
10255550 | Simkoff et al. | Apr 2019 | B1 |
10255897 | Woelfl | Apr 2019 | B2 |
10262462 | Miller | Apr 2019 | B2 |
10276143 | Prasad et al. | Apr 2019 | B2 |
10721521 | Robinson | Jul 2020 | B1 |
20030007648 | Currell | Jan 2003 | A1 |
20040003042 | Horvitz et al. | Jan 2004 | A1 |
20040249776 | Horvitz et al. | Dec 2004 | A1 |
20060122834 | Bennett | Jun 2006 | A1 |
20060206221 | Metcalf | Sep 2006 | A1 |
20080137870 | Nicol et al. | Jun 2008 | A1 |
20080243278 | Dalton et al. | Oct 2008 | A1 |
20080306720 | Nicol et al. | Dec 2008 | A1 |
20090138415 | Lancaster | May 2009 | A1 |
20090202082 | Bharitkar et al. | Aug 2009 | A1 |
20090262946 | Dunko | Oct 2009 | A1 |
20100025483 | Hoeynck et al. | Feb 2010 | A1 |
20120116186 | Shrivastav et al. | May 2012 | A1 |
20120201405 | Slamka et al. | Aug 2012 | A1 |
20120288114 | Duraiswami et al. | Nov 2012 | A1 |
20120290515 | Frank et al. | Nov 2012 | A1 |
20130044129 | Latta et al. | Feb 2013 | A1 |
20130120365 | Lee et al. | May 2013 | A1 |
20130155105 | Boldyrev et al. | Jun 2013 | A1 |
20130174213 | Liu et al. | Jul 2013 | A1 |
20130236040 | Crawford et al. | Sep 2013 | A1 |
20130272527 | Oomen et al. | Oct 2013 | A1 |
20130278633 | Ahn et al. | Oct 2013 | A1 |
20130286004 | Mcculloch et al. | Oct 2013 | A1 |
20130290876 | Anderson et al. | Oct 2013 | A1 |
20130321390 | Latta et al. | Dec 2013 | A1 |
20140002492 | Lamb et al. | Jan 2014 | A1 |
20140046879 | Maclennan et al. | Feb 2014 | A1 |
20140067391 | Ganapathiraju et al. | Mar 2014 | A1 |
20140122391 | Mugan et al. | May 2014 | A1 |
20140161268 | Antani et al. | Jun 2014 | A1 |
20140200872 | Fang et al. | Jul 2014 | A1 |
20140240351 | Scavezze et al. | Aug 2014 | A1 |
20140253589 | Tout et al. | Sep 2014 | A1 |
20140272883 | Pardo et al. | Sep 2014 | A1 |
20140285519 | Uusitalo et al. | Sep 2014 | A1 |
20150148953 | Laurent et al. | May 2015 | A1 |
20150178865 | Anderson et al. | Jun 2015 | A1 |
20150242036 | Heidari et al. | Aug 2015 | A1 |
20150245139 | Park | Aug 2015 | A1 |
20150260474 | Rublowsky et al. | Sep 2015 | A1 |
20150282766 | Cole et al. | Oct 2015 | A1 |
20160086087 | Ghouti | Mar 2016 | A1 |
20160088417 | Kim et al. | Mar 2016 | A1 |
20160104452 | Guan et al. | Apr 2016 | A1 |
20160214011 | Weising et al. | Jul 2016 | A1 |
20160223218 | Barrett | Aug 2016 | A1 |
20160269712 | Ostrover et al. | Sep 2016 | A1 |
20160337779 | Davidson et al. | Nov 2016 | A1 |
20160379408 | Wright et al. | Dec 2016 | A1 |
20170032244 | Kurata | Feb 2017 | A1 |
20170038829 | Lanier et al. | Feb 2017 | A1 |
20170039986 | Lanier et al. | Feb 2017 | A1 |
20170061693 | Kohler et al. | Mar 2017 | A1 |
20170068782 | Pillai et al. | Mar 2017 | A1 |
20170094440 | Brown et al. | Mar 2017 | A1 |
20170098453 | Wright et al. | Apr 2017 | A1 |
20170103420 | Ramasarma | Apr 2017 | A1 |
20170103582 | Novak et al. | Apr 2017 | A1 |
20170124487 | Szeto et al. | May 2017 | A1 |
20170148433 | Catanzaro et al. | May 2017 | A1 |
20170153866 | Grinberg et al. | Jun 2017 | A1 |
20170173466 | Fahmie | Jun 2017 | A1 |
20170208415 | Ojala | Jul 2017 | A1 |
20170213551 | Ji et al. | Jul 2017 | A1 |
20170223478 | Jot et al. | Aug 2017 | A1 |
20170228916 | Patrick et al. | Aug 2017 | A1 |
20170236332 | Kipman et al. | Aug 2017 | A1 |
20170243113 | Fukuda et al. | Aug 2017 | A1 |
20170287221 | Ghaly et al. | Oct 2017 | A1 |
20170301140 | Smith et al. | Oct 2017 | A1 |
20170307435 | Park | Oct 2017 | A1 |
20170308808 | Kobylkin | Oct 2017 | A1 |
20170311095 | Fitz et al. | Oct 2017 | A1 |
20170336863 | Tilton et al. | Nov 2017 | A1 |
20170345219 | Holz | Nov 2017 | A1 |
20180003588 | Iwanami | Jan 2018 | A1 |
20180004481 | Fallon | Jan 2018 | A1 |
20180020312 | Visser et al. | Jan 2018 | A1 |
20180035234 | Roach et al. | Feb 2018 | A1 |
20180061132 | Lanier et al. | Mar 2018 | A1 |
20180061136 | Lehtiniemi et al. | Mar 2018 | A1 |
20180075657 | Lanier et al. | Mar 2018 | A1 |
20180077513 | Link | Mar 2018 | A1 |
20180082172 | Patel et al. | Mar 2018 | A1 |
20180082673 | Tzanetos | Mar 2018 | A1 |
20180089349 | Rezgui | Mar 2018 | A1 |
20180090152 | Ichimura | Mar 2018 | A1 |
20180101990 | Yang et al. | Apr 2018 | A1 |
20180108440 | Stevens et al. | Apr 2018 | A1 |
20180109899 | Arana | Apr 2018 | A1 |
20180124543 | Leppänen et al. | May 2018 | A1 |
20180150738 | Harvey | May 2018 | A1 |
20180151177 | Gemmeke et al. | May 2018 | A1 |
20180152487 | Griffin et al. | May 2018 | A1 |
20180160251 | Sanger et al. | Jun 2018 | A1 |
20180160984 | Mauger et al. | Jun 2018 | A1 |
20180174658 | Kikuchi | Jun 2018 | A1 |
20180177461 | Bell et al. | Jun 2018 | A1 |
20180192227 | Woelfl | Jul 2018 | A1 |
20180195752 | Sasaki et al. | Jul 2018 | A1 |
20180204116 | Evans et al. | Jul 2018 | A1 |
20180206057 | Kim et al. | Jul 2018 | A1 |
20180232471 | Schissler | Aug 2018 | A1 |
20180232662 | Solomon et al. | Aug 2018 | A1 |
20180232937 | Moyer et al. | Aug 2018 | A1 |
20180239144 | Woods et al. | Aug 2018 | A1 |
20180246698 | Huang | Aug 2018 | A1 |
20180249276 | Godfrey | Aug 2018 | A1 |
20180249279 | Karapetyan et al. | Aug 2018 | A1 |
20180255417 | Bosnjak et al. | Sep 2018 | A1 |
20180256962 | Kudirka et al. | Sep 2018 | A1 |
20180257658 | Cho et al. | Sep 2018 | A1 |
20180259978 | Dweik | Sep 2018 | A1 |
20180260502 | Dweik et al. | Sep 2018 | A1 |
20180260503 | Dweik et al. | Sep 2018 | A1 |
20180260680 | Finkelstein et al. | Sep 2018 | A1 |
20180261010 | Kudirka et al. | Sep 2018 | A1 |
20180284882 | Shipes et al. | Oct 2018 | A1 |
20180286134 | Warhol | Oct 2018 | A1 |
20180288161 | Saxena et al. | Oct 2018 | A1 |
20180293802 | Hendricks et al. | Oct 2018 | A1 |
20180293988 | Huang et al. | Oct 2018 | A1 |
20180306609 | Agarwal et al. | Oct 2018 | A1 |
20180308013 | O'Shea | Oct 2018 | A1 |
20180314486 | Edry et al. | Nov 2018 | A1 |
20180321737 | Pahud et al. | Nov 2018 | A1 |
20180328751 | Bejot et al. | Nov 2018 | A1 |
20180336476 | Fujimaki et al. | Nov 2018 | A1 |
20180349088 | Leppänen et al. | Dec 2018 | A1 |
20180350145 | Byl et al. | Dec 2018 | A1 |
20180352360 | Chen et al. | Dec 2018 | A1 |
20180359592 | Laaksonen et al. | Dec 2018 | A1 |
20180365898 | Costa | Dec 2018 | A1 |
20180374268 | Niles | Dec 2018 | A1 |
20190005947 | Kim et al. | Jan 2019 | A1 |
20190007726 | Anderson | Jan 2019 | A1 |
20190017374 | Misra et al. | Jan 2019 | A1 |
20190019011 | Ross et al. | Jan 2019 | A1 |
20190026403 | De Keyser et al. | Jan 2019 | A1 |
20190028484 | Truong | Jan 2019 | A1 |
20190043262 | Anderson et al. | Feb 2019 | A1 |
20190058961 | Lee | Feb 2019 | A1 |
20190060741 | Contreras | Feb 2019 | A1 |
20190073370 | Woodall et al. | Mar 2019 | A1 |
20190094842 | Lee et al. | Mar 2019 | A1 |
20190095451 | Lehtiniemi et al. | Mar 2019 | A1 |
20190096130 | Shim | Mar 2019 | A1 |
20190102946 | Spivack et al. | Apr 2019 | A1 |
20190103094 | Neil | Apr 2019 | A1 |
20190103848 | Shaya et al. | Apr 2019 | A1 |
20190116448 | Schmidt et al. | Apr 2019 | A1 |
20190129181 | Polcak et al. | May 2019 | A1 |
20190130644 | Mate et al. | May 2019 | A1 |
20190138207 | Eronen et al. | May 2019 | A1 |
20190166435 | Crow et al. | May 2019 | A1 |
20190166446 | Eronen et al. | May 2019 | A1 |
20190166448 | Laaksonen et al. | May 2019 | A1 |
20190215637 | Lee | Jul 2019 | A1 |
20190373395 | Sarkar | Dec 2019 | A1 |
20200128349 | Vilkamo | Apr 2020 | A1 |
20220101623 | Walsh | Mar 2022 | A1 |
20220147563 | Perumalla | May 2022 | A1 |
20220210588 | Chen | Jun 2022 | A1 |
20220327316 | Grauman | Oct 2022 | A1 |
20230104111 | Murgai | Apr 2023 | A1 |
Number | Date | Country |
---|---|---|
2014-049118 | Mar 2014 | JP |
Entry |
---|
3D Working Group of the Interactive Audio Special Interest Group, “Interactive 3D Audio Rendering Guidelines: Level 2.0”, Published by: MIDI Manufacturers Association, Los Angeles CA (1999). |
Genovese, A.F., et at., “Blind Room Volume Estimation from Single-channel Noisy Speech”, ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5 (2019). |
Lin, C., et al.. “Floorplan-jigsaw: Jointly estimating scene layout and aligning partial scans.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5674-5683 (2019). |
Poupyrev, I., et al., “Integrating Real and Virtual Worlds in Shared Space”, In Proceedings of the 5th International Symposium on Artificial Life and Robotics, pp. 22-25 (2000). |
“XAUDIO2FX_REVERB_13DL2_PARAMETERS structure (xaudio2fx.h)”, website: https://docs.microsoft.com/en-us/windows/win32/api/xaudio2fx/ns-xaudio2fx-xaudio2fx_reverb_i3dl2_parameters (2021). |
Office Action in counterpart European Application No. 19 839 195.5-1207, mailed Jul. 13, 2022, 5 pages. |
Search report and Written Opinion in corresponding International Patent Application No. PCT/2019/066315, mailed Mar. 12, 2020. |
Number | Date | Country | |
---|---|---|---|
20220101623 A1 | Mar 2022 | US |
Number | Date | Country | |
---|---|---|---|
62784648 | Dec 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2019/066315 | Dec 2019 | WO |
Child | 17354668 | US |