Creators of immersive experiences often include audio as a component. To this end, to make the experience truly immersive, the audio should align with what is being depicted in any visual component. Incorrectly aligned audio and video can result in an unrealistic and disjointed environment. In a two-dimensional experience, stereo audio can provide sound using two channels, one channel for sounds occurring to the left and one channel for sounds occurring to the right relative to a location where, for example, a user is listening to the stereo audio. These two channels are capable of being played in each ear to indicate where in the experience sound is being generated. A three-dimensional experience (e.g., augmented reality and/or virtual reality), on the other hand, can use ambisonics, or spatial audio to indicate where sound is being generated. Ambisonics refers to a class of representations of spatial audio of different orders. Spatial audio of first order ambisonics generally utilizes of four channels of audio instead of two as in stereo audio: W, X, Y, and Z to provide sound in three dimensions. W is omnidirectional audio, meaning audio that is captured from every direction. X, Y, and Z are the channels of audio along the x axis, y axis, and z axis—in other words, left/right, up/down, and forward/backwards. It should be appreciated that other orders of spatial audio can use additional channels (e.g., second order ambisonics can use nine channels and third order ambisonics can use sixteen channels).
When generating a three-dimensional experience, the video component is often recorded separately from the audio component. For example, a camera capable of capturing a scene in three dimensions can be placed at a location to record a scene in multiple directions (e.g., 360 degrees or some subset of visualization in all directions) with the camera as a reference point. The camera can be oriented in a particular direction such that the camera has a perspective that some direction is left, up, forward, etc. so as a scene is captured, the video is oriented as such. An ambisonic microphone placed in a position to capture audio related to the scene can have its own orientation, separate from that of the camera, with its own notion of x, y, and z. In this way, recorded audio can have an orientation different from the orientation of the camera.
When listening to this audio in conjunction with viewing a related visual component, a user can wear a pair of headphones that track the user's orientation as the user's head turns. Knowing the orientation of the user's head allows spatial audio to be rendered to the user so different sounds that are encoded in the ambisonics recording will be adjusted (e.g., volume or frequency responses) such that the audio sounds like a stable audio scene in which the user is moving.
As such, aligning spatial audio captured by a microphone with captured video can provide a more immersive user experience. Accurately aligning spatial audio with video, however, can be difficult. Some conventional methods require loading video and ambisonics audio and, thereafter, using headphones while watching the video. To determine if the audio and video are correctly aligned, a user watches the video and listens to the sound to see if sounds seem to be in the right spot or not. However, relying on human perception of sound and direction often results in inaccurate alignment. Other conventional methods have attempted to create visual representations of spatial audio to assist with such alignment. Such methods can be used in the context of alternative reality, virtual reality, and/or mixed reality post-processing editing of recorded spatial audio. However, such methods often result in visual representations of sound that do not accurately indicate where sound is actually coming from. Additionally, such methods fail to use meaningful visual attributes based on properties of the spatial audio to create visual representations of sound(s) within the spatial audio.
Embodiments of the present disclosure are directed towards a spatial audio visualization system for visualizing first order spatial audio using properties associated with time segments of the audio. In accordance with embodiments, such properties can include position, intensity, focus, and color. Advantageously, such a spatial audio visualization system is capable of clearly indicating where sound is coming from in an environment as well as visually representing properties of the sound that can be used to understand what objects might be generating the sound. As such, accurately visualizing spatial audio ensures alignment can be performed more quickly and accurately.
A spatial audio visualization system can provide visual representations of spatial audio using particles or blobs with attributes that reflect the properties of spatial audio over time. Position can be used to place the particle in the location sound is being captured from at a segment of time for the spatial audio. Intensity can be used to indicate how loud the sound is at a segment of time for the spatial audio by adjusting the opacity of the particle. Focus can be used to indicate how concentrated the sound is at a segment of time for the spatial audio by adjusting the size of the particle. Frequency can be used to indicate what pitch the sound is at during the segment of time for the spatial audio by displaying the particle using a color(s). To determine these properties, a time segment of spatial audio can be obtained and the various audio channels analyzed (W, X, Y, and Z) to identify position, intensity, focus, and frequency. Upon determining the properties across various time segments of the spatial audio, the audio can be rendered into a visualization using, for example, a particle that allows a user to “see” the sounds being made.
Oftentimes, users desire for easy alignment of spatial audio with any related visual component (e.g., related video). For instance, users might desire a visualization of spatial audio that will allow the user to easily view and understand properties related to the spatial audio. Accurately indicating a where sound is coming from and what object might be making the sound can allow a user to easily align audio with visual aspects. For instance, if a visual aspect shows two people talking, a man to the right and a woman to the left, a visual indication that sound is coming from a certain position on the left that has properties associated with a woman's voice, how she is speaking, etc. (e.g., loudness, frequency of the voice) can allow for easily aligning the spatial audio associated with the woman with visualizations of the woman talking.
Accordingly, embodiments of the present disclosure are directed to a spatial audio visualization system for visualizing time segments of spatial audio using properties of the spatial audio. In particular, and as described herein, properties can be determined for a selected time segment(s) of spatial audio. Such properties can include the position audio is coming from, the intensity of the audio, how focused the audio is, and the frequency of the audio. A time segment can be represented using a particle or blob. Each property can be conveyed to a user via a distinct visual aspect, or attribute, associated with a particle or blob. Advantageously, such presentation of properties of spatial audio at time segments allows users to visualize where sounds are coming from in a video at different times. As such, aligning the sound with corresponding objects in a video can be performed in a more efficient and effective manner.
Turning to
It should be understood that operating environment 100 shown in
It should be understood that any number of user devices, servers, and other components may be employed within operating environment 100 within the scope of the present disclosure. Each may comprise a single device or multiple devices cooperating in a distributed environment.
User devices 102a through 102n can be any type of computing device capable of being operated by a user. For example, in some implementations, user devices 102a through 102n are the type of computing device described in relation to
The user devices can include one or more processors, and one or more computer-readable media. The computer-readable media may include computer-readable instructions executable by the one or more processors. The instructions may be embodied by one or more applications, such as application 110 shown in
The application(s) may generally be any application capable of facilitating the exchange of information between the user devices and the server(s) 106 in carrying out spatial audio visualization. In some implementations, the application(s) comprises a web application, which can run in a web browser, and could be hosted at least partially on the server-side of environment 100. In addition, or instead, the application(s) can comprise a dedicated application, such as an application having audio editing and/or processing functionality. For example, such an application can be configured to display visualizations of spatial audio. Such an application can also be capable of having visual and/or video editing and/or processing functionality (e.g., where the visual and/or video is associated with the audio). In some cases, the application is integrated into the operating system (e.g., as a service). It is therefore contemplated herein that “application” be interpreted broadly. Example applications include Adobe® Audition, Adobe® Premiere Pro, and the like.
In accordance with embodiments herein, application 110 can facilitate visualizing spatial audio. In particular, a user can select or input a spatial audio recording and/or spatial audio sound clip. A spatial audio recording generally refers to a recording of first order spatial audio with four channels of audio capturing sound from three dimensions: W, X, Y, and Z. A spatial audio sound clip can generally refer to a file containing four channels of audio capturing sound from three dimensions: W, X, Y, and Z. A spatial audio recording and/or spatial audio sound clip can also be referred to as spatial audio. The W channel contains omnidirectional audio, meaning audio that is captured from every direction. The X, Y, and Z channels of audio contain audio along the x axis, y axis, and z axis—in other words, sounds coming from left/right, up/down, and forward/backwards. Spatial audio can be selected or input in any manner. The application may facilitate the access of one or more recordings or sound clips stored on the user device 102a (e.g., in an audio library), and/or import spatial audio from remote devices 102b-102n and/or applications, such as from server 106. For example, a user may record spatial audio using a microphone on a device, for example, user device 102a. As another example, a user may select a desired spatial audio sound clip from a repository, for example, stored in a data store accessible by a network or stored locally at the user device 102a. Based on the input spatial audio, the input spatial audio can be analyzed to generate a visualization of the spatial audio, for example, using various techniques described herein. Such a visualization of the spatial audio can be provided to the user device 102a. To this end, the sound visualization can be rendered for display to a user using attributes relating to determined properties of sound at time segments of the spatial audio (e.g., via user device 102b-102n). Such a visualization can be rendered using a particle with attributes based on the determined properties. Particle(s) can be displayed using a two-dimensional and/or three-dimensional visualization. It should be appreciated that attributes related to determined properties for sound can be displayed in any number of ways.
The user device can communicate over a network 104 with a server 106 (e.g., a Software as a Service (SAAS) server), which provides a cloud-based and/or network-based spatial audio visualization system 108. The spatial audio visualization system may communicate with the user devices and corresponding user interface to facilitate the editing and/or presenting of sound visualizations via the user device using, for example, application 110.
As described herein, server 106 can facilitate visualizing spatial audio via spatial audio visualization system 108. Server 106 includes one or more processors, and one or more computer-readable media. The computer-readable media includes computer-readable instructions executable by the one or more processors. The instructions may optionally implement one or more components of spatial audio visualization system 108, described in additional detail below.
For cloud-based implementations, the instructions on server 106 may implement one or more components of spatial audio visualization system 108. Application 110 may be utilized by a user to interface with the functionality implemented on server(s) 106, such as spatial audio visualization system 108. In some cases, application 110 comprises a web browser. In other cases, server 106 may not be required, as further discussed with reference to
Thus, it should be appreciated that spatial audio visualization system 108 may be provided via multiple devices arranged in a distributed environment that collectively provide the functionality described herein. Additionally, other components not shown may also be included within the distributed environment. In addition, or instead, spatial audio visualization system 108 can be integrated, at least partially, into a user device, such as user device 102a.
Referring to
A user device can be utilized by a user to facilitate visualization of spatial audio. In particular, a user can select or input spatial audio for visualization utilizing user interface 118. Spatial audio can be selected or input in any manner. The user interface may facilitate the user accessing one or more stored recordings and/or sound clips of spatial audio on the user device (e.g., in an audio library), and/or import recordings and/or sound clips from remote devices and/or applications. Based on the input recordings and/or sound clips, the spatial audio can be analyzed to generate a visualization using various techniques, some of which are further discussed below with reference to spatial audio visualization system 204 of
Referring to
As shown, a spatial audio visualization system can operate in conjunction with data store 202. Data store 202 can store computer instructions (e.g., software program instructions, routines, or services), data, and/or models used in embodiments described herein. In some implementations, data store 202 can store information or data received via the various managers, engines, and/or components of spatial audio visualization system 204 and provide the various managers, engines, and/or components with access to that information or data, as needed. Although depicted as a single component, data store 202 may be embodied as one or more data stores. Further, the information in data store 202 may be distributed in any suitable manner across one or more data stores for storage (which may be hosted externally).
In embodiments, data stored in data store 202 can include spatial audio recordings and/or spatial audio sound clips selectable for visualization using, for example, the spatial audio visualization system. Spatial audio can be captured using an ambisonic microphone and is first order spatial audio can be comprised of four channels: W, X, Y, and Z. W is an omnidirectional channel containing audio captured from every direction (e.g., the x, y, and z directions). The X, Y, and Z channels contain sound captured along the x axis, y axis, and z axis (e.g., left/right, up/down, and forward/backwards). Such spatial audio can be input into data store 202 from a remote device, such as from a server or a user device. Data stored in data store 202 can also include visualization aspects determined for a spatial audio time segment. Determined properties for spatial audio can also be stored in data store 202. Such properties can include position, intensity, focus, and frequency for the sound at segments of time of the spatial audio. Data stored in data store 202 can further include rendered visualizations for spatial audio. Such rendered visualizations of spatial audio can be based on spatial audio properties.
Spatial audio visualization system 204 can generally be used to visualize spatial audio. Specifically, the spatial audio visualization system can be configured for determining properties of the audio for time segments of the spatial audio. As used herein, performing spatial audio visualization generally includes analyzing the spatial audio at a time segment to determine properties of the audio for that time segment that can be used to render a visualization of the audio at that time segment using attributes based on the determined properties of the sound. Upon determining properties for time segments, spatial audio can be visualized for an entire recording and/or clip, or a portion thereof. Such properties can include position, intensity, focus, and frequency of sound(s) present during the time segment(s) of spatial audio. Position can indicate where sound is coming from on the surface of the unit sphere during the time segment of audio. Intensity can indicate how much energy (e.g., noise) is occurring during the time segment of audio. Focus can indicate how concentrated the sound is during the time segment of audio. Frequency can indicate what pitch the sound is during the time segment of audio.
A spatial audio recording and/or audio clip can be accessed or referenced by spatial audio processing manager 206 for visualizing the audio. In this regard, the spatial audio processing manager 206 may access or retrieve spatial audio selected by a user via data store 202 and/or from a remote device, such as from a server or a user device. As another example, the spatial audio processing manager 206 may receive spatial audio provided to the spatial audio processing manager 206 via a user device. Visualization of spatial audio can be initiated in any number of ways. For example, visualization can take place when a user indicates a desire to select spatial audio for viewing using, for example, a user interface associated with the spatial audio visualization system. As another example, visualization may be initiated automatically, for instance, upon receiving and/or retrieving spatial audio. In some embodiments, such selection can be for presenting the selected spatial audio and/or for performing the spatial audio visualization.
As shown, the spatial audio processing manager 206 can include filtering engine 210 and visualization engine 212. The foregoing engines of the spatial audio processing manager can be implemented, for example, in operating environment 100 of
Upon the selection of spatial audio for visualization, filtering engine 210 can be utilized to filter the spatial audio. Filtering can take place during the visualization process because there can be a lot of noise captured in the spatial audio that makes sound(s) difficult to understand. For instance, the human audio response goes up to around 22 kHz, however, the sounds that people typically care about and perceive are typically in a sound range between 1 Hz and 1 kHz. As such, filtering engine 210 can be used to perform preprocessing of spatial audio by filtering to remove sounds greater than 1 kHz. This filtering can reduce the noise level before any further processing occurs, helping to keep the visualization clean. The filtering engine can carry out the filtering of spatial audio using, for example, a 1 kHz low pass filter or bandpass filter. As can be appreciated, in some cases, the range of sound that is filtered can vary based on the noises and/or sounds that are captured in the spatial audio. As such, the filter is capable of being adjusted to provide an optimal range of audio for visualization (e.g., when spatial audio includes sounds a user wishes to visualize over 1 kHz, the filter can be increased to, for example, 1.5 kHz).
As shown, visualization engine 212 can include position component 214, intensity component 216, focus component 218, and color component 220. The foregoing components of the visualization engine can be implemented, for example, in operating environment 100 of
Visualization engine 212 can be utilized to analyze spatial audio to determine properties of sound(s) for time segments of the spatial audio. Such properties can include position, intensity, focus, and frequency. Position can indicate where sound is coming from on the surface of the unit sphere during a time segment of audio. Intensity can indicate how much energy (e.g., noise) is occurring during a time segment of audio. Focus can indicate how concentrated the sound is during a time segment of audio. Frequency can indicate what pitch the sound is during a time segment of audio. Upon determining properties for a time segment, such properties can be combined for full visualization of spatial audio over time using attributes based on the determined properties. For instance, position can be represented using coordinates at which the visualization is displayed, intensity can be represented using opacity of the visualization, focus can be represented using size of the visualization, and frequency can be represented using a RGB color(s) for the visualization.
The visualization engine 212 can segment selected spatial audio into time segments for processing. In one embodiment, the audio is segmented into chunks of time on the order of 10 ms. A frame of video typically has 30 frames per second and there is typically 48 kH of audio captured per second. As such, using time segments of 10 ms allows for 10 visualizations of sound to be displayed for each frame of video such that each 10 ms segment of spatial audio includes 480 audio samples where an audio sample has a 16 bit value per point in time as a wave form. As can be appreciated, audio can be segmented into any time segments (e.g., dynamically determined time segments, default time segments, user-specified time segments, etc.).
Position component 214 can be configured to determine the position of sound at a selected time segment of spatial audio. Position can indicate where on the surface of the unit sphere sound is coming. In embodiments, position can designate where in a three-dimensional environment, as shown using augmented reality and/or virtual reality, sound is coming from at a point in time (e.g., when a user in a virtual reality environment is facing a certain direction, position of sound can be used to indicate that a person is clapping behind and to the left of the user; if the user is on a hill in the virtual reality environment, the position of sound can further indicate that the clapping is coming from below—or downhill from—the user). As such, position can be used to indicate what location, or coordinates, sound should be placed at for a visualization of a time segment of spatial audio.
Sound at a selected time segment of first order spatial audio has four channels of audio: W, X, Y, and Z. The W channel contains omnidirectional audio, meaning audio that is coming through from every direction. X, Y, and Z are the channels of audio along the x axis, y axis, and z axis—in other words, left/right, up/down, and forward/backward. To determine the three-dimensional position of sound at a selected segment of spatial audio includes identifying the x, y, and z position. To identify an accurate position along an axis, corresponding W channel audio from along the designated axis can be incorporated. In an embodiment, the omnidirectional component W can be incorporated into the axis component(s) by taking the root mean square error (RMSE) of the sum of W plus X/Y/Z and subtracting the RMSE of the sum of W minus X/Y/Z. Such exemplary equations can be designated as follows:
x=(RMSE(W+X))−(RMSE(W−X))
y=(RMSE(W+Y))−(RMSE(W−Y))
z=(RMSE(W+Z))−(RMSE(W−Z))
The above equations can be used to determine the three-dimensional position of sound at a time segment because the W component is omnidirectional and when the X/Y/Z is positive it means the sound is to the right/up/forward and, when negative, the sound is to the left/down/backward.
Using such equations is advantageous because it takes into account the directional components of sound. Sound in an audio wave form goes back and forth in direction (e.g., both a right/up/forward direction and a left/down/backward direction depending on the channel). Sound in an audio wave appears to have energy in two directions—the direction of the actual sound source and 180 degrees out of phase with the actual sound source. Equations, as above, allow for computation of the amount of energy in one direction and the amount of energy in the opposite direction. When the outcome of the computation is zero that means the source of the sound is in the center of that axis. When the sound is positive, the source of the sound is occurring in a corresponding positive direction. The distance, extent, or how far in that direction is based on how positive the computation is (e.g., positive ½ is slightly in positive direction, 1 is all the way in the positive direction). When sound is negative, the source of the sound is in corresponding negative direction. The distance, extent, or how far in that direction is based on how negative the computation is (e.g., negative ½ is slightly in negative direction, 1 is all the way in the negative direction). Examples of positive directions can be right for the x axis, up for they axis, and forward for the z axis. Examples of negative directions can be left for the x axis, down for the y axis, and backward for the z axis.
Upon determining x, y, z, the positions of sound can be combined to designate the three-dimensional position of sound from the time segment of spatial audio as a vector: v=<x, y, z>. When such a vector is not of unit length, the vector can be normalized to a position on the unit sphere that can be used to accurately visualize the position of sound using coordinates at a location. The normalized vector can then be used to display the sound for the spatial audio for the selected time segment at a location based on, for instance, the x, y, and z coordinates.
Intensity component 216 can be configured to determine the intensity of sound at a selected time segment of spatial audio. Intensity can indicate how much energy (e.g., noise) is occurring during a time segment of audio. In embodiments, intensity indicates how loud sound is for the spatial audio for the selected time segment. As such, intensity can be designated using decibels. Decibels can measure the intensity of a sound or the power level of an electrical signal by comparing it with a given level on a logarithmic scale. As such, intensity can be represented using opacity based on the determined decibels of sound.
To determine intensity for sound at a selected time segment of spatial audio, the W (omnidirectional) channel can be used. In embodiments, all of the sound occurring in a scene can be encoded in the W channel. When there are no spatial components of sound, the W channel can have all the audio and the x, y, z channels will be 0. When audio is 100% spatialized, for example, in the x direction, all the audio will be in the W and X audio channels. In this way, for such embodiments, the W channel can store all captured sound. As such, the intensity of sound for a time segment of spatial audio can be determined using the RMSE of the W channel which is then converted to decibels using a log function. Such an exemplary equation can be designated as follows:
I=20×log(RMSE(W))
Focus component 218 can be configured to determine the focus of sound at a selected time segment of spatial audio. Focus can indicate how concentrated sound is during the time segment. Focus can also be described as an extent to which the sound is spread. In embodiments, focus designates how diffuse the sound is within the environment. For example, when outside in a quiet place, a highly focused sound would occur when a microphone captures a tiny bell ringing. In another example, a large gong close to the microphone would generate audio on every point on surface so the gong would create a very large sound this is highly unfocused. However, if such a gong moves farther away from the microphone, the angular distance will be smaller so the gong sound would become more focused. In still a further example, when inside a room, room response such as walls causing echo and/or reverb are captured, in such an instance, a tiny bell would cause reflections of its sound off the walls creating a more diffuse sound at a microphone. As such, focus can be represented using size based on how the sound is focused.
To determine focus for sound at a selected time segment of spatial audio, v can be used—the vector comprised of the x, y, and z positions of sound. Prior to normalizing the vector, it has a length between 0 and 1. When the length is 0, x, y, and z are all 0 and there is no spatial aspect of the sound, the sound is only omnidirectional so v=<0, 0, 0>. When the length is 1, there is an extreme spatial component to the sound, not just that sound is coming from a certain direction but that the sound is entirely coming from that direction and there is not any audio in any other direction—a very focused spatial sound. When the length of a vector is one half, then the sound is in between an omnidirectional sound and a single directional sound, in other words, the sound is a unidirectional sound. As such, focus can be determined using the length of v, the x, y, z vector. An exemplary equation for determining focus can be designated as follows:
F=√{square root over (x2+y2+z2)}
Frequency component 220 can be configured to determine the frequency of sound at a selected time segment of spatial audio. Color can be assigned to sound based on the frequency of the sound, similar to the way that frequency of light has an associated color. As such, color can be used to indicate what frequency of sound occurs during the time segment of spatial audio. In this way, color can be used as a visual element to help users identify what object might be making a sound to aid in alignment of the sound with a visual component (e.g. an object making a base sound, such as a truck engine, versus an object making a high pitch sound, such as a fire alarm).
In embodiments, frequency can be determined by using a frequency spectrum of sound for a selected time segment of spatial audio. The frequency spectrum of sound can be analyzed using fast Fourier transform (FFT). FFT can be used to create bands around a frequency of a given sinusoid to designate frequency bins that are even in size, non-overlapping, and cover the whole spectrum. Using FFT, a frequency spectrum can be mapped to a RGB (Red-Green-Blue) color by defining three color matching functions. As such, a color matching function can be an array of weights per frequency bin in the FFT so that the computed spectrum is multiplied per-element by the color matching function to determine the intensity of that color.
In an embodiment, the defined minimum and maximum frequency are respectively, 0 Hz and 1 kHz with defined peak response frequencies for red at 125 Hz, green at 500 Hz, and blue at 875 Hz. The color matching functions for each color go linearly from 0 to the peak response and then back to 0. Such a frequency range is capable of being increased or decreased based on the attributes of sound to visualize from the spatial audio. The peak response frequencies can be adjusted based on the minimum and maximum frequency to ensure that the distribution of color is equally distributed across the entire range.
Assigning color to sound based on frequency creates a visual indication of the property of frequency of sound by using color(s) to indicate meaningful audio ranges (e.g., perceptually low-frequency sounds such as bass, engines, rumbling, etc. will be indicated using red, mid-frequency sounds can be indicated using yellow-green, and high-frequency sounds can be indicated using blue, and white noise such as a hiss, clap, crash, etc. can be indicated using white). Color can also indicate sounds that are “white noise” or include all frequencies—red, blue, and green. Such white noises can be designated using white, or all the colors overlapped.
Once at least one property at a time segment have been determined (e.g., location, intensity, focus and color), rendering manager 208 can be used to render a visualization of spatial audio utilizing the properties of the spatial audio. Such properties can be determined, for example, using visualization engine 212 of spatial audio processing manager 206. A time segment of the spatial audio can be rendered into a visualization using a particle or blob (e.g., when 10 ms time segments are used, a single rendered frame can result in the visualization of 10 distinct particles). Such a particle(s) can be generated by mapping the determined properties. A particle can be displayed at coordinates using the determined position where sound is estimated to be originating. Determined intensity can be used to indicate how much energy is associated with a particle. Such intensity can be indicated using opacity. Determined focus indicates the concentration of sound for a particle using size. Color can be used to indicate the frequency of the particle. Color can be displayed using RGB based on the frequency at the time segment indicated by the particle. In another embodiment, color can be displayed using up to three particles for each time segment where one particle indicates “blue,” one “green,” and another “red.” In such an embodiment, a single rendered frame can result in a visualization of up to 30 distinct particles, one red, one green, and one blue for each segment of time depending on the frequency composition of the segment of spatial audio. Determined properties for sound can be displayed in any number of ways, such as by displaying the properties using attributes of the rendered visualization.
With reference to
At block 302, spatial audio is received. Such spatial audio can be received from a database, such as data store 202 of
At block 304, the spatial audio can be filtered. Filtering can be performed using, for example, filtering engine 210 as depicted in
At block 306, spatial audio can be partitioned into time segments. Audio can be partitioned, for example, into chunks of time on the order of 10 ms. A frame of video typically has 30 frames per second, as such, using time segments of 10 ms allows for 10 visualizations of sound for each frame of video. In an embodiment, a visualization of sound can be displayed using, for example, a particle or blob. The properties associated with sound for each visualization can be used to assign particular attributes to such a particle.
At block 308, the position of sound can be determined for a time segment. Position can indicate where in a three-dimensional environment sound is coming from at a point in time. Such position can be determined by incorporating the omnidirectional component W into axis component(s) for a time segment of spatial audio to determine an x, y, z position. One manner of determining the position of sound for a time segment is by taking the RMSE of the sum of W plus X/Y/Z and subtracting the RMSE of the sum of W minus X/Y/Z. Using such an equation is advantageous because it takes into account the directional component of sound. Upon determining x, y, z, the positions of sound can be combined to designate the three-dimensional position of sound as v=<x,y,z>. When such a vector is not of unit length, the vector can be normalized to a position on the unit sphere. The normalized vector can then be used to indicate the position of the sound for the spatial audio for the selected time segment.
At block 310, the intensity of sound can be determined for the time segment. Intensity can indicate how much energy (e.g., noise) is occurring during a time segment of audio. As such, intensity can be designated using decibels as can measure the intensity of a sound or the power level of an electrical signal by comparing it with a given level on a logarithmic scale. The W (omnidirectional) channel can be used to determine intensity. As such, the intensity of sound for a time segment of spatial audio can be determined using the RMSE of the W channel which is then converted decibels using a log function.
At block 312, the focus of sound can be determined for the time segment. Focus can indicate how concentrated, spread out, or diffuse sound is within the environment at a time segment of spatial audio. To determine focus, the vector v comprised of the x, y, and z positions of sound can be used. Analyzing the value of the position along each axis, allows for determining how focused of a position sound is coming from.
At block 314, the color of sound can be determined for the time segment. Color can be assigned to sound based on the frequency of the sound. Color can be used as a visual element to help users identify what object in a video might be making a sound to aid in alignment of the sound with the video (e.g. an object making a base sound versus an object making a high pitch sound). Color can be determined by taking the frequency spectrum of sound for a selected time segment of spatial audio using FFT. The frequency spectrum can then be mapped to a RGB color using defined color matching functions for red, blue, and green. In an embodiment, the defined minimum and maximum frequency are respectively, 0 Hz and 1 kHz with defined peak response frequencies for red at 125 Hz, green at 500 Hz, and blue at 875 Hz. The color matching functions for each color go linearly from 0 to the peak response and then back to 0. Such a frequency range is capable of being increased or decreased based on the attributes of sound to visualize from the spatial audio.
Steps 308-314 can be repeated for additional time segment(s) of spatial audio based on the number of time segments partitioned at block 306. In other embodiments, some partitioned segments will not be used in the visualization process; instead specific frames can be identified as displaying important sounds. These identified frames can be processed for visualization.
At block 316, a visualization for spatial audio can be rendered using determined position, intensity, focus, and color. A time segment of the spatial audio can be rendered using a particle or blob. For instance, when 10 ms time segments are used, a single rendered frame can have 10 distinct particles, though it should be appreciated that such particles can overlap resulting in the appearance of less than 10 particles. Such a particle(s) can be generated by mapping the determined properties for time segments of spatial audio. A particle can be displayed at the position where sound is determined to be originating. Determined intensity can be used to indicate how much energy a particle has. Such intensity can be indicated using opacity. Determined focus indicates how concentrated sound for a particle. Color can be used to indicate the frequency of the particle. When all the properties for each time segments are rendered together, a visualization of the spatial audio is generated. This overall visualization can be played along with any related visual component. Depicting spatial audio using particles that indicate properties of sound can help with aligning the spatial audio with its related visual aspects. In some embodiments, it should be appreciated that instead of a time segment being displayed using a single particle, up to three particles can be used to represent each time segment such that the three particles indicate “blue,” “green,” and “red.” In such an embodiment, a single rendered frame can result in the visualization of up to 30 distinct particles.
With reference to
At block 402, a time segment of spatial audio can be selected to determine the position of sound for that time segment. The sound at a selected segment of first order spatial audio has four channels of audio: W, X, Y, and Z. The W channel is omnidirectional audio, meaning audio that is coming through from every direction. X, Y, and Z are the channels of audio along the x axis, y axis, and z axis—in other words, left/right, forward/backward, and up/down.
At block 404, the sound position can be determined for left/right, or in the x axis. An accurate position along an axis requires incorporating corresponding W channel audio along the x axis. The omnidirectional component W can be incorporated into the axis component(s) by taking the RMSE of the sum of W plus X and subtracting the RMSE of the sum of W minus X. Such an exemplary equation can be designated as: x=(RMSE(W+X))−(RMSE(W−X)). When x is positive it means the sound is to the right and when x is negative the sound is to the left. Upon determining x, the position of sound can be used to as the x axis position of a sound vector for the position of spatial audio at the time segment.
At block 406 the sound position can be determined for up/down, or in the y axis. An accurate position along an axis requires incorporating corresponding W channel audio along the y axis. The omnidirectional component W can be incorporated into the axis component(s) by taking the RMSE of the sum of W plus Y and subtracting the RMSE of the sum of W minus Y. Such an exemplary equation can be designated as: y=(RMSE(W+Y))−(RMSE(W−Y)). When y is positive it means the sound is to the right and when y is negative the sound is to the left. Upon determining y, the position of sound can be used to as the y axis position of a sound vector for the position of spatial audio at the time segment.
At block 408 the sound position can be determined for forward/backward, or in the z axis. An accurate position along an axis requires incorporating corresponding W channel audio along the z axis. The omnidirectional component W can be incorporated into the axis component(s) by taking the RMSE of the sum of W plus Z and subtracting the RMSE of the sum of W minus Z. Such an exemplary equation can be designated as: z=(RMSE(W−Z))−(RMSE(W+Z)). When z is positive it means the sound is to the right and when z is negative the sound is to the left. After determining z, the position of sound can be used to as the z axis position of a sound vector for the position of spatial audio at the time segment.
Upon determining the x, y, and z, at bock 410, these positions can be combined to determine the three-dimensional position of sound using a vector: v=<x,y,z>. When v is not of unit length, at block 412, the vector can be normalized to a position on the unit sphere. The normalized vector can then be used to indicate the position of the sound for the spatial audio for the selected time segment.
Blocks 402 through 412 can be repeated as necessary to determine the position of sound for additional time segments of spatial audio. When the position of sound for addition time segments of spatial audio are not needed, at block 414, the position of audio can be output for the time segment(s) that position was determined for the spatial audio.
With reference to
At block 502, a time segment of spatial audio can be selected to determine the intensity of sound for that time segment. The selected time segment can be, for example, the same time segment selected at block 402 to determine position.
At block 504, the W omnidirectional channel of sound can be extracted at the time segment of spatial audio. Because W is omnidirectional, the channel has audio that captured from every direction (e.g., x, y, and z). As such, the omnidirectional channel can be used to determine intensity of sound during a time segment of spatial audio. Using the omnidirectional channel can indicate how spatialized audio is based on how much sound is stored in the channel. For instance, when all of the sound occurring in a scene is encoded in the W channel, there is no spatial component of sound and the W channel will have all the audio with the x, y, and z channels having zero. When audio is 100% spatialized in the x direction, all the audio will be in the W and X audio channels.
To utilize the omnidirectional channel of audio to indicate focus, at block 506, the RMSE of the channel can be taken. At block 508, the RMSE can be converted to decibels. This conversion can use a log function such that I=20×log(RMSE(W)). Blocks 502 through 508 can be repeated as necessary for additional time segments of spatial audio to determine the intensity of sound for additional time segments. At block 510, the intensity of audio can be output for the time segment(s) that intensity was determined for the spatial audio.
With reference to
At block 602, a time segment of spatial audio can be selected to determine the intensity of sound for that time segment. The selected time segment can be, for example, the same time segment selected at block 402 to determine position and/or the same time segment selected at block 502 to determine intensity.
At block 604, the position of sound can be received for the selected time segment of spatial audio. The position of sound can be received from, for example, block 414 of
At block 606, the length of the sound vector for the position of sound can be analyzed. Prior to normalizing the sound vector, it will have a length between 0 and 1. When the length is 0, there is no spatial aspect of the sound, the sound is only omnidirectional so v=<0, 0, 0>. When the length is 1, there is an extreme spatial component to the sound, in that the sound is entirely coming from one direction and there is not any audio in any other direction.
Based on the analysis at block 606, at block 608, the focus of sound at the selected time segment can be determined. An equation that can be used for determining focus is F=√{square root over (x2+y2+z2)}, where x, y, and z are the x axis, y axis, and z axis position of sound at a time segment of spatial audio. Blocks 602 through 608 can be repeated as necessary for additional time segments of spatial audio to determine the focus of sound additional time segments. At block 610, the intensity of audio can be output for the time segment(s) that focus was determined for.
With reference to
At block 702, a time segment of spatial audio can be selected to determine the color of sound for that time segment. The selected time segment can be, for example, the same time segment selected at block 402 to determine position and/or the same time segment selected at block 502 to determine intensity and/or the same time segment selected at block 602 to determine focus.
At block 704, the frequency spectrum for sound at the time segment can be taken. Taking the frequency spectrum can be accomplished using fast Fourier transform (FFT). FFT can be used to create bands around a frequency to designate frequency bins that are even in size, non-overlapping, and cover the whole spectrum.
At block 706, the frequency spectrum can be mapped to a RGB color using color matching functions. The defined peak response frequency for red can be set at 125 Hz, green at 500 Hz, and blue at 875 Hz. The color matching functions for each color go linearly from 0 to the peak response and then back to 0. Such a frequency range is capable of being increased or decreased based on the attributes of sound to visualize from the spatial audio. The peak response frequencies can be adjusted based on the minimum and maximum frequency to ensure that the distribution of color across the spectrums are equal. At block 708, the color of sound for the time segment can be determined based on the color matching functions. Blocks 702 through 708 can be repeated as necessary for additional time segments of spatial audio to determine the color of sound additional time segments. At block 710, the color of audio can be output for the time segment(s) that color was determined for the spatial audio.
With reference to
At block 802, a time segment of spatial audio can be selected to render a visualization of sound for the spatial audio. The selected time segment can be, for example, the same time segment selected at block 402 to determine position and/or the same time segment selected at block 502 to determine intensity and/or the same time segment selected at block 602 to determine focus, and/or the same time segment selected at block 702 to determine color.
At block 804, the position, intensity, focus, and color for the selected time segment can be obtained. Properties such as position, intensity, focus, and color can be determined, for example, using visualization engine 212 of spatial audio visualization system 204, as illustrated in
At block 806, a visualization of a selected time segment for spatial audio can be rendered. In embodiments, a time segment of the spatial audio can be rendered into a visualization using a particle or blob. Such a particle(s) can be generated by mapping determined properties for the time segment using, for instance, the properties received at block 804. In the visualization rendering process, position can be utilized to display a particle at the position where sound is originating, determined intensity can be used to indicate how much energy the particle has using opacity, determined focus can be used to indicate how concentrated sound by utilizing the size of the particle, and color can be used to indicate the frequency of the particle by displaying a RGB color(s) based on the frequency at the time segment. In another embodiment, color can be displayed using up to three particles for each time segment where one particle indicates “blue,” one “green,” and another “red.” In such an embodiment, a single rendered frame can result in the visualization of up to 30 distinct particles.
Having described embodiments of the present invention,
Computing device 1100 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 1100 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 1100. Computer storage media does not comprise signals per se. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media, such as a wired network or direct-wired connection, and wireless media, such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
Memory 1112 includes computer storage media in the form of volatile and/or nonvolatile memory. As depicted, memory 1112 includes instructions 1124. Instructions 1124, when executed by processor(s) 1114 are configured to cause the computing device to perform any of the operations described herein, in reference to the above discussed figures, or to implement any program modules described herein. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 1100 includes one or more processors that read data from various entities such as memory 1112 or I/O components 1120. Presentation component(s) 1116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.
I/O ports 1118 allow computing device 1100 to be logically coupled to other devices including I/O components 1120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc. I/O components 1120 may provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs may be transmitted to an appropriate network element for further processing. An NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on computing device 1100. Computing device 1100 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, computing device 1100 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of computing device 1100 to render immersive augmented reality or virtual reality.
Embodiments presented herein have been described in relation to particular embodiments which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present disclosure pertains without departing from its scope.
Various aspects of the illustrative embodiments have been described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced with only some of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to one skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well-known features have been omitted or simplified in order not to obscure the illustrative embodiments.
Various operations have been described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation. Further, descriptions of operations as separate operations should not be construed as requiring that the operations be necessarily performed independently and/or by separate entities. Descriptions of entities and/or modules as separate modules should likewise not be construed as requiring that the modules be separate and/or perform separate operations. In various embodiments, illustrated and/or described operations, entities, data, and/or modules may be merged, broken into further sub-parts, and/or omitted.
The phrase “in one embodiment” or “in an embodiment” is used repeatedly. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising,” “having,” and “including” are synonymous, unless the context dictates otherwise. The phrase “A/B” means “A or B.” The phrase “A and/or B” means “(A), (B), or (A and B).” The phrase “at least one of A, B and C” means “(A), (B), (C), (A and B), (A and C), (B and C) or (A, B and C).”
This application is a continuation of and claims priority to, U.S. patent application Ser. No. 16,218,207 filed Dec. 12, 2018, entitled “PARTICLE-BASED SPATIAL AUDIO VISUALIZATION,” which is itself a continuation of U.S. patent application Ser. No. 15/814,254, filed on Nov. 15, 2017, entitled “PARTICLE-BASED SPATIAL AUDIO VISUALIZATION”, now issued as U.S. Pat. No. 10,165,388. The entire contents of each of the aforementioned applications are incorporated by reference herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
8560303 | Beack | Oct 2013 | B2 |
9076457 | Orler | Jul 2015 | B1 |
9779093 | Jarvinen | Oct 2017 | B2 |
10165388 | DiVerdi | Dec 2018 | B1 |
10575119 | DiVerdi | Feb 2020 | B2 |
20060149558 | Kahn | Jul 2006 | A1 |
20070071413 | Takahashi | Mar 2007 | A1 |
20070223711 | Bai | Sep 2007 | A1 |
20080255688 | Castel | Oct 2008 | A1 |
20090182564 | Beack | Jul 2009 | A1 |
20130022206 | Thiergart | Jan 2013 | A1 |
20150124167 | Arrasvuori | May 2015 | A1 |
20160134988 | Gorzel | May 2016 | A1 |
20160302005 | Fedosov | Oct 2016 | A1 |
20190149941 | DiVerdi | May 2019 | A1 |
Number | Date | Country | |
---|---|---|---|
20200186957 A1 | Jun 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16218207 | Dec 2018 | US |
Child | 16790469 | US | |
Parent | 15814254 | Nov 2017 | US |
Child | 16218207 | US |