The present disclosure relates to an audio rendering system and method, and more specifically to a system and method for estimating acoustic information of an approximately rectangular parallelepiped room scene.
All sounds in the real world are spatial audio. Sound originates from the vibration of objects, propagates through media, and is heard by us. In the real world, a vibrating object can appear anywhere, and the vibrating object and a person's head will form a three-dimensional direction vector. Since the human body receives sound through both ears, the horizontal angle of the direction vector will affect the loudness difference, time difference and phase difference of the sound reaching our ears; the vertical angle of the direction vector will also affect the frequency response of the sound reaching our ears. It is precisely by relying on this physical information that we humans have acquired the ability to determine the location of a sound source according to binaural sound signals through a large amount of acquired unconscious training.
In some embodiments of the present disclosure, an audio rendering method is disclosed, comprising obtaining audio metadata, the audio metadata including acoustic environment information; setting parameters for audio rendering according to the acoustic environment information, the parameters for audio rendering including acoustic information of an approximately rectangular parallelepiped room scene; and rendering an audio signal according to the parameters for audio rendering.
In some embodiments, the rectangular parallelepiped room includes a cube room.
In some embodiments, rendering the audio signal according to the parameters for audio rendering includes: spatially encoding the audio signal based on the parameters for audio rendering, and spatially decoding the spatially encoded audio signal to obtain a decoded audio-rendered audio signal.
In some embodiments, the audio signal includes a spatial audio signal.
In some embodiments, the spatial audio signal includes at least one of: an object-based spatial audio signal, a scene-based spatial audio signal, and a channel-based spatial audio signal.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes at least one of: the size of the room, center coordinates of the room, orientation, and approximate acoustic properties of the wall material.
In some embodiments, the acoustic environment information includes a scene point cloud consisting of a plurality of scene points collected from a virtual scene.
In some embodiments, collecting a scene point cloud consisting of a plurality of scene points from a virtual scene includes setting N intersection points of N rays emitted in various directions with a listener as the origin and the scene as scene points.
In some embodiments, estimating the acoustic information of the approximately rectangular parallelepiped room scene according to scene point clouds collected from the virtual scene includes: determining a minimum bounding box according to the collected scene point clouds; and determining the estimated size and center coordinates of the rectangular parallelepiped room scene according to the minimum bounding box.
In some embodiments, determining the minimum bounding box includes determining the average position of the scene point clouds; converting position coordinates of the scene point clouds to the room coordinate system according to the average position; grouping the scene point clouds converted to the room coordinate system according to the scene point clouds and the average position of the scene point clouds, where each group of scene point clouds corresponds to one wall of a house; and, for each group, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds as the minimum bounding box.
In some embodiments, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds includes determining a projection length of the distance from a scene point cloud converted to the room coordinate system to the coordinate origin being projected to a wall referred to by the group; and determining the maximum value of all projection lengths of the current group as the separation distance between the wall corresponding to the grouped scene point cloud and the average position.
In some embodiments, determining a separation distance between a wall corresponding to the grouped scene point cloud and the average position of the scene point clouds includes determining the separation distance when the group is not empty; and determining that the wall is missing when the group is empty.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes approximate acoustic information of the room wall material, and estimating acoustic information of approximately rectangular parallelepiped room scene according to scene point clouds collected from a virtual scene further includes: determining approximate acoustic properties of the material of the wall referred to by the group according to the average absorptance, average scattering rate, and average transmittance of all point clouds in the group.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes the orientation of a room, and estimating acoustic information of the approximately rectangular parallelepiped room scene according to scene point clouds collected from a virtual scene further includes: determining the orientation of the approximately rectangular parallelepiped room according to the average normal vector of all point clouds in the group and the angle with the normal vector of the wall referred to by the group.
In some embodiments, the method further comprises estimating acoustic information of an approximately rectangular parallelepiped room scene frame by frame according to scene point clouds collected from a virtual scene, including determining the minimum bounding box according to scene point clouds collected in the current frame and scene point clouds collected in previous frames; and determining the size and center coordinates of the rectangular parallelepiped room scene estimated in the current frame according to the minimum bounding box.
In some embodiments, the number of the previous frames is determined according to properties estimated from acoustic information of the approximately rectangular parallelepiped room scene.
In some embodiments, determining the minimum bounding box according to scene point clouds collected in the current frame and scene point clouds collected in previous frames includes determining the average position of the scene point clouds of the current frame; converting position coordinates of the scene point clouds to the room coordinate system according to the average position and the orientation of an approximately rectangular parallelepiped room estimated in the previous frame; grouping the scene point clouds converted to the room coordinate system according to the size of the approximately rectangular parallelepiped room estimated in the previous frame, where each group of scene point clouds corresponds to one wall of a house; for each group, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds; and from 1) the separation distance of the current frame and 2) the difference between separation distances of multiple previous frames and the product of the room orientation change and the average position change, determining the maximum value as the minimum bounding box of the current frame.
In some embodiments, the minimum bounding box is determined from the collected scene point clouds based on the following equation:
In some embodiments of the present disclosure, an audio rendering system is disclosed, comprising an audio metadata module configured to obtain acoustic environment information; wherein the audio metadata module is configured to set parameters for audio rendering according to the acoustic environment information, the parameters for audio rendering including acoustic information of an approximately rectangular parallelepiped room scene, the parameters for audio rendering being used to render an audio signal.
In some embodiments, the rectangular parallelepiped room includes a cube room.
In some embodiments, the system further includes a spatial encoding module configured to spatially encode the audio signal based on parameters for audio rendering; and a spatial decoding module configured to spatially decode the spatially encoded audio signal to obtain the decoded audio-rendered audio signal.
In some embodiments, the audio signal includes a spatial audio signal.
In some embodiments, the spatial audio signal includes at least one of: an object-based spatial audio signal, a scene-based spatial audio signal, and a channel-based spatial audio signal.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes at least one of: size, center coordinates, orientation, and approximate acoustic properties of the wall material.
In some embodiments, the acoustic environment information includes a scene point cloud consisting of a plurality of scene points collected from a virtual scene.
In some embodiments, collecting a scene point cloud consisting of a plurality of scene points from a virtual scene includes setting N intersection points of N rays emitted in various directions with a listener as the origin and the scene as scene points.
In some embodiments, estimating the acoustic information of the approximately rectangular parallelepiped room scene according to scene point clouds collected from the virtual scene includes: determining a minimum bounding box according to the collected scene point clouds; and determining the estimated size and center coordinates of the rectangular parallelepiped room scene according to the minimum bounding box.
In some embodiments, determining the minimum bounding box includes determining the average position of the scene point clouds; converting position coordinates of the scene point clouds to the room coordinate system according to the average position; grouping the scene point clouds converted to the room coordinate system according to the scene point clouds and the average position of the scene point clouds, where each group of scene point clouds corresponds to one wall of a house; and, for each group, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds as a minimum bounding box.
In some embodiments, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds includes: determining a projection length of the distance from a scene point cloud converted to the room coordinate system to the coordinate origin being projected to a wall referred to by the group; and determining the maximum value of all projection lengths of the current group as the separation distance between the wall corresponding to the grouped scene point cloud and the average position.
In some embodiments, determining a separation distance between a wall corresponding to the grouped scene point cloud and the average position of the scene point clouds includes: determining the separation distance when the group is not empty; and determining that the wall is missing when the group is empty.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes approximate acoustic information of the room wall material, and estimating acoustic information of approximately rectangular parallelepiped room scene according to scene point clouds collected from a virtual scene further includes: determining approximate acoustic properties of the material of the wall referred to by the group according to the average absorptance, average scattering rate, and average transmittance of all point clouds in the group.
In some embodiments, the acoustic information of the approximately rectangular parallelepiped room scene includes the orientation of a room, and estimating acoustic information of the approximately rectangular parallelepiped room scene according to scene point clouds collected from a virtual scene further includes: determining the orientation of the approximately rectangular parallelepiped room according to the average normal vector of all point clouds in the group and the angle with the normal vector of the wall referred to by the group.
In some embodiments, the system further comprises estimating acoustic information of an approximately rectangular parallelepiped room scene frame by frame according to scene point clouds collected from a virtual scene, including determining a minimum bounding box according to scene point clouds collected in the current frame and scene point clouds collected in previous frames; and determining the size and center coordinates of the rectangular parallelepiped room scene estimated in the current frame according to the minimum bounding box.
In some embodiments, the number of the previous frames is determined according to properties estimated from acoustic information of an approximately rectangular parallelepiped room scene.
In some embodiments, determining a minimum bounding box according to scene point clouds collected in the current frame and scene point clouds collected in previous frames includes determining the average position of the scene point clouds of the current frame; converting position coordinates of the scene point clouds to the room coordinate system according to the average position and the orientation of an approximately rectangular parallelepiped room estimated in the previous frame; grouping the scene point clouds converted to the room coordinate system according to the size of the approximately rectangular parallelepiped room estimated in the previous frame, where each group of scene point clouds corresponds to one wall of a house; for each group, determining a separation distance between a wall corresponding to a grouped scene point cloud and the average position of the scene point clouds; and determining the maximum value as the minimum bounding box of the current frame from 1) the separation distance of the current frame and 2) the difference between separation distances of multiple previous frames and the product of the room orientation change and the average position change.
In some embodiments, the minimum bounding box is determined from the collected scene point clouds based on the following equation:
where mcd(w) represents the distance from each wall w to the current
In some embodiments of the present disclosure, a chip is disclosed, comprising: at least one processor and an interface, the interface being used to provide computer executable instructions to the at least one processor, the at least one processor being used to execute the computer executable instructions to implement the method as described above.
In some embodiments of the present disclosure, an electronic device is disclosed, comprising: a memory; and a processor coupled to the memory, the processor being configured to perform the method described above based on instructions stored in the memory apparatus.
In some embodiments of the present disclosure, a non-transitory computer-readable storage medium is disclosed, which has a computer program stored thereon, which, when executed by a processor, implement the method as described above.
In some embodiments of the present disclosure, a computer program product is disclosed, comprising instructions, which, when executed by a processor, cause the processor to perform the method as described above.
Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.
In an immersive virtual environment, in order to simulate various information given to people by the real world as much as possible, so as not to break users' immersion, we must also simulate with high quality the impact of sound position on binaural signals we hear.
This effect, when the sound source position and the listener position are determined in a static environment, can be expressed by a head response function (HRTF). HRTF is a two-channel FIR filter; by convolving an original signal with an HRTF at a specified position, we can get the signal we hear when the sound source is at that position.
However, one HRTF can only represent the relative positional relationship between one fixed sound source and one certain listener. When we need to render N sound sources, theoretically we need N HRTFs to perform 2N convolutions on N original signals; and when the listener rotates, we need to update all N HRTFs to correctly render a virtual spatial audio scene. It is very computationally intensive by doing so.
In this way, the number of convolutions is only related to the number of amnisonics channels and is irrelevant to the number of sound sources, but encoding sound sources to ambisonics is much faster than convolution. Not only that, if the listener rotates, all ambisonics channels can be rotated, and the amount of calculation is irrelevant to the number of sound sources as well. In addition to rendering the ambisonics signal to both ears, it can also be simply rendered to a speaker array.
On the other hand, in the real world, the sounds we humans, including other animals, perceive are not only direct sounds from the sound source reaching our ears directly, but also vibration waves from the sound source that reach our ears through environmental reflection, scattering and diffraction. Wherein, environmental reflection and scattered sound directly affect our auditory perception of the sound source and the listener's own environment. This kind of perception is the basic principle by which nocturnal animals such as bats can locate themselves in the dark and understand their environment.
We humans may not be as sensitive as bats in terms of hearing, but we can also gain a lot of information by listening to the impact of the environment on the sound source. Please imagine the following scene: Even we are listening to a singer singing. We can clearly tell whether we are listening to the song in a large church or in a parking lot, because the reverberation time is different; even in a church, we can also clearly tell whether we are listening to the song 1 meter directly in front of the singer, or 20 meters directly in front of the singer, because the proportions of reverberation and direct sound are different. Still in the church, we can clearly tell whether we are listening to the singer singing in the center of the church; or we have one ear to listen to the song only 10 centimeters away from a wall; this is because the loudness of the early reflected sounds is different.
Environmental acoustic phenomena are ubiquitous in reality, so in an immersive virtual environment, in order to simulate various information given to people by the real world as much as possible, so as not to break the users' immersion, we must also simulate with high quality the impact of a virtual scene on the sound in the scene.
There are three main categories of existing methods for simulating environmental acoustic phenomena: fluctuation solvers based on finite element analysis, ray tracing, and geometry with simplified environment.
This algorithm divides the space to be calculated into densely arranged cubes, called “voxels” (similar to the concept of pixels, but pixel is an extremely small area unit on a two-dimensional plane, while voxel is an extremely small volume unit in three-dimensional space). ProjectAcoustics from Microsoft uses this algorithmic idea. The basic process of the algorithm is as follows:
Meanwhile, this algorithm has the following shortcomings:
The core idea of this algorithm is to find as many sound propagation paths from a sound source to a listener as possible, so as to obtain the energy direction, delay, and filtering properties that the path will bring. Such an algorithm is the core of room acoustics simulation systems from Oculus and Wwise.
The algorithm for finding the propagation path from the sound source to the listener can be simply summarized in the following steps:
Finally, as long as we auralize the spatial impulse response of each sound source, we can simulate very realistic sound source orientation and distance, as well as the characteristics of the sound source and the environment where the listener is located. There are two methods for spatial impulse response autalizion:
The environmental acoustics simulation algorithm based on ray tracing has the following advantages:
Meanwhile, such an algorithm also has the following disadvantages:
The idea of the last algorithm is to try to find an approximate, but much simpler geometry and surface material after given the geometry and surface material of current scene, thereby greatly reducing the amount of calculation of environmental acoustic simulation. Such practices are not very common, some examples are Resonance engine from Google:
Such an algorithm has the following advantages:
However, such an algorithm, at least with currently disclosed methods, has following disadvantages:
In conclusion: such an algorithm greatly sacrifices rendering quality in exchange for ultimate rendering speed. One of the core problems is the overly rough simplification of the scene shape; this is exactly the problem that the present disclosure intends to solve.
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some of the embodiments of the present disclosure, rather than all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this disclosure.
The relative arrangement of components and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the disclosure unless otherwise specifically stated. At the same time, it should be understood that, for convenience of description, the sizes of various parts shown in the drawings are not drawn according to actual scale. Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the authorized specification. In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values. It should be noted that similar reference numerals and letters refer to similar items in the attached drawings, so that once one item is defined in one drawings, it does not require further discussion in subsequent drawings.
In some embodiments of the present disclosure, scene point clouds P consisting of a plurality of scene points in a virtual scene is collected. In some embodiments of the present disclosure, each point cloud P contains the position of the point, a normal vector, and material information of a mesh where the point is located. In some embodiments of the present disclosure, the above scene point clouds can be formed by taking a listener as the origin, emitting N rays uniformly in all directions, and taking N intersection points between the rays and the scene as point clouds. In some embodiments of the present disclosure, the value of N is dynamically determined by comprehensively considering the stability, real-time performance and total calculation amount of room acoustic information estimation. The average position
In some embodiments of the present disclosure, for each group of point clouds of the collected scene point clouds P, approximate acoustic properties of the material of the wall referred to by the group are calculated. If the group is not empty, the current material settings of the group are: the absorbance is set to the average absorptance of all points in the group; the scattering rate is set to the average scattering rate of all points in the group; and the transmittance is set to the average transmittance of all points in the group. If the group is empty, the current material settings of the group are: the absorbance is set to 100% absorption; the scattering rate is set to 0% scattering; and the transmittance is set to 100% transmission.
In some embodiments of the present disclosure, an approximately rectangular parallelepiped room orientation is estimated for the collected scene point clouds P. For each group of point clouds, calculate the average normal vector
Embodiments of determining a minimum bounding box according to scene point clouds collected in current frame and scene point clouds collected in previous frames will be described in detail below with reference to
First, in some embodiments of the present disclosure, initial conditions and variable definitions are determined. Please see the following for details:
The number of historical records used to estimate the distance from the wall to the center h(w)=1, where w is a wall subscript, which is six integers, representing six walls of the cuboid. For convenience of expression, it takes the value from 0 to 5 herein. The corresponding relationship between walls and subscripts is as follows:
An approximation process for dynamically estimating an approximately rectangular parallelepiped room scene is performed for each frame. One implementation of the approximation process for dynamically estimating an approximately rectangular parallelepiped room scene is described below.
In some embodiments of the present disclosure, as shown in
Calculate the average position
Convert position coordinates of the scene point clouds to the room coordinate system. In some embodiments of the present disclosure, the conversion is performed according to
Divide the point clouds converted to the room coordinate system into 6 groups according to the estimated size d of the room in the previous frame, each group corresponding to one wall/floor/ceiling. For each group of point clouds, calculate the distance wcd (w) from the wall referred to by the group to
According to the
represents the maximum value from frame t=0 to frame t=h (w)−1; while rot(0)*rot(−t)−1 represents the change in room orientation between the current frame (i.e., t=0) and the past t-th frame;
In some embodiments of the present disclosure,
Wherein, rot(h) is a quaternion queue with length hmax, which stores rectangular parallelepiped room orientation information estimated in past hmax frames.
According to the minimum bounding box, it is determined the size d and room center coordinates c of the rectangular parallelepiped room scene estimated in the current frame.
Although this disclosure describes dynamically estimating acoustic information of an approximately rectangular parallelepiped room scene according to the current frame and multiple previous frames in conjunction with
In some embodiments of the present disclosure, unlike the unrealistic situation in the related art where it is assumed that the listener and the estimated virtual room are always at the same location, the present disclosure does not bind the listener to the estimated virtual room, but assumes that the listener can move freely in the scene. Since the location of the listener may be different in each frame, when N rays are emitted uniformly to the surrounding walls with the listener as the origin at different frames, the number and density of intersections of the N rays with surrounding walls (i.e., walls, floors, and ceilings) may not be the same at each wall. For example, when a listener is close to a certain wall, the rays emitted from the listener will have more intersections with the adjacent wall, while intersections with other walls will decrease accordingly depending on the distance between the wall and the listener. Therefore, when estimating house acoustic information (for example, the size of the room, orientation, average position of the scene point clouds) of an approximately rectangular parallelepiped room scene, the weight of the adjacent wall will be greater. While this wall with a larger weight will play a more decisive role in the subsequent calculation of the size of the room, the orientation of the room and the average position of the scene point clouds. For example, the average position of the scene point clouds will be closer to the wall with a larger weight. In this way, at different frames, due to possible differences in the listener's location, the estimated the size of the room, the orientation of the room, and the average position of the scene point clouds will also be different. Therefore, in order to reduce the impact caused by different listener locations at different frames, when calculating the minimum bounding box of the current frame, from 1) the separation distance of the current frame wcd(w) and 2) the difference between wcd(w) of multiple previous frames and the product of the room orientation change and the average position change, the maximum value is determined as the minimum bounding box of the current frame, that is, by subtracting the product of the room orientation change and the average position change to try to avoid the impact of different listener locations at different frames. According to the determined minimum bounding box, the size of the room and the coordinates of the room center of the current frame are further determined.
In some embodiments of the present disclosure, the minimum bounding box is determined according to scene point clouds collected in the current frame and scene point clouds collected in the past multiple frames, and at the same time, changes in the room orientation and the average position of the scene point clouds due to different listener locations between the current frame and each of the past frames are also considered, so as to avoid as much as possible differences in the estimated room acoustic information (for example, room orientation and the average position of scene point clouds) due to different listener locations in different frames, and thus to lower as much as possible the impact on the estimation of room acoustic information due to different listener locations at different frames, and at the same time to be able to adapt to dynamically changing scenes (opening doors, material changes, roof being blown off, etc.). In some embodiments of the present disclosure, the number of multiplexed past frames is dynamically determined by comprehensively considering room acoustic information estimation characteristics, such as stability and real-time performance, so that while obtaining reliable estimation data, transient changes in the scene (for example, door opening, material change, roof being blown off, etc.) can also be reflected timely and effectively, for example, a larger number of previous frames are used to ensure the stability of the estimation, while a small number of previous frames are used to ensure the real-time performance of the estimation.
For each group of point clouds, the approximate acoustic properties of the material of the wall referred to by the group are calculated. In some embodiments of the present disclosure, if the group is not empty, the current material settings of the group are: the absorptance is set to the average absorptance of all points in the group; the scattering rate is set to the average scattering rate of all points in the group; and the transmittance is set to the average transmittance of all points in the group. If the group is empty, the current material settings of the group are: the absorptance is set to 100% absorption; the scattering rate is set to 0% scattering; and the transmittance is set to 100% transmission.
Estimate the orientation of an approximately rectangular parallelepiped room. In some embodiments of the present disclosure, the average normal vector
The global horizontal angle and pitch angle (θ, φ) are converted to quaternion representations rot, which is written to a queue rot(t) with length hmax.
At this point, the approximation estimation process for each frame ends.
This disclosure progressively estimates an approximately rectangular parallelepiped model of a room in real time; estimates the room orientation through the normal vector of scene point clouds; and by reusing calculation results of the previous hmax frames, reduces the number of scene sampling points required for each frame (i.e., the number N of rays emitted in all directions with the listener as the origin) greatly, thereby speeding up the calculation for each frame in the algorithm. By continuously running the approximation estimation process for each frame, the disclosed algorithm can estimate an increasingly accurate approximately rectangular parallelepiped room model, thereby being able to quickly render scene reflections and reverberation. The present disclosure can estimate the approximately rectangular parallelepiped model of a scene where the listener is located in real time, and obtain the position, size, and orientation of the model. This disclosure enables a room acoustics simulation algorithm based on approximately rectangular parallelepiped model estimation to maintain its extreme high computational efficiency compared to other algorithms (fluctuation physics simulation, ray tracing) without sacrificing interactivity, requiring no pre-rendering, and supporting variable scenes. This algorithm can run at a much lower frequency than other audio and rendering threads, without affecting the update speed of the direction sense of direct sound and early reflected sound.
As shown in
Generally, the following apparatus may be connected to the I/O interface 905: an input apparatus 906 including, for example, a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc.; an output apparatus 907 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage apparatus 908 including, for example, a tape, a hard disk, etc.; and a communication apparatus 909. The communication apparatus 909 may allow the electronic device to communicate wirelessly or with wire with other devices to exchange data. Although
According to the embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via the communication apparatus 909, or installed from the storage device 408, or installed from the ROM 402. When the computer program is executed by the processing apparatus 401, the above functions defined in the method of the embodiment of the present disclosure are performed.
In some embodiments, there is also provided a chip, comprising: at least one processor and an interface, the interface being used to provide computer execution instructions to the at least one processor, and the at least one processor is used to execute computer execution instructions to implement the reverberation duration estimation method or the audio signal rendering method in any of the above embodiments.
In some embodiments, the arithmetic circuit 1003 internally includes multiple processing Engines (PEs). In some embodiments, the arithmetic circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some embodiments, the arithmetic circuit 1003 is a general-purpose matrix processor.
For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains data corresponding to matrix B from the weight memory 1002 and caches it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix A data from the input memory 1001 and performs matrix operation with the matrix B, and obtains partial result or final result of the matrix, and stores the result in an accumulator 708.
A vector calculation unit 1007 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
In some embodiments, the vector calculation unit 1007 can store the processed output vector to a unified buffer 1006. For example, the vector calculation unit 1007 may apply a nonlinear function to the output of the arithmetic circuit 1003, such as a vector of accumulated values, to generate activation values. In some embodiments, the vector calculation unit 1007 generates normalized values, merged values, or both. In some embodiments, the processed output vector can be used as an activation input to the arithmetic circuit 1003, for example for use in a subsequent layer in a neural network.
The unified memory 1006 is used to store input data and output data.
A Direct Memory Access Controller 1005 (DMAC) transfers input data in an external memory to the input memory 1001 and/or the unified memory 706, and stores weight data in the external memory into the weight memory 1002, and stores the data in the unified memory 1006 into the external memory.
A Bus Interface Unit (BIU) 1010 is used to realize interaction between the main CPU, DMAC and an instruction fetch memory 1009 via a bus.
The instruction fetch buffer 1009 connected to the controller 1004 is used to store instructions used by the controller 1004;
The controller 1004 is used to call instructions cached in the memory 1009 to control the working process of the computing accelerator.
Generally, the unified memory 1006, the input memory 1001, the weight memory 1002 and the instruction fetch memory 1009 are all On-Chip memories, and the external memory is a memory external to the NPU. The external memory can be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), High Bandwidth Memory (HBM) or other readable and writable memory.
In some embodiments, there is also provided a computer program, comprising: instructions, which, when executed by a processor, cause the processor to perform the audio rendering method of any of the above embodiments, especially any processing in the audio signal rendering process.
Those skilled in the art will appreciate that the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or embodiments that combine software and hardware aspects. When implemented using software, the above embodiments may be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions or computer programs. When computer instructions or computer programs are loaded or executed on a computer, processes or functions according to embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable apparatus. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk memory, CD-ROM, optical memory, etc.) having computer-usable program code embodied therein.
Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art will understand that the above examples are for illustration only and are not intended to limit the scope of the disclosure. It should be understood by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the present disclosure. The scope of the disclosure is defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/121718 | Sep 2021 | WO | international |
This application is a continuation of International Application No. PCT/CN2022/122635, filed on Sep. 29, 2022, which claims priority to International Application No. PCT/CN2021/121718, filed on Sep. 29, 2021. The entire contents of these applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/122635 | Sep 2022 | WO |
Child | 18622805 | US |