The disclosure relates to the field of audio processing, and more particularly, to a method and an apparatus for listening scene construction and a storage medium.
Music is an art that reflects emotions of humans in real life, which can cultivate sentiment of people, stimulate imagination of people, and enrich spiritual life of people. With popularity of electronic devices, various playing devices can be used by people to play music. In order to improve listening experience of a user, various sound effect elements are built in a playing device for the user to choose, so that various sound effect elements can be artificially added to the music to achieve a special playing effect when the music is played by the user. For example, when the playing device plays Daoxiang of Jay Chou, a pastoral sound effect element can be selected to be added to the song by the user to play together with the song. However, an added sound effect element played by the playing device is simply mixed into original music, and the sound effect element is fixed, such that it is difficult for the user to feel an artistic conception constructed by the sound effect element, thereby affecting a sense of realism and immersion of the user during listening to music.
Therefore, how to use the sound effect element to construct a more real listening scene when the user listens to the music is a problem studied by those skilled in the art.
According to a first aspect, a method for listening scene construction is provided in implementations of the disclosure. The method includes the following. Target audio is determined, where the target audio is used to characterize a sound feature in a target scene. A position of a sound source of the target audio is determined. Dual-channel audio of the target audio is obtained by performing audio-visual modulation on the target audio according to the position of the sound source, where the dual-channel audio of the target audio during simultaneous output is able to produce an effect that the target audio is from the position of the sound source. The dual-channel audio of the target audio is rendered into target music to produce an effect that the target music is played in the target scene.
According to a second aspect, an apparatus for listening scene construction is provided in implementations of the disclosure. The apparatus includes a memory configured to store computer programs and a processor configured to invoke the computer programs to: determine target audio, where the target audio is used to characterize a sound feature in a target scene, determine a position of a sound source of the target audio, obtain dual-channel audio of the target audio by performing audio-visual modulation on the target audio according to the position of the sound source, where the dual-channel audio of the target audio during simultaneous output is able to produce an effect that the target audio is from the position of the sound source, and render the dual-channel audio of the target audio into target music to produce an effect that the target music is played in the target scene.
According to a third aspect, a non-volatile computer storage medium is provided in implementations of the disclosure. The computer storage medium includes computer programs which, when running on an electronic device, are operable with the electronic device to perform the method provided in the first aspect.
In order to describe more clearly technical solutions in implementations of the disclosure or the related art, the following will give a brief introduction to the accompanying drawings required in implementations of the disclosure or in the background.
The following will describe clearly and completely technical solutions in implementations of the disclosure with reference to the accompanying drawings.
In implementations of the disclosure, a method is provided, which can improve a sense of presence and immersion of a user when listening to music. In implementations of the disclosure, a sound effect element that can characterize a listening scene is mixed into the music when the user is listening to the music. When audio of the sound effect element is mixed into the music, audio-visual modulation is performed on the audio of the sound effect element according to a position of a sound source, such that the sound effect element when entering ears seems to come from the position of the sound source, thereby improving a sense of presence and immersion of the user when listening to music.
Referring to
The audio 101 of the sound effect element may be audio of a sound effect element matched according to a type or lyric of the original music 104, or audio of a sound effect element determined by receiving a selection operation of a user. The audio of the sound effect element may characterize features of some scenes. For example, a sound of a scene of mountain forest can be characterized by a sound of birds chirping or a sound of leaves shaking.
The left channel audio 102 and the right channel audio 103 are obtained by performing the audio-visual modulation on the audio 101 of the sound effect element. A position of a sound source in the audio of the sound effect element needs to be determined firstly before performing the audio-visual modulation, because audio may need a fixed sound source or a sound source with a certain moving track. For example, relative to a listener, a sound of leaves in a scene may come from a fixed position, but a sound of birds may come from far to near or from left to right. Therefore, a position of the sound source at each of multiple time nodes needs to be determined according to a preset time interval. A position of a sound source in space can be represented by a three-dimensional (3D) coordinate, e.g., a coordinate of [azimuth, elevation, distance]. Processing such as frame division or windowing is performed on the audio of the sound effect element after determining the position of the sound source at each of the multiple time nodes, then a head-related transfer function (HRTF) from a position of the sound source in an audio frame to left and right is determined, and the left channel audio 102 and the right channel audio 103 are obtained by convolving the HRTF from the position of the sound source to a left ear and a right ear respectively with the audio frame, i.e., the HRTF from the position of the sound source to the left ear and the right ear respectively is convolved with single-channel audio to form binaural audio. When the left channel audio 102 and the right channel audio 103 are simultaneously played in the left ear and the right ear respectively, the listener may feel an effect that the sound effect element is from the position of the sound source.
Optionally, the sound effect element 101 may be an audio file that can characterize a scene, such as a sound of waves, a sound of leaves, a sound of running water, or the like, and may be stored in an audio format such as windows media audio (WMA) format, moving picture experts group audio layer III (MP3), or the like. The audio of the sound effect element is referred to as target audio below.
The original music 104 is an audio file that can be played. The original music during playing can be mixed with the left channel audio 102 and right channel audio 103 of the sound effect element, and the mixed music can be played in the left ear and the right ear, so that when the mixed music is played with a playing device, in addition to listening to the original music 104, the user may also feel a special scene element lingering around ears and feel like he is really in a listening scene 106.
Optionally, the original music 104 may be an audio file in various formats such as WMA format, MP3, or the like, which can be played with a playing device such as an earphone, or the like, and the original music is referred to as target music below. Optionally, the electronic device can also serve as the playing device and be used to play the mixed music. In this case, the playing device is a playing module integrated into the electronic device, and the electronic device may be a device such as a smart earphone with a calculation capability. Optionally, the electronic device can transmit the mixed music to the playing device through a wired interface, a wireless interface (e.g., a wireless fidelity (WiFi) interface, a Bluetooth interface), or other manners, and the playing device is used to play the mixed music. In this case, the electronic device may be a server (or a server cluster), a computer host, or other electronic devices, and the playing device may be a device such as a Bluetooth earphone, a wired earphone, or the like.
That is, the listener may feel a unique virtual listening environment, for example, by adding some special sound effect parts or rendering sound effects in the listening scene 106. Common listening scenes mainly include seaside, window, suburb, and the like, which can be created by adding some sound effect elements.
Referring to
At S201, an electronic device determines target audio.
Specifically, the electronic device may be a device with a computation capability, such as a phone, a computer, or the like, the target audio is audio of a sound effect element mixed into target music, and the target music may be a music file such as a song, a tape, or the like. The electronic device can determine the target audio in the following optional manners.
In manner 1, the target audio is determined according to type information of the target music. The electronic device can pre-store the type information of the target music or a label of the type information of the target music, or can obtain the type information of the target music or the label of the type information through a wired interface, a wireless interface, or other manners. The electronic device matches a sound effect element according to the type information of the target music or the label of the type information of the target music, and determines the target audio according to a matching parameter of the sound effect element. Optionally, one song may have multiple types or labels. For higher relevance between the target audio and the target music, a first matching threshold can be preset when the sound effect element is matched. Specifically, the electronic device obtains matching parameters of one or more sound effect elements by matching the one or more sound effect elements according to the type information of the target music or the label of the type information, and determines audio of one or more sound effect elements with matching parameters higher than the first matching threshold as the target audio. Optionally, before a vocal part of the song occurs and after the vocal part ends (i.e., when the song only has an accompaniment), the target audio is determined in manner 1.
In case 1, referring to
In manner 2, the target audio is determined according to whole lyrics of the target music. The electronic device can pre-store the whole lyrics of the target music, or can obtain the whole lyrics of the target music through a wired interface, a wireless interface, or other manners. The electronic device obtains a matching parameter of a sound effect element by matching the sound effect element according to the whole lyrics, and determines the target audio according to the matching parameter of the sound effect element. For higher relevance between the target audio and the target music, a second matching threshold can be preset when the sound effect element is matched. Specifically, the electronic device can obtain matching parameters of one or more sound effect elements by matching the one or more sound effect elements with the whole lyrics of the target music according to a text matching algorithm, and determine audio of one or more sound effect elements with matching parameters higher than the second matching threshold as the target audio. The second matching threshold may or may not be equal to the first matching threshold, which is not limited herein. Optionally, before the vocal part of the song occurs and after the vocal part ends (i.e., when the song only has the accompaniment), the target audio is determined in manner 2.
In case 2, the electronic device pre-stores whole lyrics of Daoxiang, and matches multiple sound effect elements according to the whole lyrics of Daoxiang when determining the target audio. If the electronic device presets the second matching threshold as 76.0, audio of a sound effect element with a matching parameter higher than 76.0 can be determined as the target audio. Optionally, the electronic device can preset the number of selected sound effect elements in order to control the number of selected sound effect elements, e.g., the number of selected sound effect elements is preset as 3, which indicates that among sound effect elements with matching parameters higher than 76.0, audio of sound effect elements with top three matching parameters are determined as the target audio.
In manner 3, the target audio is determined according to a lyric content of the target music, where the lyric content of the target music is a word, a term, a short sentence, a sentence, or other specific contents of lyrics. The electronic device can pre-store the lyric content of the target music, or can obtain the lyric content of the target music through a wired interface, a wireless interface, or other manners. The electronic device obtains a matching parameter of a sound effect element by matching the sound effect element according to the lyric content, and determines the target audio according to the matching parameter of the sound effect element. For higher relevance between the target audio and the target music, a third matching threshold can be preset when the sound effect element is matched. Specifically, the electronic device can segment the lyrics into specific contents such as a word, a term, or a short sentence according to a word segmentation algorithm, obtain matching parameters of one or more sound effect elements by matching the one or more sound effect elements with the lyric content of the target music according to the text matching algorithm, and determine audio of one or more sound effect elements with matching parameters higher than the third matching threshold as the target audio. The third matching threshold may or may not be equal to the first matching threshold and the second matching threshold, which is not limited herein. Optionally, at a vocal singing stage of the target music (i.e., after the vocal part occurs and before the vocal part ends), the target audio is determined in manner 3.
In case 3, referring to
In manner 4, the electronic device determines the target audio by providing the user with multiple options of audio of sound effect elements to select from and receiving a selection operation of the user for the target audio. Specifically, the electronic device contains an information input device such as a touchable screen or the like to receive an input operation of the user, and determines audio indicated by the input operation as the target audio.
In case 4, referring to
At S202, the electronic device transfers a sampling rate of the target audio to a sampling rate of the target music on condition that the sampling rate of the target audio is different from the sampling rate of the target music.
Specifically, after the target audio is determined, the target audio may sound abrupt when mixed into the target music if the sampling rate of the target audio is different from the sampling rate of the target music. Therefore, the sampling rate of the target audio needs to be transferred to the sampling rate of the target music, such that the sound effect element may sound more natural when mixed. For example, the sampling rate of the target audio is 44100 hertz (Hz) and the sampling rate of the target music is 48000 Hz, so the sampling rate of the target audio can be transferred to 48000 Hz, such that the target audio may sound more natural when mixed. Optionally, a step of transferring the sampling rate of the target audio may not be performed. If the sampling rate of the target audio is different from the sampling rate of the target music, the target audio sounds more abrupt when mixed into the target music on condition that the sampling rate of the target audio is not transferred, and a scene effect produced by the target audio may also be less suitable for the target music.
At S203, the electronic device determines a position of a sound source of the target audio.
Specifically, a position of any one sound source in space can be a position parameter of the sound source and represented by a 3D coordinate. For example, relative to the listener, the position of the sound source can be represented by the 3D coordinate of [azimuth, elevation, distance]. In different scenes, the position of the sound source may be fixed or changing, e.g., a position of a sound source of a sound of insects chirping or the like may be fixed, and a position of a sound source of a sound of waves, a sound of wind, or the like may need to change continuously. For example, before the vocal part begins, i.e., at a beginning of the music, the target audio needs to come from far to near, which produces an effect of the music floating slowly. The position of the sound source can be determined through the following optional methods.
In method 1, the electronic device pre-stores the position of the sound source in the target audio. Specifically, the electronic device pre-stores a correspondence between the target audio and the position of the sound source in the target audio, and after determining a target sound source, the electronic device determines the position of the sound source according to the target audio and the correspondence between the target audio and the position of the sound source.
In method 2, the electronic device determines the position of the sound source according to a time of determining the target audio. Specifically, the electronic device pre-stores a position of the sound source at each of different stages of the target music. For example, if the time of determining the target audio is before the vocal part of the target music occurs, a position of the target audio can be from far to near, and if the time of determining the target audio is after the vocal part of the target music ends, the position of the target audio can be from far to near.
In method 3, the position of the sound source selected by an operation of the user is received. Specifically, the electronic device can provide the user with a position range, a position option, a movement speed, a movement direction, or other options of the position of the sound source, and receive the position of the sound source indicated by an input operation or selection operation of the user as the position of the sound source of the target audio.
Optionally, the electronic device can be integrated with a unit for calculating the position of the sound source, which can obtain a position of the sound source more suitable for the target audio based on big data or artificial intelligence (AI) technology by simulating positions of different sound sources. Optionally, the electronic device can also receive a position of a sound source sent by other training platforms for professional sound source position calculation, which will not be repeated herein.
After the position of the sound source of the target audio is determined, specifically, when a position is generated, the following situations may occur.
In situation 1, the position of the sound source of the target audio is fixed and then represented by a fixed position parameter. For example, referring to
In situation 2, referring to
At S204, the electronic device obtains dual-channel audio of the target audio by performing audio-visual modulation on the target audio according to the position of the sound source.
Specifically, the position of the sound source may be fixed or changing, and the target audio sounds like it is from the position of the sound source through the audio-visual modulation. The electronic device obtains the dual-channel audio of the target audio by respectively performing the audio-visual modulation on the target audio according to the position of the sound source of the target audio corresponding to each of the multiple time nodes. A method for the audio-visual modulation may be convoluting an HRTF, a time delay method, a phase difference method, or other methods for audio-visual modulation.
As an optimized scheme, in order to ensure an audio-visual modulation effect as much as possible, the electronic device can firstly perform pre-emphasis processing and normalization processing on the target audio. The pre-emphasis processing is a processing method for improving a high-frequency component of audio. In practice, a power spectrum of the audio decreases with the increase of a frequency, and most energy of the audio is concentrated in a low-frequency range, which causes a signal-to-noise ratio of the audio at a high-frequency end to drop to an unacceptable level. Therefore, the pre-emphasis processing is used to increase high-frequency resolution of the audio. Specifically, the pre-emphasis processing can be realized through a high-pass digital filter. The above normalization processing is a normal information processing method for simplifying calculation, which transfers a dimensional processing object to a dimensionless processing object, such that a processing result can have a wider applicability.
The electronic device divides the target audio into multiple audio frames according to a preset second time interval T2, after pre-emphasizing and normalizing the target audio. An audio signal is a signal changing with time and can be considered to be approximately unchanged in a short period of time (generally 10˜30 millisecond (ms)), i.e., the audio has short-term stability. Frame division processing can be performed on the target audio, and the target audio can be divided into the multiple audio frames (which can also be referred to as analysis frames) for processing according to the preset second time interval T2. Optionally, the second time interval of the audio frame can be preset as 0.1*Fs, where Fs is a current sampling rate of the target audio.
When performing the frame division processing on the target audio, the electronic device can perform weighting by using a movable window with a limited length, i.e., windowing and frame division processing, to solve a problem of frequency spectrum leakage due to destruction of naturalness and continuity of the audio resulted from the frame division processing on the audio. During the frame division processing, the number of audio frames per second can be 33˜100, depending on an actual situation. The frame division processing can use a continuous segmentation method or an overlapping segmentation method. The overlapping segmentation is used to achieve a smooth transition between audio frames and keep their continuity. An overlapping part between a previous frame and a next frame is referred to as a frame shift, and a ratio of the frame shift to a frame length is generally 0˜0.5, where the frame length is the number of sampling points of an audio frame or a sampling time of an audio frame. Referring to
As a better implementation scheme, the electronic device can obtain the dual-channel audio of the target audio by convolving an HRTF from a position of the sound source to a left ear and a right ear respectively for each of the multiple audio frames according to the position of the sound source corresponding to the time node of the audio frame.
The HRTF, also referred to as anatomical transfer function (ATF), is a sound effect positioning algorithm, which can produce a 3D sound effect by using technologies such as interaural time delay (ITD), interaural amplitude difference (IAD), and auricle frequency vibration, such that the listener can feel a surround sound effect when a sound reaches auricles, ear canals, and eardrums in human ears, where the system may be affected by factors such as an auricle, a head shape, or a shoulder. The sound travels in space and thus can be heard by people, and the sound changes when traveling from the sound source to human ear eardrums, where this change can be regarded as a filtering effect of two human ears for the sound, and the filtering effect can be simulated by audio processed by the HRTF. That is, a position of a sound source of the audio can be determined by the listener through the audio processed by the HRTF.
When the electronic device synthesizes the dual-channel audio by convolving the HRTF, and a sense of orientation is given to the target audio by assigning the position of the sound source of the target audio as a measuring point and convolving the HRTF. For example, an HRTF database of University of Cologne in Germany is used as a standard transfer function library, and position information of the sound source of the audio is represented by the 3D position coordinate of [azimuth, elevation, distance]. An HRTF from the position to two ears is determined with the 3D position coordinate as a parameter, and the HRTF from the position of the sound source to two ears respectively is convolved, to form the dual-channel audio of the target audio. The requirement of the HRTF database of University of Cologne in Germany for preset parameter ranges of the position is as follows: an azimuth ranges from −90° to 90°, an elevation ranges from −90° to 90°, a distance ranges from 0.5 m to 1.5 m, and a far field distance is greater than 1.5 m. In specific processing, there may be several situations below.
In situation 1, for the sound source at a fixed position, a 3D coordinate of the sound source may be considered unchanged at multiple time nodes. The electronic device determines an HRTF of the position of the sound source according to the position of the sound source of the target audio if the parameter falls within a preset parameter range of the HRTF function library, and performs convolution processing. Referring to
In situation 2, for the sound source at a changing position, the electronic device can determine the position of the sound source at each of the multiple time nodes according to the preset time interval T. The electronic device determines an HRTF of the position of the sound source at each of the multiple time nodes according to the position of the sound source of the target audio if the parameter falls within the preset parameter range of the HRTF function library, and performs the convolution processing. Referring to
In situation 3, if the position of the sound source is determined as in situation 1 or 2, and a first position falls out of the preset parameter range of the HRTF function library, the electronic device can determine P position points around the first position, and obtain an HRTF corresponding to the first position by fitting HRTFs corresponding to the P position points, where the HRTF is referred to as a second HRTF for ease of description. P is an integer not less than 1. Referring to
At S205, the electronic device modulates power of the dual-channel audio of the target audio.
Specifically, in order that the target audio may not affect listening experience of the target music too much, the electronic device can perform power modulation on the target audio, i.e., decrease power of the target audio, before rendering the dual-channel audio of the target audio into the target music, such that the power of the target audio is lower than power of the target music. It should be noted that, modulating the power of the dual-channel audio is just a better implementation and as an optional scheme to improve user experience. The electronic device needs to firstly determine a time for rendering the target audio into the target music, i.e., determine a mixing time of the target audio, before modulating the power of the dual-channel audio of the target audio. The following illustrates several optional schemes for determining the mixing time of the target audio.
In scheme 1, the electronic device presets the mixing time of the target audio. Optionally, when the electronic device renders the target audio into the target music, the target audio can be mixed multiple times or occur circularly at a preset third time interval T3. Referring to
In scheme 2, the electronic device determines the mixing time of the target audio according to a timestamp of the lyrics. For example, the electronic device can determine the target audio in manner 2, and a timestamp to start singing a matched lyric is the mixing time of the target audio since the target audio is matched with the lyrics. Referring to
In scheme 3, the electronic device receives a selection or input operation of the user and determines a time indicated by the selection or input operation as the mixing time of the target audio. For example, referring to
The power modulation can be performed on the audio according to the mixing time of the audio after the electronic device determines the mixing time of the target audio. Optionally, the electronic device can proportionally reduce power of multiple pieces of audio when the multiple pieces of audio need to be mixed at a same time, so that an overall power output finally does not exceed a predetermined power threshold. Since the audio signal is a random signal, power of the audio signal can be represented by a root mean square (RMS) value, which is a measurement result of a sinusoidal signal with the same amplitude as a peak value of the audio signal, is close to an average value, and represents heating energy of the audio. The RMS value is also referred to as effective value, which is calculated by firstly squaring, then averaging, and extracting a square root. Referring to
In method 1, a first modulation factor is determined, and the target audio is modulated to have an RMS value which is an alpha multiple of an RMS value of the target music, where alpha is a parameter preset or indicated by an input operation received from the user, and 0<alpha<1. Referring to
At S1411, RMSA1 of the left channel audio of the target audio, RMSB1 of the right channel audio of the target audio, and RMSY of the audio of the target music are calculated.
Specifically, since the left channel audio and the right channel audio of the target audio are processed with a convolution function, power of a single channel needs to be respectively calculated during modulating of the audio.
At S1412, a parameter alpha is obtained for calculation.
At S1413, the left channel audio is set as RMSA2, where RMSA2=alpha*RMSY.
At S1414, a ratio of RMSA2 to RMSA1 is assigned as a first left channel modulation factor MA1.
Specifically, the ratio of RMSA2 to RMSA1 is assigned as the first left channel modulation factor MA1, that is,
At S1415, the right channel audio is set as RMSB2, where RMSB2=alpha*RMSγ.
At S1416, a ratio of RMSB2 to RMSB1 is assigned as a first right channel modulation factor MB1.
Specifically, the ratio of RMSB2 to RMSB1 is assigned as the first right channel modulation factor MB1, that is,
At S1417, a smaller value of MA1 and MB1 is assigned as a first modulation factor M1, and an RMS value of the left channel audio of the target audio and an RMS value of the right channel audio of the target audio are respectively adjusted to M1*RMSA1 and M1*RMSB1.
Specifically, the smaller value of MA1 and MB1 is assigned as the first modulation factor M1, that is M1=min (MA1,MB1).
At S1417, a smaller value of MA1 and MB1 is assigned as a first modulation factor M1, and an RMS value of the left channel audio of the target audio and an RMS value of the right channel audio of the target audio are respectively adjusted to M1*RMSA1 and M1*RMSB1.
Specifically, the smaller value of MA1 and MB1 is assigned as the first modulation factor M1, that is M1=min (MA1,MB1).
Since the target audio is processed with the convolution function, in order to keep the audio-visual modulation effect of the above dual audio unchanged, amplitude modulation of the left and right channels needs to use a same modulation factor. Therefore, the smaller value of MA1 and MB1 is assigned as the first modulation factor M1.
Optionally, during modulation in manner 1, if an RMS value of mixed audio obtained by mixing modulated target audio into the target music exceeds a value range of the computer number, the power of the target audio needs to be decreased, otherwise, data overflow may be resulted. In the method illustrated in
In method 2, a second modulation factor is determined, and the RMS value of the target audio is modulated, such that the sum of the RMS value of the target music and the RMS value of the target audio does not exceed a maximum value in the value range of the computer number. The RMS value of the target audio is modulated to be always less than the RMS value of the target music. Referring to
At S1521, RMSA1 of the left channel audio of the target audio, RMSB1 of the right channel audio of the target audio, and RMSY of the audio of the target music are calculated.
At S1522, the left channel audio is set as RMSA3, where RMSA3=F−RMSY.
At S1523, a ratio of RMSA3 to RMSA1 is assigned as a second left channel modulation factor MA2.
Specifically, the ratio of RMSA3 to RMSA1 is assigned as the second left channel modulation factor MA2, that is,
At S1524, the right channel audio is set as RMSB3, where RMSB3=F−RMSY.
At S1525, a ratio of RMSB3 to RMSB1 is assigned as a second right channel modulation factor MB2.
Specifically, the ratio of RMSB3 to RMSB1 is assigned as the second right channel modulation factor MB2, that is,
At S1526, a smaller value of MA2 and MB2 is assigned as a second modulation factor M2, and the RMS value of the left channel audio of the target audio and the RMS value of the right channel audio of the target audio are respectively adjusted to M2*RMSA1 and M2*RMSB1.
Specifically, the smaller value of MA2 and MB2 is assigned as the second modulation factor M2, that is, M2=min (MA2,MB2).
In the method illustrated in
In method 3, a third modulation factor is determined, and the RMS value of the target audio is modulated, such that the RMS value of the target audio is less than the RMS value of the target music. The third modulation factor can be determined in other manners and is used to modulate the RMS value of the target music. For example, a smaller value of the first modulation factor and the second modulation factor is assigned as the third modulation factor, i.e., on condition that a value of the first modulation factor is less than a value of the second modulation factor, the first modulation factor is determined as the modulation factor and is used to modulate the RMS value of the target audio, such that the RMS value of the target audio is less than the RMS value of the target music. Similarly, on condition that the value of the second modulation factor is less than the value of the first modulation factor, the second modulation factor is determined as the modulation factor and is used to modulate the RMS value of the target audio, such that the RMS value of the target audio is less than the RMS value of the target music. With the modulation method, under the premise of preventing data overflow, an RMS proportional relation between sound effect data and music data can be kept unchanged as much as possible, which can prevent the target audio from covering up the target music due to excessive power, and also prevent a situation that the target audio has no obvious effect due to too low power, thereby ensuring a dominant position of the target music.
Optionally, audio of various sound effect elements may be used to construct a listening scene since the music is played in real time. Referring to
At S206, the electronic device renders the dual-channel audio of the target audio into target music to produce an effect that the target music is played in the target scene.
Specifically, the electronic device obtains mixed music by mixing the dual-channel audio of the target audio into the target music according to the mixing time of the target audio determined at S206, such that the listener can feel the effect that the target music is played in the target scene when a playing device plays the mixed music.
Optionally, the electronic device may also serve as the playing device and be configured to play the mixed music. In this case, the playing device is a playing module integrated into the electronic device, and the electronic device may be a device such as a smart earphone with a calculation capability. Optionally, the electronic device can transmit the mixed music to the playing device through a wired interface, a wireless interface (e.g., a WiFi interface, a Bluetooth interface), etc., and the playing device is configured to play the mixed music. In this case, the electronic device may be a server (or a server cluster), a computer host, or other electronic devices, and the playing device may be a device such as a Bluetooth earphone, a wired earphone, or the like.
For example, after the electronic device assigns the song of Daoxiang as the target music, assigns a pastoral scene as the target scene, and determines “sound of flowers, plants, insects, and birds in the field”, “sound of streams”, and “sound of flash special effect” as the target audio that represents the pastoral scene, an operation such as the convolution processing, the power modulation, or the like is performed on the target audio, and the mixed audio is obtained by mixing the target audio into audio of Daoxiang according to the mixing time of the target audio. The mixed audio is transmitted via an earphone connection interface to a headphone, such that when listening to Daoxiang with the headphone, the listener can feel the sound effect element lingering around ears and feel like being in the middle of a field and smelling the fragrance of rice.
In the method illustrated in
The above illustrates the methods in implementations of the disclosure in detail, and the following provides an apparatus in implementations of the disclosure.
Referring to
The audio selecting unit 1701 is configured to determine target audio, where the target audio is used to characterize a sound feature in a target scene. The position determining unit 1702 is configured to determine a position of a sound source of the target audio. The audio-visual modulation unit 1703 is configured to obtain dual-channel audio of the target audio by performing audio-visual modulation on the target audio according to the position of the sound source, where the dual-channel audio of the target audio during simultaneous output is able to produce an effect that the target audio is from the position of the sound source. The audio rendering unit 1704 is configured to render the dual-channel audio of the target audio into target music to produce an effect that the target music is played in the target scene.
It can be seen that, a sound effect element that can characterize a listening scene is mixed when a user listens to music. When audio of the sound effect element is mixed into the music, the audio-visual modulation is performed on the audio of the sound effect element according to the position of the sound source, such that the sound effect element when entering ears seems to come from the position of the sound source, and the sound effect element can construct a more real listening scene, thereby improving a sense of presence and immersion of the user when listening to the music.
In another optional scheme, the target audio before a vocal part of the target music occurs or after the vocal part ends is audio matched according to type information or whole lyrics of the target music, and/or, the target audio in the vocal part of the target music is audio matched according to a lyric content of the target music.
That is, the target song before a vocal part of the target music occurs or after the vocal part ends is in a stage where there is only an accompaniment but no vocal singing. In this stage, the target audio can be determined according to a type or whole lyric content of the song, such that a listener can listen to audio matched with a style or content of the song in an accompaniment part of the song. In the vocal part of the target music, a main effect of the music is conveyed by singing lyrics, so the target audio is matched according to a specific content of the lyrics. As such, with a music lyric-oriented method for audio matching, added audio is more consistent with the content of the target music, thereby improving experience of listening to music.
In another optional scheme, the audio selecting unit 1701 configured to determine the target audio is specifically configured to determine the target audio by receiving a selection operation for the target audio.
It can be seen that, one or more pieces of audio are provided to the user when audio to be mixed is selected, and the target audio is determined by receiving the selection operation for the target audio. That is, when the user listens to the music, audio can be autonomously selected by the user according to own preferences to mix into the music, thereby constructing an individualized listening scene, which stimulates a creation and desire of the user and increases interest of listening experience.
In another optional scheme, the position determining unit 1702 configured to determine the position of the sound source of the target audio is specifically configured to determine a position of the sound source of the target audio at each of multiple time nodes. The audio-visual modulation unit configured to obtain the dual-channel audio of the target audio by performing the audio-visual modulation on the target audio according to the position of the sound source is specifically configured to obtain the dual-channel audio of the target audio by performing the audio-visual modulation on the target audio according to the position of the sound source at each of the multiple time nodes.
At present, when a device plays the music with an added sound effect element, the position of the sound source is fixed, a left ear and a right ear hear the same content, and a sound position is centered or fixed. However, a position of a sound source of the sound effect element in space may be fixed relative to human ears or may be displaced. With the apparatus provided in implementations of the disclosure, for audio characterizing a target listening scene, the position of the sound source of the target audio at each of the multiple time nodes is determined according to a preset time interval, the audio-visual modulation is performed on the target audio according to the position of the sound source at each of the multiple time nodes, such that the effect that the target audio is from the position of the sound source is produced, and a moving track can be changing, thereby increasing a sense of presence of the user and constructing a more natural listening scene.
In another optional scheme, the audio-visual modulation unit 1703 includes a frame division subunit 1705 and an audio-visual generating subunit 1706. The frame division subunit 1705 is configured to divide the target audio into multiple audio frames. The audio-visual generating subunit 1706 is configured to obtain the dual-channel audio of the target audio by convolving an HRTF from a position of the sound source to a left ear and a right ear respectively for each of the multiple audio frames according to a position of the sound source corresponding to a time node of the audio frame.
It can be seen that, frame division processing needs to be performed on the target audio before using the HRTF to perform the audio-visual modulation, to improve an effect of audio processing. The HRTF is convolved through a divided audio frame, such that the user can feel the effect that the target audio is from the position of the sound source when the dual-channel audio of the target audio is played in the left ear and the right ear, and presence of the sound effect element is more real.
In another optional scheme, the audio-visual generating subunit 1706 includes a frame position matching subunit 1707, a position measuring subunit 1708, and a convolving subunit 1709. The frame position matching subunit 1707 is configured to obtain a first position of the sound source corresponding to a first audio frame, where the first audio frame is one of the multiple audio frames. The position measuring subunit 1708 is configured to determine a first HRTF corresponding to the first position on condition that the first position falls within a preset measuring point range, where each measuring point in the preset measuring point range corresponds to an HRTF. The convolving subunit 1709 is configured to obtain dual-channel audio of the first audio frame of the target audio by convolving the first HRTF from the first position to the left ear and the right ear respectively for the first audio frame.
It can be seen that, since the position of the sound source of the target audio can change continuously, for the first audio frame in the multiple audio frames, the first position corresponding to the first audio frame is firstly determined, an HRTF corresponding to the first position is determined, and then convolution processing is performed. The dual-channel audio of the target audio obtained by convolving the HRTF is played in the left ear and right ear of the listener, such that the listener can feel like that the target music comes from the position of the sound source, thereby improving a sense of presence and immersion of the user when listening to the music.
In another optional scheme, the position measuring subunit 1708 is further configured to determine P measuring position points according to the first position on condition that the first position falls out of the preset measuring point range, where the P measuring position points are P points falling within the preset measuring point range, and P is an integer not less than 1. The apparatus further includes a position fitting subunit 1710 configured to obtain a second HRTF corresponding to the first position by fitting according to HRTFs respectively corresponding to the P measuring position points. The convolving subunit 1709 is further configured to obtain the dual-channel audio of the first audio frame of the target audio by convolving the second HRTF from the first position to the left ear and the right ear respectively for the first audio frame.
It can be seen that, the measuring point range is preset for the HRTF, and each measuring point in the preset measuring point range corresponds to an HRTF. On condition that the first position falls out of the measuring point range, P measuring position points that fall within the preset range and are close to the first position can be determined, and the HRTF of the first position can be obtained by fitting the HRTFs respectively corresponding to the P measuring position points, which can improve accuracy of an audio-visual modulation effect of the target audio and enhance an effect stability of a processing process of the target audio.
In another optional scheme, the audio rendering unit 1704 configured to render the dual-channel audio of the target audio into the target music to produce the effect that the target music is played in the target scene specifically includes a modulation factor determining subunit 1711, an adjusting subunit 1712, and a mixing subunit 1713. The modulation factor determining subunit 1711 is configured to determine a modulation factor according to an RMS value of the left channel audio, an RMS value of the right channel audio, and an RMS value of the target music. The adjusting subunit 1712 is configured to obtain adjusted left channel audio and adjusted right channel audio by adjusting the RMS value of the left channel audio and the RMS value of the right channel audio according to the modulation factor, where an RMS value of the adjusted left channel audio and an RMS value of the adjusted right channel audio each are not greater than the RMS value of the target music. The mixing subunit 1713 is configured to mix the adjusted left channel audio into a left channel of the target music as rendered audio of the left channel of the target music, and mix the adjusted right channel audio into a right channel of the target music as rendered audio of the right channel of the target music.
At present, when the device plays the music with added sound effect elements, sound intensities of the added sound effect elements are different. Some of the sound effect elements each have too great loudness, which easily leads to data overflow to cover up a sound of the music, and some of the sound effect elements each have too low loudness, which is almost imperceptible, thereby affecting experience of the user when listening to the music. With the apparatus provided in implementations of the disclosure, when the target audio is mixed into the music, power of the target music is firstly modulated to change a feature, such as loudness, of the music, which can prevent the sound effect element from covering up an original music signal and can also prevent a situation that the sound effect element has no obvious effect due to too low loudness, such that audio of the added sound effect element will not affect the user to listen to original music.
In another optional scheme, the RMS value of the left channel audio is RMSA1, the RMS value of the right channel audio is RMSB1, and the RMS value of the target music is RMSY. The modulation factor determining subunit 1711 configured to determine the modulation factor according to the RMS value of the left channel audio, the RMS value of the right channel audio, and the RMS value of the target music is specifically configured to perform the following. The RMS value of the left channel audio is adjusted to RMSA2, and the RMSY value of the right channel audio is adjusted to RMSB2, such that RMSA2, RMSB2, and RMSY satisfy: RMSA2=alpha*RMSY and RMSB2=alpha*RMSY, where alpha is a preset scale factor, and 0<alpha<1. A ratio of RMSA2 to RMSA1 is assigned as a first left channel modulation factor MA1, that is,
A ratio of RMSB2 to RMSB1 is assigned as a first right channel modulation factor MB1, that is,
A smaller value of MA1 and MB1 is assigned as a first group value M1, that is, M1=min (MA1,MB1). The first group value is determined as the modulation factor.
It can be seen that, by determining the modulation factor according to the RMS value of the left channel audio of the target music, the RMS value of the right channel audio of the target music, and the RMS value of the target music, and modulating power of the target audio according to the modulation factor, an RMS value of the target audio is controlled to be proportional to the RMS value of the target music, such that appearance of the target audio may not affect listening of the original music too much. A value of alpha, the ratio alpha of the sound effect element to the target music, can be preset by a system or set by the user according to their own preferences, thereby constructing an individualized listening effect and increasing interest of listening experience.
In another optional scheme, the modulation factor determining subunit 1713 is further configured to perform the following. The RMS value of the left channel audio is adjusted to RMSA3, and the RMS value of the right channel audio is adjusted to RMSB3, such that RMSA3, RMSB3, and RMSY satisfy: RMSA3=F−RMSY, where F is a maximum number of numbers that a floating-point type is able to represent, and RMSB3=F−RMSY. A ratio of RMSA3 to RMSA1 is assigned as a second left channel modulation factor MA2, that is,
A ratio of RMSB3 to RMSB1 is assigned as a second right channel modulation factor MB2, that is
A smaller value of MA2 and MB2 is assigned as a second group value M2, that is, M2=min (MA2,MB2), where the first group value is less than the second group value.
It can be seen that, an RMS value of a mixed rendered audio should not to be greater than a maximum value in a value range of a computer number when the modulation factor is determined, so that under the premise of preventing data overflow, the target audio can be prevented from covering up the target music due to excessive power as much as possible, and a situation that the target audio has no obvious effect due to too low power can also be prevented, thereby ensuring a dominant position of the target music.
In another optional scheme, the apparatus further includes a sampling rate transferring unit 1714. The sampling rate transferring unit 1714 is configured to transfer a sampling rate of the target audio to a sampling rate of the target music on condition that the sampling rate of the target audio is different from the sampling rate of the target music, after determining the target audio and before determining the position of the sound source of the target audio.
It can be seen that, after the target audio is determined, if the sampling rate of the target audio is different from the sampling rate of the target music, a sampling rate of the sound effect element is transferred to the sampling rate of the target music, which makes the sound effect element sounds more natural when mixing.
It can be seen that, through the apparatus described in
It should be noted that, implementation of each operation can also correspondingly refer to the corresponding description of the method implementations illustrated in
Referring to
The processor 1801 (or referred to as central processing unit (CPU)) is a computer core and control core of the apparatus and can be configured to analyze various instructions in the apparatus and process various data in the apparatus. For example, the CPU can transmit various interaction data between internal structures of the apparatus.
The memory 1802 is a storage device in the apparatus and is configured to store programs and data. It can be understood that, the memory 1802 here may include an internal memory of the apparatus and may also include an extended memory supported by the apparatus. The memory 1802 provides a storage space that stores an operating system and other data of the apparatus, such as an Android system, an iOS system, or a Windows Phone system etc., which will not be limited herein.
The processor 1801 can be configured to invoke program instructions stored in the memory 1802 to execute the method provided in the implementations illustrated in
It may be noted that, implementation of each operation can also correspondingly refer to the corresponding description of the method implementations illustrated in
A computer readable storage medium is further provided in implementations of the disclosure. The computer readable storage medium is configured to store computer instructions which, when running on a processor, are configured to perform the operations executed by the electronic device in the implementations illustrated in
A computer program product is further provided in implementations of the disclosure. The computer program product, when running on a processor, is configured to perform the operations executed by the electronic device in the implementations illustrated in
All or part of the above implementations can be implemented through software, hardware, firmware, or any combination thereof. When implemented by software, all or part of the above implementations can be implemented in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, all or part of the operations or functions of the implementations of the disclosure are performed. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable apparatuses. The computer instructions can be stored in a computer readable storage medium, or transmitted through the computer readable storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (for example, via a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wirelessly (for example, via infrared, radio, microwave, etc.). The computer readable storage medium can be any available medium accessible by a computer or a data storage device such as a server, a data center, or the like which is integrated with one or more available media. The available medium can be a magnetic medium (such as a soft disc, a hard disc, or a magnetic tape), an optical medium (such as a digital video disc (DVD)), or a semiconductor medium (such as a solid state disk (SSD)), etc.
Number | Date | Country | Kind |
---|---|---|---|
201911169274.2 | Nov 2019 | CN | national |
This application a continuation under 35 U.S.C. § 120 of International Patent Application No. PCT/CN2020/074640, filed Feb. 10, 2020, which claims priority under 35 U.S.C. § 119(a) and/or PCT Article 8 to Chinese Patent Application No. 201911169274.2, filed on Nov. 25, 2019, the entire disclosure of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
20040230391 | Jorgensen | Nov 2004 | A1 |
20130065213 | Gao | Mar 2013 | A1 |
20180100889 | Swamy et al. | Apr 2018 | A1 |
20180221621 | Kanemaru et al. | Aug 2018 | A1 |
20190313200 | Stein | Oct 2019 | A1 |
Number | Date | Country |
---|---|---|
105117021 | Dec 2015 | CN |
105120418 | Dec 2015 | CN |
105120418 | Dec 2015 | CN |
105792090 | Jul 2016 | CN |
106572419 | Apr 2017 | CN |
106993249 | Jul 2017 | CN |
206759672 | Dec 2017 | CN |
108616789 | Oct 2018 | CN |
108829254 | Nov 2018 | CN |
110270094 | Sep 2019 | CN |
110488225 | Nov 2019 | CN |
H11243598 | Sep 1999 | JP |
2000132150 | May 2000 | JP |
2006174052 | Jun 2006 | JP |
02078312 | Oct 2002 | WO |
2018079850 | May 2018 | WO |
Entry |
---|
JPO, Notice of Reasons for Refusal for corresponding Japanese Patent Application No. 2022-530306, Jul. 18, 2023, 9 pages. |
CNIPA, International Search Report for corresponding International Patent Application No. PCT/CN2020/074640, Jul. 22, 2020, 10 pages. |
CNIPA, First Office Action for corresponding Chinese Patent Application No. 201911169274.2, Dec. 11, 2020, 21 pages. |
CNIPA, Notice of Allowance for corresponding Chinese Patent Application No. 201911169274.2, May 27, 2021, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20220286781 A1 | Sep 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/074640 | Feb 2020 | WO |
Child | 17751960 | US |