This application claims the benefit of Chinese Patent Application No. 202311618525.7, filed on Nov. 29, 2023, and entitled “METHOD, APPARATUS, DEVICE AND MEDIUM FOR GENERATING MEDIA DATA”, which is hereby incorporated by reference in its entirety.
Example implementations of the present disclosure generally relate to data processing, and more particularly to generation of media data including music.
In the field of music production, the use of digital synthesis techniques has been proposed to create musical works. For example, a musician may collect sounds using a tool, such as a sampler, and add these sounds into a musical work through a digital synthesis technique, so that the musical work has a richer hearing effect. However, the operation of existing music producing tools is complex and requires the user to have rich professional music knowledge, which is not friendly to ordinary users.
In a first aspect of the present disclosure, a method for generating media data is provided. In the method, the first media data is obtained in response to receiving a creation request for creating music. A music template is obtained, and the music template includes melody data for specifying a music melody. A second media data including a music melody is generated based on the first media data.
In a second aspect of the present disclosure, an apparatus for generating media data is provided. The apparatus includes a data obtaining module, a template obtaining module and a generating module, where the data obtaining module is configured to obtain first media data in response to receiving a creation request for creating music; the template obtaining module is configured to obtain a music template, the music template comprising melody data for specifying a music melody; and the generating module is configured to generate second media data including the music melody based on the first media data.
In a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by the at least one processing unit, cause the electronic device to perform the method of the first aspect.
In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, causes the processor to implement the method of the first aspect.
It should be understood that the content described in this section is not intended to limit the key features or important features of the implementations of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
The above and other features, advantages, and aspects of various implementations of the present disclosure will become more apparent from the following detailed description in conjunction with the accompanying drawings. In the drawings, the same or similar reference signs refer to the same or similar elements, in which:
The embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although some embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the present disclosure are provided for illustrative purposes only and are not intended to limit the scope of protection of the present disclosure.
In the description of the embodiments of the present disclosure, the term “including” and the like should be understood as non-exclusive inclusion, that is, “including but not limited to”. The term “based on” should be understood as “based at least in part on.” The term “an embodiment” or “the embodiment” should be understood as “at least one embodiment”. The term “some embodiments” should be understood as “at least some of the embodiments”. Other explicit and implicit definitions may also be included below. As used herein, the term “model” may represent an association relationship between various data. For example, the association relationship may be obtained based on various technical solutions currently known and/or to be developed in the future.
It is to be understood that the data involved in the technical solution, including but not limited to the data itself, the obtaining or use of the data, should comply with the requirements of corresponding laws and regulations and relevant provisions.
It is to be understood that, before using the technical solutions disclosed in the various embodiments of the present disclosure, the user shall be informed of the type, the scope of use, and use scenarios and so on of personal information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and the user's authorization shall be obtained.
For example, in response to receiving an active request from a user, prompt information is sent to the user to explicitly prompt the user that an operation requested by the user will require to obtain and use personal information of the user, so that the user can autonomously select, according to the prompt information, whether to provide the personal information to software or hardware, such as an electronic device, an application program, a server, or a storage medium that performs the operations of the technical solutions of the present disclosure.
As an optional but non-limiting implementation, in response to receiving an active request of the user, the prompt information is sent to the user, for example, in the form of a pop-up window, in which the prompt information may be presented in the form of text. In addition, the pop-up window may further carry a selection control for the user to select “agree” or “not agree” to provide the personal information to the electronic device.
It should be understood that the above process for notifying and obtaining the user's authorization is merely illustrative, and do not limit the implementations of the present disclosure, and other approaches that meet the relevant laws and regulations may also be applied to the implementations of the present disclosure.
In the field of music production, use of digital synthesis technique has been proposed to create musical works. For example, a musician may collect sounds and utilize a professional music production tool to create musical works.
It should be understood that the music production tool 120 herein may be a professional tool, the operations of these tools are complex, and the user 110 is required to have sophisticated professional musical knowledge and rich software use skills. For example, the user 110 may collect sound, use the music production tool 120 to process the sound, and then add a corresponding sound effect to the music work 130, etc.
However, existing music production tools 120 are not friendly to ordinary users (e.g., ordinary users who do not have professional music knowledge and software use skills). For example, a user may wish to utilize sound of his own pet to generate a piece of music, or utilize his own voice to sing a hot song, and so on. The existing music production tools cannot meet simple music creation requirements of ordinary users. At this point, it is desirable to provide auxiliary tools to ordinary users and/or professional users in a simpler and efficient manner in order to generate a desired musical work.
In order to at least partially solve the deficiencies in the prior art, according to an exemplary implementation of the present disclosure, a method for generating media data is provided. Referring to
Specifically, the user may upload the first media data via the control 220, and the first media data may include timbre that the user desires to specify. It should be understood that the timbre refers to an essential feature of sound, and it is the most fundamental feature for distinguishing a sound from other sounds. The timbre depends on a waveform of the sound, and different waveforms correspond to different timbres. For example, voices of different people have different timbres, voices of people and sounds of animals have different timbres, and so on. If the musical work desired to be generated by the user includes the timbre of a pet, a piece of audio (and/or video) including sound of the pet may be uploaded. For another example, if the musical work desired to be generated by the user includes his or her own timbre, a piece of audio (and/or video) including his or her own speaking or singing may be uploaded.
Further, the user may specify a melody of the musical work desired to be generated via a music template. For example, a music template may be obtained via a control 230 or the like, and the music template includes melody data 240 for specifying a music melody. It should be understood that, in the context of the present disclosure, the melody data 240 may include a main melody of a song, a segment of a main melody, or any melody that a user desires to specify, and the like. The user may interact with various controls in the page 210 to specify desired media data and music templates. In this case, the music production tool may receive the media data and the music template specified by the user, and then generate the new media data (for example, it may be referred to as second media data). At this time, the new media data may have a timbre specified in the media data and have a melody specified by the music template.
According to one example implementation of the present disclosure, the second media data and the first media data may have the same timbre, and the timbre is specified by the first media data. In this way, the user may be facilitated to generate musical works with desired timbre and desired melody in a simpler and more efficient manner. Specifically, assuming that the user uploads media data of the pet's barking and the selected template includes melody data of the song A, the generated media data may include a melody of the song A sung using the pet's sound. For another example, assuming that the user uploads media data including his or her own speaking sound, and the selected template includes melody data of the song B, the media data generated at this time may include a melody of the song B sung by the user himself. According to example implementations of the present disclosure, a music creation tool may be provided to an ordinary user that does not have professional music knowledge, thereby meeting a simple music creation requirement of the ordinary user.
A summary in accordance with one example implementation of the present disclosure has been described with reference to
According to an example implementation of the present disclosure, the music template further includes style data 244 for specifying a music style, and the second media data generated at this time will further have the music style. For example, a popular style, a rock style, a classical style, etc. may be specified. In this way, a musical work with richer effect can be generated.
According to one example implementation of the present disclosure, the melody data 240 may include a set of notes, each of which corresponds to a respective time length. At this time, the melody data 240 may specify a main melody to be used, each melody may include a plurality of notes arranged in chronological order, and only a single note is present at one time point. Referring to
Alternatively, and/or additionally, the melody data 240 may be represented using note data 320 with speed information. For example, the note data 320 may be represented using a stave, Numbered Musical Notation, or any other means that may be identified. According to one example implementation of the present disclosure, a page specifying melody data 240 may be provided to a user. For example, audio data and/or note data imported by a user may be received to determine a main melody of the media data to be generated.
According to an example implementation of the present disclosure, when the melody data 240 has been determined, a corresponding audio segment may be searched for each note in the melody data 240 from the first media data, and then the second media data is generated. Specifically, the second media data may be generated by: dividing the first media data into a plurality of audio segments based on pitch information in the first media data; and generating the second media data using the plurality of audio segments. It should be understood that each audio segment has a specified timbre, so that the second media data generated by using respective audio segments will have the specified timbre.
Hereinafter, referring to
According to an example implementation of the present disclosure, before the dividing operation, preprocessing may also be performed on the received first media data, for example, noise reduction processing may be performed to eliminate ambient noise, audio track separation may be performed to extract melody data, reverberation cancellation may be performed to obtain clean melody data, and so on. According to an example implementation of the present disclosure, the pitch at each time point may be detected based on a digital signal processing algorithm, and then the division process is performed. A portion between the initial time point and the time point 430 may be used as the audio segment 420, a portion between the time points 430 and 432 may be used as the audio segment 422, and a portion between the time points 432 and 434 may be used as the audio segment 424, and so on. In this way, a plurality of audio segments 420, 422, 424, . . . , and 426 may be obtained in a simple and efficient manner.
According to an example implementation of the present disclosure, a large number of audio segments may be obtained based on pitch detection, and some of high-quality audio segments may be selected from a large number of audio segments. For example, the audio segment may be selected based on a condition that a time length of the target audio segment satisfying a predetermined length condition; energy of the target audio segment satisfying a predetermined energy condition; and a pitch difference of the target audio segment satisfying a predetermined pitch condition, and so on.
In particular, audio segments that are too short in time length may be discarded, and only audio segments with time length meeting a predetermined length condition (e.g., not below 0.3 seconds or other numerical value) are retained. Audio segments with too little energy (e.g., volume) may be discarded, and only audio segments with energy satisfying a predetermined energy condition (e.g., the root mean square of the audio segment exceeds a predetermined threshold) may be retained. The audio segment with large pitch difference (i.e., the pitch range spanned by the audio segment) may be discarded, and only the audio segments satisfying the predetermined pitch condition are retained. According to an example implementation of the present disclosure, the plurality of audio segments may be sorted in ascending order based on pitch differences of the plurality of audio segments, and then the first K (e.g., 10 or other values) audio segments are selected. At this time, the pitch differences of the selected audio segments are small, and thus the selected audio segments can be mapped to the notes in the melody data 240 in a more accurate manner.
According to one example implementation of the present disclosure, the second media data may be generated by selecting, from the plurality of audio segments, a set of audio segments respectively corresponding to the set of notes, the target note in the set of notes corresponding to a target audio segment in the set of audio segments; and creating the second media data using the set of audio segments. In this way, a set of notes may be respectively mapped to a set of audio segments in the plurality of audio segments, so that each note in the melody data may be converted into an audio segment having the specified timbre in a simple and effective manner.
More details are described with reference to
At this time, for the target note in the plurality of notes, the target note may be mapped to a certain audio segment in the plurality of audio segments. For example, the note 510 may be mapped to the audio segment 420, the note 512 may be mapped to the audio segment 422, the note 514 may be mapped to the audio segment 426, the note 516 may be mapped to the audio segment 420, and so forth.
According to an example implementation of the present disclosure, the mapping relationship may be established in a plurality of manners. For example, a corresponding target audio segment may be selected for the target note based on a random selection mode, in this case, the audio segments corresponding to the individual notes 510 are randomly selected. For another example, a corresponding target audio segment may be selected for the target note based on a poll selection mode. Assuming that 10 audio segments are generated by the dividing operation, the first audio segment can be selected for the first note in the melody data, the second audio segment can be selected for the second note, . . . , the eleventh audio segment is selected for the first note, and so on.
Alternatively, and/or additionally, the time length corresponding to the target note may be compared with the time length of the target audio segment, and an audio segment with the closest time length may be selected for the target note. According to example implementations of the present disclosure, the audio segment with a matched length is selected for each note, thereby reducing the amplitude of the time scaling operation performed on the audio segments in subsequent operations. Alternatively, and/or additionally, a pitch corresponding to the target note may be compared with the pitch of the target audio segment, and the audio segment with the closest pitch may be selected for the target note. According to example implementations of the present disclosure, an audio segment with a matched pitch is selected for each note, thereby reducing the amplitude of time and pitch adjustment performed on the audio segment in subsequent operations.
With continued reference to
More details are described with reference to
It should be understood that the pitch 620 is determined by the frequency of the sound waves. Thus, at block 630, the frequency of sound may be adjusted by resampling 632 and/or time scaling 634. Specifically, the resampling 632 may include upsampling or downsampling, in this manner, the frequency of the sound may be adjusted. Further, stretching or compressing the time length of the audio segment may also change the frequency of the sound, thereby obtaining the sound with the desired pitch in an accurate manner. For example, the pitch 650 of the note 510 may be obtained, and then the same pitch 650 may be obtained by resampling 632 and/or time scaling 634, and in this case, the pitch of the obtained audio segment 640 is the same as the pitch of the note 510.
It should be understood that although
Alternatively, and/or additionally, a smoothing process may be performed between various audio segments to obtain a smoother musical work. With example implementations of the present disclosure, a musical work having a specified timbre and melody may be generated in an accurate manner by replacing each note in melody data with an audio segment having a specified timbre and a corresponding pitch.
According to an example implementation of the present disclosure, a plurality of sound effect processing may be performed on the generated second media data, thereby improving the auditory experience of the music work. For example, special effect processing may be performed to add a special sound effect to the second media data; for another example, gain processing may be performed to adjust the volume, and so on. In particular, a variety of audio processing techniques currently known and/or that will be developed in the future may be utilized such that the musical work exhibits a better auditory effect. According to an example implementation of the present disclosure, the music template may further include accompaniment data. At this time, the accompaniment data may be added to the musical work.
Referring to
As shown in
It should be understood that the process of generating a musical work is described above with audio data as a specific example of the first media data and the second media data. Alternatively, and/or additionally, the first media data and the second media data may further include video data. At this point, the audio portion in the media data may be processed based on the manner described above. Further, the video portion in the media data may be processed in a similar manner.
More details are described with reference to
With example implementations of the present disclosure, it is assumed that the media data 840 is a video including dog barking, and the melody data specified by the user is Song A. The media data 842 generated at this time includes the song A sung in a dog sound, and the mouth shape of the dog in the video screen will match the mouth shape when singing. In this way, a personalized music creation may be implemented in a simpler and efficient manner to provide more creating tools for ordinary users who do not have professional musical knowledge.
Referring to
At this time, the user may select the desired style, and at this time, the media data of the song A with the specified style and sung with vocals will be generated. Assuming that the user selects a Rock 922 style, the generated media data may have a Rock style, for example, the accompaniment music may have an explicit rhythm and may be accompanied by musical instruments such as guitars, basses, and drums. Assuming that the user selects a Rock 9924 style, the generated media data may have a classical style, for example, a piano accompaniment may be used.
With example implementations of the present disclosure, a plurality of music creation scenarios may be supported. For example, a user may utilize a specified timbre to adapt existing songs, or a user may create a song from scratch. For example, a user may use sound of his own pet to adapt existing songs. For another example, a user may be supported to create a new song, and the user only needs to upload a piece of audio and/or video with voice. For example, the user may edit the note data in the melody data in order to generate a new melody. In turn, completely new songs can be created by selecting different styles and/or different musical instruments.
According to one example implementation of the present disclosure, the second media data and the first media data have a same timbre and the timbre is specified by the first media data.
According to one example implementation of the present disclosure, the music template further includes accompaniment data, and the second media data further includes the accompaniment data.
According to one example implementation of the present disclosure, the music template further includes style data for specifying a music style, and the second media data further has a music genre.
According to one example implementation of the present disclosure, the melody data includes a set of notes, each note in the set of notes corresponds to a respective time length, and the second media data is generated by dividing the first media data into a plurality of audio segments based on pitch information in the first media data; and generating the second media data using the plurality of audio segments.
According to an example implementation of the present disclosure, the target audio segment in the plurality of audio segments satisfies the following conditions: a time length of the target audio segment satisfying a predetermined length condition; energy of the target audio segment satisfying a predetermined energy condition; and a pitch difference of the target audio segment satisfying a predetermined pitch condition.
According to one example implementation of the present disclosure, the second media data is generated based on the following: selecting a set of audio segments respectively corresponding to the set of notes from the plurality of audio segments, a target note in the set of notes corresponding to a target audio segment in the set of audio segments; and creating the second media data using the set of audio segments.
According to an example implementation of the present disclosure, the target audio segment is selected based on at least one of the following: a random selection mode; a polling selection mode; comparison of a time length corresponding to the target note with a time length of the target audio segment; and comparison of a pitch corresponding to the target note with a pitch of the target audio segment.
According to one example implementation of the present disclosure, the second media data is created based on the following: adjusting pitches and time lengths of the set of audio segments respectively to match the set of notes; and combining the adjusted set of audio segments to generate the second media data.
According to one example implementation of the present disclosure, a pitch of the target audio segment in the set of audio segments is adjusted based on at least one of: performing resampling on the target audio segment, and scaling a time length of the target audio segment.
According to one example implementation of the present disclosure, the first media data and the second media data comprise video data, and a video portion in the second media data is generated based on the following: obtaining a set of video segments respectively corresponding to the set of audio segments; and generating the video portion in the second media data using the set of video segments.
According to an example implementation of the present disclosure, the melody data is represented by at least one of: audio data and note data.
According to one example implementation of the present disclosure, the creation request further specifies a musical instrument for creating the music, and the first media data and the second media data are played by using the musical instrument.
According to one example implementation of the present disclosure, the second media data and the first media data have a same timbre and the timbre is specified by the first media data.
According to one example implementation of the present disclosure, the music template further includes accompaniment data, and the second media data further includes the accompaniment data.
According to one example implementation of the present disclosure, the music template further includes style data for specifying a music style, and the second media data further has a music genre.
According to one example implementation of the present disclosure, the melody data includes a set of notes, each note in the set of notes corresponds to a respective time length, and the generating module is further configured to divide the first media data into a plurality of audio segments based on pitch information in the first media data; and generate the second media data using the plurality of audio segments.
According to an example implementation of the present disclosure, the target audio segment in the plurality of audio segments satisfies the following conditions: a time length of the target audio segment satisfying a predetermined length condition; energy of the target audio segment satisfying a predetermined energy condition; and a pitch difference of the target audio segment satisfying a predetermined pitch condition.
According to one example implementation of the present disclosure, the generating module is further configured to: select a set of audio segments respectively corresponding to the set of notes from the plurality of audio segments, a target note in the set of notes corresponding to a target audio segment in the set of audio segments; and create the second media data using the set of audio segments.
According to an example implementation of the present disclosure, the target audio segment is selected based on at least one of the following: a random selection mode; a polling selection mode; comparison of a time length corresponding to the target note with a time length of the target audio segment; and comparison of a pitch corresponding to the target note with a pitch of the target audio segment.
According to an example implementation of the present disclosure, the generating module is further configured to: adjust pitches and time lengths of the set of audio segments respectively to match the set of notes; and combine the adjusted set of audio segments to generate the second media data.
According to one example implementation of the present disclosure, a pitch of the target audio segment in the set of audio segments is adjusted based on at least one of: performing resampling on the target audio segment, and scaling a time length of the target audio segment.
According to one example implementation of the present disclosure, the first media data and the second media data comprise video data, and the generating module is further configured to obtain a set of video segments respectively corresponding to the set of audio segments; and generate the video portion in the second media data using the set of video segments.
According to an example implementation of the present disclosure, the melody data is represented by at least one of: audio data and note data.
According to one example implementation of the present disclosure, the creation request further specifies a musical instrument for creating the music, and the first media data and the second media data are played by using the musical instrument.
As shown in
The computing device 1200 typically includes a plurality of computer storage medium. Such media may be any available media that are accessible by the computing device 1200, including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 1220 may be a volatile memory (e.g., a register, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. The storage device 1230 may be a removable or non-removable medium and may include a machine-readable medium, such as a flash drive, a magnetic disk, or any other medium that can be used to store information and/or data (e. g., training data for training) and that can be accessed within the computing device 1200.
The computing device 1200 may further include additional detachable/undetachable, volatile/nonvolatile storage medium. Although not shown in
The communication unit 1240 implements communication with other computing devices through a communication medium. Additionally, functions of components of the computing device 1200 may be implemented by a single computing cluster or a plurality of computing machines, and these computing machines can communicate through a communication connection. Thus, the computing device 1200 may operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.
The input device 1250 may be one or more input devices, such as a mouse, a keyboard, a trackball, etc. The output device 1260 may be one or more output devices, such as a display, a speaker, a printer, etc. The computing device 1200 may also communicate with one or more external devices (not shown), such as a storage device, a display device, or the like through the communication unit 1240 as desired, and communicate with one or more devices that enable a user to interact with the computing device 1200, or communicate with any device (e.g., a network card, a modem, or the like) that enables the computing device 1200 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
According to an example implementation of the present disclosure, a computer readable storage medium is provided, on which computer-executable instructions is stored, wherein the computer-executable instructions are executed by a processor to implement the method described above. According to an example implementation of the present disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer readable medium and includes computer-executable instructions that are executed by a processor to implement the method described above. According to the example embodiments of the present disclosure, there is provided a computer program product having a computer program stored thereon, and the computer program, when executed by a processor, implements the method described above.
Aspects of the present disclosure are described herein with reference to flowcharts and/or block diagrams of methods, apparatus, devices and computer program products implemented in accordance with the present disclosure. It should be understood that each block of the flowcharts and/or block diagrams and combinations of blocks in the flowchart and/or block diagrams can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processing unit of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, when executed via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/actions specified in one or more blocks of the flowcharts and/or block diagrams. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions includes an article of manufacture that includes instructions which implement various aspects of the functions/actions specified in one or more blocks of the flowcharts and/or block diagrams.
The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices, causing a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other devices, to produce a computer implemented process such that the instructions, when being executed on the computer, other programmable data processing apparatus, or other devices, implement the functions/actions specified in one or more blocks of the flowchart and/or block diagrams.
The flowcharts and block diagrams in the drawings illustrate the architecture, functionality, and operations of possible implementations of the systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of instructions which includes one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed in parallel, or they may sometimes be executed in reverse order, depending on the function involved. It should also be noted that each block in the block diagrams and/or flowcharts, as well as combinations of blocks in the block diagrams and/or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or operations, or may be implemented using a combination of dedicated hardware and computer instructions.
Various implementations of the disclosure have been described as above, the foregoing description is example, not exhaustive, and the present application is not limited to the implementations as disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the implementations as described. The selection of terms used herein is intended to best explain the principles of the implementations, the practical application, or improvements to technologies in the marketplace, or to enable those skilled in the art to understand the implementations disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
202311618525.7 | Nov 2023 | CN | national |