This application claims the benefit of CN Patent Application No. 202111165277.6 filed on Sep. 30, 2022, entitled “METHOD, APPARATUS, DEVICE AND STORAGE MEDIUM FOR VIDEO RECORDING”, which are hereby incorporated by reference in its entirety.
Example embodiments of the present disclosure relates generally to the field of internet technologies, and for example, to a method, apparatus, device and storage medium for video recording.
With the development of internet technologies, many video applications support the recording of songs. During the process of singing a song, the video application can record the user and share the recorded video to a network platform of the video application through the network.
In the related art, the manner in which the terminal device records audio data and images of the user singing a target song via the video application is relatively monotonous, and the user experience is rather poor.
Embodiments of the present disclosure provide a method, apparatus, device and storage medium for video recording so as to achieve the recording of audio data and images when a user sings a song, which can enhance the interest of recorded video and improve the user experience.
In a first aspect of the present disclosure, a method for video recording is provided. The method comprises:
In a second aspect of the present disclosure, an apparatus for video recording is provided. The apparatus comprises:
In a third aspect of the present disclosure, an electronic device is provided. The device comprises:
In a fourth aspect of the present disclosure, a computer-readable medium is provided. The computer-readable medium stores a computer program that can be executed by a processing device to implement the method of the first aspect.
It is to be understood that multiple steps described in the method implementation method of this disclosure can be executed in different orders and/or in parallel. In addition, the method implementation method can include additional steps and/or omit the steps shown. The scope of this disclosure is not limited in this regard.
The term “including” and its variations used in this article are open-ended, i.e. “including but not limited to”. The term “based on” means “at least partially based on”. The term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the following description.
The concepts of “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules, or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules, or units.
The modifications of “one” and “multiple” mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art should understand that unless otherwise specified in the context, they should be understood as “one or more”.
The names of the messages or information exchanged between multiple devices in this public implementation are for illustrative purposes only and are not intended to limit the scope of these messages or information.
Step 110: collect voice data and an image of a target user.
The voice data can be speech generated by the user imitating a certain reference audio or singing a target song. The image can be a half-length or full-length portrait (including a face) of the target user.
In this embodiment, the target user can trigger a recording instruction to cause the terminal device to collect voice data and images. The way of the target user triggering the recording instruction may be clicking on a recording button on an interface, or speech triggering or gesture triggering. In this embodiment, when the terminal device receives the recording instruction triggered by the target user, a speech collection module (e.g., a microphone) and an image collection module (e.g., camera) are controlled to start and start working so as to collect the audio data and images of the singing target user.
For example, before collecting voice data and images of the target user, there is further comprised: receiving a reference audio selected by the target user; segmenting the reference audio to obtain a plurality of sub-audios; sequentially playing the plurality of sub-audios according to timestamps, to cause the target user to imitate the played sub-audio to input voice.
The reference audio can be a song, an audio played by a musical instrument, or animal sound, which is not limited here. The reference audio can be segmented according to a duration, for example, divided into segments every 5 seconds. Or if the reference audio contains text content, it can be divided according to the text content, that is, the text content is first divided into sentences, and the reference audio is segmented according to the text content which has been divided into sentences.
In this embodiment, after segmenting the reference audio, sub-audios are played segment by segment according to timestamps, to cause the target user to imitate the sub-audio segment by segment to input voice.
For example, if the reference audio is a target song, the process of segmenting the reference audio to obtain the plurality of segments of sub-audios can be: receiving the target song selected by the target user; obtaining a musical instrument digital interface (MIDI) file and lyrics of the target song; decomposing the lyrics to obtain a plurality of sentences of lyrics; and the process of playing the plurality of segments of sub-audios sequentially according to timestamps to cause the target user to imitate the played sub-audio to input voice can be: sequentially playing the plurality of sub-lyrics and the MIDI file according to timestamps, to cause the target user to sing the target song according to melodies corresponding to the MIDI file and the played sub-lyrics.
The target song can be a song selected by the user from a song library, and the MIDI file can be understood as a file in MIDI format in music files. Playing the MIDI file according to timestamps can be understood as playing the melodies corresponding to the MIDI file sequentially according to timestamps; playing the plurality of sub-lyrics sequentially according to timestamps can be understood as presenting the plurality of sub-lyrics on the interface sequentially according to timestamps.
Step 120: determine a matching degree between the voice data and the reference audio.
The match degree may be characterized by a similarity between the voice data and the reference audio. In this embodiment, if the similarity between the voice data and the reference audio is less than a preset threshold (or a preset threshold range), the match degree is low; if the similarity between the voice data and the reference audio is greater than or equal to a preset threshold (or a preset threshold range), the match degree is high.
For example, the way of determining the match degree between the voice data and the reference audio can be: extracting a voice feature of the voice data and an audio feature of the reference audio; determining the similarity between the voice feature and the audio feature; determining the similarity as the match degree between the voice data and the reference audio.
The audio feature can be characterized by a high-pitched difference sequence. The process of extracting a voice feature of the voice data and an audio feature of the reference audio can be as follows: note-syncopating and quantizing the voice data, establishing a high-pitched difference sequence of the voice data based on the quantized notes to obtain a speech pitch difference sequence, i.e., the voice feature; obtaining a reference pitch difference sequence of the reference audio, i.e., the audio feature. Then, a plurality of distances between the speech pitch difference sequence and the reference pitch difference sequence are calculated, and the plurality of distances are synthesized to obtain the similarity between the voice feature and the audio feature.
The plurality of distances can include a pitch sequence distance, a duration sequence distance, and an overall match distance. The way of synthesizing the plurality of distances can be to calculate a weighted sum of the plurality of distances. The way of obtaining the reference pitch difference sequence can be to use a dynamic time warping (DTW) algorithm to obtain the reference pitch difference sequence from the song library.
Step 130: determine a target special effect based on the match degree.
The special effect can be a particular effect added to the collected image. The special effect includes a reward special effect and a punishment special effect. When the match degree exceeds a certain value, the reward special effect can be selected, and if the match degree is lower than a certain value, the punishment special effect can be selected. As an example, the reward special effect can be: retouching the target user, adding cute stickers, and aestheticizing scenes; the punishment special effect can be: turning the target user into a big head, fattening up the target user, adding kuso scenes, etc. In this embodiment, the special effects can be stored in the form of special effect packages (program packages). Program codes for processing the image with special effects is written in the special effect package, and image special effects can be added by calling the special effect package. Determining the target special effects based on the match degree can be understood as calling the special effect package corresponding to the target special effects based on the match degree.
For example, the method further comprises: establishing an association between the match degree and the special effect in advance. The way of determining the target special effect based on the match degree can be to determine the target special effect corresponding to the match degree based on the association.
For example, the way of determining the target effect based on the match degree can further be: extracting features of the target user in the collected image to obtain feature information of the target user; determining the target effect according to the feature information and the match degree.
The feature information can be information such as clothing features (such as color and style) of the target user. For example, the process of determining the target special effect according to the feature information and the match degree can be as follows: first, obtaining a set of special effects corresponding to the feature information, and then selecting the target special effect corresponding to the match degree from the set of special effects.
Step 140: add the target special effect to the collected image to obtain the target image.
In this embodiment, the special effects may be stored in the form of special effect packages (program packages), program codes for performing special effect processing on the image are written in the special effect package, and the image special effects can be added by calling the special effect package.
For example, the way of adding the target effect to the collected image to obtain the target image can be: calling the special effect package corresponding to the target effect to perform special effect processing on the collected image to obtain the target image.
The special effect package is pre-developed by developers, and the special effect package corresponding to the target effect is called through a calling interface to realize the special effect processing on the image.
For example, the process of adding the target special effect to the collected image can be: adding the target special effect to an image collected between a current match degree and a next match degree; or adding the target special effect to a predetermined number of images collected from the current match degree.
In this embodiment, the way of calculating a match degree for the singing audio data can be: calculating a match degree every N segments of the sub-audios are imitated. Where N may be a positive integer greater than or equal to 1.
Step 150: audio and video encode the voice data and the target image to obtain a target video.
For example, after obtaining the target image with added special effects, the singing audio data and the target image are audio and video encoded to obtain a target video.
For example, the solution of this embodiment is also applicable to the scene of a multi-person chorus. In the process of the multi-person chorus, a match degree can be calculated for each user participating in the singing, respectively, and a reward special effect or a punishment special effect can be added to the image based on the match. The specific process can be found in the above embodiments, which is not repeated here.
In the technical solution of the embodiments of the present disclosure, collects voice data and images of the target user; determines a match degree between the voice data and a reference audio; determines a target special effect based on the match degree; adds the target special effect to the collected image to obtain a target image; and audio and video encodes the voice data and the target image to obtain a target video. With the method for video recording provided by the embodiments of the present disclosure, the special effect obtained based on the match degree is added to the collected image, which can improve the interest of video recording, enrich the presentation mode of the video and improve the user experience.
For example, the apparatus for video recording further comprises: a reference audio playing module, configured for:
For example, if the reference audio is a target song, the reference audio playing module is further configured for:
For example, the match degree determining module 220 is further configured for:
For example, the apparatus for video recording further comprises: association establishment module, configured for:
For example, the target special effect determining module 230 is further configured for:
For example, the target special effect determining module 230 is further configured for:
For example, the target image obtaining module 240 is further configured for:
For example, the target image obtaining module 240 is further configured for:
The apparatus may perform the method of the present disclosure provided in all the foregoing embodiments, the method comprises performing the corresponding functional modules and beneficial effects. Technical details not described in detail in the present embodiment, the present disclosure may refer to the method provided in all the foregoing embodiments.
Referring now to
As shown in
Typically, the following devices can be connected to I/O interface 305: input devices 306 including touch screens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 307 including liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 308 including magnetic tapes, hard disks, etc.; and communication devices 309. Communication devices 309 can allow electronic devices 300 to communicate wirelessly or wirelessly with other devices to exchange data. Although
According to embodiments of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product that includes a computer program carried on a computer-readable medium, the computer program containing program code for performing the recommended method of words. In such embodiments, the computer program can be downloaded and installed from the network through the communication device 309, or installed from the storage device 305, or from the ROM 302. When the computer program is executed by the processing device 301, the above functions defined in the method of the present disclosure are performed. The computer-readable medium can be a non-transitory computer-readable medium.
It should be noted that the computer-readable medium described above in this disclosure can be a computer-readable signal medium or a computer-readable storage medium or any combination thereof. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or any combination thereof. More specific examples of computer-readable storage media can include but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, device, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in a baseband or as part of a carrier wave, which carries computer-readable program code. Such propagated data signals can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit programs for use by or in conjunction with instruction execution systems, devices, or devices. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.
In some embodiments, the client and server may communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future developed networks.
The computer-readable medium can be included in the electronic device, or it can exist alone and not assembled into the electronic device.
The computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device: collects voice data and images of the target user; determines the match degree between the voice data and the reference audio; determines the target effect based on the match degree; adds the target effect to the collected image to obtain the target image; performs audio & video encoding on the voice data and the target image to obtain the target video.
Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to Object Oriented programming languages such as Java, Smalltalk, C++, and also including conventional procedural programming languages such as “C” language or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet).
The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions, and operations of systems, methods, and computer program products that may be implemented in accordance with various embodiments of the present disclosure. in this regard, each block in the flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may also occur in a different order than those indicated in the figures. For example, two blocks represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the function involved. It should also be noted that each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or may be implemented using a combination of dedicated hardware and computer instructions.
Described in the present embodiment relates to the disclosed unit may be implemented by way of software, may be implemented by way of hardware, wherein the name of the unit does not constitute a limitation on the unit itself in some cases.
The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), system-on-chip (SOCs), complex programmable logic devices (CPLDs), and the like.
In the context of this disclosure, machine-readable media can be tangible media that can contain or store programs for use by or in conjunction with instruction execution systems, devices, or devices. Machine-readable media can be machine-readable signal media or machine-readable storage media. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination thereof. More specific examples of machine-readable storage media may include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, convenient compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
According to one or more embodiments of the present disclosure, the present disclosure discloses a method for video recording, comprising:
For example, before the collecting voice data and the image of the target user, further comprising:
For example, the segmenting the reference audio to obtain the plurality of segments of sub-audios if the reference audio is a target song, comprises:
For example, the determining the match degree between the voice data and the reference audio comprises:
For example, the method further comprises:
The determining a target special effect based on the match degree comprising:
For example, the determining the target special effect based on the match degree comprises:
For example, the adding the target special effect to the collected image comprises:
For example, the adding the target special effect to the collected image to obtain a target image comprises:
| Number | Date | Country | Kind |
|---|---|---|---|
| 202111165277.6 | Sep 2021 | CN | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2022/118698 | 9/14/2022 | WO |