This application claims priority to Chinese Patent Application No. 202111176742.6 filed with the Chinese Patent Office on Oct. 9, 2021, which is incorporated herein by reference in its entirety.
The present disclosure relates to the field of image processing technology, such as a video cover generation method, apparatus, electronic device, and readable medium.
A video cover is a form of displaying the key content of a video, and it is also a collection of information that a user receives a at a first glance when viewing the video display page. It plays an important role in attracting users to watch videos. Usually, a frame image from the video can be used as the video cover, but this form is relatively simple, and the amount of information that the video cover can reflect is very small, which is not conducive to help users quickly understanding the key content of the video.
In order to display more video content, the video cover can also be a beautiful image designed by human. This situation has relatively diverse forms and can reflect more information. However, the design process requires the use of certain professional tools (such as PhotoShop, etc.), which cannot automatically generate the video cover. The whole process is time-consuming and laborious. In some scenarios, a dynamic cover is generated using multiple frames of images in the video. Generally, the most exciting segment in the video is used, which has better expression ability than the static cover. However, the corresponding algorithm complexity is higher. The dynamic cover training model generally requires a large amount of annotated data, and the annotation difficulty is high. This process is also very time-consuming and labor-intensive, and the dynamic cover occupies more storage space than the static cover. In summary, the video cover generation method is time-consuming, labor-intensive, costly, and an efficiency in generating video covers is relatively low.
The present disclosure provides a video cover generating method, apparatus, electronic device and readable medium to display rich video content in the cover, to improve the efficiency of generating a video cover.
The present disclosure provides a method for generating a video cover, comprising:
Extracting at least two key frames in the video, wherein the key frame comprises feature information to be displayed in a cover;
According to an action relevance of the at least two key frames, fusing the feature information in the at least two key frames in a single image to generate a cover of the video, wherein the action relevance comprises being relevant or being irrelevant.
The present disclosure also provides a video cover generation apparatus, comprising:
An extraction module, configured to extract at least two key frames in the video, wherein the key frame comprises feature information of the video;
A generation module, configured to according to an action relevance of the at least two key frames, fuse the feature information in the at least two key frames in a single image to generate a cover of the video, wherein the action relevance comprises being relevant or being irrelevant.
The present disclosure also provides an electronic device comprising:
One or more processors;
A storage device configured to store one or more programs;
When the one or more programs are executed by the one or more processors, the one or more processors implement the above video cover generation method.
The present disclosure also provides a computer-readable medium on which a computer program is stored, which implements the above-mentioned video cover generation method when executed by a processor.
The following embodiments of the present disclosure will be described with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, the present disclosure can be implemented in various forms, and these embodiments are provided to understand the present disclosure. The drawings and embodiments of the present disclosure are for illustrative purposes only.
The multiple steps described in the method implementation method of this disclosure can be executed in different orders and/or in parallel. In addition, the method implementation method can include additional steps and/or omit the steps shown. The scope of this disclosure is not limited in this regard.
The term “including” and its variations used in this article are open-ended, i.e. “including but not limited to”. The term “based on” means “at least partially based on”. The term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the following description.
The concepts of “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules, or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules, or units.
The names of the messages or information exchanged between multiple devices in this public implementation are for illustrative purposes only and are not intended to limit the scope of these messages or information.
The following embodiments, each embodiment while providing optional features and examples, a plurality of features described in the embodiment may be combined to form a plurality of schemes, each numbered embodiment should not be regarded as only one technical solution.
As shown in
S110, extracting at least two key frames in the video, wherein the key frame contains feature information to be displayed in the cover.
In this embodiment, the video comprises multiple frames of images, and the video can be captured or uploaded by the user, or downloaded from the network. Key frames mainly refer to frames that can reflect the key content or scene changes in the multi-frame images, such as frames containing the main characters in the video, frames belonging to exciting or classic clips, frames with obvious changes in the scene, and frames containing key actions of the characters, all of which can be used as key frames. Key frames can be selected through image similarity clustering and image quality evaluation of multi-frame images in the video, and key frames can also be obtained by recognizing actions or behaviors in the video.
Feature information may be used to describe the video content reflected by the key frame features, such as a hue of the key frame, expression or action features of a character in the key frame, real-time subtitles matching the key frame, etc. By displaying the feature information in the key frame in the cover, it can attract viewers and help viewers quickly understand the video content.
In the present embodiment, the number of key frames extracted from the video is at least two, on this basis, different key frames can be used to provide a variety of feature information for generating the cover, so that the content displayed in the cover is richer.
S120. According to the action relevance of at least two key frames, fusing the feature information in the at least two key frames into a single image to generate a cover of the video, wherein the action relevance includes being relevant or being irrelevant.
In this embodiment, the action relevance of multiple key frames can be an attribute that describes whether instances in different key frames have completed effective actions or behaviors. If it can be recognized from the video that instances in multiple frames have completed effective actions or behaviors, then these frames can be used as key frames, and the action relevance between these key frames is be relevant; if effective actions or behaviors cannot be recognized, the action relevance is being irrelevant. Among them, effective actions or behaviors can refer to actions or behaviors that a machine learning model can automatically recognize based on the preset behavior library, such as running, jumping, walking, waving, or bending over. The preset behavior library stores the feature sequences of relevant actions or behaviors in multiple frames, so it can be learned and recognized by the machine learning model.
Action relevance is not only related to whether effective actions or behaviors can be recognized, but also to a degree of background differences between these frames. The degree of background differences can include a difference of the scene contents and a difference of the hue in these frames. For example, in multiple key frames, although the characters are all running, the scenes in the first few frames are parks and the scenes in the next few frames are indoor, indicating that the running actions in the video did not occur in the same time period. Therefore, the background differences of these key frames are large, and the action relevance is being irrelevant. For example, if the scenes of two key frames are also parks, but one key frame is a daytime image and the other is a nighttime image, and there is a significant difference in hue, the action relevance is also being irrelevant.
In this embodiment, the action relevance affects the fusion mode of feature information of multiple key frames. For example, if the action relevance of multiple key frames is being relevant, character instances can be extracted from each key frame, and these character instances can be added to the same background. The background can be the background of any key frame or the background generated based on at least two key frames. In this case, a single static image can be used as the cover to show an action or behavior completed by the character instance in the video. Compared with the way of displaying actions or behaviors using dynamic images, it effectively reduces computing resources and storage space occupation. If the action relevance of multiple key frames is being irrelevant, multiple key frames can be cropped, scaled, spliced, etc., and all or part of the feature information in multiple key frames can be fused in a single image.
For example, for each key frame, the character instances can be extracted and added to the same background, which can be the background of any key frame or the background generated based on at least two key frames; if the action relevance of multiple key frames is being relevant, multiple character instances can be arranged in chronological order (such as from left to right, or from right to left, etc.), and the arrangement position of each character instance in the background is consistent with its relative position in the original key frame, making it easier to understand the pose of the character instance visually; if the action relevance of multiple key frames is being irrelevant, the instances in each key frame can be arranged freely without the need to arrange in chronological order or maintain the consistency between the arrangement position and the relative position.
For example, for each key frame, the character instances can be extracted and added to the same background. If the action relevance of multiple key frames is being relevant, the background can be generated based on the background of multiple key frames, thereby maintaining the consistency of the style between the background and the background of multiple key frames during the action process, which can maximize the restoration of the background of the action process and make it easier for viewers to understand the actions that occur in the background. If the action relevance of multiple key frames is being irrelevant, there is no need to consider the consistency of the background style with the background during the action process. Any key frame or image outside of any video (such as a solid color image, an image uploaded or selected by the viewer, or a template image) can be used as the background, and the character instances in other key frames can be arranged here.
The action sequence recognition algorithm can be used to determine the action relevance of multiple key frames. For example, the human posture recognition (Open-pose) algorithm is used to estimate the pose of a character in a video. Firstly, the position coordinates of the human joint points in each frame of the video are extracted, and the distance change matrix of the human joint points between adjacent two frames of images is calculated based on this. Then, the video is segmented and the video features are generated using the distance change matrix corresponding to each video segment. Finally, the trained classifier is used to classify the video features. If it can be recognized that the video features corresponding to a video belong to the action or behavior feature sequence in the preset behavior library, then the frame corresponding to this video is the key frame, and the action relevance of multiple key frames is being relevant. For example, the instance segmentation algorithm is used to extract the contours of the characters in each key frame and perform posture expression, and the clustering algorithm is used to extract the key features of the posture. Based on these key features, the dynamic time warping (DTW) algorithm is used to complete action recognition, etc.
According to the video cover generation method in the present embodiment, by fusing the feature information of a plurality of key frames in a single image, a single static image can show rich video content, less computing resources and storage space will be occupied, and the generation of a cover can be more efficient. Further, it can attract viewers and help viewers quickly understand the video content; in addition, when the feature information of different key frames are fused, considering the action relevance of a plurality of key frames, action relevance affects the fusion mode of feature information of a plurality of key frames, and the generated video cover can be more flexible and diverse.
In the present embodiment, extracting at least two key frames in the video comprises: recognizing an action sequence frame based an action recognition algorithm and using the action sequence frame as a key frame; wherein the action relevance is being relevant. On this basis, the action sequence frames with correlation can be fused to display video content about a complete action or behavior in a static cover.
As shown in
S210, based on the action recognition algorithm, recognizing at least two action sequence frames in the video, and use each action sequence frame as a key frame.
In this embodiment, using the action recognition algorithm, multiple valid action sequence frames can be recognized from the video, and the character instances in each action sequence frame can express a complete action or behavior in chronological order. Among them, the action recognition algorithm can be implemented through the Temporal Segment Network (TSM) model, which is trained based on the Kinetics-400 dataset and can be used to recognize 400 actions, which can meet the needs of recognizing and displaying the actions of instances on the cover.
When multiple valid action sequence frames are recognized, the degree of background difference between multiple action sequence frames can be determined. If the degree of background difference is within an allowable range, the action relevance is determined as being relevant, and multiple action sequence frames can be segmented and fused to obtain a cover that can express actions or behaviors.
S220, performing instance segmentation on each action sequence frame to obtain feature information of each action sequence frame, which includes an instance and a background.
The main purpose of instance segmentation is to separate the instances in each action sequence frame from the background. Multiple instances can be fused into the same background to represent a complete action or behavior. Multiple backgrounds can be used to generate cover backgrounds. For example, the Separate Object instances by Location and sizes (SOLO) algorithm is used to segment each action sequence frame. The SOLOv2 algorithm can be used to segment instances by position and size, which has a high accuracy and real-time performance, and can improve the efficiency of generating video covers.
S230, generating a cover background based on the background of at least two action sequence frames.
In this embodiment, the cover background mainly refers to the background used for arranging multiple action sequence frames, which can be generated based on the background of multiple action sequence frames. For example, it may take the mean of the pixel values of the background of multiple action sequence frames at each position to obtain the cover background, and this way is relatively simple and suitable for situations with a large number of action sequence frames. Another example is selecting the highest image quality from the background of multiple action sequence frames, or the background corresponding to the first action sequence frame, the last action sequence frame, or the action sequence frame located in the middle as the cover background. This method is also easy to implement, but the fusion of the backgrounds of multiple action sequence frames is relatively low. Another example is that for the background of each action sequence frame, the part of the example that is cut out is a blank area. The background of other action sequence frames can be used to fill the blank area of the example that is cut out in the background, and then from the multiple action sequence frames that have been filled, one frame is selected as the cover background. Or take the mean of multiple action sequence frames that have been filled as the cover background. This method can take into account the quality and integration of different backgrounds. On this basis, by integrating the characteristics of multiple action sequence frames, the style consistency of the cover and the background of multiple key frames in the process of action is ensured, which is convenient for viewers to accurately understand the video content.
S240, fusing at least two instances of action sequence frames in the cover background to obtain a single image, and using the single image as the cover of the video.
In this embodiment, a plurality of instances of the action sequence frame are added to the cover background, so that a single static image can be used to show the complete action of the integrated multi-frame. In this process, according to the relative position of each instance of the action sequence frame in the original action sequence frame, each instance can be added to the corresponding position in the cover background to ensure that the relative position of each instance is consistent with the position in the action process, and has better visualization effect.
Before generating a cover background based on at least two action sequence frames, it further includes: selecting an action sequence frame as a reference frame, and determining an affine transformation matrix between each action sequence frame and the reference frame according to a feature point matching algorithm; aligning the background of each action sequence frame with the background of the reference frame according to the affine transformation matrix.
Due to different angles, jitter, or errors in video shooting, the backgrounds of different action sequence frames are not aligned. Directly using the backgrounds of multiple action sequence frames to generate cover backgrounds may cause local distortion, deformation, or blur, which affects the accuracy and visual effect of the background. Therefore, before generating cover backgrounds based on the backgrounds of multiple action sequence frames, an action sequence frame can be selected as a reference frame, and the background of each other action sequence frame can be aligned with the background of the reference frame. The reference frame can be the action sequence frame with the highest image quality, the first action sequence frame, the last action sequence frame, or the action sequence frame located in the middle.
In this embodiment, the affine transformation matrix between each action sequence frame and the reference frame is determined according to the feature point matching algorithm, wherein the affine transformation matrix is used to describe the transformation relationship between the matched feature points from the action sequence frame to the reference frame, and the affine transformation includes linear transformation and translation transformation. The feature point matching algorithm can be a Scale-invariant Feature Transform (SIFT) algorithm, which first extracts the key feature points in the background of each action sequence frame. These key feature points will not disappear due to factors such as illumination, scale, and rotation. Then, according to the feature vector of each key point, the key points in the action sequence frame and the reference frame are compared in pairs to find multiple pairs of feature points that match each other between the action sequence frame and the reference frame, thereby establishing the correspondence between feature points and obtaining an affine transformation matrix.
Generating a cover background based on the background of at least two action sequence frames, comprising: removing a corresponding instance from the action sequence frame for each action sequence frame, filling the area corresponding to the removed instance in the action sequence frame according to the feature information of the corresponding area of the set action sequence frame, and obtaining a filling result corresponding to the action sequence frame, wherein the set action sequence frame includes at least two action sequence frames that are different from the current action sequence frame; generating a cover background based on the filling result of at least two action sequence frames.
In this embodiment, the process of generating the cover background can be divided into two stages. In the first stage, for each action sequence frame is removed instance area, the background can be used to fill the other action sequence frame region, to obtain the action sequence frame corresponding to the filling result, the filling result can be a rough background cover; in the second stage, a cover background is generated based on the filling result of a plurality of action sequence frames, and this stage can be a rough background cover repair process, the cover background obtained is more fine, for example, a plurality of action sequence frames corresponding to the rough background cover averaging, to obtain the cover background.
Taking filling the blank area after removing the character instance in action sequence frame 1 as an example, in action sequence frame 2, the character shape shown by the dotted line is the corresponding area, and the feature information represented by the oblique line in this area can be used to fill the blank area after removing the character instance in action sequence frame 1. However, the character shape shown by the dotted line in action sequence frame 2 also contains a part of the blank space (caused by the removal of the character instance in action sequence frame 2). Therefore, only using the feature information in the corresponding area in action sequence frame 2 cannot completely fill the area after removing the character instance in action sequence frame 1, so the feature information of the corresponding area in the next action sequence frame can continue to be used to fill the blank area after removing the character instance in action sequence frame 1. Assuming that the next action sequence frame is action sequence frame N−1, the feature information represented by the dot texture in the character shape shown by the dotted line in action sequence frame N−1 can be used to continue filling the blank area after removing the character instance in action sequence frame 1. However, it still cannot be completely filled, so it is necessary to use the feature information represented by the dot texture in the character shape shown by the dotted line in action sequence frame N to continue filling the blank area after removing the character instance in action sequence frame 1 to obtain the filling result of action sequence frame 1. In the filling result, the feature information of the oblique part comes from the corresponding area of the action sequence frame 2, the feature information of the dotted part comes from the corresponding area of the action sequence frame N−1, and the feature information of the vertical part comes from the corresponding area of the action sequence frame N.
One situation is that if the feature information of the corresponding area of the action sequence frame i (2≤i<N) cannot completely fill the area after removing the character instance in the action sequence frame 1, the feature information of the action sequence frame i+1 in the corresponding area can continue to be filled until the feature information of the corresponding area of the last action sequence frame is filled. Regardless of whether it can be completely filled, the filling operation of the action sequence frame 1 can be ended to obtain the filling result of the action sequence frame 1.
Another situation is that if the feature information of the corresponding area of the action sequence frame i (2≤i<N) can be completely filled, the filling operation of the action sequence frame 1 can be ended, and the filling result of the action sequence frame 1 can be obtained without using subsequent action sequence frames for filling.
Based on similar principles, the filling results of action sequence frames 2 to N can be obtained. Then, in the second stage, the cover background can be generated based on the filling results of action sequence frames. For example, the filling results of multiple action sequence frames are averaged, or the present embodiment also provides a method for repairing the filling results (rough background cover) of multiple action sequence frames to process the edges of the instance and obtain a cover background with higher accuracy.
In the second stage, the filling results of multiple action sequence frames are repaired, comprising:
Expansion processing is performed on the area of the removed instance in each action sequence frame to expand the area of the removed instance, and the expanded area covers the edge portion of the removed instance; for the expanded area in the action sequence frame, r the characteristics of the corresponding area in the filling results of each other action sequence frame are used to repair, where repair can refer to using a filling operation similar to the first stage, that is, using the characteristics of the corresponding area in the filling results of other action sequence frames to fill the expanded area again; repair can also use the average value of the characteristics of the corresponding area in the filling results of multiple action sequence frames to fill the expanded area again, so as to obtain the repair result of the action sequence frame, and finally average the repair results corresponding to multiple action sequence frames to obtain the cover background, so as to fully utilize the feature information of each other action sequence frame for fusion of the edges of the instance.
In addition, the repair operation in the second stage can be iteratively executed multiple times until the feature difference between the repair result obtained by any action sequence frame in the current iteration and the repair result in the previous iteration is within the allowable range, and the repair result at this time has fully integrated the feature information in the background of multiple action sequence frames, and the edge transition is smooth and the accuracy is higher.
The process of iteratively performing repair operations in the second stage comprises:
In the first iteration, for the swollen area of the removed instances in the filling result Rj (1≤j≤N) of the action sequence frame j obtained in the first stage, the feature information of the corresponding areas in R1, R2 . . . RN is averaged and filled in the swollen area to repair the swollen area of the removed instances in Rj, and the repair result Rj1 of the action sequence frame j is obtained; then enter the second iteration. Similarly, for the swollen area of the removed instances in the repair result Rj1 of the action sequence frame j, the feature information of the corresponding areas in R1, R2 . . . RN is averaged and filled in the swollen area to repair the swollen area of the removed instances in Rj1, and the repair result Rj2 of the action sequence frame j is obtained; and so on, until the specified number of iterations, or until the difference between the repair result of any action sequence frame in any iteration process and the repair result of the last iteration is within the allowable range, the iteration is stopped, and the repair results of all action sequence frames are averaged to obtain the background cover.
The filling result obtained in the first stage is actually a rough background cover. The secondary filling operation in the second stage can improve the accuracy of filling. Incorrect pixel values in the expansion area will be gradually repaired by correct pixel values, while the correct pixel values in the background part outside the instance will not change with iteration, ensuring that the generated cover background fully integrates the feature information of multiple action sequence frames, and the edge processing effect is better, and the transition between the instance and the background is more natural.
The fusion degree of at least two action sequence frames with the cover background decreases in chronological order.
As shown in
The video cover generation method in this embodiment adds multiple instances of action sequence frames to the cover background by identifying action sequence frames in the video, so as to display video content about a complete action or behavior in the static cover, making the cover generated according to the action sequence frames clearer and more reasonable; by generating cover backgrounds according to the backgrounds of multiple action sequence frames, the characteristics of multiple action sequence frames can be integrated to ensure the style consistency of the cover and the background of multiple key frames in the action occurrence process, which is convenient for viewers to accurately understand the video content; by selecting an action sequence frame as a reference frame, and aligning the background of each other action sequence frame with the background of the reference frame, the accuracy and visual effect of generating the background are improved; by obtaining the rough background cover of each action sequence frame in the first stage, repairing it in the second stage to improve the accuracy of filling, ensuring that the generated cover background fully integrates the feature information of multiple action sequence frames, and the edge processing effect is better, and the transition between the instance and the background is more natural; by setting different fusion degrees with the cover background for multiple action sequence frames, multiple instances of action sequence frames can be displayed through static video The temporal relationship makes the displayed actions or behaviors more specific and vivid.
In this embodiment, extracting at least two key frames in the video comprises: clustering images in the video to obtain at least two categories; extracting corresponding key frames from each category based on image quality evaluation algorithms; wherein action relevance of at least two key frames is being irrelevant. On this basis, different key frames that are action or behavior irrelevant can be used to display video content with large differences on the cover.
In this embodiment, based on the action relevance of at least two key frames, the feature information in at least two key frames is fused in a single image to generate a cover of the video, comprising: selecting a key frame as the main frame when the action relevance is being irrelevant; identifying feature information in each key frame based on the target recognition algorithm, and the feature information comprise a foreground target; fusing foreground targets in each key frame other than the main frame in the main frame to obtain a single image and using the single image as the cover of the video. On this basis, foreground targets in different key frames can be fused into the same key frame without considering the differences in backgrounds of different key frames, and the way of generating covers is more flexible.
As shown in
S310, clustering the images in the video to obtain at least two categories.
In the present embodiment, according to the inter-frame similarity of the multi-frame image in the video, for example, whether the same tone, scene content, or examples contained clustering, etc., to provide a basis for extracting key frames, wherein the clustering algorithm, for example, K-means algorithm.
S320, based on the image quality evaluation algorithm, extracting a corresponding key frame from each category.
In this embodiment, for each category, the quality of each image can be referred to when selecting key frames. For example, using the Hyper Image Quality Assessment (HyperIQA) algorithm, the quality of the images in each category is evaluated, and then the corresponding key frames for each category are extracted according to the quality of the images in each category. Since the images in each category have similarity, one key frame can be extracted for each category. The extraction of key frames can be achieved through pre-trained convolution neural networks (Convolutional Neural Networks, CNN), which can automatically use the best quality image as the key frame for a category based on the image quality of the image in that category. By extracting the corresponding key frames according to categories, it can avoid extracting too many key frames for the same category and increasing unnecessary computational and storage space occupancy, and ensure that the content displayed in the cover is not similar or repeated, thereby maximizing the display of more video content in the cover.
S330, selecting a key frame as the main frame.
In the present embodiment, the main frame may be used to arrange other foreground target key frame. The main frame may be a key frame with the best image quality, or may be the first key frame, the last key frame, or a key frame located in the middle.
S340, recognizing feature information in each key frame based on the target recognition algorithm, the feature information comprising a foreground target.
In the present embodiment, based on the target recognition algorithm can recognize the foreground target and its location in each key frame, the target recognition algorithm may be a single CNN model (You Only Look Once, YOLO) algorithm, such as YOLOv5 algorithm, using a CNN network can predict the category and location of the target, and has good real-time performance.
It is also possible to recognize the foreground targets in each key frame first, then select the main frame, and based on the recognition results of the foreground targets, key frames with relatively prominent foreground targets and a relatively simple and non-cluttered background can also be selected as the main frame, which is convenient for subsequent integration with multiple foreground targets.
S350, fusing the foreground target in each key frame except the main frame into the main frame to obtain a single image, and using the single image as the cover of the video.
In this embodiment, the foreground targets in each key frame other than the main frame are arranged in the main frame to generate a cover. When arranging, the foreground targets can be scaled appropriately, and in the fusion process, the positional relationship between each foreground target and the original foreground target of the main frame can be considered to reduce the occlusion of the original foreground target, and multiple foreground targets can be centered or evenly distributed as much as possible.
The position of the original foreground target in the main frame can also be changed to make it reasonably arranged with the foreground targets in other key frames, making the arrangement of all foreground targets more flexible. In addition, the contour of each foreground target can also be processed by bolding, adding color, etc., to make the foreground target more prominent and more attractive to viewers.
S360, blurring the background of a single image, and the blur process comprises a fuzzy process or a feather process.
In this embodiment, in order to emphasize the foreground target, the background can be blurred to a certain extent, mainly including two kinds of blurring processes: a fuzzy processing and a feathering process. Among them, the fuzzy process can make all regions of the background have the same degree of blurring, and feathering can make the region closer to the foreground target have a lower degree of blurring, and the region farther away from the foreground target has a higher degree of blurring.
Fuzzy processing can be expressed as: Iblue=Blur(I⊗(1−M),σ)+I⊗M; Feathering processing can be expressed as: Ifeather=Blur(M,σ)⊗I.
Among them, Iblur represents the cover after a fuzzy process, Ifeather represents the cover after a feathering process, Blur(,) represents the Gaussian blurring function, I represents the input image, and M represents the mask of the foreground target, σ is the Gaussian distribution Standard Deviation, ⊗ represents the element-by-element matrix multiplication operation.
Video cover generation method in the present embodiment, the use of action or behavior-independent key frames, can display a large difference in the cover of the video content, enriched features displayed in the cover; by extracting the corresponding key frame according to the category, it is possible to ensure that the content displayed in the cover is not similar or repeated, so as to maximize the display of more video content in the cover; by identifying the foreground target in each key frame, and the foreground target in each key frame other than the main frame arranged in an appropriate position in the main frame, so that the use of a static single image, the feature information effectively fused multiple key frames; in addition, by processing the contour of the foreground target, and the background of the main frame blur processing, the foreground target can be made more prominent, so that viewers quickly understand the important content of the video.
In the present embodiment, according to the action relevance of the at least two key frames, fusing the feature information of the at least two key frames in a single image to generate a cover of the video, comprising: in the case where the action relevance is being irrelevant, extracting an image block containing feature information in each key frame; stitching all image blocks to obtain a single image. On this basis, feature information in different key frames can be displayed in the cover.
As shown in
S410, clustering the images in the video to obtain at least two categories.
S420, based on the image quality evaluation algorithm, extracting corresponding key frames from each category.
S430, extracting image blocks containing feature information in each key frame.
In the present embodiment, the image block in the key frame contains feature information, for example, the image block can reflect the hue of the key frame, the image block contains the expression or action features of the character in the key frame, the image block contains the foreground target recognized by the target recognition algorithm, or the image block contains real-time subtitles that match the key frame.
S440, stitching all image blocks to obtain a single image, and using the single image as the cover of the video.
All image blocks can be stitched together according to a preset template based on the feature information in each image block and the relative proportion of the content in the image block.
Video cover generation method in the present embodiment, in the case where the action relevance is being irrelevant, extracts the image blocks containing feature information in each key frame; all image blocks are spliced to obtain a single image. On this basis, the feature information in different key frames can be displayed in the cover, and the way of generating the cover is more flexible.
In the present embodiment, after fusing the feature information of at least two key frames in a single image, further comprising: determining a hue, a saturation and a brightness of a description text according to the color value of a single image, wherein the color value by the red green blue (Red Green Blue, RGB) color mode is converted to hue saturation brightness (Hue Saturation Value, HSV) color mode; according to the hue, the saturation and the brightness of the description text, adding the description text at a specified position in a single image.
In the present embodiment, determining the color value of a single image according to the hue description text comprises: determining a plurality of hue types of a single image based on the clustering algorithm and a proportion of each hue type; using the hue type with a highest proportion as the main tone of the single image; using a hue corresponding to a hue value closest to a hue value of the main hue within a designated area of the preset color ring type as a hue of the description text.
In the present embodiment, determining the saturation and brightness of the text described in accordance with the color value of a single image comprises: determining a saturation of the description text based on an average saturation within a set range around the specified position; determining a brightness of the description text based on an average brightness within a set range around the specified position.
On this basis, the content of the cover can be enriched and beautified, so that viewers can understand the video content faster. The position, size, color scheme, and font of the description text can be determined according to the video style and overall color distribution, making the overall color scheme of the cover more reasonable and the visual effect better. The font of the description text can also be determined according to the theme of the video and the style of the cover, so that the description text can be better integrated with the video content and cover.
As shown in
S510, extracting at least two key frames from the video, the key frame containing feature information to be displayed in the cover.
S520, based on the action relevance of at least two key frames, fusing feature information in at least two key frames into a single image to generate a cover of the video.
S530, converting the color value of a single image from RGB color mode to HSV color mode.
In the present embodiment, the color value is converted to HSV color mode, HSV color mode is a color model for the user's perception, focusing on color representation, can reflect the color, color depth and light and dark, determined according to the HSV color mode description text color scheme, so that the description of the text and the cover of the fusion is stronger, the viewer's visual effect is more comfortable.
The method for converting a color from the RGB color mode to the HSV color mode is as follows: denote the red, green, and blue coordinates of the color as (r, g, b), where r, g, and b are all real numbers between 0 and 1; let max be equivalent to the largest of r, g, and b, and min be equivalent to the smallest of r, g, and b. To find the (h, s, v) value of the color in the HSV space, where h∈[0, 360) is the hue angle of the angle, and s, v∈[0,1] the saturation and brightness, there is the following conversion relationship:
S540, determining multiple hue types of a single image and the proportion of each hue type based on the clustering algorithm.
Based on the clustering algorithm, the K-means algorithm can be used to analyze the overall color of a single image. For example, the overall color of a single image can be clustered into five categories, and the hue type and proportion of the main color of each category can be output.
S550, using the hue type with the highest proportion as the main hue of a single image.
S560, using a hue corresponding to a hue value closest to a hue value of the main hue within a designated area of the preset color ring type as a hue of the description text.
In the present embodiment, the method of determining the hue of the text is described, calculating a predefined color space of a plurality of colors closest to the main hue of a single image distance, and located in the specified H color ring type interval color, the color as the hue of the description text.
S570, determining the saturation of the description text based on the average saturation within the set range around the specified position.
In this embodiment, the saturation of the description text is determined according to the average saturation within the set range around the specified position in a single image, so that the saturation of the description text is as uniform as possible with the surrounding saturation, and the fusion is stronger. The saturation mean within the set range around the specified position is referred to as
S580, determining the brightness of the description text based on the average brightness within the set range around the specified position.
In this embodiment, the brightness of the description text is determined according to the brightness mean within the set range around the specified position in a single image, so that the brightness of the description text is as uniform as possible with the saturation around it, and the fusion is stronger. The brightness mean within the set range around the specified position is referred to as
S590, adding description text at a specified location in a single image based on the hue, saturation, and brightness of the description text.
The method of the present embodiment may be added to a single image description text feature information of a plurality of fused key frames, this process takes into account the overall color distribution of the cover, the main hue of the description text similar to the image, the saturation and brightness of the image is also adapted to its surrounding fusion. In addition, the contrast may also be considered to describe the color of the text and a single image, thereby strengthening or weakening the description text.
First way: recognizing the action sequence frames in the video, and when the action relevance of multiple key frames is being relevant, performing instance segmentation and image fusion based on the action sequence frames, and fusing the instances in multiple action sequence frames into a generated cover background.
Second way: when the action relevance of multiple key frames is being irrelevant, performing clustering and key frame extraction on the images in the video, and extracting foreground targets from multiple key frames are extracted and fusing them in one of the main frames.
Third way: when the action relevance of multiple key frames is being irrelevant, performing clustering and key frame extraction on the images in the video, and stitching image blocks in multiple key frames to obtain a single image.
Through the above method, feature information from multiple key frames can be reflected in a static single image, improving the diversity of covers.
In addition, for a single image obtained by any of the above methods, the tone, saturation, and brightness of the description text can also be determined, and the description text can be added to the designated position in the single image based on this. The content of the description text can be representative subtitles or titles generated for the video.
For videos, cover generation can be prioritized or defaulted, that is, when a valid action sequence frame is identified, instance segmentation and image fusion are performed based on the action sequence frame, and instances from multiple action sequence frames are fused into a unified cover background. If there are no effectively identified action sequence frames, method two or method three is used, that is, the clustering algorithm is used to extract key frames, and then the foreground targets or image blocks in the key frames are extracted, and then the cover is generated by foreground target segmentation and fusion or image block splicing.
The above three methods can also be combined. For example, in method one, instances of multiple action sequence frames can also be arranged in an action sequence frame (which can be used as the main frame); for example, in method two, foreground targets in multiple key frames can also be arranged in a generated cover background.
The video cover generation method in the present embodiment, by adding description text, and the hue, saturation, and brightness of the description text can be determined according to the overall color value of a single image, which can enrich and beautify the content of the cover, so that the viewer can understand the video content faster, and make the overall scheme color of the cover more reasonable and better visual effect; in addition, the use of HSV color mode to determine the color scheme of the description text can reflect the color, the depth of the color, and the brightness, so that the description text and the cover The fusion is stronger; the video cover generation method of the present embodiment provides a variety of ways to generate the cover, and improves the flexibility of generating the cover.
An extraction module 610, configured to extract at least two key frames in the video, wherein the key frames include feature information of the video; a generation module 620 configured to fuse the feature information in the at least two key frames into a single image based on the action relevance of the at least two key frames to generate a cover of the video, wherein the action relevance includes being relevant or being irrelevant.
Video cover generating apparatus of the present embodiment, by fusing feature information of a plurality of key frames in a single image, a single static image can show a rich video content, less resource occupation, high efficiency, and when the feature information fusion, considering the correlation of a plurality of key frames action, generating a video cover more flexible and diverse ways.
Based on the above, the extraction module 610 is configured to: based on an action recognition algorithm, recognize at least two action sequence frames in the video, and use each action sequence frame as the key frame; wherein, the action relevance is being relevant.
Based on the above, the generation module 620 comprises:
A segmentation unit configured to in a case that the action relevance is being relevant, perform instance segmentation on each action sequence frame to obtain feature information of the each action sequence frame, wherein the feature information comprises an instance and a background; a background generation unit configured to generate a cover background based on backgrounds of at least two action sequence frames; a first fusion unit configured to fuse instances of at least two action sequence frames into the cover background to obtain the single image, and using the single image as the cover of the video.
Based on the above, the background generation unit comprises:
A filling sub-unit is configured to for each action sequence frame, remove a corresponding instance from each action sequence frame, and filling a region corresponding to the removed instance in each action sequence frame according to characteristic information of the corresponding area of a given action sequence frame, obtaining a filling result corresponding to each action sequence frame, wherein the given action sequence frame comprises an action sequence frame of the at least two action sequence frames different from a current action sequence frame; a generating sub-unit is configured to generate the cover background based on the filling results of at least two action sequence frames.
Based on the above, the apparatus also comprises:
A reference frame selection module is configured to select an action sequence frame as a reference frame, and determining an affine transformation matrix between each action sequence frame and the reference frame according to a feature point matching algorithm; the alignment module is configured to align a background of each action sequence frame with a background of the reference frame according to the affine transformation matrix.
Based on the above, the fusion degree between the instances of at least two action sequence frames and the cover background decreases in chronological order.
Based on the above, the extraction module 610 comprises:
A clustering unit configured to cluster images in the video to obtain at least two categories; an extraction unit configured to extract key frame corresponding to each category based on an image quality evaluation algorithm; wherein the action relevance of the at least two key frames is being irrelevant.
Based on the above, the generation module 620 comprises:
A main frame selection unit configured to select a key frame as the main frame when the action relevance is being irrelevant; a recognition unit configured to recognize feature information in each key frame based on a target recognition algorithm, wherein the feature information comprise a foreground target; a second fusion unit configured to fuse the foreground target in each of the at least two key frames except the main frame into the main frame to obtain the single image, and using the single image as the cover of the video.
Based on the above, the apparatus also comprises:
A blur module configured to perform a blur process on the background of the single image, wherein the blur process includes a fuzzy process or a feather process.
Based on the above, the generation module 620 includes:
An image block extraction unit configured to in a case where the action relevance is being irrelevant, extract an image block containing the feature information in each key frame; a stitching unit configured to stitch all image blocks to obtain the single image.
Based on the above, the apparatus also comprises:
A text color determination module configured to determine a hue, a saturation, and a brightness of a description text based on a color value of the single image, wherein the color value is converted from the red, green, and blue RGB color mode to the hue saturation and brightness HSV color mode; a text addition module configured to add a description text at a specified position in the single image according to the hue, saturation, and brightness of the description text.
Based on the above, the text addition module comprises:
A proportion calculation unit configured to determine multiple hue types of the single image and a proportion of each hue type based on a clustering algorithm; a main hue determination unit configured to use a hue type with a highest proportion as a main hue of the single image; a hue determination unit configured to use a hue corresponding to a hue value closest to a hue value of the main hue within a designated area of the preset color ring type as a hue of the description text.
Based on the above, the text addition module comprises:
A saturation determination unit configured to determine a saturation of the description text based on an average saturation within a set range around the specified position; a brightness determination unit configured to determine a brightness of the description text based on an average brightness within a set range around the specified position.
Above-described video cover generating means may perform any embodiment of the present disclosure provides a video cover generating method, the execution method includes a corresponding functional modules and effects.
As shown in
Typically, the following devices can be connected to the I/O interface 704: input devices 706 including touch screens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 707 including liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 708 including magnetic tapes, hard disks, etc., which are configured to store one or more programs; and communication devices 709. Communication devices 709 can allow electronic devices 700 to communicate wirelessly or wirelessly with other devices to exchange data. Although
According to embodiments of the present disclosure, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program comprising program code for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network through the communication device 709, or installed from the storage device 708, or installed from the ROM 702. When the computer program is executed by the processing device 701, the above-described functions defined in the method of the present disclosure are performed.
The computer-readable storage medium described above can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. The computer-readable storage medium is, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. Examples of computer-readable storage media may include but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. In the present disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by an instruction execution system, apparatus, or device or in combination therewith. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries computer-readable program code. Such propagated data signals can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit programs for use by or in conjunction with instruction execution systems, devices, or devices. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical cables, radio frequency (RF), etc., or any suitable combination thereof.
In some implementation methods, clients and servers can communicate using any currently known or future developed network protocol such as HyperText Transfer Protocol (HTTP), and can interconnect with any form or medium of digital data communication (such as communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), the Internet (such as the Internet), and end-to-end networks (such as ad hoc end-to-end networks), as well as any currently known or future developed networks.
The computer-readable medium can be included in the electronic device, or it can exist alone without being assembled into the electronic device.
The computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device: extracts at least two key frames in the video, wherein the key frames include feature information displayed in the cover; according to the action relevance of the at least two key frames, the feature information in the at least two key frames is fused in a single image to generate the cover of the video, wherein the action relevance includes correlation or being irrelevant.
It may be one or more programming languages or combinations thereof to write computer program code for performing the operations of the present disclosure, the above-described programming languages include but are not limited to Object Oriented programming languages—such as Java, Smalltalk, C++, further including conventional procedural programming languages—such as “C” language or similar programming languages. The program code may be executed entirely on the user's computer, partially executed on the user's computer, executed as a standalone software package, partially executed on the user's computer on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be any kind of network—including LAN or WAN—connected to the user's computer, or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet).
The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions, and operations of systems, methods, and computer program products that may be implemented in accordance with various embodiments of the present disclosure. in this regard, each block in the flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may also occur in a different order than those indicated in the figures. For example, two blocks represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the function involved. It should also be noted that each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or may be implemented using a combination of dedicated hardware and computer instructions.
Described in the present embodiment relates to the disclosed unit may be implemented by way of software, may be implemented by way of hardware. Wherein the name of the unit in one case does not constitute a limitation on the unit itself.
The functions described above in this article can be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Parts (ASSPs), System on Chip (SOCs), Complex Programmable Logic Devices (CPLDs), and so on.
In the context of this disclosure, machine-readable media can be tangible media that can contain or store programs for use by or in conjunction with instruction execution systems, devices, or devices. Machine-readable media can be machine-readable signal media or machine-readable storage media. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination thereof. Examples of machine-readable storage media may include electrical connections based on one or more wires, portable computer disks, hard disks, RAM, ROM, EPROM or flash memory, optical fibers, CD-ROMs, optical storage devices, magnetic storage devices, or any suitable combination thereof.
According to one or more embodiments of the present disclosure, Example 1 provides a video cover generation method, comprising:
Example 2. the method of example 1, wherein extracting at least two key frames in a video comprises:
Example 3. the method of example 2, wherein according to an action relevance of the at least two key frames, merging the feature information in the at least two key frames in a single image to generate a cover of the video comprises:
Example 4. the method of example 3, wherein generating a cover background based on backgrounds of at least two action sequence frames comprises:
Example 5. the method of example 3, prior to generating a cover background based on backgrounds of at least two action sequence frames, further comprising:
Example 6. the method of example 3, wherein a degree of fusion of instances of the at least two action sequence frames with the cover background decreases sequentially in a chronological order.
7. the method of example 1, wherein extracting at least two key frames in a video comprises:
Example 8. the method of example 7, wherein the feature information in the at least two key frames is fused in a single image to generate a cover of the video based on an action relevance of the at least two key frames, comprising:
Example 9. the method of example 8, after obtaining the single image, further comprising:
Example 10. the method of example 7, wherein according to an action relevance of the at least two key frames, fusing the feature information in the at least two key frames in a single image to generate a cover of the video comprises:
Example 11. the method of any one of examples 1-10, after fusing the feature information in the at least two key frames in a single image, further comprising:
Example 12. the method of example 11, wherein determining a hue of description text based on a color value of the single image comprises:
Example 13. the method of example 11, wherein determining a saturation and a lightness of a description text based on a color value of the single image comprises:
Example 14 provides a video cover generation device, comprising:
Example 15 provides an electronic device comprising:
Example 16 provides a computer-readable medium having a computer program stored thereon that, when executed by a processor, implements a video cover generation method as described in any one of examples 1-13.
In addition, although multiple operations are depicted in a specific order, this should not be understood as requiring these operations to be performed in the specific order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although multiple implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Some features described in the context of individual embodiments can also be implemented in combination in a single embodiment. Conversely, multiple features described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination.
Number | Date | Country | Kind |
---|---|---|---|
202111176742.6 | Oct 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/119224 | 9/16/2022 | WO |