This application claims the benefit of Korean Patent Application No. 10-2020-0082871, filed Jul. 6, 2020, which is hereby incorporated by reference in its entirety into this application.
The disclosed embodiment relates to technology for analyzing video shots of image content and search technology using the same.
Image content, such as a broadcast, a movie, or the like, is created through a process in which a professional producer is involved in producing a large number of video shots and finally editing the same. This process has been refined over decades since image content service began to be provided, and various production techniques are taught in a process of training production personnel. Among these production techniques, image composition is an important medium for emphasizing the story of content and making viewers engrossed in the flow of the story.
Therefore, in a filming process, the same scene may be shot multiple times from different positions and angles, and then appropriate shot composition may be selected in consideration of the overall flow of the video in the final editing process. When video is produced, this process is entrusted to a producer, a program director, or the like, and the producer or director arranges, searches for, and selects shots after checking the same one by one in a manual manner. When image content for commercial use, such as a movie, a documentary, or the like, is produced, thousands of video shots are taken, and selecting shots after checking the same one by one is time-consuming and burdensome from the aspect of labor expenses.
An object of the present invention is to reduce the time and labor expenses required for a process of producing content by finally editing video shots when image content is produced.
An apparatus for training a recognition model according to an embodiment includes at least one program, memory in which the program is recorded, and a processor for executing the at least one program. The at least one program may include a shot composition recognition model generation unit for generating a neural network model for predicting shot composition and a camera position using a video shot tagged with shot composition information and camera position information as training data; and a shot time and location recognition model generation unit for generating a neural network model for predicting a shot time and a shot location using a video shot tagged with shot time information and shot location information as training data.
Here, the shot composition recognition model generation unit may include a frame extraction unit for extracting at least one frame from the video shot and forming data for each of the at least one frame, an image feature extraction unit for extracting image features related to an object included in the extracted at least one frame, and a recognition-model-training unit for training a shot composition recognition model, which is a neural network model, to predict shot composition information and camera position information, with which the frame is tagged, when the extracted image features are input.
Here, the shot composition may include at least one of an extreme long shot, a long shot, a full shot, a knee shot, a waist shot, a bust shot, a close shot, a close-up shot, an extreme close-up shot, and an over-the-shoulder shot.
Here, each type of the shot composition may be classified as at least one of a high-angle shot, a low-angle shot, and an eye-level shot depending on a camera position.
Here, the shot composition recognition model generation unit may further include a sound feature extraction unit for extracting an audio spectrum from the video shot, and the recognition-model-training unit may train the shot composition recognition model, which is a neural network model, to predict the shot composition information and the camera position information with which the video shot is tagged when the audio spectrum is input.
Here, the shot time and location recognition model generation unit may include a frame extraction unit for extracting at least one frame from the video shot and forming data for each of the at least one frame; and a recognition-model-training unit for training a shot location recognition model or a shot time recognition model to predict shot location information or shot time information with which the frame is tagged when at least one of shot composition of the extracted frame, color distribution thereof, and a key frame is input.
Here, the shot time and location recognition model generation unit may further include, between the frame extraction unit and the recognition-model-training unit, at least one of a shot composition extraction unit for predicting shot composition of the extracted frame based on a previously trained shot composition recognition model, an image feature extraction unit for extracting color distribution from the extracted frame, and a key frame extraction unit for extracting a representative frame, among the at least one extracted frame.
Here, the image feature extraction unit may extract color distribution of each of multiple segmented areas of the frame and color distribution of the entire frame.
An apparatus for analyzing a video shot according to an embodiment includes at least one program, memory in which the program is recorded, and a processor for executing the at least one program. The at least one program may include a frame extraction unit for extracting at least one frame from a video shot, a shot composition and camera position recognition unit for predicting shot composition and a camera position for the extracted at least one frame based on a previously trained shot composition recognition model, a place and time information extraction unit for predicting a shot location and a shot time for the extracted at least one frame based on a previously trained shot location recognition model and a previously trained shot time recognition model, and an information combination unit for combining pieces of information, which are respectively predicted for the at least one frame, for each video shot and tagging the video shot with the combined pieces of information.
The apparatus for analyzing a video shot according to an embodiment may further include a shot quality measurement unit for measuring a shot quality of each of the extracted at least one frame based on predetermined factors, and the information combination unit may select the information to be combined based on the measured shot quality.
Here, the information to be combined may include at least one of shot composition, a camera position, and a shot quality, and the information combination unit may select the shot composition based on a shot quality score totaled for each type of shot composition, select the camera position based on a shot quality score totaled for each camera position, calculate the shot quality as the average of quality scores of frames recognized as having the selected shot composition, and use the shot composition, the camera position, and the shot quality as the information to be combined.
Here, the predetermined factors may include at least one of directionality of lines, which is the degree of uniformity of directions of main lines included in the frame, sharpness, which is the degree of clarity of the lines, and similarity acquired by comparing previously constructed shot composition data with information about an object included in the frame.
Here, the place and time information extraction unit may predict the shot location and the shot time for the extracted at least one frame by inputting at least one of shot composition of the frame, which is predicted based on a previously trained shot composition detection model, color distribution thereof, and a key frame to the shot location recognition model and the shot time recognition model.
Here, the apparatus for analyzing a video shot according to an embodiment may further include a place/time-based grouping unit for clustering frames into predetermined groups according to the predicted shot location and shot time, and the information combination unit may select the information to be combined based on the resultant groups.
Here, the information to be combined may include at least one of a time group and a place group, and the information combination unit may decide on the time group and the place group to be used as the information to be combined based on the number of frames included in each of the groups.
An apparatus for providing a video shot search service according to an embodiment includes at least one program, memory in which the program is recorded, and a processor for executing the program. The program may include a tagging item search unit for searching for a video shot tagged with at least one item corresponding to a search keyword when the search keyword is input; and a video shot output unit for outputting the found at least one video shot.
Here, the item may include at least one of shot composition, a camera position, an actual shot location, and an actual shot time.
Here, the apparatus for providing a video shot search service according to an embodiment may further include a video shot arrangement unit for sorting, when the found video shot comprises multiple video shots, the multiple video shots according to a predetermined criterion, and the video shot output unit may output the video shots in the order set by the video shot arrangement unit.
Here, the video shot arrangement unit may sort the video shots by referring to other items with which each of the multiple video shots is tagged.
Here, the apparatus for providing a video shot search service according to an embodiment may further include a DB for storing film grammar information pertaining to an already produced film, and the video shot arrangement unit may sort the video shots in the order from a video shot tagged with shot composition information that best matches shot composition information of a previously selected video shot based on the film grammar information.
The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed description, taken in conjunction with the accompanying drawings, in which:
The advantages and features of the present invention and methods of achieving the same will be apparent from the exemplary embodiments to be described below in more detail with reference to the accompanying drawings. However, it should be noted that the present invention is not limited to the following exemplary embodiments, and may be implemented in various forms. Accordingly, the exemplary embodiments are provided only to disclose the present invention and to let those skilled in the art know the category of the present invention, and the present invention is to be defined based only on the claims. The same reference numerals or the same reference designators denote the same elements throughout the specification.
It will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements are not intended to be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element discussed below could be referred to as a second element without departing from the technical spirit of the present invention.
The terms used herein are for the purpose of describing particular embodiments only, and are not intended to limit the present invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,”, “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless differently defined, all terms used herein, including technical or scientific terms, have the same meanings as terms generally understood by those skilled in the art to which the present invention pertains. Terms identical to those defined in generally used dictionaries should be interpreted as having meanings identical to contextual meanings of the related art, and are not to be interpreted as having ideal or excessively formal meanings unless they are definitively defined in the present specification.
Hereinafter, an apparatus and method according to an embodiment will be described in detail with reference to
Referring to
The model-training apparatus (100 and 200) performs training of an artificial intelligence (AI) recognition model that is capable of recognizing shot composition, a shot time, and a shot location pertaining to a video shot. Specifically, the model-training apparatus may include a shot composition recognition model generation unit 100 (shown in
When it stores a plurality of video shots, the video shot analysis apparatus 300 extracts the same place/time information in content and composition information of the video shots and stores the same along with the video shots in order to make it easy to search for and use a video shot having a specific composition. That is, when a user uploads all of video shots taken in order to produce a single piece of image content, the video shot analysis apparatus 300 analyzes the shot composition of each of the video shots, the position of a camera when the video shot was taken, the shot location where the video shot was taken, and the shot time when the video shot was taken, and then stores the analyzed information in the video shot DB 20 by tagging the video shot with the same. According to an embodiment, the shot composition and quality of the video shot are measured, the place and time represented by the image of the video shot are additionally analyzed, and the video shot is tagged with the measured and analyzed information. A detailed description thereof will be made later with reference to
The video shot search service provision apparatus 400 provides a function through which a producer is able to search the video shot DB 20 using items with which video shots stored therein are tagged as multiple search conditions when finally editing content. Here, when video shots having the same composition are present, a better video shot is ranked higher in the search result by determining the suitability of composition, and suitable composition may be preferentially proposed in consideration of the flow of the shot composition of the previously produced scene. For example, when video shots are sequentially selected, the shot composition suitable for the current scene is proposed in consideration of the previously selected composition by analyzing a shot composition sequence based on film grammar in order to help a producer select a video shot. This will be described in detail later with reference to
Referring to
The shot composition training data DB 110 stores training data that is used for training a shot composition recognition model, and video shots that have not been edited and information about the shot composition of each of the video shots and a camera position when the video shot was taken, with which the corresponding video shot is tagged, may be stored as the training data.
Here, shot composition may be defined as at least one of an extreme long shot, a long shot, a full shot, a knee shot, a waist shot, a bust shot, a close shot, a close-up shot, an extreme close-up shot, and an over-the-shoulder shot.
Also, each type of shot composition may be classified as at least one of a high-angle shot, a low-angle shot, and an eye-level shot depending on the camera position.
Accordingly, each of the multiple video shots may be stored after being tagged with one of the above-described shot composition information and camera position information. For example, as shown in
The frame extraction unit 120 extracts a predetermined number of important frames for each of the video shots stored in the shot composition training data DB 110, thereby forming data for each frame. Accordingly, the shot composition recognition model to be generated is trained to predict, when a single frame is input, the shot composition and the camera position pertaining to the video shot in which the corresponding frame is included.
To this end, features are extracted from the extracted frame, and image features and sound effect features may be extracted as the features.
The image feature extraction unit 130 may extract an object probability, the position and size of a main object (a person/an animal/a vehicle/or the like), and the mean and variance of the sizes of all objects from the extracted frame. This is based on the fact that the frame composition makes a difference in the sizes and shapes of the shown objects. Here, an image recognition algorithm such as a CNN may be used for the image feature extraction unit 130.
The recognition-model-training unit 140 trains a shot composition recognition model, which is a neural network model, to predict shot composition information and camera position information with which a frame is tagged when the features, extracted from the frame by the image feature extraction unit 130, are input.
Here, the shot composition recognition model is any of all types of classification models capable of using the extracted features, and an algorithm such as Inception, ResNet, or the like may be used therefor.
Additionally, the sound feature extraction unit 150 may extract an audio spectrum from each of the video shots. Here, the sound features are extracted for each video shot, rather than for each frame extracted from the video shot, and frames extracted from the same video shot may share the same sound features.
Here, the recognition-model-training unit 140 may use the sound features as the input of the shot compilation recognition model, along with the image features extracted by the image feature extraction unit 130. These sound features are used in order for the shot composition recognition model to be trained in consideration of information about an audio situation according to shot composition (e.g., shot composition frequently accompanied by dialogue between characters, shot composition of a long shot frequently accompanied only by background noise, or the like).
The shot composition recognition model trained as described above may be stored in the AI model DB 10.
Referring to
The time/place training data DB 210 stores training data that is used for training a shot time recognition model and a shot location recognition model, and video shots that have not been edited and information about the shot location where each of the video shots was taken and the shot time when the video shot was taken, with which the video shot is tagged, may be stored as the training data.
Here, the shot time may be defined as, for example, day, night, morning, or evening, and the shot location may be defined as, for example, an office room, a park, a classroom, or the like.
Therefore, each of the multiple video shots may be stored after being tagged with one of the above-described shot location information and shot time information. For example, as shown in
The frame extraction unit 220 extracts a predetermined number of important frames for each of the video shots stored in the time/place training data DB 210, thereby forming data for each frame. Accordingly, the shot time recognition model may be trained to predict, when a single frame is input, the shot time at which the video shot in which the corresponding frame is included was taken, and the shot location recognition model may be trained to predict, when a single frame is input, the shot location where the video shot in which the corresponding frame is included was taken.
To this end, predetermined features may be extracted from the extracted frame and used as the input of the shot time recognition model and the shot location recognition model. Here, at least one of shot composition, an image feature, and a key frame may be extracted as the features.
The shot composition extraction unit 230 predicts shot composition from the extracted frame based on the shot composition recognition model, which is generated by the shot composition recognition model generation unit 100 and stored in the AI model DB 10.
The image feature extraction unit 240 extracts color distribution from the extracted frame. This is because the color distribution of the frame may vary depending on the shot time and the shot location. For example, when the shot time is the time of sunset in the evening, red may occupy a large portion of the color distribution of the frame, and when the shot location is the beach, blue may occupy a large portion of the color distribution of the frame.
Here, the color distribution of multiple segmented areas of the frame (e.g., the color distribution of each of the 9×9 segmented areas of the frame) and the color distribution of the entire frame are extracted. This is because both the case in which the place and time information is represented only in a limited area depending on the shot composition (e.g., a bust shot, a close shot, an over-the-shoulder shot, a knee shot, a waist shot, or the like) and the case in which the place and time information is represented in the entire frame area depending on the shot composition (e.g., an extreme long shot, a long shot, a full shot, or the like) are considered.
Also, the key frame extraction unit 250 extracts a representative frame, among the extracted frames, as a feature.
The recognition-model-training unit 260 trains the shot location recognition model to predict shot location information with which the frame is tagged when at least one of shot composition, the color distribution of the frame, and the key frame is input.
Also, the recognition-model-training unit 260 trains the shot time recognition model to predict shot time information with which the frame is tagged when at least one of shot composition, the color distribution of the frame, and the key frame is input.
Here, the recognition model is any of all types of classification models capable of using the extracted features, and an algorithm, such as Inception, ResNet, or the like may be used therefor.
The shot location recognition model and the shot time recognition model, which are trained as described above, may be stored in the AI model DB 10.
Referring to
The frame extraction unit 310 may extract a predetermined number of important frames for each of video shots input thereto.
The shot composition and camera position recognition unit 320 recognizes shot composition and a camera position for the extracted frame based on the shot composition recognition model stored in the AI model DB 10.
When the recognized shot composition and the frame, of which the camera position information is recognized, are input, the shot quality measurement unit 330 measures the quality of the frame in consideration of various factors.
Here, the factors based on which the shot quality is measured may include the directionality of lines 331, sharpness 332, and similarity with the existing composition 333, as shown in
Here, the directionality of lines 331 indicates the degree of uniformity of the directions of main lines included in the frame, and may be represented as a value ranging from 0 to 1.0.
The sharpness 332 indicates the clarity of an image, and may be represented as a value ranging from 0 to 1.0.
The similarity with the existing composition 333 may be the maximum similarity (ranging from 0 to 1.0) measured by comparing the position and size of a main object (a person/an animal/a vehicle/or the like) and the mean and variance of the sizes of all objects appearing in the frame based on the shot composition and camera position data detected from a previously constructed shot composition DB 30.
Accordingly, the shot quality measurement unit 330 may calculate a shot quality score by totaling scores for the respective shot quality measurement factors.
The place and time information extraction unit 340 recognizes a shot location and a shot time for the extracted frame using the shot location recognition model and the shot time recognition model, which are stored in the AI model DB 10.
Referring to
First, the shot composition extraction unit 230 predicts the shot composition of a frame based on a shot composition detection model stored in the AI model DB 10.
The image feature extraction unit 341 extracts the color distribution of the frame. Here, the color distribution of each of multiple segmented areas of the frame (e.g., the color distribution of each of the 9×9 segmented areas of the frame) and the color distribution of the entire frame are extracted.
Also, the key frame extraction unit 342 uses the extracted frame as a feature.
The recognition unit 343 predicts the shot location and the shot time of the frame when at least one of the shot composition and camera position previously recognized by the shot composition and camera position recognition unit 320, the color distribution of the frame extracted by the image feature extraction unit 341, and the key frame extracted by the key frame extraction unit 342 is input.
The place/time-based grouping unit 350 groups frames corresponding to similar shot locations and similar shot times together through clustering of the frames based on the predicted shot location and shot time. Here, the number of groups based on the shot location and the shot time may be set by a user.
The information combination unit 360 combines pieces of information, which are recognition results acquired for the respective frames, for each video shot and stores the combined pieces of information in the video shot DB 20 by tagging the video shot with the same.
Here, the combined information may include at least one of shot composition, a camera position, a shot quality, a place code, a time code, an actual shot location, and an actual shot time.
Here, in the case of the shot composition, the shot quality scores of the respective frames, measured by the shot quality measurement unit 330, are totaled for each type of shot composition, and the shot composition having the highest shot quality score may become the shot composition of the video shot.
In the case of the camera position, the shot quality scores of the respective frames, measured by the shot quality measurement unit 330, are totaled for each camera position, and the camera position having the highest shot quality score may become the camera position of the video shot.
The shot quality may be the mean of the quality scores of the frames recognized as having the selected shot composition.
The time group may be set to the time group having the largest number of frames, among the groups that are set for the frames by the place/time-based grouping unit 350.
The place group may be set to the place group having the largest number of frames, among the groups that are set for the frames by the place/time-based grouping unit 350.
Referring to
The tagging item search unit 410 searches the video shot DB 20 for at least one video shot matching a search keyword when one of tagging items is input by a user as the search keyword.
Here, the tagging items may include at least one of shot composition, a camera position, an actual shot location, and an actual shot time.
Here, when multiple video shots are found, the video shot arrangement unit 420 may arrange the found video shots according to a predetermined criterion.
According to an embodiment, the video shot arrangement unit 420 may sort the video shots by referring to other tagging items with which each of the multiple video shots is tagged. For example, the video shots may be sorted in order of shot quality with which the video shots are tagged.
According to another embodiment, the video shot arrangement unit 420 may sort the video shots based on the optimal shot composition that most naturally matches the shot composition of a previously selected video shot based on film grammar.
Here, the film grammar is film grammar information for each scene and storyline used in existing films, and may be stored in a film grammar DB 440. For example, in film grammar named “telephone call”, multiple bust shots taken from an eye-level angle are sequentially used, and the position of the main object is switched when the speaker is changed. Accordingly, when a “telephone call” scene is selected as the scene intended by a user, the video shot arrangement unit 420 sorts the found video shots in order from the video shot tagged with the shot composition information that most naturally matches the shot composition information of the scene that was just formed by the user based on the film grammar data named “telephone call”.
The video shot output unit 430 outputs the found video shots, in which case the video shots are output in the order set by the video shot arrangement unit 420.
The search system based on analysis of video shots of image content and each of components included therein according to an embodiment may be implemented in a computer system 1000 including a computer-readable recording medium.
The computer system 1000 may include one or more processors 1010, memory 1030, a user-interface input device 1040, a user-interface output device 1050, and storage 1060, which communicate with each other via a bus 1020. Also, the computer system 1000 may further include a network interface 1070 connected with a network 1080. The processor 1010 may be a central processing unit or a semiconductor device for executing a program or processing instructions stored in the memory 1030 or the storage 1060. The memory 1030 and the storage 1060 may be storage media including at least one of a volatile medium, a nonvolatile medium, a detachable medium, a non-detachable medium, a communication medium, and an information delivery medium. For example, the memory 1030 may include ROM 1031 or RAM 1032.
An embodiment is expected to improve a production environment for commercial image content, such as a broadcast, a movie, and the like, in which it is required to take thousands of video shots and search the same one by one.
Although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art will appreciate that the present invention may be practiced in other specific forms without changing the technical spirit or essential features of the present invention. Therefore, the embodiments described above are illustrative in all aspects and should not be understood as limiting the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2020-0082871 | Jul 2020 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
5845048 | Masumoto | Dec 1998 | A |
10679046 | Black | Jun 2020 | B1 |
10681335 | Reunamäki | Jun 2020 | B2 |
10929672 | Rios, III | Feb 2021 | B2 |
11528429 | Ikeda et al. | Dec 2022 | B2 |
20030002715 | Kowald | Jan 2003 | A1 |
20050104958 | Egnal | May 2005 | A1 |
20080019661 | Obrador | Jan 2008 | A1 |
20100110266 | Lee et al. | May 2010 | A1 |
20110085739 | Zhang et al. | Apr 2011 | A1 |
20120123978 | Toderice | May 2012 | A1 |
20170186147 | He | Jun 2017 | A1 |
20170237896 | Tsai | Aug 2017 | A1 |
20180065247 | Atherton | Mar 2018 | A1 |
20180295408 | Wu | Oct 2018 | A1 |
20190364196 | Song | Nov 2019 | A1 |
20200092465 | Lee et al. | Mar 2020 | A1 |
20200128247 | Doshi | Apr 2020 | A1 |
20200272652 | Ito | Aug 2020 | A1 |
20210158570 | Mohandoss | May 2021 | A1 |
20220030163 | Jung et al. | Jan 2022 | A1 |
20220044414 | Noh et al. | Feb 2022 | A1 |
20220139092 | Hashimoto | May 2022 | A1 |
20220172746 | Ikeda | Jun 2022 | A1 |
Number | Date | Country |
---|---|---|
110169055 | Aug 2019 | CN |
2008-97233 | Apr 2008 | JP |
10-2018-0058380 | Jun 2018 | KR |
10-2019-0064958 | Jun 2019 | KR |
10-2019-0105533 | Sep 2019 | KR |
10-2020-0044435 | Apr 2020 | KR |
10-2020-0047267 | May 2020 | KR |
2020054241 | Mar 2020 | WO |
Number | Date | Country | |
---|---|---|---|
20220004773 A1 | Jan 2022 | US |