These and/or other aspects and advantages of the invention will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. Embodiments are described below to explain the present invention by referring to the figures.
The scene change detector 101 may segment video data into a plurality of shots and identify a key frame for each of the plurality of shots. Here, any use of the term “key frame” is a reference to an image frame or merged data from multiple frames that may be extracted from a video sequence to generally express the content of a unit segment, i.e., a frame capable of best reflecting the substance within that unit segment/shot. Thus, the scene change detector 101 may detect a scene change point of the video data and segment the video data into the plurality of shots. Here, the scene change detector 101 may detect the scene change point by using various techniques such as those discussed in U.S. Pat. Nos. 5,767,922, 6,137,544, and 6,393,054. According to an embodiment of the present invention, the scene change detector 101 calculates similarity for a histogram of two sequential frame images, namely, a present frame image and a previous frame image in a color histogram and detects the present frame as a frame in which a scene change occurs when the calculated similarity is less than a certain threshold, noting that alternative embodiments are equally available.
As noted above, the key frame is one or a plurality of frames selected from each of the plurality of shots and may represent the shot. In an embodiment, since the video data is segmented by determining a face image feature of an anchor, a frame capable of best reflecting a face feature of the anchor may be selected as the key frame. According to an embodiment of the present invention, the scene change detector 101 selects a frame separated from the scene change point at a predetermined interval, from frames forming each shot. Namely, the scene change detector 101 identifies a frame, after a predetermined amount of time from a start frame of the each of the plurality of shots, as the key frame of the shot. This is because a few angles of the face of the anchor after the start frame do not face the front side, and it is often difficult to acquire a clear image from the start frames. For example, the key frame may be a frame 0.5 seconds after each scene change point.
Thus, the face detector 102 may detect a face from the key frame. Here, the operations performed by the face detector 102 will be described in greater detail further below referring to
The face feature extractor 103 may extract face feature information from the detected face, e.g., by generating multi-sub-images with respect to an image of the detected face, extracting Fourier features for each of the multi-sub-images by Fourier transforming the multi-sub-images, and generating the face feature information by combining the Fourier features. The operations performed by the face feature extractor 103 will be described in greater detail further below referring to
The clustering unit 104 may generate a plurality of clusters, by grouping a plurality of shots forming video data, based on similarity between the plurality of shots. The clustering unit 104 may further merge clusters including the same shot from the generated clusters and remove clusters whose shots are not more than a predetermined number. The operations performed by the clustering unit will be described in greater detail further below referring to
The shot merging unit 105 may merge a plurality of shots that are repeatedly included in a search window more times than a predetermined number of times and within a predetermined amount of time, into one shot, by applying the search window on the video data. Here, the shot merging unit 105 may identify the key frame for each of the plurality of shots, compare a key frame of a first shot selected from the plurality of shots with a key frame of an Nth shot after the first shot, and merge all the shots from the first shot to the Nth shot when similarity between the key frame of the first shot and the key frame of the Nth shot is not less than a predetermined threshold. In this example, the size of the search window is N. When the similarity between the key frame of the first shot and the key frame of the Nth shot is less than the predetermined threshold, the shot merging unit 105 may compare the key frame of the first shot with a key frame of an N−1th shot. Namely, in one embodiment, a first shot is compared with a final shot by a search window whose size is N, and when the first shot is determined to be not similar to the final shot, a next shot is compared with the first shot. As described above, according to an embodiment of the present invention, shots included in a scene in which an anchor and a guest are repeatedly shown in one theme may be efficiently merged. The operations performed by the shot merging unit 105 will be described in greater detail further below referring to
The final cluster determiner 106 may identify the cluster having the largest number of shots, from the plurality of clusters, to be a first cluster and identify a final cluster by comparing other clusters with the first cluster. The final cluster determiner 106 may then identify the final cluster by merging the clusters by using time information of the shots included in the cluster.
The final cluster determiner 106 may further perform a second operation of generating a first distribution value of time lags between shots included in the first cluster whose number of key frames is largest in the clusters, sequentially merge shots included in other clusters excluding the first cluster from the clusters with the first cluster, and identify a smallest value from distribution values of the merged cluster to be a second distribution value. Further, when the second distribution value is less than the first distribution value, the final cluster determiner 106 may merge the cluster identified to be the second distribution value with the first cluster and identify the final cluster after performing the merging for all the clusters. However, when the second distribution value is greater than the first distribution value, the final cluster is identified without performing the second cluster mergence.
The final cluster determiner 106, thus, may identify the shots included in the final cluster to be a shot in which an anchor is included. According to an embodiment of the present invention, the video data is segmented by using the shots identified to be the shot in which the anchor is included, as a unit semantic. The operations performed by the final cluster determiner 106 will be described in greater detail further below referring to
The face model generator 107 may identify a shot that is most often included from the shots included in a plurality of clusters identified to be the final cluster, to be a face model shot. A character shown in a key frame of the face mode shot may be identified to be an anchor of news video data. Thus, according to an embodiment of the present invention, the news video data may be segmented by using an image of the character identified to be the anchor.
In an embodiment, the video data may include data including both video data with audio data and data including video data without audio data. When video data is input, the video data processing system 100 may segment the video data into video data and audio data and transfer the video data to the scene change detector 101, for example, in operation S201.
In operation S202, the scene change detector 101 may detect a scene change point of video data and segment the video data into a plurality of shots based on the scene change point.
In one embodiment, the scene change detector 101 stores a previous frame image, calculates a similarity with respect to a color histogram between two sequential frame images, namely, a present frame image and a previous frame image, and detects the present frame as a frame in which the scene change occurs when the similarity is less than a certain threshold. In this case, similarity (Sim(Ht, Ht+1)) may be calculated as in the below Equation 1.
In this case, Ht indicates a color histogram of the previous frame image, Ht+1 indicates a color histogram of the present frame image, and N indicates a histogram level.
In an embodiment, a shot indicates a sequence of video frames acquired from one camera without an interruption and is a unit for analyzing or forming video. Thus, a shot includes a plurality of video frames. Also, a scene is generally made up of a plurality of shots. The scene is a semantic unit of the generated video data. The described concept of the shot and the scene may be identically applied to audio data as well as video data, depending on embodiments of the present invention.
A frame and a shot in video data will now be described by referring to
Accordingly, when a scene change point is detected, the scene change detector 101, for example, identifies a frame separated from the scene change point at a predetermined interval, to be a key frame, in operation S203. Specifically, the scene change detector 101 may identify a frame after a predetermined amount of time from a start frame of each of the plurality of shots to be a key frame. For example, a frame 0.5 seconds after detecting the scene change point is identified to be the key frame.
In operation S204, the face detector 102, for example, may detect a face from the key frame, with various methods available such detecting, such that the face detector 102 may segment the key frame into a plurality of domains and may determine whether a corresponding domain includes the face, with respect to the segmented domains. The identifying of the face domain may be performed by using appearance information of an image of the key frame. The appearance may include, for example, a texture and a shape. According to another embodiment of the present invention, the contour of the image of the frame may be extracted and whether the face is included may be determined based on the color information of pixels in a plurality of closed curves generated by the contour.
When the face is detected from the key frame, in operation S205, the face feature extractor 103, for example, may extract and store face feature information of the detected face in a predetermined storage, for example. In this case, the face feature extractor 103 may identify the key frame from which the face is detected to be a face shot. The face feature information can be associated with features capable of distinguishing faces, and various techniques may be used for extracting the face feature information. Such techniques include extracting face feature information from various angles of a face, extracting colors and patterns of skin, analyzing the distribution of elements that are features of the face, e.g., a left eye and a right eye forming the face and a space between both eyes, and using frequency distribution of pixels forming the face. In addition, additional techniques discussed in Korean Patent Application Nos. 10-2003-770410 and 10-2004-061417 may be used as such techniques for extracting face feature information and for determining similarities of a face by using face feature information.
In operation 206, the clustering unit 104, for example, may calculate similarities between faces included in the face shots by using the extracted face feature information, and generate a plurality of clusters by grouping face shots whose similarity is not less than a predetermined threshold. In this case, each of the face shots may be repeatedly included in several clusters. For example, one face shot may be included in a first cluster and a fifth cluster.
To merge face shots including a different anchor, the shot merging unit 105, for example, may merge clusters by using the similarities between the face shots included in the cluster, in operation S207.
The final cluster determiner 106, for example, may generate a final cluster including only shots determined to include an anchor from the face shots included in the clusters by statistically determining an interval of when the anchor appears, in operation S208.
In this case, the final cluster determiner 106 may calculate a first distribution value of time lags between face shots included in a first cluster whose number of face shots is greatest from the clusters and identifies a smallest value from distribution values of the merged clusters by sequentially merging the face shots included in other clusters excluding the first cluster, with the first cluster, to be a second distribution value. Further, when the second distribution value is less than the first distribution value, a cluster identified to be the second distribution value is merged with the first cluster and the final cluster is generated after the merging of all the clusters. However, when the second distribution value is greater than the first distribution value, the final cluster is generated without the merging of the second cluster.
In operation S209, the face model generator 107, for example, may identify a shot, which is most often included from the shots included in a plurality of clusters that is identified to be the final cluster, to be a face model shot. The person in the face model shot may be identified to be a news anchor, e.g., because a news anchor is a person who appears a greatest number of times in a news program.
As shown in
As shown in
In this embodiment, each stage may be formed of a weighted sum with respect to a plurality of classifiers and may determine whether the face is detected, according to a sign of the weighted sum. Each stage may be represented as in Equation 2, set forth below.
In this case, cm indicates a weight of a classifier, and fm(x) indicates an output of the classifier. The fm(x) may be shown as in Equation 3, set forth below.
fm(x)ε{−1,1} 3
Namely, each classifier may be formed of one simple feature and a threshold and output a value of −1 or 1, for example.
Referring to
According to the staged structure, connected by the cascaded stages, since determination is possible even when a small number of simple features is used, the non-face is quickly rejected in initial stages, such as a first stage or a second stage, and face detection may be attempted by receiving a k+1th sub-window image, thereby improving full face detection processing speed.
In operation 661, a number of a stage may be established as 1, and in operation 663, a sub-window image may be tested in an nth stage to attempt to detect a face. In operation 665, whether face detection in the nth stage is successful may be determined and operation 673 may further be performed to change the location or magnitude of the sub-window image when such face detection fails. However, when the face detection is successful, in operation 667, whether the nth stage is a final stage may be determined by the face detector 102. Here, when the nth stage is not the final stage, in operation 669, n is increased by 1 and operation 663 is repeated. Conversely, when the nth stage is the final stage, in operation 671, coordinates of the sub-window image may be stored.
In operation 673, whether y is corresponding to h of a first image or a second image, namely, whether an increasing of y is finished, may be determined. When the increasing of y is finished, in operation 677, whether x is corresponding to w of the first image or the second image, namely, whether an increasing of x is finished may be determined. Conversely, when the increasing of y is not finished, in operation 675, y may be increased by 1 and operation 661 repeated. When the increasing of x is finished, operation 681 may be performed. When the increasing of x is not finished, in operation 679, y is maintained as is, x is increased by 1, and operation 661 repeated.
In operation 681, whether an increase of magnitude of the sub-window image is finished may be determined. When the increase of the magnitude of the sub-window image is not finished, in operation 683, the magnitude of the sub-window image may be increased at a predetermined scale factor rate and operation 661 repeated. Conversely, when the increase of the magnitude of the sub-window image is finished, in operation 685, coordinates of each sub-window image from which the stored face is detected in operation 671 may be grouped.
In a face detection method, according to an embodiment of the present invention, as a method of improving detection speed, a restricting of a full frame image input to the face detector 102, namely, a restricting of a total number of sub-window images detected as the face from one first image may be performed. Similarly, a magnitude of a sub-window image may be restricted to “magnitude of a face detected from a previous frame image—(n×n) pixels” or a magnitude of the second image to a predetermined multiple of coordinates of a box of a face position detected from the previous frame image.
The face feature extractor 103 may generate sub-images having a different eye distance, with respect to an input image. The sub-images may have the same size of 45×45 pixels, for example, and have different distances from eye to the same face image.
A Fourier feature may be extracted for each of the sub-images. Here, there may be four operations, including a first operation, where multi-sub-images are Fourier transformed, a second operation, where a result of Fourier transform is classified for each Fourier domain, a third operation, where a feature is extracted by using a corresponding Fourier component for each classified Fourier domain, and a fourth operation, where the Fourier features are generated by connecting all features extracted for each Fourier domain. In the third operation, the feature can be extracted by using the Fourier component corresponding to a frequency band classified for each of the Fourier domain. The feature is extracted by multiplying a result of subtracting an average Fourier component of a corresponding frequency band from the Fourier component of the frequency band, by a previously trained transformation matrix. The transformation matrix can be trained to output the feature when the Fourier component is input according to a principal component and linear discriminant analysis (PCLDA) algorithm, for example. Hereinafter, such an algorithm will be described in detail.
The face feature extractor 103 Fourier transforms an input image as in Equation 4 (operation 710), set forth below.
In this case, M is the number of pixels in the direction of an x axis in the input image, N is the number of pixels in the direction of a y axis, and X(x,y) is the pixel value of the input image.
The face feature extractor 103 may classify a result of a Fourier transform according to Equation 4 for each domain by using the below Equation 5, in operation 720. In this case, the Fourier domain may be classified into a real number component R(u,v), an imaginary number component I(u,v), a magnitude component |F(u,v)|, and a phase component φ(u,v) of the Fourier transform result, expressed as in Equation 5, set forth below.
For example, it may be known that while distinguishing class 1 from class 3, with respect to phase, is relatively difficult, while distinguishing the class 1 from the class 3 with respect to magnitude is relatively simple. Similarly, while it is difficult to distinguish class 1 from class 2 with respect to magnitude, the class 1 may be distinguished from the class 2 with respect to phase relatively easily. Thus, in
In the case of general template-based face recognition, a magnitude domain, namely, a Fourier spectrum, may be substantially used in describing a face feature because while phase is drastically changed magnitude is only gently changed when a small spatial displacement occurs. However, in an embodiment of the present embodiment, while a phase domain showing a notable feature with respect to the face image is reflected, a phase domain of a low frequency band, relatively less sensitive, is also considered together with the magnitude domain. Further, to reflect all detailed features of a face, a total of three Fourier features may be used for performing the face recognition. As the Fourier features, a real/imaginary (R/I) domain combining a real number component/imaginary number component (hereinafter, referred to as an R/I domain), a magnitude component of Fourier (hereinafter, referred to as an M domain), and a phase component of Fourier (hereinafter, referred to as a P domain) may be used. Mutually different frequency bands may be selected corresponding to properties of the described various face features.
The face feature extractor 103 may classify each Fourier domain for each frequency band, e.g., in operations 731, 732, and 733. Namely, the face feature extractor 103 may classify a frequency band corresponding to the property of the corresponding Fourier domain for each Fourier domain. In an embodiment, the frequency bands are classified into a low frequency band B1 corresponding to ⅓ of an 0 to an entire band, a frequency band B2 beneath an intermediate frequency, corresponding to ⅔ of the 0 to the entire band, and an entire frequency band B3 corresponding to the 0 to the entire band.
In the face image, the low frequency band is located in an outer side of the Fourier domain and the high frequency band is located in a center part of the Fourier domain.
In the R/I domain of the Fourier transform, all Fourier components of the frequency bands B1, B2, and B3 are considered, in operation 731. Since information in the frequency band is not sufficiently included in the magnitude domain, the components of the frequency bands B1 and B2, excluding B3, may be considered, in operation 732. In the phase domain, the component of the frequency band B1, excluding B2 and B3, in which the phase is drastically changed may be considered, in operation 733. Since the value of the phase is drastically changed due to a small variation in the intermediate frequency band and the high frequency band, only the low frequency band may be suitable for consideration.
The face feature extractor 103 may extract the features for the face recognition from the Fourier components of the frequency band, classified for each Fourier domain. In the present embodiment, feature extraction may be performed by using a PCLDA technique, for example.
Linear discriminant analysis (LDA) is a learning method of linear-projecting data to a sub-space maximizing between-class scatter by reducing within-class scatter in a class. For this, a between-class scatter matrix SB indicating between-class distribution and a within-class scatter matrix SW indicating within-class distribution are defined as follows.
In this case, mi is an average image of ith class ci having Mi number of samples and c is a number of classes. A transformation matrix Wopt is acquired satisfying Equation 7, as set forth below.
In this case, n is a number of projection vectors and n=min (c−1, N, and M).
Principal component analysis (PCA) may be performed before performing the LDA to reduce dimensionality of a vector to overcome singularity of the within-class scatter matrix. This is called PCLDA in the present embodiment, and performance of the PCLDA depends on a number of eigenspaces used for reducing input dimensionality.
The face feature extractor 103 may extract the features for each frequency band of each Fourier domain according to the described PCLDA technique, in operations 741, 742, 743, 744, 745, and 746. For example, a feature YRIB1 of the frequency band B1 of the R/I Fourier domain may be acquired by Equation 8, set forth below.
yRIB1=WTRIB1(RIB1−mRIB1) 8
In this case, WRIB1 is a transformation matrix of the trained PCLDA to output features with respect to a Fourier component of R/IB1 from a learning set according to Equation 7 and mRIB1 is an average of features in the RIB1.
In operation 750, the face feature extractor 103 may connect the features output above. Features output from the three frequency bands of the RI domain, features output from the two frequency bands of the magnitude domain, and a feature output from the one frequency band of the phase domain are connected by Equation 9, set forth below.
yRI=[yRIB1yRIB2yRIB3]
yM=[yMB1yMB2]
yP=[yPB1] 9
The features of Equation 9 are finally concatenated as f in Equation 10, shown below, and form a mutually complementary feature.
f=[yRIyMyP] 10
Referring to
Images 1020, 1030, and 1040 are results of preprocessing the images 1011, 1012; and 1013 from the input image 1010, such as lighting processing, and resizing to 46×56 images, respectively. As shown in
In a face model ED1 of the image 1020, learning performance is largely reduced when a form of a nose is changed or coordinates of the eyes are in a wrong location of a face, namely, a direction the face is pointed greatly affects performance.
Since an image ED31040 includes a full form of the face, the image ED31040 is persistent in the pose or wrong eye coordinates and the learning performance is high because a shape of the head is not changed over short periods of time. However, when the shape of the head changes, e.g., for a long period of time, the performance is largely reduced. Since there is relatively little internal information of the face, the internal information of the face is not reflected while training, and therefore general performance may be not high.
Since an ED2 image 1030 suitably includes merits of the image 1020 and the image 1040, head information or background information are not excessively included and most information is corresponding to internal information of the face, thereby showing a most suitable performance.
Thus, in operation S1101, the clustering unit 104, for example, may calculate the similarity of the plurality of shots forming the video data. This similarity is the similarity between face feature information, calculated from a key frame of each of the plurality of shots.
In operation S1102, the clustering unit 104 may generate a plurality of initial clusters by grouping shots whose similarity is not less than a predetermined threshold. As shown in
In operation S1103, the clustering unit 104 may merge clusters including the same shot, from the generated initial clusters. For example, in
In operation S1104, the clustering unit 104 may remove clusters whose number of included shots is not more than a predetermined value. For example, in
Thus, according to the present embodiment, video data may be segmented by distinguishing an anchor by removing a face shot including a character shown alone, from a cluster. For example, video data of a news program may include faces of various characters such as a correspondent and characters associated with news, in addition to a general anchor, a weather anchor, an overseas news anchor, a sports news anchor, an editorial anchor. According to the present embodiment, there is an effect that the correspondent or characters associated with the news, intermittently shown, are not identified to be the anchor.
The shot merging unit 105 may merge a plurality of shots repeatedly included more than a predetermined numbers for a predetermined amount of time, into one shot by applying a search window to video data. In news program video data, in addition to a case in which an anchor delivers news alone, there is a case in which a guest is invited and the anchor and the guest communicate with each other with respect to one subject. In this case, while the principal character changes, since the shot is with respect to one subject, it is desired to merge the part in which the anchor and the guest communicate with each other into one subject shot. Accordingly, the shot merging unit 105 merges shots included not less than the predetermined number of times, for the predetermined amount of time, into one shot to represent the shots, by applying the search window to the video data. An amount of video data included in the search window may vary, and a number of shots to be merged may also vary.
Referring to
Though a size of a search window 1410 has been assumed to be 8 for understanding the present invention, embodiments of the present invention is not limited thereto, and alternate embodiments are equally available.
When merging shots 1 to 8, belonging to the search window 1410 shown in
For example, a similarity calculation may be performed by checking similarities between two shots, one from each end. Namely, the similarity calculation may be performed by checking the similarity between two face shots in an order of comparing the face feature information of the first shot (B#=1) with the face feature information of the eighth shot (B#=8), comparing the face feature information of the first shot (B#=1) with face feature information of a seventh shot (B#=7), and comparing the face feature information of the first shot (B#=1) with face feature information of a sixth shot (B#=6).
In this case, when similarity [Sim (F1, F8)] between the first shot (B#=1) and the eighth shot (B#=8) is determined to be less than a predetermined threshold via a result of comparing the similarity [Sim (F1, F8)] between the first shot (B#=1) and the eighth shot (B#=8) with the predetermined threshold, the shot merging unit 105 determines whether similarity [Sim (F1, F7)] between the first shot (B#=1) and the eighth shot (B#=7) is not less than the predetermined threshold. In this case, when the similarity [Sim (F1, F7)] between the first shot (B#=1) and the eighth shot (B#=7) is determined to be not less than the predetermined threshold, all the FIDs from the first shot (B#=1) to the seventh shot (B#=7) are established as 1. In this case, similarities between the first shot (B#=1) and shots from the sixth shot (B#=6) to the second shot (B#=2) may not be compared. Accordingly, the shot merging unit 105 may merge all the shots from the first shot to the seventh shot.
The shot merging unit 105 may, thus, perform the described operations until the FIDs for all the B# are acquired for all the shots by using the face feature information. According to an embodiment, a segment in which the anchor and the guest communicate with each other may be processed as one shot and such shot mergence may be very efficiently processed.
In operation S1501, the final cluster determiner 106 may arrange clusters according to a number of included shots. Referring to
In operation S1502, the final cluster determiner 106 identifies a cluster including the largest number of shots, from a plurality of clusters, to be a first cluster. Referring to
In operations S1503 through S1507, the final cluster determiner 106 may identify a final cluster by comparing the first cluster with clusters excluding the first cluster. Hereinafter, operations S1502 through S1507 will be described in greater detail.
In operation S1503, the final cluster determiner 106 identifies the first cluster to be a temporary final cluster. In operation S1504, a first distribution value of time lags between shots included in the temporary cluster is calculated.
In operation S1505, the final cluster determiner 106 may sequentially merge shots included in other clusters, excluding the first cluster, with the first cluster and identify a smallest value from distribution values of merged clusters to be a second distribution value. In detail, the final cluster determiner 106 may select one of the other clusters, excluding the temporary final cluster, and merge the cluster with the temporary final cluster (a first operation). A distribution value of the time lags between the shots included in the merged cluster may further be calculated (a second operation). The final cluster determiner 106 identifies the smallest value from the distribution values calculated by performing the first operation and the second operation for all the clusters, excluding the temporary final cluster, to be the second distribution value and identifies the cluster, excluding the temporary final cluster, whose second distribution value is calculated, to be a second cluster.
In operation S1506, the final cluster determiner 106 may compare the first distribution value with the second distribution value. When the second distribution value is less than the first distribution value, as a result of the comparison, the final cluster determiner 106 may generate a new temporary final cluster by merging the second cluster and the temporary final cluster, in operation S1507. The final cluster may be generated by performing such merging for all of the clusters accordingly. However, when the second distribution is not less than the first distribution value, the final cluster may be generated without merging the second cluster.
The final cluster determiner 106 may further extract shots included in the final cluster. In addition, the final cluster determiner 106 may identify the shots included in the final cluster to be a shot in which an anchor is shown. Namely, from a plurality of shots forming video data, the shots included in the final cluster may be identified to be the shot in which the anchor is shown, according to the present embodiment. Accordingly, when the video data is segmented based on the shots in which the anchor is shown, namely, the shot included in the final cluster, the video data may be segmented by news segments.
The face model generator 107 identifies a shot, which is included a greatest number of times in a plurality of clusters identified to be the final cluster, to be a face model shot. Since a character of the face model shot is most frequently shown from a news video, the character may be identified to be the anchor.
Referring to
Further, when the second distribution value is less than the first distribution value, the cluster identified to be the second distribution value may be merged first. Accordingly, the merging for all the clusters may be performed and a final cluster generated. However, when the second distribution value is more than the first distribution value, the final cluster may be generated without merging the second cluster.
Thus, according to an embodiment of the present invention, video data can be segmented by classifying face shots of an anchor equally-spaced in time.
In addition to the above described embodiments, embodiments of the present invention can also be implemented through computer readable code/instructions in/on a medium, e.g., a computer readable medium, to control at least one processing element to implement any above described embodiment. The medium can correspond to any medium/media permitting the storing and/or transmission of the computer readable code.
The computer readable code can be recorded/transferred on a medium in a variety of ways, with examples of the medium including magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.), optical recording media (e.g., CD-ROMs, or DVDs), and storage/transmission media such as carrier waves, as well as through the Internet, for example. Here, the medium may further be a signal, such as a resultant signal or bitstream, according to embodiments of the present invention. The media may also be a distributed network, so that the computer readable code is stored/transferred and executed in a distributed fashion. Still further, as only an example, the processing element could include a processor or a computer processor, and processing elements may be distributed and/or included in a single device.
One or more embodiments of the present invention provides a video data processing method, medium, and system capable of segmenting video data by a semantic unit that does not include a certain video/audio feature.
One or more embodiments of the present invention further provides a video data processing method, medium, and system capable of segmenting/summarizing video data by a semantic unit, without previously storing face/voice data with respect to a certain anchor in a database.
One or more embodiments of the present invention also provides a video data processing method, medium, and system which do not segment a scene in which an anchor and a guest are repeatedly shown in one theme.
One or more embodiments of the present invention also provides a video data processing method, medium, and system capable of segmenting video data for each anchor, namely, each theme, by using a fact that an anchor may be repeatedly shown, equally spaced in time, more than other characters.
One or more embodiments of the present invention also provides a video data processing method, medium, and system capable of segmenting video data by identifying an anchor by removing a face shot including a character shown alone, from a cluster.
One or more embodiments of the present invention also provides a video data processing method, medium, and system capable of precisely segmenting video data by using a face model generated in a process of segmenting the video data.
Although a few embodiments of the present invention have been shown and described, the present invention is not limited to the described embodiments. Instead, it would be appreciated by those skilled in the art that changes may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2006-0052724 | Jun 2006 | KR | national |