The present technology relates to an information processing device and an information processing method, and more particularly relates to an information processing device and the like that process information regarding moving image content.
Conventionally, various techniques of generating emotion data indicating a user emotion for each scene of moving image content on the basis of a facial image of a user, biometric information of the user, or the like have been proposed (e.g., see Patent Document 1).
An object of the present technology is to enable effective use of a user emotion for each scene of moving image content.
A concept of the present technology is directed to:
According to the present technology, the data generation unit generates the correlation data obtained by associating the user emotion with the video quality on the basis of the user emotion and the video quality for each scene of the moving image content. For example, the correlation data may include combination data of the user emotion and the video quality for each scene. In this case, since a large number of pieces of the combination data of the user emotion and the video quality are included as the correlation data, for example, it becomes possible to accurately calculate the user emotion corresponding to the video quality.
Furthermore, for example, the correlation data may include data of a regression equation calculated on the basis of the combination data of the user emotion and the video quality for each scene. In this case, since the correlation data is the data of the regression equation, it becomes possible to save a storage capacity of a database that stores the correlation data, and to easily calculate the user emotion corresponding to the video quality, for example. In this case, for example, data of a correlation coefficient may be added to the data of the regression equation. It becomes possible to determine whether or not to use the regression equation on the basis of the data of the correlation coefficient. Furthermore, for example, the data generation unit may use the user emotion for each user attribute to generate the correlation data for each user attribute. With this arrangement, it becomes possible to selectively use the correlation data of a desired attribute.
As described above, according to the present technology, the correlation data obtained by associating the user emotion with the video quality is generated on the basis of the user emotion and the video quality for each scene of the moving image content, which makes it possible to satisfactorily obtain the correlation data in which the user emotion and the video quality are associated with each other.
Furthermore, another concept of the present technology is directed to:
According to the present technology, the user emotion prediction unit predicts the user emotion for each scene of the moving image content on the basis of the video quality for each scene of the moving image content and the correlation data obtained by associating the user emotion with the video quality. For example, the user emotion prediction unit may predict the user emotion for each scene of the moving image content on the basis of the correlation data of a predetermined attribute selected from the correlation data for each user attribute. With this arrangement, the user emotion prediction unit is enabled to obtain emotion data suitable for the attribute desired by a user for use in reproduction and editing of the moving image content.
As described above, according to the present technology, the user emotion for each scene of the moving image content is predicted on the basis of the video quality for each scene and the correlation data obtained by associating the user emotion with the video quality, which makes it possible to satisfactorily predict the user emotion for each scene of the moving image content.
Note that, in the present technology, a display control unit that controls display of the predicted user emotion for each scene of the moving image content may be further included, for example. With this arrangement, the user is enabled to easily recognize the user emotion predicted for each scene of the moving image content, and to easily and effectively perform a selective reproduction operation on the moving image content and an edit operation for performing selective retrieval or video quality correction on the moving image content.
Furthermore, in the present technology, an extraction unit that extracts an emotion representative scene on the basis of the predicted user emotion for each scene of the moving image content may be further included, for example. With this arrangement, it becomes possible to effectively use the predicted user emotion for each scene of the moving image content in reproduction and editing of the moving image content.
For example, the extraction unit may extract the emotion representative scene on the basis of a type of the user emotion. Furthermore, for example, the extraction unit may extract the emotion representative scene on the basis of a degree of the user emotion. In this case, for example, the extraction unit may extract a scene in which the degree of the user emotion exceeds a threshold as the emotion representative scene. Furthermore, in this case, the extraction unit may extract the emotion representative scene on the basis of a statistical value of the degree of the user emotion of the entire moving image content, for example. Here, the statistical value may include, for example, a maximum value, a sorting result, an average value, or a standard deviation value.
Furthermore, in the present technology, a reproduction control unit that controls reproduction of the moving image content on the basis of the extracted emotion representative scene may be further included, for example. With this arrangement, the user is enabled to view only the extracted emotion representative scene or only the remaining parts excluding the extracted emotion representative scene.
The information processing device according to claim 6.
Furthermore, in the present technology, an edit control unit that controls editing of the moving image content on the basis of the extracted emotion representative scene may be further included, for example. With this arrangement, the user is enabled to obtain new moving image content including only the extracted emotion representative scene or only the remaining parts excluding the extracted emotion representative scene, or the user is enabled to obtain new moving image content in which the video quality of only the extracted emotion representative scene or the remaining parts excluding the extracted emotion representative scene is corrected.
Hereinafter, a mode for carrying out the invention (hereinafter referred to as an “embodiment”) will be described. Note that the description will be given in the following order.
The present technology includes a step of generating emotion data indicating a user emotion for each scene of first moving image content (moving image content A), a step of generating correlation data obtained by associating the user emotion with video quality on the basis of the user emotion and the video quality for each scene of the first moving image content (moving image content A), and a step of predicting and using the user emotion for each scene of second moving image content (moving image content B).
The content database 101 stores a plurality of moving image content files. When a reproduction moving image file name (moving image content A) is input, the content database 101 supplies, to the content reproduction unit 102, a moving image content file including the moving image content A corresponding to the reproduction moving image file name. Here, the reproduction moving image file name is specified by a user of the information processing device 100, for example.
At a time of reproduction, the content reproduction unit 102 reproduces the moving image content A included in the moving image content file supplied from the content database 101, and displays a moving image on a display unit (not illustrated). Furthermore, at the time of reproduction, the content reproduction unit 102 supplies a frame number (time code) in synchronization with a reproduction frame to the metadata generation unit 106. The frame number is information that may identify a scene of the moving image content A.
The facial image shooting camera 103 is a camera that captures a facial image of the user who views the moving image displayed on the display unit by the content reproduction unit 102. The facial image of each frame obtained by being captured by the facial image shooting camera 103 is sequentially supplied to the user emotion analysis unit 105.
The biometric information sensor 104 is a sensor for obtaining biometric information, such as a heart rate, a respiratory rate, a perspiration amount, and the like, which is attached to the user who views the moving image displayed on the display unit by the content reproduction unit 102. The biometric information of each frame obtained by the biometric information sensor 104 is sequentially supplied to the user emotion analysis unit 105.
The user emotion analysis unit 105 analyzes a degree of a predetermined type of the user emotion for each frame on the basis of the facial image of each frame sequentially supplied from the facial image shooting camera 103 and the biometric information of each frame sequentially supplied from the biometric information sensor 104, and supplies user emotion information to the metadata generation unit 106.
Note that the type of the user emotion is not limited to secondary information obtained by analyzing the facial image and the biometric information, such as information regarding “joy”, “anger”, “sorrow”, and “pleasure”, and may be primary information, which is the biometric information itself, such as the “heart rate”, “respiratory rate”, “perspiration amount”, and the like.
The metadata generation unit 106 associates the user emotion information of each frame obtained by the user emotion analysis unit 105 with a frame number (time code), generates emotion metadata having the user emotion information of each frame of the moving image content A, and supplies the emotion metadata to the metadata database 107.
The metadata database 107 stores emotion metadata corresponding to a plurality of moving image content files. The metadata database 107 compiles the emotion metadata supplied from the metadata generation unit 106 into a database together with a moving image file name, that is, in association with the moving image file name so that it may be identified which moving image content file the emotion metadata corresponds to.
Here, in a case where the emotion metadata corresponding to the reproduction moving image file name (moving image content A) has not been stored yet, the emotion metadata supplied from the metadata generation unit 106 is directly stored. Furthermore, in a case where the emotion metadata corresponding to the reproduction moving image file name (moving image content A) has already been stored, the metadata database 107 performs an update with the emotion metadata supplied from the metadata generation unit 106.
Alternatively, in a case where the emotion metadata corresponding to the reproduction moving image file name (moving image content A) has already been stored, the metadata database 107 performs an update with emotion metadata obtained by combining the emotion metadata supplied from the metadata generation unit 106 with the already stored emotion metadata.
While weighted averaging is conceivable as a combining method, it is not limited thereto, and another method may be adopted. Note that, in the case of weighted averaging, when the emotion metadata that has already been added relates to m users, the emotion metadata that has already been added and the emotion metadata supplied from the metadata generation unit 106 are weighted by m: 1 and averaged.
In a case of performing an update with the emotion metadata combined and obtained in this manner, the emotion metadata is updated and becomes more accurate emotion metadata as the number of users who view the moving image content A increases. In this case, while the emotion metadata generated by viewing of one user is metadata having emotion information of the one user, the emotion metadata generated by viewing of a large number of users is metadata having emotion information statistically representative from emotional reactions of other people.
Note that, at the time of generating the emotion metadata, it is conceivable to obtain highly accurate emotion metadata at a time by inputting facial images and biometric information related to a plurality of users to the user emotion analysis unit 105 for analysis instead of updating the emotion metadata with the moving image content sequentially being viewed by the plurality of users.
While the emotion metadata stored in the metadata database 107 and the moving image content file stored in the content database 101 are associated with each other by the moving image file name in the example illustrated in the drawing, they may be associated with each other by another method, for example, by recording link information, such as a uniform resource locator (URL) for accessing the emotion metadata stored in the metadata database 107, in the corresponding moving image content file of the content database 101.
As described above, according to the information processing device 100 illustrated in
The content database 201 corresponds to the content database 101 illustrated in
The content reproduction unit 202 reproduces the moving image content A included in the moving image content file supplied from the content database 201, and supplies video signals related to the moving image content A to the video quality analysis unit 203.
On the basis of the video signals of each frame supplied from the content reproduction unit 202, the video quality analysis unit 203 analyzes, for each frame, degrees of a hand-induced shake amount (correction remaining), a zoom speed condition, a focus deviation condition, and the like, obtains video quality data having video quality information for each frame of the moving image content A, and supplies it to the correlation data generation unit 205. Here, as the video quality information, a plurality of pieces of primary information such as the hand-induced shake amount (correction remaining), the zoom speed condition, the focus deviation condition, and the like may be used in parallel, or one piece of video quality information as secondary information obtained by integrating the plurality of pieces of primary information may be used.
For example, although detailed description is omitted, the video quality analysis unit 203 uses well-known machine learning or artificial intelligence (AI) technology to determine video quality for each frame with respect to content to be evaluated in advance. Note that it is possible to calculate some evaluation value depending on the quality even with a simple filter configuration without using the machine learning or AI technology. The metadata database 204 corresponds to the metadata database 107 illustrated in
When the reproduction moving image file name (moving image content A) same as that input to the content database 201 is input, the metadata database 204 supplies, to the correlation data generation unit 205, emotion metadata having the user emotion information for each frame of the moving image content A, which is associated with the moving image content file supplied from the content database 201 to the content reproduction unit 202.
On the basis of the video quality data supplied from the video quality analysis unit 203 and the emotion metadata supplied from the metadata database 204, that is, on the basis of the user emotion and the video quality for each frame of the moving image content A, the correlation data generation unit 205 generates correlation data in which the user emotion and the video quality are associated with each other, and supplies the correlation data to the metadata database 206.
The correlation data includes, for example, combination data of the user emotion and the video quality for each frame.
Note that, while the exemplary case where both the video quality information and the user emotion information are the primary information or the secondary information has been described above, they may be not only the set of the primary information or the set of the secondary information but also a hybrid or a combination thereof.
The exemplary case where the correlation data includes the combination data of the user emotion and the video quality for each frame has been described above. In this case, since a large number of pieces of the combination data of the user emotion and the video quality are included as the correlation data, for example, it becomes possible to accurately calculate the user emotion corresponding to the video quality.
However, it is also conceivable that the correlation data is data of a regression equation calculated on the basis of the combination data of the user emotion and the video quality for each frame. For example,
By using the correlation data as the data of the regression equation in this manner, it becomes possible to save the storage capacity of the database that stores the correlation data, and to easily calculate the user emotion corresponding to the video quality, for example. Furthermore, by adding the data of the correlation coefficient to the data of the regression equation, it becomes possible to easily and appropriately determine whether or not to use the regression equation.
Returning to
As described above, according to the information processing device 200 illustrated in
The content database 301 stores a plurality of moving image content files. When a reproduction moving image file name (moving image content B) is input, the content database 301 supplies a moving image content file corresponding to the reproduction moving image file name to the content reproduction unit 302 and the content reproduction/editing unit 306. Here, the reproduction moving image file name is specified by a user of the information processing device 300, for example.
The content reproduction unit 302 reproduces the moving image content B included in the moving image content file supplied from the content database 301, and supplies video signals related to the moving image content B to the video quality analysis unit 303.
The video quality analysis unit 303 is configured in a similar manner to the video quality analysis unit 203 illustrated in
The metadata database 304 corresponds to the metadata database 206 illustrated in
The user emotion prediction unit 305 predicts a user emotion for each frame of the moving image content B on the basis of the video quality for each frame of the moving image content B and the correlation data in which the user emotion and the video quality are associated with each other corresponding to the moving image content A, obtains emotion data having user emotion information for each frame of the moving image content B, and supplies it to the content reproduction/editing unit 306.
In the content reproduction/editing unit 306, a control unit (not illustrated) performs control to selectively reproduce a part of the moving image content B, or performs edit control to selectively retrieve a part of the moving image content B included in the moving image content file or selectively correct the video quality of a part of the moving image content B to generate new moving image content C, in response to a user operation.
As described above, the emotion data obtained by the user emotion prediction unit 305 has the user emotion information for each frame of the moving image content B, and indicates what kind of emotion the viewer has with respect to each frame of the moving image content B. In the content reproduction/editing unit 306, for example, a control unit (not illustrated) performs control to display a user interface (UI) indicating user emotion information for each frame of the moving image content B on the basis of the emotion data, and supports the user in performing a selective reproduction operation on the moving image content B, performing an edit operation for generating the new moving image content C by performing selective retrieval or video quality correction on the moving image content B, and the like.
As described above, according to the information processing device 300 illustrated in
Furthermore, according to the information processing device 300 illustrated in
Note that, according to the information processing device 300 illustrated in
The information processing device 300A includes the content database (content DB) 301, the content reproduction unit 302, the video quality analysis unit 303, the metadata database (metadata DB) 304, the user emotion prediction unit 305, an emotion representative scene extraction unit 311, and a content reproduction/editing unit 312.
When a reproduction moving image file name (moving image content B) is input, the content database 301 supplies a moving image content file corresponding to the reproduction moving image file name to the content reproduction unit 302 and the content reproduction/editing unit 312. When a reproduction moving image file name (moving image content A) is input, the metadata database 304 supplies the correlation data corresponding to the moving image content A to the user emotion prediction unit 305.
The content reproduction unit 302 reproduces the moving image content B included in the moving image content file supplied from the content database 301, and supplies video signals related to the moving image content B to the video quality analysis unit 303. On the basis of the video signals of each frame supplied from the content reproduction unit 302, the video quality analysis unit 303 analyzes, for each frame, degrees of a hand-induced shake amount (correction remaining), a zoom speed condition, a focus deviation condition, and the like, obtains video quality data having video quality information for each frame of the moving image content A, and supplies it to the user emotion prediction unit 305.
The user emotion prediction unit 305 predicts a user emotion for each frame of the moving image content B on the basis of the video quality for each frame of the moving image content B and the correlation data in which the user emotion and the video quality are associated with each other corresponding to the moving image content A, obtains emotion data having user emotion information for each frame of the moving image content B, and supplies it to the emotion representative scene extraction unit 311.
The emotion representative scene extraction unit 311 extracts an emotion representative scene from the emotion metadata supplied from the user emotion prediction unit 305.
For example, the emotion representative scene extraction unit 311 extracts an emotion representative scene on the basis of a type of the user emotion. In this case, for example, in a case where the emotion metadata includes information regarding “joy”, “anger”, “sorrow”, and “pleasure” as the user emotion information for each frame of the moving image content, one of those emotions is selected, and a scene in which a degree (level) of the emotion is equal to or higher than a threshold is extracted as the emotion representative scene. Here, the selection of the emotion and the setting of the threshold may be optionally performed by a user operation, for example.
Furthermore, the emotion representative scene extraction unit 311 extracts the emotion representative scene on the basis of a degree of the user emotion, for example. In this case, it is conceivable to (1) extract a scene in which a degree of the user emotion exceeds a threshold as the emotion representative scene, or (2) extract a scene as the emotion representative scene on the basis of a statistical value of the degree of the user emotion of the entire moving image content.
First, (1) a case where a scene in which a degree of the user emotion exceeds a threshold is extracted as the emotion representative scene will be described. In this case, for example, in a case where the emotion metadata includes information regarding “joy”, “anger”, “sorrow”, and “pleasure” as the user emotion information for each frame of the moving image content, a scene in which a degree (level) of the emotion is equal to or higher than a threshold is extracted as the emotion representative scene in each of the emotions. Here, the threshold may be optionally set by a user operation, for example.
A flowchart of
First, the emotion representative scene extraction unit 311 starts a process in step ST1. Next, the emotion representative scene extraction unit 311 initializes the frame number fr=1, and n=1 in step ST2.
Next, the emotion representative scene extraction unit 311 determines whether or not the degree Em(fr) is higher than the threshold th in step ST3. When Em(fr)>th is satisfied, the emotion representative scene extraction unit 311 stores the emotion representative scene information, that is, stores the frame number fr_as the emotion representative scene L (n) in step ST4. Furthermore, the emotion representative scene extraction unit 311 increments n to be n+1 in step ST4.
Next, the emotion representative scene extraction unit 311 updates the frame number fr to fr=fr+1 in step ST5. The frame number fr is updated in step ST5 in a similar manner when Em(fr)>th is not satisfied in step ST3.
Next, in step ST6, the emotion representative scene extraction unit 311 determines whether or not the frame number fr is larger than the final frame number fr_end, that is, performs end determination. When fr>fr_end is not satisfied, the emotion representative scene extraction unit 311 returns to the processing of step ST3, and repeats the process in a similar manner to the process described above. On the other hand, when fr>fr_end is satisfied, the emotion representative scene extraction unit 311 terminates the process in step ST7.
Next, (2) a case where the emotion representative scene is extracted on the basis of the statistical value of the degree of the user emotion of the entire moving image content will be described. The statistical value in this case is a maximum value, a sorting result, an average value, a standard deviation value, or the like.
When the statistical value is a maximum value, for example, in a case where the emotion metadata includes information regarding “joy”, “anger”, “sorrow”, and “pleasure” as the user emotion information for each frame of the moving image content, a scene in which a degree (level) of the emotion is the maximum value is extracted as the emotion representative scene in each of the emotions.
Furthermore, when the statistical value is a sorting result, for example, in a case where the emotion metadata includes information regarding “joy”, “anger”, “sorrow”, and “pleasure” as the user emotion information for each frame of the moving image content, not only a scene in which a degree (level) of the emotion is the maximum value but also scenes ranked second and third in the degree are extracted as the emotion representative scene in each of the emotions.
Furthermore, when the statistical value is an average value or standard deviation, for example, in a case where the emotion metadata includes information regarding “joy”, “anger”, “sorrow”, and “pleasure” as the user emotion information for each frame of the moving image content, a scene in which a degree (level) of the emotion largely deviates (e.g., three times the standard deviation, etc.) from the average is extracted as the emotion representative scene in each of the emotions.
A flowchart of
First, the emotion representative scene extraction unit 311 starts a process in step ST11. Next, the emotion representative scene extraction unit 311 initializes the frame number fr=1 and the maximum value em_max=0 in step ST12.
Next, the emotion representative scene extraction unit 311 determines whether or not the degree Em(fr) is higher than the maximum value em_max in step ST13. When Em(fr)>em_max is satisfied, the emotion representative scene extraction unit 311 stores the emotion representative scene information, that is, stores the frame number fr_as the emotion representative scene L in step ST14. Furthermore, the emotion representative scene extraction unit 311 updates em_max to Em(fr) in step ST14.
Next, the emotion representative scene extraction unit 311 updates the frame number fr to fr=fr+1 in step ST15. The frame number fr is updated in step ST15 in a similar manner when Em(fr)>em_max is not satisfied in step ST13.
Next, in step ST16, the emotion representative scene extraction unit 311 determines whether or not the frame number fr is larger than the final frame number fr_end, that is, performs end determination. When fr>fr_end is not satisfied, the emotion representative scene extraction unit 311 returns to the processing of step ST13, and repeats the process in a similar manner to the process described above. On the other hand, when fr>fr_end is satisfied, the emotion representative scene extraction unit 311 terminates the process in step ST17.
Returning to
Furthermore, in the content reproduction/editing unit 312, a control unit (not illustrated) performs control to selectively extract a part of the moving image content B included in the moving image content file supplied from the content database 301 to generate new moving image content C on the basis of the emotion representative scene information supplied from the emotion representative scene extraction unit 311. In this case, for example, only the emotion representative scene may be extracted or other parts excluding the emotion representative scene may be extracted according to the user setting.
Furthermore, in the content reproduction/editing unit 312, a control unit (not illustrated) performs control to selectively correct the video quality of a part of the moving image content B included in the moving image content file supplied from the content database 301 to generate new moving image content C on the basis of the emotion representative scene information supplied from the emotion representative scene extraction unit 311.
Note that the content reproduction/editing unit 312 may use not only the emotion representative scene information supplied from the emotion representative scene extraction unit 311 but also other evaluation values conventionally used. Alternatively, as illustrated by a broken line in
As described above, according to the information processing device 300A illustrated in
For example, when the creator creates new moving image content C from the moving image content B, it becomes possible to automatically perform editing work based on a scene for which the viewer is likely to show likes or dislikes in advance. That is, it becomes possible for the creator to perform editing work based on the index, which results in provision of assistance in creating high-quality moving image content C.
Note that, although not described above, it is also conceivable to adopt a configuration in which the information processing device 100 (see
Furthermore, it has been described that the moving image content A is one content in the embodiment above. However, the moving image content A may be a plurality of pieces of content. In that case, in the information processing device 200 in
Furthermore, an exemplary case where each scene is configured by one frame has been described in the embodiment above. However, each scene may be configured by a plurality of frames.
Furthermore, while the preferred embodiment of the present disclosure has been described in detail with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such example. It is apparent that a person having ordinary knowledge in the technical field of the present disclosure may conceive various changes or modifications within the scope of the technical idea recited in claims, and it is naturally understood that they also belong to the technical scope of the present disclosure.
Furthermore, the effects described in the present specification are merely exemplary or illustrative, and are not restrictive. That is, the technology according to the present disclosure may exert other effects apparent to those skilled in the art from the description of the present specification in addition to or instead of the effects described above.
Furthermore, the present technology may also have the following configurations.
(1) An information processing device including:
(2) The information processing device according to (1) described above, in which
(3) The information processing device according to (1) described above, in which
(4) The information processing device according to (3) described above, in which
(5) The information processing device according to any one of (1) to (4) described above, in which
(6) An information processing method including:
(7) An information processing device including:
(8) The information processing device according to (7) described above, further including:
(9) The information processing device according to (7) described above, further including:
(10) The information processing device according to (9) described above, in which
(11) The information processing device according to (9) described above, in which
(12) The information processing device according to (11) described above, in which
(13) The information processing device according to (11) described above, in which
(14) The information processing device according to (13) described above, in which
(15) The information processing device according to any one of (7) to (14) described above, in which
(16) The information processing device according to any one of (7) to (15) described above, further including:
(17) The information processing device according to any one of (7) to (16) described above, further including:
(18) An information processing method including:
Number | Date | Country | Kind |
---|---|---|---|
2021-153886 | Sep 2021 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2022/012474 | 3/17/2022 | WO |