This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2022-119678, filed on Jul. 27, 2022, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates to a video search apparatus, a video search method, and a storage medium.
Patent Literature 1 discusses one example of a video search system that searches for a scene constituted by a plurality of events from a video image. In this video search system, first, a video image targeted for a search is indexed with structural indexes, which are assigned by dividing the video image into structural units prepared as a plurality of hierarchized sections, and the structural units are set so as to be used as a search granularity, which is a range when a scene is searched for. Then, the video image is further indexed with event indexes, which indicate events that have occurred in the video image. For example, in a case of a video image of a baseball game, the video image from a game start is divided into respective structural units hierarchized in ranks of “inning”, “top/bottom”, “batting”, and “pitching”, and these structural units are assigned as the structural indexes. Further, event indexes for identifying the contents of events are assigned to scenes such as a “hit” and “run scoring” in the video image. In addition thereto, in the video search system discussed in Patent Literature 1, for some video scene, a term expressing the content of the video scene, a state transition pattern constituted by a pattern of occurrence of a plurality of events corresponding to the video scene, and the search granularity indicating the structural unit targeted for the search when the video scene is searched for are set in association with one another in advance. This means that, for example, in the case of the video image of the baseball game, a pattern of occurrence of events “hit” and “run scoring” is set as the state transition pattern and “batting” is set as the structural unit used as the search granularity with respect to a video scene “run-scoring hit”.
Then, in the video search system discussed in Patent Literature 1, when a search is conducted, a term expressing a desired video scene is input, and the state transition pattern, i.e., the pattern of occurrence of events corresponding to the input term is searched for unit by unit according to the search granularity, i.e., the structural unit corresponding to the input term, based on the assumption that the video image is structured in the above-described manner. Then, the structural unit in the video image in which the state transition pattern is searched for is output as a search result. This means that, for example, when the search for the video scene “run-scoring hit” from the video image of the baseball game is attempted as described above, the pattern of occurrence of events “hit” and “run scoring” is searched for from the video image indexed with the structural indexes unit by unit set according to the search granularity corresponding to the video scene “run-scoring hit”, i.e., the structural unit “batting”. Then, if events of a hit and run scoring subsequent thereto are searched for in some batting in the video image, the video image of this batting is output as the “run-scoring hit”, which is the desired video scene.
However, the above-described technique discussed in Patent Literature 1 is constructed assuming that the video image targeted for the search is indexed with the above-described structural indexes in advance, and a search range in the entire video image is defined to be a predetermined structural unit accordingly. Therefore, this technique raises a problem of taking labor and cost to index the video image with the structural indexes in advance.
On the other hand, the absence of the structural indexes assigned to the video image makes it impossible to appropriately determine the section of the desired video scene. For example, when the search for the video scene “run-scoring hit” from the baseball video image is attempted as described above, information specifying what time one inning or one batting turn starts and ends cannot be acquired from the video image, and an appropriate search range cannot be set. Therefore, for example, a combination of a hit event in the top of the first inning and a run-scoring event in the bottom of the third inning is inadvertently determined to be the desired video scene “run-scoring hit”, and a video section from the hit in the top of the first inning to the run-scoring in the bottom of the third inning is incorrectly determined as one video scene. In other words, a problem occurs in that, although a “run-scoring hit” never exists across innings, such an inappropriate video scene may be undesirably searched for, and the section of the desired video scene cannot be appropriately determined.
In light thereof, an object of the present disclosure is to provide a video search apparatus capable of solving the incapability of appropriately determining a section of a desired video scene from a video image, which is the above-described problem.
A video search apparatus according to one aspect of the present disclosure is configured to include
Further, a video search method according to one aspect of the present disclosure is configured to include
Further, a program according to one aspect of the present disclosure is configured to cause a computer to execute processing to
By being configured in the above-described manner, the present disclosure allows the video section of the desired video scene to be appropriately determined from the video image.
A first exemplary embodiment of the present disclosure will be described with reference to
A video search apparatus 10 according to the present exemplary embodiment is configured to detect a section corresponding to a desired video scene constituted by a plurality of events from a target video image, and determine and output this video section. As one example, the present exemplary embodiment will be described referring to an example in which the video search apparatus 10 sets a video image of a baseball game as the target video image, and searches for a video scene such as a “run-scoring hit” (a batter makes a hit and a run is scored) from this target video image and determines and outputs the video section thereof. Further, in the present exemplary embodiment, the video search apparatus 10 also calculates reliability of the determined video section as a content indicating the “run-scoring hit”, i.e., a degree indicating credibility that the determined video section is the “run-scoring hit”. However, the target video image handled by the video search apparatus 10 may be a video image having any content without being limited to a baseball game, and a searched video scene may also be a video scene having any content.
The video search apparatus 10 is configured of one or a plurality of information processing apparatus(es) each including an arithmetic unit and a storage unit. Then, as illustrated in
The video storage unit 16 stores therein the target video image, which is set as a target searched for a video scene. For example, the target video image is a video image of a baseball game as described above, and is a video image captured in a baseball field and edited. Then, the target video image is assumed to be indexed in advance with an event index, which is information indicating an event that has occurred in the video image. As will be used herein, the “event” may be, for example, something expressing a person or an object appearing in the video image (for example, a “player”, a “spectator”, and a “scoreboard” in the case of the baseball video image), or may be something expressing an action of a person or an object (“throw a ball”, “run”, and “swing a bat”). Alternatively, the “event” may be, for example, a motion of a camera that captures this video image (for example, “pan” and “tilt”) or a feature of the edited video image (a “caption indicating the score” and “the camera is switched”).
In this case, the event index is assigned to a predetermined time in the target video image, and “time information” is associated with the “event”. Therefore, the event index is assumed to have no temporal duration. Note that the “time information” may be any information as long as it is information indicating a position in the temporal direction in the target video image, and may be information such as a clock time itself, a time elapsed from the beginning of the video image, or a frame number.
The scene condition storage unit 17 stores therein a scene condition table (scene information), which is information used when a video scene is searched for. The scene condition table is information that defines the video scene, and includes a “scene name” and “scene conditions” as mainly illustrated in
Then, the “scene name” may be a name according to the content of the video scene (“hit” or “strikeout” in the case of the baseball video image), or may be a numerical number or an alphabet for identifying the scene. Note that a “query”, which is information for specifying the video scene to be searched for, is input by a searcher who requests the search for the video scene, and the “scene name” is constituted by information linkable with the “query”, as will be described below. Due to that, the scene condition table of the corresponding “scene name” is expected to be identified based on the “query” input from the searcher as a search request at the time of the search.
Further, the “scene conditions” include “event information” including an event constituting the video scene, and “time information indicating a time set for the video scene. Especially, in the present exemplary embodiment, the “event information” in the scene conditions is expressed by a series of a plurality of events for which an occurrence order is set. Further, in the present exemplary embodiment, the “time information” in the scene conditions indicates a time interval between the series of events for which the occurrence order is set, and, as one example, is expressed by a time interval from some specific event (a key event) to each of the other events or a time interval between events adjacent to each other in the occurrence order.
Now, specific examples of the above-described “scene conditions” will be described. In a case where the video scene is a “hit” in baseball, such a video scene is characterized by a sequence of three events “pitching”, “batting”, and “arrival of a batter at the first base”, and therefore the three events “pitching”, “batting”, and “arrival of a batter at the first base” are set as an event sequence indicating that these events occur in this order as the “event information” in the “scene conditions” of the scene name “hit”. Then, respective time intervals from the event “pitching” serving as the starting point to the other events “batting” and “arrival of a batter at the first base” are set as the “time information” in the “scene conditions” by way of example. In this case, the event “pitching” serving as the starting point will be referred to as the “key event”, and the other events “batting” and “arrival of a batter at the first base” will be referred to as “auxiliary events”. Alternatively, as another example of the “time information” in the “scene conditions”, each time interval between events adjacent to each other in the occurrence order may be set, and, in this case, the time interval between the event “pitching” and the event “batting” and the time interval between the event “batting” and the event “arrival of a batter at the first base” may be set.
However, the “time information” included in the “scene conditions” is not necessarily limited to the time interval between events like the above-described examples, and may be any information indicating a time settable for the video scene. For example, the “time information” may be the time length of the entire sequence that can contain the plurality of events constituting the video scene, or may be information such as a time ratio to the entire target video image. Further, the “time interval” and the “time information” may be any information as long as they are information indicating a time range in the temporal direction in the video image, and may be information such as a time length or a frame duration.
Specific examples of the data structure of the above-described scene condition table will be described with reference to
On the other hand, for the “scene 1” in a scene condition table indicated by a reference sign T2 in
On the other hand, for the “scene 1” in a scene condition table indicated by a reference sign T3 in
On the other hand, for the “scene 1” in a scene condition table indicated by a reference sign T4 in
In this manner, the video search apparatus 10 stores the target video image and the scene condition table therein in advance. Then, the video search apparatus 10 searches for the requested video scene from the target video image using functions of the configuration that will be described now.
The acquisition unit 11 receives an input of the query, which is information about the video scene requested to be searched for, and acquires the scene condition table corresponding to the query from the scene condition storage unit 17. Assume that the query is, for example, information constituted by the scene name or an identification number for identifying the scene, and corresponding to the “scene name” in the scene condition table. As one example, when receiving an input of a query “hit”, the acquisition unit 11 acquires the scene condition table of the scene name “hit” corresponding to the query from the scene condition storage unit 17.
The search unit 12 uses the scene condition table acquired by the acquisition unit 11 and searches for an event contained in this scene condition table from the target video image stored in the video storage unit 16. At this time, while setting a search range of the event in the target video image using the time information such as the time interval between events contained in the scene condition table, the search unit 12 searches for the event according to the set occurrence order within this search range. In other words, the search unit 12 first searches for a predetermined event among the plurality of events contained in the scene condition table from the target video image, and sets the time interval from the searched predetermined event that is set with respect to another event as the search range in the target video image and then searches for the other event from the target video image within this search range after that. At this time, if the time interval set in the scene condition table is the time interval in relation to the key event, the search unit 12 sets the search range from the key event based on the time interval set with respect to each of the other events and searches for each of the events within this search range after searching for the key event. Alternatively, if the time interval set in the scene condition table is the time interval between events adjacent to each other in the occurrence order, the search unit 12 repeatedly sets the search range based on the time interval set with respect to each event preceding or subsequent to the predetermined event in the order and searches for each of the other events within this search range every time the predetermined event is searched for.
This means that, for example, when the query is “hit” as described above, the search unit 12 acquires the scene condition table of the scene name “hit” and searches for the three events “pitching”, “batting”, and “arrival of a batter at the first base” set in the scene conditions according to this occurrence order. Suppose that, at this time, the event “pitching” is set as the key event, and the time intervals from the key event are set for the other events “batting” and “arrival of a batter at the first base”, respectively. In this case, the search unit 12 first searches for the key event “pitching” from the target video image. Then, the search unit 12 searches for the event “batting” from the target video image within the time interval set for the event “batting” subsequent in the occurrence order from the time of the searched key event “pitching” as the search range. Further, the search unit 12 searches for the event “arrival of a batter at the first base” from the target video image within the time interval set for the event “arrival of a batter at the first base” further subsequent in the occurrence order from the time of the key event “pitching” as the search range.
Now, specifically how the search unit 12 performs the processing for searching for the event in the scene condition table like the examples illustrated in
Next, the processing by the search unit 12 in the case of the scene condition table of the “scene 1” illustrated in the reference sign T1 in
Note that the search unit 12 may set the search range B of the further subsequent auxiliary event “event 3” and search for the “event 3” as illustrated in
On the other hand, if the “margin” is set as the time interval as illustrated in the reference sign T3 in
Further, if an auxiliary event is present prior to the key event like scenes 2 and 3 in the reference sign T1 in
On the other hand, for the “scene 3” in the reference sign T1 in
On the other hand, if the key event is not set in the scene condition table, the search unit 12 may set the key event based on the “weight” set as illustrated in the reference sign T4 in
On the other hand, if just a time length is set as the time information in the scene condition table differently from the time interval between events like the above-described example, the search unit 12 may set this time length as the search range and search for the event. For example, if the time length of the entire event sequence is set as the time information in the scene condition table, the search unit 12 may set a range of the set time length from the key event or an arbitrary event as the search range and search for all the events. On the other hand, if the ratio of the time of the entire event sequence to the entire target video image is set as the time information in the scene condition table, the search unit 12 may set a range of the set time length from the key event or an arbitrary event as the search range and search for all the events.
The determination unit 13 determines a video section in the target video image that corresponds to the scene condition table targeted for the search based on the result of the event search by the search unit 12. For example, if the video scene is constituted by the event sequence having the occurrence order of “the event 1, the event 2, and the event 3”, and all the events can be searched for, the determination unit 13 determines and outputs a video section from the position of the first event 1 to the position of the last event 3 in the target video image as the searched video scene. At this time, the determination unit 13 may identify and output the times or the frame numbers of the corresponding beginning and termination in the target video image or may identify and output the video data itself corresponding to this section as the video section.
However, the determination unit 13 may determine the video section based on the time information in the scene condition table such as the time interval between events in addition to the searched events. For example, supposing that the “event 1” and the “event 2” can be searched for but the “event 3” cannot be searched for as illustrated in the reference sign E2 in
Moreover, the determination unit 13 further calculates reliability of the determined video portion and outputs it together with the video portion. More specifically, the determination unit 13 calculates the reliability based on the events set in the scene condition table targeted for the search and the searched events. Now, the reliability is defined to be a value reflecting the credibility that the video scene constituted by the successfully searched events is the requested desired video scene, and is defined in such a manner that a higher value indicates higher credibility that the video scene is the desired video scene. For example, if N events are set in the scene condition table and n events among them can be searched for in the target video image, the reliability may be calculated as n/N. In other words, the reliability may be calculated in such a manner that the value of the reliability increases as the number of searched events increases among the events set in the scene condition table. Alternatively, for example, if the weight is written for each of the events set in the scene condition table, the weights of the successfully searched events in the target video image may be added up and calculated as the reliability. However, the determination unit 13 does not necessarily have to calculate the reliability.
Next, the operation of the above-described video search apparatus 10 will be described mainly with reference to a flowchart illustrated in
The video search apparatus 10 receives the input of the query as the information about the video scene requested to be searched for from a not-illustrated input device (step S1). Note that the video search apparatus 10 may acquire the target video image together with the query. Then, the video search apparatus 10 acquires the scene condition table corresponding to the query for the video scene from the scene condition storage unit 17 (step S2).
Subsequently, the video search apparatus 10 searches for the preset key event or the key event determined by any method in the event sequence contained in the acquired scene condition table from the target video image (step S3). At this time, if the key event cannot be found in the target video image (step S3: NO), the video search apparatus 10 outputs that the desired video scene cannot be found to an output device (step S9). If one or more key events can be found in the video image (step S3: YES), the video search apparatus 10 determines whether the processing for searching for the auxiliary events is performed with respect to every key event found in the target video image (step S4).
If the processing for searching for the auxiliary events is not performed with respect to every key event found in the target video image (step S4: NO), the video search apparatus 10 determines whether all of the auxiliary events contained in the scene condition table are searched for with respect to one of the key event(s) in the target video image on which the processing for searching for the auxiliary events is not performed (step S5).
If all of the auxiliary events contained in the scene condition table are not searched for (step S5: NO), the video search apparatus 10 sets the search range for the auxiliary event that is not searched for yet (step S6), and searches for the auxiliary event in the target video image within this search range (step S7).
If the search for all of the auxiliary events contained in the scene condition table is ended with respect to one of the key event(s) found in the video image (step S5: YES), the video search apparatus 10 determines the video section based on the result of the search for the auxiliary events and calculates the reliability of this video section (step S8). If the processing for searching for the auxiliary events is ended with respect to every key event found in the target video image (step S4: YES), the video search apparatus 10 outputs a search result constituted by the determined video section and reliability to the output device (step S9).
In this manner, the video search apparatus 10 according to the present exemplary embodiment sets the search range of the event using the time information set in the scene condition table, thereby being able to search for the plurality of events constituting the desired video scene with an appropriate time range. As a result, the video search apparatus 10 can accurately determine the video section corresponding to the desired video scene even when the target video image is not structurally organized. Further, the video search apparatus 10 calculates the reliability of the determined video section, thereby allowing a user to be aware of the credibility of the video section resulting from the search and thus improving convenience.
Next, a second exemplary embodiment of the present disclosure will be described. The video search apparatus 10 according to the present exemplary embodiment is configured substantially similarly to the configuration described in the first exemplary embodiment. On the other hand, the present exemplary embodiment is different from the first exemplary embodiment in terms of the fact that the target video image does not have to be indexed with the event index. Accordingly, in the video search apparatus 10, the above-described search unit 12 is configured differently in the following manner.
In particular, the search unit 12 according to the present exemplary embodiment has a function of searching for an event contained in the scene condition table from the target video image, but this function is realized using, for example, a model using a neural network. In other words, the search unit 12 has a function of searching for video data corresponding to an event from the target video image using a model that learns corresponding video data for each event in advance. In this case, one possible example of the model includes a model that learns a task of predicting a time at which an event occurs and a label with use of data in which a label indicating an event is manually assigned to a certain time in the video image as teacher data. Note that the search unit 12 may search for all the events using one model, or may search for the events using a plurality of models by, for example, preparing a model for each event.
Next, a third exemplary embodiment of the present disclosure will be described with reference to
First, the hardware configuration of a video search apparatus 100 according to the present exemplary embodiment will be described with reference to
Note that
Then, the video search apparatus 100 can construct and include a search unit 121 and a determination unit 122 illustrated in
The above-described search unit 121 uses scene information including event information including a plurality of events constituting a video scene and time information indicating a time set for the video scene to set a search range of an event in a target video image based on the time information and also search for the event included in the event information from the target video image within the set search range. Then, for example, an occurrence order of the plurality of events is set in the above-described event information, and a time interval between the events is set in the above-described time information. Therefore, the search unit 121 sets the search range in consideration of the temporal relationship between the events, and searches for the events constituting the video scene within this search range.
The above-described determination unit 122 determines a video section in the target video image that corresponds to the scene information targeted for the search based on a result of the search for the event. For example, the determination unit 122 determines the video section containing the searched events or the video section in consideration of the time information.
By being configured in this manner, the present disclosure sets the search range of the event using the time information set in the scene information, thereby being able to search for the plurality of events constituting the desired video scene with an appropriate time range. As a result, the present disclosure can accurately determine the video section corresponding to the desired video scene even when the target video image is structurally organized.
Note that the program described above can be supplied to a computer by being stored on a non-transitory computer-readable medium of any type. Non-transitory computer-readable media include tangible storage media of various types. Examples of non-transitory computer-readable media include a magnetic recording medium (for example, a flexible disk, a magnetic tape, and a hard disk drive), a magneto-optical recording medium (for example, a magneto-optical disk), a CD-Read Only Memory (ROM), a CD-R, a CD-R/W, a semiconductor memory (for example, a mask ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), a flash ROM, and a Random Access Memory (RAM)). Alternatively, the program may also be supplied to a computer via a transitory computer-readable medium of any type. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. A transitory computer-readable medium can supply the program to a computer via a wired communication channel such as an electric wire and an optical fiber, or a wireless communication channel.
While the present disclosure has been described with reference to the exemplary embodiments described above, the present disclosure is not limited to the above-described exemplary embodiments. The form and details of the present disclosure can be changed within the scope of the present disclosure in various manners that can be understood by those skilled in the art. Further, at least one or more function(s) among the functions of the above-described search unit 121 and determination unit 122 may be executed by an information processing apparatus set up at any location in a network and connected therefrom, i.e., may be executed by so-called cloud computing.
The whole or part of the exemplary embodiments disclosed above can be described as, but not limited to, the following supplementary notes. Hereinafter, outlines of the configurations of a video search apparatus, a video search method, and a program according to the present disclosure will be described. However, the present disclosure is not limited to the configurations described below.
A video search apparatus comprising:
The video search apparatus according to supplementary note 1, wherein
The video search apparatus according to supplementary note 2, wherein
The video search apparatus according to supplementary note 3, wherein
The video search apparatus according to supplementary note 3, wherein
The video search apparatus according to any of supplementary notes 1 to 5, wherein
The video search apparatus according to any of supplementary notes 1 to 6, wherein
The video search apparatus according to supplementary note 7, wherein
A video search method comprising:
The video search method according to supplementary note 9, wherein
The video search method according to supplementary note 9.1, wherein
The video search method according to supplementary note 9.2, wherein
The video search method according to supplementary note 9.2, wherein
The video search method according to any of supplementary notes 9 to 9.4, further comprising:
The video search method according to any of supplementary notes 9 to 9.5, further comprising:
The video search method according to supplementary note 9.6, further comprising:
A program for causing a computer to execute processing to:
Number | Date | Country | Kind |
---|---|---|---|
2022-119678 | Jul 2022 | JP | national |