The present disclosure relates to an event occurrence time learning apparatus, an event occurrence time estimation apparatus, an event occurrence time estimation method, an event occurrence time learning program, and an event occurrence time estimation program which estimate the occurrence time of an event using a series of images acquired in time series.
In the related art, there is a technique for estimating the time left until an event occurs by analyzing data relating to the time left until an event occurs. For example, in Non Patent Literature 1, the time left until an event occurs (for example, the death of a patient) is estimated using medical images. Specifically, this technique enables estimation by modeling a non-linear relationship between the time until the death of a patient and features included in medical images such as the sizes and locations of lesions using survival analysis and a deep learning technology, especially a convolutional neural network (CNN) (for example, see Non Patent Literature 2).
There is also a technique for estimating the time left until an event occurs from time series information obtained from results of a plurality of clinical tests as in Non Patent Literature 3. Specifically, this technique enables estimation by capturing time-series changes in test results and modeling a relationship between the time-series changes and the time left until an event occurs using survival analysis and a deep learning technology, especially a recurrent neural network (RNN).
However, the methods of the related art cannot handle time series information included in a series of images captured at different times. For example, the technique using Non Patent Literature 1 can handle high-dimensional information such as images but cannot handle time-series information. On the other hand, the technique using Non Patent Literature 3 can handle time-series information but cannot handle high-dimensional information such as images.
In addition, these two techniques cannot be simply combined because they have different neural network structures, hazard functions, likelihood functions, and the like. This causes a problem of not being able to perform analysis taking into consideration the movement of objects or the like. For example, when considering a traffic accident as an event, it is not possible to analyze the movement of objects such as whether nearby pedestrians are approaching or moving away and how fast they are. Thus, it is difficult to predict the time left until an accident occurs.
Further, these techniques cannot handle information accompanying a series of images. The accompanying information includes metadata associated with the entire series of images and time-series data typified by sensor data. Non Patent Literature 1 cannot handle both types of data, while Non Patent Literature 3 can handle time-series data but cannot handle metadata. For example, when a traffic accident is considered as an event, the metadata includes attribute information such as the driver's age and the type of the automobile and the time-series data includes the speed or acceleration of the automobile, global positioning system (GPS) location information, the current time, or the like. This information relates to prior knowledge such as the driver's reaction speed and driving tendencies or areas with frequent running out where higher speeds are dangerous. In the related art, these types of information cannot be fully utilized and the occurrence of an accident that does not appear in a series of images until just before running out may be overlooked.
The present invention has been made in view of the above circumstances and it is an object of the present invention to provide an event occurrence time learning apparatus, an event occurrence time estimation apparatus, an event occurrence time estimation method, an event occurrence time learning program, and an event occurrence time estimation program which estimate the occurrence time of an event by learning the occurrence time of an event using a series of images acquired in time series.
An event occurrence time learning apparatus of the present disclosure to achieve the object includes a hazard estimation unit configured to estimate a likelihood of occurrence of an event relating to a recorder of an image, a recorded person, or a recorded object according to a hazard function for each of a plurality of time-series image groups including time-series image groups in which the event has not occurred and time-series image groups in which the event has occurred, each of the plurality of time-series image groups including a series of images and being given an occurrence time of the event in advance, and a parameter estimation unit configured to estimate a parameter of the hazard function such that a likelihood function that is represented by including the occurrence time of the event given for each of the plurality of time-series image groups and the likelihood of occurrence of the event estimated for each of the plurality of time-series image groups is optimized.
Accompanying information may be further given for the time-series image group, and the hazard estimation unit may be configured to estimate the likelihood of occurrence of the event according to the hazard function based on the time-series image group and the accompanying information given for the time-series image group.
The hazard estimation unit may include a plurality of partial hazard estimation units, each being configured to estimate the likelihood of occurrence of the event according to a partial hazard function using at least one of the time-series image group and the accompanying information given for the time-series image group as an input and each having the input or the partial hazard function different from that of another partial hazard estimation unit, and a partial hazard combining unit configured to combine estimated likelihoods of occurrence of the event from the plurality of partial hazard estimation units to obtain an estimate according to the hazard function.
The hazard estimation unit may be configured to extract a feature amount in consideration of a time series of an image from the time-series image group according to the hazard function using a neural network and estimate the likelihood of occurrence of the event based on the extracted feature amount.
An event occurrence time estimation apparatus of the present disclosure includes an input unit configured to receive an input of a target time-series image group including a series of images, a hazard estimation unit configured to estimate a likelihood of occurrence of an event relating to a recorder of an image, a recorded person, or a recorded object for the target time-series image group according to a hazard function using a learned parameter, and an event occurrence time estimation unit configured to estimate an occurrence time of a next event based on the estimated likelihood of occurrence of the event.
An event occurrence time estimation method of the present disclosure includes, at a computer, for each of a plurality of time-series image groups including time-series image groups in which an event relating to a recorder of an image, a recorded person, or a recorded object has not occurred and time-series image groups in which the event has occurred, each of the plurality of time-series image groups including a series of images and being given an occurrence time of the event in advance, estimating a parameter of a hazard function such that a likelihood function that is represented by including the occurrence time of the event and a likelihood of occurrence of the event estimated for each of the plurality of time-series image groups is optimized, receiving an input of a target time-series image group including a series of images, estimating a likelihood of occurrence of the event for the target time-series image group according to a hazard function using the estimated parameter, and estimating an occurrence time of a next event based on the estimated likelihood of occurrence of the event.
An event occurrence time learning program of the present disclosure is a program for causing a computer to estimate a likelihood of occurrence of an event relating to a recorder of an image, a recorded person, or a recorded object according to a hazard function for each of a plurality of time-series image groups including time-series image groups in which the event has not occurred and time-series image groups in which the event has occurred, each of the plurality of time-series image groups including a series of images and being given an occurrence time of the event in advance, and estimate a parameter of the hazard function such that a likelihood function that is represented by including the occurrence time of the event given for each of the plurality of time-series image groups and the likelihood of occurrence of the event estimated for each of the plurality of time-series image groups is optimized.
An event occurrence time estimation program of the present disclosure is a program for causing a computer to receive an input of a target time-series image group including a series of images, estimate a likelihood of occurrence of an event relating to a recorder of an image, a recorded person, or a recorded object for the target time-series image group according to a hazard function using a learned parameter, and estimate an occurrence time of a next event based on the estimated likelihood of occurrence of the event.
The event occurrence time learning apparatus of the present disclosure having the above features can optimize the hazard function using the likelihood function that is represented by including the occurrence time of the event given for each of the plurality of time-series image groups and the likelihood of occurrence of the event estimated for each of the plurality of time-series image groups. Further, the event occurrence time estimation apparatus of the present disclosure can estimate the occurrence time of the next event using the likelihood of occurrence of an event obtained from the hazard function optimized by the event occurrence time learning apparatus.
In addition to this, by taking into consideration information accompanying time-series images, it is possible to improve the estimation accuracy.
Furthermore, estimation appropriate for inputs of various different types is enabled by obtaining the likelihoods of occurrence of events using a plurality of methods with different inputs or partial hazard functions, combining the estimated likelihoods of occurrence of events, and outputting the combination as a hazard function.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings.
An event occurrence time prediction apparatus 1 is constructed by a computer or a server computer equipped with well-known hardware such as a processing device, a main storage device, an auxiliary storage device, a data bus, an input/output interface, and a communication interface. By being loaded into a main storage device and then executed by an processing device, various programs constituting an event occurrence time learning program and an event occurrence time estimation program function as each unit in the event occurrence time prediction apparatus 1. In the first embodiment, the various programs are stored in an auxiliary storage device included in the event occurrence time prediction apparatus 1. However, the storage destination of the various programs is not limited to the auxiliary storage device and the various programs may be recorded on a recording medium such as a magnetic disk, an optical disc, or a semiconductor memory or may be provided through a network. Any other component does not necessarily have to be realized by a single computer or server computer and may be realized by being distributed over a plurality of computers connected by a network.
The event occurrence time prediction apparatus 1 illustrated in
The event occurrence time prediction apparatus 1 is also connected to a history video database 2 via communication means to communicate information therebetween. The communication means may include any known communication means. For example, the event occurrence time prediction apparatus 1 may be connected to the history video database 2 via communication means such as the Internet in which communication is performed according to the Transmission Control Protocol/Internet Protocol (TCP/IP). The communication means may also be communication means according to another protocol.
The history video database 2 is constructed by a computer or a server computer equipped with well-known hardware such as an processing device, a main storage device, an auxiliary storage device, a data bus, an input/output interface, and a communication interface. The first embodiment will be described with reference to the case where the history video database 2 is provided outside the event occurrence time prediction apparatus 1, although the history video database 2 may be provided inside the event occurrence time prediction apparatus 1.
The history video database 2 stores a plurality of time-series image groups, each including a series of images for which event occurrence times are given in advance. Each time-series image group includes a series of images captured at predetermined time intervals. The first embodiment will be described below with reference to the case where each time-series image group is a video shot as an example. Hereinafter, a video shot will be simply referred to as a video V. Further, the history video database 2 stores a set of times when events have occurred for each video V. Events include events relating to a recorder, a recorded person, or a recorded object. Events may be either events that appear in the videos such as events of changes in the recorded person or the recorded object or events that do not appear in the videos such as events relating to the recorder. Hereinafter, a set of times when events have occurred will be referred to as an event occurrence time set E.
The time-series images are not limited to video images captured and recorded by a video camera or the like and may be images captured by a digital still camera at predetermined time intervals.
The recorder may be a person or an animal who or which takes pictures using a device for shooting and recording time-series images such as a video camera or a digital still camera, a robot or a vehicle such as an automobile equipped with a device for shooting and recording, or the like.
Using i as an identifier of the video V, each video Vi is represented by equation (1) below.
[Math. 1]
V
i=[Iij, . . . ,Ii|V
where Iij represents a j-th image included in the video Vi and |Vi| represents the length of the video Vi.
The event occurrence time set Ei of each video Vi is represented by equation (2) below.
[Math. 2]
E
i
={e
ik
, . . . ,e
i|E
|} (2)
where eik represents the occurrence time of a kth event that has occurred in the video Vi and |Ei| indicates the number of events that have occurred in the video Vi. The history video database 2 also includes videos Vi in which no events have occurred, that is, videos where |Ei|=0.
The input unit 15 receives an input of a target time-series image group including a series of images for which event occurrence is to be estimated. The target time-series image group is transmitted from a storage connected to a network or is input from various recording media such as a magnetic disk, an optical disc, and a semiconductor memory.
The first embodiment will be described below with reference to the case where the target time-series image group is a video shot V as an example, similar to the time-series image groups stored in the history video database 2. Hereinafter, the target time-series image group is simply referred to as a target video. The target video is a video V from a certain time in the past to the present and the identifier is c. Similar to the videos in the history video database 2, the target video Vc is represented by equation (3) below.
[Math. 3]
V
c=[Ic0, . . . ,Ic|V
Events may or may not occur in the target video Vc.
In the first embodiment, a hazard function representing the relationship between a video V and an event is generated using survival analysis and deep learning that uses a neural network (for example, a combination of a CNN and an RNN or a 3DCNN). Through learning, a parameter θ defining the hazard function used for prediction is optimized to estimate event occurrence times.
The parameter storage unit 13 stores the parameter θ of the hazard function. The parameter θ will be described later.
The hazard estimation unit 11 estimates the likelihood of event occurrence for each of a plurality of videos Vi including videos Vi in which no events have occurred and videos Vi in which events have occurred according to the hazard function. Specifically, according to a hazard function using a neural network, the hazard estimation unit 11 extracts feature amounts in consideration of the time series of the images from the video Vi and estimates the likelihood of event occurrence based on the extracted feature amounts.
First, the hazard estimation unit 11 receives a parameter θ of a hazard function from the parameter storage unit 13 and outputs a value of the hazard function utilizing deep learning.
The hazard function is a function that depends on a time t left until an event occurs and l variables (x1, . . . , xl) estimated by deep learning, and when no events have occurred by the time t, represents the likelihood that an event will occur immediately after the time t. The hazard function h(t) is represented, for example, by equation (4) or equation (5) below. Equation (4) represents the case where the number of variables is two and equation (5) represents the case where the number of variables is one. where t in the hazard function h(t) represents a time elapsed from the time when prediction is performed. The number of variables of the hazard function h(t) may be increased as necessary and an equation with the increased number of l variables may be used.
[Math. 4]
h(t)=exp(x1)exp(x2)texp(x
[Math. 5]
h(t)=exp(x1) (5)
The convolutional layer 20 is a layer for extracting feature amounts from each image Iij (where i=1 to j and j≤|Vi|) in the video Vi. For example, the convolutional layer 20 convolves each image with a 3×3 pixel filter or extracts maximum pixel values of rectangles of a specific size (through max-pooling). For example, the convolutional layer 20 may have a known neural network structure such as VGG described in Reference 1 or may use a parameter learned in advance.
Reference 1: Karen Simonyan and Andrew Zisserman “Very deep convolutional networks for large-scale image recognition”, CoRR, Vol. abs/1409.1556, 2014.
The fully connected layer A 21 further abstracts the feature amounts obtained from the convolutional layer 20. Here, for example, a sigmoid function is used to non-linearly transform the input feature amounts.
The RNN layer 22 is a layer that further abstracts the abstracted features as time-series data. Specifically, for example, the RNN layer 22 receives features as time-series data, causes information abstracted in the past to circulate, and repeats the non-linear transformation. The RNN layer 22 only needs to have a network structure that can appropriately abstract time-series data and may have a known structure, examples of which include the technology of Reference 2.
Reference 2: Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation”, arXiv preprint arXiv: 1406. 1078, 2014.
The fully connected layer B 23 transforms a plurality of abstracted feature amounts into a vector of l dimensions corresponding to the number of variables (l) of the hazard function and calculates elements of the vector as values of the variables of the hazard function. Here, the fully connected layer B 23 non-linearly transforms the input feature amounts, for example, using a sigmoid function.
The output layer 24 outputs a value indicating the likelihood that an event will occur immediately after the time t according to the above equation (4) or (5) based on the calculated l-dimensional vector.
The parameter estimation unit 12 estimates a parameter θ of the hazard function such that a likelihood function that is represented by including the occurrence time of an event given for each of the plurality of videos Vi and the likelihood of event occurrence estimated for each of the plurality of videos Vi is optimized.
First, the parameter estimation unit 12 compares the event occurrence time set Ei of each video Vi stored in the history video database 2 with the hazard function output from the hazard estimation unit 11 to estimate a parameter θ. Then, the parameter estimation unit 12 optimizes the parameter θ of the hazard function such that the output of the likelihood function L obtained from the occurrence time eik of the kth event and the likelihood of event occurrence at each time tij estimated from the hazard function is maximized. The parameter estimation unit 12 stores the optimized parameter θ of the hazard function in the parameter storage unit 13.
For example, when, for N videos, Δtij and δij are defined using each video Vi and the event occurrence time set Ei of each video Vi, Δtij and δij are represented by equations (6) and (7) below. where tij represents the time of a j-th image Iij of the video Vi.
From these, a likelihood function L(θ) defined when the current parameter θ is used is represented by equation (8) below.
where
V
ij=[Ii0, . . . ,Iij] [Math. 9]
A specific optimization method can be implemented, for example, by using the logarithm of the likelihood function L(θ) multiplied by −1 as a loss function and minimizing the loss function using a known technique such as backpropagation.
When the learned parameter θ is set in the hazard estimation unit 11 and each image of the target video Vc from an image I0 to an image Ij is input to the hazard estimation unit 11 as illustrated in
The event occurrence time estimation unit 14 estimates the occurrence time of the next event based on the value of the hazard function estimated by the hazard estimation unit 11. In prediction, the time ec when the next event will occur can be estimated, for example, by performing a simulation based on the hazard function or by comparing the value of a survival function derived from the hazard function (the probability that no events will occur until t seconds elapse) with a threshold value.
Next, a flow of processing when the event occurrence time prediction apparatus 1 of the first embodiment functions as the event occurrence time learning apparatus will be described with reference to a flowchart of
First, in step S1, a parameter θ of a hazard function determined using a random number or the like is stored in the parameter storage unit 13 as an initial value of the parameter θ.
Next, in step S2, videos {V0, . . . , VN} included in the history video database 2 are passed to the hazard estimation unit 11. N is the number of videos included in the history video database 2. Here, a total of N videos Vi in the history video database 2 may be passed to the hazard estimation unit 11 or only a partial set of videos Vi in the history video database 2 may be passed to the hazard estimation unit 11.
In step S3, the hazard estimation unit 11 sets the parameter θ obtained from the parameter storage unit 13 as a neural network parameter of the hazard function.
In step S4, the hazard estimation unit 11 repeats processing of obtaining, for each video Vi (where i is 1 to N), a hazard function h(t|Vij; θ) (see
In step S5, the parameter estimation unit 12 further receives event occurrence time sets {E0, . . . , En} included in the history video database 2 corresponding to the videos Vi.
In step S6, the parameter estimation unit 12 optimizes the parameter θ of the hazard function by maximizing a likelihood function L(θ) obtained from the hazard functions h(t|Vij; θ) and the event occurrence time sets {E0, . . . , En} passed to the parameter estimation unit 12.
In step S7, the optimized parameter θ of the hazard function is stored in the parameter storage unit 13.
In step S8, it is determined whether or not a predetermined criterion has been reached. The criterion is, for example, the number of times that has been determined in advance or whether or not the amount of change in the likelihood function is a reference value or less. If the determination of step S8 is negative, the process returns to step S2.
In step S2, the videos {V0, . . . , VN} included in the history video database 2 are passed to the hazard estimation unit 11 again. The same set of videos Vi may be passed to the hazard estimation unit 11 each time, and a different set of videos Vi may also be passed to the hazard estimation unit 11 each time. For example, a total of N videos Vi in the history video database 2 may be passed to the hazard estimation unit 11 each time. Alternatively, a partial set of videos Vi different from the partial set of videos Vi in the history video database 2 that has been first passed to the hazard estimation unit 11 may be passed to the hazard estimation unit 11 such that partial sets of videos Vi included in the history video database 2 are sequentially passed to the hazard estimation unit 11. The same set of videos Vi may also be passed a plurality of times.
Subsequently, the processing of steps S3 to S7 is executed to obtain a new parameter θ of the hazard function h(t). In step S8, it is determined whether or not the predetermined criterion has been reached and the processing of steps S2 to S7 is repeatedly performed until the predetermined criterion is reached. If the determination in step S8 is affirmative, the optimization ends.
Next, a flow of processing when the event occurrence time prediction apparatus 1 of the first embodiment functions as the event occurrence time estimation apparatus will be described with reference to a flowchart of
First, in step S11, the optimized parameter θ of the hazard function stored in the parameter storage unit 13 is passed to the hazard estimation unit 11.
In step S12, a target video Vc is input through the input unit 15 and passed to the hazard estimation unit 11.
In step S13, the hazard estimation unit 11 calculates a hazard function h(t|Vc) for each time t from the end time of the target video Vc based on each image Icj of the target video Vc and passes the calculated hazard function to the event occurrence time estimation unit 14.
In step S14, the event occurrence time estimation unit 14 estimates an event occurrence time ec based on the value of the hazard function h(t|Vc) for each time t. Then, in step S15, the event occurrence time estimation unit 14 outputs the estimated occurrence time ec.
Although the first embodiment has been described with reference to the case where the event occurrence time learning apparatus and the event occurrence time estimation apparatus are constructed as a single apparatus, the event occurrence time learning apparatus 1a and the event occurrence time estimation apparatus 1b may be constructed as different apparatuses as illustrated in
In the first embodiment, taking into consideration high-order information of time-series images and time changes thereof while using deep learning and survival analysis makes it possible to estimate the time left until an event occurs. For example, when an event is a traffic accident, taking into consideration the movement of an object makes it possible to determine whether a nearby pedestrian is approaching or moving away, and taking into consideration the speed makes it possible to predict the time left until an accident occurs.
Next, a second embodiment will be described. The second embodiment will be described with reference to the case where the event occurrence time learning apparatus and the event occurrence time estimation apparatus are provided in the same apparatus, similar to the first embodiment. The second embodiment will also be described with reference to the case where time-series image groups are videos, similar to the first embodiment. The second embodiment differs from the first embodiment in that hazard functions are estimated using not only videos but also accompanying information in addition to videos. The same components as those of the first embodiment are denoted by the same reference signs and detailed description thereof will be omitted and only components different from those of the first embodiment will be described in detail.
The history video database 2 of the second embodiment stores accompanying information in addition to videos Vi and event occurrence time sets Ei of the videos Vi. Each video Vi and the event occurrence time set Ei of each video Vi are represented in the same manner as in the first embodiment and thus detailed description thereof will be omitted. The accompanying information is, for example, metadata or time-series data obtained from a sensor simultaneously with the video Vi. Specifically, when videos Vi are videos taken by an in-vehicle camera, the metadata includes attribute information such as the driver's age and the type of the automobile and the time-series data includes the speed or acceleration of the automobile, GPS location information, the current time, or the like.
Hereinafter, the second embodiment will be described with reference to the case where the accompanying information is time-series data. Accompanying information that accompanies an image Iij of each video Vi will be denoted by Aij. The accompanying information Aij is represented by equation (9) below.
[Math. 10]
A
ij
={a
ij
0
, . . . ,a
ij
|A
|} (9)
Here, arij represents accompanying information of type r associated with a j-th image Iij of the video Vi and is stored as time-series data in an arbitrary format (for example, a scalar value, a categorical variable, a vector, or a matrix) associated with each image Iij. |Aij| represents the number of types of accompanying information for the image Iij.
In the example using the in-vehicle camera, the accompanying information Aij is, for example, sensor data of speed, acceleration, and position information, and is represented by a multidimensional vector.
As illustrated in
The hazard estimation unit 11a of the second embodiment includes M partial hazard estimation units 11-1, . . . , 11-M and a partial hazard combining unit 16.
Each of the partial hazard estimation units 11-1, . . . , 11-M uses at least one of each video Vi and accompanying information Aij given for the video Vi as an input to estimate the likelihood of event occurrence according to a partial hazard function hm(t). Here, m is an identifier of the partial hazard estimation unit 11-1, . . . , 11-M. Each of the plurality of partial hazard estimation units 11-1, . . . , 11-M takes at least one of each video Vi and accompanying information Aij given for the video Vi as an input.
The fully connected layer C 25 transforms the accompanying information Aij represented by a multidimensional vector into an abstract l-dimensional feature vector. Further, it is desirable that the accompanying information Aij be normalized in advance and input to the fully connected layer C 25.
The RNN layer 22 takes the outputs of the fully connected layer A 21 and the fully connected layer C 25 as inputs, such that feature amounts obtained from the image Iij and feature amounts obtained from the accompanying information Aij are input to the RNN layer 22. For example, feature amounts of the accompanying information Aij together with feature amounts of the image Iij included in the video Vi are input to the RNN layer 22 in accordance with the time when the data is obtained.
Further, each of the plurality of partial hazard estimation units 11-1, . . . , 11-M takes an input different from inputs to the other partial hazard estimation units 11-1, . . . , 11-M or has a partial hazard function hm(t) different from those of the other. The structure of the neural network differs from that of
For example, the video Vi is input to the partial hazard estimation unit 11-1 and the accompanying information Aij is input to the partial hazard estimation unit 11-2. Alternatively, the input information is changed such that the video Vi is input to the partial hazard estimation unit 11-1 and the video Vi and the accompanying information Aij are input to the partial hazard estimation unit 11-2. Further, when a combination of the video Vi and the accompanying information Aij is input to each of the partial hazard estimation unit 11-1 and the partial hazard estimation unit 11-2, the accompanying information Aij input to the partial hazard estimation unit 11-1 and the accompanying information Aij input to the partial hazard estimation unit 11-2 may be information of different types. In the example using the in-vehicle camera, accompanying information Aij of different types include the speed and position information of the automobile. Thus, a combination of the video Vi and the speed of the automobile may be input to the partial hazard estimation unit 11-1 and a combination of the video Vi and the position information of the automobile may be input to the partial hazard estimation unit 11-2.
Further, the partial hazard function hm(t) may be changed according to information input to the partial hazard estimation units 11-1, . . . , 11-M, such that it is possible to perform estimation according to the input information. Alternatively, the same video Vi and the same accompanying information Ai may be input to the plurality of partial hazard estimation units 11-1, . . . , 11-M while changing the configuration of the partial hazard function hm(t) for each partial hazard estimation unit 11-1, . . . , 11-M, such that the plurality of partial hazard estimation units 11-1, . . . , 11-M can perform estimation from different viewpoints. For example, a neural network may be used for one partial hazard function hm(t) while a kernel density estimation value is used for another partial hazard function hm(t).
The partial hazard combining unit 16 combines the estimated likelihoods of event occurrence from the plurality of partial hazard estimation units to derive a hazard function h(t). This derivation of the hazard function h(t) may use, for example, a weighted sum or a weighted average of all partial hazard functions hm(t) or a geometric average thereof.
The parameter estimation unit 12a compares the event occurrence time set Ei of each video Vi stored in the history video database 2 with the hazard function output from the hazard estimation unit 11a to estimate a parameter θ of the hazard function and stores the estimated parameter θ of the hazard function in the parameter storage unit 13 in the same manner as in the first embodiment described above. Here, the parameter θ of the hazard function includes the parameters θm of the plurality of partial hazard functions hm(t).
The likelihood function L of the second embodiment is represented by equation (10) below. Δtij and δij are the same as those of equations (6) and (7) in the first embodiment.
where
V
ij=[Ii0, . . . ,Iij] [Math. 12]
Here, if there is no accompanying information Aij corresponding to the image Iij, Aij is assumed to be empty data. A specific optimization method can be implemented, for example, by using the logarithm of the likelihood function multiplied by −1 as a loss function and minimizing the loss function using a known technique such as backpropagation.
Next, a flow of processing when the event occurrence time prediction apparatus 1c of the second embodiment functions as the event occurrence time learning apparatus will be described with reference to a flowchart of
First, in steps S1 to S3, the same processing as in the first embodiment is performed such that a parameter θ of the hazard function obtained from the parameter storage unit 13 is passed to the hazard estimation unit 11a and set therein as a neural network parameter of the hazard function. Specifically, the parameters Om are set as parameters of the partial hazard functions hm(t).
In step S4-1, each of the plurality of partial hazard estimation units 11-1, . . . , 11-M repeats processing of obtaining, for each video Vi (where i is 1 to N), a partial hazard function hm(t) of each image from a first image Ii0 to an image Iij at the time tj (where j is 0 to |Vi|). Subsequently, in step S4-2, for each video Vi (where i is 1 to N), values of the partial hazard functions hm(t) of each image from the first image Ii0 to the image Iij at the time tj (where j is 0 to |Vi|) are combined to derive a hazard function h(t) of each image from the first image Ii0 to the image Iij at the time tj (where j is 0 to |Vi|) for each video Vi (where i is 1 to N) and the derived hazard function h(t) is passed to the parameter estimation unit 12.
In steps S5 to S8, the same processing as in the first embodiment is performed. In step S8, it is determined whether or not a predetermined criterion has been reached and the processing of steps S2 to S7 is repeatedly performed until the determination is affirmative. If the determination is affirmative in step S8, the optimization ends.
Next, a flow of processing when the event occurrence time prediction apparatus 1c of the second embodiment functions as the event occurrence time estimation apparatus is similar to that of the first embodiment and thus is omitted.
The second embodiment has been described with reference to the case where the event occurrence time learning apparatus and the event occurrence time estimation apparatus are constructed as a single apparatus. In addition, the event occurrence time learning apparatus 1d and the event occurrence time estimation apparatus 1e may be constructed as different apparatuses as illustrated in
In the second embodiment, prediction accuracy can be improved by taking into account information accompanying time-series images in addition to the high-order information of the time-series images and time changes thereof. For example, when a traffic accident is considered as an event, it is possible to perform prediction taking into consideration the characteristics of areas such as those with frequent running out and information such as speed and acceleration.
Next, a third embodiment will be described. The third embodiment will be described with reference to the case where a hazard function is estimated using accompanying information similar to the second embodiment. However, the third embodiment differs from the second embodiment in that accompanying information is metadata rather than time-series data and accompanying information such as metadata for the entirety of a video is given.
An event occurrence time prediction apparatus of the third embodiment includes a hazard estimation unit 11a, a parameter estimation unit 12a, a parameter storage unit 13, an event occurrence time estimation unit 14, and an input unit 15, similar to the second embodiment illustrated in
In the third embodiment, the accompanying information is accompanying information Ai that accompanies one video Vi like metadata. In the example using the in-vehicle camera, the metadata is, for example, attribute information such as the driver's age and the type of the automobile. The accompanying information Ai of each video Vi is represented by equation (11) below.
[Math. 13]
A
i
={a
i
0
, . . . ,a
i
|A
|} (11)
Here, ari represents r-th accompanying information for the video Vi and a plurality of pieces of accompanying information relating to the entirety of the video is stored in an arbitrary format (for example, a scalar value, a categorical variable, a vector, or a matrix). |Ai| represents the number of pieces of accompanying information for the video Vi.
When the accompanying information Ai is metadata each of the partial hazard estimation units 11-1 . . . . , 11-M uses the video Vi as an input or uses the video Vi and the accompanying information Ai as inputs to estimate the likelihood of event occurrence according to a partial hazard function hm(t). Also, similar to the second embodiment, each of the plurality of partial hazard estimation units 11-1, . . . , 11-M takes an input different from inputs to the other partial hazard estimation units 11-1 . . . . , 11-M or has a partial hazard function hm(t) different from those of the other.
The fully connected layer D 26 transforms the accompanying information Ai into an abstracted l-dimensional feature vector.
In the second embodiment, feature amounts of the video Iij and feature amounts of the accompanying information Aij are input to the RNN layer 22 via the fully connected layer A 21 and the fully connected layer C 25, respectively. However, in the third embodiment, feature amounts of the accompanying information Aij are input to the fully connected layer B 23 via the fully connected layer D 26, separately from the video Iij.
The structure of the neural network differs from that of
The parameter estimation unit 12a compares the event occurrence time set Ei of each video Vi stored in the history video database 2 with the hazard function output from the hazard estimation unit 11a to estimate a parameter θ of the hazard function in the same manner as in the second embodiment described above.
The likelihood function L of the third embodiment is represented by equation (12) below. Δtij and δij are the same as those of equations (6) and (7) in the first embodiment.
where
V
ij=[Ii0, . . . ,Iij] [Math. 15]
A specific optimization method can be implemented, for example, by using the logarithm of the likelihood function multiplied by −1 as a loss function and minimizing the loss function using a known technique such as backpropagation, similar to the second embodiment.
A flow of processing of the event occurrence time prediction apparatus of the third embodiment is similar to that of the second embodiment and thus detailed description thereof is omitted.
In the third embodiment, the event occurrence time learning apparatus 1d and the event occurrence time estimation apparatus 1e may also be constructed as different apparatuses as illustrated in
In the third embodiment, prediction accuracy can be improved by taking into account accompanying information such as metadata in addition to the high-order information of time-series images and time changes thereof. For example, when a traffic accident is considered as an event, it is possible to perform prediction taking into consideration information such as the driver's age and the type of the automobile.
The hazard estimation unit 11a may perform estimation using the partial hazard estimation units of the second embodiment and the partial hazard estimation units of the third embodiment in combination. It is also possible to use a structure in which the fully connected layer B 23 in the structure of the neural network in the second embodiment is provided with the fully connected layer D 26 for inputting feature amounts of the accompanying information A; in the third embodiment.
By combining the partial hazard estimation units of the second embodiment and the partial hazard estimation units of the third embodiment in this way, when a traffic accident is considered as an event, it is possible to perform prediction taking into consideration information such as the driver's age and the type of the automobile in addition to prediction taking into consideration the characteristics of areas such as those with frequent running out and information such as speed and acceleration.
Although the above embodiments have been described with reference to the case where a combination of a CNN and an RNN is used as a neural network, a 3DCNN may also be used.
The present disclosure is not limited to the above embodiments and various modifications and applications are possible without departing from the gist of the present invention.
In the above embodiments, a central processing unit (CPU) which is a general-purpose processor is used as an processing device. It is preferable that a graphics processing unit (GPU) be further provided as needed. Some of the functions described above may be realized using a programmable logic device (PLD) which is a processor whose circuit configuration can be changed after manufacturing such as a field programmable gate array (FPGA), a dedicated electric circuit having a circuit configuration specially designed to execute specific processing such as an application specific integrated circuit (ASIC), or the like.
Number | Date | Country | Kind |
---|---|---|---|
2019-028825 | Feb 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/004944 | 2/7/2020 | WO | 00 |