This application claims the priority of Korean Patent Application No. 10-2022-0032099 filed on Mar. 15, 2022 and No. 10-2022-0052202 filed on Apr. 27, 2022 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference.
The present invention relates to a method for sampling a random vector corresponding to an intention of a pedestrian non-stochastically or applying a social statistical element that the majority of pedestrians move in groups to training, in training a neural network model for pedestrian trajectory prediction.
Pedestrian trajectory prediction technology is a technology that estimates a future trajectory based on a past trajectory of a pedestrian, and can be applied to various areas such as behavioral prediction, crowd movement analysis, abnormal movement detection, and traffic flow analysis.
Various computer vision technologies have been used to predict the pedestrian trajectory, and deep learning technology has been applied to enhance predictive accuracy in recent years.
The most commonly used deep learning technology among the technologies is a stochastic trajectory prediction technology, and as illustrated in
Since the technology is basically based on a probability, available random vectors are infinitely classified, and when the number of training execution times is indefinitely repeated, predictive accuracy continues to rise. However, the number of prediction trajectories to be sampled is not enough to indicate all trajectories that can actually occur, and it is also impossible to execute indefinite repeated execution in an application program, so there is a limit that it is very difficult to secure a predetermined level of prediction accuracy through the technology.
In other words, there is a problem in that the previously illustrated existing technologies are fundamentally sensitive to bias due to the fixed number of samples and stochastic sampling, accordingly predicting a completely different trajectory from an actual result as illustrated in
In addition, the recent trajectory prediction studies with deep learning have focused on individual pedestrians, and it is expected that an interaction between respective pedestrians will be sufficiently reflected through graph-based neural network models such as graph convolutional network (GCN), graph attention network (GAT), graph transformer network (GTN), etc., but as the number of edges connecting respective pedestrians (nodes) increases, it is very difficult to train the neural network model, there is a limit that the trajectory prediction is very inaccurate in an environment which is crowded due to the pedestrians.
The present invention has been made in an effort to sample a random vector corresponding to an intention of a pedestrian stochastically and use the random vector for training a neural network model when training various neural network models used for pedestrian trajectory prediction.
Further, the present invention has been made in an effort to apply a social statistical element that the majority of pedestrians move in groups to training deep learning in pedestrian trajectory prediction using a neural network model.
The objects of the present disclosure are not limited to the above-mentioned objects, and other objects and advantages of the present disclosure that are not mentioned can be understood by the following description, and will be more clearly understood by exemplary embodiments of the present disclosure. Further, it will be readily appreciated that the objects and advantages of the present disclosure can be realized by means and combinations shown in the claims.
In order to solve the problem, according to an exemplary embodiment of the present invention includes: sampling, based on a pedestrian trajectory of a target pedestrian, a predetermined number of latent vectors among a plurality of random vectors corresponding to an intention of the target pedestrian non-stochastically; and extracting a pedestrian feature vector from the pedestrian trajectory, and applies the pedestrian feature vector and the latent vectors to a neural network model to determine the expected trajectory of the target pedestrian.
In an exemplary embodiment, the method further includes collecting a pedestrian image including the target pedestrian, and identifying the pedestrian trajectory of the target pedestrian in the pedestrian image.
In an exemplary embodiment, the identifying of the pedestrian trajectory of the target pedestrian includes detecting a location of the target pedestrian for each frame, and identifying the pedestrian trajectory.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes sampling the predetermined number of latent vectors in the order in which trajectories predicted by the plurality of random vectors are most similar to an actual trajectory of the target pedestrian upon learning the neural network model.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes sampling the predetermined number of latent vectors by applying a loss function which decreases as the trajectories predicted by the plurality of random vectors are more similar to the actual trajectory of the target pedestrian to the neural network model.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes sampling the predetermined number of latent vectors in the order in which a distance between respective trajectories predicted by the plurality of random vectors are largest upon learning the neural network model.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes sampling the predetermined number of latent vectors by applying a loss function which decreases as the distance between the respective trajectories predicted by the plurality of random vectors to the neural network model.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes sampling the predetermined number of latent vectors so that the distance between respective trajectories predicted by the plurality of random vectors are largest while the trajectories predicted by the plurality of random vectors are most similar to the actual trajectory of the target pedestrian.
In an exemplary embodiment, sampling of the latent vectors non-stochastically includes applying, to the neural network model, a final loss function acquired by a linear combination of a first loss function decreases as the trajectories predicted by the plurality of random vectors are more similar to the actual trajectory of the target pedestrian and a second loss function decreases as the distance between the respective trajectories predicted by the plurality of random vectors is larger to sample the predetermined number of latent vectors.
In an exemplary embodiment, the sampling of the latent vectors non-stochastically includes extracting an interaction-aware feature between the target pedestrian and a surrounding pedestrian, and reflecting the interaction-aware feature to sample the latent vector.
In an exemplary embodiment, the extracting of the interaction-aware feature includes extracting the interaction-aware feature through a graph attention network (GAT), and inputting the interaction-aware feature into a multi-layer perception (MLP) to sample the latent vector.
In an exemplary embodiment, the neural network model is learned by using a training dataset constituted by the pedestrian trajectory of the target pedestrian for a first time interval of the pedestrian image and the pedestrian trajectory of the target pedestrian for a second time interval continued to the first time interval.
In an exemplary embodiment, the determining of the expected trajectory of the target pedestrian includes outputting the expected trajectory of the target pedestrian by applying the pedestrian feature vector and the latent vector to any one of Gaussian distribution, Generative Adversarial Network (GAN), and Conditional Variational AutoEncoder (CVAE).
Further, in order to solve the problem, according to an exemplary embodiment of the present invention, a method for predicting a pedestrian trajectory includes: classifying, based on pedestrian trajectories of a plurality of pedestrians, the plurality of pedestrians into at least one pedestrian group; generating each of first graph data according to a relationship of the pedestrian group, second graph data according to a relationship of the pedestrians in each pedestrian group, and third graph data according to a relationship of all of the plurality of pedestrians; and generating an expected trajectory for each of the plurality of pedestrians by inputting the first to third graph data into a neural network model.
In an exemplary embodiment, the method further includes collecting a pedestrian image including the plurality of pedestrians, and identifying the pedestrian trajectories of the plurality of the plurality of pedestrians in the pedestrian image.
In an exemplary embodiment, the identifying of the pedestrian trajectories of the plurality of pedestrians includes identifying the pedestrian trajectory by detecting a location of each pedestrian for each frame.
In an exemplary embodiment, the classifying of the plurality of pedestrians into at least one pedestrian group includes classifying, based on a distance between the pedestrian trajectories of the plurality of pedestrians, the plurality of pedestrians into at least one pedestrian group.
In an exemplary embodiment, the classifying of the plurality of pedestrians into at least one pedestrian group includes classifying the plurality of pedestrians into the same group when the distance between the pedestrian trajectories of the plurality of pedestrians is equal to or less than a reference value.
In an exemplary embodiment, the classifying of the plurality of pedestrians into at least one pedestrian group includes inputting the pedestrian trajectories of the plurality of pedestrians into a grouping neural network, and the grouping neural network extracts features from the pedestrian trajectories of the plurality of pedestrians through a convolutional layer, and classifies the plurality of pedestrians into the same pedestrian group when the distance between the extracted features is equal to or less than the reference value.
In an exemplary embodiment, the grouping neural network is learned through a gradient descent using a straight-through estimator (STE).
In an exemplary embodiment, the reference value is a learnable parameter of the grouping neural network.
In an exemplary embodiment, the generating of the first graph data includes pooling pedestrian trajectories of pedestrians which belong to each pedestrian group to determine a representative location of each pedestrian group, and generating the first graph data according to a node representing the representative location and an edge connecting the representative location for each pedestrian group.
In an exemplary embodiment, the generating of the second graph data includes generating the second graph data according to a node representing a time-wise location of the pedestrian in each pedestrian group and an edge connecting locations of the pedestrians in each pedestrian group.
In an exemplary embodiment, the generating of the third graph data includes generating the third graph data according to a node representing time-wise locations of the plurality of pedestrians and an edge connecting the locations of the plurality of pedestrians.
In an exemplary embodiment, the generating of the expected trajectory for each of the plurality of pedestrians includes inputting the first to third graph data into first to third graph based neural network sharing parameters, respectively, and integrating outputs of the first to third graph based neural networks to generating the expected trajectory for each of the plurality of pedestrians.
In an exemplary embodiment, the generating of the expected trajectory for each of the plurality of pedestrians includes unpooling the outputs of the neural network model for the first graph data so that expected trajectories of pedestrians which belong to the same pedestrian group are the same as each other.
In an exemplary embodiment, the generating of the expected trajectory for each of the plurality of pedestrians includes sampling latent vectors corresponding to intentions of the plurality of pedestrians, and inputting the latent vectors and the first to third graph data into the neural network model to generate the expected trajectory.
In an exemplary embodiment, in the sampling of the latent vectors, the same latent vector is sampled with respect to the pedestrians which belong to the same pedestrian group.
According to an exemplary embodiment of the present invention, when various neural network models used for pedestrian trajectory prediction are trained, a random vector corresponding to an intention of a pedestrian is sampled statistically to enhance prediction accuracy of a neural network model, and derive various expected trajectories which can be implemented by the neural network model to be output.
Further, according to the present invention, an interaction between pedestrian groups is structuralized with data to allow a neural network model for trajectory prediction to learn an intrinsic complexity of a social interaction.
Further, according to the present invention, there is an advantage in that as each pedestrian group is set to a node of graph data, the number of nodes can be reduced, so a data biasing problem of the neural network model can be prevented, and it is possible to flexibly cope with a change in number of pedestrians upon the trajectory prediction.
Further according to the present invention, in one pedestrian image, each of an interaction between the pedestrian groups, an interaction between the pedestrians in the pedestrian group, and an interaction among all pedestrians is structuralized with the graph data to augment data at the time of learning the neural network model.
In addition to the above-described effects, the specific effects of the present invention will be described below together while describing the specific matters for the present invention.
The above-mentioned objects, features, and advantages will be described below in detail with reference to the accompanying drawings. Therefore, those skilled in the art to which the present invention pertains may easily practice a technical idea of the present invention. In describing the present invention, a detailed description of related known technologies will be omitted if it is determined that they unnecessarily make the gist of the present invention unclear. Hereinafter, a preferable embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numeral is used to indicate the same or similar component.
In this specification, although the terms “first”, “second”, and the like are used for describing various components, these components are not confined by these terms. These terms are only used to distinguish one component from other components, and unless there is particularly disclosed contrary thereto, a first component may be a second component, of course.
Further, in this specification, any component being disposed “at an upper portion (or lower portion)” of a component or “above (or below)”a component may mean that any component is disposed in contact with an upper surface (or a lower surface) of the component and another component is interposed between the component and any component disposed above (or below) the component.
Further, in this specification, when it is disclosed that any component is “connected”, “coupled”, or “linked” to other components, it should be understood that the components may be directly connected or linked to each other, but another component may be “interposed” between the respective components or the respective components may be “connected”, “coupled”, or “linked” through another component.
Further, a singular form used in the present disclosure may include a plural form if there is no clearly opposite meaning in the context. In this application, a term such as “comprising” or “including” should not be interpreted as necessarily including all various components or various steps disclosed in the present disclosure, and it should be interpreted that some component or some steps among them may not be included or additional components or steps may be further included.
In addition, in this specification, when the component is called “A and/or B”, this means that the component means A, B or A and B unless it is not particularly disclosed contrary thereto, and when the component is called “C to D”, this means that the component is C or more and D or less unless it is not particularly disclosed contrary thereto.
The present invention relates to a method for sampling a random vector corresponding to an intention of a pedestrian non-stochastically in training a neural network model for pedestrian trajectory prediction. Hereinafter, a pedestrian trajectory prediction method through non-stochastic sampling according to an exemplary embodiment of the present invention will be described in detail with reference to
Referring to
However, the pedestrian trajectory prediction method illustrated in
The respective steps illustrated in
Hereinafter, the respective steps illustrated in
The processor may collect a pedestrian image 100 including a target pedestrian 110 (S10).
The target pedestrian 110 may mean a pedestrian which becomes a target of trajectory prediction, and the pedestrian image 100 may be a predetermined image containing a figure in which the target pedestrian 110 moves. The pedestrian image 100 may be an image of various views, and specifically, may be an image of a first person view (FPV) in which the target pedestrian 110 is shot or an image of a surveillance view.
The processor may collect the pedestrian image 100 from the other device or a predetermined storage medium. For example, the processor may collect the pedestrian image 100 in front of a vehicle from the vehicle, and collect the pedestrian image 100 in a surveillance area from a CCTV, and collect the pedestrian image 100 from a predetermined database.
Subsequently, the processor may identify a pedestrian trajectory 120 of the target pedestrian 10 in the pedestrian image 100 (S20).
Referring to
The processor may detect a location of the target pedestrian 110 for each frame of the pedestrian image 100, and identify the pedestrian trajectory 120 based on a location which is changed in time series. To this end, the processor may use a predetermined object detection algorithm known in the technical field. Specifically, the processor may detect a specific body portion of the target pedestrian 110, e.g., a location of a head for each frame, and connects the locations detected for each frame to identify the pedestrian trajectory 120.
Referring to
Here, an actual trajectory of the pedestrian in the second time interval T2 may be informally determined according to a latent intention of the pedestrian. As a result, in the trajectory prediction model in the related art, a method for randomly sampling random vectors corresponding to the latent intention of the pedestrian as large as the number of trajectories to be predicted, and using the sampled latent vectors for learning the neural network model to determine various expected trajectories 130 is used.
However, referring to
Specifically, the processor may sample a predetermined number of latent vectors among a plurality of random vectors corresponding to the intention of the target pedestrian 110 non-stochastically based on the pedestrian trajectory 120 of the target pedestrian 110 (S31). Here, the random vector as a vector defined by a random number may be determined according to a Monte Carlo or a Quasi-Monte Carlo method. Further, since each latent vector corresponds to a latent intention, i.e., the expected trajectory 130, the predetermined number may be set to the number of expected trajectories 130 to be determined through the neural network model.
Hereinafter, a non-stochastic sampling method of the present invention will be described.
In a first exemplary embodiment, upon learning the neural network model to be described below, the processor may sample a predetermined number of latent vectors in the order in which trajectories predicted by a plurality of random vectors are most similar to an actual trajectory of the target pedestrian 110. That is, among the random vectors, the predetermined number may be sampled according to the order in which the trajectory predicted by each random vector and the actual trajectory are most similar, and determined as the latent vector.
In the present invention, the neural network model may be learned by a training dataset constituted by the pedestrian trajectory 120 of the target pedestrian 110 for the first time interval T1 of the pedestrian image 100 and the pedestrian trajectory of the target pedestrian 110 for the second time interval T2 continued to the first time interval T1.
In other words, the neural network model may be learned to output the pedestrian trajectory 120 for the second time interval T2 when the pedestrian trajectory 120 for the first time interval T1 is input. In this case, the pedestrian trajectory 120 for the second time interval T2 used for learning may be the actual trajectory (ground truth (GT)) of the target pedestrian 110.
In end-to-end learning, the processor may train parameters (e.g., a weight and a bias) of each layer and node constituting the neural network model so that the trajectory predicted by the random vector is similar to the actual trajectory.
To this end, the processor may apply, to the neural network model, a loss function which becomes smaller as the trajectories predicted by the plurality of random vectors are more similar to the actual trajectory of the target pedestrian 110.
The neural network model may learn the parameters in the model so that a value of the loss function becomes minimal by using a gradient descent, and a latent vector which minimizes the loss function among the random vectors may be sampled.
Specifically, the processor applies a loss function Ldist of [Equation 1] below to the neural network model to sample the random vector to allow the neural network model to sample the random vector so that a Euclidian distance (L2 distance) between the trajectory predicted by the random vector and the actual trajectory decreases.
(L represents the number of target pedestrians 110, N represents the random vector,
represents the trajectory predicted by the random vector, and
represents the actual trajectory)
Meanwhile, when the latent vector is sampled according to the first exemplary embodiment, the prediction accuracy of the neural network mode for the actual trajectory may be enhanced, but as the learning of the neural network model is conducted, a problem in that the neural network model is excessively biased for the actual trajectory may occur.
That is, the neural network mode for predicting the pedestrian trajectory should predict the latent intention of the pedestrian and present various trajectories which may be generated, and when a sampling method of the first exemplary embodiment is used, the diversity of the trajectory predicted by the neural network model may be lowered.
As a result, the processor may also conduct sampling by the following method.
In a second exemplary embodiment, upon learning the neural network model, the processor may sample a predetermined number of latent vectors in the order in which a distance between respective trajectories expected by the plurality of random vectors are largest. That is, among the random vectors, the predetermined number may be sampled according to the order in which the distance between the trajectories predicted by the respective random vectors is largest, and determined as the latent vector.
In other words, when the end-to-end learning is applied to the neural network model, the processor may allow the parameters of each layer and node constituting the neural network model to be learned so that the respective trajectories predicted by the random vectors are far from each other. That is, in the first exemplary embodiment, if the random vector is sampled according to the distance between the trajectory predicted by the random vector and the actual trajectory, the random vector may be sampled according to the distance between the respective trajectories predicted by the random vector in the second exemplary embodiment.
To this end, the processor may apply the loss function which becomes smaller as the distance between the respective trajectories predicted by the plurality of random vectors increases.
Similarly as in the first exemplary embodiment, the neural network model may learn the parameters in the model so that the value of the loss function becomes minimal by using the gradient descent, and the latent vector which minimizes the loss function among the random vectors may be sampled.
Specifically, the processor applies a loss function Ddisc of [Equation 2] below to the neural network model to sample the random vector to allow the neural network model to sample the random vector so that a Euclidian distance (L2 distance) between the respective trajectories predicted by the random vector increases.
(L represents the number of target pedestrians 110, N represents the random vector, and Sl,i and Sl,j represent the trajectories predicted by the respective random vectors)
Meanwhile, when the latent vector is sampled according to the second exemplary embodiment, the neural network model may present various expected trajectories 130, but there is a problem in that the prediction accuracy for the actual trajectory may be lowered as the learning of the neural network model is conducted.
That is, since a general pedestrian walks in a shortest trajectory toward a destination, there is a high probability that an existing walking direction will be maintained as it is in most situations. In other words, the expected trajectory of the target pedestrian is more likely to extend a pre-identified pedestrian trajectory.
When this is considered, the neural network model should secure prediction accuracy of a predetermined level or more while providing various expected trajectories 130, and in the case of the second exemplary embodiment, since the random vector is sampled through a distance comparison between the expected trajectories 130 other than a distance comparison between the expected trajectory 130 and the actual trajectory, the prediction accuracy for the actual trajectory may be lowered as the learning is conducted.
As a result, the processor may sample the random vector by combining the first and second exemplary embodiments.
In a third exemplary embodiment, upon learning the neural network model, the processor may sample a predetermined number of latent vectors in the order in which the trajectories expected by the plurality of random vectors are most similar to the actual trajectory of the target pedestrian 110 and the distance between the respective trajectories predicted by the random vectors are largest.
That is, among the random vectors, the predetermined number may be sampled according to the order in which the trajectory predicted by each random vector and the actual trajectory are most similar and in the order in which the distance between the respective trajectories predicted by the random vectors are largest, and determined as the latent vector. In this case, whether a weight is to be assigned to a similarity between the expected trajectory 130 and the actual trajectory or whether the weight is to be assigned to a distance difference between the expected trajectories 130 may be determined according to setting of a user.
To this end, the processor may apply, to the neural network model, a final loss function acquired by a linear combination of a first loss function decreases as the trajectories predicted by the plurality of random vectors are more similar to the actual trajectory of the target pedestrian 110 and a second loss function decreases as the distance between the respective trajectories predicted by the plurality of random vectors is larger.
Similarly as in the first and second exemplary embodiments, the neural network model may learn the parameters in the model so that the value of the final loss function becomes minimal by using the gradient descent, and the latent vector which minimizes the final loss function among the random vectors may be sampled.
Specifically, the processor may apply a final loss function L of [Equation 3] below to the neural network model. Ldist and Ldisc included in [Equation 3] may be the same as those disclosed in [Equation 1] and [Equation 2], respectively, and a scale difference between Ldist and Ldisc, and a relative weight may be controlled by,
Referring to
Meanwhile, the expected trajectory 130 of the pedestrian may be influenced by a movement of a surrounding pedestrian 210 located nearby. For example, the pedestrian may bypass to avoid the other pedestrian which comes from the front, and may find a specific pedestrian nearby and approach the specific pedestrian, and join a nearby pedestrian group to change a movement trajectory.
In order to consider mutual effects between the pedestrians, the processor may reflect an interaction-area feature of the target pedestrian 110 to the above-described latent vector sampling operation. To this end, the processor may use a graph based deep learning network, and for example, use Graph Convolutional Network (GCN), GraphSAGE, Graph Attention Network (GAT), etc. However, as described above, since it is normal that the pedestrian is more largely influenced by the surrounding pedestrian 210 located nearby, it may be preferable to use the GAT which the weight may be set differently for each neighboring node.
Referring to
which an adjacent node j has with respect to a specific node i as an attention coefficient, and normalized to calculate an attention score
(Here, both ak and W represent learnable parameters)
Subsequently, the processor may update an interaction-aware feature
for each node, i.e., for each pedestrian according to [Equation 5] based on an attention score
The processor may sample the latent vector by inputting the interaction-aware feature determined according to the above-described method into multi-layer perceptron (MLP). In other words, the processor may train the MLP to express a non-linear relationship between the interaction-aware feature and the latent vector.
When specifically described with reference to
When the learning of the neural network model is completed as described above, the processor extracts a pedestrian feature vector from the pedestrian trajectory 120 of the target pedestrian 110 (S320), and applies the extracted pedestrian feature vector and the above sampled latent vector to the neural network model (S40) to determine the expected trajectory 130 of the target pedestrian 110 (S50).
In this case, a method for extracting the pedestrian feature vector (S32) and a method for applying the extracted pedestrian feature vector to the neural network model (S40) may be the same as the method used in the conventional pedestrian trajectory prediction model. That is, in the present invention, the random vector applied by the stochastic sampling method such as rolling a dice in the conventional neural network model descried in
As a result, the present invention may be applied to all Gaussian distribution, Generative Adversarial Network (GAN), and Conditional Variational AutoEncoder (CVAE) models.
The extracted and sampled pedestrian feature vector and latent vector may be aggregated, and consequently, the neural network model may output N expected trajectories (classes) 130 and a generation probability of each expected trajectory 130 (a probability for each class).
The processor may determine at least one of N expected trajectories 130 output from the neural network model as the expected trajectory 130 of the target pedestrian 110. For example, the processor may also determine all of N expected trajectories 130 as the expected trajectory 130 of the target pedestrian 110, and may determine only one trajectory having a highest probability among N expected trajectories 130 as the expected trajectory 130 of the target pedestrian 110.
In
In
As described above, according to an exemplary embodiment of the present invention, when various neural network models used for pedestrian trajectory prediction are trained, a random vector corresponding to an intention of a pedestrian is sampled statistically to enhance prediction accuracy of a neural network model, and derive various expected trajectories 130 which can be implemented by the neural network model to be output.
Further, the present invention relates to a method for predicting the trajectory of the pedestrian by applying a social statistical element that the majority of pedestrians move in groups to learning. Hereinafter, a trajectory prediction method (hereinafter, referred to as a pedestrian trajectory prediction method) through pedestrian grouping according to an exemplary embodiment of the present invention will be described in detail with reference to
Referring to
Subsequently, the pedestrian trajectory prediction method may include a step (S410) of generating first graph data according to a relationship of the pedestrian group, a step (S420) of generating second graph data according to a relationship of the pedestrian in the pedestrian group, and a step (S430) of generating third graph data according to a total relationship of the plurality of pedestrians.
Subsequently, the pedestrian trajectory prediction method may include a step (S500) of inputting the first to third graph data into the neural network model and a step (S600) of generating an expected trajectory for each of the plurality of pedestrians.
However, the pedestrian trajectory prediction method illustrated in
The respective steps illustrated in
Hereinafter, the respective steps illustrated in
The processor may collect a pedestrian image 100 including a plurality of pedestrians (S100).
The plurality of pedestrians may mean a pedestrian which becomes a target of trajectory prediction, and the pedestrian image 100 may be a predetermined image containing a figure in which the plurality of pedestrians moves. The pedestrian image 100 may be an image of various views, and for example, may be an image of a first person view (FPV) or an image of a surveillance view.
The processor may collect the pedestrian image 100 from the other device or a predetermined storage medium. For example, the processor may collect the pedestrian image 100 in front of a vehicle from the vehicle, and collect the pedestrian image 100 in a surveillance area from a CCTV, and collect the pedestrian image 100 from a predetermined database.
Subsequently, the processor may identify the pedestrian trajectories 120 of the plurality of pedestrians in the pedestrian image 100 (S200).
Referring back to
The processor may detect a location of the pedestrian 110 for each frame of the pedestrian image 100, and identify the pedestrian trajectory 120 based on a location which is changed in time series. To this end, the processor may use a predetermined object detection algorithm known in the technical field. Specifically, the processor may detect a specific body portion of the pedestrian 110, e.g., a location of a head for each frame, and connects the locations detected for each frame to identify the pedestrian trajectory 120.
Referring to
In this regard, in the trajectory prediction model in the relate art, the individual pedestrians are focused, and it is expected that the interaction between the respective pedestrians will be sufficiently reflected through the graph-based neutral network models, such as Graph Convolutional Network (GCN), Graph Attention Network (GAT), Graph Transformer Network (GTN), etc. However, as the connection (edge) between the respective pedestrians (nodes) increases, it becomes very difficult for the neural network model to learn the complexity individual interactions, so there is a limit that trajectory prediction becomes very inaccurate in a complex environment.
In the present invention, by considering a social scientific research that more than 70% of the pedestrian forms a group, and the group forms a formation and walks to the same destination, a core value is to allow the neural network model to learn a group walking feature different from individual walks in predicting the trajectory of the pedestrian through an artificial intelligence neural network model.
To this end, the processor may classify the plurality of pedestrians into one pedestrian group based on the pedestrian trajectories 120 of the plurality of pedestrians (S300).
Referring to
In one example, the processor may classify the plurality of pedestrians into one pedestrian group based on the distance between the pedestrian trajectories 120 of the plurality of pedestrians. As described above, since the pedestrian trajectory 120 is defined by continuous locations of the pedestrian, the distance between the pedestrian trajectories 120 may be the distance between the pedestrians according to the continued time.
Specifically, the processor may specify location coordinates of the plurality of pedestrians at each continued time, and calculate the distance between the pedestrians based on each location coordinate. In this case, the calculated distance may be the Euclidian distance (L2 distance), and the distance between the respective pedestrians may be calculated in the form of a pairwise matrix.
The processor may classify the plurality of pedestrians into the same pedestrian group when the distance between the pedestrian trajectories of the plurality of pedestrians is equal to or less than a reference value. In other words, when the distance between multiple pedestrians calculated in each continued time is equal to or less than a reference value, the processor may classify the corresponding pedestrians into the same pedestrian group.
When
Specifically, referring to the matrix illustrated in
Meanwhile, the processor may classify the above-described pedestrian group by using the neural network. In one example, the processor may input the pedestrian trajectories 120 of the plurality of pedestrians into a grouping neural network that performs a classification operation, and the grouping neural network may classify the pedestrian group based on a distance between features of the pedestrian trajectories 120.
To this end, the grouping neural network may include a convolutional layer, and extract a feature from the pedestrian trajectories 120 of the plurality of pedestrians through the convolutional layer. Subsequently, the grouping neural network may calculate the distance between the extracted feature, and classify pedestrians in which the calculated distance is equal to or less than a reference value into the same pedestrian group.
Specifically, the grouping neural network may calculate a distance between features of pedestrian trajectories 120 for pedestrians of each pair (i, j) according to [Equation 6] below, and define an index γ of a pedestrian set in which the distance between the features is equal to or less than the reference value according to [Equation 7] below, and generate a pedestrian group index G according to [Equation 8] below.
(In
represents the location of the pedestrian at a time t, π represents the reference value, and Gk represents a k-th pedestrian group)
As described above, the grouping neural network may have a structure of generating the index of the pedestrian group discretely. In this case, since a function applied to the grouping neural network is impossible to be differentiated, the index of the pedestrian group may not be learned by a general backpropagation algorithm.
In the present invention, a straight-through estimator (STE) may be used so that the grouping neural network may be a learning target. Specifically, the processor may separate a forward pass and a backward pass of the grouping neural network in a learning process, and for example, in the process of the backward pass, the function applied to the grouping neural network may be approximated in a differentiable form by using a sigmoid function and a temperature coefficient τ of the corresponding function.
Specifically, the processor may calculate a probability Ai,j that the pedestrians of each pair (i, j) will belong to the same pedestrian group according to [Equation 9] below, and update the location of each pedestrian as in [Equation 10] below.
(In Equation 10, X′ represents the updated location of the pedestrian, and <·> represents a detach function of PyTorch or a stop gradient function of Tensorflow)
As in the example, as the function applied to the grouping neural network is converted into the differential form, the gradient descent of reducing the loss function may be applied to the grouping neural network, and as a result, the parameters (weight and bias) applied to the grouping neural network may be learned.
Specifically, the parameters applied to the convolutional layer constituting the grouping neural network may be learned so that the index of the pedestrian group output from the grouping neural network is approximated to an actual pedestrian group (ground truth (GT)). Additionally, the processor may set the reference value π applied to [Equation 7] below as a learnable parameter, and in this case, the reference value π may also be learned so that the index of the pedestrian group output from the grouping neural network is approximated to the actual pedestrian group (ground truth (GT)).
When the pedestrian groups are classified according to the above-described method, the processor may predict the expected trajectory 130 of each pedestrian based on the pedestrian trajectories 120 of the pedestrian groups and the pedestrians in each pedestrian group.
Referring to
The inter-group interaction and the intra-group interaction may be structuralized as the graph data, and the processor may generate first to third graph data in order to structuralize each interaction. Hereinafter, a method for generating each graph data and a method for predicting the pedestrian trajectory through the same will be described in detail.
The processor may generate the first graph data according to the relationship of each pedestrian group in order to structuralize the inter-group interaction (S410). In the present invention, the graph data as data constituted by the node and the edge may be data used as an input into the neural network model to be described below.
Referring to
Specifically, the processor pools the pedestrian trajectory 120 of the pedestrian which belongs to each pedestrian group, i.e., the location for each time to determine a representative location of each pedestrian group and set the representative location as the node. For example, when the processor uses an average pooling, a node Vgroup corresponding to each pedestrian group may be set as in [Equation 11] below.
Subsequently, the processor may set the connection the representative locations of the respective pedestrian groups as an edge εgroup according to [Equation 12] below.
(In Equations 11 and 12, k represents each pedestrian group)
The processor may generate first graph data Ggroup (hereinafter, referred to as GD1) as in [Equation 13] below according to the set node Vgroup and edge εgroup.
As described above, according to the present invention, an interaction between pedestrian groups is structuralized with data to allow a neural network model to be described below to learn an intrinsic complexity of a social interaction. Further, since the number of nodes may be reduced as each pedestrian group is set as the node, a data biasing problem of the neural network model may be prevented.
Moreover, there is an advantage in that it is possible to flexibly cope with a change in the number of pedestrians in the pedestrian image 100 upon testing the neural network model. For example, even in the case where the neural network model is learned only a pedestrian image 100 including approximately 10 pedestrians, when a pedestrian image 100 including approximately 50 pedestrians is input into the neural network model at a test stage, if approximately 50 pedestrians are classified into approximately 10 pedestrian groups, prediction accuracy similar to the prediction accuracy upon the learning may be exhibited.
Meanwhile, the processor may generate the second graph data according to the relationship of the pedestrians in each pedestrian group in order to structuralize the intra-group interaction (S420).
Referring to
Specifically, the processor may set a time-specific location of the pedestrian in the pedestrian group as a node Vped according to [Equation 14] below, and set each connection between the pedestrians as an edge εmember according to [Equation 15] below.
(Here, K represent all pedestrian groups)
The processor may generate first graph data Gmember (hereinafter, referred to as GD2) as in [Equation 16] below according to the set node Vped and edge εmember.
As described above, according to the present invention, the intra-group interaction is structuralized to prevent the expected trajectories 130 of the pedestrians in the same pedestrian group output from the neural network model to be described below from colliding with each other while maintaining predetermined formations and directions.
Meanwhile, the processor may generate the third graph data according to relationships of all pedestrians in order to structuralize the entire intra-group interaction (S430).
Referring to
Specifically, the processor may set time-specific locations of all pedestrians in the pedestrian group as the node Vped according to [Equation 14] described above, and set each connection between the pedestrians as the edge εedge according to [Equation 17] below.
The processor may generate third graph data Gped (hereinafter, referred to as GD3) as in [Equation 18] below according to the set node Vped and edge εedge.
As described above, according to the present invention, in one pedestrian image 100, each of an interaction between the pedestrian groups, an interaction between the pedestrians in the pedestrian group, and an interaction among all pedestrians is structuralized with the graph data to augment data at the time of learning the neural network model to be described below.
When the first to third graph data GD1, GD2, and GD3 are generated, the processor may input the first to third graph data GD1, GD2, and GD3 into the neural network model (S500), and generate the expected trajectory 130 for each of the plurality of pedestrians based on the output of the neural network model (S600).
Here, the neural network model as the neural network using the graph data described above as the input may include, for example, a Graph Convolutional Network (GCN), a Graph Attention Network (GAT), and a Graph Transformer Network (GTN).
The neural network model applied to the present invention may be learned to receive the first to third graph data GD1, GD2, and GD3 as the input, and output the expected trajectories 130 (classes) of all pedestrians, and an expected-trajectory (130)-specific occurrence probability (class-specific probability).
The processor may generate at least one of a plurality of expected trajectories 130 for each pedestrian output from the neural network model as the expected trajectory 130 of the pedestrian. For example, the processor may generate, as the expected trajectory 130, only trajectories selected as large as a predetermined number in the order of a higher probability among the plurality of excepted trajectories 130 (classes).
Meanwhile, in order to train all attributes included in the first to third graph data GD1, GD2, and GD3, respectively, the neural network model may include first to third graph based neural networks. In this case, the first to third graph based neural networks may include architecture such as the Graph Convolutional Network (GCN), the Graph Attention Network (GAT), the Graph Transformer Network (GTN), etc.
He first to third graph based neural networks may have different architectures, but preferably have the same architecture in order to increase a learning speed of each neural network through sharing parameters (hyperparameter and/or learnable parameter) .
The processor may input the first to third graph data GD1, GD2, and GD3 into the first to third graph based neural networks sharing the parameters, respectively. Specifically, the processor may input the first graph data GD1 into the first graph based neural network, input the second graph data GD2 into the second graph based neural network, and input the third graph data into the third graph based neural network.
Subsequently, the processor may generate the expected trajectory 130 for each of the plurality of pedestrians by integrating the outputs of the first to third graph based neural networks.
The integration method may adopt various methods used in the technical field. For example, the processor may perform an element-wise summation or an element-wise product of the outputs of the first to third graph based neural networks. Further, the processor may perform an element-wise average of the outputs of the first to third graph based neural networks or combine respective outputs by using a multi-layer perceptron.
Meanwhile, the neural network model should output the expected trajectory 130 for each of all pedestrians, and since the first graph data GD1 is set with respect to the pedestrian group other than each pedestrian, the number of data (the number of pedestrian groups) output from the neural network model may not coincide with the number of pedestrians.
By considering this, the processor may unpool the output of the neural network model for the first graph data GD1 so that expected trajectories of pedestrians which belong to the same pedestrian group are the same as each other.
Specifically, a feature output from the first graph based neural network may correspond to the expected trajectory 130 of the pedestrian group. The processor may apply the feature corresponding to the pedestrian group to all pedestrians which belong to the corresponding pedestrian group through an unpooling technique so that all pedestrians included in the pedestrian group have the same expected trajectory 130.
That is, when
Meanwhile, referring back to
Specifically, the processor may randomly sample the latent vectors according to the random vector determined according to the Monte Carlo or a Quasi-Monte Carlo method. Since each latent vector corresponds to a latent intention of the pedestrian, i.e., the expected trajectory 130, the processor may sample the latent vectors as large as the number of expected trajectories 130 to be determined through the neural network model.
Additionally, the processor may sample the latent vector according to the pedestrian group in order to reflect a group feature of the pedestrian group. Specifically, the processor may sample the same latent vector with respect to the pedestrians which belong to the same pedestrian group.
When
When the latent vector is sampled by such a method, the neural network model may learn a social statistical feature that the pedestrians in the same pedestrian group move toward the same destination.
Hereinafter, trajectory prediction architecture and an operation process thereof according to an exemplary embodiment of the present invention will be described with reference to
Referring to
The processor pools (20) the pedestrian trajectories 120 of the pedestrians which belong to each pedestrian group to determine a representative location of each pedestrian group, and generate the first graph data GD1 based on the relationship between the representative locations. Further, the processor may generate the second graph data GD2 according to a location relationship of the pedestrians in each pedestrian group, and generate the third graph data GD3 according to a location relationship of individual pedestrians regardless of the pedestrian group.
The neural network model 300 applied to the present invention may include the first to third graph based neural networks (trajectory prediction baseline models, and the processor may input the first graph data GD1 into the first graph based neural network, the second graph data GD2 into the second graph based neural network, and the third graph data GD3 into the third graph based neural network, respectively.
Since the number of data output from the first graph based neural network corresponds to the number of pedestrian groups, the processor may unpool (40) the corresponding output and convert the unpooled output so that the number of data output from the first graph based neural network corresponds to the number of pedestrians, and then input the outputs of the first to third graph based neural networks into a group integration module 50.
The group integration module may integrate the outputs of the first to third graph based neural networks through the method such as the element-wise summation, the element-wise product, the element-wise averaging, a data combination using the multi-layer perceptron, etc., and the processor may generate each pedestrian-wise expected trajectory 130 according to integrated data.
As described above, the present invention is described with reference to the exemplified drawing, but the present invention is not limited by the exemplary embodiments and drawings disclosed in this specification, and it is apparent that that various modifications can be made by those skilled in the art without the scope of the technical spirit of the present invention. In addition, it is natural that even though an action effect according to the configuration of the present invention is explicitly disclosed and described while describing the exemplary embodiments of the present invention, predictable effects should also be accepted by the corresponding configuration.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0032099 | Mar 2022 | KR | national |
10-2022-0052202 | Apr 2022 | KR | national |