The present disclosure relates to an image processing apparatus and a program.
In recent years, “360° Video” has drawn an attention. Because the entirety of a space can be stored as a video, one can feel a greater sense of immersion and a greater sense of realism, compared to the conventional video. As a device for easily recording and replaying a 360° video has been provided, a service relating to the 360° video has appeared, and a market relating to virtual reality has expanded, the 360° video has become more and more important.
On the other hand, in the 360° video, hyperlapse and stabilization are expected in many cases, compared to the ordinary video. The hyperlapse refers to temporal sampling of a captured video to obtain a video of shorter time. In the cases such that a video itself is long and viewing the video in the original length is difficult, and that a video is uploaded to a service on a network, the length of the video must be made into a predetermined time period and the hyperlapse is required.
Further, stabilization refers to clarifying blurriness, etc., of an image, etc., caused when the image is captured. This is a conventionally recognized problem. However, since the sense of immersion is great in the 360° video, if the image has large blurriness, some viewers feel similar to motion sickness, and thus, stabilization is more strongly required compared to the conventional videos.
Non-Patent Document 1 discloses a conventional technology for handling these two problems, i.e., hyperlapse and stabilization, regarding conventional videos (not 360° videos). According to the method disclosed in Non-Patent Document 1, upon performing the stabilization, an inter-frame cost is obtained on the basis of the inter-frame homography transformation, etc., and frames evaluated as inappropriate are removed. Further, with respect to the selected frames, a process to crop the common part is performed.
However, the above-mentioned conventional technology cannot be applied to a wide-angle moving image such as a 360° video, etc., (here, the wide-angle moving image refers to an image captured to cover a range wider than the average visual field of the human eye, such as an image with a diagonal angle of view exceeding that of the standard lens, i.e., 46 degrees). The reasons therefor are: first, the inter-frame homography transformation is transformation between planar images; and a wide-angle video such as a 360° video, etc., cannot be obtained by performing partial cropping after the frame selection.
Therefore, there are drawbacks that such conventional technology cannot meet the requirement for hyperlapse and the requirement for stabilization regarding the wide-angle video such as a 360° video, etc.
The present disclosure has been made in view of the above, and one of the objectives is to provide an image processing apparatus and a program capable of meeting the requirement for hyperlapse and the requirement for stabilization regarding a wide-angle video such as a 360° video, etc.
In order to solve the drawbacks of the above conventional example, the present disclosure provides an image processing apparatus which receives and processes moving image data captured while a camera is moved, wherein the moving image data processing apparatus comprises a movement trajectory estimation device which estimates movement trajectory of the camera, a selection device which selects a plurality of points satisfying a predetermined condition, from among the points on the estimated camera movement trajectory, an extraction device which extracts data of images captured at the selected plurality of points, a generation device which generates reconfigured moving image data on the basis of the extracted image data, and an output device which outputs the reconfigured moving image data.
The requirement for hyperlapse and the requirement for stabilization can be met, regarding the wide-angle video such as a 360° video, etc.
An embodiment of the present disclosure will be explained with reference to the drawings. Unlike an ordinary image, in a 360° image, the image capturing range does not vary depending on the rotation of a camera. Thus, when the blurring of the camera is divided into positional blurring and rotational blurring, the rotational blurring can be completely restored if the amount of rotation is provided. In an ordinary video (a video other than the 360° video, hereinafter, referred to as a non-360° video), neither of the blurrings can be restored, and thus, the degree of the blurring is obtained as a cost. However, in the 360° video, it is considered that only the positional blurring should be treated as a cost of camera motion.
When a 360° video and a desired sampling rate are provided, a method for outputting a stabilized 360° video, while satisfying the sampling rate to a certain extent, is as follows.
When v represents a transition cost from the i-th frame to the j-th frame among a plurality of frames (360° images) included in a 360° video, and the frame selected before the i-th frame is defined as h-th frame, the cost can be represented by the following formula (1).
C(h,i,j,v)=C{dot over (m)}(i,j)+λsCs(i,j,v)+λaCa(h,i,j) (1)
Here, Cm represents a cost by camera motion, Cs represents a cost for violating the provided velocity magnification constraint, and Ca represents a cost for the velocity change. λs and λa represent coefficients for providing weights to respective costs. For Cs and Ca, definitions are the same as those of the conventional method, because the difference of video type has no influence thereon. On the other hand, for Cm, according to the conventional method, a moving amount of the center is calculated on the basis of the inter-frame homography transformation, and the size of the moving amount is treated as a cost, whereas according to the present embodiment, a motion cost using the three-dimensional camera position is used. Specifically, the motion cost is defined as below.
Here, the vector Xk represents a three-dimensional position coordinate of the camera when the k-th frame is captured, the vector X′k represents a three-dimensional position coordinate of an expected position of the camera (preferable camera position), and ∥x∥2 represents Euclidean norm of x.
The preferable camera position can be calculated by a method of such as applying Gaussian smoothing to an actual camera position. Cm obtained by the formula (2) represents the moving amount of the camera in a direction perpendicular to the ideal direction, which is a cost expressing the positional blurring of the camera.
Next, on the basis of the defined inter-frame cost, a frame path (movement trajectory of the camera at the time of image capturing) is selected by a predetermined method such as Dynamic Programming, so that the selected frame path has the minimum total cost. Thereby, a frame is selected so that, with the selected frame, the camera position becomes smooth while the sampling rate is maintained at a value similar to the provided value.
The frame selection is performed to reduce the positional blurring, but the rotation state of the camera at the time of image capturing is not considered. Therefore, in the present embodiment, a known rotation removing process is performed to the 360° video as a post treatment. An example of the rotation removing process is disclosed in Pathak, Sarthak, et. al., A decoupled virtual camera using spherical optical flow, Image Processing (IPCP), 2016 IEEE International Conference on pp. 4488-4492 (September 2016). In this method, the moment of the optical flow of the 360° video is minimized, to thereby minimize the inter-frame rotation. In the present embodiment, the post treatment is changed from cropping to rotation removing, so as to be applicable to 360° videos.
As exemplified in
The storage unit 12 is a memory device, etc., and stores a program executed by the control unit 11. The program may be provided by being stored in a computer-readable non-transitory storage medium, and may be installed to the storage unit 12. Further, the storage unit 12 may also operates as a work memory of the control unit 11. The input-output unit 13 is, for example, a serial interface, etc., which receives 360° video data to be processed from the camera, stores the received data in the storage unit 12 as data to be processed, and provides the data so as to be processed by the control unit 11.
Operations of the control unit 11 according to the present embodiment will be explained. As exemplified in
The movement trajectory estimation unit 21 estimates a movement trajectory of a camera when the 360° video to be processed is captured. The movement trajectory estimation unit 21 projects the 360° video onto the inner faces of a hexahedral projection plane with its center at the position of the camera, and a planar image projected on the inner face corresponding to the moving direction of the camera (mentioned below), among the inner faces of the hexahedron, is used. According to a process described in, for example, ORB-SLAM (Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. Orb-slam: a versatile and accurate monocular slam system, IEEE Transactions on Robotics 31.5 (2015): 1147-1163), a camera position coordinate (three-dimensional position coordinate) and a camera posture (a vector representing a direction from the camera position toward the center of the angle of view) are obtained for each of the frames expressing the estimation result of the movement trajectory of the camera. The movement trajectory estimation unit 21 outputs the obtained camera posture information to the generation unit 24.
For example, when a 360° video is captured by a camera having a pair of image capturing elements arranged on the front side and the rear side of the camera body, the three-dimensional position coordinate can be described as a coordinate value in a three-dimensional space of the XYZ orthogonal coordinate system, wherein the origin is a position of the camera at the start of the image capturing, the Z-axis is the moving direction of the camera which is the direction of the center of the image capturing element at the start of the image capturing, the X-axis is in a direction parallel with the floor, and is in the plane of which the normal line is the Z-axis (the plane being one of the faces of the hexahedron, i.e., the projection plane to which ORB-SLAM is applied), and the Y-axis is in the direction perpendicular to X-axis and Z-axis, respectively. The coordinate value of each point on the movement trajectory of the camera may be estimated by a method other than the above-mentioned ORB-SLAM method.
The selection processing unit 22 selects a plurality of points satisfying a predetermined condition from among the points on the estimated camera movement trajectory, using the camera position coordinate information for each frame output from the movement trajectory estimation unit 21. Hereinbelow, Xi (here, X represents a vector value) represents a camera position coordinate when the i-th frame (hereinbelow, the “i” is referred to as a frame number) is captured, and the vector X′k represents a preferable three-dimensional position coordinate of the camera.
According to an example of the present embodiment, the selection processing unit 22 selects frames on the basis of a condition relating to the information of the point position at which each frame is captured (camera position coordinate Xi (i=1, 2, 3 . . . ) at the time of image capturing), and a condition relating to the information of the image capturing time at the relevant point.
Specifically, the selection processing unit 22 obtains a preferable three-dimensional position coordinate X′k of the camera at the k-th frame (k=1, 2, 3 . . . ), on the basis of the position coordinate Xi (i=1, 2, 3 . . . ) of the camera when each frame is captured.
As an example, the selection processing unit 22 calculates a preferable three-dimensional position coordinate X′k of the camera by a method, for example, applying a smoothing process such as Gaussian smoothing to the values (data series) of the position coordinate Xi (i=1, 2, 3 . . . ). Here, the smoothing method may be Gaussian smoothing or any other widely known methods such as obtaining a moving average, etc.
The selection processing unit 22 receives an input of a designated velocity magnification v from a user, and calculates the transition cost from the i-th frame to the j-th frame as follows, using the velocity magnification v. Namely, provided that the frame selected before the i-th frame is the h-th frame, the selection processing unit 22 calculates the transition cost from the i-th frame to the j-th frame by the formula (1).
In the formula (1), Cm is a motion cost as represented by the formula (2). Cs is a speed cost as represented by the formula (3).
C
s(i,j,v)=min(∥(j−i)−v∥22,τS (3)
In the formula, i and j each represents a frame number, v represents a velocity magnification, Ts represents the maximum value of the speed cost, which is previously determined, and min(a, b) refers to taking smaller value between a and b (the same hereinafter).
Ca is an acceleration cost as represented by the formula (4).
C
a(h,i,j)=min(∥(j−i)−(i−h)∥22,τa) (4)
In the formula, i, j, and h each represents a frame number, and τa represents the maximum value of the acceleration cost, which is previously determined. Here, the speed cost and the acceleration cost correspond to the conditions relating to the capturing time information of each frame (such as a difference from the frame number which is supposed to be extracted on the basis of the designated velocity magnification, and the like).
The selection processing unit 22 selects a frame to be extracted, using the obtained transition cost sequence from the i-th frame to the j-th frame. Specifically, when a frame is selected as a frame to be extracted, from a series of frames p, and the frame (n=1, 2, . . . , N) which is n-frame after the selected frame, has the frame number t in the entirety of the moving image data to be processed, this is represented as p(n)=t. In the moving image data to be processed, the total cost with the designated velocity magnification v is represented by the formula (5).
Then, the selection processing unit 22 uses the formula (5), and obtains the frame series of the formula (6).
p
v=argminPϕ(p,v) (6)
As for this frame selection method based cn the cost, Dynamic Programming may be used, similar to the method in Non-Patent Document 1. Thus, detailed explanation therefor is omitted here.
The extraction processing unit 23 extracts the frames selected by the selection processing unit 22, from the moving image data to be processed. Namely, the extraction processing unit 23 extracts, from the received moving image data, image data of the frames captured at a plurality of points, which are selected by the selection processing unit 22 so as to be close to the ideal positions, and so as not to largely violate the velocity magnification constraint.
The generation unit 24 generates timelapse moving image data by arranging (reconfiguring) the image data extracted by the extraction processing unit 23 in the order of extraction (in the ascending order of the frame number in the moving image data to be processed). Further, with respect to each piece of the image data extracted by the extraction processing unit 23, the generation unit 24 may estimate the camera posture when the relevant image data is captured, modify the image data on the basis of the information of the estimated posture, and generate reconfigured moving image data using the modified image data.
Specifically, the generation unit 24 receives information representing the camera posture (vector representing a direction from the camera position toward the center of the angle of view) from the movement trajectory estimation unit 21. When the i-th frame is extracted from the moving image data to be processed, and the frame number j is the next greater frame number of i, the image of the i-th frame is modified so that the center of the i-th frame image is located in the direction of the movement vector (Xj−Xi) from the i-th frame to the j-th frame. Namely, using the vector V toward the center of the angle of view represented by the information of the camera posture when the i-th frame was actually captured, and the above-mentioned movement vector (Xj−Xi), the three-dimensional rotational correction, by the difference (Xj−Xi)−V, is applied to the extracted i-th frame image. The rotational correction process is widely known, and the detailed explanation therefor is omitted here.
According to an example of the present embodiment, the moving image data does not have to be a 360° image, but may be an image of comparatively wide-angle. If this is the case, after the rotational correction process, the finally output angle of view size (which can be previously designated) may include the image-uncaptured range. In this case, the image data may be cropped so that the image-uncaptured range is not included and the image data is output with the cropped angle of view, or the image-uncaptured range may be set to be pixels of a predetermined color (for example, black), which is subjected to the subsequent process.
The output unit 25 outputs the moving image data generated by the generation unit 24 through reconfiguration, to a display, etc. The output unit 25 externally transmits the generated moving image data through a network, etc.
The present embodiment has the above structure, and operates as follows. In the following example, the input moving image data to be processed is moving image data captured while the camera is moved along a path (for example, moving image data captured during walking), the outline the path being two-dimensionally shown in
Using the moving image data (here, 360° video) captured along the above-mentioned path as moving image data to be processed, the image processing apparatus 1 processes the moving image data by ORB-SLAM, etc., and obtains a camera position coordinate (three-dimensional position coordinate), and a camera posture (a vector representing a direction from the camera position toward the center of the angle of view) for each of the frames representing the estimation result of the camera movement trajectory, as exemplified in
Then, the image processing apparatus 1 uses the information of the camera position coordinate obtained for each frame to select plurality of points which satisfy a predetermined condition, from the points on the estimated camera movement trajectory. In this example, first, the image processing apparatus 1 performs Gaussian smoothing to the position coordinate Xi (i=1, 2, 3 . . . ) of the camera when each frame was captured, and obtains a preferable three-dimensional position coordinate X′k of the camera at the k-th frame (k=1, 2, 3 . . . ) (S2).
Then, using the information of velocity magnification v received from a user, the image processing apparatus calculates the transition cost from the i-th frame to the j-th frame, by the formula (1). For the formula (1), the motion cost Cm representing the deviation amount in the translational direction from the preferable camera apposition obtained as a preferable path, the speed cost Cs reflecting the deviation from the frame which is supposed to be selected based on the velocity magnification, and the acceleration cost Ca, are obtained from the formula (2) to formula (4) (S3).
The image processing apparatus 1 selects a frame combination (frame series) having the minimum transition cost in total, from possible combinations of the frames to be selected, regards the frames included in the obtained frame series as selected frames, and obtains frame number information specifying the selected frames (for example, frames indicated as (X) in
The image processing apparatus 1 extracts the frames specified by the frame numbers obtained in the above process, from the frames included in the moving image data to be processed (S5). Then, with respect to the image data of each extracted frame, the image processing apparatus 1 applies the three-dimensional rotational correction, using the information expressing the camera posture (a vector representing a direction from the camera position toward the center of the angle of view) (S6), so that the moving direction (here, the transition direction between the selected frames) matches the direction toward the center of the angle of view.
The image processing apparatus 1 arranges the pieces of the image data after the correction in ascending order of the frame number, and generates and outputs the reconfigured moving image data (S7).
According to the present embodiment, for example, if frames are selected from the 20 frames shown in
On the other hand, according to an example of the present embodiment, frames which are comparatively close to the result of the smoothing process obtained on the basis of the image capturing positions of the frames, are selected. Therefore, the intervals between the image capturing times of the selected frames are not always constant, and, for example, the frames indicated as (X) in
As described above, according to the present embodiment, with respect to a wide-angle video such as a 360° video, etc., the requirements for hyperlapse and the requirements for stabilization can be met at the same time.
In the above explanation of the present embodiment, the position and the posture of the camera when each frame in the moving image data to be processed is captured, are estimated using the captured image data, such as ORB-SLAM, etc. However, the present embodiment is not limited thereto. For example, if the camera has a built-in gyroscope or GPS, or if information from a position recording apparatus which moves together with the image processing apparatus 1 can be obtained, the image processing apparatus 1 can receive the input of the information measured and recorded by the gyroscope or GPS, or the information recorded by the position recording apparatus, and obtain the position or posture of the camera when each frame is captured, by using the input information.
In the above example of the present embodiment, the moving image data to be processed is received from the camera connected to the input-output unit 13. However, the camera itself can function as an image processing apparatus 1. In this case, the CPU, etc., provided in the camera functions as a control unit 11, and above processes are executed to the moving image data captured by itself.
Using the image processing apparatus 1 according to the present embodiment, actually captured moving image data was processed, and evaluation of the results are shown below. In the following evaluation, an amount showing the size of oscillation caused by the camera movement is obtained.
In the formula (7), xi (i=1, 2, represents the camera position coordinate at the i-th frame included in the moving image data to be output, and the angle θi between the vector from xi−1 to xi and the vector from xi to xi+1 is represented as below.
As exemplified in
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2017/045524 | 12/19/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62436467 | Dec 2016 | US |