VIEWPOINT PATH STABILIZATION

Information

  • Patent Application
  • 20220406003
  • Publication Number
    20220406003
  • Date Filed
    October 15, 2021
    3 years ago
  • Date Published
    December 22, 2022
    2 years ago
Abstract
Three-dimensional points may be projected onto first locations in a first image of an object captured from a first position in three-dimensional space relative to the object and projected onto second locations a virtual camera position located at a second position in three-dimensional space relative to the object. First transformations linking the first and second locations may then be determined. Second transformations transforming first coordinates for the first image to second coordinates for the second image may be determined based on the first transformations. Based on these second transformations and on the first image, a second image of the object from the virtual camera position.
Description
TECHNICAL FIELD

The present disclosure relates generally to the processing of image data.


DESCRIPTION OF RELATED ART

Images are frequently captured via a handheld device such as a mobile phone. For example, a user may capture images of an object such as a vehicle by walking around the object and capturing a sequence of images or a video. However, such image data is subject to significant distortions. For instance, the images may not be captured in an entirely closed loop, or the camera's path through space may include vertical movement in addition to the rotation around the object. To provide for enhanced presentation of the image data, improved techniques for viewpoint path modeling are desired.


BRIEF SUMMARY

According to various embodiments, techniques and mechanisms described herein provide for methods, computer-readable media having instructions stored thereon for performing methods, and/or various systems and devices capable of performing methods related to processing image data.


In one aspect, a method includes projecting via a processor a plurality of three-dimensional points onto first locations in a first image of an object captured from a first position in three-dimensional space relative to the object, projecting via the processor the plurality of three-dimensional points onto second locations a virtual camera position located at a second position in three-dimensional space relative to the object, determining via the processor a first plurality of transformations, linking the first locations with the second locations, determining based on the first plurality of transformations a second plurality of transformations transforming first coordinates for the first image of the object to second coordinates for the second image of the object, and generating via the processor a second image of the object from the virtual camera position based on the first image of the object and the second plurality of transformations.


The first coordinates may correspond to a first-two-dimensional mesh overlain on the first image of the object, and the second coordinates may correspond to a second two-dimensional mesh overlain on the second image of the object. The first image of the object may be one of a first plurality of images captured by a camera moving along an input path through space around the object, and the second image may be one of a second plurality of images generated at respective virtual camera positions relative to the object. The plurality of three-dimensional points may be determined at least in part via motion data captured from an inertial measurement unit at the mobile computing device. The second plurality of transformations may be generated via a neural network.


The method may also include generating a multiview interactive digital media representation (MVIDMR) that includes the second set of images, the MVIDMR being navigable in one or more dimensions. The second image of the object may be generated via a neural network. The processor may be located within a mobile computing device that includes a camera, and the first image may be captured by the camera.


The processor may be located within a mobile computing device that includes a camera which captured the first image. The plurality of three-dimensional points may be determined at least in part based on depth sensor data captured from a depth sensor. The method may also include determining a smoothed path through space around the object based on the input path, and determining the virtual camera position based on the smoothed path. The motion data may include data selected from the group consisting of: accelerometer data, gyroscopic data, and global positioning system (GPS) data. The first plurality of transformations may be provided as reprojection constraints to the neural network. The neural network may include one or more similarity constraints that penalize deformation of first two-dimensional mesh via the second plurality of transformations.


Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.





BRIEF DESCRIPTION OF THE DRAWINGS

The included drawings are for illustrative purposes and serve only to provide examples of possible structures and operations for the disclosed inventive systems, apparatus, methods and computer program products for image processing. These drawings in no way limit any changes in form and detail that may be made by one skilled in the art without departing from the spirit and scope of the disclosed implementations.



FIG. 1 illustrates an overview method for viewpoint path modeling, performed in accordance with one or more embodiments.



FIG. 2A, FIG. 2B, FIG. 2C illustrate examples of viewpoint path modeling diagrams, generated in accordance with one or more embodiments.



FIG. 3 illustrates one example of a method for translational viewpoint path determination, performed in accordance with one or more embodiments.



FIG. 4A and FIG. 4B illustrate examples of viewpoint path modeling diagrams, generated in accordance with one or more embodiments.



FIG. 5 illustrates one example of a method for rotational position path modeling, performed in accordance with one or more embodiments.



FIG. 6 illustrates a particular example of a computer system configured in accordance with various embodiments.



FIG. 7A, FIG. 7B, FIG. 7C, FIG. 7D illustrate examples of viewpoint path modeling diagrams, generated in accordance with one or more embodiments.



FIG. 8 illustrates one example of a method for image view transformation, performed in accordance with one or more embodiments.



FIG. 9 illustrates a diagram of real and virtual camera positions along a path around an object, generated in accordance with one or more embodiments.



FIG. 10 illustrates a method for generating a novel image, performed in accordance with one or more embodiments.



FIG. 11 illustrates a diagram of a side view image of an object, generated in accordance with one or more embodiments.



FIG. 12 illustrates a method for generating an MVIDMR, performed in accordance with one or more embodiments.



FIG. 13 shows an example of a MVIDMR Acquisition System, configured in accordance with one or more embodiments.



FIG. 14 illustrate an example of a process flow for capturing images in a MVIDMR using augmented reality, performed in accordance with one or more embodiments.



FIG. 15 illustrate an example of a process flow for creating an MVIDMR, performed in accordance with one or more embodiments.



FIG. 16A and FIG. 16B illustrate aspects of generating an Augmented Reality (AR) image capture track for capturing images used in a MVIDMR, performed in accordance with one or more embodiments.





DETAILED DESCRIPTION

Techniques and mechanisms described herein provide for viewpoint path modeling and image transformation. A set of images may be captured by a camera as the camera moves along a path through space around an object. Then, a smoothed function (e.g., a polynomial) may be fitted to the translational and/or rotational motion in space. For example, positions in a Cartesian coordinates pace may be determined for the images. The positions may then be transformed to a polar coordinate space, in which a trajectory along the points may be determined, and the trajectory transformed back into the Cartesian space. Similarly, the rotational motion of the images may be smoothed, for instance by fitting a loss function. Finally, one or more images may be transformed to more closely align a viewpoint of the image with the fitted translational and/or rotational positions.


According to various embodiments, images are often captured by handheld cameras, such as cameras on a mobile phone. For instance, a camera may capture a sequence of images of an object as the camera moves along a path around the object. However, such image sequences are subject to considerable noise and variation. For example, the camera may move vertically as it traverses the path. As another example, the camera may traverse a 360-degree path around the object but end the path at a position nearer to or further away from the object than at the beginning of the path.


Techniques and mechanisms described herein provide for viewpoint path modeling and image transformation. A set of images may be captured by a camera as the camera moves along a path through space around an object. Then, a smoothed function (e.g., a polynomial) may be fitted to the translational and/or rotational motion in space. For example, positions in a Cartesian coordinates pace may be determined for the images. The positions may then be transformed to a polar coordinate space, in which a trajectory along the points may be determined, and the trajectory transformed back into the Cartesian space. Similarly, the rotational motion of the images may be smoothed, for instance by fitting a loss function. Finally, one or more images may be transformed to more closely align a viewpoint of the image with the fitted translational and/or rotational positions.


According to various embodiments, images are often captured by handheld cameras, such as cameras on a mobile phone. For instance, a camera may capture a sequence of images of an object as the camera moves along a path around the object. However, such image sequences are subject to considerable noise and variation. For example, the camera may move vertically as it traverses the path. As another example, the camera may traverse a 360-degree path around the object but end the path at a position nearer to or further away from the object than at the beginning of the path.



FIG. 1 illustrates an overview method 100 for viewpoint path modeling, performed in accordance with one or more embodiments. According to various embodiments, the method 100 may be performed on a mobile computing device that captures images along a path. Alternatively, the method 100 may be performed on a different computing device, such as a remote server to which data from a mobile computing device is transmitted.


A set of images captured along a path through space is identified at 102. According to various embodiments, the images may be captured by a mobile computing device such as a digital camera or a mobile phone. The images may be still images or frames extracted from a video.


In some embodiments, additional data may be captured by the mobile computing device beyond the image data. For example, motion data from an inertial measurement unit may be captured. As another example, depth sensor data from one or more depth sensors located at the mobile computing device may be captured.


A smoothed trajectory is determined at 104 based on the set of images. According to various embodiments, determining the smoothed trajectory may involve determining a trajectory for the translational position of the images. For example, the smoothed trajectory may be determined by identifying Cartesian coordinates for the images in a Cartesian coordinate space, and then transforming those coordinates to a polar coordinate space. A smoothed trajectory may then be determined in the polar coordinate space, and finally transformed back to a Cartesian coordinate space. Additional details regarding trajectory modeling are discussed throughout the application, and particularly with respect to the method 300 shown in FIG. 3.


In some implementations, determining the smoothed trajectory may involve determining a trajectory for the rotational position of the images. For example, a loss function including parameters such as the change in rotational position from an original image and/or a previous image may be specified. Updated rotational positions may then be determined by minimizing the loss function. Additional details regarding rotational position modeling are discussed throughout the application, and particularly with respect to the method 500 shown in FIG. 5.


One or more images are transformed at 106 to fit the smoothed trajectory. According to various embodiments, images captured from locations that are not along the smoothed trajectory may be altered by any of a variety of techniques so that the transformed images appear to be captured from positions closer to the smoothed trajectory. Additional details regarding image transformation are discussed throughout the application, and more specifically with respect to the method 500 shown in FIG. 5.



FIG. 2A, FIG. 2B, and FIG. 2C illustrate examples of viewpoint path modeling diagrams, generated in accordance with one or more embodiments. In FIG. 2A, the points 202 show top-down Cartesian coordinates associated with images captured along a path through space. The points 204 show a trajectory fitted to the points as a circle using conventional trajectory modeling techniques. Because the fitted trajectory is circular, it necessarily is located relatively far from many of the points 202.



FIG. 2B shows a trajectory 206 fitted in accordance with techniques and mechanisms described herein. The trajectory 206 is fitted using a 1st order polynomial function after transformation to coordinate space, and then projected back into Cartesian coordinate space. Because a better center point is chosen, the trajectory 206 provides a better fit for the points 202.



FIG. 2C shows a trajectory 208 fitted in accordance with techniques and mechanisms described herein. The trajectory 208 is fitted using a 6th order polynomial function after transformation to coordinate space, and then projected back into Cartesian coordinate space. Because the circular constraint is relaxed and the points 202 fitted with a higher order polynomial, the trajectory 208 provides an even better fit for the points 202.



FIG. 3 illustrates one example of a method 300 for viewpoint path determination, performed in accordance with one or more embodiments. According to various embodiments, the method 300 may be performed on a mobile computing device that captures images along a path. Alternatively, the method 300 may be performed on a different computing device, such as a remote server to which data from a mobile computing device is transmitted. The method 300 will be explained partially in reference to FIG. 4A and FIG. 4B, which illustrate examples of viewpoint path modeling diagrams generated in accordance with one or more embodiments.


A request to determine a smoothed trajectory for a set of images is received at 302. According to various embodiments, the request may be received as part of a procedure for generating a multiview interactive digital media representation (MVIDMR). Alternatively, the request may be generated independently. For instance, a user may provide user input indicating a desire to transform images to fit a smoothed trajectory.


In particular embodiments, the set of images may be selected from a larger group of images. For instance, images may be selected so as to be relatively uniformly spaced. Such selection may involve, for example, analyzing location or timing data associated with the collection of the images. As another example, such selection may be performed after operation 304 and/or operation 306.


Location data associated with the set of images is determined at 304. The location data is employed at 306 to determine Cartesian coordinates for the images. The Cartesian coordinates may identify, in a virtual Cartesian coordinate space, a location at which some or all of the images were captured. An example of a set of Cartesian coordinates is shown at 402 in FIG. 4A.


According to various embodiments, the location data may be determined by one or more of a variety of suitable techniques. In some embodiments, the contents of the images may be modeled to estimate a pose relative to an object for each of the images. Such modeling may be based on identifying tracking points that occur in successive images, for use in estimating a change in position of the camera between the successive images. From this modeling, an estimated location in Cartesian coordinate space may be determined.


In some embodiments, location data may be determined at least in part based on motion data. For instance, motion data such as data collected from an inertial measurement unit (IMU) located at the computing device may be used to estimate the locations at which various images were captured. Motion data may include, but is not limited to, data collected from an accelerometer, gyroscope, and/or global positioning system (GPS) unit. Motion data may be analyzed to estimate a relative change in position from one image to the next. For instance, gyroscopic data may be used to estimate rotational motion while accelerometer data may be used to estimate translation in Cartesian coordinate space.


In some embodiments, location data may be determined at least in part based on depth sensor information captured from a depth sensor located at the computing device. The depth sensor information may indicate, for a particular image, a distance from the depth sensor to one or more elements in the image. When the image includes an object, such as a vehicle, the depth sensor information may provide a distance from the camera to one or more portions of the vehicle. This information may be used to help determine the location at which the image was captured in Cartesian coordinate space.


In particular embodiments, the location data may be specified in up to six degrees of freedom. The camera may be located in three-dimensional space with a set of Cartesian coordinates. The camera may also be oriented with a set of rotational coordinates specifying one or more of pitch, yaw, and roll. In particular embodiments, the camera may be assumed to be located along a relatively stable vertical level as the camera moves along the path.


A focal point associated with the original path is determined at 308. According to various embodiments, the focal point may be identified as being close to the center of the arc or loop if the original path moves along an arc or loop. Alternatively, the focal point may be identified as being located as being at the center of an object, for instance if each or the majority of the images features the object.


According to various embodiments, any of a variety of techniques may be used to determine the focal point. For example, the focal point may be identified by averaging the locations in space associated with the set of images. As another example, the focal point may be determined by minimizing the sum of squares of the intersection of the axes extending from the camera perspectives associated with the images.


In particular embodiments, the focal point may be determined based on a sliding window. For instance, the focal point for a designated image may be determined by averaging the intersection of the axes for the designated image and other images proximate to the designated image.


In some implementations, the focal point may be determined by analyzing the location data associated with the images. For instance, the location and orientation associated with the images may be analyzed to estimate a central point at which different images are focused. Such a point may be identified based on, for instance, the approximate intersection of vectors extending from the identified locations along the direction the camera is estimated to be facing.


In particular embodiments, a focal point may be determined based on one or more inferences about user intent. For example, a deep learning or machine learning model may be trained to identify a user's intended focal point based on the input data.


In some embodiments, potentially more than one focal point may be used. For example, the focal direction of images captured as the camera moves around a relatively large object such as a vehicle may change along the path. In such a situation, a number of local focal points may be determined to reflect the local perspective along a particular portion of the path. As another example, a single path may move through space in a complex way, for instance capturing arcs of images around multiple objects. In such a situation, the path may be divided into portions, with different portions being assigned different focal points.


A two-dimensional plane for the set of images is determined at 310. According to various embodiments, the two-dimensional plan may be determined by fitting a plane to the Cartesian coordinates associated with the location data at 306. For instance, a sum of squares model may be used to fit such a plane.


The identified points are transformed from Cartesian coordinates to polar coordinates at 312. According to various embodiments, the transformation may involve determining for each of the points a distance from the relevant focal point and an angular value indicating a degree of rotation around the object. An example of locations that have been transformed to polar coordinates is shown at 404 in FIG. 4B.


A determination is made at 314 as to whether to fit a closed loop around the object. In some implementations, the determination may be made based at least in part on user input. For instance, a user may provide an indication as to whether to fit a closed loop. Alternatively, or additionally, the determination as to whether to fit a closed loop may be made at least in part automatically. For example, a closed loop may be fitted if the path is determined to end in a location near where it began. As another example, a closed loop may be fitted if it is determined that the path includes nearly 360-degrees or more of rotation around the object. As yet another example, a closed loop may be fitted if one portion of the path is determined to overlap or nearly overlap with an earlier portion of the same path.


If it is determined to fit a closed loop, the projected data points for closing the loop are determined at 316. According to various embodiments, the projected data points may be determined in any of a variety of ways. For example, points may be copied from the beginning of the loop to the end of the loop, with a constraint added that the smoothed trajectory pass through the added points. As another example, a set of additional points that lead from the endpoint of the path to the beginning point of the path may be added.


A trajectory through the identified points in polar coordinates is determined at 318. According to various embodiments, the trajectory may be determined by any of a variety of curve-fitting tools. For example, a polynomial curve of a designated order may be fit to the points. An example of a smoothed trajectory determined in polar coordinate space is shown at 406 in FIG. 4B.


In some embodiments, the order of a polynomial curve may be strategically determined based on characteristics such as computation resources, fitting time, and the location data. For instance, higher order polynomial curves may provide a better fit but require greater computational resources and/or fitting time.


In some implementations, the order of a polynomial curve may be determined automatically. For instance, the order may be increased until one or more threshold conditions are met. For example, the order may be increased until the change in the fitted curve between successive polynomial orders falls beneath a designated threshold value. As another example, the order may be increased until the time required to fit the polynomial curve exceeds a designated threshold.


The smoothed trajectory in polar coordinate space is transformed to Cartesian coordinates at 320. According to various embodiments, the transformation performed at 320 may apply in reverse the same type of transformation performed at 312. Alternatively, a different type of transformation may be used. For example, numerical approximation may be used to determine a number of points along the smoothed trajectory in Cartesian coordinate space. As another example, the polynomial function itself may be analytically transformed from polar to Cartesian coordinates. Because the polynomial function, when transformed to Cartesian coordinate space, may have more than one y-axis value that corresponds with a designated x-axis value, the polynomial function may be transformed into a piecewise Cartesian coordinate space function. An example of a smoothed trajectory converted to Cartesian coordinate space is shown at 408 in FIG. 4A. In FIG. 4A and FIG. 4B, a closed loop has been fitted by copying locations associated with images captured near the beginning of the path to virtual data points located near the end of the path, with a constraint that the curve start and end at these points.


The trajectory is stored at 322. According to various embodiments, storing the trajectory may involve storing one or more values in a storage unit on the computing device. Alternatively, or additionally, the trajectory may be stored in memory. In either case, the stored trajectory may be used to perform image transformation, as discussed in additional detail with respect to the method 800 shown in FIG. 8.



FIG. 5 illustrates one example of a method 500 for rotational position path modeling, performed in accordance with one or more embodiments. According to various embodiments, the method 500 may be performed on a mobile computing device that captures images along a path. Alternatively, the method 500 may be performed on a different computing device, such as a remote server to which data from a mobile computing device is transmitted.


A request to determine a rotational position path for a set of images is received at 502. In some implementations, the request may be generated automatically after updated translational positions are determined for the set of images. For instance, the request may be generated after the completion of the method 300 shown in FIG. 3. Alternatively, or additionally, one or more operations shown in FIG. 5 may be performed concurrently with the determination of updated translational positions. For instance, updated rotational and/or translational positions may be determined within the same optimization function.


Original rotational positions for the set of images are identified at 504. According to various embodiments, each original rotational position may be specified in two-dimensional or three-dimensional space. For example, a rotational position may be specified as a two-dimensional vector on a plane. As yet another example, a rotational position may be specified as a three-dimensional vector in a Cartesian coordinate space. As yet another example, a rotational position may be specified as having values for pitch, roll, and yaw.


In some implementations, the original rotational positions may be specified as discussed with respect to the translational positions. For instance, information such as IMU data, visual image data, and depth sensor information may be analyzed to determine a rotational position for each image in a set of images. As one example, IMU data may be used to estimate a change in rotational position from on image to the next.


An optimization function for identifying a set of updated rotational positions is determined at 508. According to various embodiments, the optimization function may be determined at least in part by specifying one or more loss functions. For example, one loss function may identify a difference between an image's original rotational position and the image's updated rotational position. Thus, more severe rotational position changes from the image's original rotational position may be penalized. As another example, another loss function may identify a difference between a previous image's updated rotational position and the focal image's updated rotational position along a sequence of images. Thus, more severe rotational position changes from one image to the next may be penalized.


In some implementations, the optimization may be determined at least in part by specifying a functional form for combining one or more loss functions. For example, the functional form may include a weighting of different loss functions. For instance, a loss function identifying a difference between a previous image's updated rotational position and the focal image's updated rotational position may be associated with a first weighting value, and loss function may identify a difference between an image's original rotational position and the image's updated rotational position may be assigned a second weighting value. As another example, the functional form may include an operator such as squaring one or more of the loss functions. Accordingly, larger deviations may be penalized at a proportionally greater degree than smaller changes.


The optimization function is evaluated at 510 to identify the set of updated rotational positions. According to various embodiments, evaluating the optimization function may involve applying a numerical solving procedure to the optimization function determined at 510. The numerical solving procedure may identify an acceptable, but not necessarily optimal, solution. The solution may indicate, for some or all of the images, an updated rotational position in accordance with the optimization function.


The set of updated rotational positions is stored at 512. According to various embodiments, the set of updated rotational positions may be used, along with the updated translational positions, to determine updated images for the set of images. Techniques for determining image transformations are discussed in additional detail with respect to the method 800 shown in FIG. 8.


With reference to FIG. 6, shown is a particular example of a computer system that can be used to implement particular examples. For instance, the computer system 602 can be used to provide MVIDMRs according to various embodiments described above. According to various embodiments, a system 1700 suitable for implementing particular embodiments includes a processor 604, a memory 606, an interface 610, and a bus 612 (e.g., a PCI bus).


The system 602 can include one or more sensors 608, such as light sensors, accelerometers, gyroscopes, microphones, cameras including stereoscopic or structured light cameras. As described above, the accelerometers and gyroscopes may be incorporated in an IMU. The sensors can be used to detect movement of a device and determine a position of the device. Further, the sensors can be used to provide inputs into the system. For example, a microphone can be used to detect a sound or input a voice command.


In the instance of the sensors including one or more cameras, the camera system can be configured to output native video data as a live video feed. The live video feed can be augmented and then output to a display, such as a display on a mobile device. The native video can include a series of frames as a function of time. The frame rate is often described as frames per second (fps). Each video frame can be an array of pixels with color or gray scale values for each pixel. For example, a pixel array size can be 512 by 512 pixels with three color values (red, green and blue) per pixel. The three color values can be represented by varying amounts of bits, such as 6, 12, 17, 40 bits, etc. per pixel. When more bits are assigned to representing the RGB color values for each pixel, a larger number of colors values are possible. However, the data associated with each image also increases. The number of possible colors can be referred to as the color depth.


The video frames in the live video feed can be communicated to an image processing system that includes hardware and software components. The image processing system can include non-persistent memory, such as random-access memory (RAM) and video RAM (VRAM). In addition, processors, such as central processing units (CPUs) and graphical processing units (GPUs) for operating on video data and communication busses and interfaces for transporting video data can be provided. Further, hardware and/or software for performing transformations on the video data in a live video feed can be provided.


In particular embodiments, the video transformation components can include specialized hardware elements configured to perform functions necessary to generate a synthetic image derived from the native video data and then augmented with virtual data. In data encryption, specialized hardware elements can be used to perform a specific data transformation, i.e., data encryption associated with a specific algorithm. In a similar manner, specialized hardware elements can be provided to perform all or a portion of a specific video data transformation. These video transformation components can be separate from the GPU(s), which are specialized hardware elements configured to perform graphical operations. All or a portion of the specific transformation on a video frame can also be performed using software executed by the CPU.


The processing system can be configured to receive a video frame with first RGB values at each pixel location and apply operation to determine second RGB values at each pixel location. The second RGB values can be associated with a transformed video frame which includes synthetic data. After the synthetic image is generated, the native video frame and/or the synthetic image can be sent to a persistent memory, such as a flash memory or a hard drive, for storage. In addition, the synthetic image and/or native video data can be sent to a frame buffer for output on a display or displays associated with an output interface. For example, the display can be the display on a mobile device or a view finder on a camera.


In general, the video transformations used to generate synthetic images can be applied to the native video data at its native resolution or at a different resolution. For example, the native video data can be a 512 by 512 array with RGB values represented by 6 bits and at frame rate of 6 fps. In some embodiments, the video transformation can involve operating on the video data in its native resolution and outputting the transformed video data at the native frame rate at its native resolution.


In other embodiments, to speed up the process, the video transformations may involve operating on video data and outputting transformed video data at resolutions, color depths and/or frame rates different than the native resolutions. For example, the native video data can be at a first video frame rate, such as 6 fps. But, the video transformations can be performed on every other frame and synthetic images can be output at a frame rate of 12 fps. Alternatively, the transformed video data can be interpolated from the 12 fps rate to 6 fps rate by interpolating between two of the transformed video frames.


In another example, prior to performing the video transformations, the resolution of the native video data can be reduced. For example, when the native resolution is 512 by 512 pixels, it can be interpolated to a 76 by 76 pixel array using a method such as pixel averaging and then the transformation can be applied to the 76 by 76 array. The transformed video data can output and/or stored at the lower 76 by 76 resolution. Alternatively, the transformed video data, such as with a 76 by 76 resolution, can be interpolated to a higher resolution, such as its native resolution of 512 by 512, prior to output to the display and/or storage. The coarsening of the native video data prior to applying the video transformation can be used alone or in conjunction with a coarser frame rate.


As mentioned above, the native video data can also have a color depth. The color depth can also be coarsened prior to applying the transformations to the video data. For example, the color depth might be reduced from 40 bits to 6 bits prior to applying the transformation.


As described above, native video data from a live video can be augmented with virtual data to create synthetic images and then output in real-time. In particular embodiments, real-time can be associated with a certain amount of latency, i.e., the time between when the native video data is captured and the time when the synthetic images including portions of the native video data and virtual data are output. In particular, the latency can be less than 100 milliseconds. In other embodiments, the latency can be less than 50 milliseconds. In other embodiments, the latency can be less than 12 milliseconds. In yet other embodiments, the latency can be less than 20 milliseconds. In yet other embodiments, the latency can be less than 10 milliseconds.


The interface 610 may include separate input and output interfaces, or may be a unified interface supporting both operations. Examples of input and output interfaces can include displays, audio devices, cameras, touch screens, buttons and microphones. When acting under the control of appropriate software or firmware, the processor 604 is responsible for such tasks such as optimization. Various specially configured devices can also be used in place of a processor 604 or in addition to processor 604, such as graphical processor units (GPUs). The complete implementation can also be done in custom hardware. The interface 610 is typically configured to send and receive data packets or data segments over a network via one or more communication interfaces, such as wireless or wired communication interfaces. Particular examples of interfaces the device supports include Ethernet interfaces, frame relay interfaces, cable interfaces, DSL interfaces, token ring interfaces, and the like.


In addition, various very high-speed interfaces may be provided such as fast Ethernet interfaces, Gigabit Ethernet interfaces, ATM interfaces, HSSI interfaces, POS interfaces, FDDI interfaces and the like. Generally, these interfaces may include ports appropriate for communication with the appropriate media. In some cases, they may also include an independent processor and, in some instances, volatile RAM. The independent processors may control such communications intensive tasks as packet switching, media control and management.


According to various embodiments, the system 602 uses memory 606 to store data and program instructions and maintained a local side cache. The program instructions may control the operation of an operating system and/or one or more applications, for example. The memory or memories may also be configured to store received metadata and batch requested metadata.


The system 602 can be integrated into a single device with a common housing. For example, system 602 can include a camera system, processing system, frame buffer, persistent memory, output interface, input interface and communication interface. In various embodiments, the single device can be a mobile device like a smart phone, an augmented reality and wearable device like Google Glass™ or a virtual reality head set that includes multiple cameras, like a Microsoft Hololens™. In other embodiments, the system 602 can be partially integrated. For example, the camera system can be a remote camera system. As another example, the display can be separate from the rest of the components like on a desktop PC.


In the case of a wearable system, like a head-mounted display, as described above, a virtual guide can be provided to help a user record a MVIDMR. In addition, a virtual guide can be provided to help teach a user how to view a MVIDMR in the wearable system. For example, the virtual guide can be provided in synthetic images output to head mounted display which indicate that the MVIDMR can be viewed from different angles in response to the user moving some manner in physical space, such as walking around the projected image. As another example, the virtual guide can be used to indicate a head motion of the user can allow for different viewing functions. In yet another example, a virtual guide might indicate a path that a hand could travel in front of the display to instantiate different viewing functions.


According to various embodiments, to generate the smoothed trajectories, iterative fitting of a polynomial curve in polar coordinate space may be used. For instance, a Gauss-Newton algorithm with a variable damping factor may be employed. In FIG. 7A, a single iteration is employed to generate the smoothed trajectory 706 from the initial trajectory 702 around points 704. In FIG. 7B, three iterations are employed to generate the smoothed trajectory 708. In FIG. 7C, seven iterations are employed to generate the smoothed trajectory 710. In FIG. 7D, ten iterations are employed to generate the smoothed trajectory 712.


As shown in FIG. 7A, FIG. 7B, FIG. 7C, and FIG. 7D, successive iterations provide for an improved smoothed trajectory fit to the original trajectory. However, successive iterations also provide for diminishing returns in smoothed trajectory fit, and require additional computing resources for calculation.



FIG. 8 illustrates one example of a method 800 for image view transformation, performed in accordance with one or more embodiments. According to various embodiments, the method 800 may be performed on a mobile computing device that captures images along a path. Alternatively, the method 800 may be performed on a different computing device, such as a remote server to which data from a mobile computing device is transmitted.


In some implementations, the method 800 may be performed in order to transform images such that their perspective better matches the smoothed trajectory determined as described with respect to FIG. 3. Such transformations may allow the images to be positioned in an MVIDMR so that navigation between the images is smoother than would be the case with untransformed images. The images identified at 802 may include some or all of the images identified at operation 302 shown in FIG. 3.


A request to transform one or more images is received at 802. According to various embodiments, the request may be generated automatically. For instance, after the path modeling is performed as described with respect to the method 300, images may automatically be transformed to reposition their perspectives to more closely match the smoothed trajectory. Alternatively, the request to transform one or more images may be generated based on user input. For instance, a user may request to transform all images associated with locations that are relatively distant from the smoothed trajectory, or even select particular images for transformation.


Location data for the images is identified at 804. A smoothed trajectory is identified at 806. According to various embodiments, the location data and the smoothed trajectory may be identified as discussed with respect to the method 300 shown in FIG. 3.


A designated three-dimensional model for the identified images is determined at 808. According to various embodiments, designated the three-dimensional model may include points in a three-dimensional space. The points may be connected by edges that together form surfaces. The designated three-dimensional model may be determined using one or more of a variety of techniques.


In some embodiments, a three-dimensional model may be performed by analyzing the contents of the images. For example, object recognition may be performed to identify one or more objects in an image. The object recognition analysis for one or more images may be combined with the location data for those images to generate a three-dimensional model of the space.


In some implementations, a three-dimensional model may be created at least in part based on depth sensor information collected from a depth sensor at the computing device. The depth sensor may provide data that indicates a distance from the sensor to various points in the image. This data may be used to position an abstract of various portions of the image in three-dimensional space, for instance via a point cloud. One or more of a variety of depth sensors may be used, including time-of-flight, infrared, structured light, LIDAR, or RADAR.


An image is selected for transformation at 810. In some embodiments, each of the images in the set may be transformed. Alternatively, only those images that meet one or more criteria, such as distance from the transformed trajectory, may be formed.


According to various embodiments, the image may be selected for transformation based on any of a variety of criteria. For example, images that are further away from the smoothed trajectory may be selected first. As another example, images may be selected in sequence until all suitable images have been processed for transformation.


A target position for the image is determined at 812. In some implementations, the target position for the image may be determined by finding a position along the smoothed trajectory that is proximate to the original position associated with the image. For example, the target position may be the position along the smoothed trajectory that is closest to the image's original position. As another example, the target position may be selected so as to maintain a relatively equal distance between images along the smoothed trajectory.


According to various embodiments, the target position may include a translation from the original translational position to an updated translational position. Alternatively, or additionally, the target position may include a rotation from an original rotational position associated with the selected image to an updated rotational position.


At 814, the designated three-dimensional model is projected onto the selected image and onto the target position. According to various embodiments, the three-dimensional model may include a number of points in a point cloud. Each point may be specified as a position in three-dimensional space. Since the positions in three-dimensional space of the selected image and the target position are known, these points may then be projected onto those virtual camera viewpoints. In the case of the selected image, the points in the point cloud may then be positioned onto the selected image.


At 816, a transformation to the image is applied to generate a transformed image. According to various embodiments, the transformation may be applied by first determining a function to translate the location of each of the points in the point cloud from its location when projected onto the selected image to its corresponding location when projected onto the virtual camera viewpoint associated with the target position for the image. Based on this translation function, other portions of the selected image may be similarly translated to the target position. For instance, a designated pixel or other area within the selected image may be translated based on a function determined as a weighted average of the translation functions associated with the nearby points in the point cloud.


The transformed image is stored at 818. In some implementations, the transformed image may be stored for use in generating an MVIDMR. Because the images have been transformed such that their perspective more closely matches the smoothed trajectory, navigation between different images may appear to be more seamless.


A determination is made at 820 as to whether to select an additional image for transformation. As discussed with respect to operation 810, a variety of criteria may be used to select images for transformation. Additional images may be selected for transformation until all images that meet the designated criteria have been transformed.



FIG. 9 illustrates a diagram 900 of real and virtual camera positions along a path around an object 930, generated in accordance with one or more embodiments. The diagram 900 includes the actual camera positions 902, 904, 906, 908, 910, 912, and 914, the virtual camera positions 916, 918, 920, 922, 924, 926, and 928, and the smoothed trajectory 932.


According to various embodiments, each of the actual camera positions corresponds to a location at which an image of the object 930 was captured. For example, a person holding a camera, a drone, or another image source may move along a path through space around the object 930 and capture a series of images.


According to various embodiments, the smoothed trajectory 932 corresponds to a path through space that is determined to fit the positions of the actual camera positions. Techniques for determining a smoothed trajectory 932 are discussed throughout the application as filed.


According to various embodiments, each of the virtual camera positions corresponds with a position along with the smoothed trajectory at which a virtual image of the object 930 is to be generated. The virtual camera positions may be selected such that they are located along the smoothed trajectory 932 while at the same time being near the actual camera positions. In this way, the apparent path of the viewpoint through space may be smoothed while at the same time reducing the appearance of visual artifacts that may result by placing virtual camera positions at locations relatively far from the actual camera positions.


The diagram 900 is a simplified top-down view in which camera positions are shown in two dimensions. However, as discussed throughout the application. The smoothed trajectory 932 may be a two-dimensional or three-dimensional trajectory. Further, each camera position may be specified in up to three spatial dimensions and up to three rotational dimensions (e.g., yaw, pitch, and roll relative to the object 930).


The diagram 900 includes the key points 934, 936, 938, 940, 942, and 944. According to various embodiments, the key points may be identified via image processing techniques. Each key point may correspond to a location in three-dimensional space that appears in two or more of the images. In this way, a key point may be used to determine a spatial correspondence between portions of different images of the object.


According to various embodiments, a key point may correspond to a feature of an object. For instance, if the object is a vehicle, then a key point may correspond to a mirror, door handle, headlight, body panel intersection, or other such feature.


According to various embodiments, a key point may correspond to a location other than on an object. For example, a key point may correspond to a location on the ground beneath an object. As another example, a key point may correspond to a location in the scenery behind an object.


According to various embodiments, each of the key points may be associated with a location in three-dimensional space. For instance, the various input images may be analyzed to construct a three-dimensional model of the object. The three-dimensional model may include some or all of the surrounding scenery and/or ground underneath the object. Each of the keypoints may then be positioned within the three-dimensional space associated with the model. At that point, each keypoint may be associated with a respective three-dimensional location with respect to the modeled features of the object.



FIG. 10 illustrates a method 1000 for generating a novel image, performed in accordance with one or more embodiments. The method 1000 may be used in conjunction with other techniques and mechanisms described herein, such as those for determining a smoothed trajectory based on source image positions. The method 1000 may be performed on any suitable computing device.


A request to generate a novel image of an object at a destination position is received at 1002. According to various embodiments, the request may be generated as part of an overarching method for smoothing the positions of images captured along a path through space. For example, after identifying a set of images and determining a smoothed trajectory for those images, a number of destination positions may be identified for generating novel positions.


According to various embodiments, the destination positions may be determined based on a tradeoff between trajectory smoothness and visual artifacts. On one hand, the closer a destination position is to the smoothed trajectory, the smoother the resulting sequence of images appears. On the other hand, the closer a destination position is to an original image position of an actual image, the more the novel image will match the appearance of an image actually captured from the destination position.


As discussed herein, the term position can refer to any of a variety of spatial and/or orientation coordinates. For example, a point may be located at a three-dimensional position in spatial coordinates, while an image or camera location may also include up to three rotational coordinates as well (e.g., yaw, pitch, and roll).


At 1004, a source image at a source position is identified for generating the novel image. According to various embodiments, the source image may be any of the images used to generate the smoothed trajectory or captured relatively close to the smoothed trajectory.


A 3D point cloud for generating the novel image is identified at 1006. According to various embodiments, the 3D point cloud may include one or more points corresponding to areas (e.g., a pixel or pixels) in the source image. For example, a point may be a location on an object captured in the source image. As another example, a point may be a location on the ground underneath the object. As yet another example, a point may be a location on background scenery behind the object captured in the source image.


One or more 3D points are projected at 1008 onto first positions in space at the source position. In some implementations, projecting a 3D point onto a first position in space at the source position may involve computing a geometric projection from a three dimensional spatial position onto a two-dimensional position on a plane at the source position. For instance, a geometric projection may be used to project the 3D point onto a location such as a pixel on the source position image. The first position may be specified, for instance, as an x-coordinate and a y-coordinate on the source position image.


According to various embodiments, the key points described with respect to FIG. 9 may be used as the 3D points projected at 1008. As discussed with respect to FIG. 9, each of the key points may be associated with a position in three-dimensional space, which may be identified by performing image analysis on the input images.


The one or more 3D points are projected at 1010 onto second positions in space at the destination position. According to various embodiments, the same 3D points projected at 1008 onto first positions in space at the source position may also be projected onto second positions in space at the destination position. Although the novel image has not yet been generated, because the destination location in space for the novel image is identified at 1002, the one or more 3D points may be projected onto the second positions in much the same way as onto the first positions. For example, a geometric projection may be used to determine an x-coordinate and a y-coordinate on the novel position image, even though the image pixel values for the novel position image have not yet been generated, since the position of the novel position image in space is known.


One or more transformations from the first positions to the second positions are determined at 1012. According to various embodiments, the one or more transformations may identify, for instance, a respective translation in space from each of the first positions for the points to the corresponding second positions for the points. For example, a first one of the 3D points may have a projected first position onto the source location of x1, y1, and z1, while the first 3D point may have a projected second position onto the destination location of x2, y2, and z2. In such a configuration, the transformation for the first 3D point may be specified as x2−x1, y2−y1, and z2−z1. Because different 3D points may have different first and second positions, each 3D point may correspond to a different transformation.


A set of 2D mesh source positions corresponding to the source image are determined at 1014. According to various embodiments, the 2D mesh source positions may correspond to any 2D mesh overlain on the source image. For example, the 2D mesh may be a rectilinear mesh of coordinates, a triangular mesh of coordinates, an irregular mesh of coordinates, or any suitable coordinate mesh. An example of such a coordinate mesh is shown in FIG. 11.


In particular embodiments, using a finer 2D mesh, such as a mesh that includes many small triangles, may provide for a more accurate set of transformations at the expense of increased computation. Accordingly, a finer 2D mesh may be used in more highly detailed areas of the source image, while a coarser 2D mesh may be used in less highly detailed areas of the source image.


In particular embodiments, the fineness of the 2D mesh may depend at least in part of the number and positions of the projected locations of the 3D points. For example, the number of coordinate points in the 2D mesh may be proportional to the number of 3D points projected onto the source image.


A set of 2D mesh destination positions corresponding to the destination image are determined at 1016. According to various embodiments, the 2D mesh destination positions may be the same as the 2D mesh source positions, except that the 2D mesh destination positions are be relative to the position of the destination image whereas the 2D mesh source points are relative to the position of the source image. For example, if a particular 2D mesh point in the source image is located at position x1, y1 in the source image, then the corresponding 2D mesh point in the destination image may be located at position x1, y1 in the destination image.


According to various embodiments, determining the 2D mesh destination positions may involve determining and/or applying one or more transformation constraints. For example, reprojection constraints may be determined based on the transformations for the projected 3D points. As another example, similarity constraints may be imposed based on the transformation of the 2D mesh points. The similarity constraints allow for rotation, translation, and uniform scaling of the 2D mesh points, but not deformation of the 2D mesh areas.


In particular embodiments, one or more of the constraints may be implemented as a hard constraint that cannot be violated. For instance, one or more of the reprojection constraints based on transformation of the projected 3D points may be implemented as hard constraints.


In particular embodiments, one or more of the constraints may be implemented as a soft constraint that may be violated under some conditions, for instance based on an optimization penalty. For instance, one or more of the similarity constraints preventing deformation of the 2D mesh areas may be implemented as soft constraints.


In particular embodiments, different areas of the 2D mesh may be associated with different types of constraints. For instance, an image region near the edge of the object may be associated with small 2D mesh areas that are subject to more relaxed similarity constraints allowing for greater deformation. However, an image region near the center of an object may be subject to relatively strict similarity constraints allowing for less deformation of the 2D mesh.


A source image transformation for generating the novel image is determined at 1018. According to various embodiments, the source image transformation may be generated by first extending the transformations determined at 1012 to the 2D mesh points. For example, if an area in the source image defined by points within the 2D mesh includes a single projected 3D point having a transformation to a corresponding location in the destination image, then conceptually that transformation may be used to also determine transformations for those 2D mesh points.


In particular embodiments, the transformations for the 2D mesh points may be determined so as to respect the position of the projected 3D point relative to the 2D mesh points in barycentric coordinates. For instance, if the 2D mesh area is triangular, and the projected 3D point is located in the source image at a particular location have particular distances from each of the three points that make up the triangle, then those three points may be assigned respective transformations to points in the novel image such that at their transformed positions their respective distances to the transformed location of the projected 3D point are maintained.


The novel image is generated based on the source image transformation at 1020. According to various embodiments, once transformations are determined for the points in the 2D mesh, then those transformations may in turn be used to determine corresponding translations for pixels within the source image. For example, a pixel located within an area of the 2D mesh may be assigned a transformation that is an average (e.g., a weighted average) of the transformations determined for the points defining that area of the 2D mesh. Techniques for determining transformations are illustrated graphically in FIG. 11.


In some implementations, generating a novel image may involve determining many transformations for potentially many different projected 3D points, 2D mesh points, and source image pixel points. Accordingly, a machine learning model such as a neural network may be used to determine the transformations and generate the novel image. The neural network may be implemented by, for example, employing the transformations of the projected 3D points as a set of constraints used to guide the determination of the transformations for the 2D mesh points and pixels included in the source image. The locations of the projected 3D points and their corresponding transformations may be referred to herein as reprojection constraints.


In particular embodiments, generating the novel image at 1020 may involve storing the image to a storage device, transmitting the novel image via a network, or performing other such post-processing operations. Moreover, the operations shown in FIG. 10 may be performed in any suitable order, such as in a different order from that shown, or in parallel. For example, as discussed above, a neural network or other suitable machine learning technique may be used to determine multiple transformations simultaneously.



FIG. 11 illustrates a diagram 1100 of a side view image of an object 1102, generated in accordance with one or more embodiments. In the diagram 1100, the side view image of the object 1102 is overlain with a mesh 1104. The mesh is composed of a number of vertices, such as the vertices 1108, 1110, 1112, and 1114. As discussed with respect to the method shown in FIG. 10, reprojection points are projected onto the image of the object 1102. The point 1106 is an example of such a reprojection point.


A relatively coarse and regular mesh is shown in FIG. 11 for clarity. However, according to various embodiments, an image of an object may be associated with various types of meshes. For example, a mesh may be composed of one or more squares, triangles, rectangles, or other geometric figures. As another example, a mesh may be regular in size across an image, or may be more granular in some locations than others. For instance, the mesh may be more granular in areas of the image that are more central or more detailed. As yet another example, one or more lines within the mesh may be curved, for instance along an object boundary.


According to various embodiments, an image may be associated with a segmentation mask that covers the object. Also, a single reprojection point is shown in FIG. 11 for clarity. However, according to various embodiments, potentially many reprojection points may be used. For example, a single area of the mesh may be associated with none, one, several, or many reprojection points.


According to various embodiments, an object may be associated with smaller mesh areas near the object's boundaries and larger mesh areas away from the object's boundaries. Further, different mesh areas may be associated with different constraints. For example, smaller mesh areas may be associated with more relaxed similarity constraints, allowing for greater deformation, while larger mesh areas may be associated with stricter similarity constraints, allowing for less deformation.



FIG. 12 shows an example of a process flow diagram 1200 for generating a MVIDMR. In the present example, a plurality of images is obtained at 1202. According to various embodiments, the plurality of images can include two-dimensional (2D) images or data streams. These 2D images can include location information that can be used to generate a MVIDMR. In some embodiments, the plurality of images can include depth images. The depth images can also include location information in various examples.


In some embodiments, when the plurality of images is captured, images output to the user can be augmented with the virtual data. For example, the plurality of images can be captured using a camera system on a mobile device. The live image data, which is output to a display on the mobile device, can include virtual data, such as guides and status indicators, rendered into the live image data. The guides can help a user guide a motion of the mobile device. The status indicators can indicate what portion of images needed for generating a MVIDMR have been captured. The virtual data may not be included in the image data captured for the purposes of generating the MVIDMR.


According to various embodiments, the plurality of images obtained at 1202 can include a variety of sources and characteristics. For instance, the plurality of images can be obtained from a plurality of users. These images can be a collection of images gathered from the internet from different users of the same event, such as 2D images or video obtained at a concert, etc. In some embodiments, the plurality of images can include images with different temporal information. In particular, the images can be taken at different times of the same object of interest. For instance, multiple images of a particular statue can be obtained at different times of day, different seasons, etc. In other examples, the plurality of images can represent moving objects. For instance, the images may include an object of interest moving through scenery, such as a vehicle traveling along a road or a plane traveling through the sky. In other instances, the images may include an object of interest that is also moving, such as a person dancing, running, twirling, etc.


In some embodiments, the plurality of images is fused into content and context models at 1204. According to various embodiments, the subject matter featured in the images can be separated into content and context. The content can be delineated as the object of interest and the context can be delineated as the scenery surrounding the object of interest. According to various embodiments, the content can be a three-dimensional model, depicting an object of interest, and the content can be a two-dimensional image in some embodiments.


According to the present example embodiment, one or more enhancement algorithms can be applied to the content and context models at 1206. These algorithms can be used to enhance the user experience. For instance, enhancement algorithms such as automatic frame selection, stabilization, view interpolation, filters, and/or compression can be used. In some embodiments, these enhancement algorithms can be applied to image data during capture of the images. In other examples, these enhancement algorithms can be applied to image data after acquisition of the data.


In the present embodiment, a MVIDMR is generated from the content and context models at 1208. The MVIDMR can provide a multi-view interactive digital media representation. According to various embodiments, the MVIDMR can include a three-dimensional model of the content and a two-dimensional model of the context. According to various embodiments, depending on the mode of capture and the viewpoints of the images, the MVIDMR model can include certain characteristics. For instance, some examples of different styles of MVIDMRs include a locally concave MVIDMR, a locally convex MVIDMR, and a locally flat MVIDMR. However, it should be noted that MVIDMRs can include combinations of views and characteristics, depending on the application.



FIG. 13 shows an example of a MVIDMR acquisition system 1300, configured in accordance with one or more embodiments. The MVIDMR Acquisition System 1300 is depicted in a flow sequence that can be used to generate a MVIDMR. According to various embodiments, the data used to generate a MVIDMR can come from a variety of sources.


In particular, data such as, but not limited to two-dimensional (2D) images 1306 can be used to generate a MVIDMR. These 2D images can include color image data streams such as multiple image sequences, video data, etc., or multiple images in any of various formats for images, depending on the application. During an image capture process, an AR system can be used. The AR system can receive and augment live image data with virtual data. In particular, the virtual data can include guides for helping a user direct the motion of an image capture device.


Another source of data that can be used to generate a MVIDMR includes environment information 1308. This environment information 1308 can be obtained from sources such as accelerometers, gyroscopes, magnetometers, GPS, Wi-Fi, IMU-like systems (Inertial Measurement Unit systems), and the like. Yet another source of data that can be used to generate a MVIDMR can include depth images 1310. These depth images can include depth, 3D, or disparity image data streams, and the like, and can be captured by devices such as, but not limited to, stereo cameras, time-of-flight cameras, three-dimensional cameras, and the like.


In some embodiments, the data can then be fused together at sensor fusion block 1312. In some embodiments, a MVIDMR can be generated a combination of data that includes both 2d images 1306 and environment information 1308, without any depth images 1310 provided. In other embodiments, depth images 1310 and environment information 1308 can be used together at sensor fusion block 1312. Various combinations of image data can be used with environment information 1308, depending on the application and available data.


In some embodiments, the data that has been fused together at sensor fusion block 1312 is then used for content modeling 1314 and context modeling 1316. The subject matter featured in the images can be separated into content and context. The content can be delineated as the object of interest and the context can be delineated as the scenery surrounding the object of interest. According to various embodiments, the content can be a three-dimensional model, depicting an object of interest, although the content can be a two-dimensional image in some embodiments. Furthermore, in some embodiments, the context can be a two-dimensional model depicting the scenery surrounding the object of interest. Although in many examples the context can provide two-dimensional views of the scenery surrounding the object of interest, the context can also include three-dimensional aspects in some embodiments. For instance, the context can be depicted as a “flat” image along a cylindrical “canvas,” such that the “flat” image appears on the surface of a cylinder. In addition, some examples may include three-dimensional context models, such as when some objects are identified in the surrounding scenery as three-dimensional objects. According to various embodiments, the models provided by content modeling 1314 and context modeling 1316 can be generated by combining the image and location information data.


According to various embodiments, context and content of a MVIDMR are determined based on a specified object of interest. In some embodiments, an object of interest is automatically chosen based on processing of the image and location information data. For instance, if a dominant object is detected in a series of images, this object can be selected as the content. In other examples, a user specified target 1304 can be chosen. It should be noted, however, that a MVIDMR can be generated without a user-specified target in some applications.


In some embodiments, one or more enhancement algorithms can be applied at enhancement algorithm(s) block 1318. In particular example embodiments, various algorithms can be employed during capture of MVIDMR data, regardless of the type of capture mode employed. These algorithms can be used to enhance the user experience. For instance, automatic frame selection, stabilization, view interpolation, filters, and/or compression can be used during capture of MVIDMR data. In some embodiments, these enhancement algorithms can be applied to image data after acquisition of the data. In other examples, these enhancement algorithms can be applied to image data during capture of MVIDMR data.


According to various embodiments, automatic frame selection can be used to create a more enjoyable MVIDMR. Specifically, frames are automatically selected so that the transition between them will be smoother or more even. This automatic frame selection can incorporate blur- and overexposure-detection in some applications, as well as more uniformly sampling poses such that they are more evenly distributed.


In some embodiments, stabilization can be used for a MVIDMR in a manner similar to that used for video. In particular, keyframes in a MVIDMR can be stabilized for to produce improvements such as smoother transitions, improved/enhanced focus on the content, etc. However, unlike video, there are many additional sources of stabilization for a MVIDMR, such as by using IMU information, depth information, computer vision techniques, direct selection of an area to be stabilized, face detection, and the like.


For instance, IMU information can be very helpful for stabilization. In particular, IMU information provides an estimate, although sometimes a rough or noisy estimate, of the camera tremor that may occur during image capture. This estimate can be used to remove, cancel, and/or reduce the effects of such camera tremor.


In some embodiments, depth information, if available, can be used to provide stabilization for a MVIDMR. Because points of interest in a MVIDMR are three-dimensional, rather than two-dimensional, these points of interest are more constrained and tracking/matching of these points is simplified as the search space reduces. Furthermore, descriptors for points of interest can use both color and depth information and therefore, become more discriminative. In addition, automatic or semi-automatic content selection can be easier to provide with depth information. For instance, when a user selects a particular pixel of an image, this selection can be expanded to fill the entire surface that touches it. Furthermore, content can also be selected automatically by using a foreground/background differentiation based on depth. According to various embodiments, the content can stay relatively stable/visible even when the context changes.


According to various embodiments, computer vision techniques can also be used to provide stabilization for MVIDMRs. For instance, keypoints can be detected and tracked. However, in certain scenes, such as a dynamic scene or static scene with parallax, no simple warp exists that can stabilize everything. Consequently, there is a trade-off in which certain aspects of the scene receive more attention to stabilization and other aspects of the scene receive less attention. Because a MVIDMR is often focused on a particular object of interest, a MVIDMR can be content-weighted so that the object of interest is maximally stabilized in some examples.


Another way to improve stabilization in a MVIDMR includes direct selection of a region of a screen. For instance, if a user taps to focus on a region of a screen, then records a convex MVIDMR, the area that was tapped can be maximally stabilized. This allows stabilization algorithms to be focused on a particular area or object of interest.


In some embodiments, face detection can be used to provide stabilization. For instance, when recording with a front-facing camera, it is often likely that the user is the object of interest in the scene. Thus, face detection can be used to weight stabilization about that region. When face detection is precise enough, facial features themselves (such as eyes, nose, and mouth) can be used as areas to stabilize, rather than using generic keypoints. In another example, a user can select an area of image to use as a source for keypoints.


According to various embodiments, view interpolation can be used to improve the viewing experience. In particular, to avoid sudden “jumps” between stabilized frames, synthetic, intermediate views can be rendered on the fly. This can be informed by content-weighted keypoint tracks and IMU information as described above, as well as by denser pixel-to-pixel matches. If depth information is available, fewer artifacts resulting from mismatched pixels may occur, thereby simplifying the process. As described above, view interpolation can be applied during capture of a MVIDMR in some embodiments. In other embodiments, view interpolation can be applied during MVIDMR generation.


In some embodiments, filters can also be used during capture or generation of a MVIDMR to enhance the viewing experience. Just as many popular photo sharing services provide aesthetic filters that can be applied to static, two-dimensional images, aesthetic filters can similarly be applied to surround images. However, because a MVIDMR representation is more expressive than a two-dimensional image, and three-dimensional information is available in a MVIDMR, these filters can be extended to include effects that are ill-defined in two dimensional photos. For instance, in a MVIDMR, motion blur can be added to the background (i.e. context) while the content remains crisp. In another example, a drop-shadow can be added to the object of interest in a MVIDMR.


According to various embodiments, compression can also be used as an enhancement algorithm 1318. In particular, compression can be used to enhance user-experience by reducing data upload and download costs. Because MVIDMRs use spatial information, far less data can be sent for a MVIDMR than a typical video, while maintaining desired qualities of the MVIDMR. Specifically, the IMU, keypoint tracks, and user input, combined with the view interpolation described above, can all reduce the amount of data that must be transferred to and from a device during upload or download of a MVIDMR. For instance, if an object of interest can be properly identified, a variable compression style can be chosen for the content and context. This variable compression style can include lower quality resolution for background information (i.e. context) and higher quality resolution for foreground information (i.e. content) in some examples. In such examples, the amount of data transmitted can be reduced by sacrificing some of the context quality, while maintaining a desired level of quality for the content.


In the present embodiment, a Mvidmr 1320 is generated after any enhancement algorithms are applied. The MVIDMR can provide a multi-view interactive digital media representation. According to various embodiments, the MVIDMR can include three-dimensional model of the content and a two-dimensional model of the context. However, in some examples, the context can represent a “flat” view of the scenery or background as projected along a surface, such as a cylindrical or other-shaped surface, such that the context is not purely two-dimensional. In yet other examples, the context can include three-dimensional aspects.


According to various embodiments, MVIDMRs provide numerous advantages over traditional two-dimensional images or videos. Some of these advantages include: the ability to cope with moving scenery, a moving acquisition device, or both; the ability to model parts of the scene in three-dimensions; the ability to remove unnecessary, redundant information and reduce the memory footprint of the output dataset; the ability to distinguish between content and context; the ability to use the distinction between content and context for improvements in the user-experience; the ability to use the distinction between content and context for improvements in memory footprint (an example would be high quality compression of content and low quality compression of context); the ability to associate special feature descriptors with MVIDMRs that allow the MVIDMRs to be indexed with a high degree of efficiency and accuracy; and the ability of the user to interact and change the viewpoint of the MVIDMR. In particular example embodiments, the characteristics described above can be incorporated natively in the MVIDMR representation, and provide the capability for use in various applications. For instance, MVIDMRs can be used to enhance various fields such as e-commerce, visual search, 3D printing, file sharing, user interaction, and entertainment.


According to various example embodiments, once a Mvidmr 1320 is generated, user feedback for acquisition 1302 of additional image data can be provided. In particular, if a MVIDMR is determined to need additional views to provide a more accurate model of the content or context, a user may be prompted to provide additional views. Once these additional views are received by the MVIDMR acquisition system 1300, these additional views can be processed by the system 1300 and incorporated into the MVIDMR.


Additional details regarding multi-view data collection, multi-view representation construction, and other features are discussed in co-pending and commonly assigned U.S. patent application Ser. No. 15/934,624, “Conversion of an Interactive Multi-view Image Data Set into a Video”, by Holzer et al., filed Mar. 23, 2018, which is hereby incorporated by reference in its entirety and for all purposes.



FIG. 14 illustrate an example of a process flow for capturing images in a MVIDMR using augmented reality, performed in accordance with one or more embodiments. In 1402, live image data can be received from a camera system. For example, live image data can be received from one or more cameras on a hand-held mobile device, such as a smartphone. The image data can include pixel data captured from a camera sensor. The pixel data varies from frame to frame. In some embodiments, the pixel data can be 2-D. In other embodiments, depth data can be included with the pixel data.


In 1404, sensor data can be received. For example, the mobile device can include an IMU with accelerometers and gyroscopes. The sensor data can be used to determine an orientation of the mobile device, such as a tilt orientation of the device relative to the gravity vector. Thus, the orientation of the live 2-D image data relative to the gravity vector can also be determined. In addition, when the user applied accelerations can be separated from the acceleration due to gravity, it may be possible to determine changes in position of the mobile device as a function of time.


In particular embodiments, a camera reference frame can be determined. In the camera reference frame, one axis is aligned with a line perpendicular to the camera lens. Using an accelerometer on the phone, the camera reference frame can be related to an Earth reference frame. The earth reference frame can provide a 3-D coordinate system where one of the axes is aligned with the Earths' gravitational vector. The relationship between the camera frame and Earth reference frame can be indicated as yaw, roll and tilt/pitch. Typically, at least two of the three of yaw, roll and pitch are available typically from sensors available on a mobile device, such as smart phone's gyroscopes and accelerometers.


The combination of yaw-roll-tilt information from the sensors, such as a smart phone or tablets accelerometers and the data from the camera including the pixel data can be used to relate the 2-D pixel arrangement in the camera field of view to the 3-D reference frame in the real world. In some embodiments, the 2-D pixel data for each picture can be translated to a reference frame as if the camera were resting on a horizontal plane perpendicular to an axis through the gravitational center of the Earth where a line drawn through the center of lens perpendicular to the surface of lens is mapped to a center of the pixel data. This reference frame can be referred as an Earth reference frame. Using this calibration of the pixel data, a curve or object defined in 3-D space in the earth reference frame can be mapped to a plane associated with the pixel data (2-D pixel data). If depth data is available, i.e., the distance of the camera to a pixel. Then, this information can also be utilized in a transformation.


In alternate embodiments, the 3-D reference frame in which an object is defined doesn't have to be an Earth reference frame. In some embodiments, a 3-D reference in which an object is drawn and then rendered into the 2-D pixel frame of reference can be defined relative to the Earth reference frame. In another embodiment, a 3-D reference frame can be defined relative to an object or surface identified in the pixel data and then the pixel data can be calibrated to this 3-D reference frame.


As an example, the object or surface can be defined by a number of tracking points identified in the pixel data. Then, as the camera moves, using the sensor data and a new position of the tracking points, a change in the orientation of the 3-D reference frame can be determined from frame to frame. This information can be used to render virtual data in a live image data and/or virtual data into a MVIDMR.


Returning to FIG. 14, in 1406, virtual data associated with a target can be generated in the live image data. For example, the target can be cross hairs. In general, the target can be rendered as any shape or combinations of shapes. In some embodiments, via an input interface, a user may be able to adjust a position of the target. For example, using a touch screen over a display on which the live image data is output, the user may be able to place the target at a particular location in the synthetic image. The synthetic image can include a combination of live image data rendered with one or more virtual objects.


For example, the target can be placed over an object that appears in the image, such as a face or a person. Then, the user can provide an additional input via an interface that indicates the target is in a desired location. For example, the user can tap the touch screen proximate to the location where the target appears on the display. Then, an object in the image below the target can be selected. As another example, a microphone in the interface can be used to receive voice commands which direct a position of the target in the image (e.g., move left, move right, etc.) and then confirm when the target is in a desired location (e.g., select target).


In some instances, object recognition can be available. Object recognition can identify possible objects in the image. Then, the live images can be augmented with a number of indicators, such as targets, which mark identified objects. For example, objects, such as people, parts of people (e.g., faces), cars, wheels, can be marked in the image. Via an interface, the person may be able to select one of the marked objects, such as via the touch screen interface. In another embodiment, the person may be able to provide a voice command to select an object. For example, the person may be to say something like “select face,” or “select car.”


In 1408, the object selection can be received. The object selection can be used to determine an area within the image data to identify tracking points. When the area in the image data is over a target, the tracking points can be associated with an object appearing in the live image data.


In 1410, tracking points can be identified which are related to the selected object. Once an object is selected, the tracking points on the object can be identified on a frame to frame basis. Thus, if the camera translates or changes orientation, the location of the tracking points in the new frame can be identified and the target can be rendered in the live images so that it appears to stay over the tracked object in the image. This feature is discussed in more detail below. In particular embodiments, object detection and/or recognition may be used for each or most frames, for instance to facilitate identifying the location of tracking points.


In some embodiments, tracking an object can refer to tracking one or more points from frame to frame in the 2-D image space. The one or more points can be associated with a region in the image. The one or more points or regions can be associated with an object. However, the object doesn't have to be identified in the image. For example, the boundaries of the object in 2-D image space don't have to be known. Further, the type of object doesn't have to be identified. For example, a determination doesn't have to be made as to whether the object is a car, a person or something else appearing in the pixel data. Instead, the one or more points may be tracked based on other image characteristics that appear in successive frames. For instance, edge tracking, corner tracking, or shape tracking may be used to track one or more points from frame to frame.


One advantage of tracking objects in the manner described in the 2-D image space is that a 3-D reconstruction of an object or objects appearing in an image don't have to be performed. The 3-D reconstruction step may involve operations such as “structure from motion (SFM)” and/or “simultaneous localization and mapping (SLAM).” The 3-D reconstruction can involve measuring points in multiple images, and the optimizing for the camera poses and the point locations. When this process is avoided, significant computation time is saved. For example, avoiding the SLAM/SFM computations can enable the methods to be applied when objects in the images are moving. Typically, SLAM/SFM computations assume static environments.


In 1412, a 3-D coordinate system in the physical world can be associated with the image, such as the Earth reference frame, which as described above can be related to camera reference frame associated with the 2-D pixel data. In some embodiments, the 2-D image data can be calibrated so that the associated 3-D coordinate system is anchored to the selected target such that the target is at the origin of the 3-D coordinate system.


Then, in 1414, a 2-D or 3-D trajectory or path can be defined in the 3-D coordinate system. For example, a trajectory or path, such as an arc or a parabola can be mapped to a drawing plane which is perpendicular to the gravity vector in the Earth reference frame. As described above, based upon the orientation of the camera, such as information provided from an IMU, the camera reference frame including the 2-D pixel data can be mapped to the Earth reference frame. The mapping can be used to render the curve defined in the 3-D coordinate system into the 2-D pixel data from the live image data. Then, a synthetic image including the live image data and the virtual object, which is the trajectory or path, can be output to a display.


In general, virtual objects, such as curves or surfaces can be defined in a 3-D coordinate system, such as the Earth reference frame or some other coordinate system related to an orientation of the camera. Then, the virtual objects can be rendered into the 2-D pixel data associated with the live image data to create a synthetic image. The synthetic image can be output to a display.


In some embodiments, the curves or surfaces can be associated with a 3-D model of an object, such as person or a car. In another embodiment, the curves or surfaces can be associated with text. Thus, a text message can be rendered into the live image data. In other embodiments, textures can be assigned to the surfaces in the 3-D model. When a synthetic image is created, these textures can be rendered into the 2-D pixel data associated with the live image data.


When a curve is rendered on a drawing plane in the 3-D coordinate system, such as the Earth reference frame, one or more of the determined tracking points can be projected onto the drawing plane. As another example, a centroid associated with the tracked points can be projected onto the drawing plane. Then, the curve can be defined relative to one or more points projected onto the drawing plane. For example, based upon the target location, a point can be determined on the drawing plane. Then, the point can be used as the center of a circle or arc of some radius drawn in the drawing plane.


In 1414, based upon the associated coordinate system, a curve can be rendered into to the live image data as part of the AR system. In general, one or more virtual objects including plurality of curves, lines or surfaces can be rendered into the live image data. Then, the synthetic image including the live image data and the virtual objects can be output to a display in real-time.


In some embodiments, the one or more virtual object rendered into the live image data can be used to help a user capture images used to create a MVIDMR. For example, the user can indicate a desire to create a MVIDMR of a real object identified in the live image data. The desired MVIDMR can span some angle range, such as forty-five, ninety, one hundred eighty degrees or three hundred sixty degrees. Then, a virtual object can be rendered as a guide where the guide is inserted into the live image data. The guide can indicate a path along which to move the camera and the progress along the path. The insertion of the guide can involve modifying the pixel data in the live image data in accordance with coordinate system in 1412.


In the example above, the real object can be some object which appears in the live image data. For the real object, a 3-D model may not be constructed. Instead, pixel locations or pixel areas can be associated with the real object in the 2-D pixel data. This definition of the real object is much less computational expensive than attempting to construct a 3-D model of the real object in physical space.


The virtual objects, such as lines or surfaces can be modeled in the 3-D space. The virtual objects can be defined a priori. Thus, the shape of the virtual object doesn't have to be constructed in real-time, which is computational expensive. The real objects which may appear in an image are not known a priori. Hence, 3-D models of the real object are not typically available. Therefore, the synthetic image can include “real” objects which are only defined in the 2-D image space via assigning tracking points or areas to the real object and virtual objects which are modeled in a 3-D coordinate system and then rendered into the live image data.


Returning to FIG. 14, in 1416, AR image with one or more virtual objects can be output. The pixel data in the live image data can be received at a particular frame rate. In particular embodiments, the augmented frames can be output at the same frame rate as it received. In other embodiments, it can be output at a reduced frame rate. The reduced frame rate can lessen computation requirements. For example, live data received at 12 frames per second can be output at 15 frames per second. In another embodiment, the AR images can be output at a reduced resolution, such as 60p instead of 480p. The reduced resolution can also be used to reduce computational requirements.


In 1418, one or more images can be selected from the live image data and stored for use in a MVIDMR. In some embodiments, the stored images can include one or more virtual objects. Thus, the virtual objects can be become part of the MVIDMR. In other embodiments, the virtual objects are only output as part of the AR system. But, the image data which is stored for use in the MVIDMR may not include the virtual objects.


In yet other embodiments, a portion of the virtual objects output to the display as part of the AR system can be stored. For example, the AR system can be used to render a guide during the MVIDMR image capture process and render a label associated with the MVIDMR. The label may be stored in the image data for the MVIDMR. However, the guide may not be stored. To store the images without the added virtual objects, a copy may have to be made. The copy can be modified with the virtual data and then output to a display and the original stored or the original can be stored prior to its modification.


In FIG. 15, the method in FIG. 14 is continued. In 1502, new image data can be received. In 1504, new IMU data (or, in general sensor data) can be received. The IMU data can represent a current orientation of the camera. In 1506, the location of the tracking points identified in previous image data can be identified in the new image data.


The camera may have tilted and/or moved. Hence, the tracking points may appear at a different location in the pixel data. As described above, the tracking points can be used to define a real object appearing in the live image data. Thus, identifying the location of the tracking points in the new image data allows the real object to be tracked from image to image. The differences in IMU data from frame to frame and knowledge of the rate at which the frames are recorded can be used to help to determine a change in location of tracking points in the live image data from frame to frame.


The tracking points associated with a real object appearing in the live image data may change over time. As a camera moves around the real object, some tracking points identified on the real object may go out of view as new portions of the real object come into view and other portions of the real object are occluded. Thus, in 1506, a determination may be made whether a tracking point is still visible in an image. In addition, a determination may be made as to whether a new portion of the targeted object has come into view. New tracking points can be added to the new portion to allow for continued tracking of the real object from frame to frame.


In 1508, a coordinate system can be associated with the image. For example, using an orientation of the camera determined from the sensor data, the pixel data can be calibrated to an Earth reference frame as previously described. In 1510, based upon the tracking points currently placed on the object and the coordinate system a target location can be determined. The target can be placed over the real object which is tracked in live image data. As described above, a number and a location of the tracking points identified in an image can vary with time as the position of the camera changes relative to the camera. Thus, the location of the target in the 2-D pixel data can change. A virtual object representing the target can be rendered into the live image data. In particular embodiments, a coordinate system may be defined based on identifying a position from the tracking data and an orientation from the IMU (or other) data.


In 1512, a track location in the live image data can be determined. The track can be used to provide feedback associated with a position and orientation of a camera in physical space during the image capture process for a MVIDMR. As an example, as described above, the track can be rendered in a drawing plane which is perpendicular to the gravity vector, such as parallel to the ground. Further, the track can be rendered relative to a position of the target, which is a virtual object, placed over a real object appearing in the live image data. Thus, the track can appear to surround or partially surround the object. As described above, the position of the target can be determined from the current set of tracking points associated with the real object appearing in the image. The position of the target can be projected onto the selected drawing plane.


In 1514, a capture indicator status can be determined. The capture indicator can be used to provide feedback in regards to what portion of the image data used in a MVIDMR has been captured. For example, the status indicator may indicate that half of angle range of images for use in a MVIDMR has been captured. In another embodiment, the status indicator may be used to provide feedback in regards to whether the camera is following a desired path and maintaining a desired orientation in physical space. Thus, the status indicator may indicate the current path or orientation of the camera is desirable or not desirable. When the current path or orientation of the camera is not desirable, the status indicator may be configured to indicate what type of correction which is needed, such as but not limited to moving the camera more slowly, starting the capture process over, tilting the camera in a certain direction and/or translating the camera in a particular direction.


In 1516, a capture indicator location can be determined. The location can be used to render the capture indicator into the live image and generate the synthetic image. In some embodiments, the position of the capture indicator can be determined relative to a position of the real object in the image as indicated by the current set of tracking points, such as above and to left of the real object. In 1518, a synthetic image, i.e., a live image augmented with virtual objects, can be generated. The synthetic image can include the target, the track and one or more status indicators at their determined locations, respectively. The image data is stored at 1520 When image data is captured for the purposes of use in a MVIDMR can be captured, the stored image data can be raw image data without virtual objects or may include virtual objects.


In 1522, a check can be made as to whether images needed to generate a MVIDMR have been captured in accordance with the selected parameters, such as a MVIDMR spanning a desired angle range. When the capture is not complete, new image data may be received and the method may return to 1502. When the capture is complete, a virtual object can be rendered into the live image data indicating the completion of the capture process for the MVIDMR and a MVIDMR can be created. Some virtual objects associated with the capture process may cease to be rendered. For example, once the needed images have been captured the track used to help guide the camera during the capture process may no longer be generated in the live image data.



FIG. 16A and FIG. 16B illustrate aspects of generating an Augmented Reality (AR) image capture track for capturing images used in a MVIDMR, performed in accordance with one or more embodiments. In FIG. 16A, a mobile device 1616 with a display 1634 is shown. The mobile device can include at least one camera (not shown) with a field of view 1600. A real object 1602, which is a person, is selected in the field of view 1600 of the camera. A virtual object, which is a target (not shown), may have been used to help select the real object. For example, the target on a touch screen display of the mobile device 1616 may have been placed over the object 1602 and then selected.


The camera can include an image sensor which captures light in the field of view 1600. The data from the image sensor can be converted to pixel data. The pixel data can be modified prior to its output on display 1634 to generate a synthetic image. The modifications can include rendering virtual objects in the pixel data as part of an augmented reality (AR) system.


Using the pixel data and a selection of the object 1602, tracking points on the object can be determined. The tracking points can define the object in image space. Locations of a current set of tracking points, such as 1606, 1608 and 1610, which can be attached to the object 1602 are shown. As a position and orientation of the camera on the mobile device 1616, the shape and position of the object 1602 in the captured pixel data can change. Thus, the location of the tracking points in the pixel data can change. Thus, a previously defined tracking point can move from a first location in the image data to a second location. Also, a tracking point can disappear from the image as portions of the object are occluded.


Using sensor data from the mobile device 1616, an Earth reference frame 3-D coordinate system 1604 can be associated with the image data. The direction of the gravity vector is indicated by arrow 1612. As described above, in a particular embodiment, the 2-D image data can be calibrated relative to the Earth reference frame. The arrow representing the gravity vector is not rendered into the live image data. However, if desired, an indicator representative of the gravity could be rendered into the synthetic image.


A plane which is perpendicular to the gravity vector can be determined. The location of the plane can be determined using the tracking points in the image, such as 1606, 1608 and 1610. Using this information, a curve, which is a circle, is drawn in the plane. The circle can be rendered into to the 2-D image data and output as part of the AR system. As is shown on display 1634, the circle appears to surround the object 1602. In some embodiments, the circle can be used as a guide for capturing images used in a MVIDMR.


If the camera on the mobile device 1616 is rotated in some way, such as tilted, the shape of the object will change on display 1634. However, the new orientation of the camera can be determined in space including a direction of the gravity vector. Hence, a plane perpendicular to the gravity vector can be determined. The position of the plane and hence, a position of the curve in the image can be based upon a centroid of the object determined from the tracking points associated with the object 1602. Thus, the curve can appear to remain parallel to the ground, i.e., perpendicular to the gravity vector, as the camera 1614 moves. However, the position of the curve can move from location to location in the image as the position of the object and its apparent shape in the live images changes.


In FIG. 16B, a mobile device 1630 including a camera (not shown) and a display 1632 for outputting the image data from the camera is shown. A cup 1620 is shown in the field of view of camera 1520 of the camera. Tracking points, such as 1622 and 1636, have been associated with the object 1620. These tracking points can define the object 1620 in image space. Using the IMU data from the mobile device 1630, a reference frame has been associated with the image data. As described above, in some embodiments, the pixel data can be calibrated to the reference frame. The reference frame is indicated by the 3-D axes and the direction of the gravity vector is indicated by arrow 1624.


As described above, a plane relative to the reference frame can be determined. In this example, the plane is parallel to the direction of the axis associated with the gravity vector as opposed to perpendicular to the frame. This plane is used to proscribe a path for the MVIDMR which goes over the top of the object 1628. In general, any plane can be determined in the reference frame and then a curve, which is used as a guide, can be rendered into the selected plane.


Using the locations of the tracking points, in some embodiments a centroid of the object 1620 on the selected plane in the reference can be determined. A curve 1628, such as a circle, can be rendered relative to the centroid. In this example, a circle is rendered around the object 1620 in the selected plane.


The curve 1626 can serve as a track for guiding the camera along a particular path where the images captured along the path can be converted into a MVIDMR. In some embodiments, a position of the camera along the path can be determined. Then, an indicator can be generated which indicates a current location of the camera along the path. In this example, current location is indicated by arrow 1628.


The position of the camera along the path may not directly map to physical space, i.e., the actual position of the camera in physical space doesn't have to be necessarily determined. For example, an angular change can be estimated from the IMU data and optionally the frame rate of the camera. The angular change can be mapped to a distance moved along the curve where the ratio of the distance moved along the path 1626 is not a one to one ratio with the distance moved in physical space. In another example, a total time to traverse the path 1626 can be estimated and then the length of time during which images have been recorded can be tracked. The ratio of the recording time to the total time can be used to indicate progress along the path 1626.


The path 1626, which is an arc, and arrow 1628 are rendered into the live image data as virtual objects in accordance with their positions in the 3-D coordinate system associated with the live 2-D image data. The cup 1620, the circle 1626 and the arrow 1628 are shown output to display 1632. The orientation of the curve 1626 and the arrow 1628 shown on display 1632 relative to the cup 1620 can change if the orientation of the camera is changed, such as if the camera is tilted.


In particular embodiments, a size of the object 1620 in the image data can be changed. For example, the size of the object can be made bigger or smaller by using a digital zoom. In another example, the size of the object can be made bigger or smaller by moving the camera, such as on mobile device 1630, closer or farther away from the object 1620.


When the size of the object changes, the distances between the tracking points can change, i.e., the pixel distances between the tracking points can increase or can decrease. The distance changes can be used to provide a scaling factor. In some embodiments, as the size of the object changes, the AR system can be configured to scale a size of the curve 1626 and/or arrow 1628. Thus, a size of the curve relative to the object can be maintained.


In another embodiment, a size of the curve can remain fixed. For example, a diameter of the curve can be related to a pixel height or width of the image, such as 150 percent of the pixel height or width. Thus, the object 1620 can appear to grow or shrink as a zoom is used or a position of the camera is changed. However, the size of curve 1626 in the image can remain relatively fixed.


In the foregoing specification, reference was made in detail to specific embodiments including one or more of the best modes contemplated by the inventors. While various implementations have been described herein, it should be understood that they have been presented by way of example only, and not limitation. For example, some techniques and mechanisms are described herein in the context of MVIDMRs and mobile computing devices. However, the techniques of disclosed herein apply to a wide variety of digital image data, related sensor data, and computing devices. Particular embodiments may be implemented without some or all of the specific details described herein. In other instances, well known process operations have not been described in detail in order to avoid unnecessarily obscuring the disclosed techniques. Accordingly, the breadth and scope of the present application should not be limited by any of the implementations described herein, but should be defined only in accordance with the claims and their equivalents.

Claims
  • 1. A method comprising: projecting via a processor a plurality of three-dimensional points onto first locations in a first image of an object captured from a first position in three-dimensional space relative to the object;projecting via the processor the plurality of three-dimensional points onto second locations a virtual camera position located at a second position in three-dimensional space relative to the object;determining via the processor a first plurality of transformations, each of the first plurality of transformations linking a respective one of the first locations with a respective one of the second locations;based on the first plurality of transformations, determining via the processor a second plurality of transformations transforming first coordinates for the first image of the object to second coordinates for the second image of the object; andgenerating via the processor a second image of the object from the virtual camera position based on the first image of the object and the second plurality of transformations.
  • 2. The method of claim 1, wherein the first coordinates correspond to a first-two-dimensional mesh overlain on the first image of the object, and wherein the second coordinates correspond to a second two-dimensional mesh overlain on the second image of the object.
  • 3. The method of claim 1, wherein the first image of the object is one of a first plurality of images captured by a camera moving along an input path through space around the object, and wherein the second image is one of a second plurality of images generated at respective virtual camera positions relative to the object.
  • 4. The method of claim 3, the method further comprising: determining a smoothed path through space around the object based on the input path; anddetermining the virtual camera position based on the smoothed path.
  • 5. The method of claim 1, wherein the plurality of three-dimensional points are determined at least in part via motion data captured from an inertial measurement unit at the mobile computing device.
  • 6. The method of claim 5, wherein the motion data includes data selected from the group consisting of: accelerometer data, gyroscopic data, and global positioning system (GPS) data.
  • 7. The method of claim 1, wherein the plurality of three-dimensional points are determined at least in part based on depth sensor data captured from a depth sensor.
  • 8. The method of claim 1, wherein the second plurality of transformations is generated via a neural network.
  • 9. The method of claim 8, wherein the first plurality of transformations are provided as reprojection constraints to the neural network.
  • 10. The method of claim 8, wherein the neural network includes one or more similarity constraints that penalize deformation of first two-dimensional mesh via the second plurality of transformations.
  • 11. The method of claim 1, the method further comprising generating a multiview interactive digital media representation (MVIDMR) that includes the second set of images, the MVIDMR being navigable in one or more dimensions
  • 12. The method of claim 1, wherein the second image of the object is generated via a neural network.
  • 13. The method of claim 1, wherein the processor is located within a mobile computing device that includes a camera, the first image being captured by the camera.
  • 14. The method of claim 1, wherein the processor is located within a mobile computing device that includes a camera, the first image being captured by the camera.
  • 15. A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to: project via a processor a plurality of three-dimensional points onto first locations in a first image of an object captured from a first position in three-dimensional space relative to the object;project via the processor the plurality of three-dimensional points onto second locations a virtual camera position located at a second position in three-dimensional space relative to the object;determine via the processor a first plurality of transformations, each of the first plurality of transformations linking a respective one of the first locations with a respective one of the second locations;based on the first plurality of transformations, determine via the processor a second plurality of transformations transforming first coordinates for the first image of the object to second coordinates for the second image of the object; andgenerate via the processor a second image of the object from the virtual camera position based on the first image of the object and the second plurality of transformations.
  • 16. A computing apparatus comprising: a processor; anda memory storing instructions that, when executed by the processor, configure the apparatus to: project via the processor a plurality of three-dimensional points onto first locations in a first image of an object captured from a first position in three-dimensional space relative to the object;project via the processor the plurality of three-dimensional points onto second locations a virtual camera position located at a second position in three-dimensional space relative to the object;determine via the processor a first plurality of transformations, each of the first plurality of transformations linking a respective one of the first locations with a respective one of the second locations;based on the first plurality of transformations, determine via the processor a second plurality of transformations transforming first coordinates for the first image of the object to second coordinates for the second image of the object; andgenerate via the processor a second image of the object from the virtual camera position based on the first image of the object and the second plurality of transformations.
  • 17. The computing apparatus of claim 16, wherein the first image of the object is one of a first plurality of images captured by a camera move along an input path through space around the object, and wherein the second image is one of a second plurality of images generated at respective virtual camera positions relative to the object.
  • 18. The computing apparatus of claim 17, the method wherein the instructions further configure the apparatus to: determine a smoothed path through space around the object based on the input path; anddetermine the virtual camera position based on the smoothed path.
  • 19. The computing apparatus of claim 16, wherein the second plurality of transformations is generated via a neural network, and wherein the first plurality of transformations are provided as reprojection constraints to the neural network.
  • 20. The computing apparatus of claim 19, wherein the neural network includes one or more similarity constraints that penalize deformation of first two-dimensional mesh via the second plurality of transformations.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of, and claims priority to U.S. patent application Ser. No. 17/351,104 (Atty Docket FYSNP079) by Chande, titled VIEWPOINT PATH MODELING, filed Jun. 17, 2021, which is hereby incorporated by reference in its entirety and for all purposes.

Continuation in Parts (1)
Number Date Country
Parent 17351104 Jun 2021 US
Child 17502594 US