The method relates in general to video and image processing.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
In the prior art, a still picture or video is taken of a scene. At times different images are taken from different perspectives, which may be shown at different times to give the viewer a more complete view of the incident. However, the different perspectives do not always give a complete picture. Also, combined images often have transitions that do not look natural.
A method and a system are provided for joining and stitching multiple images or videos taken from different locations or angles and viewpoints. In this specification the word image is generic to a video image or a still image. In an embodiment, a video panorama or a still image panorama may be automatically constructed from a single video or multiple videos. Video images may be used for producing video or still panoramas, and portions of a single still image or multiple still images may be combined to construct a still panorama. After the stitching and joining a much larger video or image scenery may be produced than any one image or video from which the final scenery was produced. Some methods that may be used for joining and representing the final scene include both automatic and manual methods of stitching and/or joining images. The methods may include different degrees of adjusting features, and blending and smoothening of images that have been combined. The method may include a partial window and/or viewing ability and a self-correcting/self-adjusting configuration. The word “stitching” refers to joining images (e.g., having different perspectives) to form another image (e.g., of a different perspective than the original images from which the final image is formed). The system can be used for both still images and videos and can stitch any number of scenes without limit. The system can provide higher performance by “stitching on demand” only the videos that are required to be rendered based on the viewing system. The output can be stored in a file system, or displayed on a screen or streamed over a network for viewing by another user, whom may have the ability to view a partial or a whole scene. The streaming of data refers to the delivering of data in packets, where the packets are in a format such that the packets may be viewed prior to receiving the entire message. By streaming the data, the packets are presented (e.g., viewed), the information delivered appears like a continuous stream of information. The viewing system may include an ability to zoom, pan, and/or tilt the final virtual stitched image/video seamlessly.
Any of the above embodiments may be used alone or together with one another in any combination. Inventions encompassed within this specification may also include embodiments that are only partially mentioned or alluded to or are not mentioned or alluded to at all in this brief summary or in the abstract.
In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples of the invention, the invention is not limited to the examples depicted in the figures.
Although various embodiments of the invention may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments of the invention do not necessarily address any of these deficiencies. In other words, different embodiments of the invention may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
In general, at the beginning of the discussion of each of
Cameras 12, 14, and 16 may be video cameras, cameras that takes still images, or cameras that takes both still and video images. Each of cameras 12, 14 and 16 take an image from a different perspective than the other cameras. Cameras 12, 14, and 16 may be used for photographing images from multiple perspectives. The images taken by cameras 12, 14, and 16 are combined together to form a panorama. Although three cameras are illustrated by way of example, there may be any number of cameras (e.g., 1 camera, 2 cameras, 4 cameras, 8 cameras, 10 cameras, 16 cameras, etc.), each capturing images from a different perspective. For example, there may be only one camera and multiple images may be taken from the same camera to form a panorama.
Input device 18 may be used for controlling and/or entering instructions into system 10. Output device 20 may be used for viewing output images of system 10 and/or for viewing instructions stored in system 10.
Processing system 24 processes input images by combining the input images to form output images. The input images may be from one or more of cameras 12, 14, and 16 and/or from another source. Processor 22 may combine images from at least two sources or may combine multiple images from the same source to form a still image or video panorama. A user may swipe a scene with a single video camera, which creates just one video. From this one video, system 10 may automatically extract various sequential frames and take multiple images from the video. In an embodiment not every frame from the video is used. In another embodiment every frame from the video is used. Then system 10 stitches the frames that were extracted into one large final panorama image. Consequently one video input may be used to produce a panorama image output. Network 26 may be any Wide Area Network (WAN) and/or Local Area Network (LAN). Client system 28 may be any client network device, such as a computer, cell phone, and/or handheld computing device.
Although
Architectures other than that of system 30 may be substituted for the architecture of processor 24 or client system 28. Output system 32 may include any one of, some of, any combination of, or all of a monitor system, a handheld display system, a printer system, a speaker system, a connection or interface system to a sound system, an interface system to peripheral devices and/or a connection and/or interface system to a computer system, intranet, and/or internet, for example. In an embodiment, output system 32 may also include an output storage area for storing images, and/or a projector for projecting the output and/or input images.
Input system 34 may include any one of, some of, any combination of, or all of a keyboard system, a mouse system, a track ball system, a track pad system, buttons on a handheld system, a scanner system, a microphone system, a connection to a sound system, and/or a connection and/or interface system to a computer system, intranet, and/or internet (e.g., IrDA, USB), for example. Input system 24 may include one or more of cameras, such as cameras 12, 14, and 16 and/or a port for uploading and/or receiving images from one or more cameras such as cameras 12, 14, and 16.
Memory system 36 may include, for example, any one of, some of, any combination of, or all of a long term storage system, such as a hard drive; a short term storage system, such as random access memory; a removable storage system, such as a floppy drive or a removable USB drive; and/or flash memory. Memory system 126 may include one or more machine readable mediums that may store a variety of different types of information. The term machine-readable medium is used to refer to any medium capable of carrying information that is readable by a machine. One example of a machine-readable medium is a computer-readable medium. Another example of a machine-readable medium is paper having holes that are detected that trigger different mechanical, electrical, and/or logic responses. All or part of memory 126 may be included in processing system 24. Memory system 36 is also discussed in conjunction with
Processor system 38 may include any one of, some of, any combination of, or all of multiple parallel processors, a single processor, a system of processors having one or more central processors and/or one or more specialized processors dedicated to specific tasks. Optionally, processing system 38 may include graphics cards (e.g., an OpenGL, a 3D acceleration, a DirectX, or another graphics card) and/or processors that specialize in, or are dedicated to, manipulating images and/or carrying out of the methods
Communications system 42 communicatively links output system 32, input system 34, memory system 36, processor system 38, and/or input/output system 44 to each other. Communications system 42 may include any one of, some of, any combination of, or all of electrical cables, fiber optic cables, and/or means of sending signals through air or water (e.g. wireless communications), or the like. Some examples of means of sending signals through air and/or water include systems for transmitting electromagnetic waves such as infrared and/or radio waves and/or systems for sending sound waves.
Input/output system 44 may include devices that have the dual function as input and output devices. For example, input/output system 44 may include one or more touch sensitive screens, which display an image and therefore are an output device and accept input when the screens are pressed by a finger or stylus, for example. The touch sensitive screens may be sensitive to heat and/or pressure. One or more of the input/output devices may be sensitive to a voltage or current produced by a stylus, for example. Input/output system 44 is optional, and may be used in addition to or in place of output system 122 and/or input device 44.
System 90 may be a combination of hardware and/or software components. In an embodiment, system 90 is an embodiment of memory system 36, and each of the block represent portions of computer code. In another embodiment, system 90 is a combination of processing system 38 and memory system 36, and each block in system 90 may represent hardware and/or a portion of computer code. In another embodiment, system 90 includes all or any part of systems 10 and/or 30. Stitching and viewing system 100 stitches images together. Configuration module 102 configures images and videos. Automatic stitcher 104 automatically stitches portions of images together. Each of points module 106, outer boundary mapping 108, graph based mapping 110, and moving-object-based-stitching 112 perform different types of alignments, which may be used as alternatives to one another and/or together with one another. Points module 106 joins two or more images or videos together based on 3 or 4 points in common between two images. Outer boundary mapping 108 may be used to manually and/or automatically align images and/or videos by matching outer boundaries of objects. Graph based mapping 110 may form a graph of different images and/or videos, which are matched. The matching of graph based mapping 110 may perform a nonlinear mapping based on a mesh formed from the image and/or video. Moving-object-based-stitching 112 may perform an automatic stitching based on a common moving object. Depth adjustment 114 may adjust the depth and place different images at different levels of depth.
Returning to the discussion of configuration module 102, the mapping is a transformation that an image goes thru when it is aligned in the final Panorama. For example, Image/Scene 1, may be transformed linearly, when it is merged or applied to the final resulting Panorama image. A perspective transform is a more complex non-affine perspective transformation from the original image to the final panorama. For a simple scene or panorama—a linear mapping may be applied. For roads, complex roads, or for looking at a distance, a visual perspective mapping may be applied to make the panorama appear aesthetically pleasing and realistic. A perspective is a non-affine transformation determined by geometric principles applied to a two dimensional image. For example, the same car or person will look bigger or taller at a near distance and look smaller at a further distance. A graph or mesh transformation may be applied on more complex and hard to align panorama images, similar to a fish eye lens, there are lens distortions, or a combination of lens distortions and changes to account for different perspectives, etc. Then the images are joined via mesh graphs. The mesh nodes may be aligned manually or automatically. Inside each triangle node, a perspective or nonlinear transformation may be applied. In a mesh, the image is divided into segments, and each triangle segment transformed individually.
Rendering system 92 renders the panorama image created. Output and viewing system 94 may allow the user to output and view the panorama created with system 90 on a screen or monitor. Rending system 92 may produce a still image/Video Panoramas (VPs) may support stitching of different types of videos, cameras, and images. In the case of still images and/or videos, it may be possible to view the stitched panorama on a separate window. For rendering the panorama on a screen, two types of renderers may be used: a hardware renderer and/or a software renderer. The hardware renderer is faster and uses functions and library that are based on OpenGL, 3D acceleration, DirectX, or other graphics standard. On machines having a dedicated OpenGL graphics card, 3D acceleration graphics card, DirectX graphics card, or other graphics cards, the Central Processing Unit (CPU) usage is considerably less than for systems that do not have a dedicated graphics card and also rendering is faster on systems having a dedicated OpenGL, 3D acceleration, DirectX, or graphics other standard graphics card. A software renderer may require more CPU usage, because its rendering uses the operating system's (e.g., Windows®) functions for normal display. In an embodiment, the user may view the original videos in combination with the stitched stream. In an embodiment, the final panorama can be resized, zoomed, and/or stitched for better display.
Using output and viewing system 94, remote viewing may be facilitated by a Video Panorama (VP) system, which may support at least two kinds of network streams, which include a Transmission Control Protocol/User Datagram Protocol (TCP/UDP) (or another protocol) based server and client and a webserver and client. The TCP/UDP based server and client may be used for sending the VP stream over a Local Area Network (LAN), and the web-based server and client may be used for sending VP stream over the internet.
When using a TCP/UDP based server as output and viewing system 94, the user can select the port on which the user wants to send the data. The user can select the streaming type, such as RGB, JPEG, MPEG4, H26, custom compression formats, and/or other formats. JPEG is faster to send in a data stream, as compared to RGB raw data. Sockets (pointers to internal addresses, often referred to as ports, that are based on protocols for making connections to other devices for sending and/or receiving information) associated with the operating systems (e.g., Windows® sockets) may be used to send and receive data over the network. Initially when the user connects to a client, TCP protocol is used, because TCP protocol can give an acknowledgement of whether the server has successfully connected to the client or not. Until the server receives an acknowledgement of a successful connection, the server does not perform any further processing. System 10 (a VP system) may send some server-client specific headers for the handshaking process. Once system 10 receives the acknowledgment, another socket may be opened that uses the UDP protocol for transferring the actual image data. UDP has an advantage when sending the actual image, because UDP does not require for the server to understand whether the client received the image data or not. When using UDP, the server may start sending the frames without waiting for client's acknowledgement. This not only improves the performance, but also may facilitate sending the frames at a higher speed (e.g., frames per second). Also, to make the sending of data even faster, the scaling of image data (and/or other manipulations of the image) may be performed before sending the data over the network.
On a web and/or LAN based server associated with output and viewing system 94, the user may select the port on which the user wants to send the data. The user may be presented with the option of selecting the format of the streaming data, which may be RGB, MJPEG, MPEG4, H26, custom, or another format. MJPEG may be suggested to the user and/or presented as a default choice, because sending MJPEG is faster than sending RGB raw data. The operating system's sockets may be used to send and/or receive data over the internet. The transmission protocol used by the web and/or LAN based server may be TCP. In an embodiment, system 10 may support around 10 simultaneous clients. In another embodiment, system 10 may support an unlimited number of clients. In an embodiment, only JPEG compression is used for sending MJPEG data. In another embodiment, MPEG4 compression may be used for MJPEG data with either TCP and/or Real time Streaming Protocol (RTSP) protocols for a better performance and an improved rate of sending frames when compared with MJPEG. In an embodiment, ActiveX based clients are used for both TCP and web servers. The clients that can process ActiveX instructions (or another programming standard that allows instructions to be downloaded for use by a client) can be embedded in webpages, dialog boxes, or any user required interface. The web based client is generic to many different types of protocols. The web based client can capture standard MJPEG data not only from the VP web server, but also from other Internet Protocol (IP) cameras, such as Axis, Panasonic, etc. The resulting panorama video can be viewed over network 26 by any a client application on client system 28 using various methods. In one method, the panorama video may be viewed using any standard network video parser application. Video parsing applications may be used for viewing the panorama, because video panorama supports most of the standard video formats used for video data transfer over the network. Panorama videos may be viewed with an active X viewer or another viewer enabled to accept and process code (an active X viewer is available from IntelliVision). The viewer may be a client side viewer (e.g., an active X client-side viewer), which may be embedded into any HyperText Markup Language (HTML) page or another type of webpage (e.g., a page created using another mark up language). The viewer may be created with language, such as any application written in C++ windows application (or another programming language and/or an application written for another operating system). The panorama video may be viewed using a new application written from scratch—in an embodiment, the viewer may include standard formats for data transfer and also may provide a C++ based Application Programming Interface (API). The panorama video may be viewed using DirectShow filter provided by IntelliVision. The DirectShow filter is part of Microsoft DirectX and DirectDraw family of interfaces. DirectShow is applied to video, and helps the hardware and the Operating System perform an extra fast optimization to display and pass video data efficiently and quickly. If a system outputs DirectShow interfaces, other systems that recognize DirectShow, can automatically understand, receive, display, and receive the images and videos.
The panorama may be resizable, and may be stretched for better display. Alternatively, if the size is too big, then the scene can be reduced and focus can be shifted to a particular area of interest. It is also possible to zoom in and out on the panorama. In an embodiment, the resulting panorama that is output may be panned and/or tilted may also be supported by the system.
A result panorama video may be so large that it is difficult to show a complete panorama on a single monitor unless a scaling operation is performed to reduce the size of the image. However, scaling down may result in a loss of detail and/or may not always be desirable for other reasons. Hence, the user may want to focus on a specific region. The user may also want to tilt and/or rotate the area being viewed.
The video panorama may support a variety of operations. For example, focus may be directed to only a smaller part of the result panorama (viewing only a small part of the panorama is often referred to as zooming). The system may also provide a high quality digital zoom that shows output that is bigger than the actual capture resolution, which may be referred to as super resolution. The super resolution algorithm may use a variety of interpolations and/or other algorithms to compute the extra pixels that are not part of the original image. The system may be capable of changing the area under focus (which is referred to as panning). The user can move the focus window to any suitable position in the resultant panorama. Output and viewing system 90 may allow the user to rotate the area under focus (which is referred to as Tilt). In an embodiment, output and viewing system 94 of system 90 may support a 360 degree rotations for the area under focus.
In many video streams captured from live cameras, the elements of the image change infrequently. (e.g., the camera is mounted towards a secure area in which very few people are permitted to enter). So most of the times, frames in video captured from camera will be almost the same, which may be true for some parts of the video. That is some parts may have frequent changes but some parts will change less frequently.
System 90 may be capable of understanding and distinguishing that there are no changes in certain part of the video and therefore the video panorama system does not render that that part of that frame in the panorama result video. Also only the updated data is sent over the network. Not rendering the parts of the image that do not change and only transmitting the changes reduces the processing and results in less Central Processor Unit (CPU) usages than if the entire image is rendered and transmitted. Only sending the changes also reduces data sent on network and assists in sending video at a rate of at least 30 Frames Per Second (FPS) over the network.
The video panorama system also understands and identifies the changes in each of the video frames, and the video panorama system renders the changing parts accurately in the resulting panorama view (and in that way can be called intelligent). The changing part may also be sent over network 26 after being rendered. Sending just the changes facilitates sending high quality high resolution panorama video over network 26.
An option may be provided for saving the panorama as a still image on a hard disk or other storage medium, which may be associated with memory system 36. The user may be given the option to save the panorama still image in any standard image format. A few examples of standard image formats are jpeg, bit maps, gif, etc. Another option may be offered to the user for saving panorama videos. Using the option to save panorama videos, the user may be able to save the stitched panorama videos on a hard disk or other storage medium. The user may be able to save the panorama in any standard video format. Few examples of standard video formats are—avi, mpeg2, mpeg4 etc
Once the user determines some settings for making a final panorama from a group of source images, the user may be offered the option of saving those settings to a file, which may contain some or all of the information for the panorama stitching, rendering, and joining. Using the panorama data file, the next time the same set of cameras are located at the same positions, the settings can be loaded automatically. The details derived from which images and/or videos were used to create the panorama and the actual stitched output image may be stored in this data file.
System 90 may include other features. For example, system 90 may self-adjust and/or self-correct stitching over time. System 90 may adjust the images and/or videos to compensate for camera shakes and vibrations. In an embodiment, the positions and/or angles of the images or videos may be adjusted to keep the titles and imprinted letters or text in fixed positions. If two points or nodes from different images that are the same can be automatically found, then the system will automatically snap the two images together and align them with each other, which is referred to as self adjusting. The self adjusting may be performed by performing an automatic recognition and point correlation, which may use template matching and/or other point or feature matching techniques to identify corresponding points. If points that are same are matched, then system 90 can align the images and self correct the alignment (if the images are not aligned correctly).
In an embodiment system 90 self-adjusts and self-corrects stitching over time. One of the features provided by the system 90 is to self-adjust and correct itself over time. System 90 can review the motions of objects and the existence of objects to determine whether an object has been doubled and whether an object has disappeared. Both of object doubling and object disappearance may be the results of errors in the panorama stitching. By using object motions, object doubling and object disappearance can be automatically determined. Then an offset and/or adjustment may be required to reduce or eliminate the double appearance or the missing object. Other errors may also be detectable. Hence the panorama stitching mapping can be adjusted and corrected over time, by observing and finding errors.
In an embodiment, system 90 may adjust for camera shakes and vibrations. The cameras can be in different locations and can move independently. Consequently, some cameras may shake while others do not. Video stabilization of the image (even though the camera may still be shaking) can be enabled and the appearance of shakes in the image can be reduced and/or stopped in the camera. Stabilization of the image uses feature points and edges, optical flow points and templates to find the mapping of the features or areas to see if the areas have moved. Consequently, individual movements in the image that result from camera movements or shakes can be arrested to get a better visual effect. Templates are small images, matrixes, or windows. For example, templates may be 3×3, 4×4, 5×5, or 10×10 arrays of pixels. Each array of pixels that makes up a template is matched from one image to another image. Matching templates may be used to find corresponding points in two different images or of correlating a point, feature, or node in one image to another corresponding point, feature, or node in the other image. For the window formed around a point, the characteristics of the window are determined. For example, the pixel values, a signature, or image values for the pixels are extracted. Then characteristics of the template are determined for a similar template on another image (which may be referred to as a target image) and a comparison is made to determine whether there is a match between the templates. A match between templates may be determined based on a match of the colors, gradients, edges, textures, and/or other characteristics.
In an embodiment, system 90 may be capable of keeping the titles and imprinted letters or text in fixed position or removing them. Sometimes text, closed captions, or titles may be placed in the individual images. These text or titles may be removed, repositioned, or aligned in a particular place. The text location, size, and color may be used to determine the text. Then the text may be removed or replaced. The text can be removed by obtaining the information hidden by the text and negating the effect of the inserted image, in order to make the text disappear. Additionally a new text can be created or a new title or closed caption can be created in the final panorama in addition to, or that replaces text in the original image.
In an embodiment, each of the steps of method 200A is a distinct step. In another embodiment, although depicted as distinct steps in
Multiple images that are input from a single video may also be used. For example, a single camera may be rotated (e.g., by 180 degrees or by a different amount) one or multiple times while filming. The video camera may swipe a scene or gently pan around and capture a scene. Sequences of images in from each rotation may be combined to from one or more panorama output images. As an example, the video may be divided into multiple periods of time that are shorter than the entire pan or rotation, and one can collect multiple images, in which each image comes from a different one of the periods. For example, one frame may be taken every N frames, every 0.25 seconds, or even every frame image. Then the images may be taken as individual images and joined into a panorama image as output. In other words, a user may swipe the scene with a single video camera, which creates just one video. From this one video, system 10 may automatically extract various sequential frames and take multiple images from the video. In an embodiment not every frame from the video is used. In another embodiment every frame from the video is used.
In step 203, mappings are estimated as part of the configuration stage. The mappings may unambiguously specify the position of each source image point or video image point in the final panorama. There are at least three types of mappings that may be used, which are (1) affine mappings (which are linear) in which a linear transformation is applied uniformly to each point of the image, (2) perspective mappings (which are non-linear) in which the transformation applied foreshortens the image according to the way the image would appear from a different perspective, and (3) graph and mesh-based mappings (which are non-linear) are used in which a mesh is superimposed over an image and then nodes of the mesh are moved, thereby distorting the mesh and causing a corresponding distortion in the image. To estimate the final mapping for each source, it is sufficiently to estimate the mapping between the pairs of overlapping source images or videos.
The problem of mapping estimation between a pair of overlapped source images can be formulated as follows. Mapping estimation requires finding the mapping from one image to the other image, such that the objects visible on the images are superposed in a manner that appears realistic and/or as though the image came from just a single frame of just one camera. At least three ways of initially estimating the mapping may be used, which may include manual alignment, a combination of manual alignment and auto refinement, and fully automatic alignment. In the case of a manual mapping estimation, the mapping between the pair of images is specified by the user. At least two options are possible: manually selecting corresponding feature points, or manually aligning each of the images as a whole.
In the case of estimating a mapping for manual alignment plus auto refinement the initial mapping is specified by the user as is described above (regarding manual mapping). To reduce the user interaction and increase the accuracy of the estimation, the manual stage is followed by a auto refinement procedure for refining the initial manual mapping.
A fully automatic mapping estimation may also be implemented. Unique features are extracted from each source image. For example, edges, individual feature points, or feature areas may be identified as unique features. The edges in a scene that may be used as unique features are those that are easily recognizable or easily identifiable, such edges that are associated with a high contrast between two regions—each region being on a different side of the edge. Feature points or feature areas, may be represented by one of several different methods. In one method, a small template window having an M×N matrix of pixels (with color values RGB, YUV, HSL, or another color space) within which the feature is located may be established to identify a feature. In another method, a unique edge map may be associated with a particular feature and that is located in the M×N matrix, which may be used to locate certain features. Scale-invariant feature-transforms or high curvature points may be used to identify certain features. In other words, features are identified that are expected not to change as the size of the image changes. For example, the ratio of sizes of features may be identified. Special corners or intersection points of one or more lines or curves may identify certain features. The boundary of a region may be used to identify a region, which may be used as one of the unique features.
In step 204, after extracting points or small features using all the feature pairs (each image has one of the members of the pair) that represent exactly the same object are identified. After identifying the pairs, the mapping is estimated. Optionally, the estimated mapping may be refined by applying the mapping refinement procedure. The mapping refinement procedure estimates a more accurate mapping (than the initial mapping) given a rough initial mapping on input. The more accurate mapping may be determined via the following steps.
In step 206, easily identifiable features (such as the unique features discussed above) are identified on one of the images (if features are identified manually, the system will refine the mapping automatically).
In step 208, a feature correlation and matching method is applied, such as template matching, edge matching, optical flow, mean shift, or histogram matching, etc. In step 210, once more accurate feature points on one image and the corresponding features on the other image have been identified, an estimate of a more accurate mapping may be determined. After step 210, the method continues with method 300 of
In an embodiment, each of the steps of method 200B is a distinct step. In another embodiment, although depicted as distinct steps in
In an embodiment, each of the steps of method 300 is a distinct step. In another embodiment, although depicted as distinct steps in
In an embodiment, each of the steps of method 400 is a distinct step. In another embodiment, although depicted as distinct steps in
In step 504, the mapping may be estimated by solving a linear system of equations. A standard linear set of equations will solve for the position matrix to transform the second image to exactly match and align with the first image.
The features or points may have been computed automatically or may have been manually suggested by the user (mentioned above). In both cases the feature points can be imprecise, which leads to imprecise mapping. For an accurate mapping, it is possible to use more than three (four) point pairs. In this case the mapping which minimizes the sum of squares of distances (or a similar error minimizing method) between the actual points on the second image and the point from the first image mapped to the second one is estimated. The mapping may be estimated in a way that is more precise and robust for inaccurate point coordinates. If more than three points are available for the affine mapping, a least square fit may be used. Similarly, if more than four points are available for the perspective mapping, a least square fit or similar error minimization method may be used.
In an embodiment, each of the steps of method 500 is a distinct step. In another embodiment, although depicted as distinct steps in
In an embodiment, each of the steps of method 600 is a distinct step. In another embodiment, although depicted as distinct steps in
Step 804 may be a sub-step of step 802. In step 806, the graph or mesh is stretched individually by moving the nodes. One difference between method 600 and method 800, is that method 600 references the outer boundary based stretching and aligning. Method 800 shows how to move the images in a non-linear way and align the individual graph node or mesh. Most (possibly all) of the mesh may be made from triangles, quads, and/or other polygons. Each of the nodes or points of the graph or mesh, can be moved and/or adjusted. Each of the triangles, quads, and/or other polygons can be moved, stretched, and/or adjusted. These adjustments occur inside the image and may not affect the boundary. The adjustments may be restricted by one or more constraints. Some examples of constraints are one or more points may be locked and prohibited from to being moved. Some examples of constraints are some points may be allowed to move, but only within a certain limited region and some points may be confined to being located along a particular trajectory. Some points may be considered floating in that these points are allowed to be moved. If each point is a node in a mesh, moving just one point without moving the other points distorts the mesh or graph. Some points may be allowed to move relative to the canvas or relative to background regions of the picture, but are constrained to maintain a fixed location relative to a particular set of one or more other points. By constraining the image (e.g., by locking or restricting the movement of some points with respect to one another and/or with respect to the canvas) while allowing other points to move, the user may sometimes create a very powerful and highly complex non-linear mapping that is not possible to perform automatically.
In an embodiment, each of the steps of method 800 is a distinct step. In another embodiment, although depicted as distinct steps in
In contrast, graph or mesh based stitching is based on fixed and non-moving part of the image. For example mesh/graph based stitching will use the door corner, edges on the floor, trees, parked cars, as nodes. The points that mesh based stitching uses to align the images are fixed features. In contrast to the graph or mesh based stitching, in the moving features based stitching, other clues on how to stitch and align images are used, which are moving objects or moving features, such as a person walking or a car moving. Points on the moving object can also be used to align two images.
Motion may be detected in each video, and corresponding matching motions and features may be aligned. The moving features that may be used for matching moving objects may include corners, edges, and/or other features on the moving objects. By matching corresponding moving features, a determination may be made whether two moving objects are the same, even if the two videos show different angles, distance, and/or views. Thus, two images may be aligned based on corresponding moving objects, and corresponding moving objects may be determined based on corresponding moving features.
In an embodiment, each of the steps of method 1000 is a distinct step. In another embodiment, although depicted as distinct steps in
In an embodiment, each of the steps of method 1100 is a distinct step. In another embodiment, although depicted as distinct steps in
The initial pixels or two dimensional array of points are fixed X and Y coordinates that have integer values. Optionally, each X and Y location may be associated with a depth or Z value. A transformation may be applied that represents a change in perspective, depth, and/or view. During the transformation each of the integer X and Y values may be mapped to a new X and Y value, which may not have an integer value. The new X and Y values may be based on the Z value of the original X and Y location. Then the pixel values at the integer locations are determined based on the pixel values at the non-integer locations. Since final result is also a two dimensional array in which all pixels are in integer valued X and Y locations only. The floating point and Z values of points are only for intermediate calculations for mathematics. The final results are only a two dimensional image.
During the setup stage the map may be filled with the actual values in the following way. Each pixel of the panorama may be back-projected into each source image coordinate frame. If the projected pixel lies outside all of the source images, then the corresponding map element ID may be set to 0. During the rendering such pixels may be filled with a default background color. Otherwise, for those pixels that have corresponding pixels in other source images, there will be one or more source images that overlap with the pixel on the panorama that is being considered. Among all of the source images having the points that overlap with the pixel, the topmost source image is selected and the corresponding map element ID is set to the ID determined by the selected image. The map element offset value is a difference between two pixel values that are located at the same location. The offset is the amount by which the pixel value must be increased or decreased from its current value. The offset may be the difference in value between a pixel of the topmost image and a pixel of the current source image, which must be added to the topmost pixel so that the image has a uniform appearance. For example, the topmost image may be too bright, and consequently the topmost pixel may need to be dimmed.
The setup only needs to be performed once for each configuration. After the setup, the fast rendering can be performed an arbitrary number of times. After the setup, the rendering is performed as follows. For each panorama pixel the source image ID and source image pixel offset are stored in the map. Each final panorama pixel is filled with the color value from the given source image taken at a given offset. The color value is pre-computed to avoid performing a run-time computation. If the source image ID is 0, then such pixel is filled with the background color.
In other words, for one frame (e.g., the first frame), a transformation for each pixel of each image is obtained from each source image to the topmost source image, for example. The transformation is used to determine transformations for each pixel in each source image for a desired view, which may not correspond to any source image, and then the same transformation is applied to all subsequent frames.
Computing the pixel values based on the offset is fast, but the interpolation between pixels may result in an image that has some seams or somewhat noticeably unnatural transitions. To render the smoother image instead of using the same offset value of each pixel element, may contains a floating point value for the X and Y coordinates of each source image pixel (instead of an integer value. During the rendering, the source image pixel neighborhood is used as a basis for interpolation in order to obtain the panorama pixel color value. Further, an interpolation may be performed by providing and/or storing additional information and/or by performing additional computations. The extra information may be included within the panorama map).
In step 1204, a joined image is created with a hardware texture mapping and a 3D acceleration. Hardware based rendering of a panorama may use texture mapping and 3D acceleration, DirectX, OpenGL (other graphics standard) rendering available in many video cards. The panorama may be divided into triangles, quadrilaterals and/or other polygons. The texture of each of the areas (the triangles or other polygons) is initially computed from the original image. The image area texture is passed to the 3D rendering as a polygon of pixel locations. Hardware rendering may render each of the polygons faster than software methods, but in an embodiment may only perform a linear interpolation. The image should be divided into triangles and/or other polygons that are small enough, such that linear interpolation is sufficient to determine the texture of a particular area in the final panorama.
In step 1206, the portion that changes is rendered. In an embodiment, only the portion that changed is rendered. In many video streams captured from live cameras, many of the objects (sometimes all of the objects) change very infrequently. For example, the camera may be mounted to face towards a secure area into which very few people tend to enter. Consequently, most of the time, frames in the video captured from the camera will be almost the same. For example, in a video of a conversation between two people that are sitting down while talking, there may be very little that changes from frame to frame. Even in videos that have a significant amount of action, there may still some parts of the video that change very little. That is some parts of the image may tend to have frequent changes but some parts may tend to change less frequently.
In a video panorama, understanding (e.g., identifying) that there are no changes in certain part of the video allows the user or the system to not render that part of that frame in the resulting panorama result video. The rendered part of the frame may be added to the non-changing part of the frame after rendering. This reduces the processing and results in less CPU usage than were the entire frame rendered. Having the system understand (e.g., identify) the portions of the frames that contain changes allows the system to always render these parts of the image accurately in the resulting panorama view.
The portions that change may be identified by computing the changes in the scene first. Pixel differencing methods may be used to identify motion. If there is no motion in a particular area, then that area, region, or grid does not need to be rendered or sent for display. Instead, the previous image, grid, or region may be used in the final image, as is.
In step 1208, the seams are blended and smoothened at the seams and interior for better realism. The images from different cameras when joined, may look a bit different or un-natural at the joining seams. Specifically, the seams where the images were joined may be visible and may have discontinuities that would not appear in an image from a single source. Blending and smoothening at the seam of stitching improves realism and makes the image appear more natural. To smooth the seam, the values for the pixels at the seam are first calculated, and then at and around the seams, the brightness, contrast, and colors are adjusted or averaged to make the transition from one source image to another source image of the same panorama smoother. The transition distance (the distance from the seam over which the smoothening and blending are applied) can be defined as a parameter in terms of the percentage of pixels in the image or total number of pixels.
In step 1210, the brightness and contrast are adjusted. Adjusting the brightness and contrast may facilitate creating a continuous and smooth panorama effect. When a user plays the stitched panorama, it is possible to adjust the brightness of adjacent frames fed by different cameras and/or videos so that which regions of the image are taken from different source cameras is not as apparent (or not noticeable at all). It is possible that due to different camera angles, the brightness of the frames may not be the same. So to create a continuous panorama effect, the brightness and/or contrast are adjusted. Also the adjacent frames that may overlap each other during stitching can be merged at the boundaries. Adjusting the brightness and/or contrast may remove the jagged edge effect and may provide a smooth transition from one frame to another.
In an embodiment, each of the steps of method 1200 is a distinct step. In another embodiment, although depicted as distinct steps in
Image 1300A is an image that is formed by joining together multiple images. Bright portion 1302a is a portion of image 1300a that is brighter than the rest of image 1300A. Dim portion 1304a is portion of image 1300A that is dimmer than the rest of the image 1300A. The transition between bright portion 1302a and dim portion 1304a is sharp and unnatural.
Image 1300B is image 1300A after being smoothed. Bright portion 1302b is a portion of image 1300B that was initially brighter than the rest of image 1300B. Dim portion 1304b is portion of image 1300B was initially dimmer than the rest of the image 1300B.
Each embodiment disclosed herein may be used or otherwise combined with any of the other embodiments disclosed. Any element of any embodiment may be used in any embodiment.
Although the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the true spirit and scope of the invention. In addition, modifications may be made without departing from the essential teachings of the invention.
This application claims priority benefit of U.S. Provisional Patent Application No. 60/903,026 (Docket #53-4), filed Feb. 23, 2007, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60903026 | Feb 2007 | US |