This invention relates generally to panoramic video signal processing. More particularly, this invention relates to techniques for panoramic video hosting with directional audio supplied to networked client devices.
Panoramic video images may be acquired using a group of cameras. The panoramic video images may be uploaded to a server where the images are made available to networked client devices. Consequently, the networked client devices may follow an event that is being panoramically videoed and request perspectives of interest. Such processing generates large volumes of video data that must be processed and transmitted in an efficient manner. Each video stream may be accompanied by an audio track. Techniques are required for selecting appropriate audio for any given streamed video.
A server includes an input node to receive video streams forming a panoramic video. The server also receives audio tracks corresponding to the video streams. A module forms an audio track based upon a combination of at least two of the audio tracks and directional viewing data. The audio track may be a stereo, mixed or surround sound audio track with volume modulation based upon the directional viewing data. An output node sends the audio track to a client device.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
Frames are captured from each camera 10 and then each frame is encoded and is sent to a server as a separate video stream 11. Each frame may have an associated audio track. Thus, each camera generates a separate audio track.
Camera distortion parameters may also be sent 12, as described in the commonly owned co-pending patent application entitled “Apparatus and Method for Video Image Stitching Utilizing Alignment and Calibration Parameters”, Ser. No. 13/691,632, filed Nov. 30, 2012, the contents of which are incorporated herein by reference.
The cameras 1, 2, 3, 4 may include a wired or wireless link to network 13. Server 14 is also connected to network 13. An input node of the server 14 receives the video signals. The server 14 decodes frames and stitches them into a panoramic frame. The server 14 receives user requests and encodes necessary data to service each request. In one embodiment, the server includes a module with executable instructions to form a suggested field of view in the panoramic video. An output node of the server sends video signals to a client device.
The user requests are from a client device 15, such as Smartphone, Tablet, personal computer, wearable computation device and the like. A user requests access to a video stream 16. The server 14 services the request and delivers the requested video through the network 13 as specific data for a requested field of view 17, which may then be displayed on the client device 15.
The invention allows for multiple image processing services to be conducted on server 14. For example, the server 14 may provide error detection and correction. Further, the sever 14 may map and learn user interactions with the video content to optimize data streams. The server 14 can also monitor available bandwidth available on the network 13. The server 14 can then stream only field of view 18 to the device 15, or it can stream additional data outside of the field of view 18 to enable smoother navigation of the video stream. When additional data outside the field of view is sent to the client device, but the entire panoramic (or wide angle) video stream is not streamed to the client device, this extra video data is referred to as the buffer.
In one embodiment, two input parameters are sent to the server. One parameter is the user's viewing location based upon the coordinate system of
In order to keep the video playing smoothly while the user moves, a small buffer 306 is added to the video frame. This gives the server time to respond to the change in location and update the video stream sent to the user accordingly without the user noticing disrupted video playback.
The following parameters may be used to calculate the user's location with respect to the panoramic frame in the coordinate system of
User Area=(1280 px, 700 px)
panoHeight (total height of panoramic video stream)=720 px
panoWidth (total width of panoramic video stream)=4000 px
p (radius)=340 px
φ=0°
θ=92°
An aspect of the invention is a new method for improving video that includes additional data that allows for more intuitive navigation and use of panoramic video. When paired with visual video data, certain auxiliary data such as audio can be tied to coordinates in the video. This allows multiple audio tracks to be included in the file format with the final audio delivered to the user having a reduced (e.g., mono or stereo pair) signal, processed given the multiple audio tracks, their known directional information and the direction of view of the user.
Other augmented reality (e.g., video, audio, haptic) features or embedded links can also be efficiently placed throughout the video using a global (360 degrees in x and y axes) coordinate system plus the individual user's direction of view.
In at least some preferred implementations, it is possible to use current digital audio standards for transport of multiple audio signals as well as processing at either or both of nodes (camera/user, processing/cloud). These standards could be Dolby Pro Logix® IIx, DTS 5.1, or some other method.
Standard video streams provide one audio track to be played simultaneously and in synchronicity with the video. While this audio track may have different channels that enable more audio data to be played, for example in a theatre with surround sound capability, there is only a single audio track that is played with the video. For example, at a specific time during the video, there is only one arrangement of audio data that is played at that moment in time. Multiple audio channels may allow for stereo audio or surround sound, but the separate channels do not provide a fundamentally different sound track when played with the video. Every time a user watches a specific sequence of video, they hear the same audio track that is played in conjunction with the video shown on the screen.
The audio track that is played with the 360° or wide angle video varies depending upon the direction where the user is looking in the panoramic video stream. This direction of view may be designated DoV.
An embodiment of the invention adds more audio data to the video file and pairs this audio data with specific coordinates or perspectives of the audio data in relation to the video data. This enables a variety of tracks to be played in parallel with the video data. Multiple audio tracks can also be played simultaneously. Based on where the user has navigated the area of interest (or DoV), a blend of audio tracks may be played that best represents the users chosen perspective of the video data.
This method also allows for additional data to be added to the video file. This data can be tied to global x/y coordinates within the video stream, enabling features or embedded links to be located at specific physical points within the video. This allows embedded links, computer vision tagging, visual and/or audible queues or information in addition to other augmented reality features including audio, video and haptic contact to be used in conjunction with panoramic video.
Metadata or a new file format incorporates additional data specific to the area of the video that is being viewed by the user. This method may use planar (for single-camera wide angle), cylindrical, spherical (or global) coordinate systems as the reference for where the user focused within the entire video stream.
One implementation is to have the video divided into a grid pattern using 360 degrees vertically and horizontally to create a spherical (or global) coordinate grid in space. A center-line is created in both the horizontal and vertical and the size of the grid pattern can be divided into segments of the user's choice. By way of example, this disclosure describes a spherical viewing surface with 360 degrees of freedom in both the horizontal and vertical.
The video is overlaid onto this grid so that software can track where the viewer is looking at any given moment using the coordinate system. Using coordinates, software tracks how the user navigates the video, for example, do they move quickly or slowly, are there tendencies to move left/right or up/down, how quickly does the user settle on an area of interest and stay there, etc.
Regions/directions of interest based upon image processing, audio processing, information sourced from crowd data or other sources as described above or artificially decided may be enhanced audibly by increasing the “volume” of items of interest (e.g., race car or a floating advertisement) as well as giving audible and visual cues to look toward this area of interest. This is an example where more audio tracks may be used than simply the minimum channels corresponding to the video in the view.
In the case of A audio track 320, the Direction of View is exactly between microphone 306 and microphone 308. Therefore, the audio track may be a simple average of the two frames. For stereo, the audio track 316 may be used as the left channel and the audio track 318 may be used for the right channel.
In the case of B audio track 322, the viewer is looking to the right of the camera 1-camera 2 boundary. The audio channel here may be a normalized, weighted average of the two channels. For example,
Or more generally,
Using the data from the video allows third parties to track where users are looking in the video and how they are navigating the video. This opens up opportunities in the advertising industry. For example, new advertising pricing structures can be created based on the actual amount of screen time that an advertisement gets while a user is watching a video.
Advertisers can learn how users navigate video to more effectively place advertisements or brand messages in the videos so that they receive maximum visibility or make the strongest brand impression on the user.
For video streams that may have multiple tracks of stereo audio, separate tracks can be assigned to specific fields of view. As the user moves from one perspective to the next, the audio data changes. For example, if a user is listening to stereo audio and is facing forward, one set of sounds (Sound A) is specific to the left ear, and one set of sounds (Sound B) is specific to the right ear. If the user rotates his perspective 180 degrees, the stereo audio signals switch to better match the user's new perspective. Now facing the opposite direction, Sound A is specific to the right ear, while Sound B is specific to the left ear.
As the user navigates the panoramic video stream and does not settle in a direction that has a single overwhelming audio track, the coordinates of the viewer's exact position can be used to effectively blend the audio tracks. Blending audio tracks when there are multiple audio signals for that perspective provides the best representative sound signal based on the user's selected field of view.
Directional audio, when paired with stereo audio used by operators, allows users to “hear” activity to the left or right, and allows users to instinctually know where to navigate the field of view to the area of interest. Audio clues can serve as an indicator for activities outside of the operator's field of view. Directional audio gives users the ability to monitor activities outside of the visible field of view by listening for audio signals that serve as a cue to change the operator's field of view.
The disclosed techniques are also applicable to wide field of view images (i.e., not full 360° panoramic). The server may also receive a different number of audio streams that do not directly correlate with the number of cameras.
An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using JAVA®, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
This application claims priority to U.S. Provisional Patent Application 61/707,746, filed Sep. 28, 2012, the contents of which are incorporated herein by reference. This application is related to U.S. Ser. No. 13/691,654, filed Nov. 30, 2012, which claims priority to U.S. Provisional Patent Application 61/566,269, filed Dec. 2, 2011, the contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5319465 | Squyres et al. | Jun 1994 | A |
5748199 | Palm | May 1998 | A |
6043837 | Driscoll, Jr. et al. | Mar 2000 | A |
6133944 | Braun et al. | Oct 2000 | A |
6192393 | Tarantino et al. | Feb 2001 | B1 |
6611241 | Firester et al. | Aug 2003 | B1 |
6788333 | Uyttendaele | Sep 2004 | B1 |
7324664 | Jouppi et al. | Jan 2008 | B1 |
8334905 | Bhan | Dec 2012 | B2 |
8406562 | Bassi et al. | Mar 2013 | B2 |
8605783 | El-Saban et al. | Dec 2013 | B2 |
8687070 | Chen et al. | Apr 2014 | B2 |
20020021353 | DeNies | Feb 2002 | A1 |
20020049979 | White et al. | Apr 2002 | A1 |
20020067412 | Kawai et al. | Jun 2002 | A1 |
20030030678 | Rosenholtz et al. | Feb 2003 | A1 |
20030197785 | White et al. | Oct 2003 | A1 |
20040030527 | Rankin | Feb 2004 | A1 |
20040032495 | Ortiz | Feb 2004 | A1 |
20040264919 | Taylor et al. | Dec 2004 | A1 |
20050002535 | Liu | Jan 2005 | A1 |
20050280701 | Wardell | Dec 2005 | A1 |
20060204142 | West et al. | Sep 2006 | A1 |
20070035612 | Korneluk | Feb 2007 | A1 |
20090284601 | Eledath et al. | Nov 2009 | A1 |
20100034425 | Lin et al. | Feb 2010 | A1 |
20100050221 | McCutchen et al. | Feb 2010 | A1 |
20100150458 | Angell et al. | Jun 2010 | A1 |
20100299630 | McCutchen et al. | Nov 2010 | A1 |
20110214072 | Lindemann et al. | Sep 2011 | A1 |
20120057852 | Devleeschouwer et al. | Mar 2012 | A1 |
20120162362 | Garden | Jun 2012 | A1 |
20120210252 | Fedoseyeva | Aug 2012 | A1 |
20130070047 | DiGiovanni et al. | Mar 2013 | A1 |
20130141523 | Banta et al. | Jun 2013 | A1 |
20130141526 | Banta et al. | Jun 2013 | A1 |
20130328910 | Jones et al. | Dec 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
61707746 | Sep 2012 | US |