Not applicable.
1. Field of the Invention
The present invention relates to a system and method for using cameras, such as in a cell phone, to download data.
2. Brief Description of the Related Art
Previously, work has been performed on mobile vision and recognition, mobile interaction and error correction coding.
The combined image acquiring, processing, storage and communication capability in mobile phones rekindles researchers' interests in applying traditional pattern recognition and computer vision algorithms on camera phones in the pursuit of new mobile applications. Camera phones have been used to recognize faces (Y. Ijiri, M. Sakuragi, and S. Lao, “Security management for mobile devices by face recognition,” in MDM '06: Proceedings of the 7th International Conference on Mobile Data Management (MDM'06) Washington, D.C., USA: IEEE Computer Society, 2006, p. 49), road signs (X. Chen, J. Yang, J. Zhang, and A. Waibel, “Automatic detection of signs with affine transformation,” in WACV '02: Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, Washington, D.C., USA: IEEE Computer Society, 2002, p. 32 and “A pdabased sign translator,” in ICMI '02: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, Washington, D.C., USA: IEEE Computer Society, 2002, p. 217), text (K. S. Bae, K. K. Kim, Y. G. Chung, and W. P. Yu, “Character recognition system for cellular phone with camera,” in COMPSAC '05: Proceedings of the 29th Annual International Computer Software and Applications Conference (COMPSAC'05) Volume 1, Washington, D.C., USA: IEEE Computer Society, 2005, pp. 539-544 and M. Koga, R. Mine, T. Kameyama, T. Takahashi, M. Yamazaki, and T. Yamaguchi, “Camera based kanji OCT for mobile phones: Practical issues,” in ICDAR '05: Proceedings of the Eighth International Conference on Document Analysis and Recognition, Washington, D.C., USA: IEEE Computer Society, 2005, pp. 635-639), and barcodes (E. Ohbuchi, H. Hanaizumi, and L. Hock, “Barcode readers using the camera device in mobile phones,” in Cyberworlds, 2004 International Conference on, 2004, pp. 260-265; A. Otero, “A robust software barcode reader using the Hough transform,” in ICIIS '99: Proceedings of the 1999 International Conference on Information Intelligence and Systems, Washington, D.C., USA: IEEE Computer Society, 1999, p. 313; S. Ando and H. Hontani, “Automatic visual searching and reading of barcodes in 3d scene,” in Vehicle Electronics Conference, 2001, pp. 49-54; H. Hee Il and J. Joung Koo, “Implementation of algorithm to decode two-dimensional bar code pdf-417,” 6th International Conference on Signal Processing, Vol. 2, 2002, pp. 1791-1794; and E. Ouaviani, A. Pavan, M. Bottazzi, E. Brunelli, F. Caselli, and M. Guerrerro, “A common image processing framework for 2d barcode reading,” 7th International conference on Image Processing and its Applications, vol. 2, 1999, pp. 652-655.). Although the methods differ for individual application, some follow common procedures, summarized as follows:
1) Target Location: The first step is to locate the target's position. On traditional desktop/workstation environments, sophisticated methods can be applied. For mobile devices, however, detection often needs to run in real time and consume less resource to save power (which means the longer battery life). Lightweight or approximate features are explored to achieve these goals. For example, Viola and Jones used efficient rectangular features in “Robust real-time face detection,” Int. J. Comput. Vision, vol. 57, no. 2, pp. 137-154 (2004), for face detection on a Compaq PDA. Road sign or text detection often uses heuristic methods. For 2D barcode acquisition an unique pattern is often used to identify by its location. For example, a Maxicode contains a bull eye pattern at its center, a QR Code uses three squares at its three corners as locator patterns, and Datamatrix has its two perpendicular edges. Algorithms are designed to locate these locator patterns efficiently.
2) Image Enhancement and Distortion Correction: Camera phones often use cheap CMOS sensors with fixed focus. Compared with digital cameras with high quality CCD sensors, images captured by camera phones are relatively low quality. One problem is uneven lighting. Images captured by camera phones often have cast or attached shadows. Adaptive binarization is often used to reduce the effect of shading and uneven lighting. Another problem is perspective distortion. When users capture images, it is impractical for them to hold devices at a perfectly right angle. As a result, perspective distortion is inevitable and geometrical correction is required to normalize the image before recognition. Focus is another problem to be tackled. Cameras in mobile phones are designed to take pictures of people and scenes. For this reason the focal length of camera is often set to a distance >1 foot. To keep a reasonable resolution, however, physical barcodes need to be put close enough to cameras, leading to blur in the acquired image. A super resolution method was proposed to solve this problem in S. Baker and T. Kanade, “Limits on superresolution and how to break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1167-1183, 2002, but the complexity of the algorithm prevents it from being run on mobile devices. To handle these problems the symbology should be robust enough to compensate for the adverse effects caused by image degradation.
3) Recognition: For recognition, features with geometric invariance are often selected since images are usually captured by cameras at arbitrary angles. Geometric invariants are used explicitly or implicitly in previous work. See I. Weiss, “Geometric invariants and object recognition,” Int. J. Comput. Vision, vol. 10, no. 3, pp. 207-231, 1993 and F. Mindru, T. Tuytelaars, L. V. Gool, and T. Moons, “Moment invariants for recognition under changing viewpoint and illumination,” Comput. Vis. Image Underst., vol. 94, no. 13, pp. 3-27, 2004. Explicit features include moments or the Fourier descriptors. See S. K. W. Kwok and J. C. H. Poon, “Viewpoint-invariant Fourier descriptors for 3 dimensional planar shape representation,” Electronics Letters, vol. 32, no. 19, pp. 1775-1776, 1996, 00135194. An example of implicit features is to locate feature points based on reference points, which is commonly used for decoding 2D barcodes. For example, when the three rectangular location patterns of a QR code are located, the positions of other unit cells in the QR code can be decided and the encoded information will be decoded.
One challenge for camera phone related applications is the user interface. Due to the physical limitation of mobile phones (small keypads, small displays, etc.), the designing of interface to facilitate users' interaction with the device is an important problem. Interaction with mobile devices received much attention in recent years as the popularity of camera phones and PDAs has increased. A survey of camera phone related applications can be found in T. Kindberg, M. Spasojevic, R. Fleck, and A. Sellen, “The ubiquitous camera: An in-depth study of camera phone use,” IEEE Pervasive Computing, vol. 4, no. 2, pp. 42-50, 2005. Some interesting applications include: Researchers at CMU use camera phone based 2D barcode solution for human identity authentication. J. M. McCune, A. Perrig, and M. K. Reiter, “Seeing is believing: Using camera phones for human verifiable authentication,” in SP '05: Proceedings of the 2005 IEEE Symposium on Security and Privacy. Washington, D.C., USA: IEEE Computer Society, 2005, pp. 110-124 In R. Ballagas, J. Borchers, M. Rohs, and J. G. Sheridan, “The smart phone: A ubiquitous input device,” IEEE Pervasive Computing, vol. 5, no. 1, p. 70, 2006, a camera phone is used as a pervasive input device to acquire position and motion information. The authors described a new scheme in P. Vartiainen, S. Chande, and K. Ramo, “Mobile visual interaction: enhancing local communication and collaboration with visual interactions,” in MUM '06: Proceedings of the 5th international conference on Mobile and ubiquitous multimedia. New York, N.Y., USA: ACM Press, 2006, p. 4, allowing users to use their camera phones to interact with large screen displays. The work described in A. Wilhelm, Y. Takhteyev, R. Sarvas, N. V. House, and M. Davis, “Photo annotation on a camera phone,” in CHI '04: CHI '04 extended abstracts on Human factors in computing systems. New York, N.Y., USA: ACM Press, 2004, pp. 1403-1406 allows users to annotate digital photos when capturing. In summary the unique challenges which need to be considered when developing applications related to the user interaction with camera phones include:
1) Image Distortion: When users capture images, one cannot expect them keep the image plane of a camera phone parallel with the physical plane. Perspective distortion is expected.
2) Small input keypads and displays: The user interface should be intuitive enough.
Images captured by camera phones are often of low quality due to perspective distortion, noise and shading. Decoding errors are inevitable, and extra bits need to be inserted to correct them. More specifically, data needs to be encoded with error control codes. Error control coding (also known as error correction coding) is an important technology developed in information theory. In general, error correction codes can be divided into convolutional codes and block codes. For a convolutional code, the entire code word is convolved. A deconvolution process is required to restore the data for decoding. For a block code, error correction bits are appended to the original code word, i.e. the code word is intact but appended by error correction bits. Previously, convolutional codes were widely used. Today researchers realize the combination of both convolution and block codes provides the best result which approaches the Shannon limit, the maximal capacity of a noisy channel. The Low Density Parity Check (LDPC) Codes (T. J. Richardson and R. L. Urbanke, “Efficient encoding of low density parity-check codes,” Information Theory, IEEE Transactions, vol. 47, no. 2, pp. 638-656, 2001, 00189448) and the Turbo Codes (B. Vucetic and J. Yuan, Turbo codes: principles and applications, Norwell, Mass., USA: Kluwer Academic Publishers, 2000) are designed based on this idea and widely used in applications such as deep space exploration (C. Jr, C. Stelzreid, L. Deutsch, and L. Swanson, “Nasa's deep space telecommunications road map,” 1999). However, decoding of convolved block codes requires computational power beyond current mobile devices. Especially, the floating point Viterbi decoding inhibits real-time performance on today's camera phones. Therefore, convolutional codes are not used.
A variety of systems and methods for downloading data to mobile devices such as cell phones, PDA's, MP3 players, and portable gaming systems are known. Such systems and methods include CDMA/GPRS, BlueTooth, infrared and cable. While such systems and methods have proven useful, they fail to take advantage of the fact that cameras are increasingly being incorporated into such devices.
The present invention is a novel system and method which allows a camera to be repurposed to download data from an image or a series of images. This camera-based system has several unique advantages. First, it uses existing hardware infrastructure and local communication, so there is no extra data cost. Some of the existing data downloading methods, such as wireless communication data networks (GPRS/CDMA), will trigger charges by service providers. Second, the present invention can be implemented predominantly through software. Users do not need to connect their phones with PCs through cables or BlueTooth adaptors and there will be no complex driver installation or synchronization problems. Users need to simply aim the camera at the visual code, or “V-Code”.
In one embodiment, the present invention is a method for transferring data to a mobile device having a processor, a storage means, and a camera. The method comprises the steps of encoding data in a visual code where the visual code comprises a plurality of two-dimensional bar codes, displaying the visual code, capturing the plurality of two-dimensional bar codes with the camera and decoding the plurality of two-dimensional bar codes. In other embodiments, visual codes other than two dimensional bar codes may be used. The step of displaying comprises displaying a portion of the plurality of two-dimensional bar codes sequentially. In one embodiment, the encoding step comprises spatial (intra frame) and temporal (inter frame) encoding with Reed-Solomon error correction codes. The Intra-frame error correction corrects errors within each frame and Inter-frame error is used to recover the dropped frames. The encoding step comprises encryption by user-designed masks. Users can design their own mask and fuse the mask information into the data frame by bitwise AND or OR operation. The receivers can decode the data only when they have the key associated with the designed mask. The plurality of two-dimensional bar codes may square, rectangular, circular, or any other shape. Further, the plurality of bar codes may be different in shape. The decoding step comprises boundary tracking with fast Hough transform to locate the code frame in real time. In another embodiment, the method further comprises the step of displaying a detected boundary in real time to assist a user in aiming the camera at the V-Code frame.
The decoding step may comprise fast perspective correction. Instead of solving a plane-to-plane projection which requires large amount of floating points operation. We use intermediate affine coordinate transform which simplifies homogeneous estimation to inverting two signs of a homography. In this way we eliminate floating operations and the speed of perspective correction is significantly improved. Further, colors may be embedded in the two-dimensional bar codes.
Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description, simply by illustrating a preferable embodiments and implementations. The present invention is also capable of other and different embodiments and its several details can be modified in various obvious respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive. Additional objects and advantages of the invention will be set forth in part in the description which follows and in part will be obvious from the description, or may be learned by practice of the invention.
For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description and the accompanying drawings, in which:
Embedding information in images (see Kutter, M., And Petitcolas, F. A., “Fair evaluation methods for image watermarking systems,” Journal of Electronic Imaging 9 (October 2000), 445-455) and videos (see Dittmann, J., Stabenau, M., and Steinmetz, R., “Robust mpeg video watermarking technologies,” MULTIMEDIA '98: Proceedings of the sixth ACM international conference on Multimedia, ACM Press, New York, N.Y., USA, 71-80 (1998)) has been studied for digital watermarking. The purpose of watermarking typically is for authorization and protection of the media. In the preferred embodiments of the present invention, data is encoded to facilitate the communication between the mobile device and the computer.
Known 2D barcode systems such as CyberCode (see Rekimoto, J., And Ayatsuka, Y., “Cybercode: designing augmented reality environments with visual tags,” DARE '00: Proceedings of DARE 2000 on Designing augmented reality environments, ACM Press, New York, N.Y., USA, 1-10 (2000)) and QR code (Ohbuchi, E., Hanaizumi, H., And Hock, L. A., “Barcode readers using the camera device in mobile phones,” CW '04: Proceedings of the 2004 International Conference on Cyberworlds (CW'04), IEEE Computer Society, Washington, D.C., USA, 260-265 (2004)) can encode very limited amounts of data. For example, the QR code can encode at most 2 KB data. To compensate for this limitation, the present invention encodes a file or files of any size into a series of frames where each frame encodes a part of the file or files. These frames are captured by the camera, decoded, and stored on the device in which the camera is located. The frames may be merged into one or more files.
The approach of the present invention will enable new applications and benefit numerous industries. The following examples will provide one of skill in the art with an idea of the potential scope of these new applications and benefits:
Instead of using existing 2D barcode symbologies such as QR code or Data Matrix, a preferred embodiment of the present invention uses its own symbology, for example, as shown in
While the symbology shown in
An overview of the architecture of an embodiment of the present invention is shown in
Overall, the procedures include:
A preferred embodiment of the method of the present invention starts with encoding.
To encode a data file into a VCode, we first split the data file into small segments, and then encode each segment into an image sequence. While the scheme is straightforward, the challenge is to make the encoding robust to the degradation and data loss which are inevitable in the imaging process. The cameras on phones often have much lower quality than digital cameras, and we expect users to capture VCode in real environment without constraints in lighting and perspective angles. Our strategy is to use state of the art error control in both time and space to make code more robust against these types of degradations.
1) Data Partitioning and Error Correction: The data is partitioned in the way that both intra and inter error correction bits can easily be inserted. We divide the data into multiple chunks, each of which is further divided into individual frames. This forms a three layer structure of the data representation, as shown in
b shows the error correction scheme we propose in each chuck. Each data chunk 310, 312, 314 in
Each frame consists of three parts: the frame header, the data area and the error correction area. The frame header contains the frame index, chunk index, the total number of chunks, and a checksum. The frame and chunk indexes provide the position of each frame so it can be put into the right position after decoding. The checksum is used to check if the decoded frame and chunk indexes are correct. If they are incorrect, the whole frame will be dropped and recovered later by error correction frames. The number of chunks is uniform on all frames and can be used to check if the file is downloaded completely. We put on every frame so users can begin capturing from any frame (the VCode will be displayed in a loop until all data frames are correctly captured and decoded).
A preferred embodiment of the present invention uses Reed-Solomon encoding for error correction (see Wicker, S. B., and Bhargava, V. K., Reed-Solomon Codes and Their Applications. John Wiley & Sons, Inc., New York, N.Y., USA (Eds. 1999)). Reed-Solomon error correction is used in a wide variety of commercial applications such as CDs and DVDS. Typically a (n, k) Reed-Solomon code block can encode k bits data with n−k bits for error correction. If the locations of error bits are unknown in advance, which is the present case, then a Reed-Solomon code can correct up to (n−k)/2 error bits. The advantage of Reed-Solomon error correction is no matter where the errors occur (on data area or on the error correction area, or even on both), they will be corrected as long as the number of error bits is not larger than (n−k)/2.
After defining the individual frame, a large data file can be split into many smaller chunks so that the data in each small chunk can be encoded into one frame. These images 402, 404, 406, 408 are piled up along the time axis to form a “V-Code”, as shown in
After encoding the data into a “V-Code”, the present invention xor's a mask with a checkerboard pattern, such as is shown in
2) VCode Rendering: The rendering converts each frame (including error correction frames) into an image, which can be displayed on flat screens. Rather than using existing 2D barcode symbologies such as QR codes or Data Matrix (which are inherently static), we designed our own symbology, as shown in
Before a frame is rendered, we use a mask to xor each frame. The mask provides encryption to the data since decoding is almost impossible without preknowledge of the mask. This allows the data to be downloaded only by users who have the “passcode”. A typical mask is shown in
The acquisition size and frame rate are constrained by the device. The process, however, must optimize throughput by trading off acquisition speed, image resolution, and processing requirements. Ideally we would choose the highest resolution which remains robust to degradation, yet can be processed at frame rates. Although camera phones often allow users to capture images with different resolutions, from 160×120 to 1600×1200 (2M pixels), our initial experiments suggest that QVGA resolution is a balance between speed and image quality for current mid level devices. The acquisition process itself is very simple: Users only need to aim the camera at the VCode to keep the frames at the center of the display. Detection and decoding will occur at frame rate.
Before decoding, each captured frame needs to be perspectively corrected, enhanced, and converted into a binary sequence.
1) Image Processing: The algorithm must be very efficient to meet the real-time requirement. A typical preview frame is shown in
Our localization pattern is a bold rectangular bounding box, as shown in
The biggest challenge is to decode the real images captured by camera phones. One example is shown in
The problem of uneven lighting is typically not critical for monocolor images because black and white are quite distinct from each other. If the numbers of black and white cells are roughly equal in the image, the average pixel value of the image is a reasonable threshold to separate them. If one color dominates however, the global thresholding will not be a good solution since cameras often have automatic white balance. Instead of using complex adaptive binarization methods, a preferred embodiment of the present invention uses a mask (as shown in
A more significant problem is geometrical distortion. Although the code is displayed on a planar display (LCD or CRT), the user may capture the code from any arbitrary angle. The code area in the real image could therefore be an arbitrary quadrangle (
For any matrix entry (I,j), {tilde over (H)} maps homogeneous coordinate x=(I, j, l)T to its image coordinate X:
X={tilde over (H)}x (1)
Suppose we know n matrix entries
and their corresponding image points
The classical way of computing {tilde over (H)} is the homogeneous estimation method (see Criminisi, A., Reid, I., And Zisserman, A., “A plane measuring device,” Image and Vision Computing 17, 8, 625-634 (1999)) Reshape matrix {tilde over (H)} as a vector {tilde over (h)}=(h11, h12, h13, h21, h22, h23, h31, h32, h33) T and solve for
When n=4, {tilde over (h)} is the null-vector of M and we have a unique solution of {tilde over (h)} for (2) (Assuming |{tilde over (h)}| or h33=1). This means we only need the coordinates of the four corners (P1, P2, P3, P4) in
However, solving (2) has some practical difficulties on cell phones. It usually requires LU decomposition with pivoting, which often involves large amount of floating point calculation which is not supported by mobile phones at the hardware level. Instead, The operating systems (Symbian, Windows Mobile) provide software emulation of IEEE-754 64-bit floating point which is much slower than integer operations. Other platforms, such as Java (J2ME), provide no floating point capabilities. This motivates us to search for simpler/faster algorithms without floating point calculation.
We first perform an affine transformation and then perspective transformation. Suppose we know the coordinates of four corners (P1, P2, P3, P4) in the image plane and the top and bottom boundaries of the bounding box intersect at vanishing point A. Then under homogeneous coordinates
A=L
1
×L
2=(P1×P4)×(P2×P3),
Similarly the left and right boundaries intersect at
B=L
3
×L
4=(P1×P2)×(P3×P4).
A and B are infinite points in the original plane. The third element of A and B under homogenous coordinates should be 0 in the affine image. Any homography
that maps the perspective image back into affine image should map A and B to infinite, which implies
This indicates we can calculate H3 using seven cross products. As shown in
we have (up to scale)
This “inverse” only requires changing two signs in the third row of H. In this way it simplifies the coordinate transformation with numerical stability. Normally the numerical inverse often suffers from “division by zero” when H is nearly singular.
In summary, instead of linearly solving homography {tilde over (H)}, we compute the coordinate transformation in the following way:
and use H−1 to map this affine coordinate to the image coordinate.
No floating point computation is required in the above procedure.
For an M×N “VCode” matrix we sample M×N coordinates on the image and read their gray scale values. Then we convert these gray scale values into binary (0 or 1). Since the image may be captured under various lighting conditions, and further affected by changes in perspective angles, a fixed global threshold can not be used. An adaptive thresholding must be used to separate black pixels from white ones. We use k-means (k=2) classification to find the threshold: 1) Find the maximal and minimal values of this M×N gray scale matrix and use them initially as two centers. 2) Assign every pixel to a class whose center is closer to the pixel's gray scale value. 3) Replace the class center by the average value of all the elements in this class. 4) Go back to 2) until the two centers do not change. After the classification, each entry of the M×N matrix is assigned to either 0 or 1.
Details of a preferred method of decoding is described with reference to
Our encoder is implemented as a web service which takes a file as an input and generates a GIF animation (GIF89A). We chose animated GIF because GIF is a standard format which can be opened in web browsers on any platform. Other formats such as MPEG and Flash are also possible but not as popular as an animated GIF. GIF animations can be generated by simply packing frames along the time line, as shown in
Our goal is to support a wide range of devices with various development platforms and operating systems. Porting and maintaining source code of an application among diversified platforms presents a very challenging task. For example, devices running Symbian, Windows Mobile and Palm operating systems have different requirements for development. Developing for the varying architectures, with different conventions for storing of data, different cache architectures, and managing different devices (displays, cameras, network) can be a significant burden for the developer. Efficiently and reliably embedding the same application into these different devices can be very expensive. In our strategy, we begin the development off line with emulators of different devices. The algorithm consists of a set of basic components managed by a core software control module. The core components will manage resources needed by the analysis modules. We then find identical components, and adopt a “one source, multiple project files” strategy. In this way, adding or updating existing algorithms in one platform will automatically update all other platforms. Using this strategy, we have developed for both Symbian OS and Windows Mobile 5 using one copy of source code. Our decoder was tested on Symbian: Nokia 6680 (Series 60 FP2), 7610 (Series 60 FP1) and Windows Mobile: UTStarcom PPC6700 phones. Although these three phones have different intrinsic camera parameters, our decoder works well on all of them without tuning parameters. This shows the stability and compatibility of our algorithm.
The “V-Code” is designed to work in three modes:
(1) The Static Mode: This is similar to existing 2D barcode, a short message is encoded in a static image, and the camera phone reads this message when it scans over the code.
(2) The Handheld Mode: When downloading more data, the camera phone needs to read a sequence of frames and the user will have to hold the phone facing the visual sequence for a period of time. The user does not have to hold very still, as long as the “V-Code” is in scope; the program will track the “V-Code” automatically.
(3) The Dock Mode: Downloading rather long size data. It works when the phone is still and the position of code matrix in the image remains unchanged. In the dock mode, the downloading speed is much faster because no geometrical computation is required after the first frame is located.
An important feature is that, unlike regular key triggered snapshots, the decoder of a preferred embodiment of the present invention is a no touch decoder. Once the decoder is started, the capture is dynamic. It not only eases the usage of software but also provides extra stabilization of the image. Usually a motion blur occurs at the moment the user presses the “capture” key. Since the phone has no hardware “stabilizer” the motion blur caused by key press is critical for image processing. Therefore we use the preview mode and process the frame stream.
For each frame, the first byte indicates its frame type:
When encoding a data file, the encoder generates the sequence header frame according to the file name and size, and then chops the file into chunks and generates data frames for each chunk. In case any of the data frame might be dropped while capturing, all data frames are replicated three times. Finally the encoder puts the sequence header frame together with the data frames into a sequence of frames.
The decoder tries to decode every single frame it “sees” through the camera. To guarantee that the frame is read correctly it will be read twice and only accepted when the two matrices are identical. When reading the matrix, the decoder starts with the first byte, which must be Type I, II or III, to be considered a valid frame.
For Type I, it will decode all other bits in this frame and show it as a popup message. When the decoder sees Type II, which is the sequence header, it allocates the memory according to the file size and gets ready to accept data chunks. For each chunk, a flag is initialized as “incomplete”. When the decoder sees Type III, it first reads its frame offset and if the corresponding chunk is “incomplete” the reader will fill in this chunk and mark it as “complete”. When all chunks are completed the data is dumped to the file system.
An encoder in accordance with a preferred embodiment of the present invention may, for example, be implemented on WIN32 platform and take either a message or a file as input. For a message, it encodes it to a static image (BMP/JPG). For a file, it encodes it to a video file (WMV/AVI) or GIF (GIF89A) animation. The advantage of a GIF animation is that it could be played in any web browser through any platform, while the video file gives the user more control when playing.
A decoder in accordance with a preferred embodiment of the present invention may, for example, be implemented on Nokia Series 60 platform using “ECAM.LIB” which is provided in Symbian OS 7.1 or later. Such a decoder has been tested on Nokia 6680 and 7610 phones.
The “V-Code” of the present invention may be used as a data channel, so robustness is an important feature. Practically, the code presented might be noisy or partially occluded causing part of the matrix to be read incorrectly. For these situations we still want to recover the code and that is the reason we choose Reed-Solomon error correction.
Another important criteria as a data channel is the speed (bit rate). Unlike the other channels, the “V-Code” of the present invention is visible to the user and the user is actually controlling this channel by hand. The speed must consider HCI (Human Computer Interaction) issues.
Therefore, the following “speed test” is more like a user study than a hardware/protocol test. The “V-Code” of the present inventions was explained to four people, who were then asked to download an image, a ring tong and a small Java program to the Nokia 6680 phone by holding the phone still in front of a laptop screen (Dell Latitude D800, 15″). These three files are all encoded as “V-Code” in the DIVX/MPEG4 video format with a frame rate of 2 frames/second, with 100 bytes of data in each frame. The desired bit rate should be 2×100×8=1600 bps. As a comparison we also download these files in dock mode which has no frame drop. Dock mode performs roughly the same over these three cases because there is no human factor involved. The dock mode frame rate is 1455 bps on average, which is a little lower than 1600 bps because there is overhead on the sequence header and frame header. It is interesting to look at the handheld mode: the bit rate of handheld mode is ⅔ of dock mode (1000/1455), the reason that handheld mode takes longer time is that people cannot hold the phone still all the time. When the hand gets tired and the code drifts out of scope, a frame drop occurs. Since we put three copies of each frame into the sequence of frames, two more chances are provided for each dropped frame to make up later on. However the backup frame might come after tens of frames that have already been consumed. Another observation is that, the longer visual sequence is, the lower bit rate. The reason is that frame drops tend to happen more when people hold the phone for a longer time. After downloading these three files onto the phone, we run a bytewise comparison against the original files and found them identical.
As stated in the performance section, there are two major areas for improvement: speed and usability. In handheld mode, the download speed is 1 KBps and in dock mode it increases to 1.4 KBps, but it is still too slow for real application. As for the completeness of the data, the data sequence is displayed three times. If all three copies of a data frame is dropped, the entire data is unrecoverable incomplete. It is painful if the user holds the phone for two minutes and needs to start over again.
For the speed, in the preview mode a camera phone typically captures 10 VGA (640×480) color (RGB) frames per second. Each frame takes 640×480×3=900K bytes thus 900K×10=9 M bytes information flows into the phone through camera in one second. Compared to our bit rate 1.4 Kbps, we have used only 0.01% of these 9 M bytes. Although we do not expect to achieve mega bit rate through the camera channel, if only we could increase the portion that carry data among these 9 M bytes to 1%, the bandwidth would be 90K bytes per second, which is a lot faster than the current GPRS connection (4 K-5 K bytes per second). To increase the bit rate, one straight forward way is to increase the preview frame rate (fps) but the phone allows at most 10-15 frames per second. An alternative way is to put more content in each frame. Here are some possible solutions:
(1) Increase the grid density. Use smaller size for each black/white pixel in the matrix. This requires the location of the code area to be more accurate. For low density, if the boundary shifts one or two pixels, the data can still be read correctly, but for high density, each data grid might take at most three or four pixel width, there is not much room to tolerate the location error. A more subtle finder pattern should be considered to increase the location accuracy
(2) Use the color information. When reading the image from the camera, each pixels actually takes 24 bits (8 bits each for RGB channels). Although we do not expect to extract 24 bits information from each pixel, a separation on the color channel can increase the bit rate to triple or even more. Note that each camera has a different CMOS/CCD sensor, one color pixel appears differently among all the phones, therefore, to use the color information, a color alignment might be required.
Security can be provided by encrypting the “V-Code” before transmitting or posting the file even when using non-secure methods. For instance, someone leaves an encrypted “V-Code” message on their public webpage for only one or a few people with the password to view the message. Or, a business needs to transmit a message to an employee in the field when the business thinks someone has compromised their security wall.
For the usability, there is a neat solution. We are using error correcting code within each frame, so that under some occlusion the code can still be recovered. We can apply similar error correction across frames. For example, for matrix entry (i,j) even if 20% (depend on the error correction level) of the frames are dropped, the values of (i,j) on all frames are still recoverable. That way, we do not have to repeat the data sequence three times and worry if all three copies are dropped. We only need to insert some error correcting frames between data frames.
Another interesting idea is to print several hundred static “V-Codes” on one page and let the user scan over the page. Suppose we print 20×20=400 code patterns on an A4 page, each encodes 100 bytes, the total amount of information is 40K bytes which can hold a lot J2ME programs. With a close-up lens, the image can be printed even smaller, and more information can fit in one page. There are also issues to explorer about the security, the “V-Code” is hard to break without knowing the mask, the data format and the error correction level, and we can use these as shield to guard the encoded data.
Another method of “Branding” the “V-Code” would be embedding of graphics in the visual stream, either spatially or temporally. Spatially, the graphics can be placed at arbitrary locations within a given frame, subset of frames or the entire sequence. Temporally, the graphics take the place of entire frame for selected frames in the sequence. For instance, the motto of the brand of soda could sporadically appear to flicker throughout the “V-Code” while a user downloaded a coupon. Another instance is when the set of visual frames that download a ring tone to the user also have images showing the singer performing the song being downloaded.
Another idea is to have the “V-Code” have pictures in individual visual frames that when viewed in sequence serve to draw attention to the “V-Code.” For instance, a “V-Code” might show a ball seemingly being kicked around inside the visual frame.
One of the direct applications of VCodes is for downloading data through visual communication. From the user's point of view two factors are important: the data transmission speed and robustness. Our experiments evaluate the performance of these two factors.
The factors directly affecting the data transmission speed are (1) the amount of data encoded in a frame, and (2) the frame rate at which the VCode is displayed and subsequently decoded. Assume the displayed frame rate is P frames/second and D bits are encoded in each frame, then theoretically the overall bit rate is P×D bits per second (bps). Therefore the increase of P and/or D will lead to higher bit rates. Practically however, it is much more complex. For example, if more bits are encoded in a frame (increasing D), it will increase the barcode density and decrease the resolution of a single cell unit when the image is captured, possibly leading to more decoding errors. If the frames are displayed too quickly (increasing P), the device may not be fast enough to capture and process them resulting in missed frames. The experiments we conduct in the following sections result in a quantitative analysis of these factors.
1) Data Capacity in a Single Frame: Currently main stream camera phones can capture a video sequence with resolution of 320×240 pixels. Although a captured still image may have a Mega- or multi-Mega-pixel resolution, a camera phone needs to capture and process frames continuously. Therefore a video mode is required, which limits D. Although the next generation camera phones may capture HDTV quality video, in this paper our analysis is based on the majority of currently available devices.
Like all other 2D barcodes, the resolution (the number of pixels) of a unit cell, defined as a black or white square representing one bit information (either 1 or 0), is crucial for decoding. Given the restriction of the frame size (320×240), increasing the number of bits will decrease the resolution of a unit cell in captured images, leading to higher erroneous bits, and correspondingly, more extra bits being required to correct those erroneous ones. As we addressed above, the total number of bits in a frame (N) consists of the data part (D) and the error correction part (E). The actual data D=N−E. It is important to find a balance between N and E to achieve the optimal result. To investigate this problem we performed a simulation by generating an all-zero data file and encoding it as a VCode with four different settings of unit cells: 28×35, 32×40, 40×50 and 48×60. The reason we select an allzero data file is that zero remains the same after xor operation with the mask defined in
Where TB is the total number of bits that we can decode from F frames, and T is the time spent on decoding a frame. F=100 in this experiment and T depends on the number of unit cells. Since the complexity of sampling N points from an image and of decoding N-bits data is Θ(N), we have T˜N:
Let Err(i) be the number of erroneous bits on the ith frame and Data(i) be the number of bits we read from the ith frame, which could be either 0 or N−8E, depending on Err(i). If the number of erroneous bits in a frame is too large, the remaining bits will not then be enough to correct them. More specifically, we have:
Substituting (8) into (7), we have:
Where iε1 . . . F, as shown in
2) Display Frame Rate: Generally the display frame rate depends on how quickly a frame can be captured and processed by camera phones, and this is device dependent. A frame can not be displayed too quickly since camera phones need to have enough time to perform geometrical correction, decoding and error correction. If it is displayed too slowly, however, the camera phone will have to process the same frame again and again. Although the duplicate data will be identified and removed, re-decoding decreases the overall bit rate. The ideal situation is that camera phones process every frame exactly once. If a frame is dropped, it can be recovered by error correction or be recaptured in the next round since the VCode is displayed in a loop. We tested four different display frame rates with a NOKIA 6680 camera phone as a capture device. The data file selected was a 4 KB MIDI ring tone encoded as a VCode containing 60 frames. The VCode was displayed at frame rate of 20, 10, 6.6, 4 frames/second respectively on a 15 inch flat panel computer monitor. For each frame rate we let three users download the file into the camera phone. The time t used for download is recorded for each run and the throughput is calculated as 4096×8/t bps. The overall results are shown below in Table I.
From Table I, we see that when the animation frame rate is very high (20 fps) or very low (4 fps), the downloading bit rate is low. The optimal result is achieved when the animation frame rate is between 6.6 to 10 fps. To explain these results, we recorded the total number of dropped frames in each run. From Table II, below, we see that when the frame rate is high (20 fps), the number of dropped frames (over 600) is much higher than that of other settings when the final download is finished.
Since VCode contains only 60 frames, a large number of dropped frames indicates the VCode has been displayed in a loop for several times before downloading is complete. There are two reasons for dropping frames: First, the camera phone cannot process a frame within 1/20 sec. Second, when frames are displayed fast, ghost images appear due to the “visual short term memory” of the camera. When black and white cells flip quickly, they appear as a gray color rather than black or white.
When the frame rate is low (5 fps), the frame drop rate is also high because the camera keeps processing duplicate frames. Therefore, a frame rate between 6.6 and 10 is a good choice for the device used in this experiment.
3) Overall Downloading Bit Rate: After analyzing specific factors affecting the download speed we evaluate the overall throughput in a more comprehensive data set. We selected three data files, including a MIDI ring tone, a Java game, and a 3GP video as our test set. The sizes of these files are listed in Table III.
We let the same three users download these files and recorded the time spent on downloading when the final download is complete. The bit rate is defined as the quotient of a file size over the time spent on downloading. The average bit rates for downloading are shown in Table III. As we can see, the bit rate decreases as the file size increases. For comparison, we put the phone on a dock on a desk so both of the phone and monitor are static, a configuration we call “dock” mode. In dock mode the download bit rate is very stable, independent of the file size, since no users' factors are involved in and the bit rate is higher (around 3.3 Kbps) than that in handheld mode.
1) Aspect Ratios of Displays: Flat panel display devices may have different aspect ratios (such as computer monitors, HDTVs, etc.). For example, on a wide-screen display the displayed image may be stretched to fit the display. This experiment tests the robustness of our algorithm when VCode images are stretched along vertical and horizontal directions. We use a JPEG image file with a size of 4 KB for the experiment. The file was encoded as a VCode and displayed with different aspect ratios ranging from 0.5 to 2.7 (width: height). The downloading speeds are shown in Table IV.
From Table IV we can see that the best download speed is achieved with aspect ratios from 1.2 to 1.5, i.e. the designed aspect ratio. When a VCode is stretched too wide (with an aspect ratio ≧2.7) or too narrow (with an aspect ratio ≦0.5) the download cannot be completed.
2) Image Contrast: Another factor affecting the performance is the image contrast. During experiments, we found outside lighting contrast does not affect the performance significantly since the displays emit light (like the active lighting) and therefore the display contrast and imaging sensor (camera+CMOS) together affect the contrast of the final image which is the input of V-Code decoder. If the contrast is too low, black and white colors will move closer, the bit error rate will increase significantly. In this section we evaluate the robustness against contrast degradation. Instead of measuring the contrast of the original V-Code frames, we measure the contrast of the actual image being sent to the decoder. Usually the image contrast is defined as the difference of maximal and minimal gray scale values of the image. However, a little bit of random noise can disturb the maximal and minimal gray scale values significantly. Instead, we use the difference between the average gray scale values of white and black pixels to measure the image contrast. These two average gray scale values are computed as a bi-product of the binarization step. For each different level of contrast, we measure the bit rate by averaging the total bytes of data being download over the total number of frames take under that level of contrast. When the distance between white and black average values is larger than 150, the downloading speed is unaffected. When it is smaller than 75, no information can be extracted due to the low display contrast.
These examples demonstrate that cameras can be used for pervasive transfer of data to mobile phones. The encoding and decoding method comprise data splitting, error correction coding, image capture, correction of perspective distortion and decoding. The examples are analyzed quantitatively and provide guidance for the optimal settings which maximize the bit rate. The results show our approach is robust even when the image is stretched or with low display contrast. The present invention provides a new method to enable camera phones to download data when other communication channels do not exist. While the current download speed may be somewhat slower compared with existing wireless or cable connections, this will be significantly improved as camera resolutions become higher and processing speed increases. Further, bit rates may be increased by using color instead of black and white cells in the 2-D bar codes so each cell can carry more bits. If eight colors are used, for example, the speed can be tripled theoretically.
The foregoing description of the preferred embodiment of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiment was chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto, and their equivalents. The entirety of each of the aforementioned documents is incorporated by reference herein.
The present invention claims the benefit of the filing date of U.S. Provisional Patent Application Ser. No. 60/865,602 filed on Nov. 13, 2006 by Xu Liu, David Doermann and Huiping Li. This prior application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60865602 | Nov 2006 | US |