This invention is related to improving the fidelity and functionality of sharing physical written surfaces such as whiteboards, blackboards, paper and other writing surfaces, via video. It is focused on, but not exclusive to, visual enhancement, annotation, transcription, metadata enrichment, storage and broadcast of information on any type of physical writing. More particularly, it focuses on enhancing the readability of the information, enriching metadata about the images and overlaying additional collaborative functionalities on the writing surface once the video images are captured.
This application claims the benefit of U.S. Provisional Application Ser. No. 62/295,115 filed Feb. 14, 2016, entitled SYSTEM AND METHOD OF CALIBRATING A DISPLAY SYSTEM FREE OF VARIATION IN SYSTEM INPUT RESOLUTION, the entirety of which is herein incorporated by reference.
Remote collaboration is often facilitated using video or audio conferencing or screen sharing. In such a meeting, it is often beneficial to explain a concept by drawing a quick diagram or written explanation. On most computers, it is time consuming and difficult to make simple sketches, and the results often look pixelated. It is faster and easier and more precise to draw by hand with a pen or marker, so that paper or a whiteboard is a preferred medium for making sketches. Alas, if one uses paper or a whiteboard, it becomes challenging to digitally share the on-going sketch with remote users.
One solution is to point a videoconference camera at the whiteboard. This method has a few shortcomings. The image quality of the writing is often insufficient due to issues including glare, poor lighting, distance from the whiteboard, and noise inherent in the camera. Remote viewers often can't read the writing.
Another solution that has been tried is a method that improves the quality of a single image. That solution can work well for processing a single frame of a drawing session, but doesn't meet the real-time needs of a video session, and doesn't take advantage of the data in time.
If remote viewers can read the writing in the video imagery, they often desire a way to point at, reference or annotate areas of the written surface to guide the conversation and interact. As such, sending the video alone is valuable, but does not solve the full collaboration problem.
This patent describes a system and method that addresses these shortcomings.
The invention includes a method and system to visually capture, enhance, enrich, broadcast, annotate, index and store information written or drawn on physical surfaces. A user points a camera at a whiteboard and broadcasts the video stream so that other users can easily receive and collaborate remotely in multi-user fashion. The system enhances the video by leveraging the nature of video as a sequence of related frames over time and by leveraging the nature of the writing, and the writing surface. Multi-user collaboration is facilitated by annotation tools, allowing users to add text and draw digitally on a video. Also facilitating the collaboration is an archival feature that saves a frame, a subset of or the whole video locally on the user's computation device, or to our own or third-party servers in the cloud.
Initially, the system uses a web browser to broadcast and receive video. Users visit a website and configure their system through the browser. The system initially supports cameras embedded in a computer such as a laptop, cell phone or tablet, or cameras connected to those devices such as a USB webcam. In many cases, the user points a camera at the beginning of the meeting. In other cases, the camera is mounted and set up once, so that no subsequent setup is necessary.
The system addresses many writing surfaces including whiteboards, blackboards, glass, paper and other physical surfaces. Each surface has distinct properties and the algorithms can be tuned for each such surface. The writing on the surface and the surface itself are enhanced to improve the legibility of the writing.
The algorithms can also be made to adapt to available computational and bandwidth resources.
The video is sometimes distributed by relaying through one or more central computers. It can be sent peer-to-peer, sometimes relayed from person to person. Depending on the number of people on the call, and the bandwidth and processing power available, different options may be used.
The software system allows any camera pointed at a physical writing surface to perform all the functionalities above, so that users can easily collaborate remotely in a full-duplex, multi-party fashion. The broadcasting user will come to a website, which will access his web camera and broadcast the video stream to other users, who will watch on a website.
The invention description below refers to the accompanying drawings, of which:
Without limitation, here are some methods and concepts that form the basis of the algorithms for image enhancement. In each case, one example of how to use the algorithms is presented. A person skilled in the state of the art can expand on them straightforwardly.
Backgrounding.
A common way to identify pixels that have changed is the Stauffer-Grimson knn backgrounding, published as: Chris Stauffer and W. E. L Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, volume 2, pages 252-258, 1999, the entirety of which is hereby included by reference. The concept is that for a single pixel, there are multiple clusters of values. In this case, that is the background (e.g. writing surface), writing, obstruction, etc. Each cluster is represented as a Gaussian distribution. Whenever a value for a new pixel comes in, if it is within 2.5 standard deviation of the known values, it is accepted into that cluster. If not, a new cluster can be created. The goal of the algorithm is to classify what the pixel is currently representing. The algorithm is part of a class of Gaussian Mixture Model algorithms (GMM). The backgrounding algorithm is noisy, so it is typically used in conjunction with a blob-region detection or blob-finding algorithm.
Blob-Detection/Region-Finding Algorithm.
For pixels in a particular class, e.g. background, some of those pixels that are found will be next to other pixels of the same class, and grouped into a region. For pixels classified to be obstructions: the largest regions will be considered obstructions, while the smallest regions can be ignored. For example, regions of size 3 pixels or less, can be ignored as noise. Similarly, large regions often have holes that can be filled in. Sometimes the regions are dilated and then eroded to smooth out the edges. These are morphological operators on binary images. These steps taken that are called region finding algorithms are blob detectors. Region-finding algorithms can also be run on grayscale/color images. In this case, pixels are clustered together if they are similar to their neighbors. As an example, a simple similarity methodology is to check if the intensity is within 20% of a neighbor's intensity.
Low-Pass Filter.
A type of digital filter that is convolved with the image. Commonly used to remove spatial variations in the data, and to reduce noise. It is very effective on writing surfaces where the intensity variations are slowly varying. The exceptions are writing, obstructions and boundaries. A common low pass filter is a Gaussian low pass filter. Low pass filters can be used in space, time and/or both.
High-Pass Filter.
High-pass filters are digital filters that select for fast-changes. These filters are often used to look for writing or the boundaries of the writing surface. These filters are often very sensitive to noise, so that running filters in time (e.g. a low-pass filter in time, before a high-pass filter in space) are extremely helpful to improve the signal to noise ratio of these filters. An effective example is to low-pass filter an image, and subtract the result from the original image. One can design a high-pass filter using a Parks McClellan algorithm or other filter algorithm that focuses on enhancing high spatial frequencies. Note that sometimes one is interested in high spatial or temporaral frequencies, but not the highest frequencies which are the noisiest. In this case, the filter is still a high pass filter, though it is sometimes called a band-pass filter. It is a filter that passes a specified range of frequencies.
Max/Min Filter.
For every pixel, look in a region around that pixel, and replace it with the maximum; this is the max filter. The min filter is formed likewise using the minimum. They are often done with total intensity, or on each color channel. These filters are particularly useful for dealing with camera intensity falloff on white surfaces. With intensity variations, one wants to make comparisons against the local intensity, not against global intensity. For a white surface, the max filter gives a measure of the local intensity. The min filter is useful for finding the writing on whiteboards. To reduce noise, one often ones uses a max filter followed by a min (or vice versa) to reduce noise. Also, low-pass filtering first is useful to reduce the effects of noise. That can be done spatially and/or in time. Note that the correct filter to use clearly depends on the type of surface being used. For background on blackboards and other dark surfaces, one uses the min filter first to find the background intensity. These filters are examples of morphological operators on grayscale images. Note that sometimes the morphological operators are applied using regions that look like lines. One can use them to find stroke-like objects.
Spline Based Intensity Fitting.
One can fit a spline to the overall intensity of the writing surface, and then use it to normalize the surface intensity in order to make global comparisons. One has to be very careful to fit the spline. Often one can run a filter to determine average intensity locally. A robust average intensity metric is often better.
Robust Metrics.
Robust metrics are measures that are robust to noise, and in particular outliers. There are many robust metrics that are commonly known to the machine learning community. In many of them, one tries to do outlier detection. As one example, one can form a histogram of a range of values and looks for the mode, the peak in the data. One can then cluster the data, and omit the data not in the cluster. For example, for a writing surface to find the average intensity in a region, one might want to ignore the writing, which might have very different intensities than the background writing surface. To achieve this, form a histogram of all the values, and look for a cluster, and throw away the remaining data. There are many clustering techniques. A simple example of one is to start at the peaks in the histogram and move in each direction until the value of the peak has subsided to half or a third of the peak value. More complicated clustering methods consider the relative width of the cluster of data, and the relative distance between adjacent samples.
Noise Measurements.
One can examine a pixel in time, and find the mean and standard deviation in time. The standard deviation estimate is very noisy, so sometimes one averages a number of measurements together spatially to improve the measure. Though, one must be careful in spatially averaging as the standard deviation of many cameras is dependent on the intensity of the pixel.
Color/Intensities.
We work in many color spaces including:
Histogram Techniques:
We often form one, two or three-dimensional histograms. For these histograms, we take the data and look for clusters and/or thresholds. In each of these spaces, we can form histograms: counts of how many data points have a specific set of parameters. We can then look for peaks in the histogram and peak widths to form thresholds around these peaks. Peaks are the bins with the largest counts. Their widths can be measured in many ways, such as examining nearest bins until one find a bin with height less than half the peak height.
Segmentation:
For a whiteboard, one can find the histogram of intensities across an image of a whiteboard. One might use a max-min filter first to gain an estimate of local intensity, and then normalize the image and then histogram. One would expect the writing surface to have a near uniform intensity and be large. One can therefore find the largest peek in the histogram, for a whiteboard one can threshold all intensities as higher as likely being part of the writing surface. This is one method to make sure glare regions are not ignored. One can then use region-finding methods to find the whiteboard. One can use the estimates of noise to decide that the threshold can actually be lower by one or two standard deviations than the peak of the histogram. There are many variants on these algorithms, including methods that do region finding based on intensity similarities to neighbors, and not crossing edges. One can start by finding numerous connected regions, and then determining, based on size, uniformity of color, presence of writing, and location in the camera, which region is most likely to be the surface of interest.
Writing/Text Detection.
There are a number of algorithms such as the one by Neumann et. al (Neumann L., Matas J.: Real-Time Scene Text Localization and Recognition, CVPR 2012) or MERS (Chen, Huizhong, et al. “Robust Text Detection in Natural Images with Edge-Enhanced Maximally Stable Extremal Regions.” Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE, 2011.), the text of each is included here by reference. A simple algorithm is to use the backgrounding algorithm to do text detection, and then use a region-connection algorithm. Another method is to use a multi-dimensional histogram based on measures of edges, and intensity and color, and cluster pixels into connected regions or strokes.
Edge-Measurement/Degree of Edge:
There are numerous ways to make measures of edges. A difference of linear Gaussian filters is a common technique; they effectively form a band-pass filter. A Canny edge-filter is also common used. Yet another method is a high-pass filter. The morphological operators on grayscale images with regions that are shaped like lines also yield a measure. Note that the magnitude and sign of these methods is interesting. For many measurements of the degree of edge, the results have a different sign if just outside a dark region, then just inside a dark region. To enhance writing or background, one can enhance these regions based on sign. One can also choose not to enhance regions whose degree of edge is too small.
Boosting.
The concept is to chain together many simple algorithms. Typically each one has a high false-positive rate, and a very low false negative rate. If you apply many of these algorithms over and over again, each time some of the false positives are removed. One chains enough algorithms so that the overall result is an algorithm that has a very low false negative rate, and a very low false positive rate. One can use algorithms based on histograms to achieve the desired results. One can use this concept over and over again in segmentation, applying thresholds in many different spaces to get a good result.
The camera, 103, captures a video stream that can be transmitted to remote viewers. The video is generally a sequence of image frames captured at a regular rate, such as 30 frames per second. The rate is typically uniform, but it can vary. It can also be adjusted depending on the bandwidth available to transmit, and the available computation to compress the video for transmission. Some cameras have ‘photo mode’ which produces a still image of higher quality than the default video. One can effectively take still images many times sequentially to form a video stream.
Note that the camera 103 need not be embedded. It could be an attached device, such as a camera connected to the laptop via a USB camera (often referred to a “webcam”). Or it could be an IP camera connected via a wired or wireless data network (not shown).
The laptop 101 can enhance the video before transmission, taking advantage of the expected qualities of the writing on the surface. Enhancement means improving the clarity and legibility of the writing on the written surface. The goal is to improve a visibility or readability metric. Two examples of readability metrics are per-pixel signal to noise ratio, and the average size, intensity and count of spurious edges after enhancement. A common goal is to amplify the difference between the intensity of the writing and the background, and to reduce the overall noise.
The enhancement is typically done to compensate for many factors. Example enhancement goals include reducing the effects of noise, compensating for properties of the camera lens such as low-pass filtering caused by the lens and artifacts in the camera from compression and processing of the camera video. Enhancement is also done to compensate for factors in the room that make the writing surface less clear, or skew caused cause by the positioning of the camera. Enhancement can be done to minimize the bandwidth that will be required to transmit the video. Or for transmission, it can also done to anticipate and avoid artifacts caused by compression and/or transmission.
Note the laptop, 101, can more generally be any type of computer. The computer will most often use a central-processing unit (CPU) such as the Intel Core processor family of CPUs (Intel Inc. Santa Clara, Ca) and/or a graphics processing unit (GPU) such as the Quadro family of GPUs (NVIDIA Inc, Santa Clara, Ca). The computer needs to be able to access video from the camera, and transmit the video remotely. For the case of a computer with a display such as a laptop, the video is also displayed locally.
The video can be transmitted peer-to-peer to one or more devices 211 including laptop, cell phone and tablet shown for reference. Or, the video can be relayed through computers 213 connected via the Internet. For example, if the network bandwidth from 205 is limited, the computer 205 can broadcast the video to a computer 213, which can broadcast to other computers 213, which can then broadcast to the devices 211. Similarly, the devices 211 can be used as a video relay, receiving the video and re-broadcasting to other devices 211.
The data can be archived which is indicated as a cylinder in 209. Note that the archival system may have a processor associated with it to processor data before archival.
In
The enhancement processing may be done in computer 205. It may be done at any point in the process. That is, it might be done before broadcast. It may be done by a computer(s) 213, or receiving devices 211. There is clearly a trade-off between where the computation is done, and the bandwidth available to transmit the results of the computation.
The computers 213 may perform additional duties. For example, for peer-to-peer connections between the devices 211, and between 205 and the devices 211, it is helpful to have a mediation computer to transmit data so that the computer may transmit data to form that connection. Such a computer is often called a signaling computer. The WebRTC standard (WebRTC 1.0: Real-time Communication Between Browsers, W3C Working Draft 10 Feb. 2015, Bergkvist et. al, The World Wide Web Consortium (W3C)) provides good background information on how to form such connections and how to use a signaling computer, which is well known in the state-of-the-art, and is included by reference. The computers 213 can also handle permissions, e.g. who is allowed to log in, as well as invitations. They can handle user presence detection, e.g. who is currently taking part in the written communication session, and reporting that information to users.
There is also a control module 350 that helps orchestrate the entire system. That includes, but is not limited to, helping to set up communications between the devices. It includes doing presence detection, and permissions.
The overall system is often referred to as a virtual room, as it is a place where a number of users come and share data. In this case, the data being shared includes video and stills, typically processed, history, annotations.
In
Step 530 is the encoding step. The video is encoding for transmission. Feedback loop 535 communicates to the video processing step. The two steps often share resources, typically a CPU and a GPU. In the event that the encoder is at risk for not completing the encoding in time to process the next frame in real time, that information can be communicated to the previous step to adapt. The encoder can also indicate/measure any artifacts that are introduced with the encoding, and report those to step 520 to adapt.
In step 540, the video data is transmitted via the network. In step 545, the transmission step can report back that the bandwidth available is at risk of being overwhelmed.
The video may be sent peer-to-peer. Or, in step 550, the data can optionally be received and re-transmitted by computers 213. The video may also be processed in this step. One of the advantages to relaying the data is that available bandwidth can be used to transmit once, and the relay computer can relay to multiple devices 211, and to other relay computers 213 and then on to the devices 211. The total system can ensure it has enough bandwidth.
Note that sometimes one device 211 will have surplus processing power and bandwidth. In this case step 550 might be done by one of those devices 211.
The receiver will receive the data in 560, and decode the image frames in step 570. In step 580, the data will optionally be processed. Step 520, 550 and 580 can share the work, or the work can be distributed among the three steps as processing time allows. Not shown, steps 580 and 550 can report back to step 520 the available processing power to optimize where the processing is done.
Note that for the local display of data, step 520 can go directly to step 580; that is, there is no need to transmit the data.
The video can then be displayed at 590, which is an instantiation of module 340. and can be archived 592, an instantiation of 330.
In step 620, the annotation module can optionally inform the video module that parts of the image are being covered up by annotations. Those regions need not be processed or transmitted. In step 630, the annotations are transmitted from one device to another, such as devices 211 and computer 205, or to the archival system to be archived 632. The annotations are then displayed, most typically on top of the video, 634.
Sometimes, cameras will move. After the motion, the annotations are out of place. The video module, for example part of the processing video step 520 for example, can report motions to the annotation module. In step 636, video stabilization algorithms are run.
Video stabilization algorithms are well known to those in the state of the art. One often runs a feature detector in an image, looking for corners and shapes for example, and then tries to find a feature in the new image in a similar place. Since writing surfaces are most typically planar, one can look for a homography as a transform from one image to the next.
The cameras that are used are typically fixed, or fixed for long periods. For example, the camera in a laptop, or as part of a webcam, or as part of a tablet can be positioned and will typically remain in place for long periods. Occasionally the camera will be moved, causing a large change.
Step 740 involves video processing of the data, as part of a decision process to store the data. For example, the archival system may be configured to store a frame once a minute. An alternative is to look for key-frames. For example, if the whiteboard is erased, storing the last frames before erasure is beneficial. To detect erasure, one is looking for large changes in the number of pixels that are represented as writing. One can use the backgrounding algorithms already described, and look for large changes in the number of pixels labeled as writing. Or, one can use a writing detectors previously mentioned, and look for changes in the number of pixels.
Rather than detecting erasure, the video processing system can also detect the addition of writing. The system can track the number of writing pixels added in time. When there is slow down in the number of pixels added per unit time, a frame is archived. In practice, we find that people stop writing for at least 30 to 60 seconds, if not longer, for discussion. During that time the number of writing pixels added goes to 0, and that is one situation where archiving a frame is reasonable.
Another example of video processing is piecing together frames. For example, if an obstruction such as a human moving from one side of the surface to the other, the system can detect that part of the whiteboard that was not previously visible, is now visible. The system can therefore piece together different frames to get a complete snapshot of the whiteboard. Detecting obstructions is relatively straight forward using background algorithms together with region detection algorithms. It's also straightforward to store known parts of the frame, and fill in pieces when new data becomes available, or it changes.
Another type of artifact is an obstruction such as 907, which represents a whiteboard eraser or 909, which represents a human. Some obstructions, like the eraser, change very rarely. Other obstructions, like humans, move a lot. Both pose challenges to image processing algorithms.
There are other types of challenging intensity and color artifacts. For example, camera lenses often have intensity fall-off so that the intensity at the center of the image and the intensity at the edge of the image are often not uniform. Also causing intensity artifacts are shadows that are quite common on whiteboards, often caused by obstructions of light sources in the room. Similar to intensity artifacts are color artifacts. For example, lenses can treat colors differently so that images might, for example, appear more yellow at the edges then at the center. In fact, spatial distortions can be per-color as well. Sometimes, there are chromatic aberrations as well where the colors emanating from a single object on the wall land on the sensor in different places. The lens often causes this type of issue.
Another type of artifact that one often sees is halos. The camera data is processed coming out of the camera, often to compensate for lens reduction in high spatial frequencies, and the processing creates artifacts, which often appear as halos around sharp objects.
Another challenging issue, not shown, is camera motion. If the camera is permanently mounted, it may vibrate a bit. If it is sitting on a table, the camera may be nudged, or it may fall over. In these cases, any per-pixel memory of a correction system can be rendered inaccurate when the camera is re-positioned.
Not only can there be initial processing, there may be ongoing processing 1024. The ongoing processing can check for the gain of the camera or motion of the camera or change in lighting. It may re-do the initial processing periodically, or when motion has happened or lights have moved. The goal is to check for potential changes that can affect future steps, particularly step 1040, the memory-ful processing.
The processing is then divided into memory-less 1030 and memory-ful 1040 processing. The distinguishing factor between these two is driven by whether they can be done in real-time.
Step 1030 is memory-less processing. Memory-less process takes the prior few frames and process them in real time to get a result. If there is motion, the algorithms may produce strange results during the motion or immediately thereafter. That said, once the system has processed the last few frames after the motion, there are no additional artifacts.
Step 1040 is memory-ful processing. Memory-ful processes use an intermediate calculation, typically because the calculation can't be done in real time. The intermediate calculation is stored in memory. These processes are typically per-pixel calculations that change significantly if the camera or obstructions move. For example, calculating the region of the whiteboard in the camera image may be too slow to do in real time. That calculation is affected if the camera moves. And, calculating per-pixel corrections to whiten the image is affected if the camera moves, or the obstruction moves. These steps are often too slow to be calculated in real-time. They may also require many frames of memory, such as backgrounding algorithms.
In step 1050, the results of the processing can be optionally characterized. The results may be too noisy, and the parameters of the algorithms can be tuned based on the results. Sometimes it is better to do this step after transmission, e.g. as part of step 580, so that the results are seen after decoding and transmission. While most often the characterization step goes back to steps 1030 and 1040, they can also sometimes feedback into steps 1020 and 1024.
Examples of the feedback loop 535 and 545 are 1070 and 1080. In 1070, the processor usage is fed back into the processing 1060. The available bandwidth 1080 is fed back into the processing 1060. The algorithms can be adjusted based on the available bandwidth and processing availability.
A key algorithm for memory-ful algorithms is detecting camera motion. There are numerous ways to do this. One method is to detect edges on the writing surface. Without camera motion, the edges are typically nearly stationary. An example of an algorithm to detect camera motion: For a new frame, consider many edge pixels. For each frame, for each pixel: if an edge is not where it was in the last frame, one can search a few pixels to find it. Count the fraction of edges not found within that search. If above a threshold, the camera has moved. Once motion is detected, the algorithm needs to be re-initialized with the new location of edges.
If an obstruction such as a human moves in front of a pixel during this time, it causes errors in the measurements. This issue can be handled in several ways. First, one tries not to collect too many frames. At 30 frames per second, 20 frames for example, is two thirds of a second. In most cases, an obstruction such as a human won't move very much in that time scale. A second method is to use segmentation techniques. One can for example only estimate noise that use pixels that belong to the writing surface. Another method is to show a message in a window in the Graphical user Interface for a user of the system to remain still. If pixels are thrown away, the data for those pixels can be re-measured. Or, it can be extrapolated from other data.
One can also use clustering algorithms to look for outliers in the data, step 1110. For the example of collecting 20 frames, for each pixel, for each color, there are 20 measurements. One looks for and rejects outliers in the data as described in description of robust metrics. If too high a percentage of outliers are rejected, the data can simply be thrown away and re-collected.
Another challenge is that in some systems, we may not have control over the gain of the camera. In this case, the input/output curve of the camera may change over time. One may monitor the camera and look for trends in gain. If the camera gain is changing, one can compensate for it. Or, one can simply throw away the data and wait for the camera gain to stop changing. Gain estimation is described in
It's worth noting that for the measurements we have examined, the noise is not stationary across the image; it varies. In fact, our measurements show it depends on the intensity of the pixel. One model of the camera is that the light collection device does and corresponding analog to digital converter has approximately uniform noise across the camera. However, the system applies a formula similar to f(x)=[A x+B]̂(1/γ)+C, which x is the collected intensity, and A, B and C are unknown constants. The camera applies a contrast and brightness which is typically a scale factor such as A, and an offset such as B and/or C. The 1/γ power compensates for most screen which apply a roughly 2.2 power to the signal; thus gamma is typically about 2.2. Noise effectively changes the intensity x, by a small amount dx. The effect of the noise is approximately df/dx*dx. That is, the effect of the noise is magnified by the derivative of the function f′(x)=(1/γ)/[Ax+B]̂(1/γ−1). It is this effect that we actually measure, not the un-magnified noise. If one assumes that noise is uniform across the image, then the measurements we make effectively measure the derivative of the input/output curve of the camera, 1130. Our measurements have shown that the data closely follows a function like the above. It decreases with increasing intensity.
The estimates of noise are often quite noisy in themselves. There are many methods to decrease the noise in the estimates. The first is to fit a curve plotting noise against intensity. A second way is to average the estimates of noise with neighbors. A more precise way is to do that average without averaging with pixels with very different intensity. One can form a partial differential equation to solve for the noise that minimizes the squared difference between a pixel's noise and it's neighbors summed with the squared difference of the measurement of noise and the actual noise. The weighting of the first term can be inversely proposal to the intensity gradient between the pixels—a simple first difference. Such equations can be solved interactively usually typically non-linear gradient descent methods.
In
One effective segmentation technique uses find a measure of color, intensity, degree of edge, and noise at every pixel. The per-pixel intensity can be measured, for example, by summing the red, green, and blue (R,G,B) components of the pixel, if (R,G,B) is the color space. An algorithm to measure noise has already been done in 1120. Estimates for 1120 can be made without first detecting the region.
Two histograms can be particularly useful in finding the regions. The first is to form a histogram of total intensity. That is, for each intensity level, find the number of pixels with that total intensity. Often, there is a peak that is much larger than the rest. That's a good indication of the region of interest's intensity. A measure of the confidence is the difference in count from the largest peak to the next largest peak. When comparing those counts, it is important to include the effects of noise. A simple way to do that is to include counts not just from the peak bin, but from bins nearby whose intensity is within noise of that intensity. For robust algorithms, we often keep track of a few largest peaks in case the largest uniform image turns out to be in the background (e.g. a piece of paper on the table where the image of the table is much larger than the piece of paper.)
A second histogram that may be useful is the color histogram. One can plot 2d color histogram. For example, one can plot B-G vs. R-G, and look for maximums. Again, to get the counts right, one uses the measurements of noise to include neighboring bins.
Similarly, one can make a 3d histogram. One can instead plot a 3d histogram of (R+G+B, B−G, R−G) and look for maximums. Again, one looks in neighboring bins according to the effects of the noise. As before, a goodness measure is the difference in counts from the largest bins to the next largest bins.
The histograms are able to initialize algorithms of what colors and intensities are likely the colors and intensities of the writing surface. Additionally, the widths of those peaks can be measured as already described to find a sense of the variability within the region. Given that regions have intensity gradients, lighting gradients, etc. the width of the peak may be dominated by issues un-related to per-pixel noise. Thus, normalizing by a local intensity measure can be beneficial, such as the one from max/min morphological operator, or by running a low-pass filter across the image.
The next step is to connect regions using a region-growing algorithm. A simple way to do that is to find all the pixels that for a particular peak in one of the histograms, and use those pixels to start a region-growing algorithm. For example, connect the top, bottom, left and right neighbors of a pixel if the difference between that pixel and its neighbor falls within the width of the peak that is measured. A much more strict method is to use the noise width measured at each pixel, rather than the width of the peak.
The found region with the above region finding algorithms will include extra, undesirable regions. We therefore add a measure of the degree of an edge. One can form an edge-image as already described, and histogram the intensities in the edge-image. That image generally has a peak near zero, and a peak at much larger intensities. As one example, one can find a threshold in the middle, and not allow the region-growing algorithm to cross edges.
For each region, we expect the region to be convex, and so we fill in holes using usual methods. We can additionally look for straight edges as most writing surfaces are likely rectangular.
For each region, depending on the color, we may optionally additionally look for glare, large white patches within the region, or adjacent to the region.
For multiple regions, one can choose the region closest to the center of the image, or largest. Or, one can run a writing detector and choose the region with text in it.
Once the region has been chosen, the average color and intensity of the region can be used to identify a likely writing surface, such as whiteboard or blackboard; this is step 1150. That information, or the average color and intensity, can be used to choose optimal enhancement algorithms.
Another initial processing step is to initialize gain detection algorithms, step 1160 in
One can parameterize the input/output curve as a spline, or with the parameters already mentioned: A, B, C and γ. Or, one can simplify and parameterize the curve as a contrast/brightness or a gain and an offset. Investigations show this simplification yield acceptable results for small gain changes. Independent of the parameterization, the gain measurement changes can be used as part of the enhancement. Gain compensation can be done such as in step 1260. If only a gain is being measured, one can just adjust each new frame by the inverse of that gain to keep the intensity of the writing surface uniform. Gain correction is often a part of 1040, memory-ful processing. If the camera moves, the gain correction algorithm is often reset.
Note that other cameras have gains that can be controlled. In this case, one is looking for changes in brightness in the room, e.g. the lights turning on, and detecting the change and intensity, and then feeding back the measurement to change the gain of the camera 1270 to compensate, rather than using the data as part of the correction.
An example of 1330 is a high spatial frequency enhancement algorithm. A simple example of such a filter is to low-pass filter the image, using a Gaussian filter, for example, and then subtract the results from the original pixel. The result is a high-pass filter. One can take the results and multiply it by a parameter, and then add the results back into the original video frame. The result is a video frame whose high-spatial frequencies have been enhanced.
In step 1340, we can optionally measure the effect of the enhancement algorithms. That may be an algorithm to detect stray lines that are not part of writing, and their intensity, as well as the rate they are changing in time. These measurements can feedback information into the enhancement algorithms 1330 to keep the stray line intensity below a threshold. This is the feedback loop depicted in
The key is that the noise levels in each camera are different, and the post-processing in each camera is different. One needs to adapt the filters to the particular camera/post-processing system to maintain a visual quality. Additionally, those estimates of the parameters in your camera such as noise affect the parameters in the segmentation algorithm, such as the histogram widths, and other image processing algorithms.
There are numerous methods for enhancing writing, and or edges. One family is to find the measure of the local degree of edge, and multiply it by a factor, and add the result back into the original image. Another mechanism is to try and whiten the image, and expand the difference in intensity between the darker regions and the lighter regions.
There are also numerous measures of edges, including digital filters, morphological operators, Canny algorithms, etc. The results can similarly by multiplied by a factor, and the result added/subtracted into the original image, or equivalently multiplied by the original image to enhance regions at edges.
Note that how one applies the algorithms is very dependent on the surfaces. Many of the methods above causing blooming artifacts. That is the area on one side of the edge gets darker, and the area on the other side gets brighter. For black writing on a white background, one generally wants to limit the blooming on the white background, while allowing the writing to be made as dark as possible. Thus, if one makes a filter whose result adds to the current image, one simply caps have large the addition can be to limit the blooming. On a dark background, such as a blackboard, one wants to do the opposite and limit the blooming for the dark background.
Another useful algorithm is whitening the image. In the region containing the writing surface, or perhaps the entire image, one finds a transformation to make the image look uniform intensity and color. For a white sheet of paper or a whiteboard, often that color is white, thus the term whitening. A simple way to do this is to estimate the color of a pixel, and normalize each color, and multiply by a constant value. If one uses this algorithm, the problem is one will erase the writing on the writing surface as well as everything else. A different method is to classify the pixels into writing, whiteboard, background, human, etc. One can then estimate what the correction is behind the writing, by erasing it, replacing the writing by its neighbor pixel values, and then low pass filtering the result to get a smooth correction image. Because of already discussed blooming artifacts that may part of the camera processing system, one often needs to erase a region around the writing by the expected width of the bloom.
There are two key challenges when implementing the algorithms. The first is that it computationally challenging to do segmentation in real time, for example 30 frames/second. That limitation is ok as the correction changes very, very slowly. Thus, one can compute a whitening correction in step 1020. One can then apply the whitening correction in step 1040, a memory-ful process. It is typically applied in the GPU or CPU. Periodically, one can recompute the correction in step 1024. And, one can check for motion of the camera and do a re-computation. And, similar, one can do gain correction for the camera in step 1024, which feeds information back into step 1040.
Image processing algorithms can use a lot of computational resources so that computational efficiency is of great importance.
One of the strong advantages of having imagery in time is that there are only a small number of regions where writing is likely to appear. One can divide an image into regions. The remaining regions need not be examined. One can effectively assign a probability for a region to change between frames, and only consider regions of the highest likelihood. Regions on the edge of the video, and regions near a human, e.g. large obstruction, have the potential to have writing added between frames. The regions neighboring those regions have a smaller probability. A simple algorithm is to start by examining the regions in a new frame where pixels are most likely to change. If they did change, then examine the regions around them. Within each region, the data can be considered hierarchically in a tree-like algorithm in order to be computationally efficient. Examining the region to see if any pixels have changed state can be done in many ways. Examining the pixels on the edge of a region is one method that works well; if those pixels have changed state, then check if the pixels in the interior have changed. One can also down-sample the image to check pixels, which also yields a hierarchical algorithm, as one up-samples for more detail and continue to check pixels.
Clearly, when encoding video to transmit, one can effectively determine that much of the image has not changed from one frame to the next, so that much of the image need not be re-processed for the encoding.
Sometimes, computational resources are limited, especially at the source computer 205. Luckily, there are other places the computation can be done, computers 213 and 211. Thus, it is possible to move some of the processing to other computers. For example, the results of step 1024, such as detection of the gain of the camera or checking if the camera has moved, can often be moved to other computers, and the results of the computation shared. Note that this process effectively trades off computation for bandwidth, as results need to be transmitted. Similarly, gain can be computed in many possible places.
In fact, gain need not be calculated at every frame. If the lack of resources on a computer is temporary, Steps 1024 and steps 1050 can often be dropped for several frames until a computer has more resources available.
Sometimes, some of the processing can be reduced. Digital filters and morphological operators can be run at lower length and smaller region size such that less compute is required. Similarly, some of the global estimates can be made to run faster by using fewer points to analyze. The steps to determine if the camera moved and determining if the gain of the camera changed can be estimated using fewer points. In these cases quality is traded off for computation.
For bandwidth issues, a key is that encoders are designed to most accurately transmit the video. For writing surfaces, that need not be the case. Video encoders typically work by presenting key frames, and then differences to subsequent frames until a new key-frame is transmitted. The objectives are therefore to make the key frames highly compressible, and to make the differences highly compressible. Most encoders use a hierarchical type scheme where the image is broken down into regions, and within each region, displaying more detailed changes requires larger representations. Thus, a good first step is a whitening filter. By applying the whitening filter to make the image uniform intensity and color, large portions of the regions are uniform intensity.
Noise reduction algorithms become very useful. Using frames in time to reduce noise is valuable, such as using averages, so that the changes between frames is small or zero. Often signal to noise ratio is a good metric for how distracting noise can be. In this case signal refers to the average intensity of a pixel, and noise the average standard deviation of the noise. The ratio can also be measured with the signal measured as contrast: the difference between the intensity of the background and the intensity inside the writing. Other time, other metrics can be used, such as the average size, intensity and count of spurious edges due the enhancement. (Spurious edges can be found by running region finding algorithms on enhanced edges, and then removing the edges that are classified as writing according to a writing detector.) The end result of reducing noise, no matter what its measure, is that the bandwidth required generally decreases.
Backgrounding algorithms become very useful. The pixels can be classified into writing, writing surface, background or obstructions. A noisy pixel that is in the same class as it was previously, can be replaced by the last known good value. The pixels therefore don't change values in time.
Regions outside the written surface need not be transmitted. They might either be cropped away, or replaced by a uniform background color such as black. Thus, a segmentation algorithm (generally a memory-ful algorithm) together with a camera motion detection algorithm to detect motion, allows the system to reduce required bandwidth. Or, those regions may be transmitted at a lower bandwidth, either refreshed at a lower rate, or transmitted at lower fidelity.
Obstructions such as humans are not important to the video of the surface. Humans can be segmented out, and either transmitted at a lower frame rate, or not at all. They can be replaced by the last known imagery of the surface in the region.
The writing itself can be enhanced to be easily compressed and transmitted. Making sharp edges is one way to do that for a standard encoder.
Exact colors in the writing surface may not be important. For example for a whiteboard, a version of the whiteboard that is white, and uses only a small number of saturated colors (e.g. red, green, blue) may be optimal. Moving obstructions such as humans may not need to be transmitted at all. One algorithm is to run a writing detector, and to change the colors of the found writing to the nearest primary. Thus, the image is effectively using a much smaller color space and need not encode many colors. Typically the color space is 3 colors, 256 values each, for a total of 16.7 million colors. Instead, one can switch to using gray values, and the 3 primaries only, which is 4*256=1024 values.
If bandwidth becomes very limited, not transmitting frames can be valuable.
Sometimes, computational resources are limited, especially at the source computer 205. Luckily, there are other places the computation can be done, computers 213 and 211. Thus, it is possible to move some of the processing to other computers. For example, the results of step 1024, such as detection of the gain of the camera, can often be moved to other computers, and the results of the computation shared. In fact, gain need not be calculated at every frame. Steps 1024 can often be dropped for several frames until a computer has more resources available.
Additionally, some of the processing can be reduced. Digital filters and morphological operators can be run at lower length such that less compute is handled.
A step that is very useful is when choosing regions to examine, raising the threshold of the probability of writing appearing, and thus examining fewer regions.
One can also trade bandwidth for computation. Rather than correcting the video before transmission, one can do the corrections after transmission
For example, the readability metric could be the expected signal to noise ratio of the enhanced video. In this case signal means the difference between the intensity of the background of the surface, and the intensity of the writing. Noise refers to the magnitude of the noise.
The noise magnitude can be estimated per-pixel using multiple frames as already described, as well as estimated in the spatial frequency domain using standard techniques. A high-pass filter can be applied to the initial video, and the results of the high pass filter can be multiplied by a parameter and added to the initial video. The parameters of the high pass filter can be chosen to enhance the frequencies with maximum signal to noise ratio. Similarly, the multiplication parameter can be chosen to maximize the signal, without over-saturating. Once saturation is achieved, increasing the multiplication more would not increase the signal, only the noise.
The parameters of the high pass filter can also be adjusted for the expected nature and properties of the writing and/or writing surface. High pass filters can produce ringing effects near edges. For example, the region very close to the edge for a whiteboard might be appear extra bright on one side, and extra dark on the other side. For a whiteboard, the extra darkness is desirable, it increases the contrast between the foreground and the background. The extra brightness may not appear uniformly across all colors and thus cause a color shift around the writing. One way to adjust the high-filter is to introduce a parameter. If the results of the filter result in a darkening, it gets applied. If it results in a brightening, then the results are ignored. Clearly for blackboards, and glass different parameters are desired. As part of the determining the parameters of the enhancement algorithm, one can determine the type of surface as described in step 1150 in
Additionally, a low pass filter can reduce noise in time, which will also serve to increase the signal to noise ratio. A low pass filter that has the support of too many frames will introduce artifacts due to objects, typically humans, moving in front of the surface.
Other enhancements can similarly be made as described above, particularly as relating to
The following are particular examples of implementations in which the teachings of the inventive concepts herein can be employed.
Camera motion is an algorithm that can be particularly important to annotation. When the camera moves, one wants the annotation to move as well. Typically the writing surface is treated as a flat surface, and the goal is to measure a homography from the camera to the board. A simple way to find the transform is through a standard simultaneous localization and mapping (SLAM) algorithms. One periodically records a frame. One typically finds a set of image features with strong gradients, such as edges and corners. After motion is detected, a new set of features is found. The new features are matched to the old features using robust methods. The homography from the old to the new camera position can be found by matching those features. The resulting annotations can be mapped from their old positions to their new positions using the homography. In practice, one first runs the algorithm to determine if the camera has moved, and if it has, then runs the algorithm to find how far it has moved. Note that when computation is short, the number of pixels examined can be decreased, essentially trading off computation for quality. As more points can be gathered in the next frame, a low-pass filter in time can be added to achieve a low noise result that simply needs a few frames to converge.
One of the advantages of the invention is the ability to apply different enhancement algorithms to different regions or different classes of the image. For example, if the image pixels are classified as writing, writing surface, obstruction and background: one can apply a different enhancement on each. An edge-sharpening filter works well on the text. An intensity whitening works well on the remainder of the writing surface. Any obstruction such as people does not need to be enhanced to save computation and to not change the looks of the obstruction. The background need not be enhanced as well. Overall, different filters can be used in different classes of pixels as desired.
Compression artifacts have the potential to introduce sizable problems. Introducing corrections to pre-compensate for them can be valuable. The simplest way to do that is to make an image frame easily representable using a small number of compression basis functions. Whitening the image is the simplest way to pre-compensate for compression artifacts.
The foregoing has been a detailed description of illustrative embodiments of the invention. Various modifications and additions can be made without departing from the spirit and scope if this invention. Each of the various embodiments described above may be combined with other described embodiments in order to provide multiple features. Furthermore, while the foregoing describes a number of separate embodiments of the apparatus and method of the present invention, what has been described herein is merely illustrative of the application of the principles of the present invention. For example, additional correction steps can be employed beyond those described herein. These steps may be based particularly upon the type of display device and display surface being employed. In addition, it is expressly contemplated that any of the procedures and functions described herein can be implemented in hardware, software, comprising a computer-readable medium consisting of program instructions, or a combination of hardware and software. Accordingly, this description is meant to be taken only by way of example, and not to otherwise limit the scope of this invention.