Whiteboards, also known as dry-erase boards, are different from blackboards in that whiteboards include a smoother writing surface that allows rapid marking and erasing of markings. Specifically, whiteboards usually include a glossy white surface for making nonpermanent markings, and are used in many offices, meeting rooms, school classrooms, and other work environments. Whiteboards may also be used to facilitate collaboration among multiple remote participants (referred to as collaborating users) that are sharing information. In such collaborations, one or more cameras are pointed at the whiteboard to share a user's written or drawn content with other participants.
In general, in one aspect, the invention relates to a method to extract static user content on a marker board. The method includes generating a sequence of samples from a video stream comprising a series of images of the marker board, generating at least one center of mass (COM) of estimated foreground content of each sample in the sequence of samples, detecting, based on a predetermined criterion, a stabilized change of the at least one COM in the sequence of samples, wherein the stabilized change of the at least one COM identifies, in the sequence of samples, a stable sample with new content, generating, in response to the stabilized change of the at least one COM and from the stable sample with new content, a mask of full foreground content, and extracting, based at least on the mask of full foreground content, a portion of the static user content from the video stream.
In general, in one aspect, the invention relates to a system for extracting static user content on a marker board. The system includes a memory and a computer processor connected to the memory and that generates a sequence of samples from a video stream comprising a series of images of the marker board, generates at least one center of mass (COM) of estimated foreground content of each sample in the sequence of sample, detects, based on a predetermined criterion, a stabilized change of the at least one COM in the sequence of samples, wherein the stabilized change of the at least one COM identifies in the sequence of samples a stable sample with new content, generates, in response to the stabilized change of the at least one COM and from the stable sample with new content, a mask of full foreground content, and extracts, based at least on the mask of full foreground content, a portion of the static user content from the video stream.
In general, in one aspect, the invention relates to a non-transitory computer readable medium (CRM) storing instructions for extracting static user content on a marker board. The computer readable program code, when executed by a computer, includes functionality for generating a sequence of samples from a video stream comprising a series of images of the marker board, generating at least one center of mass (COM) of estimated foreground content of each sample in the sequence of samples, detecting, based on a predetermined criterion, a stabilized change of the at least one COM in the sequence of samples, wherein the stabilized change of the at least one COM identifies in the sequence of samples a stable sample with new content, generating, in response to the stabilized change of the at least one COM and from the stable sample with new content, a mask of full foreground content, and extracting, based at least on the mask of full foreground content, a portion of the static user content from the video stream.
Other aspects of the invention will be apparent from the following description and the appended claims.
Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.
In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In general, embodiments of the invention provide a method, non-transitory computer readable medium, and system for extracting written content and/or user placed object(s) from a marker board using a live video stream or pre-recorded video where one or more users are interacting with the marker board. In a collaboration session between collaborating users, the extracted user content is sent to the collaborating user in real time while one or more users are writing/drawing/placing object(s) on the marker board. One or more embodiments of the invention minimize the amount of extraction updates sent to collaborating users by limiting the extraction updates to occur only when content changes in a specific region of the marker board.
In one or more embodiments of the invention, the buffer (101) is configured to store a marker board image (102). The marker board image (102) is an image of a writing surface of a marker board captured using one or more camera devices (e.g., a video camera, a webcam, etc.). In particular, the marker board image (102) may be one image in a series of images in a video stream (102a) of the captured marker board, and may be of any size and in any image format (e.g., BMP, JPEG, TIFF, PNG, etc.).
The marker board is a whiteboard, blackboard, or other similar type of writing material. The writing surface is the surface of the marker board where a user writes, draws, or otherwise adds marks and/or notations. The user may also place physical objects on the writing surface. Throughout this disclosure, the terms “marker board” and “the writing surface of the marker board” may be used interchangeably depending on context.
The marker board image (102) may include content that is written and/or drawn on the writing surface by one or more users. Once written and/or drawn on the writing surface, the content stays unchanged until the content is removed (e.g., the content is erased by a user). In one or more embodiments, the written and/or drawn content is referred to as user written content. Additionally, the marker board image (102) may include content corresponding to object(s) placed on the marker board, a user's motion in front of the marker board, and/or sensor noise generated by the camera device. The user's written content and content resulting from user placed object(s), and user's motion, and/or sensor noise collectively form a foreground content of the marker board image (102). The user's written content and content resulting from the user placed object(s) are collectively referred to as static user content of the marker board image (102).
In one or more embodiments, the buffer (101) is further configured to store the intermediate and final results of the system (100) that are directly or indirectly derived from the marker board image (102) and the video stream (102a). The intermediate and final results include at least an averaged sample (103), an estimated foreground content (104), a center of mass (COM) (105), a full foreground content mask (106), a changing status (107), and the static user content (108). Each of these intermediate and final results is described below in detail.
In one or more embodiments, the averaged sample (103) is an average of a contiguous portion of the video stream (102a), where the contiguous portion corresponds to a short time period (e.g., 0.25 seconds) during the collaboration session. The averaged sample (103) is one averaged sample within a sequence of averaged samples. Each pixel of the averaged sample (103) is assigned an averaged pixel value of corresponding pixels in all images within the contiguous portion of the video stream (102a). For example, the marker board image (102) may be one of the images in the contiguous portion of the video stream (102a).
Furthermore, the averaged sample (103) includes multiple divided regions of the marker board. Each region is referred to as a tile and may be represented as a rectangle, square, or any other planar shape. In this disclosure, the term “tile” is also used to refer to an image of the tile.
In one or more embodiments, the estimated foreground content (104) is a binary mask generated using the averaged sample (103). Each pixel of the estimated foreground content (104) is assigned a binary value that estimates the pixel as either the foreground pixel or the background pixel of the averaged sample (103). In one or more embodiments, the estimated foreground content (104) is generated by applying an adaptive thresholding algorithm to the averaged sample (103). Due to the adaptive thresholding algorithm, the edges or outlines of foreground objects are emphasized in the estimated foreground content (104) and the interior region of the foreground objects are de-emphasized in the estimated foreground content (104). As will be described below, the estimated foreground content (104) is used to detect changes in the foreground content of the marker board image (102) and when the changes become stabilized. Therefore, de-emphasizing the interior region of the foreground objects in the estimated foreground content (104) advantageously does not cause any adverse effects to occur on the overall processing.
In one or more embodiments, the COM (105) is a pixel location in a tile where the coordinates are averaged from all estimated foreground pixels in the estimated foreground (104). As the user writes/draws or places object(s) into a particular tile, the COM (105) changes due to the user's hand motion and/or due to the added static user content (108).
In one or more embodiments, the full foreground content mask (106) is a binary mask where each pixel is assigned a binary value that designates the pixel as either the foreground pixel or the background pixel of the averaged sample (103).
In one or more embodiments, the changing status (107) is a status of an averaged sample (103) indicating whether a significant change in the COM (105) has stabilized over a stability window, which is a predetermined number (e.g., 2, 10, etc.) of subsequent averaged samples (103). A significant change in the COM (105) that has stabilized over the stability window is referred to as a stabilized change. In one or more embodiments, the changing status (107) includes STABLE, CHANGING, STABILIZING, and STABLE WITH NEW CONTENT. STABLE indicates that there are no significant changes in the COM (105) from one averaged sample (103) to the subsequent averaged sample (103). CHANGING indicates that there is a significant change in the COM (105) from one averaged sample (103) to the subsequent averaged sample (103). STABILIZING occurs after a CHANGING state so long as the COM (105) no longer significantly changes over the stability window. If at the end of the stability window the COM has significantly moved from its location prior to entering the CHANGING state, then it is deemed STABLE_WITH_NEW_CONTENT otherwise returns back to STABLE. A STABLE_WITH_NEW_CONTENT state indicates that there is new user content that should be shared with remote participants.
In one or more embodiments of the invention, the analysis engine (109) is configured to generate a sequence of averaged samples (including the averaged sample (103)) and corresponding estimated foreground content (including the estimated foreground content (104)) from the video stream (102a). The analysis engine (109) is further configured to generate the COM (105) for each tile of the samples.
In one or more embodiments of the invention, the extraction engine (110) is configured to detect a stabilized change of the COM (105) in the sequence of samples, to generate the full foreground content mask (106), and to extract the static user content (108) in a corresponding tile of the video stream (102a) where the stabilized change is detected. As the user writes/draws and/or places object(s) across the entire writing surface of the marker board, the static user content (108) in the extracted tile of the video stream (102a) represents only a portion of the entire static user content (108) across the marker board.
In one or more embodiments of the invention, the collaboration engine (111) is configured to generate the static user content (108) by aggregating all portions of the static user content (108) in all of the tiles of the video stream. The collaboration engine (111) is further configured to send an entirety or a portion of the static user content (108) to one or more collaborating users. The act of sending only a portion or the entirety of the static user content (108) to collaborating user(s) is referred to as an extraction update of the collaboration session.
In one or more embodiments, the analysis engine (109), the extraction engine (110), and the collaboration engine (111) perform the functions described above using the method described in reference to
Although the system (100) is shown as having four components (101, 109, 110, 111), in one or more embodiments of the invention, the system (100) may have more or fewer components. Furthermore, the functions of each component described above may be split across components. Further still, each component (101, 109, 110, 111) may be utilized multiple times to carry out an iterative operation.
Referring to
In one or more embodiments, the series of images is divided into consecutive portions where each portion is contiguous and includes consecutive images in the video stream. In one example, the consecutive portions may all have the same number of consecutive images. In another example, the number of consecutive images may vary from one portion to another. Regardless of whether the number of consecutive images is constant or variable, the consecutive images in each portion are averaged to generate a corresponding sample. In one or more embodiments, each sample is scaled down in pixel resolution to improve processing performance in the subsequent steps. Each sample is converted into a binarized sample using an adaptive thresholding algorithm where the two binary pixel values are used to identify an estimated foreground content of the sample. For example, each ON pixel (i.e., a pixel having a pixel value of “1”) in the estimated foreground content represents a portion of the foreground content of the sample. Using the adaptive thresholding algorithm, the edges or outlines of foreground objects are emphasized in the estimated foreground content and the interior region of the foreground objects are de-emphasized in the estimated foreground content. As will be described below, the estimated foreground content is used to detect changes in the foreground content and to detect when the changes have stabilized. Therefore, de-emphasizing the interior region of the foreground objects advantageously does not cause any adverse effect to occur on the overall processing.
In one or more embodiments, the image frame in the video stream is divided into a number of tiles. In one example, the image frame may be divided equally into rectangular shaped (or other planar shaped) tiles. Each tile in the image frame corresponds to a rectangular section of the marker board, and each rectangular section of the marker board is referred to as a tile of the marker board. In another example, the tiles may have different form factors within the image frame and across the marker board where a dimension of a tile is at least twice the width of writing/drawing strokes in the image.
In Step 201, as discussed above in reference to
In Step 202, as discussed above in reference to
In Step 203, as discussed above in reference to
In Step 204, as discussed above in reference to
In Step 205, as discussed above in reference to
An example of the method flowchart is described in TABLEs 1-5 and
The following main process is then repeated for each frame of the stream as detailed by way of example in TABLE 2.
The function monitor_tile(sample_num, foreground) can be expanded, for example, as detailed in TABLE 3.
The function significant_change(center_of_mass1, center_of_mass2) can be expanded, for example, as detailed in TABLE 4.
The function IdentifyForeground(img, msk_adaptive, msk_canny) can be expanded, for example, as detailed in TABLE 5.
In the example of
The example method of one or more embodiments operates on a series of images from a video stream. The video stream may be a pre-recorded collaboration session or a live stream of a current collaboration session. Before the first image in the video stream is processed, initialization is performed as detailed in TABLE 1 above. The process described below is then repeated for each image of the video stream.
Each frame in the video stream is broken up into tiles and analyzed for new static user content using a quick estimate of foreground content. In practice, the estimate generally identifies just the outline of any objects placed on the whiteboard. Once new static user content has been identified in any tile based on the estimate, a more rigorous process is initiated to identify the full foreground content, including the interiors of any objects. Accordingly, an update of static user content present in any tile is shared with remote participants in the collaboration session based on the full foreground identification and an average of previous stable samples. Identifying the new static user content using the quick estimate of foreground content advantageously reduces the computing resources and image processing time. Generating the updated static user content using the more rigorous process further advantageously allows both thin text strokes as well as patterns with solid fill (e.g., a physical object, user drawn solid figures) to be shared as the user static content sent to the remote participants in the collaboration session.
Further, automatically transmitting new static user content when detected advantageously eliminates a user's manual initiation of the capture and sending of the content to remote participants in the collaboration session. Such transmission of a tile's static user content based on determining when new content is available and stable also advantageously minimizes (i.e., reduces) the number of necessary content data transmissions to remote participants in the collaboration session. Furthermore, during the content data transmission, the tiles without new static user content are excluded from the transmission. Excluding the tiles with no new user static content advantageously minimizes (i.e., reduces) the amount of content data in each of the content data transmission to remote participants in the collaboration session. Furthermore, automatically transmitting the new static user content will also advantageously allow content to be seen by remote participants sooner than had the user manually initiated the capture.
The process will now be discussed in more detail. Steps 1-4 in the main algorithm in TABLE 2 above are initial preparation tasks. These initial steps are used to prepare data samples for subsequent analysis. Each set of consecutive n (e.g., n=2 in an example implementation) frames in the video stream are averaged to generate an averaged sample that minimizes the effects of motion and maximizes the effects of static user content. Any user content on the whiteboard identified as pixels in each image of the set show up strongly (i.e., exhibit higher numerical values) in the averaged sample. In contrast, any user motion is likely identified as disparate pixels in different images and consequently does not show up strongly (i.e., exhibit lower numerical values) in the averaged sample. For example, consider the two averaged samples (301a) and (301b) shown in
Throughout the discussion below, the averaged sample described above is referred to as a sample. In addition to the aforementioned averaging, a log of recently generated data, recent_records, is checked to see if it exceeds the size of the stability window, which is the number of samples that must be deemed stable for an update to occur. If the size of the stability window is exceeded, then the oldest entry is removed before a new one is started. The data recorded in this log is detailed in TABLE 1 above but in general it is all of the data required over the stability window to generate a full foreground and provide an update of stable content.
All subsequent processing happens on a sample by sample basis. That sample-by-sample processing occurs in the sub-steps of step 5 in the main algorithm in TABLE 2 above. In particular, estimated foreground content is identified in each sample by running an adaptive thresholding function on each color channel of the sample and using a bitwise-OR function to combine all color channels together into a single binary image that corresponds to msk_adaptive listed in TABLE 2 above. As shown in
Continuing with the sample-by-sample processing, the next step is to identify the center of mass (COM) in each binarized sample. The COM is computed as the average location of all estimated foreground (white) pixels in the binarized sample. The COM is assigned an “undefined” status when the total number of estimated foreground (white) pixels in the binarized sample is less than a pre-determined threshold (e.g., 10). The COM is used for motion tracking and stability identification. For the two binarized samples (302a) and (302b), the COM is identified by the icon “x” to generate the marked samples (303a) and (303b). The averaged sample (301a) and the binarized sample (302a) correspond to the marked sample (303a) with the COM (313a), the averaged sample (301b) and the binarized sample (302b) correspond to the marked sample (303b) with the COM (313b). A slight shift exists between the COM (313a) and the COM (313b) as result of a noise pattern (312b) being identified as additional foreground from the binarized sample (302a) to the binarized sample (302b).
In samples 1 and 2 shown in
For sample 0 through sample 6, as detailed in step 5.3 of the main algorithm in TABLE 2 above, the averaged frames image is scaled down. This is primarily done as a performance optimization. Then, the foreground estimate for each scaled down average is computed as detailed in step 5.4 using adaptive thresholding. In step 5.5 above, the foreground estimate is divided into tiles and each tile is monitored for changes and stability. In EXAMPLE 1 above, some of the foreground estimate pieces are illustrated (seen as “Fgd Est”) for the 6 tiles immediately encompassing the apple.
In sample 3 depicted in
Similar to the EXAMPLE 1 above, each sample in the sequence of EXAMPLE 2 is divided into 6 rows (i.e., row 0 through row 5) and 8 columns (i.e., column 0 through column 7) resulting in 48 tiles. The tile grid dividing each sample is omitted for clarity.
In the EXAMPLE 2, no foreground content is detected in sample 0. With sample 1, the user's hand enters the scene and provides enough contrast to register an edge on the foreground estimate. In step 3 of TABLE 3, the COM of all foreground pixels in the estimate is computed as the average location of all foreground content. In step 4, a determination is made based on whether the tile's state is currently STABLE. If so, then the tile monitor determines whether or not the tile is no longer STABLE (branch 4.1). Otherwise, the tile monitor determines whether the tile has now become stable (branch 4.2). At the start of monitoring for sample 1, the tile is STABLE and is checked to see whether the newly computed current COM (36.9, 40.3) has significantly changed from the last stable COM (undefined). This is registered as a significant change and so the state is updated to CHANGING for sample 1.
With sample 2, the new current COM is computed (22.7, 34.3). Now, since the state is currently CHANGING, branch 4.2 is processed to determine if the tile has now stabilized. It is determined if there is not a significant change from the previous COM (36.9, 40.3) to the current COM. In this case, there is a significant change and therefore branch 4.2.2.2 is executed and the tile remains in a CHANGING state for sample 2.
With sample 4, the algorithm again processes branch 4.2 but now the COM does not significantly change (undefined to undefined) and so branch 4.2.2.1 is executed. Here, the state changes to STABILIZING and the number of stable samples is incremented to 1. In step 4.2.2.1.3, it is determined that the number of stable samples has not reached the stability window (e.g., 2 samples) and so processing for this sample ends.
The same processing happens with sample 5 as with sample 4, but this time the stability window (e.g., 2 samples) has been reached and so branch 4.2.2.1.3.1 is executed. In this case, it is determined if the change qualifies for an update. In other words, it is determined if the last stable COM (undefined) has significantly changed to the current COM (undefined). In this case, it has not, and therefore the tile becomes STABLE, but there is no new content to share.
Processing continues in this manner, as depicted in
The first step in TABLE 5 is to compute the average pixel value of all the border pixels in img, excluding those that are identified as foreground in msk_adaptive. This is done to identify suitable places to launch a flood fill in a subsequent step.
In step 2 of TABLE 5, bkgrnd is initially set to the result of a bitwise-or operation between msk_adaptive and msk_canny. This results in the mask (361) shown in
In step 3 of TABLE 5, a flood fill of bkgrnd is initiated from the border pixels but only if the pixel is not already marked (i.e., “1” or “ON”) in bkgrnd and if the corresponding pixel in img is greater than the average border pixel previously computed. This second condition helps ensure that flood filling occurs from the brightest portions which are likely to be whiteboard background (and not, for example, from the users hand and/or arm). In this case, a single flood fill is launched from border pixel (94, 0) setting flooded pixels to color value 127 (i.e., gray) resulting in the flooded image (362). In TABLE 5, bkgrnd is an 8-bit deep mask with pixel values of 0 to 255 as an interim step to generate a true binary mask where all pixels are strictly ON or OFF by the end of step 5 in TABLE 5.
In step 4 of TABLE 5, the pixels of bkgrnd are set to 0 in all cases where the pixels are not equal to 127. These pixels are known to NOT be background. Then, in step 5 of TABLE 5, the pixels of bkgrnd are set to 255 in all cases where the pixels are equal to 127. These are the flood filled pixels and are assumed to be background. At this point, bkgrnd looks like the mask (363) as an example of the content of the variable bkgrnd at the end of step 5 of TABLE 5, which is referred to as “starting background” depicted in
In bkgrnd, all ON (white) pixels shown in the mask (363) (i.e., starting background) are considered as known background. However, not all background pixels in the sample are necessarily shown as ON (white) in the mask (363). In other words, some OFF (black) pixels in the mask (363) may also be background and correspond to an interior hole of a foreground object in the sample. Hence, every potential foreground object has to be investigated for holes. Any pixels identified in msk_adaptive (i.e., estimated foreground content) are assumed to be foreground (i.e., not a hole), but as noted in step 2 above, any pixels in msk_canny will also require a closer investigation to determine if they are truly foreground. Hence, in step 6 of TABLE 5, holes is set to the result of a bitwise-and operation between 255—bkgrnd (i.e., inverted starting background) and 255—msk_adaptive (i.e., inverted estimated foreground content), which is shown as the candidate holes mask (364).
All “ON” pixels in candidate holes mask (364) will require a more detailed investigation to determine whether or not it really is background.
In step 7 of TABLE 5, all connected components are identified in a candidate holes mask (364). In this case, 105 individual connected components are identified.
In step 8 of TABLE 5, iterative hole traversal and background mending is performed on the starting background. Each connected component in candidate holes mask (364) is processed to determine if it qualifies as foreground or background by comparing the corresponding pixel intensities of the connected component in img (351) to the average pixel intensity of neighboring pixels. The neighboring pixels are accumulated as the closest neighboring pixels of known background with a count approximately equal to the number of pixels in the connected component. Pixels in the connected component are first compared based on the average value of corresponding pixels in src followed by individual pixel comparisons. The content of bkgrnd at the end of step 8 of TABLE 5 is referred to as “updated background” depicted in
Embodiments of the invention may be implemented on virtually any type of computing system, regardless of the platform being used. For example, the computing system may be one or more mobile devices (e.g., laptop computer, smart phone, personal digital assistant, tablet computer, or other mobile device), desktop computers, servers, blades in a server chassis, or any other type of computing device or devices that includes at least the minimum processing power, memory, and input and output device(s) to perform one or more embodiments of the invention. For example, as shown in
Software instructions in the form of computer readable program code to perform embodiments of the invention may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code that when executed by a processor(s), is configured to perform embodiments of the invention.
Further, one or more elements of the aforementioned computing system (400) may be located at a remote location and be connected to the other elements over a network (412). Further, one or more embodiments of the invention may be implemented on a distributed system having a plurality of nodes, where each portion of the invention may be located on a different node within the distributed system. In one or more embodiments, the node corresponds to a distinct computing device. Alternatively, the node may correspond to a computer processor with associated physical memory. The node may alternatively correspond to a computer processor or micro-core of a computer processor with shared memory and/or resources.
One or more embodiments of the present invention provide the following improvements in electronic collaboration technologies: automatically sharing user content on a marker board with remote collaborating users without the user having to manually send the content data; limiting the amount of content data transmission to the portion of the marker board with new content; minimizing the number of content data transmissions by automatically determining when the new content is stable prior to transmission; improving performance in detecting new static user content by down scaling the image sample and using simplified foreground content estimation.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.