The presently disclosed embodiments are directed toward video-based vehicular speed law enforcement. However, it is to be appreciated that the present exemplary embodiments are also amenable to other like applications.
Conventional single camera systems are hindered by limited abilities to accurately detect vehicle speed due to limitations associated with viewing a 3D world with 2D imaging devices. Additionally, the quality of evidentiary photos provided by such systems is unsatisfactory due to the retro-reflective properties of license plates, which requires a sensor operating at high dynamic range at night. Moreover, the camera field of view (FOV) conventionally is calibrated for speed detection accuracy, which conflicts with larger FOV requirements in traffic monitoring and incident detection. The performance of systems with such wide FOV in speed estimation tasks typically exhibits a large degree of estimation error unless additional elements and/or features are included, such as multi-view capabilities, structured illumination, stereo-vision, etc. These FOV problems cannot be easily solved with a conventional speed camera. Additionally, classical video-based speed estimate systems based on a single camera exhibit performance and utility that falls short in several areas. For instance, using such systems, the estimated speed is not accurate due to ambiguities introduced by mapping a 3D scene onto a 2D image.
There is a need in the art for systems and methods that facilitate video-based speed estimation and vehicle speed limit enforcement with reduced cost and improved accuracy, while overcoming the aforementioned deficiencies.
In one aspect, a computer-implemented method for video-based speed estimation comprises acquiring traffic video data from a primary camera and one or more image frames from a secondary camera, preprocessing the video data acquired from the primary camera, and detecting at least one vehicle in video data acquired from the primary camera. The method further comprises tracking at least one vehicle of interest by identifying and tracking a location of one or more vehicle features across a plurality of video frames in video data acquired from the primary camera, and performing sparse stereo processing using video data of one or more tracked features within a predetermined region in the video frames from the primary camera and the one or more image frames from the secondary camera. Additionally, the method comprises estimating a height above a reference plane (e.g., a road surface or the like) of the one or more tracked features, and estimating vehicle speed based on camera calibration information and estimated feature height associated with at least one of the one or more tracked features.
In another aspect, a system that facilitates video-based speed estimation comprises a primary camera that captures video of at least a vehicle, a secondary camera that concurrently captures one or more image frames of the vehicle, and a processor configured to acquire traffic video data from the primary camera and the one or more image frames from the secondary camera. The processor is further configured to preprocess the video data acquired from the primary camera, detect at least one vehicle in video data acquired from the primary camera, and track at least one vehicle of interest by identifying and tracking a location of one or more vehicle features across a plurality of video frames in video data acquired from the primary camera. Additionally, the processor is configured to perform sparse stereo processing using video data of one or more tracked features within a predetermined region in the video frames from the primary camera and the one or more image frames from the secondary camera, estimate a height above a reference plane (e.g., a road surface or the like) of the one or more tracked features, and estimate vehicle speed based on camera calibration information and estimated feature height associated with at least one of the one or more tracked features.
In yet another aspect, a non-transitory computer-readable medium, stores computer-executable instructions for video-based speed estimation, the instructions comprising acquiring traffic video data from a primary camera and one or more image frames from a secondary camera, preprocessing the video data acquired from the primary camera, and detecting at least one vehicle in video data acquired from the primary camera. The instructions further comprise tracking at least one vehicle of interest by identifying and tracking a location of one or more vehicle features across a plurality of video frames in video data acquired from the primary camera, and performing sparse stereo processing using video data of one or more tracked features within a predetermined region in the video frames from the primary and the one or more image frames from the secondary camera. Additionally, the instructions comprise estimating a height above a reference plane (e.g., a road surface or the like) of the one or more tracked features, and estimating vehicle speed based on camera calibration information and estimated feature height associated with at least one of the one or more tracked features.
The above-described problem is solved by providing a video-based speed enforcement system that utilizes a main camera and a secondary traffic camera, such as a low-cost red-green-blue (RGB) camera. The described systems and methods provide improved accuracy of speed measurement and improved evidentiary photo quality compared to single camera approaches. The use of an RGB traffic camera mitigates the cost associated with a conventional stereo camera since the conventional approach requires two identical expensive primary cameras, rather than one primary and one low-cost secondary camera as proposed herein. There is also a greatly reduced computational requirement compared to conventional stereo video, which is a significant benefit in the transportation industry due to a need for real-time processing and high data rates. By using secondary video, spatio-temporally sparse stereo processing is enabled specifically for estimating the height of a vehicle feature above the road surface, which in turn enables accurate speed estimation.
The described systems and methods add a low-cost RGB traffic camera (e.g., a video camera, a still camera, etc.) to complement information obtained by the speed camera, which focuses on measuring vehicle speed. Since the RGB traffic camera is low-cost and provides a broad FOV, it is more cost-effective to use it for improving the accuracy of a lower cost, single monocular camera as a speed detector as compared to using a stand-alone and more expensive stereo camera for speed estimation in addition to the RGB traffic camera for surveillance and evidentiary photo purposes. Accordingly, the described systems and methods utilize the inexpensive RGB traffic camera for improving a single camera speed measurement without sacrificing its surveillance capability.
Relative to a system with stereo-vision for speed and a traffic camera for surveillance (e.g., 3-camera systems), the described system is more cost-effective, employing only two cameras. This advantage is achieved by re-formulating the speed measurement problem in stereo vision to form a simple feature height estimation (a constant factor) problem. Compared to the conventional monocular camera solutions, the described systems and methods are more accurate and are not limited to license plate tracking for speed.
It will be appreciated that the method of
The computer 30 can be employed as one possible hardware configuration to support the systems and methods described herein. It is to be appreciated that although a standalone architecture is illustrated, that any suitable computing environment can be employed in accordance with the present embodiments. For example, computing architectures including, but not limited to, stand alone, multiprocessor, distributed, client/server, minicomputer, mainframe, supercomputer, digital and analog can be employed in accordance with the present embodiment.
The computer 30 can include a processing unit (see, e.g.,
The computer 30 typically includes at least some form of computer readable media. Computer readable media can be any available media that can be accessed by the computer. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
A user may enter commands and information into the computer through an input device (not shown) such as a keyboard, a pointing device, such as a mouse, stylus, voice input, or graphical tablet. The computer 30 can operate in a networked environment using logical and/or physical connections to one or more remote computers, such as a remote computer(s). The logical connections depicted include a local area network (LAN) and a wide area network (WAN). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
The vehicle tracking module 58 tracks vehicles of interest by determining the location of one or more vehicle feature(s) (e.g., a license plate or the like) across frames. For example, the vehicle tracking module follows identified vehicle features from one frame to the next. Tracked feature information is forwarded to a speed measurement module 64, and to a sparse stereo processing module 66 which performs sparse stereo processing when the tracked features are within a pre-determined region or zone in the frame(s). The sparse stereo processing module 66 uses video from the primary camera and one or more image frames from the secondary camera (video stream (A) 53 and video stream (B) 68) to estimate a height h of each tracked feature. Once tracking points are determined by the vehicle tracking module, and heights are estimated by the sparse stereo processing module, the speed estimation module 64 estimates the speed of the vehicle from camera calibration information 70 and spatio-temporal data of the tracked points or features (including height estimates). Speed estimation information (in addition to the vehicle identification information provided by the vehicle identification module from video stream A, and the video stream B from the RGB camera) is received at the speed violation enforcement module 62 for use in issuance of a citation or ticket 72 by a law enforcement entity. In one embodiment, the speed violation enforcement module prepares a violation package and/or issues a ticket for detected speed violators.
It will be appreciated that one or more modules or components of the system of
It will be understood that in accordance with one or more aspects of the described innovation, the basic processing involved in the speed estimation process may employ known techniques, with the exception that, in contrast to conventional approaches, the height of the tracked features are determined via spatio-temporally sparse stereo processing (triangulation) on a one or more pairs of frames from both the primary speed camera 51 and the traffic RGB camera 52. Advantages of the sparse stereo processing approach described herein include better speed accuracy, better evidentiary photo quality, and the use of a low cost RGB traffic camera. Spatio-temporal sparse stereo processing is more computationally efficient than a conventional two-camera stereo-vision solution. It is also more robust than a conventional two-camera stereo-vision solution: since it only operates on distinct features (features used for tracking) rather than all features (as a typical dense stereo-vision solution does), it is less susceptible to noises. In the following discussion, the main or primary camera may be referred to as the speed camera, and the secondary or auxiliary camera may be referred to as the traffic camera or the RGB camera.
With regard to sparse stereo processing for tracked feature height estimation, a camera-based speed estimation system (single or stereo) typically includes camera calibration information that relates camera coordinates to 3-D world coordinates relative to the road surface. Both the speed camera and the RGB traffic camera can be calibrated concurrently, e.g., in the absence of traffic disturbance through the use of a vehicle travelling through the scene or FOV of the two cameras while carrying calibration targets that span the 3 dimensions of the FOVs or the like, such as is described in U.S. patent application Ser. No. 13/527,673 to Hoover et al., which is hereby incorporated by reference herein in its entirety. Given the camera models for both cameras and the knowledge of the heights of two landmarks (e.g., road surface and another object at, e.g., 3 ft above the road or some other predetermined height), it can be shown that a feature height h can be computed by:
Here, Mh1, Mh2 are camera models for the speed camera of the landmarks at heights h1 and h2, M′h1, M′h2 are camera models of the RGB traffic camera of landmarks at heights h1 (e.g., 0) and h2 (e.g., 3), (i j) is the pixel position of the tracked feature in the image in speed camera coordinates, and (i′ j′) is the pixel position of the tracked feature in the image in RGB traffic camera coordinates. All values are known once the camera calibration is performed and pixel correspondence for the feature has been found from the stereo pair (the correspondence problem determines (i′ j′) given (i j) as explained below). Since there are two equations and one unknown, the system can be solved via a conventional least squares solution, which is robust against noise. In one example, sparse stereo processing comprises performing height estimation by identifying a least square solution that is a function of camera calibration and orientation information, estimating the feature height multiple times using a plurality of stereo feature pairs, and processing the estimated heights statistically by computing one or more of an average height, a median height, a mean height, and a truncated mean height.
For feature height estimation, the processing occurs at the speed camera end. As a tracking point located at coordinates (i j) in the speed camera image plane enters the tracked feature height estimation zone or region within a given frame of video stream (A), the corresponding image template (e.g., the cropped image of a license plate from the speed camera video stream) is used to find the correspondence (i′ j′) in the corresponding RGB frame. Since there are two different cameras (i.e., with different spatial resolutions and FOVs), the matching method needs to be invariant to scale and potentially projective distortions. Therefore, a matching technique such as scale invariant feature transform (SIFT), Speeded Up Robust Features (SURF), or Gradient Location and Orientation Histogram (GLOH) can be employed. Alternatively, one can apply matching technique at multiple scales using features that are not scale invariant in nature, such as correlations of image intensities (used by Harris Corners), Histogram of Oriented Gradients (HOG), local binary patterns (LBP) etc. This may be computationally more expensive but will enable scale-invariant matching for objects that are described with scale-variant features. Once the corresponding pixel locations of the tracked feature have been identified on both cameras, the height of the tracked feature can be estimated using Eq. (1). Multiple height estimations across multiple frames are calculated until the tracked points exit the tracked feature height estimation zone, and an estimated feature height is computed by averaging the individual estimates. The tracking continues until the vehicle exits the FOV of the speed camera but the feature height estimation can stop after sufficient measurements are made (as defined by the length of the height estimate region). This estimated feature (e.g., license plate) height is then used to fine tune the raw speed estimated by the single speed camera for better accuracy.
A typical stereovision system involves at least two cameras seeing a segment of common/overlapping scene. One of the goals of stereovision is to resolve the 3D-to-2D ambiguities that a single 2D camera cannot resolve. That is, in the context of speed detection, a single camera provides two-dimensional feature locations (x,y), while a stereo camera has the capability to provide three-dimensional information (x,y,z) (where z typically denotes depth). Unless the height of the tracked vehicle features can be estimated accurately by some other means, the speed measurement from a conventional monocular camera system is not as accurate as that from a stereo-camera (all other factors such as sensor noise, placement of cameras, illumination, camera shake etc. being equal). Though stereo-vision provides depth information and is thus more appropriate for 3D world imaging applications, the depth estimation performance is not uniform throughout the space. The depth resolution and the amount of overlap in the two camera views are dependent on the relative positions between the cameras, sensor resolutions, and their focal lengths.
To illustrate this,
The overlapping of the FOVs can be optimized while imposing few constraints on the FOV of the RGB traffic camera which results in a small area of overlap, where stereo performs robustly (as opposed to attempting to obtain stereo vision to perform well in a larger portion of the overlap region). In
It will be understood that stereo vision processing may include, for example, determining epi-polar lines, i.e. the search region for the stereo correspondence problem. The corresponding pixels in each pair of images (i.e., from the primary and secondary cameras) are matched, given a constraint introduced by the determined epi-polar lines. That is, the potential matches are only searched around the epi-polar lines. In this manner, a dense depth map for all pixels in the overlapping FOV (referred as stereo region) is achieved. This approach can also be used to derive sparse depth information, i.e. the depth information for selected feature points. In one example, feature points on the stereo pair of images or frames are first identified independently on each image and then linked together according to the correspondence between them. Point detectors of interest, such as SIFT, SURF, or various corner detectors such as Harris corners, Shi-Tomasi corners, Smallest Univalue Segment Assimilating Nucleus (SUSAN) corner etc. can be applied to find the feature points. The correspondence problem can be solved via one or more of interest point matching and local searches under the epi-polar constraint. It will be noted that, according to one example, processing from the speed camera sequence can identify the set of feature points that are suitable for tracking. Tracked feature points are useful for stereo matching since good tracking points have certain texture and/or corner properties that are desirable for identifying stereo matches. The correspondence problem is spatially sparse since only the 3D coordinates of a small set of points are typically recovered, and temporally sparse since it only occurs when vehicles of interest traverse the height estimation zone 140. For regular stereo-vision applications, the depth measurements of these sparse points are interpolated and propagated to all pixels in the stereo regions (e.g., by multi-resolution and having a predetermined number of points of interest) and across a plurality of video frames. In the case of speed measurement, the spatial coordinates (x,y,z) of the tracked feature points are sufficient.
For a typical stereo-vision speed camera, the (x,y,z) point coordinates across a given number of frames is converted to road (e.g., real-world) coordinates so that speed in standard units such as miles-per-hour (mph) can be calculated. A calibration process that maps pixel values into real-world coordinates facilitates the conversion. The calibration process may be referred to as an extrinsic calibration. As previously described, the quality of the estimation of the spatial coordinates (x,y,z) of a point depends at least in part on its location within the stereo region.
In the described systems and methods, an optimal tradeoff is achieved by using stereo vision for tracked feature height estimation (e.g., license plate height) across the highlighted tracked feature height estimation zone 140. Since the speed camera measurement system identifies feature points to track with constant but unknown height above the road surface 136, all that is needed from the auxiliary RGB traffic camera 134 is video data to aid the computation of said unknown (but constant) value. In the case where the tracked height is constant, only a single pair of images of the vehicle at some optimal location is needed (e.g., the first time the vehicle enters the scene in
Derivation of tracked feature height estimation using sparse stereo-vision processing involves an approach for estimating the height of a feature of an object (e.g. a vehicle) traveling on a reference plane (e.g. road surface) using two cameras. Given four camera models Mh1, Mh2, M′h1, M′h2 with common (x,y,h) coordinate relative to the road surface and a pair of pixel correspondence (i,j) and (i′,′) it can be shown that:
Here, the four camera models correspond to the primary camera at two heights, h1, h2, and the secondary camera at two heights, h′1, h′2, respectively. A pair of pixel correspondence above means the pixel locations in the primary camera image or frame and in the secondary camera image or frame of the same point of an object. Looking at Eq. (2), it will be understood that for a point (i,j) in the primary camera frame it is not possible to know its true location (x,y) without knowing whether it is at height h1 or h2 or some other height. Similarly, it is not possible to resolve the ambiguity for (i′,j′) by looking at Eq. (3) alone. It is however possible to resolve the ambiguity if it is known that (i,j) and (i′,j′) are physically the same point (i.e. their true (x,y) is the same).
Assuming the camera projection mapping (e.g., camera models at various heights) is linear along the height axis, it can be shown that for a point at (x,y,h) the following equation can be satisfied:
When solving the tracked-feature height problem, given a pair of image plane correspondences, (i,j) and (i′,j′) of a tracked feature at unknown height h from the two cameras, its real-world coordinate (x,y) satisfies
Setting Eq. (5) equal to Eq. (6) and substituting the two-camera calibration models in equations (2) and (3), it can be shown that h satisfies:
Further simplification of the two-camera model to force h1=h′1, h2=h′2, shows that Eq. (7) can be simplified to:
There are two equations and only one unknown in Eq. (8). Therefore, h can be calculated using a least square solution. Additionally, multiple such pairs can be acquired and used to solve for h as the tracked object appears in both views (i.e. the fields of view of the primary and secondary cameras) to yield an even more robust solution.
As stated above, the system 200 comprises the processor 204 that executes, and the memory 206 that stores one or more computer-executable modules (e.g., programs, computer-executable instructions, etc.) for performing the various functions, methods, procedures, etc., described herein. “Module,” as used herein, denotes a set of computer-executable instructions, software code, program, routine, or other computer-executable means for performing the described function, or the like, as will be understood by those of skill in the art. Additionally, or alternatively, one or more of the functions described with regard to the modules herein may be performed manually.
The memory may be a computer-readable medium on which a control program is stored, such as a disk, hard drive, or the like. Common forms of non-transitory computer-readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic storage medium, CD-ROM, DVD, or any other optical medium, RAM, ROM, PROM, EPROM, FLASH-EPROM, variants thereof, other memory chip or cartridge, or any other tangible medium from which the processor can read and execute. In this context, the systems described herein may be implemented on or as one or more general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphical card CPU (GPU), or PAL, or the like.
According to
The feature tracking module 216 tracks vehicles of interest by determining the location of one or more vehicle feature(s) (e.g., a license plate or the like) across frames. For example, the feature tracking module follows identified vehicle features from one frame to the next in the primary video stream. Tracked feature information is forwarded to a speed estimation module 226, and to a sparse stereo processing module 220 that performs sparse stereo processing when the tracked features are within a pre-determined region or zone (e.g., a tracked feature zone such as zone 140 in
Additionally, the system 200 can include a graphical user interface (GUI) 230 via which a user may enter information and on which information is presented to the user. For instance, a technician or law enforcement personnel can be presented with video data, height and/or speed estimation information, vehicle ID information, violation package(s), or any other suitable information.
The following example is provided for illustrative purposes to show the manner in which the described system(s) may be calibrated. The example focuses on the accuracy of the feature height estimation capabilities of the proposed sparse stereo-vision system. In the example a parking lot is imaged from the 2nd floor of a building (e.g., about 100 ft away and 15 ft height above the ground). In this example, the cameras are horizontally (rather than vertically) displaced by 12 ft due to space constraints, although one skilled in the art will understand that the same principles apply to vertically mounted cameras, as described with regard to the preceding figures. It will be noted that the working distance can be any suitable distance (e.g., between 25 ft and 50 ft away from the tracked feature height estimation zone) and is not limited to the tested 100 ft distance imposed by the testing conditions. In any case, scaling all tested lengths and working distance down by a factor of 4 (e.g., from 100 ft to 25 ft) provides results consistent with those of an operational vertically mounted system.
Example views from two cameras under this highly constrained test are shown in
The performance statistics are (min,max)=(−2″,4.8″), (ave,std)=(0.25″,2.39″), P95=6.8″. A conventional approach (e.g., such as is described in U.S. patent application Ser. No. 13/411,032 to Kozitsky et al., which is hereby incorporated by reference in its entirety herein) yielded, e.g., an accuracy of (min,max)=(−8.1″,16.5″), (ave,std)=(0.26″,3.96″), P95=15.1″, whereas the herein described method is more accurate (˜8″ improvement in P95 or 1.5″ improvement in standard-deviation), even under the limited experimental conditions. It will be appreciated that while the conventional approach was tested more extensively (more iterations), the target features consisted of 5 to 6 distinct license plates with heights ranging from 24.5″ to 43″. On the other hand, using the herein described method, fewer iterations need be performed while still addressing a wider range of feature heights, ranging from 0″ to 44″. Moreover, the conventional method exhibits a few failure modes that the herein described method overcomes: first, the conventional method only works for license plates (as it performs height estimation from measured license plate character heights), and second, its accuracy decreases with external noise factors affecting the appearance of the license plate (e.g. snow, frames around the license plate, etc.).
The exemplary embodiments have been described. Obviously, modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the exemplary embodiments be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Number | Name | Date | Kind |
---|---|---|---|
5734337 | Kupersmit | Mar 1998 | A |
8108119 | Southall et al. | Jan 2012 | B2 |
20040252193 | Higgins | Dec 2004 | A1 |
20110267460 | Wang | Nov 2011 | A1 |
Entry |
---|
Dufournaud et al., “Matching Images with Different Resolutions” Computer Vision and Pattern Recognition, 2000. Proceedings IEEE Conference, vol. 1, pp. 612-618. |
T-EXSPEED V 2.0, http://www.kria.biz/english/products.html, accessed Feb. 11, 2013, 2 pgs. |
U.S. Appl. No. 13/611,718, filed Sep. 12, 2012, W. Wu. |
U.S. Appl. No. 13/527,673, filed Jun. 20, 2012, Hoover et al. |
U.S. Appl. No. 13/411,032, filed Mar. 2, 2012, Kozitsky et al. |
U.S. Appl. No. 13/315,032, filed Dec. 8, 2011, Maeda et al. |
U.S. Appl. No. 13/414,167, filed Mar. 7, 2012, Shin et al. |
U.S. Appl. No. 13/371,068, filed Feb. 10, 2012, Wu et al. |
Number | Date | Country | |
---|---|---|---|
20140267733 A1 | Sep 2014 | US |