1. Technical Field
The subject matter described herein generally relates to vehicular safety systems, and in particular to use of real-time visual detection of traffic lights in high resolution images using a standard CPU.
2. Background Information
Traffic light detection is an important problem faced by both forthcoming Advanced Driver Assistance Systems (ADAS) and future autonomous vehicles. Autonomous vehicles will need to interact with traffic lights; while there has been some discussion that traffic lights may disappear as human drivers do, it seems likely that traffic signals will survive, since they are needed not only for vehicles but for pedestrians and bicyclists as well.
Autonomous vehicles can interact with traffic lights in one of two ways. First, they can use data provided by the traffic signal itself, either through Dedicated Short Range Communication (DSRC) or using an Internet-based mechanism. Second, vehicles can make use of cameras to observe the traffic lights.
In recent years, there has been increasing research in visual traffic light detection using various color segmentation and feature detection algorithms, as well as recent advances using convolutional neural networks (CNNs). Unfortunately, much of the existing work has significant drawbacks, making it unsuitable for deployment in the real world. Practical problems include issues related to robustness, reliance on potentially out-of-date prior information, detection speed, computational requirements, and scalability to higher resolution cameras.
Practical traffic light detection must operate in a wide range of environments, and with multiple traffic light designs. A new and temporary light in a construction zone is unlikely to exist in a geospatial database. A rural area may area may graduate to being a (literal) one-light town and add a new traffic light but simply not tell anyone. In all of these cases, a signal may appear unexpectedly in the visual field, and an autonomous vehicle will need to handle it. Traffic light detection systems will need to operate without complete prior knowledge of light locations, and without generating false positives from lights on vehicles or buildings.
Cameras typically capture video at 30 frames per second (fps) or more, and there are two reasons why traffic light detection systems need to keep up. First, while a system operating at (say) 3 fps may be sufficient to observe a light long before a vehicle needs to interact with it, there may be aliasing between the frequency of frame analysis and the frequency of, for example, a flashing yellow signal. In the worst case, the two frequencies will coincide and the flashing light will be perceived as either always on or always off. More importantly, however, no computer vision system is perfect. A system that detects lights in single frames with 80% accuracy has about a 1% chance of missing a traffic light after analyzing three frames. If the analysis proceeds at a rate of 3 fps, that is probably unacceptable; a 1% chance of not seeing a light for a full second (or a flashing yellow for double that) is simply unsafe. A system that operates at 30 fps would have less than one chance in a billion of missing a traffic light for a full second, even if its detection rate on individual images were only 50%.
Existing work has also utilized low resolution images, typically 640×480 pixels, due to computational requirements. Again, this introduces significant risk that the detection will be insufficient for modern requirements, whether due to false positives (e.g., falsely detecting a green light) or false negatives (e.g., not detecting an upcoming red light). For light detection to be usable and cost-effective in desired safety systems and autonomous vehicle control subsystems, improvements in the task of traffic light detection are needed.
FIG. (
The Figures and the following description describe certain embodiments by way of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality.
In summary, the traffic light detection systems and methods described here address the technical needs outlined above while also moderating deployment costs. In one embodiment, a system analyzes 4K (3840×2160) video at approximately 30 fps running on a single mid-range desktop CPU and without requiring prior information. This facilitates detection of lights at long distances while utilizing a camera with a wide field of view, enabling the perception of lights when stopped at the white line. A wide field of view also enables additional visual analysis in other applications such as collision avoidance while using only a single camera system.
In some environments, networks 140 and 150 are WAN networks, while in others the connections they provide may be implemented by conventional LAN Ethernet, Wi-Fi, USB or other conventional connections between processing subsystems. In additional embodiments for various applications, the networks 140 and 150 employ links using technologies such as Ethernet, 802.11, worldwide interoperability for microwave access (WiMAX), 2G/3G/4G mobile communications protocols, digital subscriber line (DSL), asynchronous transfer mode (ATM), InfiniBand, PCI Express Advanced Switching, etc. Similarly, the networking protocols used on the networks 140 and 150 can include multiprotocol label switching (MPLS), transmission control protocol/Internet protocol (TCP/IP), User Datagram Protocol (UDP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (FTP), etc. as may be most suitable for the application at hand. The data exchanged over the networks 140 and 150 can be represented using technologies and formats including image data in binary form (e.g., Portable Network Graphics (PNG)), hypertext markup language (HTML), extensible markup language (XML), etc. In addition, all or some of the links can be encrypted using conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private networks (VPNs), Internet Protocol security (IPsec), etc. to address issues such as potential hacking threats. In another embodiment, the entities coupled via networks 140 or 150 use custom or dedicated data communications technologies instead of, or in addition to, the ones described above. Although
In other embodiments, the system 100 contains different and/or additional elements. In addition, the functions may be distributed among the elements in a different manner than described herein. For example, the input device 110 may perform some or all of the data processing related to the input it provides, such as described below with respect to video conversion.
Referring now to
In the described embodiment, it is considered acceptable to detect only traffic lights of radius four pixels or greater; given the geometry as described above, this corresponds to a traffic light distance of approximately 450 feet.
Referring still to
Accurately identifying a traffic light in a typical street image is a task that is daunting for a conventional single-CPU computing system. A single 4K image will contain about 8 million pixels. Analyzing a 30 fps 4K feed will therefore involve dealing with approximately 240 million pixels per second, with each pixel containing 3 bytes of information. This is a number that is likely to be impractical in practice for real-time processing. It is for this reason that known attempts at addressing this problem have relied on lower resolution images. As explained above, this is unsatisfactory for reliable real-time identification of traffic lights suitable for vehicular safety systems and autonomous vehicle control.
Referring now to
The initial processing task is to rapidly find colored circles in a stream of changing video images. This is a well-studied problem in the computational vision literature and is commonly solved using color blob detection or the circular Hough transform. While such techniques may be usable in some applications, in order to cope with high resolution images, the embodiment detailed here makes use of a faster algorithm that exploits the fact that the task is to look for circular discs.
Method 400 commences by searching 410 not for disks, but for squares of a particular color or set of colors (for instance, red, amber or green). After the squares are found, the system 100 uses them to localize 420 the search for appropriately colored disks. In the embodiment detailed here, the pixel data is stored in the YUV format typical of raw video encoding, and because the primary means by which traffic lights stand out is luminance, corresponding to the Y value. An RGB encoding would make this harder to identify. The common HSV space also separates brightness (the V values) but requires an additional transform of the native video color-space. In testing, it was discovered that approximately 14 ms is required (using the described processor architecture) to transform a 4K image from YUV to HSV color-space, an overhead that is preferably avoided.
When searching for a square of a particular color, the system identifies that color by a range of possible values for each of Y, U and V. For each point (x,y) in the image, h(x,y) is defined as the number of traffic light colored pixels in a square of size s with upper left corner at (x,y). The algorithm takes s to be the size of the largest square that can be inscribed in a circle of radius four pixels, since each traffic signal is required to be at least that large by the system. Further consideration is given only to those points (x,y) for which h(x,y) exceeds some threshold t.
This approach has two notable properties. First, it is possible to compute h(x,y) for every pixel in the image extremely quickly. Second, it is possible to optimize the colors being searched for and the threshold t so that the overall process suggests a minimal number of squares for further evaluation.
For the first, suppose that hr(x,y) is denoted as the number of appropriately colored pixels in a region that is not a square of size s, but is instead a row of length s. Now if p(x,y) is denoted as the function that takes the value 1 if the pixel at (x,y) is traffic-light colored and 0 if it is not, this yields the equation:
h
r(x,y)=hr(x−1,y)+p(x+s,y)−p(x−1,y)
since the region starting at (x,y) differs from that starting at (x−1,y) only in that the pixel at (x+s,y) is now included and the pixel at (x−1,y) no longer is. This dynamic programming approach allows the system to compute the entire row of hr(x,y) values for a fixed x using only two operations per pixel. The process of computing h(x,y) for each (x,y) in the image can also be perfectly parallelized, since each row is treated separately. The image can be divided into n horizontal strips, facilitating processing across multiple CPU threads of data processing device 120. An alternative embodiment works instead with hc(x,y), the number of suitably colored pixels in a column of the image. Which of these two embodiments is to be used is a function of the memory layout used to store the images in question.
Having computed hr(x,y), h(x,y) is computed as
h(x,y)=h(x,y−1)+hr(x,y+s)−hr(x,y−1)
since the square region starting at (x,y) differs from that starting at (x,y−1) only in that the row at (x,y+s) is now included and the row at (x,y−1) no longer is. Again using a dynamic programming approach, h(x,y) can be computed using an additional two operations per pixel. If hc(x,y) is used instead of hr(x,y), a similar approach can be taken.
To minimize the number of pixels that need to be considered as possible traffic lights there are seven parameters to optimize: the minimum and maximum values for the luminance Y-value, ymin and ymax, and similar parameters for the U and V chroma values. There is also the threshold parameter t.
Note first that given a set of images in which traffic lights have been identified by hand, it is possible to compute t from the other six parameters. After all, t should be the maximum threshold for which the known traffic lights do indeed qualify as lights. The system therefore optimizes only ymin, ymax, umin, umax, vmin and vmax.
For some fixed image I, the total number of pixels identified as possible traffic light locations in I is denoted by N(I, ymin, ymax, umin, umax, vmin, vmax), given values for ymin, ymax, umin, umax, vmin and vmax. By averaging this expression over various images representative of those that will be analyzed when the system 100 is operating in real time, the algorithm can employ a new function that we denote simply by N(ymin, ymax, umin, umax, vmin, vmax). This function represents the expected number of points that will need to be analyzed further in a new and not yet known image.
The goal is now to choose values for ymin, ymax, umin, umax, vmin and vmax that minimize N. In the described embodiment, a conventional hill-climbing approach is used to select such values.
Having identified ymin, ymax, umin, umax, vmin and vmax for each of the three light colors (red, yellow and green) using a training data set, associated thresholds t for each of the colors are computed as well. The system 100 can then identify all of the pixels in our image that satisfy h(x,y)≧t; each such point is the upper left-hand corner of a square centered at the possible center of a traffic light. The result of this process is thus a set of candidate centers of traffic lights. In typical use, hundreds if not thousands of candidates may result, including not only the actual traffic lights of interest but other features to be treated as artifacts, such as vehicular lights and pedestrian signals, as well.
The next step is to prune 430 the set of candidate centers to reduce the amount of subsequent processing required. If two candidates are in virtually identical positions with one appearing to be better than the other (in that the value of h(x,y) is higher), the worse candidate is pruned.
In the next stage of analysis, the image is converted in one or more separate ways into a Boolean representation, where each pixel is either on or off. One such manner is to convert an original image to one in which traffic light colored pixels are on and everything else is off.
A second possible manner in which the image can be converted to a Boolean representation is by using the Canny transform as detailed in Harris, C. and Stephens, M. (1988), “A combined corner and edge detector,” Fourth Alvey vision converence. 15, pp. 147-151, University of Manchester, the contents of which are incorporated by reference as if fully set forth herein. The Canny transform is designed to find edges in an image, and in this alternative manner may achieve a second Boolean representation.
The system 100 now searches 440 for circles that appear in all of the Boolean representations. To make this quantitative, suppose that p is the probability that a pixel is on in an image. Considering a circle centered at a point c and of radius r, there will be 2πr pixels associated with that circle. If the Boolean image were random, it would be expected that 2πrρ of those pixels would be on. If the circle is actually present in the image, many more will be. The system performs a probabilistic analysis to determine the probability that a circle observed in the image would appear randomly, and then defines the score of the circle to be the negative of the logarithm of that probability. In general, circles would not show up at all if the image were random, so the probabilities involved are quite small and a higher score thus corresponds to a better circle. The system combines weighted scores from the various Boolean representations to get an overall score for each possible circle under consideration. In one implementation, a color-thresholded image is weighted 60% and a Canny image is weighted 40% to get the overall score. The final set of circles is then pruned 450 so that if two circles overlap, only the one with the highest score is kept. This finally produces a highly reduced set of candidates that are ranked 460 based on their scores.
Because this second phase need consider only perhaps hundreds or thousands of possible centers out of 8 million pixels in the original image, virtually all of the computation time needed by this approach is consumed by the identification of the candidate centers themselves. When computing a Canny representation, for example, the system 100 needs only to look in a small neighborhood of the points found previously, minimizing the time needed for this portion of the computation. In practice, each candidate thus identified as a potential traffic light is associated with its score. In one embodiment particularly suitable for tuning a system (e.g., identifying false negatives and false positives), the score is displayed as an overlay on a displayed video image just below the portion thought to be a traffic light, with the highest score in each frame displayed in green, the second highest in blue, and all other possible lights in red.
In testing, it is found that an embodiment as described herein generates no perceptible false negatives (i.e., every traffic light is identified as soon as a human viewing the image can also detect the light). The detection distances are sufficient to allow an autonomous vehicle to make appropriate decisions regarding speed or other considerations.
In practice, it is found that the embodiment described herein does generate false positives, typically taillights or directional blinkers on other vehicles. In practice, however, such false positives are readily identified using existing known methods for determining that such features are not traffic lights. These methods include realizing that the lights are only a few feet off the ground, that they are moving, and that they are in the same location as an automobile or other vehicle. Good automated real-time systems for identifying other vehicles on the road already exist, and 3D mapping module 320 illustrated in
In practice, there may be multiple traffic lights visible in any particular image, and for many applications it does not suffice to simply determine whether or not a light is present; identifying the specific light relevant to the vehicle's direction of travel is also a concern. In most applications, identifying all of the traffic lights in an image is not nearly as important as identifying at least one light relevant to the vehicle's direction of travel. Frame-to-frame differences can be used to help identify traffic lights (as used by the “local” thread analyzing subimages in the neighborhood of lights found in old ones) or remove false positives (as discussed elsewhere herein).
In practical application, the question of whether an image contains a light is somewhat ambiguous. If a detection system is designed to identify lights that are at least 4 pixels in radius, an image with a 3-pixel light should not count as a positive. But it may be desirable to treat an instance with a 4-pixel light as legitimately ambiguous, since a variety of edge effects make it difficult to define the exact size of an object in an image.
In order to address these concerns, in some embodiments the system 100 uses a fundamental “success” metric, i.e., the correct identification of at least one light in any particular image corresponding to the direction of travel of the vehicle in question. The system considers an image to be a positive instance if there is a light in the direction of travel that is at least 4 pixels in radius and at least one light radius away from the edge of the image. The system considers an image to be a negative instance if there is no light larger than 3 pixels in radius visible in the image. Note that some images are simply not considered, since images with lights near the borders cannot in general be expected to be classified correctly by the techniques described herein, and lights between 3 and 4 pixels in size might or might not be classified correctly. Note also that because of the high resolution of the images used and correspondingly wide field of view, the image border is typically a much smaller fraction of the overall image than it is in conventional vehicular camera systems.
Accordingly, a positive instance is considered to be correctly classified (a true positive, in conventional terms) if there is at least one object in the image identified as a traffic light, and if the object with the highest score is indeed a traffic light relevant to the current direction of travel. It is considered incorrectly classified (a false negative) if no traffic light is found. It is considered misclassified (for which there is no conventional analog) if the object identified as the most likely traffic light is not a traffic signal in the current direction of travel.
A negative instance will be said to be correctly classified (a true negative) if either there is no traffic light identified in the image, or a light is correctly identified even though it is smaller than 3 pixels in radius. It will be said to be incorrectly classified (a false negative) if an object other than a traffic light is identified as such in the image.
In testing, a system as described herein was found to generate “apparent” false negatives on approximately 5% of sample images, virtually all of which occurred with the vehicle stopped at a red traffic light (with numerous other red brake lights of other vehicles in the image as well). The vehicle lights, which were false positives, scored higher than the actual traffic light and consumed all of the available “slots” for possible lights, thus resulting in what appears to be a false negative with respect to the actual traffic light. In various embodiments, such brake lights and other artifacts are eliminated from consideration 470, as detailed below.
In various embodiments such issues are addressed in two ways. One is to realize that at some level, the errors being made are ones that do not matter. More specifically, the most dangerous false positives are those where a green light is “seen” even though none exists. The most dangerous false negatives are those where a red light exists but is not noticed.
An alternative is to eliminate such problems by coupling the observations made from frame to frame with GPS information regarding the location of the vehicle, based on which an embodiment determines when a particular “light” is in fact moving. Such lights are obviously not traffic lights, and are eliminated from analysis. Second, accurate information regarding the location and position of the camera is used in some embodiments to automatically exclude “lights” that are in impossible locations (not near a road, too close to the ground, etc.).
The false positives identified in testing in general corresponded to lights on other vehicles. In some embodiments these are eliminated from analysis by recognizing that those vehicles are either moving or too low to be traffic lights. In some embodiments pedestrian signals (e.g., 205) and other artifacts are likewise able to be removed from consideration. Static positional analysis can remove a number of such artifacts.
Another technique, used alternatively or additionally in some embodiments, employs trajectory analysis (via module 330 of
Based on specific applications, some embodiments use one or more methods to implement such processing. Spatial (multiple cameras) and temporal (multiple frames) can be used to help determine location and size of a traffic light in order to position the light in three-dimensional space. Using the temporal example, given a sequence of such locations, the system can find the most likely fixed location for the object in question. A challenge with this approach is that it is often difficult to determine the precise radius of an object in the image, which can lead to significant errors in depth of field (and therefore in positioning generally).
A more accurate approach is to start with a notional position and velocity of a traffic light and to construct an image sequence from that. For a specific presumed position and velocity, the image sequence constructed can then be compared to the images as actually observed. A position and velocity can then be found that jointly minimize the disparity between the images as predicted and observed. In practice, this optimization appears to avoid the difficulty in ranging described in the previous paragraph because large differences in range may correspond to small differences in the size of an object in the image. Small differences of this type will therefore have only minimal impact on the optimization being undertaken. In some embodiments, Levenberg-Marquardt optimization, as described in Moré, J. (1978), The Levenberg-Marquardt algorithm: Implementation and theory, Numerical analysis, 105-116, the contents of which are incorporated by reference as if fully set forth herein, is used to effect the minimization described in this paragraph, thereby computing the likely initial position and velocity of an identified object in a sequence of images.
In some embodiments, such methods as described herein are variously combined with existing methods, for instance those using feeds of real-time traffic light data, as may be warranted in different applications to eliminate, or at least reduce, artifacts in identifying traffic lights. For example, such methods may include, in various applications, those set forth in Ginsberg, M. (2016). Traffic Signals and Autonomous Vehicles: Vision-based or a V21 Approach? Intelligent Transportation Systems, ITSA-16, San Jose, Calif., the contents of which are incorporated by reference as if fully set forth herein.
Thus, both the 3D mapping module 320 and the trajectory analysis module 330 are used, either independently or together, to resolve possible artifacts that might otherwise be considered as possible traffic lights.
In the embodiment shown in
The types of computers used by the entities of
Some portions of above description refer to the embodiments in terms of algorithmic processes or operations. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations are understood to be implemented by hardware systems or subsystems. One of skill in the art will recognize alternative approaches to provide the functionality described herein.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. It should be understood that these terms are not intended as synonyms for each other. For example, some embodiments may be described using the term “connected” to indicate that two or more elements are in direct physical or electrical contact with each other. In another example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.
As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the disclosure. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for image-based identification of traffic lights. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the described subject matter is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus disclosed herein. The scope of the invention is to be limited only by the following claims.
This application is a continuation in part of U.S. patent application Ser. No. 14/820,345, filed Aug. 6, 2015, which claims the benefit of U.S. Provisional Application No. 62/106,146, filed Jan. 21, 2015, both of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62106146 | Jan 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14820345 | Aug 2015 | US |
Child | 15473177 | US |