The present invention relates to analysing crowd congestion using video images and, in particular, but not exclusively, to methods and systems for analysing crowd congestion in confined spaces such as, for example, on train station platforms.
There are generally two approaches to behaviour analysis in computer vision-based dynamic scene analysis and understanding. The first approach is the so-called “object-based” detection and tracking approach, the subjects of which are individual or small group of objects present within the monitoring space, be it a person or a car. In this case, firstly, the multiple moving objects are required to be simultaneously and reliably detected, segmented and tracked against all the odds of scene clutters, illumination changes and static and dynamic occlusions. The set of trajectories thus generated are then subjected to further domain model-based spatial-temporal behaviour analysis such as, for example, Bayesian Net or Hidden Markov Models, to detect any abnormal/normal event or change trends of the scene.
The second approach is the so-called “non-object-centred” approach aiming at (large density) crowd analysis. In contrast with the first approach, the challenges this approach faces are distinctive, since in crowded situations such as normal public spaces, (for example, a high street, an underground platform, a train station forecourt, shopping complexes), automatically tracking dozens or even hundreds of objects reliably and consistently over time is difficult, due to insurmountable occlusions, the unconstrained physical space and uncontrolled and changeable environmental and localised illuminations. Therefore, novel approaches and techniques are needed to address the specific and general tasks in this domain.
There has been increasing research in crowd analysis in recent years. In [14], for example, a general review is presented of the latest trend and investigative approaches adopted by researchers whilst tackling the domain issues from different disciplines and motivations. In [2], a non-object-based approach to surveillance scene change detection (segmentation) is proposed to infer semantic status of the dynamic scene. Event detection in a crowded scene is investigated in [1]. Crowd counting employing various detection-based or matching-based methods are discussed in [3], [4], [6] and [11]. Crowd density estimation is studied in [8], [9], [10] and [12]. In [9][10], a Markov Random Field-based approach is applied to an underground monitoring task using a combination of three sources (features/statistical models), resulting in a motion (or change) detection map. This map is then geometrically weighted pixel-wise to provide a translation invariant measure for crowding. The method, however, is computationally intensive, and was not seen to be extensively validated across different environments or complex scenarios in terms of accuracy and robustness; it has difficulty in choosing a number of critical system parameters for the optimisation of the performance. Moreover, Paragios relies on quasi calibration using knowledge of the height of a train.
By way of example, some particular difficulties in relation to an underground station platform, which can also be found in general scenes of public spaces in perhaps slightly different forms, include:
U.S. Pat. No. 7,139,409 (Paragios et al.) describes a method of real time crowd density estimation using video images. The method applies a Markov Random Field approach to detecting change in a video scene which has been geometrically weighted, pixel by pixel, to provide a translation invariant measure for crowding as people move towards or away from a camera. The method first estimates a background reference frame against which the subsequent video analysis can be enacted.
Embodiments of aspects of the present invention aim to provide an alternative or improved method and system for crowd congestion analysis.
According to a first aspect of the invention there is provided a method of determining crowd congestion in a physical space by automated processing of a video sequence of the space, the method comprising: determining a region of interest in the space; partitioning the region of interest into an irregular array of sub-regions, each comprising a plurality of pixels of video image data; assigning a congestion contributor (or weighting) to each sub-region in the irregular array of sub-regions; determining first spatial-temporal visual features within the region of interest and, for each sub-region, computing a metric based on the said features indicating whether or not the sub-region is dynamically congested; determining second spatial-temporal visual features within the region of interest and, for each sub-region that is not indicated as being dynamically congested, computing a metric based on the said features indicating whether or not the sub-region is statically congested; generating an indication of an overall measure of congestion for the region of interest on the basis of the metrics for the dynamically and statically congested sub-regions and their respective congestion contributors (or weightings).
According to a second aspect of the invention, there is provided a crowd analysis system comprising: an imaging device for generating images of a physical space; and a processor, wherein, for a given region of interest in images of the space, the processor is arranged to: partition the region of interest into an irregular array of sub-regions, each comprising a plurality of pixels of video image data; assign a congestion contributor (or weighting) to each sub-region in the irregular array of sub-regions; determine first spatial-temporal visual features within the region of interest and, for each sub-region, compute a metric based on the said features indicating whether or not the sub-region is dynamically congested; determine second spatial-temporal visual features within the region of interest and, for each sub-region that is not indicated as being dynamically congested, compute a metric based on the said features indicating whether or not the sub-region is statically congested; generate an indication of an overall measure of congestion for the region of interest on the basis of the metrics for the dynamically and statically congested sub-regions and their respective congestion contributors (or weightings).
Dividing the region of interest into an irregular array of sub-regions enables computational efficiency which enables real-time processing to be carried out even when merely using a low-cost PC. Also, dealing with locally adaptive “blobs” rather than individual pixels—as used by Paragios, offers many advantages, not least of which is computational efficiency.
Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.
a is an image of an underground train platform and
a illustrates a partitioned region of interest on a ground plane—with relatively small, uniform sub-regions—and
a illustrates a partitioned region of interest on a ground plane—with relatively large, uniform sub-regions—and
a exemplifies a non-uniformly partitioned region of interest on a ground plane and
a, 10b and 10c show, respectively, an image of an exemplary train platform, a detected foreground image indicating areas of meaningful movement within the region of interest (not shown) of the same image and the region of interest highlighting dynamic, static and vacant sub-regions;
a, 11b and 11c respectively show an image of a moderately well-populated train platform, a region of interest highlighting dynamic, static and vacant sub-regions and a detected pixels mask image highlighting globally congested areas within the same image;
a, 12b and 12c respectively show an image of another sparsely populated train platform, a region of interest highlighting dynamic, static and vacant sub-regions and a detected pixels mask image, highlighting globally congested areas within the same image;
a, 13b and 13c respectively show an image of a crowded train platform, a region of interest highlighting dynamic, static and vacant sub-regions and a detected pixels mask image highlighting globally congested areas within the same image;
a and 14b are images which show one crowded platform scene with (in
c and 14d are images which show another crowded platform scene with (in
a and 15b illustrate one way of weighting sub-regions for train detection according to embodiments of the present invention;
a-16c and 17a-17c are images of two platforms, respectively, in various states of congestion, either with or without a train presence, including a train track region of interest highlighted thereon;
a and 18b are images of one platform and
a relating to a second timeframe is a graph plotted against time showing both a train detection curve and a passenger crowding curve and
a relating to a third timeframe is a graph plotted against time showing both a train detection curve and a passenger crowding curve and
Embodiments of aspects of the present invention provide an effective functional system using video analytics algorithms for automated crowd behaviour analysis. Such analysis finds application not only in the context of platform monitoring in a railway station, but more generally anywhere where it is useful or necessary to monitor crowds of people, pedestrians, spectators, etc. When applied to the analysis of crowds on platforms of railway/metro/MRT/underground stations, embodiments of the invention also offer train presence detection. The preferred arrangement is for the embodiments of the invention to operate on live image sequences captured by surveillance video cameras. Analysis can be performed in real-time in a low-cost, Personal Computer (PC) whilst cameras are monitoring real-world, cluttered and busy operational environments. Embodiments of the invention can also be applied to the analysis of recorded or time-delayed video. In particular, preferred embodiments have been designed for use in analysing crowd behaviour on urban underground platforms. Against this background, the challenges to face include: diverse, cluttered and changeable environments; sudden changes in illuminations due to a combination of sources (for example, train headlights, traffic signals, carriage illumination when calling at station and spot reflections from polished platform surface); the reuse of existing legacy analogue cameras with unfavourable relatively low mounting positions and near to horizontal orientation angle (causing more severe perspective distortion and object occlusions). The crowd behaviours targeted include platform congestion levels, or crowd density, estimation (ranging from almost empty platforms with a few standing or sitting passengers to highly congested situations during peak hour commuter traffic) and the differentiation of dynamic congestion (due to people being in constant motion) from static congestion (due to people being in a motionless state, either standing or sitting on the chairs available). The techniques proposed according to embodiments of the invention offer a unique approach, which has been found to address these challenges effectively. The performance has been demonstrated by extensive experiments on real video collections and prolonged live field trials. Embodiments of the invention also find application in less challenging environments where some or many of the challenges identified above may not arise.
Key principles involved in crowd congestion analysis according to the present embodiments also find application in train detection analysis, in addition to other kinds of object detection and/or analysis. Thus, embodiments of the present invention can be applied to producing meaningful measures of crowd congestion on train platforms and usefully correlating that with train arrivals and departures, as will be described hereinafter.
The analytics PC 105 includes a video analytics engine 115 consisting of real-time video analytic algorithms, which typically execute on the analytics PC in separate threads, with each thread processing one video stream to extract pertinent semantic scene change information, as will be described in more detail below. The analytics PC 105 also includes various user interfaces 120, for example for an operator to specify regions of interest in a monitored scene, using standard graphics overlay techniques on captured video images.
The video analytics engine 115 may generally include visual feature extraction functions (for example including global vs. local feature extraction), image change characterisation functions, information fusion functions, density estimation functions and automatic learning functions.
An exemplary output of the video analytics engine 115 from a platform 105 may include both XML data, representing the level of scene congestion and other information such as train presence (arrival/departure time) detection, and snapshot images captured at a regular interval, for example every 10 seconds. According to
It will be appreciated that each platform may be monitored by one, or more than one, video camera. It is expected that more-precise congestion measurements can be derived by using plural spatially-separated video cameras on one platform; however, it has been established that high quality results can be achieved by using only one video camera and feed per platform and, for this reason, the following examples are based on using only one video feed.
It has been determined that there are three main difficulties when attempting to use camera sensor and visual-based technology to monitor realistic and cluttered crowd scene in an operational underground platform, as follows:
These factors typically make it difficult to use traditional, object-based video analysis for scene understanding.
Therefore, embodiments of aspects of the present invention perform visual scene “segmentation” based on relevance analysis on (and fusion of) various automatically computable visual cues and their temporal changes, which characterise crowd movement patterns and reveal a level of congestion in a defined and/or confined physical space.
According to
Congestion analysis according to the present embodiment comprises three distinct operations. A first analysis operation comprises dynamic congestion detection and assessment, which itself comprises two distinct procedures, for detecting and assessing scene changes due to local motion activities that contribute to a congestion rating or metric. A second analysis operation comprises static congestion detection and assessment and third analysis operation comprises a global scene scatter analysis. The analysis operations will now be described in more detail with reference to
Firstly, in order to detect instantaneous scene dynamics, in block 305 a short-term responsive background (STRB) model, in the form of a pixel-wise Mixture of Gaussian (MoG) model in RGB colour space, is created from an initial segment of live video input from the video camera. This is used to identify foreground pixels in current video frames that undergo certain meaningful motions, which are then used to identify blobs containing dynamic moving objects (in this case passengers). Thereafter, the parameters of the model are updated by the block 305 to reflect short term environmental changes. More particularly, foreground (moving) pixels, are first detected by a background subtraction procedure in block involving comparing, on a pixel-wise basis, a current colour video frame with the STRB. The pixels then undergo further processing steps, for example including speckle noise detection, shadow and highlight removal, and morphological filtering, by block 310 thereby resulting in reliable foreground region detection [5], [13]. For each partition blob within the ROI, an occupancy ratio of foreground pixels relative to the blob area is computed in a block 315, which occupancy ratio is then used by block 320 to decide on the blob's dynamic congestion candidacy.
Secondly, in order to cope with likely sudden uniform or global lighting changes in the scene, the intensity differencing of two consecutive frames is computed in block 325, and, for a given blob, the variance of differenced pixels inside it is computed in block 330, which is then used to confirm the blob's dynamic congestion status: namely, ‘yes’ with its weighted congestion contribution or ‘no’ with zero congestion contribution by block 320.
Due to the intrinsic unpredictability of a dynamic scene, so-called “zero-motion” objects can exist, which undergo little or no motion over a relatively long period of time. In the case of an underground station scenario, for example, “zero-motion” objects can describe individuals or groups of people who enter the platform and then stay in the same standing or seated position whilst waiting for the train to arrive.
In order to detect such zero-motion objects, a long-term stationary background (LTSB) model that reflects an almost passenger-free environment of the scene is generated by a block 335. This model is typically created initially (during a time when no passengers are present) and subsequently maintained, or updated selectively, on a blob by blob basis, by a block 340. When a blob is not detected as a congested blob in the course of the dynamic analysis above, a comparison of the blob in a current video frame is made with the corresponding blob in the LTSB model, by a block 345, using a selected visual feature representation to decide on the blob's static congestion candidacy. In addition, a further analysis, by the same block 345, on the variance of the differenced pixels is used to confirm the blob's static congestion status with its weighted congestion contribution. Finally, the maintenance of the LTSB model in the ROI is performed on a blob by blob basis by the block 350. In general, if a blob, after the above cascaded processing steps, is not considered to be congested for a number of frames, then it is updated using a low-pass filter in a known way.
In contrast with the above blob-based (localised) scene analysis, the first step of this operation, carried out by a block 355, is a global scene characterisation measure introduced to differentiate between different crowd distributions that tend to occur in the scene. In particular, the analysis can distinguish between a crowd that is tightly concentrated and a crowd that is largely scattered over the ROI. It has been shown that, while not essential, this analysis step is able to compensate for certain biases of the previous two operations, as will be described in more detail below.
The next step according to
The algorithms applied by the analytics engine 115 will now be described in further detail.
The image in
Given the estimated homography, a density map for the ROI can be computed, or a weight is assigned to each pixel within the ROT of the image plane, which accounts for the camera's perspective projection distortion [4]. The weight wi attached to the ith pixel after normalisation can be obtained as: where the square area centred on (x, y) in the ground plane in
Having defined the ROI and applied weights to the pixels, a non-uniform partition of the ROI into a number of image blobs can be automatically carried out, after which each blob is assigned a single weight. The method of partitioning the ROI into blobs and two typical ways of assigning weights to blobs are described below.
Uniform ROI partitions will now be described by way of an introduction to generating a non-uniform partition.
The first step in generating a uniform partition, is to divide the ground plane into an array of relatively small uniform blobs (or sub-regions), which are then mapped to the image plane using the estimated homography.
In a crowd congestion estimation problem, any blob which is too big or too small causes processing problems: a small blob cannot accommodate sufficient image data to ensure reliable feature extraction and representation; and a large blob tends to introduce too much decision error. For example, a large blob which is only partially congested may still end up being considered as fully congested, even if only a small portion of it is occupied or moving, as will be discussed below.
a shows another exemplary uniform partition using an array of relatively large uniform blobs on a ground plane and the image in
It can be observed from
Assuming wS and hS are the width and height of the blobs for a uniform partition (for example, that described in
In step 830, if more blobs are required to fill the array of blobs, the next blob starting point is identified as x+wI+1, y, in step 835 and the process iterates to step 805 to calculate the next respective blob area. If no more blobs are required then the process ends in step 830.
In practice, according to the present embodiment, blobs are defined a row at a time, starting from the top left hand corner, populating the row from left to right and then starting at the left hand side of the next row down. Within each row, according to the present embodiment, the blobs have an equal height. For the first blob in each row, both the height and width of the ground plane blob are increased in the iteration process. For the rest of the blobs on the same row, only the width is changed whilst keeping the same height as the first blob in the row. Of course, other ways of arranging blobs can be envisaged in which blobs in the same row (or when no rows are defined as such) do not have equal heights. The key issue when assigning blob size is to ensure that there are a sufficient number of pixels in an appropriate distribution to enable relatively accurate feature analysis and determination. The skilled person would be able to carry out analyses using different sizes and arrangements of blobs and determine optimal sizes and arrangements thereof without undue experimentation. Indeed, on the basis of the present description, the skilled person would be able to select appropriate blob sizes and placements for different kinds of situation, different placements of camera and different platform configurations.
Regarding assigning a weighting to each blob, which has a modified width and height, wI and hI respectively, there are typically two ways of achieving this.
A first way of assigning a blob weight is to consider that uniform partition of the ground plane (that is, an array of blobs of equal size) renders each blob having an equal weight proportional to its size (wS×hS), the changes in blob size as made above result in the new blob assuming a weight
(wI×hI)/(wS×hS).
An alternative way of assigning a blob weight is to accumulate the normalised weights for all the pixels falling within the new blob; wherein the pixel weights were calculated using the homography, as described above.
According to the present embodiment, an exception to the process for assigning blob size occurs when a next blob in the same row may not obtain the minimum size required, within the ROI, when it is next to the boarder of the ROI in the ground plane. In such cases, the under-sized blob is joined with the previous blob in the row to form a larger one, and the corresponding combined blob in the image plane is recalculated. Again, there are various other ways of dealing with the situation when a final blob in a row is too small. For example, the blob may simply be ignored, or it could be combined with blobs in a row above or below; or any mixture of different ways could be used.
The diagram in
The image in
As mentioned above in connection with
As has been described, an efficient scheme is employed to identify foreground pixels in the current video frames that undergo certain meaningful motions, which are then used to identify blobs containing dynamic moving objects (pedestrian passengers). Once the foreground pixels are detected, for each blob bk, the ratio Rkf is calculated between the number of foreground pixels and its total size. If this ratio is higher than a threshold value τf, then blob bk is considered as containing possible dynamic congestion. However, sudden illumination changes (for example, the headlight of an approaching train or changes in traffic signal lights) possibly increase the number of foreground pixels within a blob. In order to deal with these effects, a secondary measure Vkd is taken, which first computes the consecutive frame difference of grey level images, on F(t) and its preceding one F(t−1), and then derives the variance of the difference image with respect to each blob bk. The variance value due to illumination variation is generally lower as compared to that caused by an object motion, since, as far as a single blob is concerned, the illumination changes are considered to have a global effect. Therefore, according to the present embodiment, blob bk is considered as dynamically congested, which will contribute to the overall scene congestion at the time, if, and only if, both of the following conditions are satisfied, that is:
Rkf>τf and Vkd>τmv, (3)
where τmv is a suitably chosen threshold value for a variance metric. The set of dynamically congested blob is noted as BD thereafter.
A significant advantage of this blob-based analysis method over a global approach is that even if some of the pixels are wrongly identified as foreground pixels, the overall number of foreground pixels within a blob may not be enough to make the ratio Rkf higher than the given threshold. This renders the technique more robust to noise disturbance and illumination changes. The scenario illustrated in
a is a sample video frame image of a platform which is sparsely populated but including both moving and static passengers.
Regarding zero-motion regions, there are normally two causes for an existing dynamically congested blob to lose its ‘dynamic’ status: either the dynamic object moves away from that blob or the object stays motionless in that blob for a while. In the latter case, the blob becomes a so-called “zero-motion” blob or statically congested blob. To detect this type of congestion successfully is very important in sites such as underground station platforms, where waiting passengers often stand motionless or decide to sit down in the chairs available.
If on a frame by frame basis any dynamically congested blob bk becomes non-congested, it is then subjected to a further test as it may be a statically congested blob. One method that can be used to perform this analysis effectively is to compare the blob with its corresponding one from the LTSB model. A number of global and local visual features can be experimented for using this blob-based comparison, including colour histogram, colour layout descriptor, colour structure, dominant colour, edge histogram, homogenous texture descriptor and SIFT descriptor.
After a comparative study, MPEG-7 colour layout (CL) descriptor has been found to be particularly efficient at identifying statically congested blobs, due to its good discriminating power and because it has a computationally relatively low overhead. In addition, a second measure of variance of the pixel difference can be used to handle illumination variations, as has already been discussed above in relation to dynamic congestion determinations.
According to this method, the ‘city block distance’ in colour layout descriptors dCLs is computed between blob bk in the current frame and its counterpart in the LTSB model. If the distance value is higher than a threshold τcl, then blob bk is considered as a statically congested blob candidate. However, as in the case of dynamic congestion analysis, sudden illumination changes can cause a false detection. Therefore, to be sure, the variance Vs of the pixel difference in blob bk between the current frame and LTSB model is used as a secondary measure. Therefore, according to the present embodiment, blob bk is declared as a statically congested one that will contribute to the overall scene congestion rating, if and only if the following two conditions are satisfied:
dCL
where τsv is a suitably chosen threshold. The set of statically congested blobs is thereafter noted as Bs. As already indicated,
A method for maintaining the LTSB model will now be described. Maintenance of the LTSB is required to take account of slow and subtle changes that may happen to the captured background scene over a longer-term basis (day, week, month) caused by internal lighting properties drifting, etc. The LTSB model used should be updated in a continuous manner. Indeed, for any blob bk that has been free from (dynamic or static) congestion continuously for a significant period of time (for example, 2 minutes) its corresponding LTSB blob is updated using a linear model, as follows.
If Nf frames are processed over the defined time period and for a pixel iεbk if, its mean intensity Mix and variance Vix, or (σix)2, for each colour band, x*ε(R,G,B), are calculated as follows:
Next, according to the present embodiment, if, for iεbk, the condition σix<τlv, xε(R,G,B) is satisfied for at least 95% of the pixels within blob bk, then the corresponding pixels IiBG in the LTSB model will be updated as:
I
i
BG,X
=α×M
i
X+(1−α)IiBG,X,Xε(R,G,B) (6)
where α=0.01. For the remaining pixels within blob bk that fail to meet the condition, the corresponding ones in the LTSB model will not be changed.
Note that in the above processing, the counts for non-congested blobs are returned to zero whenever an update is made or a congested case is detected. In practice, the pixel intensity value and the squared intensity value (for each colour band) are accumulated with each incoming frame to ease the computational load.
Accordingly, an aggregated scene congestion rating can be estimated by adding the congestions associated with all the (dynamically and statically) congested blobs. Given a total number of Nb blobs for the ROI, the aggregated congestion (TotalC) can be expressed as:
where Ck is the congestion weighting associated with blob bk given previously in Equation (2).
It has been found that the blob-based visual scene analysis approach discussed so far has been very effective and consistent in dealing with high and low crowd congested situations in underground platforms. However, one observation that has emerged, after many hours of testing on the live video data. The observation is that the approach tends to give a higher congestion level value when people are scattered around on the platform in medium congestion situation. This is more often the case when, in the camera's view, the far end of the platform is more crowded compared to the near end of the platform, simply because the blobs in the far end of the platform carry more weight to account for the perspective nature of the platform appearance in the videos. To illustrate this,
The main difference between a scattered, or loosely distributed, crowd and a highly congested crowd scene is that there will tend to be more free space between people in the former case as compared to the latter. Since this free space and congested space are evenly distributed over all the blobs, as shown in
In particular, it has been found that a measure based on the use of a thresholded pixel difference within the ROI, between the current frame and the LTSB model, provides a suitable measure. For example, consider a pixel iεROI in the current frame, the maximum intensity difference Dimax as compared to its counterpart in the LTSB model in three colour bands is obtained by:
D
i
max=Max(DiR,DiG,DiB)
If Dimax>τs is satisfied, then pixel i is counted as a ‘congested pixel’ or iεPc where τs is a suitably chosen threshold.
where 0≦GM<1.0. As a result, the final congestion (OverallC) for the monitored scene can be computed as:
OverallC=TotalC×f(GM),
where f(.) can be a linear function or a sigmoid function:
and where α=8 has been used according to the present embodiment.
Referring again to the example illustrated in
The scene examples in
As already indicated, embodiments of the present invention have been found to be accurate in detecting the presence, and the departure and arrival instants, of a train by a platform. This leads to it being possible to generate an accurate account of actual train service operational schedules. This is achieved by detecting reliably the characteristic visual feature changes taking place in certain target areas of a scene, for example, in a region of the original rail track that is covered or uncovered due to the presence or absence of a train, but not obscured by passengers on a crowded platform. Establishing the presence, absence and movement of a train is also of particular interest in the context of understanding the connection between train movements and crowd congestion level changes on a platform. When presented together with the congestion curve, the results have been found to reveal a close correlation between trains calling frequency and changes in the congestion level of the platform. Although the present embodiment relates to passenger crowding and can be applied to train monitoring, it will be appreciated that the proposed approach is generally applicable to a far wider range of dynamic visual monitoring tasks, where the detection of object deposit and removal is required.
Unlike for a well-defined platform area, a ROI, according to embodiments of the present invention, in the case of train detection does not have to be non-uniformly partitioned or weighted to account for homography. First, the ROI is selected to comprise a region of the rail track where the train rests whilst calling at the platform. The ROI has to be selected so that it is not obscured by a waiting crowd standing very close to the edge of the platform, thus potentially blocking the camera's view of the rail track.
As indicated, perspective image distortion and homography of the ROI does not need to be factored into a train detection analysis in the same way as for the platform crowding analysis. This is because the purpose is to identify, for a given platform, whether there is a train occupying the track or not, whilst the transient time of the train (from the moment the driver's cockpit approaching the far end of the platform to a full stop or from the time the train starts moving to total disappearance from the camera's view) is only a few seconds. Unlike the previous situation where the estimated crowd congestion level can take any value between 0 and 100, the ‘congestion level’ for the target ‘train track’ conveniently assumes only two values (0 or 100).
In particular, according to embodiments of the invention, the ROI for the train track is firstly divided into uniform blobs of suitable size. If a large portion of a blob, say over 95%, is contained in the specified ROI for train detection, then the blob is incorporated into the calculations and a weight is assigned according to a scale variation model, or the weight is obtained by multiplying the percentage of pixels of the blob falling within the ROI and the distance between the blob's centre and the side of the image close to the camera's mounting position. This is shown in
Finally, a global scatter scene analysis is not necessary for train detection as the ‘congestion level’ is always either 0 or 100.
In embodiments of the invention in which train detection is involved as well as crowd analysis, it will be appreciated that, while train detection using the analysis techniques described herein are extremely convenient, since the entire analysis can be enacted by a single PC and camera arrangement, there are many other ways of detecting trains: for example, using platform or track sensors. Thus, it will be appreciated that embodiments of the present invention which involve train detection are not limited only to applying the train detection techniques described herein.
The video images in
In order to demonstrate the effectiveness and efficiency of embodiments of the present invention for estimating crowd congestion levels and train presence detection, extensive experiments have been carried out on both highly compressed video recordings (motion JPEG+DivX) and real-time analogue camera feeds from operational underground platforms that are typical of various passengers traffic scenarios and sudden changes of environmental conditions. The algorithms can run in real-time in the analytics computer 105 (in this case, a modern PC, for example, an Intel Xeon dual-core 2.33 GHz CPU and 2.00 GB RAM running Microsoft Windows XP operating system) simultaneously, with two inputs of either compressed video streams or analogue camera feeds and two output data streams that are destined to an Internet connected remote server, with still about half of the resources spared. It found that the CIF size video frame (352×288 pixels) is sufficient to provide necessary spatial resolution and appearance information for automated visual analyses, and that working on the highly compressed video data does not show any noticeable difference in performance as compared to directly grabbed uncompressed video. Details of the scenarios, results of tests and evaluations, and insights into the usefulness of the extracted information are presented below.
The characteristic of the particular video data being studied are described, with regard to two platforms A and B, in Tables 1 and 2 (at the end of this description). In the case of Platform A (Westbound), as illustrated in the images in
Snapshots (A), (B) and (C) in
Snapshots (D), (E) and (F) in
Snapshots (G), (H) and (I) in
Snapshots (J), (K) and (L) in
Snapshots (2), (3) and (4) in
Snapshots (Y), (Z) and (1) in
Snapshots (P), (Q) and (R) in
Snapshots (V), (W) and (X) in
The graph in
By carefully inspecting these results it is possible to identify several interesting points, which illustrate the accurate performance of the approach described according to the present embodiment.
First, it is clear that the approach works well across two different camera set ups, and a variety of different crowd congestion situations, in real-world underground train station operational environments. For the train detection, the precision of detection time has been found to be within about two seconds of actual train appearance or disappearance by visual comparison, and for the platform congestion level estimation, the results have been seen to faithfully reflect the actual crowd movement dynamics with the required level of accuracy as compared with experienced human observers.
By drawing the results of congestion level estimation and train presence detection together in the same graph, we are able to gain insights into the different impacts that a train calling at a platform may have on the platform congestion level, considering also that the platform may serve more than one underground line (such as the District Line and the Circle Line in London). At a generally low congestion situation, as shown in
In persistently high level platform congestion situations as depicted in
The algorithms described above contain a number of numerical thresholds in different stages of the operation. The choice of threshold has been seen to influence the performance of the proposed approaches and are, thus, important from an implementation and operation point of view. The thresholds can be selected through experimentation and, for the present embodiment, are summarised in Table 3 hereunder.
In summary, aspects of the present invention provide a novel, effective and efficient scheme for visual scene analysis, performing real-time crowd congestion level estimation and concurrent train presence detection. The scheme is operable in real-world operational environments on a single PC. In the exemplary embodiment described, the PC simultaneously processes at least two input data streams from either highly compressed digital videos or direct analogue camera feeds. The embodiment described has been specifically designed to address the practical challenges encountered across urban underground platforms including diverse and changeable environments (for example, site space constraints), sudden changes in illuminations from several sources (for example, train headlights, traffic signals, carriage illumination when calling at station and spot reflections from polished platform surface), vastly different crowd movements and behaviours during a day in normal working hours and peak hours (from a few walking pedestrians to an almost fully occupied and congested platform), reuse of existing legacy analogue cameras with lower mounting positions and close to horizontal orientation angle (where such an installation causes inevitably more problematic perspective distortion and object occlusions, and is notably hard for automated video analysis).
Unlike in the prior art, a significant feature of our exemplified approach is to use a non-uniform, blob-based, hybrid local and global analysis paradigm to provide for exceptional flexibility and robustness. The main features are: the choice of rectangular blob partition of a ROI embedded in ground plane (in a real world coordinate system) in such a way that a projected trapezoidal blob in an image plane (image coordinate system of the camera) is amenable to a series of dynamic processing steps and applying a weighting factor to each image blob partition, accounting for geometric distortion (wherein the weighting can be assigned in various ways); the use of a short-term responsive background (STRB) model for blob-based dynamic congestion detection; the use of long-term stationary background (LTSB) model for blob-based zero-motion (static congestion) detection; the use of global feature analysis for scene scatter characterisation; and the combination of these outputs for an overall scene congestion estimation. In addition, this computational scheme has been adapted to perform the task of detecting a train's presence at a platform, based on the robust detection of scene changes in certain target area which is substantially altered (covered or uncovered) only by a train calling at the platform.
Extensive experimental studies have been conducted on collections of various representative scenarios from 8 hours video recordings (4 hours for each platform) as well as real-time field trials for several days over a normal working week. It has been found that the performance of congestion level estimation matches well with experienced observers' estimations and the accuracy of train detection is almost always within a few seconds of actual visual detection.
Finally, it should be pointed out that although the main discussion focus of this paper is on the investigation of video analytics for monitoring underground platforms, the approaches introduced are equally applicable to automated monitoring and analysis of any public space (indoor or outdoor) where understanding crowd movements and behaviours collectively are of particular interest from crime prevention and detection, business intelligence gathering, operational efficiency, and health and safety management purposes among others.
The above embodiments are to be understood as illustrative examples of the invention. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
08250570.2 | Feb 2008 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB2009/000479 | 2/19/2009 | WO | 00 | 8/18/2010 |