Near Real-Time Social Distance Estimation in London

To mitigate the current COVID-19 pandemic, policy-makers at the Greater London Authority, the regional governance body of London, UK, are reliant upon prompt, accurate and actionable estimations of lockdown and social distancing policy adherence. Transport for London, the local transportation department, reports they implemented over 700 interventions such as greater signage and expansion of pedestrian zoning at the height of the pandemic’s ﬁrst wave with our platform providing key data for those decisions. Large well-deﬁned heterogeneous compositions of pedestrian footfall and physical proximity are diﬃcult to acquire, yet necessary to monitor city-wide activity (“busyness”) and consequently discern actionable policy decisions. To meet this challenge, we leverage our existing large-scale data processing urban air quality machine learning infrastructure to process over 900 camera feeds in near real-time to generate new estimates of social distancing adherence, group detection and camera stability. In this work we describe our development and deployment of a computer vision and machine learning pipeline. It provides near immediate sampling and contextualisation of activity and physical distancing on the streets of London via live traﬃc camera feeds. We introduce a platform for inspecting, calibrating and improving upon existing methods, describe the active deployment on real-time feeds and provide analysis over an 18 month period.


INTRODUCTION
Before 2020, the phrase "social distancing" had hardly any visibility to the public eye [1] as vernacular more frequently found in epidemiology textbooks and historical reports [2].However, during the COVID-19 pandemic, physical spacing between strangers became a means of trying to curb the spread of the virus.
As the global community is actively engaged in understanding more about the effects and transmission mechanisms of COVID-19, many governments have enacted temporary restrictions targeted at reducing the proximity of the public to one another.Measures such as limiting capacity within enclosed spaces, communicating new pedestrian traffic flow, and when necessary, enacting broader controls via "lock-downs" [3].The monitoring of public response to these measures have come out of necessity for policy makers to better understand their adoption, plan economic recovery and eventual suspension.When social restrictions were first implemented in the UK, there were limited measures of public activity in the context of likely vectors for viral transmission.A number of private companies trading in public movement data began providing aggregate information at the request of local government, from sources such as workplace reporting, wearable sports activity trackers and point of sale transactions [4].It became clear there was an immediate need for additional response metrics for pedestrian activity, unmet by the aforementioned sources.FIGURE 1: Lines detected by our feature extraction algorithm; two orthogonal sets of lines: those parallel to the foreground road (green, road edges) and those perpendicular (blue, road perpendiculars).The intersection of each set, the vanishing point (light blue) which lies on the horizon.Some challenging conditions are visible, including varying lighting and non-zero road curvature.This work seeks to estimate social distance in areas of high footfall in London, UK.The goal is to gauge adherence at high spatial and temporal granularity, and most importantly, provide near real-time access to policy makers.We describe a social-distancing estimation system using Open Government Licensed [5] traffic cameras directed towards pedestrian crossings and pavements.We include the description of our pipeline, methodology, algorithms and new accuracy results as urban object detection benchmarks.
The basis for this approach was initially built for constructing greater predictive features to improve a live air quality model of London [6].It is known that pollutants are generated at different rates depending on driving activity [7].Traffic camera footage is a suitable candidate for proving features on typical vehicle movement, capable of assisting the modelling of fine airborne particulates contributing to air pollution.The cloud infrastructure developed for the air quality model serves as the foundation for our social distancing estimation system.
Due to the nature of large-scale CCTV capture, there were initial substantial privacy concerns.All footage employed throughout the process is anonymised via deliberate restrictive sampling and systematically undergoes continuous review by The Alan Turing Institute's Ethical Advisory Group [8].

METHOD
Cameras available to the public are heterogeneous in quality and fall victim to the sporadic physical nature of London's historical streets.This scenario presents numerous challenges from a geospatial statistical and technical perspective, see fig. 2 for an example of a postprocessed still frame.Our platform predominately relies upon 912 independent traffic camera feeds, over 500 of which typically overlook an intersection or crossing with an expected pedestrian footfall.In order to mitigate potential deanonymisation, all input visual data is reduced in video resolution, significantly hampering facial identification.Additionally, to place our results in an appropriate broader context, our research goal is to measure variations in social distancing and feed quality over extended periods of time.
Each camera feed provides two data elements: a short video every few minutes and a restrictive set of static metadata regarding location and approximate cardinal direction.Hence, before attempting to estimate any pedestrian location, each camera requires an initial digital twin (DT) abstraction to define the world-plane of the visible scene stage, usually synonymous with visible road structure.A final real-world calibration is applied using human-labelled mappings from pixel locations within the image to physical coordinates of objects identified within the scene.These anchors are considered as "ground-truth," examples include road markings, telephone booths and traffic lights.The objects selected are collectively referred to as "urban furniture," and are of most benefit if visible from aerial or satellite photography for later calibration.As a form of image registration, this enables mapping from the two-dimensional video frame to an inferred unreferenced world-plane, finally to a real-world location.The process is difficult with highly variate CCTV scenes; our method learns one set of parameters for mapping a 2D scene to a 3D real-world coordinate projection and is described in section 2.1.
The Computer Journal, Vol.??, No. ??, ????  Once complete, active data collection continuously ingests ten-second camera clips from the public domain.Upon successful retrieval of each video sample, they are batched for object detection.Our image processing pipeline is composed of a cluster of dynamically-scaled compute resource via virtual machines operated by a container-orchestration system called Kubernetes [9].A batch of 500 clips are ingested at a time, we are careful to ensure our model and computational resources are sufficient to process each sample in less or equal time than they represent, i.e. 30 minutes of footage must complete within 30 minutes, or the system would perpetually slip behind real-time.We employ a tuned state-of-the-art object detection model called YOLOv4 [10,11] for identification of pedestrians within video frames.The reasoning for this selection and the tuning process is included in section 2.2, and active deployment as described in section 2.5.
Results from the camera calibration and object detection stages are then stored within high-availability databases [12] near our data storage and image processing cluster.These databases permit immediate availability to public policy makers, specifically the greater London authority (GLA) and transport for London (TFL) via a reliable representational state transfer application programming interface (REST API).Additionally, high availability increases capacity for complimentary research tasks, such as simultaneously watching for spikes or irregularities via expectationbased network scan statistics [13].
A primary challenge borne out of long-term experimental processing is the unexpected consequences of relying upon cameras prone to real-world interference.Some examples identified during the developmental phase of this system are graffiti, wind progressively drifting the view direction, physical malfunction, and trees sprouting leaves restricting previously clear views.In response to these detriments, we designed a camera stability change point detection process for identifying and alerting when scene dissimilarity meets a predetermined threshold, as described in section 2.3.
Finally, purely recognition, localisation and relative distance are not enough for adequate social distancing metrics, as pedestrian activity typically includes grouping behaviour.As individuals seek to preserve physical distance with strangers whilst reducing the chance of disbanding their safe social group (or "bubble") [14]; hence, an inclusion of group agency is considered.An algorithm for group detection operating at frame and scene granularity is presented in section 2.4 for discussion as part of the final results.

Camera Calibration
Obtaining a world-plane mapping of a camera scene is extensively described in the computer vision and photogrammetry literature [15,16].A large portion of literature requires manual calibration using known patterns to estimate a mapping from sensor data to a real world contextualisation [17,18,19].We aim to learn the geometric relationship between camera view and physical scene, frequently described through similarity, affine or projective transformations.A vanishing point of an image is the location of apparent three-dimensional convergence of parallel vanishing lines from a twodimensional perspective (fig.2a).Estimation of these vanishing lines is a common technique to recover some of these transformations.Our input scenes have multiple limitations: roads are usually curved or contain junctions of varying width; irregular road markings vary in quality; low video resolution of the feed; lighting conditions change frequently and are individually very short in duration.
Scene object context methods [20,21] use the activity of multiple vehicles travelling parallel and regularly to estimate the vanishing point.Which is feasible solution for our problem if multiple samples were stitched together and vehicle movement manually corrected.Calibration methods [22,23,24,25] require clear, regular or known lines in the scene, which is not practical in the case of a large spread of physical geometry.A stratified transformation approach discussed in [26] relies upon maximum likelihood estimation (MLE), a popular method for parameter estimation of an assumed probability distribution given some observations.This is applied over multiple extracted lines from high quality images to build a real-world model, an issue for our low resolution samples.Finally, [20,23,24,27] extract visible road features by using a derivative-based binarisation operator.This is principally suitable for cameras overlooking straight and visually similar lanes, which is turn is only suitable for a portion of our input domain.Overall, we sought a more easily generalisable method considerate of our cluttered urban traffic scenes at low resolution that leverages our high sample quantity.

Simplified Pinhole Camera Model
A mapping, (u, v, 0) → (X, Y, Z), is sought from the image-plane to world geometry -for example, the transformation from pixels representing the bikes in fig. 2 to a physical location.Without a priori truth of any parameters describing the camera properties, these properties should be estimated or assumed and categorised into two groups: intrinsics and extrinsics.Examples of intrinsics include focal length, principal point, skew and aspect ratio, whereas extrinsics include positioning and direction.After manual inspection of all cameras, we conclude the suitability of the Simplified Pinhole Camera Model, as fewer than 0.5% of cameras have ultra wide-angle ("fisheye") lenses.
Our simplified pinhole camera model allows the transformation to be described by four parameters u 0 , v 0 , u 1 , h where (u 0 , v 0 ), (u 1 , v 0 ) are the vanishing points of two orthogonal planar directions subtending the horizon line, and h is the height of the camera above ground (fig.3).Parallel lines on the road and on cars, such as road edges, advanced stop lines and car and truck edges, are used to estimate this transformation (fig.2a).This model makes the following assumptions: no change in the center pixel from the center of the camera view.These are commonplace and rarely estimated when lacking more detailed visual information [28], [21].External assumptions as follows: (d) Rectilinear lens, i.e. zero radial distortion; the image has already been pre-corrected such that perpendicular straight lines in reality are straight on the perpendicular pixel grid.(e) Flat horizon v 0 = v 1 , i.e. camera has zero-roll.(f) Zero-inclined roads Z = 0, i.e. pedestrians do not move in a space large enough to calibrate deviation in elevation.Where cameras fail these external assumptions, a preprocessing stage included additional information to correct radial distortion [21], inclined horizon (setting v 1 = v 0 ), and non-zero inclination Z [29].

Edge Detection
Our method for edge detection should be robust in noisy, low resolution scenes with varying light conditions.Whilst deep learning approaches for edge detection such as Visual Geometry Group ("VGG") models [30] have seen significant advancements in the last decade [31,32], they require hours of training on a large set of labelled edges.Note our dataset does not have labelled edges.To produce these data for this task outweighs the value of rapid perspective mapping on 912 scene samples.This problem extends to considering direct vanishing point estimation.The aforementioned deep learning approaches would require labelled data on the order of magnitude of hundreds of samples [32] for direct perspective estimation.We instead turn our attention to classical methods [33].
Developed in 1986, the Canny Edge Detector has seen wide adoption for its adept ability to find edges under the edge detection goals of low error rates and minimised false edges in noisy-scenarios, suitable for our low resolution highly light-variate input scenes, see Figures 1 and 2a.
The method relies upon Gaussian filters to first smooth potential noise and then applies four filters to find intensity gradients with reference to gradient angle direction.Edge-thinning is subsequently applied via magnitude thresholding, but this is not enough to remove spurious variations in colour and noise.A second double threshold is applied using the surviving edge gradients, this time utilising high and low empirically determined from the whole edge set.Finally, some final weak edges remain.A process called hysteresis is applied via blob analysis to determine survival based on proximity to neighbouring strong pixels.

Parameter Estimation
In order to learn our simplified pinhole camera model detector parameters (u 0 , v 0 , u 1 , h), scenes with light vehicle traffic are selected and the edge detector applied per frame to find sets of road edges and road perpendiculars as shown in fig. 2. In order to learn a set of vanishing lines from these edges, the Hough transform matches collinear edge segments into linked lines which are then filtered by gradient [20] and dimensions.The vanishing point is then simply estimated as the highest frequency of the pairwise line intersections.This is chosen over more computationally The Computer Journal, Vol.??, No. ??, ????  expensive aforementioned MLE methods [34] where the vanishing point error is optimised using least squares [28], [27] or Levenberg-Marquardt [20], [26].This procedure is repeated across different contrast factors to provide a robust line detector in challenging lighting conditions that such as British weather traditionally exhibits.These u 0 , u 1 , v 0 values are averaged over all frames to extract a final estimate.Finally, the camera height h must be manually estimated.One method is by transforming an object of known dimensions.For example, using frequently appearing London buses of fixed 4.95m height, the calculated height averages h = 9.6m with 10% average deviation across 7 randomly picked cameras.Other ways to obtain the scale h include using car length averages [29] or known lane spacing [25,27].Given few known consistent standardised urban furniture upright heights, the London Bus method is appropriate.With each parameter estimated is it possible to define a world-plane.

Real-world Reference
The world-plane projection (fig.3) is as yet unreferenced to the real-world; a Euclidean Norm may be applied but not uniformly across all cameras.We employ points of reference via geotagged static urban furniture,.such as traffic lights or road markings to map this intermediate world-plane to a real-world representation, We select an appropriate flat 2D projected coordinate reference system, British National Grid (OS 27700).We then employ a second transformation between these two 2D Cartesian frames of reference, represented above with scale and shear factors, k, angle of rotation, θ, translations t, and error terms, e.The estimated realworld representation is the result of optimisation of the sum of squares error between transformed imagecoordinates of the urban furniture and the world-plane image registration.

Camera Dataset
Footage is sourced openly via Transport for London sourced from the Open Roads initiative, known as JamCams [35].
A day of collection constitutes approximately 220,000 individual files of a total of 20-30GB, deleted upon processing in accordance with our data retention policy.The nature of monitoring The Computer Journal, Vol.??, No. ??, ???? public spaces means we cannot a priori request consent.The reductive resolution of footage collected from this source inhibits any capacity to personally identify an individual.Thus only their humanoid likeness is utilised for detection.

Object Detector
In order to detect entities quickly enough to assist policy makers, we evaluate object detection models such as SSD [36] and YOLO v3 [37] to balance speed against accuracy.These are typically determined by architecture, model depth, input sizes, classification cardinality and execution environment.You Only Look Once (YOLO) [38] is a one-stage anchor-based object detector that is both fast and accurate.YOLOv3 achieves an accuracy of 57.9 AP 50 in 51ms [37].Recently, a faster version named YOLOv4 [10], was released with a state-of-the-art accuracy than these alternative object detectors.Notably, YOLOv4 can be trained and used on conventional GPUs which allows for faster experimentation and fine-tuning on custom datasets.YOLOv4 improves performance and speed by 10% and 12% respectively [10].
We employ both YOLOv3 & v4 in our experiments.Each were pretrained on Coco [39] dataset, a large-scale repository of objects belonging to 80 class labels.Due to our objective, the classes of interest are limited to six labels: person, car, bus, motorbike, bicycle, and truck.We fine-tuned the model on six labels using joint datasets from COCO, MIO-TCD [40], and a training set of custom manually labelled JamCam-specific set.A validation set was also partitioned from the manually labelled dataset for model evaluation.Results in the evaluation section documents the success of this finetuning to traffic camera footage.

Similarity Indices
Due to the extended duration of this project, it is necessary to include an evaluation of physical change in scene perspective or visible feed quality.Examples exhibited over time include intended adjustments made via motor-driven camera equipment, strong weather laterally progressively shifting direction, detachment from mounting hardware, and when local vegetation sprouts to inhibit visibility of the original scene.To mitigate and detect these issues, we construct a variation metric using past frame information to detect variations in the captured scene.Direct application of pixel-forpixel Mean Square Error (MSE) is not suitable under the change of lighting conditions, and is too sensitive to minute pixel differences.The Structural Similarity Index Measure (SSIM) has been shown [41] to aptly measure image distance via a kernel comparison approach.

Application to scene imagery
Numerous scene reference periods were selected to measure SSIM for each camera: first known scene frame versus first hourly frame, one-week historical offset to first hourly frame, and mean of non-erroneous initial frames over seven days from initial data acquisition versus first daily at noon.It became clear that the final measure is most appropriate both for noise reduction and computational efficiency.Over time, this generates a univariate time series measuring scene variation.Sustained linear drift is less likely to negatively effect our pedestrian location estimation; however, it can be detected once the threshold is met.More importantly, a single major movement must be detected for subsequent alerting during the live experiment operation.This option however does deviate from other automated tasks, requiring the seven-day period to be adjusted to the new scene reference upon human intervention.

Change Point Detection
Detection of abrupt shifts in frame similarity over time is a task suitable to the unsupervised learning problem of change point detection, the study of algorithms designed to find underlying change in time series [42].An offline solution is still suitable, providing our large number of input feeds and necessity for appending daily measures.Upon evaluation, we determined Pruned Exact Linear Time (PELT) [43] under a standard RBF kernel could accurately partition camera scene changes.Under this measure, cameras of high variability are also excluded from later analysis.

Group Detection
In order to better describe social distancing efforts, we implement a group detection process (algorithm 1) and define seven metrics to describe a scene over time (table 1).Selecting groups from pedestrian detection locations are calculated by generating the Delaunay

Individuals
Total number of unique pedestrians

No. groups Total number of unique groups
No. groups max.Maximum number of exhibited groups in a given location.

Outer group min. distance
Minimum distance exhibited between two groups in a given location.

Outer group mean distance
Mean distance of all exhibited groups in a given location.

Inner group mean size
Mean population within a group.

Inner group mean. distance
Mean of distance exhibited within a group.triangulation in the British National Grid (BNG) projection for pedestrians within individual frames per scene sample.
Each metric is calculated depending on two constants: detection confidence, T c , the threshold required to include a detection, and a distance threshold, T d , the maximum group diameter distance (metres).This task is conducted per frame, producing frame-level results: total number of groups, G n ; number of people within a group I l ; mean distance between individuals within group, I d ; and, group centres in BNG projection, C. Intermediate per-frame groups are determined by refactoring within threshold detection locations into a coordinate matrix from the Delaunay graph.Individual groups are then classified as connected components per [44].Upon completion, each set of groups detected per frame supports an additional Delaunay triangulation, permitting final calculation of scene-level metrics: maximum number of groups per frame, G n ; minimum distance between groups; G max(n) ; and, mean distance between groups, G d) .
We fix the detection confidence threshold to 0.7, meaning inferred detection certainties from object detection as pedestrian below 70% are excluded.A maximum interest area is defined as 6 metres between any two individuals.In practice, this task is expensive and distributed amongst many processing nodes using Python Dask parallelisation.

Deployment
Deployment provisioning is controlled declaratively by Terraform, containing each component of the processing pipeline (fig.6).Kubernetes manages two compute clusters: A GPU accelerated horizontally-scaled video processing pool, and a stability focused horizontallyscaled burstable CPU pool executing scheduled tasks and hosting API access points for direct data acquisition and service for control centre output.

World-plane estimation
Uncertainty in the estimation of the vanishing line and extrinsic camera height arises due to imperfect camera effects eliminated in the assumptions and inaccurate automatic line extraction.The estimated errors in mapped world position, dX, dY , are evaluated for 3 randomly selected cameras manually calibrated beforehand using the total differential over all estimated parameters p i ∈ {u 0 , v 0 , u 1 , h} assuming that the vehicle tracking u, v are accurate at the point of evaluation.The average relative uncertainty in position mapping due to parameter estimation | dX X | is calculated to be 17.7%, σ = 7.9%.

Calibration to real-world reference
Of 912 total cameras, 504 were selected for analysis within the Boroughs of Inner London with nonpedestrian motorway scenes predominately excluded.For this training subset, 3,298 manually labelled urban furniture anchors were employed for frame realworld calibration.During labelling, care was taken to maximise spacial coverage in each dimension, i.e. anchors were sparsely labelled to include the width and depth of the field of view.There are an average of 5.53 labels, |F s |, per scene s, with a minimum of 4, |F s | ≥ 4, where few urban furniture could be identified.Given we are interested in the distance between individuals, the most appropriate error would be distance between known real-world locations and their pixel coordinates post transformation.The value of this comparison stems from the interest of comparing two individuals or groups within the scene.All possible lengths between all anchors, N , were calculated before optimisation.The error function was the mean squared error between these and the learned transformation results, M , This tests the complete calibration pipeline, from pixel coordinates in the image plane, to relative locations in the world plane, finally to the real-world distances between points.The median optimisation error was 0.8210 in BNS, meaning our model is able to locate an object in the image within 82.10cm.The distribution of this error is shown in fig. 7.

Validation
There does not exist a ground-truth dataset containing relative distances between people in the traffic camera frames around London.To validate this approach, we remove ground-truth anchors enforcing a reliance on fewer manually calibrated examples.For every scene The Computer Journal, Vol.??, No. ??, ????  s with a set F s of urban furniture anchors, we remove exactly one anchor a s from F s randomly with uniform and independent probability.We train our model only using anchors in F s − {a s }.The validation test set contains all the out-of-sample removed anchors a s for each scene s.
The approach led to a mean relative distance error 83.43cm, a discrepancy of 1.33cm, with distribution displayed in red, fig. 7.This indicates that the training procedure marginally benefits from more labels and is resistant to changes in the input training data.

Object detection
As preprocessing steps, we subset 6 labels from the Coco 2017 and MIO-TCD localization dataset.Unlike the Coco dataset, MIO-TCD localization dataset contains 11 labels with additional categories such as motorized vehicle, non-motorized vehicle, pickup truck, single unit truck, and work van, not found in the Coco 2017 dataset.For comparison, we collapse the different categories of trucks as truck and remove labels regarding vehicle motorization.We produce a new collection of manually labelled entities specifically on frames from traffic cameras, using CVAT [45].The dataset contains 1142 frames and 11497 bounding boxes as shown in Table 2.For evaluation/validation, we compute the mean Average Precision (mAP) at IOU threshold of 0.5 over the Coco 2017, MIO-TCD, joint (Coco 2017 + MIO-TCD), and JamCam datasets.
We fine-tune a pretrained YOLOv4 weights file on six labels from different training datasets using a batch size of 16, subdivsions of 4, image size of 416 and at least 7000 iterations on a Tesla V100-SXM3-32GB GPU.We train three different models on 1) Coco 2017 training data 2) MIO-TCD training data 3) Joint data containing random shuffle of Coco 2017 and MIO-TCD training data.Table 2 shows the number of training data by labels.The validation data contains Coco 2017 validation data, MIO-TCD validation data and Jamcam data.
The performances of the three models are shown in Table 3.On the Coco 2017 validation data, the model achieves a mean Average Precision (mAP@0.50) of 67.55%.However, the model trained on Coco 2017 dataset perform poorly on MIO-TCD localization validation dataset with an mAP of 20.39%.Likewise, the performance of the model trained on MIO-TCD dataset reduces greatly from 85.80% to 14.24% when Coco 2017 dataset is used as the validation dataset.This behaviour might be as a result of the differences in the resolutions and weather conditions in the two datasets.Performing a joint training creates a balance between the two datasets and increases the model's performance on the independent Jamcam dataset.

ANALYSIS
As of September 23th 2021, in the 18-month period the data collection pipeline has processed 19.31 terabytes of footage, for a total of 23,839,346,160 samples of all detection types, spanning all calibrated camera scenes.Of these, 9,453,327,651 were rejected either due to irrelevant camera positioning, or out of caution when recorded during a period of high camera variability.
We define "Inner Bouroughs" as the Statutory Inner London according to the London Government Act of 1963.The results for this section are generated for a period beginning from our collection date, 23rd March 2020.

Scene Stability and Camera Drift
Change in SSIM between the reference scene in the active feed, ∆SSIM, is highly relevant to determining sample suitability.Instances of multiple change points promote manual intervention or total rejection, Figure 5 is an instance of changing stability, whereby the original perspective does not include both pedestrian crossings, then the sensor becomes damaged or over exposed over four months, only to then be positioned differently in October 2020.
Variance in camera positioning was noticeably larger in areas of high pedestrian activity, indicative of the active role Transport for London (TfL) has taken in monitoring pedestrian and vehicle traffic.
As demonstrated by comparing the standard deviation of ∆SSIM between inner and non-inner boroughs over the aforementioned time frame results, σ inner = 0.0636, σ outer = 0.0053.

Macro policy intervention
Macro interventions within London are defined as either applicable national requirements determined by the central government or city-wide policies set forth by the Mayor's Office.For this example, inner boroughs are selected for their high camera availability for a 12-month subset of these data.Applying group detection per borough provides profiles visible in Figure 8.A timeline of intervention events [46]   [47], taking 5% closest points to (x i ,y i ), estimating y i under standard weighted linear regression.
There are two predominant results generalised across the City.A substantial increase in pedestrian frequency and social distance reduction during the "Eat out to help out" scheme between the 3rd to 30th August 2020.Additionally, activity during the second lockdown plummets whilst social distancing resumes a steady increase.Within the reduced restriction periods between these events, social distancing is seen to largely spike in most boroughs.Potentially indicative of successful public information campaigns and a willingness to maintain safe distancing conditions.
Within the Christmas period the initial social restriction measures rapidly stem the quantity of individuals and groups in all boroughs.Directly after the holiday passes, activity and distances rapidly grow until restrictions are relaxed leading to plummeting social distancing in almost all boroughs.

Micro policy intervention
Micro interventions are limited within our dataset, as many COVID protocols cannot be captured on traffic cameras.There exist a number of smaller interventions in the form of pavement extensions.These expansions in pedestrian space include road reclamation in specific areas of high volume, such as near restaurants and public transport stations.For our analysis, we selected three stable scenes from distinct boroughs and filtered our social distance metrics before and after the intervention for comparison.
Locations Borough High St/Southward St, East Road/Vestry St, and A10 North of Tyssen Rd (Figure 9) have 50.2,43.0, and 153.1m 2 of pavement area within our digital twin scene.Post-expansion, each gained 40.2, 20.5, and 100.3m 2 of additional walking space.Before intervention these had mean estimated social distances of 1.33, 1.21, 0.73m, or as a ratio to area, 0.027, 0.028, 0.005 m/m 2 respectively.Post barricade installation, estimated mean distance rose to 1.50, 1.25, 0.93m.This is indicative of a clear usage of this space, increased physical distancing, and effective policy intervention.

CONCLUSION
This work contributes a new social distancing monitoring platform, improves upon the accuracy of the state of the art detection model for in an urban domain, introduces a new camera perspective estimation method, provides physical spacing metrics into a viable historical context, and demonstrates how multiple machine learning techniques may benefit public health.According to the Greater London Authority, this tool enabled The Computer Journal, Vol.??, No. ??, ???? them to intervene quickly and identify where street spacing interventions were required.These interventions included moving bus stops, widening pavements and closing parking bays to create space which enabled social distancing."TfL says that it implemented over 700 such interventions at the height of the pandemic's first wave, and that the Turing's tool provided key data for those decisions [48,49]." Combined with large-scale, inexpensive consumer distributed computing infrastructure, we provide an option for policy makers to receive a near real-time perspective of their impact via an online interface, fig.10.Ongoing directions for this project include validating our early warning detection system, improving the digital twin overall accuracy, providing more "human-in-theloop" recommendations with high ease of use for policy makers, and continuing to provide transparent and interrogatable examples of machine learning applications.
(a) Road curvature challenging condition example, leading to erroneous vanishing point.(b) Harsh shadow parallel to vanishing lines, a beneficial scenario.
(a) Detected edges, generated ground plane, and overlaid pedestrian detection density in bright green, black, and viridis heat map respectively.(b)High accuracy detections, pedestrians, buses and bicycles in red, blue, green respectively.

FIGURE 2 :
FIGURE 2: Example of our method applied to a traffic camera at Bank, London.
(a) Unit skew, i.e. regularly square pixel grid.(b) Constant aspect ratio, i.e. no change in width to height ratio of pixels.(c) Coincidence of principal point and image centre, i.e.

FIGURE 3 :
FIGURE 3: Demonstration of perspective mapping of camera calibration from image to world plane before estimated registration to British National Grid.Rays (black, solid) are drawn as grid lines and extended (red, dashed) to the estimated vanishing points (u 0 , v 0 ) and (u 1 , v 0 ).After mapping onto world coordinates, for example, vehicle trajectories (green, dotted) are also mapped by this transformation.

1 FIGURE 4 :
FIGURE 4: Estimated locations of urban furniture (green) and transformed ground-truth scene anchors (blue) on British National Grid.

FIGURE 5 :
FIGURE 5: Change points detecting scene stability of Oxford St/Vere St.Each colour represents a detected change, example frames included and coloured respectively to indicate variation in camera quality and direction.

FIGURE 6 :
FIGURE 6: Development operations and platform architecture as deployed to Azure cloud services.

FIGURE 7 :
FIGURE 7: MSE distances of all possible distances from ground-truth anchors and transformation results.Full data are displayed in black, dropout validation results in red.
are documented in Figure A.1.Each profile is smoothed under simple local regression The Computer Journal, Vol.??, No. ??, ???? Inner London Borough Social Distancing Profiles April 2020 -April 2021

FIGURE 8 :
FIGURE 8: Number of individuals, I n , mean inner group physical distance, I d , outer group social distance, G d by inner borough."Lockdowns" and "Eat out to help out" represented by red and yellow respectively.Points are representative of camera locations, selected and unselected in blue and grey respectively.

FIGURE 9 :
FIGURE 9: Before (top) and after (bottom) of three locations of pavement expansion interventions.Heat map of pedestrian footfall within calibrated pavement and extension (red) areas pre-and post-bollard placement.

2
Algorithm 1: Group proximity frame tracking Input: Scene of localised detections, S L Parameters : Confidence threshold, T c Distance threshold, T d Output: Total detected groups, G n Max groups per-frame, G max(n) Min distance between groups, G min(d) l , |f L |); if |f L | = 2 then append(I d , Euclidean(f Lx , f Ly )); append(C, Mean(f Lx , f Ly )); end else E ← DelaunayEdges(f L ); D ← Euclidean(e vx , e vy ) ∀e ∈ E; E ← E d ≤ T d ∀d ∈ D; C ); L ← Euclidean(e vx , e vy ) ∀e ∈ E; append

TABLE 1 :
Group metrics calculated over a given sample of detection results.

TABLE 2 :
Training and validation samples per dataset.

TABLE 3 :
Comparing models fine-tuned on the Coco 2017 dataset, MIO-TCD dataset, and joint training set using YOLOv4 architecture.