GPS-Free Street-Level Geolocation with Simultaneous Quantile Regression

Written by Plainsight Engineering | Jun 22, 2026 5:52:21 PM

When a camera doesn’t have GPS, can it still figure out where it is? That was the motivating question behind a demo we built for our work with Police departments.

The short answer: yes, with some important caveats about what you measure and how honest you are about what you don’t know.

The Problem

Police departments ingest footage from a lot of sources: crowdsourced tips, third-party CCTV, and footage shared by the public. A large fraction of that material arrives without GPS metadata or with location data that’s too coarse to be operationally useful. A frame of video is rich with geometric cues — building facades, street furniture, lane markings — that implicitly encode location. The challenge is turning those cues into a reliable coordinate estimate when the camera metadata can’t be trusted.

We scoped this proof of concept to Chicago’s Loop district: a roughly 2km × 2km grid of city blocks that’s visually diverse and well-covered by public street imagery.

Demo screenshot: Chicago Loop dashcam footage with minimap overlay showing estimated position and cell confidence heatmap

The Grid: A Hierarchy of Cells

The first design decision was not to regress latitude and longitude directly from a raw frame. That works poorly: the mapping from pixel values to world coordinates is highly nonlinear, and a regression head trained end-to-end on it tends to produce overconfident, geographically incoherent estimates.

Instead, we split the problem in two:

Coarse classification — which 137m × 186m cell does this frame belong to? The Loop fits in a 12×12 grid, so this is a 139-class classification problem (5 cells had too sparse coverage to use).
Fine regression — where within that cell is the camera?

This is the standard top-down coarse-to-fine pattern with a lot of prior art. More interesting than this decomposition is what goes into each stage.

Stage 1: CosFace for Stable Cell Boundaries

Standard cross-entropy on a 139-class problem does something undesirable for geolocation: it drives spatially adjacent cells apart in embedding space as aggressively as it drives distant cells apart, because they’re all just different class labels. A cell on the north edge of block 42 has no special similarity to the cell on the south edge of block 43 from the loss function’s perspective.

Following CosPlace (Berton, Masone, and Caputo, 2022), we fix this by replacing the standard dot-product logit with a cosine similarity and adding an angular margin penalty. The consequence is that the backbone learns an embedding where visually similar frames end up geometrically close — and because visual similarity and geographic proximity are correlated in street imagery, this naturally clusters neighboring cells. The margin pushes class boundaries to be conservative and stable, which matters a lot for the confidence-gated switching logic downstream.

With the CosFace objective, the cell classifier reaches 92% top-1 accuracy on held-out real dashcam imagery, and 97% of its inaccurate predictions fall within one cell (~200m) of ground truth. Without it, cross-entropy trained the same backbone to roughly 60%.

Stage 2: Simultaneous Quantile Regression for Honest Uncertainty

Within a cell, we wanted more than a point estimate. Not all positions within a 137m cell are equally identifiable — a midblock stretch of brick wall is far more ambiguous than a distinctive corner intersection. We wanted the model to tell me when it was confused.

Simultaneous Quantile Regression (SQR) is a natural fit. Instead of training a regression head to minimize mean squared error (which implicitly estimates the conditional mean), SQR trains a single head conditioned on a quantile value τ — sampled uniformly during training — using the pinball loss. At inference time, you query the same head at whichever τ values you want: the head has learned to produce a value that the true answer falls below τ fraction of the time.

The result: one head gives you a full slice of the conditional distribution on demand. The spread between querying at τ=0.05 and τ=0.95 is a calibrated uncertainty estimate. When the model sees an easily identifiable corner, that spread is narrow. When it sees a long featureless block it’s never been certain about, the spread grows.

For 2D targets (lat and lng jointly), the pinball loss broadcasts cleanly over dimensions. The uncertainty representation becomes an ellipse centered on the median estimate, with semi-axes given by the quantile spread in each direction. What renders on the minimap as a small or large ellipse around the location dot is directly measuring how sure the model is — not a heuristic radius, but an empirically calibrated interval.

Within-cell accuracy for the SQR stage was 97%: when the coarse classifier picks the right cell (which it does 92% of the time), the fine regressor’s point estimate lands within one cell of the truth.

The Training Data Story

The first version of this system used Google Street View imagery scraped from the geofence. Accuracy on real dashcam footage was ~1%.

The problem isn’t the model architecture. The problem is that Street View drives through each block roughly once every year or two. That means every image of a given location shares the same lighting conditions, the same traffic density, and roughly the same season. A model trained on it learns to classify the foreground — the specific cars parked that day, the specific shadows — rather than the durable landmarks: facades, signage, corner geometry.

Switching to Mapillary — crowdsourced, geotagged street-level imagery — fixed this. Mapillary images of the same location come from different contributors, different times of day, different weather, and different cameras. Without any explicit augmentation for foreground invariance, the model is forced to fit to what’s stable: the building geometry and fixed signage, not the cars or pedestrians. Accuracy on real dashcam footage jumped from 1% to 92%.

The lesson generalizes: when you want a model to be invariant to something, the easiest way is to make sure your training data varies. Aggressive augmentation can compensate, but there’s no substitute for real distributional diversity.

Temporal Stability in the Pipeline

A raw frame-by-frame classifier produces a jittery output: the predicted cell flips between neighboring cells on every frame. Two mechanisms stabilize this for the demo.

Cell hysteresis. The displayed cell only changes when the classifier has been confident about a new cell for N consecutive frames. Low-confidence raw predictions are filtered out. This alone removes most of the visible jitter.

Gated EMA smoothing. A Miller DSL filter applies an exponential moving average to the within-cell position estimate, with a gate that rejects innovations larger than ~180m (roughly the size of a cell). This prevents the displayed marker from jumping across the map when the classifier briefly mislocalizes. After 6 consecutive rejected frames, the filter interprets the move as genuine and snaps to the new estimate.

The demo video is also spliced from different camera sources, so a jump-cut detector (HSV histogram correlation) resets both the hysteresis state and the EMA smoother when it detects an abrupt scene change — otherwise the smoother would glide the marker across the map through an artificial transition.

Why This Matters Beyond Geolocation

The geolocation use case is a good demo precisely because uncertainty is visually legible — you can watch the ellipse grow when the model is confused and shrink when it’s confident, and it corresponds to something real about the scene.

But SQR is now a first-class training task in protege (alongside classification, detection, segmentation, and the rest). The more general capability is: any regression target that benefits from calibrated uncertainty bounds — object counts on a metrics dashboard, dwell time estimates, throughput predictions — can now emit error bars that empirically follow the distribution of the training data, not a hand-tuned heuristic.

For customer-facing dashboards showing business metrics (cars in a lot, items processed on a line), this means the uncertainty displayed alongside a number is traceable back to how variable that number actually was in training data. That’s a meaningfully different claim than “±10%, trust us.”

What’s Still Missing

Two things would make this significantly more useful for production dashcam localization:

Temporal context in the model: Right now, each frame is classified independently. The classifier doesn’t know where the vehicle was in the previous frame, which means it can’t apply motion constraints to rule out teleportation. Training with temporal context — even just feeding a short frame stack — would eliminate a large fraction of the remaining errors.

Foreground equivariance: Despite Mapillary’s distributional diversity, dynamic foreground elements (pedestrians, other vehicles, lighting transients) still influence the cell decision more than they should. Ironically, Street View’s weakness as a training source — a single capture per location — is also an asset: its geo-registration is precise and its coverage is dense. The missing ingredient is synthetic foreground variation. Inpainting models like InstructPix2Pix, Cosmos, or Gemini Omni — which has a native multimodal backbone rather than routing edits through a separate video generation model — could remove or replace pedestrians and parked cars in each Street View frame while leaving the building geometry and signage intact, producing a large, well-registered corpus with the scene diversity that forces the model to fit landmarks rather than transient content.

The classifier and SQR regression head are both trained via protege’s train_from_config interface. The full inference pipeline runs as an openfilter filter chain: VideoIn → GeoLocator → TemporalSmoother → MinimapRenderer → Webvis.

View full post