Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images

Paula Seidler, Neill D F Campbell, Ivor J A Simpson

problem: lack of interpretability in deep-learning based methods

Abstract

Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties.

We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling.

SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.

Methods

SUSS learns a perceptual similarity score by modelling how each image can vary under small, human-imperceptible transformations. Instead of relying on deep feature embeddings with unknown invariances, we use Structured Uncertainty Prediction Networks (SUPN) to predict multivariate Normal distributions over perceptually close versions of an image. This generative formulation allows us to explain the score directly in pixel space through residuals, samples, and spatial relevance maps.

Our SUPN-UNet predicts the mean and structured covariance of multi-scale luminance (Y) and colour (Cb, Cr) components. SUSS evaluates the similarity as a weighted sum of component log-likelihoods.

Perceptual components. We work in YCbCr and model luminance at three spatial scales, capturing fine-to-coarse structure, while chrominance is modelled at a lower resolution to reflect reduced colour sensitivity. These components form the basis of the final score.

Learning perceptual invariances. SUPN is trained in a self-supervised manner to represent the distribution of small geometric and colour transformations that preserve perceptual content. These augmentations follow psychophysical findings (e.g., JND-calibrated intensity levels). The model predicts a mean image and a sparse Cholesky-factored precision matrix, enabling efficient modelling of correlated spatial and cross-channel structure in pixel space.

Similarity through log-likelihoods. To compare images X and Y, SUSS evaluates how likely each component of Y is under the learned distribution around X. The component log-probabilities are combined using learned non-negative weights derived from human perceptual datasets. This yields a Mahalanobis-like perceptual distance that is smooth, interpretable, and aligned with human judgment.

Because the model is fully defined through Normal distributions in pixel space, SUSS remains explainable: we can inspect whitened residuals, generate samples, and visualise pixel-wise contributions through SUSS maps.

Try the SUSS Metric

Download a checkpoint + weights: Pick a SUSS variant (Base, PieAPP, PIPAL, SR) from checkpoints.html. Note the load_path you saved it to and the matching w vector for that variant (replace placeholders).

Use (replace load_path and w with chosen variant values):

As a metric (stop grads)

from metrics.metric import SUSS

suss_model = SUSS(
    load_path="/path/to/your/suss_variant.pth",
    w=[...],
    testing=True,
    stop_grads=True,
)

score = suss_model.get_log_p_weighted(img_ref, img_dist)

As a loss (keep grads)

from metrics.metric import SUSS

suss_model = SUSS(
    load_path="/path/to/your/suss_variant.pth",
    w=[...],
    testing=True,
    stop_grads=False,
)

loss = -suss_model.get_log_p_weighted(target, prediction)
loss.backward()

For training details and data paths, see the paper and SUPNMetric repo.

Interpretability Examples

A key advantage of SUSS is that its similarity score can be interpreted directly in pixel space. Because each component is modelled as a multivariate Normal distribution, we can visualise how the learned covariance structure shapes residuals, highlights perceptually relevant differences, and defines the space of variations the model considers perceptually similar. Below are three main interpretability tools used in the paper:

Whitened residuals

SUSS forms the raw residual R = Y − μ(X) and applies the learned Cholesky factor L(X) to obtain whitened residuals. These show how the model decorrelates and rescales differences based on its learned perceptual invariances. Irrelevant changes (e.g., small shifts or noise) are suppressed, while differences the model has learned to be perceptually important become more prominent. The heatmap lets us examine regions that contribute strongly to lowering the log-likelihood.

SUSS map

The SUSS map combines the squared whitened residuals from all luminance scales and colour components into a single spatial relevance map using the learned SUSS weights. It provides a pixel-level explanation of the final score: areas with high values are where the compared image deviates most from the perceptual invariance structure learned around the reference. These correspond to the regions that drive the similarity judgement.

Samples from the learned distributions

Because SUSS models each component as a multivariate Normal, we can draw samples to visualise the “perceptual neighbourhood” around an image. Close samples preserve fine structural detail; medium and far samples show progressively stronger variations, but remain perceptually plausible. These samples illustrate the type and strength of changes the model assigns high probability to-i.e., which transformations it treats as perceptually irrelevant.