Image discriminability models predict the visibility of the difference between a pair of images. We compare the ability of two basic models to predict the detectability of objects in natural backgrounds: a multiple channel Cortex transform model with within-channel masking and a single channel contrast sensitivity filter model. Minkowski summation of differences was implemented with three different exponents: 2 (root mean square difference), 4, and infinity (maximum difference). Each method was also tried with a simple contrast gain control normalization. Without contrast normalization, the multiple channel model with a summation exponent of 4 performed best. The predictions of both models improved with contrast normalization. With contrast normalization, at their best exponent of 4, the two models performed equally well.
There are a number of image discriminability models for predicting the visibility of the difference between a pair of images [1]. Many applications, such as the quality of imaging radar displays, are concerned with object detection and recognition. Object detection involves looking for one of a large set of object sub-images in a large set of background images and has been approached from this general point of view [2]. We show that discrimination models can predict the relative detectability of objects in different images, suggesting that these simpler models may be useful in some object detection and recognition applications. Here we compare two models that give measures of image discriminability. The first is a multiple spatial frequency channel model based on the Cortex transform with within-channel masking [3,4,5]. It is similar to the models of Lubin and Daly [6,7]. The second is a single channel contrast sensitivity function (CSF) filter model. Each model was tested with three different Minkowski summation exponents: 2, 4, and infinity. The exponent of 2 corresponds to the familiar root-mean-square (RMS) difference metric. The Minkowski sum with an exponent of infinity corresponds to the maximum difference.
Stimuli. Six images of a vehicle in an otherwise natural setting were altered by replacing the vehicle with appropriate background imagery. Two object images having lower levels of detectability were constructed from each image pair by mixing the object and non-object images. The mixing proportions were selected individually for the images and chosen to be near threshold detectability. The 510x480 pixel images were presented on a 13 inch Macintosh color monitor at a viewing distance corresponding to 95 pixels per degree of visual angle.
Observers. The observers were 19 male soldiers, aged 18 to 32 years. Their acuities were 20/20 or better, and they had normal color vision.
Procedure.
Observers were asked to rate each of the 24 images on a
4 point rating scale according to the following interpretation:
1-A target was definitely in the scene.
2-There was something in the scene that probably was a target.
3-There was something in the scene but it probably was not a target.
4-There was definitely no target in the scene.
One group of 10 observers saw each image 20 times at a duration of 1 sec. A second group of 9 observers saw each image 10 times at a duration of 0.5 sec and 10 times at a duration of 2 sec. The sequence of 480 images was completely randomized separately for each observer.
Methods.
The distance in discriminability units
from each object image to its non-object image
was measured in the context of
a one-dimensional Thurstone scaling model [8].
The scaling model has the following assumptions:
1. Internal stimulus values have a
normal distribution with unit variance.
2. Distances between stimulus distributions for mixtures
of stimuli are proportional to the mixtures.
3. All subjects have the same pattern of
distances between the stimulus distributions.
They differ only by a multiplicative subject
sensitivity factor.
4. Category boundaries vary across subjects,
but are constant over stimuli.
The scaling model has one image discriminability parameter for each of the 6 image sets and one sensitivity factor and 3 category boundaries for each of the observers. Observers tested with the two different stimulus durations were given two sensitivity factors. Parameters were estimated by the method of maximum likelihood.
Table 1 - Experimental discriminability indices (d').
image pair A B C D E F
n=10 4.1 10.3 3.7 6.7 4.5 3.7
n=9 5.5 10.3 4.7 8.5 4.9 4.7
n=19 4.8 10.3 4.2 7.6 4.7 4.2
For the 10 observer group, the ratio of the best observer sensitivity factor to the median observer sensitivity factor was 1.5 and to the worst observer sensitivity factor was 3.3. For the 9 observer group these ratios were 1.9 and 4.1, respectively. The sensitivities measured for the 0.5 and 2.0 sec durations were neither appreciably nor significantly different from each other.
Multiple channel model.
The multiple channel model calculations has the following steps:
1) The images are converted to luminance contrast by
subtracting and then dividing by the background image
mean luminance.
2) A CSF filter is then applied to both images.
3) The cortex transform is then applied to both images.
4) In the cortex transform domain,
the differences between the transforms of the object and background
images are then divided by the absolute value of the corresponding
coefficient from the background image to the 0.7 power if that absolute
value is greater than one (threshold).
5) The absolute value of these scaled cortex transform coefficient
differences are raised to a power, summed, and then taken to the
inverse power.
For the case that the exponent is infinity, the maximum absolute difference
is computed.
Single channel model.
For the single channel model, the steps are as follows:
1) The images are converted to luminance contrast by
subtracting and then dividing by the background image
mean luminance.
2) A CSF filter is then applied to both images.
3) In the image domain,
the differences between the filtered object and background
images are computed.
4) The absolute value of these
differences are raised to a power, summed, and then taken to the
inverse power.
The CSF filters were calibrated separately for each of the 6 combinations of multiple or single channel model and summation exponent of 2, 4, and infinity. They were designed to fit the prediction of Barten's CSF formula [11] for 1.33 deg square grating patches at five spatial frequencies centered in each of the five bandpass channels of the multiple channel model. ." Barten SPIE 1993 These contrast sensitivities appear in Table 2 and are graphed in Figure 2.
Table 2 - Calibration contrast sensitivities.
Predictions of the multiple and single channel models for the discriminability in d'
units of the object image from the background image are
given in Table 3 for each of the three summation exponents.
Table 3 - Model discriminability indices (d')
Least squares predictions of the observer discriminabilities
from the model predictions were computed in the log domain,
assuming only an additive constant (discriminability domain
multiplicative factor).
Analyses in the discriminability domain show that
neither constant terms nor squared terms significantly
improve the fits.
The multiplicative factors for predicting each of the
two groups and their average results are shown for
each model in Table 4.
Table 4 - Multipliers for model d' to predict observer d'.
The multiple channel model has correction factors closer
to one,
indicating that its within-channel masking allows
it to make better absolute predictions of observer data
when calibrated for absolute contrast threshold detection.
The standard errors of the log predictions converted to
percentage error in discriminability units are
shown in Table 5.
A standard error in the log domain of 0.3 log units
corresponds to a factor of 2 in the discriminability domain
and an error of 100 percent.
Table 5 - Prediction errors in percent.
Table 6 - RMS image contrast (%).
Table 7 contains the contrast normalized model predictions
in discriminability (d') units.
Table 7 - Contrast normalized model discriminabilities (d').
Table 8 has
values of a in percent contrast,
prediction multipliers, and
the resulting errors shown as a percentage of the
pooled group discriminabilities.
The values of a are smaller for the single channel
model.
Table 8 - a, prediction multipliers, and errors for contrast normalized models.
In principle, they
should be smaller (stronger contrast
masking) for the single channel model, because the
multiple channel model already has within-channel
contrast masking.
When a was zero, the division by c, a
number less than one, made the multiplier
even more different from unity.
However, when a was nonzero, the factors were all closer
to unity than they were prior to contrast normalization.
The predictions are now only too large by a factor
near two.
The best fitting a was chosen to minimize the standard error
of prediction, allowing the arbitrary factor in addition,
so the choice of a was not affected by whether the
absolute prediction was correct.
After contrast normalization, both models performed much better and
did equally well at their optimal summation exponent of 4.
Figure 4
shows these results plotted as in Figure 3.
An implicit parameter of the contrast gain correction is the
size of the region over which the contrast is computed.
In this case it is a 2.7 deg square, not out of line with
psychophysical measurements of the width of the
contrast gain control region [13,14].
Trials were run in blocks of 60, using one
vehicle and its background.
Each of four images, no-vehicle, 10%, 20%, and 40% vehicle,
were presented with probability 0.25.
Before each set of 10 trials, a trial was given with the 100% vehicle
image as a memory aid.
Six blocks were run for each of the 6 image pairs
in a 6x6 Latin square design,
randomized separately for each observer.
The image duration was 1.0 sec.
Table 9 - Discrimination experiment d's.
Figure 5
plots the average group object detection d's against the average
image discrimination d's of these two observers.
cycles per degree 1.125 2.25 4.5 9 18
1/contrast threshold 77 122 147 122 54
Model predictions and results
channels exp. A B C D E F
multiple 2 32.4 45.4 31.5 19.0 19.7 20.8
multiple 4 20.4 32.1 17.8 17.0 12.7 15.3
multiple inf 25.3 28.6 21.6 21.9 13.8 16.1
single 2 36.6 59.1 36.5 20.0 20.2 20.8
single 4 65.6 89.9 56.2 36.5 29.4 32.4
single inf 145.2 216.3 135.2 87.8 32.0 62.3
channels exp. n=10 n=9 n=19
multiple 2 0.19 0.23 0.21
multiple 4 0.28 0.33 0.30
multiple inf 0.25 0.30 0.27
single 2 0.17 0.21 0.19
single 4 0.11 0.13 0.12
single inf 0.053 0.064 0.059
channels exp. n=10 n=9 n=19
multiple 2 51 47 48
multiple 4 33 28 30
multiple inf 41 30 35
single 2 55 53 54
single 4 55 50 52
single inf 86 79 82
The best model is the multiple channel model with
a summation exponent of 4.
The single channel model did very poorly with the
maximum rule.
Figure 3
shows plots of the predictions of the average
group detectabilities for the 6 image pairs
for both models
using the best summation exponent (4).
The error bars represent 95% confidence intervals
for the mean of the two groups of subjects based on the variance
between the two groups.
Contrast normalization
Recent data have shown that contrast energy at other spatial
frequencies raises the threshold of grating increments and models
have been developed to account for this effect [12].
A simple way of allowing for such an effect is to multiply
the above predictions by a/Sqrt(a^2+c^2), where
c is the RMS background image contrast passed by the contrast
sensitivity function, and a is a parameter estimated from the
data.
To compute c, the CSF is normalized to unity at its peak value.
When the best estimate of $a$ is zero, we simply divide the predicted discriminability
by the RMS contrast of the filtered background image to obtain a
contrast normalized prediction.
The CSF's for the single channel model with different exponents should
be the same except for a multiplicative constant, and the CSF for
the multiple channel model and the single channel model should be the same
when the summation exponent is two.
As a result, we only show the filtered
RMS contrast values of the six background images
for the three different multiple channel filters in Table 6.
The values corresponding to a contrast exponent of two were used
for the three single filter model normalizations.
exp. A B C D E F
2 30.7 24.5 25.3 10.2 13.6 17.3
4 27.3 21.8 22.5 9.1 11.9 15.5
inf 15.4 11.9 12.1 4.8 6.3 8.9
channels exp. A B C D E F
multiple 2 6.9 12.0 8.1 10.5 8.7 7.5
multiple 4 9.4 17.5 9.5 14.3 9.7 10.3
multiple inf 13.0 17.6 13.1 19.4 11.4 11.6
single 2 119 241 143 196 148 120
single 4 9.0 15.5 9.4 14.1 8.8 7.8
single inf 473 883 533 859 235 359
channels exp. a multiplier error
multiple 2 6.7 0.63 25
multiple 4 14.2 0.49 16
multiple inf 9.2 0.40 26
single 2 0.0 0.0357 14
single 4 4.3 0.54 13
single inf 0.0 0.0111 33
A discrimination experiment
The lower resolution gray scale images that were used as
input to the models were also shown to two observers in
a discrimination experiment.
Instead of presenting all the images in one completely randomized
sequence,
mixture sets based on each of the six vehicle images
were presented in separate blocks so that
the observers could respond to any visible difference and not
only rely on those that contributed to the detection of a vehicle.
Methods
The images were presented on a 15 inch Sony monitor using a lookup
table to match the luminance and gamma of the monitor used in
the object detection experiment described above.
The viewing distance was set to give 47.5 pixels per degree.
Intermediate images of 10%, 20%, and 40% vehicle were used
for all six images.
The observers, RH, a 29 year old male,
and BB, a female of 37 years,
had been refracted within 2 months of the experiment
to normal acuity.
The observers were asked to rate each image on a
4 point rating scale
according to the following interpretation:
1-Definitely the non-vehicle image.
2-Probably the non-vehicle image.
3-Probably the vehicle image.
4-Definitely the vehicle image.
Results
Discriminability parameter estimates scaled to represent the
distance (d') from the 100% vehicle image to the
non-vehicle image are given in Table 9
for the two observers
and for the geometric mean of their results.
The mean pattern is similar to that of the 19
detection observers (see Table 1).
The standard deviation of prediction of detection d's from
discrimination d's
in the log domain gives a prediction error of 19%.
The factor for predicting the detection results from
the discrimination results is 49%.
The factor near 50% needed to correct the model predictions
after contrast normalization can thus be regarded as the
factor needed to account for the difference between detection
and discrimination in this situation.
image pair A B C D E F
RH 7.8 18.7 6.1 12.0 7.3 8.6
BB 16.3 20.9 9.6 16.7 8.9 14.2
Ave. 11.3 19.8 7.6 14.1 8.1 11.1
Conclusions
Discrimination models designed to answer,
"Are these two images different?"
can predict answers to the question,
"Is there an object in this image?"
Without global contrast masking,
the multiple channel model performs better than the single channel model
at predicting both absolute and relative discriminabilities.
Contrast gain normalization improves predictions for both models
and appears to obviate the need for multiple channels in
this situation.
Acknowledgements
R. Horng wrote and ran the MatLab program that generated the
model and metric predictions.
Parts of this work have been reported previously [15,16,17].
This work was supported in part by NASA RTOPs 505-64-53 and 537-08-20
and NASA Cooperative Agreement NCC 2-307 with Stanford University.
References
1. A.J. Ahumada, Jr. (1993) Computational image quality metrics: a
review.
SID Digest, 24, 305-308.
2. H.H. Barrett (1992)
Evaluation of image quality through linear discriminant models.
SID Digest, 23, 871-873.
3. A.B. Watson (1983)
Detection and recognition of simple spatial forms,
in O. J. Braddick and A. C. Sleigh, eds.,
Physical and biological processing of images,
Springer-Verlag, Berlin.
4. A.B. Watson (1987)
The Cortex transform: rapid computation of simulated neural images,
Computer Vision, Graphics, and Image Processing, 39, 311-327.
5. A.B. Watson (1987) Efficiency of an image code based on human vision.
JOSA A, 4, 2401-2417.
7. S. Daly (1993) The visible differences predictor:
an algorithm for the assessment of image fidelity,
in Watson, ed. Digital Images and Human Vision.
MIT Press, Cambridge, MA.
7. J. Lubin (1993) The use of psychophysical data and models
in the analysis of display system performance,
in Watson, ed. Digital Images and Human Vision.
MIT Press, Cambridge, MA.
8. W.S. Torgerson (1958)
Theory and Methods of Scaling, Wiley, New York.
9. A. J. Ahumada, Jr. (1987)
Putting the noise of the visual system back in the picture,
Journal of the Optical Society of America A,
vol. 4, pp. 2372-2378.
10. B. Girod (1989)
The information theoretical significance of spatial and temporal
masking in video signals,
B. E. Rogowitz, ed.,
Human Vision, Visual Processing, and Digital Display,
Proc. SPIE 1077, pp. 178-187.
11. P.G.J. Barten (1993)
Spatiotemporal model for the contrast sensitivity of the human eye
and its temporal aspects,
in B. Rogowitz and J. Allebach, eds.,
Human Vision, Visual Processing, and Digital Display IV,
Proc. Vol. 1913, SPIE, Bellingham, WA, pp. 2-14.
12. J.M. Foley (1994)
Human luminance pattern-vision mechanisms: masking experiments
require a new model,
Journal of the Optical Society of America A,
vol. 11, pp. 1710-1719
13. J.S. DeBonet, Q. Zaidi (1994)
Weighted spatial integration of induced contrast-contrast,
Investigative Ophthalmology and Visual Science,
vol. 35 (ARVO Suppl.), p. 1667.
14. M. D'Zmura, B. Singer, L. Dinh, J. Kim, J. Lewis (1994)
Spatial sensitivity of contrast induction mechanisms,
Optics and Photonics News,
vol. 5,
no. 8 (suppl),
p. 48 (abs).
15. A.B. Watson and A.J. Ahumada, Jr. (1994)
A modular, portable model of image fidelity,
Perception, vol. 23,
ECVP Suppl., p. 95 (Abs.).
16. A.M. Rohaly, A.J. Ahumada, Jr., and A.B. Watson (1994)
Visual detection in natural backgrounds,
Optics and Photonics News, 5,
(OSA Annual Meeting Suppl.), 48 (Abs.).
17. A.J. Ahumada, Jr., A.B. Watson, A.M. Rohaly (1995)
Models of human image discrimination predict
object detection in natural backgrounds,
in B. Rogowitz and J. Allebach, eds.,
Human Vision, Visual Processing, and Digital Display IV,
Proc. Vol. 2411, SPIE, Bellingham, WA, pp. 355-362.