Observers viewed a simulated airport runway landing scene with an obstructing aircraft on the runway and rated the visibility of the obstructing object in varying levels of white fixed-pattern noise. The effect of the noise was compared with the predictions of single and multiple channel discrimination models. Without a contrast masking correction, both models predict almost no effect of the fixed-pattern noise. A global contrast masking correction improves both models' predictions, but the predictions are best when the masking correction is based only on the noise contrast (does not include the background image contrast).
Object detection typically involves search and pattern recognition in a range of backgrounds. Visual object detection is fundamentally limited by background-induced contrast masking. When the object is present or absent in a constant background, contrast masking can be measured as the discriminability between two images. We are evaluating the ability of image discrimination models to predict object visibility with a fixed background image. If the models are successful, they predict the upper limit of observer performance in an object detection task.
Ahumada, Rohaly, and Watson [1] applied discrimination models to object detection in natural backgrounds. We reported that the detectability of tank targets was better predicted by a multiple channel model than by a single channel model. We then added a simple correction for masking based on visible contrast energy. It improved the predictions for both models and equalized their performance[2,3,4].
Some object detection situations involve noisy displays. Here we measure object detectability in a complex image masked by fixed-pattern noise. We compare these measurements with discrimination model predictions. Without the masking correction, the single channel model predicts no effect of noise and the multiple channel model predicts masking only by the noise in the channels affected by the object. So, neither model correctly predicts the effect of the fixed-pattern noise. With the masking correction, both models' predictions are improved. The predictions are even better when the masking correction is based only on the noise contrast and does not include the background image contrast.
2.1.1 Stimuli.
Two digital images of a simulated airport scene were generated.
Image I1, shown at the
top of Figure 1,
has an obstructing aircraft on the
runway.
Image I0, shown in the
middle of Figure 1,
is the same image without the obstructing
aircraft.
We used a single fixed-pattern white noise mask N with uniformly
distributed pixel values.
Images for the experiment
were constructed from these images by adding the background
image, a fraction p of the difference between the background
and the object images, and a fraction q of the noise image,
I(p,q) = I0 + p (I1-I0) + q N + (1-q) N'. (1)N' is the mean of the noise image. A fraction of N' is added to keep the mean luminance constant. Images were generated for the six p values 0, 0.05, 0.10, 0.20, 0.40, and 1, and for the q values 0, 0.25, 0.50, and 1.0. The image at the bottom of Figure 1 illustrates the case of p=1 and q=0.5. The 128x128 pixel gray-scale images were presented on a 15 inch Sony color monitor whose luminance in cd/m^2 was closely approximated by
2.4
L = 0.05 +(0.024 d) , (2)
where d is the digital image pixel value.
The mean luminance of the images and
surrounding screen region was about 10 cd/m^2.
The viewing distance of 127.5 cm and the image size of 6.0 cm
give a viewing resolution of
47.5 pixels per degree of visual angle.
The plane/runway scene thus
subtended 2.7 deg visual angle, the plane alone fit in a rectangle
0.78 deg by 0.17 deg of visual angle (37 horizontal and 8 vertical pixels).
It affected a total of 96 pixels.
When an image was not present, the screen was filled with random,
uniformly distributed, gray scale pixels.
Because the display had only 32 different levels of gray scale
(IBM-PC compatible VGA display mode) the no-noise condition was
run at twice the digital image contrast to allow more dynamic
range.
The image duration was 1.0 second.
2.1.2 Observers. Four female observers, aged 18 to 37 years, with corrected acuity of 20/20 or better were tested.
2.1.3 Procedure. The observers were asked to rate each image on a 4 point rating scale according to the following interpretation: 1-Definitely did not have a plane. 2-Probably did not have a plane. 3-Probably did have a plane. 4-Definitely did have a plane. In addition, the observers were asked to try to use the 4 response categories with roughly equal frequency.
Within a block of 60 trials, the mask noise level q was held constant, while the four object/background p levels occurred randomly (with probability 0.25). Table 1 shows the four values of p used at each q value (the coefficient determining the noise level).
Table 1 - Signal level values p used at each noise level value q
q p values 0 0 0.05 0.1 0.2 0.25 0 0.05 0.1 0.2 0.5 0 0.1 0.2 0.4 1 0 0.2 0.4 1.0Groups of four repetitions of the four noise levels were independently sequenced using 4x4 Latin squares. Observers 1 and 2 completed 16 repetitions of each noise level, Observer 3 completed 8 repetitions, and Observer 4 completed 10 repetitions in 5x5 Latin squares, including a no-noise condition at the same contrast as the noise conditions.
2.2.1 Method.
For a given noise level,
the distance d' in discriminability units
from each object image to its non-object image
was measured in the context of
a one-dimensional Thurstone scaling model[5].
The scaling model has the following assumptions:
1. The presentation of an image generates an internal value that
is a sample from a normal distribution with unit variance.
2a. The mean of the distribution generated by a background image
I0 is zero.
2b. The mean of the distribution generated by an original object
image I1 is d'.
2c. The mean of the distribution generated by an image
Ip is p times d'.
3. The observer has 3 fixed criteria that are used to
categorize an internal value to one of the 4 responses.
The scaling model for this experiment has 4 d' parameters
and 3 category boundaries for each observer.
Parameters were estimated by the method of maximum likelihood
separately for each block.
2.2.2 Experimental results. Median d' estimates for each observer and for the 4 noise levels are given in Table 2.
Table 2 - Median experimental discriminability indices d'
noise level q 0 0.25 0.5 1 Observer 1 18.5 11.9 6.6 3.3 Observer 2 24.9 11.6 8.8 4.1 Observer 3 24.4 9.5 8.8 5.8 Observer 4 28.4 15.4 9.0 5.2 Geometric mean 24.8 11.9 8.2 4.5The standard deviation of an individual score in decibels (dB = 20 times the log of the score) is estimated to be 1.3 dB, based on the observer by noise level interaction, which has 9 degrees of freedom. This leads to 95% confidence intervals of + or - 1.4 dB for the means for each noise level. Figure 2 plots the data of Table 2 with the confidence intervals about the means. Observer 4 had a median d' of 18.4 for the no-noise condition at the same contrast as the noise conditions, only slightly higher than her d' value of 15.4 for the q=0.25 condition. The large difference from the q=0 and the q=0.25 conditions is seen to be mainly an effect of the lower signal level in the noise conditions.
3.1.1 Multiple channel model. The multiple channel model is based on the Cortex transform of Watson[6]. It is similar in spirit to his original multiple channel model[7], and is similar in detail to others based on the Cortex transform[8,9,10].
The multiple channel model calculation for a pair of images
(I0 and I1) has the following steps.
The images I1 and I0 are converted
to luminance images by the calibration function of Equation (2).
The images are converted to luminance contrast by
subtracting and then dividing by the background image
mean luminance L0,
Ij - L0
Ij = ------- . (3)
L0
The operations on the image
indicate the operation applied separately to each pixel.
A contrast sensitivity function (CSF) filter S
is then applied to the two contrast images.I = F'[S F[Ij]] , (4)where F and F' are the forward and inverse Fourier transforms. Next the Cortex transform is applied to the images resulting in coefficients C(j,k), where the index k ranges over spatial frequency, orientation, and spatial location. The detectability dk contributed by the kth spatial frequency, orientation, and position is then computed as the absolute value of the difference in the Cortex transform coefficients, masked by the background coefficient if it is above threshold.
dk = |C(1,k)-C(0,k)| , if C(0,k) < or = 1.0 ,
|C(1,k)-C(0,k)|
dk = --------------- , if C(0,k) > 1.0 . (5)
0.7
|C(0,k)|
Finally, d' is given by a Minkowski sum of the individual
contributions with summation exponent beta,
beta 1/beta
d' = ( Sum[ dk ] ) . (6)
k
For the case that beta=infinity, the result is the largest of the dk.
3.1.2 Single channel model.
For the single channel model, the steps are
the same through the image filtering,
then the filtered image values are used to compute
dk = |I(1,k)-I(0,k)| , (7)where the index k now refers to image pixels. Equation (6) is then used to obtain d'.
3.1.3 Contrast normalization.
Without the correction factor, the single channel model
predicts no contrast masking at all and the multiple
channel model only predicts masking within the channels
affected by the signal.
Recent work demonstrates masking by contrast energy in channels
not containing the signal[11].
New versions of the multiple channel models
incorporating lateral interactions
among cortical unit channels to account for between-channel
masking have been developed[12-15].
A model similar to theirs would result by replacing Equation (5) with
c0 |C(1,k)-C(0,k)|
dk = ----------------------------------------- , (8)
a0 a(k,k') 1/a0
(c0 + Sum[c(k,k') |C(0,k')| ])
k'
where c0 and a0 are constants,
c(k,k') represents the weight of the masking of
channel k' on channel k, and
a(k,k') represents the growth of that masking with the activity in channel k'.
If we make the simplifying assumptions that the
c(k,k') are all equal and sum to unity,
that the a(k,k')=2, and a0=2, the result is that the factor
multiplying the difference term is no longer a function of
k and can be factored out of the Minkowski metric Equation (6).
Also, the Cortex transform has the property that the sum of squares
of the coefficients equals the sum of squares of the image values,
so the simplification assumptions result in the d' prediction
replacement formula,
c0
d' = d' ---------------- , (9)
2 2
Sqrt[c0 + c ]
where the input d' is computed from the unmasked differences,
c is the RMS background image contrast passed by the
CSF filter, and c0 is a parameter representing
the contrast level at which the masking becomes effective.
To compute c, the CSF is normalized to unity at its peak value.
Instead of dealing with the additional computational complexity
and parameter estimation problems of Equation (8), we will
simply use Equation (9) to correct the predictions of the single
and multiple channel models.
The model parameters used are those that proved to be best
in previous studies[1-4].
The CSF filters were calibrated to agree with the CSF formula
developed by Barten[16].
The filters have a difference of Gaussian form,
2 2
S(f) = Ac Exp[-(f/fc) ] - As Exp[-(f/fs) ] , (10)
where Ac and As are the center and surround amplitude parameters
and fc and fs are the center and surround frequency cutoff parameters.
Table 3 gives the CSF and beta parameters for the multiple
channel and the single channel models.
The amplitude parameters have the dimensions of JND's per unit contrast and
the cutoff parameters have the dimensions of cycles per
degree of visual angle.
Table 3 - Model parameters
channels beta ac fc as/ac fc/fs multiple 4 15.5 20.8 0.77 5.6 single 4 18.5 16.4 0.68 7.9
3.3.1 Predictions without a contrast masking correction. The model predictions for d' without a contrast masking correction given in Table 4 for each of the four noise levels.
Table 4 - Model d' values without a contrast masking correction
noise level q 0 0.25 0.5 1 multiple channel 4.0 2.3 2.2 1.9 single channel 24.5 11.5 11.5 11.8Figure 3 shows the predictions of Table 4 plotted with the mean observer results. Both models correctly predict the difference between q=0 and q=0.25 caused by scaling the down the aircraft image to make room for the noise. The single channel model predicts no masking by the noise. The multiple channel model predicts very little masking by the noise. Table 5 shows the sensitivity scale factors needed to equalize the average log predictions of the models and the observers. It also shows the average error of prediction in decibels using the scale factor and an F statistic representing the statistical goodness-of-fit of the error. The multiple channel model averages a factor of 4 too insensitive, while the single channel average sensitivity is within the range of that of the observers. The underprediction of the masking effects causes the errors to be large. Both F's are highly significant, since the 99.9 percentile of the F distribution with 3 and 9 degrees of freedom is 13.9.
Table 5 - Model fits without contrast masking correction
model scale factor error, dB F multiple channel 4.1 3.5 30.5 single channel 0.72 4.0 38.5
3.3.2 With contrast masking correction. RMS contrast values for normalizing the d' values are shown in Table 6 for each of the 4 noise plus background images, filtered by the CSF for each model.
Table 6 - RMS image contrast
noise level q 0 0.25 0.5 1 multiple channel 0.136 0.076 0.098 0.158 single channel 0.150 0.079 0.093 0.136Figure 4 shows the predictions of Figure 3 corrected with a c0 of 0.04 and the RMS contrast values of Table 6. Now both models predict the effect of the noise better when the noise is present, but they predict too much masking of the target by the image alone. Table 7 shows the goodness-of-fit measures as in Table 5. The scale factors show that now both models predict too much masking.
Table 7 - Model fits with contrast masking correction
model scale factor error, dB F multiple channel 12.2 3.3 25.3 single channel 2.1 3.8 34.7
3.3.3 Contrast masking correction based on noise alone. The poor fit above is what one might expect from using an image-wide estimate for image masking while the runway region has little contrast variation. The values of Table 6 can be decomposed to show that the RMS visible contrast from the full (q=1) noise alone is 0.144 for the multiple channel model and 0.114 for the single channel model. Figure 5 shows the predictions of Figure 3 corrected with a c0 of 0.04 and the noise component of the RMS visible contrast. Now both models fit well, with a slight error in the direction that would result from a small image masking effect. Table 8 shows the goodness-of-fit measures as in Table 5. Now both models have scale factors close to unity and the single channel model fits the noise effect quite well. The multiple channel F now barely exceeds the 99th percentile of the F distribution (6.99), and the single channel F is just above the 90th percentile (2.81).
Table 8 - Model fits using only noise in the contrast masking correction
model scale factor error, dB F multiple channel 1.30 1.7 7.02 single channel 1.16 1.1 2.84
The improvement in the model predictions resulting from limiting the contrast masking correction to the noise, suggests that the contrast masking correction should be based on the contrast in a smaller region containing the target object. We had success before[2-4] with the correction based on the same sized image, and experiments measuring contrast effects on perceived contrast indicate considerable spatial spread[19-22]. Current models[12-15] extend the masking interactions only to channels differing in orientation at the same location and spatial frequency. Also recent attempts to measure contrast masking by a surround masker found none[23,24]. We currently recommend that the contrast masking correction be based on an estimate of the image contrast in the immediate region of the target object.
The results demonstrate that the single channel model with an appropriate contrast masking correction can outperform the multiple channel model with or without a general gain control. Although a multiple channel model with inter-channel interactions might do better in this situation, it probably would require more strongly oriented signals and maskers to obtain a benefit for the extra calculations. One problem with the contrast masking correction and the multiple channel model is that contrast in the signal channels contributes to masking twice. The multiple channel model might be the better of the two with the correction if, for example, the within-channel masking exponent and the correction exponent were both lowered. The results here show that even though the single channel model does not predict the details of oriented contrast masking, such as the results of Foley[11], it can be a useful alternative to more complicated models.
Ren-Sheng Horng wrote the experimental display and response collection program. Andrew Watson wrote the basic Mathematica routines that generated the model and metric predictions and made helpful suggestions. We are also grateful for the help of Ann Marie Rohaly, Cynthia Null, Jeffrey Mulligan, and Robert Eriksson. This work was supported in part by NASA Grant 199-06-39 to Andrew Watson and NASA Aeronautics RTOP #505-64-53.