Abstract reference:

Investigative Ophthalmology and Visual Science, vol. 36, no. 4 (ARVO Suppl.), p. S439, 1995.

Image Discrimination Models Predict Object Detection in Natural Backgrounds

A. J. Ahumada, Jr.
NASA Ames Research Center, Moffett Field, CA
A. M. Rohaly
U.S. Army Research Laboratory, Aberdeen Proving Ground, MD
A. B. Watson
NASA Ames Research Center, Moffett Field, CA

Abstract

Image discriminability models predict the visibility of the difference between a pair of images. We compare the ability of two basic models to predict the detectability of objects in natural backgrounds: a multiple channel Cortex transform model with within-channel masking and a single channel contrast sensitivity filter model. Minkowski summation of differences was implemented with three different exponents: 2 (root mean square difference), 4, and infinity (maximum difference). Each method was also tried with a simple contrast gain control normalization. Without contrast normalization, the multiple channel model with a summation exponent of 4 performed best. The predictions of both models improved with contrast normalization. With contrast normalization, at their best exponent of 4, the two models performed equally well.

Introduction

There are a number of image discriminability models for predicting the visibility of the difference between a pair of images [1]. Many applications, such as the quality of imaging radar displays, are concerned with object detection and recognition. Object detection involves looking for one of a large set of object sub-images in a large set of background images and has been approached from this general point of view [2]. We show that discrimination models can predict the relative detectability of objects in different images, suggesting that these simpler models may be useful in some object detection and recognition applications. Here we compare two models that give measures of image discriminability. The first is a multiple spatial frequency channel model based on the Cortex transform with within-channel masking [3,4,5]. It is similar to the models of Lubin and Daly [6,7]. The second is a single channel contrast sensitivity function (CSF) filter model. Each model was tested with three different Minkowski summation exponents: 2, 4, and infinity. The exponent of 2 corresponds to the familiar root-mean-square (RMS) difference metric. The Minkowski sum with an exponent of infinity corresponds to the maximum difference.

Object detection experiment

Experimental method

Stimuli. Six images of a vehicle in an otherwise natural setting were altered by replacing the vehicle with appropriate background imagery. Two object images having lower levels of detectability were constructed from each image pair by mixing the object and non-object images. The mixing proportions were selected individually for the images and chosen to be near threshold detectability. The 510x480 pixel images were presented on a 13 inch Macintosh color monitor at a viewing distance corresponding to 95 pixels per degree of visual angle.

Observers. The observers were 19 male soldiers, aged 18 to 32 years. Their acuities were 20/20 or better, and they had normal color vision.

Procedure. Observers were asked to rate each of the 24 images on a 4 point rating scale according to the following interpretation:
1-A target was definitely in the scene.
2-There was something in the scene that probably was a target.
3-There was something in the scene but it probably was not a target.
4-There was definitely no target in the scene.

One group of 10 observers saw each image 20 times at a duration of 1 sec. A second group of 9 observers saw each image 10 times at a duration of 0.5 sec and 10 times at a duration of 2 sec. The sequence of 480 images was completely randomized separately for each observer.

Data analysis

Methods. The distance in discriminability units from each object image to its non-object image was measured in the context of a one-dimensional Thurstone scaling model [8]. The scaling model has the following assumptions:
1. Internal stimulus values have a normal distribution with unit variance.
2. Distances between stimulus distributions for mixtures of stimuli are proportional to the mixtures.
3. All subjects have the same pattern of distances between the stimulus distributions. They differ only by a multiplicative subject sensitivity factor.
4. Category boundaries vary across subjects, but are constant over stimuli.

The scaling model has one image discriminability parameter for each of the 6 image sets and one sensitivity factor and 3 category boundaries for each of the observers. Observers tested with the two different stimulus durations were given two sensitivity factors. Parameters were estimated by the method of maximum likelihood.

Experimental results

Discriminability parameter estimates scaled to represent the distance (d') from the 100% vehicle image to the non-vehicle image are given in Table 1. Patterns of discriminability differences for the images were estimated separately for the 10 observers given 1.0 sec durations and the 9 observers given the 0.5 and 2.0 sec durations. The median observer sensitivity factor for each group was used to convert the sensitivity pattern to sensitivities. The values for the combined group are the geometric means of the individual group values.

Table 1 - Experimental discriminability indices (d').

image pair A   B    C   D   E   F 
n=10 4.1 10.3 3.7 6.7 4.5 3.7
n=9 5.5 10.3 4.7 8.5 4.9 4.7
n=19 4.8 10.3 4.2 7.6 4.7 4.2

For the 10 observer group, the ratio of the best observer sensitivity factor to the median observer sensitivity factor was 1.5 and to the worst observer sensitivity factor was 3.3. For the 9 observer group these ratios were 1.9 and 4.1, respectively. The sensitivities measured for the 0.5 and 2.0 sec durations were neither appreciably nor significantly different from each other.

Models

Although the observers were presented with color images, the models were only presented with gray scale images. The RGB color images were converted to gray scale using the coefficients 87/253, 127/253, and 39/253 for the respective color planes. Also, these gray scale images were pixel-averaged by factors of two in the horizontal and vertical dimensions and cropped around the central target area to 128x128 pixels. These images are shown in Figure 1a and Figure 1b. Both models are linearized versions of more general models [9,10]. Because discriminability is linear in the amount of target that is added to the background, the models satisfy the second assumption of the above observer response scaling model. Linearization is accomplished by using the background-only image for the luminance used to convert from luminance to contrast and for the masking calculations. The model predictions thus need to be computed for only one target level.

Algorithms

Multiple channel model. The multiple channel model calculations has the following steps:
1) The images are converted to luminance contrast by subtracting and then dividing by the background image mean luminance.
2) A CSF filter is then applied to both images.
3) The cortex transform is then applied to both images.
4) In the cortex transform domain, the differences between the transforms of the object and background images are then divided by the absolute value of the corresponding coefficient from the background image to the 0.7 power if that absolute value is greater than one (threshold).
5) The absolute value of these scaled cortex transform coefficient differences are raised to a power, summed, and then taken to the inverse power. For the case that the exponent is infinity, the maximum absolute difference is computed.

Single channel model. For the single channel model, the steps are as follows:
1) The images are converted to luminance contrast by subtracting and then dividing by the background image mean luminance.
2) A CSF filter is then applied to both images.
3) In the image domain, the differences between the filtered object and background images are computed.
4) The absolute value of these differences are raised to a power, summed, and then taken to the inverse power.

The CSF filters were calibrated separately for each of the 6 combinations of multiple or single channel model and summation exponent of 2, 4, and infinity. They were designed to fit the prediction of Barten's CSF formula [11] for 1.33 deg square grating patches at five spatial frequencies centered in each of the five bandpass channels of the multiple channel model. ." Barten SPIE 1993 These contrast sensitivities appear in Table 2 and are graphed in Figure 2.

Table 2 - Calibration contrast sensitivities.

 cycles per degree    1.125  2.25  4.5   9   18
1/contrast threshold    77   122   147  122  54

Model predictions and results

Predictions of the multiple and single channel models for the discriminability in d' units of the object image from the background image are given in Table 3 for each of the three summation exponents.

Table 3 - Model discriminability indices (d')

channels exp. A     B     C     D     E     F
multiple 2    32.4  45.4  31.5  19.0  19.7  20.8
multiple 4    20.4  32.1  17.8  17.0  12.7  15.3
multiple inf  25.3  28.6  21.6  21.9  13.8  16.1
single   2    36.6  59.1  36.5  20.0  20.2  20.8
single   4    65.6  89.9  56.2  36.5  29.4  32.4
single   inf 145.2 216.3 135.2  87.8  32.0  62.3

Least squares predictions of the observer discriminabilities from the model predictions were computed in the log domain, assuming only an additive constant (discriminability domain multiplicative factor). Analyses in the discriminability domain show that neither constant terms nor squared terms significantly improve the fits. The multiplicative factors for predicting each of the two groups and their average results are shown for each model in Table 4.

Table 4 - Multipliers for model d' to predict observer d'.

channels exp. n=10  n=9   n=19
multiple 2    0.19  0.23  0.21
multiple 4    0.28  0.33  0.30
multiple inf  0.25  0.30  0.27
single   2    0.17  0.21  0.19
single   4    0.11  0.13  0.12
single   inf  0.053 0.064 0.059

The multiple channel model has correction factors closer to one, indicating that its within-channel masking allows it to make better absolute predictions of observer data when calibrated for absolute contrast threshold detection.

The standard errors of the log predictions converted to percentage error in discriminability units are shown in Table 5. A standard error in the log domain of 0.3 log units corresponds to a factor of 2 in the discriminability domain and an error of 100 percent. Table 5 - Prediction errors in percent.

channels exp. n=10 n=9  n=19
multiple 2    51   47   48
multiple 4    33   28   30
multiple inf  41   30   35
single   2    55   53   54
single   4    55   50   52
single   inf  86   79   82
The best model is the multiple channel model with a summation exponent of 4. The single channel model did very poorly with the maximum rule. Figure 3 shows plots of the predictions of the average group detectabilities for the 6 image pairs for both models using the best summation exponent (4). The error bars represent 95% confidence intervals for the mean of the two groups of subjects based on the variance between the two groups.

Contrast normalization

Recent data have shown that contrast energy at other spatial frequencies raises the threshold of grating increments and models have been developed to account for this effect [12]. A simple way of allowing for such an effect is to multiply the above predictions by a/Sqrt(a^2+c^2), where c is the RMS background image contrast passed by the contrast sensitivity function, and a is a parameter estimated from the data. To compute c, the CSF is normalized to unity at its peak value. When the best estimate of $a$ is zero, we simply divide the predicted discriminability by the RMS contrast of the filtered background image to obtain a contrast normalized prediction. The CSF's for the single channel model with different exponents should be the same except for a multiplicative constant, and the CSF for the multiple channel model and the single channel model should be the same when the summation exponent is two. As a result, we only show the filtered RMS contrast values of the six background images for the three different multiple channel filters in Table 6. The values corresponding to a contrast exponent of two were used for the three single filter model normalizations.

Table 6 - RMS image contrast (%).

exp. A    B    C    D    E    F
2    30.7 24.5 25.3 10.2 13.6 17.3
4    27.3 21.8 22.5  9.1 11.9 15.5
inf  15.4 11.9 12.1  4.8  6.3  8.9

Table 7 contains the contrast normalized model predictions in discriminability (d') units.

Table 7 - Contrast normalized model discriminabilities (d').

channels exp. A    B    C    D    E    F
multiple  2   6.9 12.0  8.1 10.5  8.7  7.5
multiple  4   9.4 17.5  9.5 14.3  9.7 10.3 
multiple inf 13.0 17.6 13.1 19.4 11.4 11.6
single    2  119  241  143  196  148  120
single    4   9.0 15.5  9.4 14.1  8.8  7.8 
single   inf 473  883  533  859  235  359

Table 8 has values of a in percent contrast, prediction multipliers, and the resulting errors shown as a percentage of the pooled group discriminabilities. The values of a are smaller for the single channel model.

Table 8 - a, prediction multipliers, and errors for contrast normalized models.

channels exp.  a   multiplier error
multiple  2   6.7   0.63       25
multiple  4  14.2   0.49       16
multiple inf  9.2   0.40       26
single    2   0.0   0.0357     14
single    4   4.3   0.54       13
single   inf  0.0   0.0111     33

In principle, they should be smaller (stronger contrast masking) for the single channel model, because the multiple channel model already has within-channel contrast masking. When a was zero, the division by c, a number less than one, made the multiplier even more different from unity. However, when a was nonzero, the factors were all closer to unity than they were prior to contrast normalization. The predictions are now only too large by a factor near two. The best fitting a was chosen to minimize the standard error of prediction, allowing the arbitrary factor in addition, so the choice of a was not affected by whether the absolute prediction was correct. After contrast normalization, both models performed much better and did equally well at their optimal summation exponent of 4. Figure 4 shows these results plotted as in Figure 3.

An implicit parameter of the contrast gain correction is the size of the region over which the contrast is computed. In this case it is a 2.7 deg square, not out of line with psychophysical measurements of the width of the contrast gain control region [13,14].

A discrimination experiment

The lower resolution gray scale images that were used as input to the models were also shown to two observers in a discrimination experiment. Instead of presenting all the images in one completely randomized sequence, mixture sets based on each of the six vehicle images were presented in separate blocks so that the observers could respond to any visible difference and not only rely on those that contributed to the detection of a vehicle.

Methods

The images were presented on a 15 inch Sony monitor using a lookup table to match the luminance and gamma of the monitor used in the object detection experiment described above. The viewing distance was set to give 47.5 pixels per degree. Intermediate images of 10%, 20%, and 40% vehicle were used for all six images. The observers, RH, a 29 year old male, and BB, a female of 37 years, had been refracted within 2 months of the experiment to normal acuity. The observers were asked to rate each image on a 4 point rating scale according to the following interpretation:
1-Definitely the non-vehicle image.
2-Probably the non-vehicle image.
3-Probably the vehicle image.
4-Definitely the vehicle image.

Trials were run in blocks of 60, using one vehicle and its background. Each of four images, no-vehicle, 10%, 20%, and 40% vehicle, were presented with probability 0.25. Before each set of 10 trials, a trial was given with the 100% vehicle image as a memory aid. Six blocks were run for each of the 6 image pairs in a 6x6 Latin square design, randomized separately for each observer. The image duration was 1.0 sec.

Results

Discriminability parameter estimates scaled to represent the distance (d') from the 100% vehicle image to the non-vehicle image are given in Table 9 for the two observers and for the geometric mean of their results. The mean pattern is similar to that of the 19 detection observers (see Table 1). The standard deviation of prediction of detection d's from discrimination d's in the log domain gives a prediction error of 19%. The factor for predicting the detection results from the discrimination results is 49%. The factor near 50% needed to correct the model predictions after contrast normalization can thus be regarded as the factor needed to account for the difference between detection and discrimination in this situation.

Table 9 - Discrimination experiment d's.

image pair  A     B    C     D    E     F
  RH        7.8  18.7  6.1  12.0  7.3   8.6
  BB       16.3  20.9  9.6  16.7  8.9  14.2
  Ave.     11.3  19.8  7.6  14.1  8.1  11.1

Figure 5 plots the average group object detection d's against the average image discrimination d's of these two observers.

Conclusions

Discrimination models designed to answer, "Are these two images different?" can predict answers to the question, "Is there an object in this image?" Without global contrast masking, the multiple channel model performs better than the single channel model at predicting both absolute and relative discriminabilities. Contrast gain normalization improves predictions for both models and appears to obviate the need for multiple channels in this situation.

Acknowledgements

R. Horng wrote and ran the MatLab program that generated the model and metric predictions. Parts of this work have been reported previously [15,16,17]. This work was supported in part by NASA RTOPs 505-64-53 and 537-08-20 and NASA Cooperative Agreement NCC 2-307 with Stanford University.

References

1. A.J. Ahumada, Jr. (1993) Computational image quality metrics: a review. SID Digest, 24, 305-308.
2. H.H. Barrett (1992) Evaluation of image quality through linear discriminant models. SID Digest, 23, 871-873.
3. A.B. Watson (1983) Detection and recognition of simple spatial forms, in O. J. Braddick and A. C. Sleigh, eds., Physical and biological processing of images, Springer-Verlag, Berlin.
4. A.B. Watson (1987) The Cortex transform: rapid computation of simulated neural images, Computer Vision, Graphics, and Image Processing, 39, 311-327.
5. A.B. Watson (1987) Efficiency of an image code based on human vision. JOSA A, 4, 2401-2417. 7. S. Daly (1993) The visible differences predictor: an algorithm for the assessment of image fidelity, in Watson, ed. Digital Images and Human Vision. MIT Press, Cambridge, MA.
7. J. Lubin (1993) The use of psychophysical data and models in the analysis of display system performance, in Watson, ed. Digital Images and Human Vision. MIT Press, Cambridge, MA.
8. W.S. Torgerson (1958) Theory and Methods of Scaling, Wiley, New York.
9. A. J. Ahumada, Jr. (1987) Putting the noise of the visual system back in the picture, Journal of the Optical Society of America A, vol. 4, pp. 2372-2378.
10. B. Girod (1989) The information theoretical significance of spatial and temporal masking in video signals, B. E. Rogowitz, ed., Human Vision, Visual Processing, and Digital Display, Proc. SPIE 1077, pp. 178-187.
11. P.G.J. Barten (1993) Spatiotemporal model for the contrast sensitivity of the human eye and its temporal aspects, in B. Rogowitz and J. Allebach, eds., Human Vision, Visual Processing, and Digital Display IV, Proc. Vol. 1913, SPIE, Bellingham, WA, pp. 2-14.
12. J.M. Foley (1994) Human luminance pattern-vision mechanisms: masking experiments require a new model, Journal of the Optical Society of America A, vol. 11, pp. 1710-1719
13. J.S. DeBonet, Q. Zaidi (1994) Weighted spatial integration of induced contrast-contrast, Investigative Ophthalmology and Visual Science, vol. 35 (ARVO Suppl.), p. 1667.
14. M. D'Zmura, B. Singer, L. Dinh, J. Kim, J. Lewis (1994) Spatial sensitivity of contrast induction mechanisms, Optics and Photonics News, vol. 5, no. 8 (suppl), p. 48 (abs).
15. A.B. Watson and A.J. Ahumada, Jr. (1994) A modular, portable model of image fidelity, Perception, vol. 23, ECVP Suppl., p. 95 (Abs.).
16. A.M. Rohaly, A.J. Ahumada, Jr., and A.B. Watson (1994) Visual detection in natural backgrounds, Optics and Photonics News, 5, (OSA Annual Meeting Suppl.), 48 (Abs.).
17. A.J. Ahumada, Jr., A.B. Watson, A.M. Rohaly (1995) Models of human image discrimination predict object detection in natural backgrounds, in B. Rogowitz and J. Allebach, eds., Human Vision, Visual Processing, and Digital Display IV, Proc. Vol. 2411, SPIE, Bellingham, WA, pp. 355-362.