Development of a translation-invariant image representation

Antonio Tabernero
Instituto de Optica, Serrano 121
Madrid 28006, Spain

Albert J. Ahumada Jr.
NASA Ames Research Center, MS 262-2
Moffett Field, CA 94035

Abstract

The simple cells of primary visual cortex have been proposed to result from Hebbian associative learning [15]. Sanger's model [2] of this theory develops a `column' of units supported by a small patch of visual field. The receptive fields of these units converge to the eigen vectors of the covariance matrix of the retinal inputs. Since many eigen values are effectively equal, columns associated with different patches do not in general have units with similar receptive fields in corresponding positions in the columns. Maloney and Ahumada's translation-invariance learning algorithm [1] can ensure that a layer of linear receptive fields is uniform. It can force the corresponding column elements to have the same receptive field. Simulations of these two processes on regularly sampled arrays show that they can cooperatively find eigen vectors and force corresponding column elements to have the same receptive fields. When the image is sampled at slightly disordered positions, the eigen vectors of different columns no longer are identical but the two processes still cooperate to give an appropriate output network.

1. Introduction

In this work we combine two previously proposed schemes-- Hebbian associative learning and a translation-invariance learning algorithm-- to develop columns of units modeling the simple cells of primary visual cortex.

Following the pioneering work of von der Malsburg [14] many have proposed that Hebbian association principles might account for the development of the spatial frequency and orientation selective cells with approximately linear receptive fields (RFs), the simple cells of primary visual cortex. Two recent works contain references to the early development of these ideas [11, 12]. Sanger's recent model [2] of this theory is especially convenient because of the mathematical elegance of the result, originally found by Oja [13]. Sanger's model develops a `column' of units supported by a small patch of visual field. Some of these learned eigen vectors resemble edge and bar detectors, orientation specific and somewhat localized in the spatial frequency domain. Since some RFs of cells in the visual cortex [4, 3] also show this selectivity, this model can generate cortex-like RFs. The RFs of these units converge to the eigen vectors of the covariance matrix of the retinal inputs, ordered by the size of the eigen values. Since many eigen values are effectively equal, columns supported by different patches do not in general have units with similar RFs in corresponding positions in the columns.

Maloney and Ahumada [1] proposed a translation-invariance (TI) algorithm to calibrate a simple linear visual system. Their method can compensate for irregularities in the sampling lattice. Given a disordered sampling lattice, the trained weights of the neural network can transform the inputs in such a way that the outputs are the values from the original image sampled at regular positions. The network is therefore forcing the RFs of all output units to be equal. In their case all the RFs are delta functions for suitably low-passed inputs, for the possible distortion of the sampling array. The TI learning algorithm by itself only tries to make all the RFs equal; the reason that all the RFs end up delta functions is that one RF is fixed to be a delta function by connecting it to a single input. It is clear that the same algorithm can be used to train the network to compute the output of any given RF, for example it could generate a layer of cells having a specific orientation and spatial frequency. However, to replicate a RF, we have to have one fixed cell with the desired profile. Having one cell with this special status does not seem biologically realistic.

The main idea of this paper is to apply Sanger's Hebbian algorithm and the TI algorithm simultaneously. Our hope was that they would be compatible and the Hebbian algorithm would develop cortex-like RFs and the TI algorithm would organize them into corresponding layers in each column across the visual field, correcting problems associated with the irregularities of the sampling array.

2. Modeling

We will be modeling a simple visual system with linear connections between the inputs and the outputs. The input will consist of an array of photoreceptors that will sample the training image. The size of this array will be N (although we will show results for the 2D case, all the discussion will be restricted to the 1D case, for the sake of simplicity). The output will be N x NRF 'cells', NRF being the number of different kinds of RFs at each of N output positions.

Therefore, the weights Wr,j,i connecting the inputs xi and the outputs yr,j will be triply subscripted, so that

yr,j = Si Wr,j,i xi , Eq. (1)

where r ranges over the NRF levels at a position, and i and j range over the N positions. In particular, if we fix the type of RF (r) and an output position (j), the resulting set of weights is what we call the RF of that cell.

3. Training Procedure

In the combined training procedure, the TI algorithm is applied independently for each RF type (r is fixed), and the Hebbian algorithm is applied independently at each fixed output position.

3.1 The Hebbian component

Sanger's Generalized Hebbian algorithm increments the weights of the network after each iteration by

D Wr,j,i = g yr,j ( xi - Skr Wk,j,i yk,j), Eq. (2)

where g is the Hebbian learning rate. For r = 1, the weights are being incremented according to the correlation between the inputs and the outputs. For outputs at higher levels in the columns, after the weights are close to the eigen vectors of the input covariances, the outputs are then correlated with the residuals of the inputs after parts correlated with the lower units have been extracted. Note that in this fully connected system, if the weights for two column positions j and j' are equal at some point in time, they will be equal thereafter.

3.2 The TI component

The TI algorithm tries to match what it sees after an 'eye movement' with what it would expect from its knowledge of the sampled image before the translation. It will succeed if it learns the positions of the sampling array. At that point the system will be calibrated. In order to do so, for each simulated eye movement the system computes the expected new image, and then compares it with the real one. This gives an error term that is used to correct the weights with a modified Widrow-Hoff algorithm. The Widrow-Hoff algorithm [9, 10] adjusts weights to make the future outputs more like a desired output by the increment

D Wr,j,i = l (y'r,j - yr,j) xi, Eq. (3)

where y'r,j - yr,j is the difference between the desired output (usually provided by the trainer or supervisor) and the actual output and l is the learning rate. In the TI procedure, y' is not the desired output, it is the output image preceding the eye movement translated (by interpolation and re-sampling) to the corresponding position.

3.3 The combined procedure

Now, the training procedure: An image is generated, presented, and sampled by the array of photoreceptors. These samples will be the input data for both the TI and the Hebbian algorithms. The TI algorithm is applied first. Since we want to propagate different types of RFs, the TI algorithm is applied independently to each level. Then, the Hebbian algorithm is applied to each of the positions in the network. After this, a random 'eye movement' is produced, the image is sampled again, and we repeat the whole procedure. After a number of repeated viewings of the same image (usually from 10 to 50), a new image is generated and the process begins again.

The learning rate of the TI algorithm is constant (approximately 0.5), whereas that of the Hebbian algorithm is reduced according to the following rule: The Hebbian algorithm error is averaged over blocks of about 10 images, and if the average error of a block is lower than that of the previous one, the learning rate is assumed to be appropriate and kept. Otherwise, the rate is reduced by a constant factor (approximately 0.75).

The process continues until both the TI and Hebbian errors have decreased below some previously defined limit. For the results shown below, the resulting number of training images was usually about 500 - 1000.

4. Input Images

The training images used here are finite Fourier series, composed of a limited number of frequencies with random (Gaussian distributed) amplitudes. After generating the images, we apply a low-pass Gaussian filter with a standard deviation of approximately 2 sample spaces. The finite number of spatial frequencies and the Gaussian low-pass filter can be thought of as representing the blur produced by the optics of the eye and the low-pass nature of natural imagery. The TI algorithm has been seen to perform poorly with inadequately sampled imagery [7].


Figure 1. A sample noise image used in the simulations and the 7x7 array of photoreceptor sample points.

5. Results

5.1 Regular case

Training the network in the case of a regular sampling array provides results easy to interpret. In this case, we obtain a set of different RFs , each correctly replicated across the different output positions.


(a)


(b)


(c)

Figure 2.

Fig. 2 shows some cases when the the size of the input array was 7 x 7 (49 photoreceptors). Fig. 2a, 2b, and 2c correspond to 3, 5, and 7 layers of RFs. Only the set of RFs at one position is shown since they are all similar. The obtained RFs correspond to a low-pass filter (that always appears in the first position) and several band-pass filters with different orientations. Almost all the RFs (except for the fourth one in Fig. 2b) have vertical, horizontal, or 45 deg orientations. We can see how some of them share the same frequency and orientation, but have a 90 deg phase shift with each other, that is, they are in phase quadrature. All of them are orthogonal to each other. Some of these characteristics resemble those of the RFs of cells in the visual cortex [3].

In this case, the response of an output 'cell' to a given input, the inner product between the input and the RF of that 'cell', will capture a particular spatial-frequency feature of that image. The translation invariance is clear in this case, since we can see that the RF of a particular type for all the output positions are identical to each other (the degree of similarity among them depends on the limits that we imposed to the error of the TI algorithm)

5.2 Irregular case

In the case of irregular sampling, RFs can look very different from the ones of the regular case in terms of their Wr,j,i. Moreover, these weights are not translation invariant, since the weights compensate for the irregularities of the sampling array as well as filtering out some aspect of the images.


(a)


(b)


(c)

Figure 3.

These effects can be graphically appreciated in Fig. 3. Fig. 3a shows weights from a 5 x 5 regular input array with 5 different layers of RFs. The resultant RFs are similar to those of Fig. 2. Now, we repeat the procedure with a disordered sampling array. The distortion imposed on the (x,y) coordinates of the lattice was :

| ( 0, 0)  ( 0, 1)  ( 0,-1)  ( 1, 0)  ( 1, 1)  |
| ( 0, 1)  ( 1, 0)  ( 0, 0)  (-1, 1)  ( 0, 0)  |
| (-1, 1)  ( 0, 1)  (-1,-1)  ( 1, 0)  ( 0, 1)  | 
| ( 1, 0)  ( 1, 1)  ( 0, 0)  ( 1,-1)  ( 0,-1)  |
| ( 0, 1)  ( 0, 0)  (-1,-1)  ( 1, 1)  ( 1, 0)  |
where each unit is one fourth of the inter-receptor distance in the regular array. With this sampling configuration, the set of weights obtained at the output position 0,0 are shown in Fig.3b. To see if these weights are appropriate, we need to translate them into weights for the regularly sampled positions. One way of doing this is to present to this network each of the images that are appropriately band limited and have the value unity at one of the regular points and are zero at all the other regular points. For the above case the image for point (0,0) is the sum of all the cosine phase components of the DFT, appropriately normalized. The output to each such image gives the weight of the regularly sampled equivalent for that position. Another approach is to use the original TI algorithm to find weights Tj,i that transform the irregular samples into regular samples. The inverse of this matrix then provides a transformation for converting weights for the irregular sample positions into weights for the regular sample positions. Figure 3c shows that the weights of Fig. 3b actually correspond to weights that are similar to those obtained in the regular case and the weights for the other positions are correspondingly similar.

6. Theory

Therefore, what we should show is whether we can still obtain what we want (that is, what we were obtaining in the regular case) by applying these new eigen vectors to the non-regular samples of the image.

Let us begin with :

x = (xi) is the column vector of the regular samples of the image, and

y = (yi) the column vector of the non-regular samples.

Given certain restrictions about the input images (that make the TI algorithm converge, and that have been preserved in this work), we can recover the regular samples from the non-regular ones through a conversion matrix T. This is exactly what was being done by Maloney and Ahumada [1]. Therefore,

x = T y . Eq. (4)

Then if Q(x) is the correlation matrix of the regularly sampled input,

Q(x) = (E(xixj)) = E(x xT ) = E (T y(T y)T) = T E (y yT )TT = T Q(y)TT . Eq. (5)

We know that with the Hebbian algorithm we are computing the first eigen vectors of this matrix (our RFs). The matrix whose rows are the eigen vectors of Q(x) will be called C(x). Then, we know that for symmetric matrices like Q(x) or Q(y):

A(x) = C(x) Q(x) C(x)T A(y) = C(y) Q(y) C(y)T , Eq. (6)

where A(x) and A(y) are diagonal matrices.

Since A(x) and A(y) are both diagonal, they can be related through another diagonal matrix F :

A(y) = F A(x) FT .

Combining Equations 2 and 3:

A(y) = F(C(x) Q(x) C(x)T) FT = F(C(x)(T Q(y) TT ) C(x)T) FT = (F C(x) T) Q(y) (F C(x) T)T . Eq. (7)

Combining Equation 4 and Equation 3a we obtain:

C(y) = F C(x) T . Eq. (8)

We can see that in the regular case, since the rows of C(x) are the eigen vectors of Q(x) (the RFs obtained with the Hebbian algorithm), we were computing:

C(x) x, the inner product between each RF and the input x.

And now, in the irregular case we have:

C(y) y = F C(x) T y = F C(x) x = F(C(x) x).

Since F is diagonal the only minor difference is a factor for each RF. Furthermore, the factor is going to be the same for the case of degenerate eigen values, so the correction will affect equally those pairs of RFs which differ only in a phase shift.

This shows that what the network is computing in this irregular case is what we wanted: the 'spontaneous' generation of some weights that make the output be what it would have been had we applied some cortex-like filters to a regularly sampled array.

In order to test this we have corrected the 'disordered' RFs in C(y) with the matrix T. Computing C(y) T-1 should give us something similar to the original 'ordered' Rfs of C(x). Indeed, this is what happens as it is shown in Fig. 3c. In this figure the factors of matrix F are not shown since the gray level range has been expanded between 0 and 255. This factors can be appreciated if we compute the inner product between the 5 different RFs at a particular output position (0,0) :

	    |0.694    0.000    0.000    0.000    0.000|
	    |0.000    0.588   -0.001    0.000    0.000|
	    |0.000   -0.001    0.588    0.000    0.000|
	    |0.000    0.000    0.000    0.674    0.000|
	    |0.000    0.000    0.000    0.000    0.675|

Acknowledgements

Work supported by NASA RTOP 506-71-51 and the Educational Ministry of Spain, CICYT TIC 91 - 0438.

References

1. L. T. Maloney and A. J. Ahumada, Jr. (1989) "Learning by Assertion: Two Methods for Calibrating a Linear Visual System", Neural Computation, Vol.1, pp.392-401.

2. T. D. Sanger (1989) "Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network", Neural Networks, Vol. 2, pp. 459-473.

3. D. A. Pollen and S. F. Ronner (1983) "Visual Cortical Neurons as Localized Spatial Filters", IEEE Trans. on Systems, Man, and Cybernetics, Vol. SMC-13, No. 5, pp. 907-916.

4. S. Marcelja (1980) "Mathematical Description of the Response of Simple Cortical Cells", J. Opt. Soc. Am., Vol. 70, No. 11, pp. 1297-1300.

5. A. J. Ahumada, Jr. (1992) "Learning Receptor Positions", in M. Landy and J. A. Movshon, eds., Computational Models of Visual Processing, MIT Press, Cambridge, MA, pp. 23-34.

6. A. J. Ahumada, Jr. and J. B. Mulligan (1990) "Learning Receptor Position from Imperfectly Known Motions", in B. Rogowitz and J. Allebach, Human Vision, Visual Processing, and Digital Display, Proc. SPIE ,Vol. 1249, pp. 124-134.

7. A. J. Ahumada, Jr. and J. B. Mulligan (1991) "Network Compensation for Missing Sensors", in B. Rogowitz and J. Allebach, Human Vision, Visual Processing, and Digital Display II, Proc. SPIE ,Vol. 1453, pp. 134-146.

8. G. O. Stone (1986) "An Analysis of the Data Rule and the Learning of Statistical Associations", in D. E. Rumelhart and J. L. Mcclelland, eds., Parallel Distributed Processing, Vol. 1, MIT Press, Cambrige, MA, pp. 444-459.

9. A. B. Widrow and M. E. Hoff (1960) "Adaptive Switching Circuits", Inst. of Radio Engineers, WESCON Record, Part 4, pp. 96-104.

10. A. B. Widrow and S. D. Stearns (1985) Adaptive Signal Processing, Englewood Cliffs, NJ, Prentice-Hall.

11. S. Amari (1988) "Dynamical Stability of Formation of Cortical Maps", M. A. Arbib and S. Amari, eds., Dynamic Interactions in Neural Networks: Models and Data, Springer, New York.

12. T. Kohonen (1989) Self-Organization and Associative Memory, Springer, New York.

13. E. Oja (1982) "A Simplified Neuron Model as a Principal Component Analyzer", J. Math. Biology, Vol. 15, pp. 267-273.

14. C. von der Malsburg (1973) "Self-Organization of Orientation Sensitive Cell in the Striate Cortex", Kybernetik, Vol. 14, pp. 85-100.

15. D. O. Hebb (1949) Organization of Behavior, John Wiley, Inc., New York.