I have a lot of thoughts on this paper and the general idea of invariance vs. selectivity: oftentimes, people describe invariance and selectivity as being inherent tradeoffs (invariance causes you to 'lose' information about the image in the hopes of generalizing; selectivity causes you to 'hone in' on details that you might otherwise have wanted to throw away or treat equally with other features). Do you think this is a good way of looking at the problem? One bothersome thing to me is that we treat it like we're throwing away information along the pipeline that must still be there (e.g. - throw away the pose of the object so we can focus on what the object really is) when in reality, many other tasks need that information so it must be somehow reorganized instead. Does that holistic knowledge change how we phrase the invariance vs. selectivity tradeoff problem?
I think the paper is very interesting in that it focusses on the fact that both selectivity and tolerance are something we are extremely good at, and both are equally important for robust object detection. You are right that in generalizing we tend to lose information and selectivity (here conjunction sensitivity) focusses on retaining details specific to the object. However, as the visual information processes along the ventral stream, the paper seems to suggest that information related to both specific conjunctions of local features and invariance (to pose, scale and context) are both retained at the IT. Since we are really good at both these tasks, their results seem plausible. Since they uniformly sample neurons to measure population behaviour, it is unclear whether different regions of the IT (and V4) encode information for invariance and selectivity.
This comment has been removed by the author.
I like this idea of questioning the selectivity/invariance tradeoff. One way of maybe thinking about it is the difference between factoring out something like the pose (and then storing it) and just being invariant to it. To have a concrete example: suppose you want to recognize a particular bottle anywhere in your house. In order to do so, you'll need a way to factor out the pose of the bottle with respect to the camera. I think in order to really bring it into correspondence with your model of the bottle for recognition and the other tasks you mention, you don't just need some representation that tosses out the pose data, but instead you need a representation that knows the pose and how to match it to the "ground-truth" instances of the bottle. So yes, you need to "factor out the pose", but I don't think you want to / need to throw that data out once you've obtained it. That said, I suspect that on a neuron level, you definitely need to have a trade-off. But I don't think that this lower-level representation constraint necessarily has any strong implications for higher level representations.
I think David has an interesting point. You don't really have to ``throw away'' the data, once it's been factored out. In fact, it might just be that this is useful in the task of disambiguation.
This is a very interesting topic. It seems that there are maybe two subtly different ways to think about the visual processing problem:1. Raw input data is factored or split according to various feature spaces as it passes through successive layers in the brain.2. Raw input data is assigned various descriptive labels as it passes through successive layers in the brain.Technically I think you can make those two things equivalent, but it feels like the kind of invariance Allie and the paper is describing fits best in the second way of thinking about things. You use part of the brain to take this raw data and assign it some very general label, but that doesn't have to strip away anything.
I think that complex cells do discard information but only that which they really need to discard (for invariance). Simple cells split the information such that such a selective discard is feasible.(Bottle in the left of receptive field OR Bottle in the right of receiptive field) will be invariant to bottle position but that doesn't lead to the system getting confused between bottle's and two legged structures.
I agree. It doesn't mean that you throw away other information. It is used to untangle the different factors and choose or combine the ones that are suitable for the current scenario.
I liked this study, especially Fig. 7 and the fact that they used both SVM and the simpler, correlation approach: I've always been somewhat skeptical of the learned layer on top of FMRI data (although I honestly can't think of anything better). I'm not as interested in the linearly separability study later on: I doubt the visual system is a linear classifier on top of the IT. One tidbit that I thought was especially interesting was towards the end (12991) that it wasn't the larger receptive field alone (although their discussion suggests that "receptive field" is overloaded and tricky to define) that was doing the hard work.
One separate thought that I wanted to add since it's a whole other can of worms by itself: I was somewhat disappointed (as a person interested in 3D vision) that the transformations they tried were in 2D (correct me if I'm wrong). Computationally, scale, translation and even 2D invariance seems simple or doable to me. Invariance to 3D transformations (and not in the contrived, watch an object turn on a turntable), however, is really neat. I'm very curious what would happen if they tried similar 3D rotations.
Can we say 'scale' here is equivalent to depth? For ex - If you look at 0.5x Giraffe, the only way that is possible is when the Giraffe has moved far away from subject (somewhat a notion of depth). But I suppose what you are actually looking at is -- if we have sideways or frontal image of giraffe for recognition?
It may be that the ventral stream mostly focuses on 2D/object recognition. Less is known about the dorsal stream, but it appears to have a greater role in what we computer vision people would call 3D understanding. This would make sense as it is right next to the cortex for motor control (think perception+manipulation). The ventral stream and the dorsal stream are not completely isolated; they are interconnected so it looks like information from each is relevant for either task.
Ah -- I had forgotten about that. Thanks for pointing it out. I guess I figured that the dorsal stream only handled stuff like layout and not stuff like recognize the car's rotation.
Two things:1. To Aayush's comment, strictly speaking 2D scale is not a true representation of scale unless the object depicted is totally flat and orthogonal to the image plane. A wide-angle lens that's close to a 3D object does not look the same as a telephoto lens that's far away. With that said, in a lot of cases it doesn't make that much difference.2. Regarding the Ventral/Dorsal stream, it seems puzzling to me that part of the brain would be able to recognize an object without also knowing its pose. Also, doesn't it make sense that even if the dorsal stream is processing pose, that the ventral stream would still be invariant to it? It would at least be a very interesting result if it were invariant to 2D transformations but not to 3D (suggesting that IT is in fact related more to what an object "looks like" rather than some semantic level of meaning).
So, it looks like as information is processed through the ventral stream (at least in v4 to IT), the visual representation becomes 1. more invariant to identity preserving transformations and 2. more selective for complex visual features. To me, this seems like what deep learning architectures do (I know they are bio-inspired). Do you think that this invariance/selectivity/increase in receptive field can simply be explained by max-pooling? In other words, on an abstract level, do you think that the deep-learning architecture is a good model for the ventral stream?
Abhinav and I have talked briefly about this and I hope to discuss it more tomorrow -- the one interesting point is that you can argue that only invariance or selectivity is happening on each layer of artificial NN's but this paper seems to indicate only one stage is needed to do both. Whether the jump from V4 to IT is equatable to the jump between a given two artificial NN layers is very debatable.
The jump from V4 to IT is not necessarily a single neuron jump, just like the jump from V1 to V2 involves both simple and complex cell types. Please correct me if this is wrong.
Not sure how much we should interpret from figure 7, but it seems to be saying that scale invariance is already achieved at/before v4, but position and context invariance is only achieved in IT. In general, is the location in the visual pathway where the processing happens a good indication of when the processing happens? I.e. is the experiment in this paper evidence that the brain processes for scale invariance before position and context invariance?
------------------Aside:------------------I am trying to put this paper(1) from a computer vision perspective and recent work of Girshick et.al (2) (computer vision paper in this discussion). Lets first remove all the knitty-gritty details of (1). Let us just look at the basic aim of this paper which is to examine how selectivity and tolerance change along the ventral stream (going up from retina to lateral geniculate nucleus, V1, V2, V4 and IT), and the paper suggests that both selectivity/tolerance increases as information propagates from V4 to IT. --------------------------------------------------(Please correct me if I am wrong).--------------------------------------------------1. In plain words, can we say that better (or selective) feature/representation is learnt which is tolerant to size, position and context as visual information moves up. 2. In (2), feature representation is learnt at different levels of hierarchies. In the initial layers pool-layer1, something like basic-gabor-filter type responses are learnt, and then as we move up the layers to pool-5, fc-6, and fc-7, nodes fire for more selective elements (for ex - faces of animals, people, color-blobs etc.). This approach is beating the existing methods by a significant margin.If both above statements are right, can it be concluded that these deep-networks are right way of learning representation in computer vision?
I really like the paper and the way all the results are presented by showing consistency in the results across subjects, tasks, variations in eye-movement and measurement techniques(SVM linear readout v/s correlation based classifier). It was also good to see results on single-neuron correlations to neuron population responses for selectivity (scrambling sensitivity) and invariance. Most interesting was they show that increase in RF from V4 to IT is not a reason but rather a reflection of the large computations needed to achieve invariance while identifying objects.
The paper conclude that "as visual information travels fromV4 to IT, the population representation is reformatted to become more selective for featureconjunctions and more tolerant to identity preserving transformations". At first glance, this is an elegant conclusion. But, I doubt the conclusion might overlook the "selectivity" of "feature conjunction selectivity" and "tolerance to transforms". As suggested by numerous papers on "visual attention or selective attention", our visual system have to make decisions on what should focus on and what should neglect. Therefore, there should be some mechanism to coordinate feature selectivity and transform invariance. In other words, they should be mutually inhibited. It can be very interesting to see how they fight to sucess. Another point I want to argue is about how well the current deep learning models can match the ventral visual processing pathway. Currently, filtering operators (e.g. Gabor-like) are generally used to implement "feature selectivity", while pooling operations (e.g., max-pooling, sum-pooling, histogram...) are used to achieve "transform invariance". The problem is whether these "neuron" operations can really achieve what we want. If not (very possibly), I doubt deep learning paradigm has the compacity to work out operations that can really achieve the goal: mutually inhibited feature selectivity and transform tolerance.
I really liked the paper. They have shown in the paper that IT has more generalization power than V4, as it is tolerant to change in pose, size and background clutter. I also liked the fact they used standard SVM and correlation classifiers for their study.After seeing their study, I am curious to know how our brain process the visual information, when object's pose, size and background have semantic meaning. For ex:- A thumbs up and thumbs down, have semantically completely opposite meaning although they are visually same. Do brain process them differently and, orientation becomes part of local feature in the early processing stage. Also the experiment were conducted on monkey, so I think it will be difficult to find cases when difference in pose, size etc. will have different semantic meaning for monkey.
This paper focuses very heavily on how things change as we move up the network. I feel its important to keep in mind that we still cannot conclude that this processing is entirely bottom up. A short feedback loop is feasible in the time scales measured either within the visual area (V4/IT) or across areas.
The paper shows how IT "generalizes" when objects are varied in size/position/placed within a background. Also shows how V4 responses are insensitive to image scrambling. Figure 5 and 7 do a good job in displaying intuition and then proof for these two aspects.I am curious how the findings of this paper relate to deep networks in computer vision. I'm particularly interested in whether IT receives raw visual sensory input apart from information from V4.