Visual Neurons and Machines: DiCarlo

Saturday, March 22, 2014

DiCarlo

Rust et al., Selectivity and Tolerance (“Invariance”) Both Increase as Visual Information Propagates from Cortical Area V4 to IT. Neuroscience 2010.

26 comments:

AllieMarch 23, 2014 at 7:09 PM
I have a lot of thoughts on this paper and the general idea of invariance vs. selectivity: oftentimes, people describe invariance and selectivity as being inherent tradeoffs (invariance causes you to 'lose' information about the image in the hopes of generalizing; selectivity causes you to 'hone in' on details that you might otherwise have wanted to throw away or treat equally with other features). Do you think this is a good way of looking at the problem? One bothersome thing to me is that we treat it like we're throwing away information along the pipeline that must still be there (e.g. - throw away the pose of the object so we can focus on what the object really is) when in reality, many other tasks need that information so it must be somehow reorganized instead. Does that holistic knowledge change how we phrase the invariance vs. selectivity tradeoff problem?
ReplyDelete
Replies
UnknownMarch 23, 2014 at 7:18 PM
I liked this study, especially Fig. 7 and the fact that they used both SVM and the simpler, correlation approach: I've always been somewhat skeptical of the learned layer on top of FMRI data (although I honestly can't think of anything better). I'm not as interested in the linearly separability study later on: I doubt the visual system is a linear classifier on top of the IT.

One tidbit that I thought was especially interesting was towards the end (12991) that it wasn't the larger receptive field alone (although their discussion suggests that "receptive field" is overloaded and tricky to define) that was doing the hard work.
ReplyDelete
Replies
UnknownMarch 23, 2014 at 7:23 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMarch 23, 2014 at 7:24 PM
One separate thought that I wanted to add since it's a whole other can of worms by itself: I was somewhat disappointed (as a person interested in 3D vision) that the transformations they tried were in 2D (correct me if I'm wrong). Computationally, scale, translation and even 2D invariance seems simple or doable to me. Invariance to 3D transformations (and not in the contrived, watch an object turn on a turntable), however, is really neat. I'm very curious what would happen if they tried similar 3D rotations.
ReplyDelete
Replies
Jacob WalkerMarch 23, 2014 at 8:41 PM
So, it looks like as information is processed through the ventral stream (at least in v4 to IT), the visual representation becomes 1.
more invariant to identity preserving transformations and 2. more selective for complex visual features. To me, this seems like what deep learning architectures do (I know they are bio-inspired). Do you think that this invariance/selectivity/increase in receptive field can simply be explained by max-pooling? In other words, on an abstract level, do you think that the deep-learning architecture is a good model for the ventral
stream?
ReplyDelete
Replies
UnknownMarch 23, 2014 at 8:46 PM
Not sure how much we should interpret from figure 7, but it seems to be saying that scale invariance is already achieved at/before v4, but position and context invariance is only achieved in IT. In general, is the location in the visual pathway where the processing happens a good indication of when the processing happens? I.e. is the experiment in this paper evidence that the brain processes for scale invariance before position and context invariance?
ReplyDelete
Replies
UnknownMarch 23, 2014 at 8:58 PM
------------------
Aside:
------------------
I am trying to put this paper(1) from a computer vision perspective and recent work of Girshick et.al (2) (computer vision paper in this discussion). Lets first remove all the knitty-gritty details of (1). Let us just look at the basic aim of this paper which is to examine how selectivity and tolerance change along the ventral stream (going up from retina to lateral geniculate nucleus, V1, V2, V4 and IT), and the paper suggests that both selectivity/tolerance increases as information propagates from V4 to IT.
--------------------------------------------------
(Please correct me if I am wrong).
--------------------------------------------------
1. In plain words, can we say that better (or selective) feature/representation is learnt which is tolerant to size, position and context as visual information moves up.

2. In (2), feature representation is learnt at different levels of hierarchies. In the initial layers pool-layer1, something like basic-gabor-filter type responses are learnt, and then as we move up the layers to pool-5, fc-6, and fc-7, nodes fire for more selective elements (for ex - faces of animals, people, color-blobs etc.). This approach is beating the existing methods by a significant margin.

If both above statements are right, can it be concluded that these deep-networks are right way of learning representation in computer vision?
ReplyDelete
Replies
UnknownMarch 23, 2014 at 9:35 PM
I really like the paper and the way all the results are presented by showing consistency in the results across subjects, tasks, variations in eye-movement and measurement techniques(SVM linear readout v/s correlation based classifier). It was also good to see results on single-neuron correlations to neuron population responses for selectivity (scrambling sensitivity) and invariance. Most interesting was they show that increase in RF from V4 to IT is not a reason but rather a reflection of the large computations needed to achieve invariance while identifying objects.
ReplyDelete
Replies
UnknownMarch 23, 2014 at 10:18 PM
The paper conclude that "as visual information travels fromV4 to IT, the population representation is reformatted to become more selective for feature
conjunctions and more tolerant to identity preserving transformations". At first glance, this is an elegant conclusion. But, I doubt the conclusion might overlook the "selectivity" of "feature conjunction selectivity" and "tolerance to transforms". As suggested by numerous papers on "visual attention or selective attention", our visual system have to make decisions on what should focus on and what should neglect. Therefore, there should be some mechanism to coordinate feature selectivity and transform invariance. In other words, they should be mutually inhibited. It can be very interesting to see how they fight to sucess.

Another point I want to argue is about how well the current deep learning models can match the ventral visual processing pathway. Currently, filtering operators (e.g. Gabor-like) are generally used to implement "feature selectivity", while pooling operations (e.g., max-pooling, sum-pooling, histogram...) are used to achieve "transform invariance". The problem is whether these "neuron" operations can really achieve what we want. If not (very possibly), I doubt deep learning paradigm has the compacity to work out operations that can really achieve the goal: mutually inhibited feature selectivity and transform tolerance.
ReplyDelete
Replies
UnknownMarch 24, 2014 at 5:08 AM
I really liked the paper. They have shown in the paper that IT has more generalization power than V4, as it is tolerant to change in pose, size and background clutter. I also liked the fact they used standard SVM and correlation classifiers for their study.
After seeing their study, I am curious to know how our brain process the visual information, when object's pose, size and background have semantic meaning. For ex:- A thumbs up and thumbs down, have semantically completely opposite meaning although they are visually same. Do brain process them differently and, orientation becomes part of local feature in the early processing stage. Also the experiment were conducted on monkey, so I think it will be difficult to find cases when difference in pose, size etc. will have different semantic meaning for monkey.
ReplyDelete
Replies
M AravindhMarch 24, 2014 at 5:50 AM
This paper focuses very heavily on how things change as we move up the network. I feel its important to keep in mind that we still cannot conclude that this processing is entirely bottom up. A short feedback loop is feasible in the time scales measured either within the visual area (V4/IT) or across areas.
ReplyDelete
Replies
GauravMarch 24, 2014 at 6:15 AM
This comment has been removed by the author.
ReplyDelete
Replies
GauravMarch 24, 2014 at 6:18 AM
The paper shows how IT "generalizes" when objects are varied in size/position/placed within a background. Also shows how V4 responses are insensitive to image scrambling. Figure 5 and 7 do a good job in displaying intuition and then proof for these two aspects.

I am curious how the findings of this paper relate to deep networks in computer vision. I'm particularly interested in whether IT receives raw visual sensory input apart from information from V4.
ReplyDelete
Replies

Add comment