I really liked this paper (ok, given my research, it's a bit predictable). I'm really curious what's actually going on here at a higher level and (more importantly) why. Here's my first take: the apparently far-away object is given the increased processing in an attempt to ensure that processing power is roughly proportional to actual area (i.e., if the world were actually the 2.1D sketch and all the objects were flat cutouts along the camera axis).This makes some sense to me assuming that some form of grouping is already done: if you've got the 3D of a scene, your processing should somehow be scale independent. I think what's puzzling is that this is happening at the feature extraction layer. But here's where I think it becomes dangerous to treat the human vision system as a biological camera+matlab setup: I really don't know how the system is hooked up, and I don't know how much surprise to register beyond the fact that top-down information is changing feature extraction. Is this because somehow the eye has top-down information via 3D about how large the object is, and is thus the V1 is extracting features as if the image has been processed to some common scale? But, irrespective of why, I think the top-down changing of processing was quite interesting. Although it's hard (since the illusion's effect is per-person), I wish they had also done one where the apparent size is the same.
David has got very interesting points. Please correct me if I am wrong.I agree with the first point and I really like the second point. A simple experimentation can be -- Given two similar objects lying in real world, the one farther will look smaller. But if you take picture of one of the object in hallway (for ex - a chair in Smith Hall), and move the template of object along the hallway in image, the object appears to vary in size as it is moved along camera axis. One might reason that since we have been tuned to live in 3d-world, our brain is expecting the variation of template's size in image.
Yes, this is also similar to exploiting the vision system for optical illusions. Case in point:http://www.illusionspoint.com/relative-sizes-optical-illusions/are-they-of-same-size/I think this motivates the template argument.
Yes it is very interesting to see that depth has a role in feature extraction also. I wonder how critical these features are for computer vision and how much do we gain by using these features? Just by looking at 2D images we often have the illusion as mentioned above, because we take into consideration relative sizes of other surrounding objects. For the experiment mentioned by Aayush, it seems that the size of the chair seems to vary (when template size is fixed) because the other objects in the enironment have the same size. So the brain is under the illusion that size is changing in reference to the surrounding objects.
V1 is very early (These are where we have simple oriented edge detectors). I wonder where this top-down influence is coming from. Could it be that there are feedback connections from the dorsal-stream (An area believed to be involved in depth perception) are actually changing the bottom-up features themselves with some prior? I think there need to be a paper like Elissa's that tries to find a sequence of phase locking between these areas.
David, you mentioned allocating processing power proportional to estimated size in space, but I'm not sure why that would be a good thing to do (although it does seem to be a legitimate interpretation of the results). This paper is puzzling to me. I think it's cool that top down processing goes all the way to V1, but I'm not sure what to make of what that information actually does when it gets there.
I think what the above link shows is that the top down priors on the size of the object is relatively weak, compared to priors on how a scene should be structured, given a 2d projection in the image. This is also linked to the example with the giant chair (and that we can tell that the chair is big). I wonder how this compares to current state of the art methods in computer science? Are there similar methods which estimate the 3d scene structure, and use this as top down to change how/what low level features are run?
I really like the idea of using depth in feature extraction. As Jacob mentioned earlier, I am also curious to know how V1 is getting top down information.I also feel that all this size based optical illusion works because the context give us feeling that object is far-away. But if we remove the ground plane or change the context, the illusion won't have same effect. For ex:-http://www.illusionspoint.com/relative-sizes-optical-illusions/real-or-fake/So in above illusion, we see that man is holding a ring which is close to us, but actually it is a building which is far way. We are getting this illusion because of positioning of hand. It will be interesting to study V1 for this scenario and role of context in depth estimation.
Depth based representation is reasonable. One issue is where the depth information is extracted. This paper demonstrates that V1 works on this. According to the discussion on context, depth information can be viewed as a case of geometrical context. I think another important issue is left unsolved in this paper. That is, when does this happen? Is it a pure bottom-up process solely attributable to V1? Or is it a mechanism similar to the context system we discussed before involved with top-down feedback. That is, probably there is another region that extracts the depth information and then modifies the way V1 processes the low-level features. Given that the current evidence, I think the latter is more possible. Or there are other alternative explanations?
The take-away from this paper is very interesting. This paper however does not spend time analyzing how the retinotopic area increases for far away objects.Nonetheless, this depth-based feature extraction is something cool.
This is very interesting, but also seems very strange. I thought V1 is dealing with small 2D features, so I don't understand why depth would be telling V1 to expect a smaller image on the retina. This seems to be another victory for context though.
I agree this is another context victory -- and I think your question is one a lot of people studying V1 would ask, too -- context tends to get ignored at that low level of a system. I believe the reasoning is that when you are looking for / discerning things at a high level, the type of information you get at the low level can make a big difference. Maybe it makes more sense from an attentional view -- the information coming from V1 will be less noisy and perhaps lead to better results down the pipeline.
One other thing I noticed... The shadows of the spheres also reinforce the depth perspective. The close sphere has a shadow consistent with an overhead view of the object, where the far one is consistent with a glancing anglem. Some of their later experiments don't have this effect, but I wonder if the same illusion would hold if only the shadows were present, with no perspective lines.
In the context of top down information affecting the behavior of neurons in the visual cortex, I found this paper interesting.Feedforward, horizontal, and feedback processing in the visualcortex, Victor AF Lamme, Hans Sup`er and Henk Spekreijse, Curr Opin Neurobiol. 1998 Aug;8(4):529-35. (http://www.ncbi.nlm.nih.gov/pubmed/9751656)Among other things, this relatively old survey paper says that the receptive field for neurons in V1 can change due to horizontal connections.
In an attempt to disagree with the authors intuition that V1 neurons are adjusting their receptive field based on depth estimates feedback from higher visual areas, I have come up with three alternative hypothesis.1. The visual cortex could be a prediction machine. The feedforward process is followed by a feedback prediction of what the brain expects to see next. In the second stimulus, where the ball is further into the walls, the prediction corresponds to an increase in stimulus size as the ball rolls forward down the corridor. Whether or not this prediction uses a depth estimate is secondary.2. The scene layout processed by higher visual areas can feedback to V1 to increase the receptive field of neurons farther from the center, making them fire. The higher level visual areas can capture the alignment of wall edges and the sphere (implicitly inferring the scene layout) and feedback to V1 to facilitate the illusion.3. The neurons in V1 that are processing information at higher scales (more coarse edges) can laterally stimulate neurons further away from the center to facilitate affine rectification of the image. I'm basically trying to avoid going into the dorsal stream and then coming back into V1.
Previously, we most talk about the fMRI responses. It is quite interesting that the perception area of the receptive field is also adjusted according different scenarios. Is the perception area of the receptive field or other regions also a good evaluation criterion helpful for other tasks while not limited to depth estimation?