Paper for reading : · Harel et al., 2013, Deconstructing Visual Scenes in Cortex: Gradients of Object and Spatial Layout Information. Cerebral Cortex. http://graphics.cs.cmu.edu/courses/16-899A/2014_spring/thevisualworld/HarelBaker2013.pdf
It is counter intuitive that knowing the identity, or absence, give more information compared to just the activation values. For example in figure 2, the initial experiment shows a comparison of neural activation, and also shows absence and identity decoding. Overall, the information conveyed is that PPA has object information, RSA has little object information. How does this experiment throw light on what we already know?
Information that PPA had object information, as opposed to just scene information (we we knew so far) is an important result for me. But as David points out in the comment below, given that it was too simplistic, we have to take it with a grain of salt..
The second paper [Stansbury 2013] is also a good read.
I really liked the second paper. In this paper they have hand labeled the objects in images, and then represented each images in object feature space. Then used LDA to to find prominent object for each scene class and probabilities of image belonging to each scene class. After this, they have shown we can learn weight vectors which relate image representation in term of their scene class probabilities to their voxel wise response. I am wondering why we are representing image in term scene class probabilities, instead we could have represented image in term of object feature space which is more accurate as they are hand labelled. Then decoding voxel response using our weight vector would give us objects present in the scene, and using LDA we can know to which scene class label for the image.
My first reaction is that I'm torn by this paper. In the end, I'm glad that I read it though. Part of me loves that that they try to make it as simple as possible to eliminate as many confounding factors as possible. This is really great and I think important to do. However, part of me just has to say that I'm just fundamentally skeptical of their synthetic data. I believe that overly simplistic scenes can be misleading about computer vision, and that you can learn a lot of wrong things about the world with them. For instance, if you have scenes with only a few objects, you might think that it's important to label all the objects (i.e., identify all the chairs). However, in the real world, if you see a scene with tons of chairs, it might be ok to get away with just "chairness" representations (i.e., there are some chairs here and I'll never be bothered to count them, only maybe identify and segment out one nearby chair that I could sit on). This doesn't mean that I don't buy any conclusion of the paper, but I just think that everything really needs to be taken with a large grain of salt.
By this, I mean that their simplistic synthetic scenes might be processed in a way that's different from natural scenes. I think this is an interesting topic for discussion (I'm skeptical, not dismissive), and I'd be interested in others' opinions on this.
To add to your second point, the simplicity of scenes might have some role to play in PPA processing objects in these experiments...
The authors also mention in the 'Discussion' section that the objects that they use have navigational affordances which provide diagnostic details about spatial environment. So ultimately PPA may be sensitive to spatial information alone from different sources. Other objects such as manipulable tools or complex backgrounds may not lead to similar results.
I am similarly skeptical about the synthetic images. They used gray backgrounds but color objects, and it is imaginable how this might not be a good idea particularly in the case of "open" scenes.
To add to the skepticism, their take-away message is much too broad for the experiment they actually did. I'm particularly peeved by the idea that object information is represented by invariance to backgrounds. Simply low-level features could get you the same thing.
Keeping with the theme of simmering discontentment in this thread, in the discussions section when they describe their general approach, they claim that they use their formulated framework to "inform the design of complementary studies that use both *naturalistic* and artificial scenes to understand the nature of information being represented in the scene network."Are they referring to the monkey paper? Isn't that too big a jump? Or am I just parsing this incorrectly?
I think we need results from both kinds of data - synthetic and natural. The authors argue, in the introduction, that naturalistic photographic stimuli has led to conflicting results. In such a situation a synthetic data based experiment seems very useful.I also feel that the brain areas discussed here are not so abstract as to have different pathways for synthetic vs real scenes - just a guess.
I agree that synthetic data has its flaws, but don't you think it's important to understand which changes in a scene matter the most? By making the data as simple as possible, they can see the effect of the slightest change (desk vs bed). I think this is more difficult with messy real world data.
I agree with their philosophy of disentangling and controlling different factors by using simplified synthetic data due to the indeed complexity of the issue. However, i think their conclusions conducted by the usage of only synthetic data is unguaranteed. Simply combining both synthetic and natural data, i.e., repeating the experiments using both data, is also not enough. A better approach is to use the gradual evolution of data. For example, start from some simplified synthetic data, and then gradually add other objects, backgrounds and geometric layout to make the scene become more complex. An alternative is that start from some complex scene image, and then gradually discard some objects and backgrounds to simplify the scene. In both scenarios, test whether and how the response of the corresponding areas in the brain change.
I personally like the use of synthetic data. It's important not to extrapolate the conclusions too far, but there is something to separating out the important things we want to test for... removing clutter if you will. It might have been more compelling though if they simply constructed real physical scenes that were very simple and photographed them, rather than using CG models. Honestly if you're going through all the trouble of running fMRI on 20 patients (at hundreds of dollars an hour), you think they could have found a bare room and an old couch somewhere and simply taken a bunch of photos. Also I agree that they probably should have done more comparisons against complex scenes.
This paper seems to reinforce the way vision is perceived in modern times. The object recognition sub-field, the geometry (layout, single image 3D) sub-field having not too much cross talk.I am highly disturbed by the fact that they use furniture objects (which themselves give strong cues of box-like 3D information), and nothing else (which may include more 2D/flat objects). The fact that we are using such objects might be a reason why the PPA gives a really high response for objects. It might just be latching on to the "perspectiveness" of these objects.
On the other hand, these big objects also make the case for the LOC results stronger.
The author mentioned that hey are using furniture because they can realistically embedded in a spatial environment. But many other flat objects like flat carpet, portrait etc can be easily embedded in a spatial environment and can be used in their experiment.
If PPA is able to decode between semi-fine grained furniture categories then its definitely not latching onto simple cues like perspective.
I'm actually not sure how much the PPA is able to decode. Last class I defended this correlation approach as being better than an SVM. But one problem I now see is that I really don't know what 0.02 decoding here means. Is this strong decoding? Yes, yes, it's statistically significant (i.e., there's something going on), but the question I have is whether this can be explained by other stuff going on in the brain that's not really "semantic-object" tuned. Basically: is this decoding accuracy good enough that it's not just low-level statistics or some other stuff floating around the PPA.
I liked the way in which author has shown that LOC is responsible for processing, RSC for scene layout and PPA for both. Also at last they have shown RSC is correlated with PPA which is also correlated with LOC. So is this correlation is enough to imply actual physical connection between these regions or exchange of information between them?
Interesting the PPA has also been shown to capture global statistics and textures. Following a discussion from the last class on dividing the scene recognition task into structure, content and style, would it be fair to say then that according to this paper, PPA is sensitive to all three (structure, content, style)? It seems to suggest that PPA is integrating inherited information from different regions of the brain to culminate the process of scene understanding.
This paper investigated synthetic scenes with one object, and it seems to be suggesting that there are 2 streams of processing, one for spacial properties, and the other for object properties. The relationship between different objects in the scene seems to be important. Object properties may be used as cues to spacial properties. For example we might rely on the position of objects on a smooth surface to deduce where that surface is. It would be interesting to see where and how the 2 streams of processing merge to form a consistent representation.
Reading this paper I was immediately reminded of the Computer vision's GIST feature - http://cvcl.mit.edu/papers/IJCV01-Oliva-Torralba.pdf PPA does what and where' processing and maybe this gives the inspiration for a general scene understanding feature like GIST ? RSC is more related to the where aspect - might be useful for actual robotic applications of fast motion through environments. Read replicating driving/flying on robots.
Taking inspiration PPA's use of both what and where -Maybe in computer vision we can use cues from object recognition to do spatial understanding eg. identifying part of image as sky means it's far away always. This reminds me of a CVPR paper that used semantic labels to inform depth prediction - (http://users.cecs.anu.edu.au/~sgould/papers/cvpr10-depth.pdf)
I wonder if there have been tries in vision, where people have used different methods on the same data and brought them together to add more to the over-all semantics. Like, one analysis for say high-level recognition, another method for fine-grained analysis and finally bringing them together to give a consolidated result. Can any one mention some of such experiments? If any?
I think it's important think that biological vision is a means to an end. Organisms want to know where they are, find food, etc. PPA, LOC, and RSA may all be toward different cognitive ends. I agree that the data is contrived, but it may make sense for PPA to be sensitive to objects as well if it is indeed part of a larger pathway that aids in navigation. It may be a way of noting "landmarks"; I know that this scene (or I) was at place X and it had an object Y.
The paper emphasizes the connections from LOC and RSC into PPA and how PPA receives information from several brain regions (hippocampus, V4 etc.).What does PPA output too?I'm interested in this question because then we may hypothesize what the brain does with the scene information.
Till now I still do not get a clear picture with respect to the role of the corresponding regions in the brain for scene understanding, especially PPA. Different papers we read use different experimental setup and dataset, and hence sometimes lead to controversial conclusions. I thinks this is probably because of different methodology between computer vision and neuroscience. In computer vision, we are accustomed to proving our assumption using extensive experiments on different dataset and even for different vision tasks. Without doing so, our paper would have high chance of being rejected. Meanwhile, for neuroscience paper, probably partially due to the hardness of conducting comprehensive experiments, they are usually simplified, such as simplified dataset and simplified parameter tuning procedure. Hence, the conclusions are usually skeptical for vision guys.
I would be curious to know people's opinions on what this all means for the visual process in general. I.E. if we forget about PPA/SRC/LOC for a minute and simply make the claim: "Spatial information about scenes (open/closed) is processed in one area of the brain, object identity in another area and object pose in another area" what does that actually tell us about the visual problem? What would be different if they were all processed as part of one big block of functionality? To me it makes obvious sense that they would be separate: the individual pieces of data seem to be relatively uncorrelated, and seem to require different process to evaluate, so it would make sense to have dedicated regions. Does anyone have any other thoughts on what we might get from this information?
One thing that struck me after class was that the PPA was shown to be more correlated with LOC than RSC, so that kind of goes against the hypothesis that PPA is basically latching on to the 3D structure or spatial information from the furniture objects. PPA is also encoding the object content as well. But again all this depends on how fair it is to use the correlation metric for comparison between regions of the brain, how long is the time to process the images etc.