Saturday, March 29, 2014

Scenes: Structure + Content

Kravitz, Real-World Scene Representations in High-Level Visual Cortex: It’s the Spaces More Than the Places, 2011

37 comments:

  1. By the way -- for those who read / are going to read the Patterson and Hayes paper (which I'd encourage), there's a very recently added IJCV edition which does scene classification with the predicted attributes: http://cs.brown.edu/~gen/pub_papers/SUN_Attribute_Database-Patterson_et_al.pdf

    I was very curious about this personally and was very surprised that the original CVPR paper did not do this.

    ReplyDelete
    Replies
    1. David: Thanks for the pointer.

      One point - It seems that when self-similarity, gist and geometric color histogram are combined with HoG, the performance is similar to that obtained by just HoG (a slight improvement when no. of training examples are increased). But when these low level features are combined with predicted attributes, there accuracy increases by ~5% over varying numbers of training examples. I am curious to know what if only HoG + predicted attributes were used for scene categorization task. One problem that authors have reported is seemingly the low dimensions of the attribute-feature-space. I am not sure if it has been tried before or not, or how much it works -- Instead of just keeping a probability score of a particular attribute-classifier as a feature, max-pooling is done and for each classifier 5 scores are retained. Further instead of just saying that this image is this much likely to contain attribute 'A', how good is to say that these are likely positions for attribute 'A'?

      Delete
    2. Interesting catch on HOG doing nearly the best. Two thoughts: (1) their features came from the original SUNS paper and so this probably motivated sticking with everything as well. (2) this is averaged over all the attributes: it might be that geometric context is crucial for a small number of attributes (i.e., contributes ~0.5% overall, but makes or breaks 10 important attributes).

      As for localized attributes, that would be interesting, although certain attributes here are definitely global (e.g., closed / open)

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. 1. Under the heading, 'Representational structure within cortical regions' on Page-7325, the third line states -- "Given the limited acquisition volume possible at our high-resolution, our scene-selective regions included both TOS and PPA but not RSC." Does this mean that due to infrastructural issues they were not able to include RSC in their studies?

    2. As we know (as in 2014) that there are some works which show that RSC contains information about spatial layout (open, close) more than PPA, how will the conclusions in this paper modify?

    ReplyDelete
    Replies
    1. With regards to the first question, I think the high resolution scans were obtainable only from the technology that can penetrate only the outermost layers of the cranium, and the RSC is deeply inset, next to the hippocampus.
      With regards to the latter, I don't see why their argument is diminished at all by that discovery. They do claim that it is a complex, distributed task, and their main claim is that the PPA is distinguishing scenes at the macro level on the basis of spatial difference, which still holds.

      Delete
    2. As a note, the authors mention in the 'Selectivity analysis' section that about 76% of the 'scene-processing selective' voxels lay within their predefined regions, which covers the majority. I would also agree that the fact that the PPA discriminates scenes based on spatial information would still be valid.

      Delete
  4. In the discussion, it is mentioned that "low-level representations may be important in supporting quick discriminations of complex stimuli, whereas high-level representations are specialized to support more abstract or specialized actions (e.g., navigation)". I wonder how much the brain's representations will change depending on the task, and which parts of the processing will be invariant to the task being performed?

    Personally it feels different when I am looking at a scene passively, vs when I am looking at a scene with the intent to identify it, vs when I am trying to navigate. In this paper, it was mentioned that the subjects were looking out for changes in the fixation cross. I wonder if this could push the subjects towards perceiving the scenes more passively, and as a result high level semantic properties of the scene (which may require more focus to pick out) are underrepresented.

    ReplyDelete
    Replies
    1. I agree with your comments, but then again, how else could you measure the required property? Fixation seems like the logical thing to do.

      Delete
    2. Interesting point. Fixation would enforce a person to not follow the natural saliency that would otherwise be perceived during a scene identification or navigation task. The authors mention under the section on 'Categorization and discrimination in other cortical regions' that cEVC represents the portion of space containing the fixation cross which impacts results for cEVC. Does that mean they assume this task does not impact results for PPA and PEVC?

      Delete
    3. I agree with Liu and think that passive viewing is not right thing to do. Not only in this paper but in previous papers we read, most of the the times authors ask their subject to perform passive viewing. But I think stimuli of our brain is very much task dependent. They should conduct experiment by assigning various task like finding empty space in scene and find the route, or try to identify all the objects in the scene etc, which require different amount of our attention.

      Delete
    4. Yeah, I think there's something to this argument. Timing also must be a factor, so what we're getting here is people's immediate reactions to the stimuli after viewing it for half a second. I can see why they do this though... The longer you look at something, the more time you have for your mind to wander.

      Delete
    5. I too agree with Liu. By asking the user to pay attention to the changing fixation cross, the study may be forcing them to suppress the neural response of other objects. This may be the reason why PPA and pEVC are not able to capture high level scene category. Was any part of the brain in this study able to capture high level scene category?

      Delete
    6. The second paper by Park does find a place that captures high level categories. I won't spoil the ending, just come to class.

      As for the different tasks, I'd agree that you ideally want to study people doing natural tasks. But in the end, my half-hearted defense of the fixation cross is that if they did anything else, people would (rightfully) complain that their task inherently biased the readings (especially if it was a negative result, like the lack of high-level content decodability). The fixation cross is not the best, but I think it' might be neutral.

      Delete
    7. I agree. The tasks themselves when the authors are designing certain experiments might be "selective", tending to favor certain type of scene property. A better approach is to simulate different scenarios when people perceive scenes and see whether there are some differences.

      Delete
  5. I notice that one of the areas studied (PPA) surrounds the hippocampus. Please correct me if I am wrong, but isn't the hippocampus the home to "place cells," that is, cells that fire based on the perceived location of the viewer (Think Brain-based SLAM)? If the scene representation in the PPA is primarily spatial, could the PPA just be part of this SLAM machinery?

    ReplyDelete
    Replies
    1. See RatSLAM . This might be of interest to you.

      Delete
    2. I thought the hippocampus was involved with memory?

      Delete
    3. Yes, it's involved with both. I do know about RatSLAM, it's just when we say "scene understanding," brain-wise, might concretely mean mostly a 3D-ish understanding. Other aspects like outside/inside, manmade/natural are secondary.

      Delete
  6. Hm, interesting paper. Kind of blows the Walther paper out of the water. It was also very entertaining to see the ELO rating system used outside of competitive chess.

    I didn't quite follow how they were determining the ELO ranking. They claim that they made 10,000 iterations of the ranking (since the order of appearance can influence ratings, which totally makes sense), but I wasn't completely clear what they were using as a differentiator to rank trial images.

    I also liked how they made sure to remove the retinotopic voxels, and comparison with the control(chequerboards). I suspect most of the other papers also do that, but it made a lot of sense the way they presented it.

    Finally, I also liked the MDS visualization, and it seemed like I've seen it somewhere else as well.

    ReplyDelete
    Replies
    1. I believe they used the ELO rankings to define the dimensions along which to put the scenes. They use it to compare behavioral differences with neural responses in the fMRI. For instance, when talking about expanse they mentioned that there's a "strong correlation between ELO ratings and expanse scores in PPA but not in pEVC, suggesting that the pattern of response in PPA more closely reflects behavioral judgments of expanse"

      Delete
    2. So the ELO rankings work as follows: while they were sitting in the FMRI machine, the users were being asked to give relative ratings of the scenes as they were being processed (for each of the three things: content, expanse, distance), i.e., they'd see two scenes and say "scene A is more open than scene B". These give a bunch of pairwise ranks, which they turn into a continuous values via the ELO system.

      As a completely unimportant side note, I'm curious whether the ELO process has a unique stationary distribution (and whether instead they could have just randomly ordered the matches ad infinitum rather than randomizing and averaging).

      Delete
  7. The paper was interesting and easy to follow. Liked the fact that they explained in detail (Discussion) about the bias in study design for previous studies which have shown that PPA distinguishes between high level perceptual categories of scenes, contrary to their findings. Based on the findings of this paper it would be interesting to find claims on where does scene-identification primarily occur in the brain then? Is there some strong evidence to suggest it could occur elsewhere? Or is it not a region specific task, but rather the inheritance of features along the ventral stream which ultimately enables scene understanding? The authors claim based on their findings that weaker categorization by relative distance in PPA means the PPA inherits aspects of scene categorization from pEVC.

    ReplyDelete
    Replies
    1. For this paper, author knew on the basis of the previous study that PPA distinguishes between high level perceptual categories of scenes. So they can do rigorous study on ROI, concentrating on PPA. But I don't think these experiment can be extended that easily for the whole brain until we have some prior knowledge.

      Delete
    2. The second paper presented in class by Park and Oliva was an interesting clue for my question which proposes that content is majorly handled by the LOC and structure by PPA, sort of breaking down the scene understanding task to these two regions of the brain.

      Delete
  8. This comment has been removed by the author.

    ReplyDelete
  9. I'm still a bit hazy on ventral and dorsal stream separation and would be glad for some insight and corrections on 2 things related to paper -
    1. We have 1024x768 highly detailed images vs. line drawings in some papers. Since there is depth and *where* question to be answered do we expect the dorsal stream to light up too on fMRI for the scenes used ?
    2. Does the paper blow a big hole in the two stream theory by saying PPA does spatial processing which was the domain of the dorsal stream ? Please correct me if I'm wrong here :)

    ReplyDelete
    Replies
    1. Spatial information when represented as open / closed or near / far is still categorical rather than precise structure/3D. I feel that the two stream theory does not contradict this result.
      Its been very long since I read these papers, so please correct me if I am wrong.

      Delete
    2. Yes -- open/closed is a binary category that's assigned to a scene class (e.g., mountains). Near / far is a binary category that's assigned to a scene instance (e.g., a close-up of a rock on Mount Ranier vs. Mount Ranier in the distance).

      This structure as a class does bug me a bit as someone who works on more fine-grained structure, but it actually might get you most of the way for just the gist of the scene. There's a really neat paper which tries to get the depth for a scene in a video by arguing that TV has on the order of 12 classes of depths. For those interested:
      http://staff.science.uva.nl/~vnedovic/pubs/PAMI2010.pdf

      Delete
  10. I'm still left thinking that this/previous papers which claim X piece of brain is responsible for Y may be misleading. For example their data shows that PPA has selectivity for expansive vs. enclosed spaces, but couldn't you also use this data to make the claim that PA is characterized by parts which detect ceilings and walls, and other parts which detect horizons?

    ReplyDelete
    Replies
    1. Exactly. I was thinking the same thing. It just might be the case that PA is discriminative towards some mid-level features of the indoors and the outdoors. I think in the end it is clear that it discriminates between indoors and outdoors, but how exactly (representation wise) is not clear.

      Delete
    2. This is partially addressed in paragraph 5 of the discussion.
      I think that if they detected parts, the scene category could be decoded when they controlled for spatial expanse.

      Delete
    3. Very interesting points...

      Aaron: I agree that the soundbite take-away message is not always accurate and that the actual message (X part has the information to do Y) is much more subtle (and it doesn't mean X does Y!). The Park did a good job at trying to make more strong evidence. They asked the devil's advocate question of: "ok, I can decode Y in part X. Is this just because I have a trained decoder that's just using signals from low-level statistics of the scene?".

      Ishan: I'd be very curious what would happen if they looked at the indoor / outdoor dichotomy more closely. There are definitely closed spaces that aren't indoors (such as a canyon), and ones where it's arguable (a forest canopy) but open/closed is definitely correlated with indoors / outdoors.

      In the end though, if caves count as indoors, then the indoor / outdoor distinction is the same as the closed / open distinction, I think.

      Delete
    4. I agree with Aaron's point. I think it's very hard to make very strong statement, like PPA is only responsible for spatial factor. Many properties of an image are correlated to each other (or they are not orthogonal). For example, the expanse factor (open, close) is heavily confounded with the indoor/outdoor and even category (e.g. mountain is always open), then, by changing the target Y, we can then get another conclusion.

      Delete
  11. To incorporate some of the findings of this paper, we could change the vision pipeline to do the following:-
    1. First evaluate whether a scene is open/closed and near/far (relative distance).
    2. Based on the result of 1 pick one of four pretrained models (trained for these specific situations) and use this model instead of a model that operates in all conditions.
    Can we get away with searching a small scale space based on the relative distance?

    ReplyDelete
    Replies
    1. That's a good point! From Patterson's paper, they actually have found that there are different types of attributes (i.e. spatial, material, function, etc.) However, even though they have noticed that, they in fact treated different kinds of attributes in the same way (e.g. train classifiers in the same framework). Based on the brain studies, maybe it makes sense to start treating them differently.

      Delete
  12. The entire experiments are conducted when the authors already have some assumption or prior (also probably bias) on certain regions of the brain to be analyzed, like PPA. So the experiments are usually, I think, not easy to generalize well on other regions of the brain. Moreover, the conclusions of these experiments are also limited. For instance, there might be other explanations for the functions of PPA rather than expanse (open, closed); there might be other regions responsible for the semantics instead of spatial factors for scene understanding. This means when we look at these problems, in most cases we are forced to be in a local perspective rather than a global one.

    ReplyDelete