Monday, February 17, 2014

First Read, Feb 19

[Gallant]     Huth et al., A Continuous Semantic Space Describes
 the Representation of Thousands of Object and Action Categories across the Human Brain, Neuron 2012

36 comments:

  1. Also, check out the brain viewer on Gallant's Website:
    http://gallantlab.org/brainviewer/huthetal2012/

    ReplyDelete
    Replies
    1. I have very basic questions. Does visual categories always match with linguistic categories? For example when we read or hear a word corresponding to a semantic category, are the neurons that are sensitive to that semantic category are related to the ones when we see the picture?
      If not, if we think about linguistically, is (possibly in another part of brain) Wordnet hierarchy is valid?
      This paper shows the relationships between several sub-categories, and a nice hierarchical relationship, but it seems to me that it is not exactly as in Wordnet hierarchy.
      Is it due to the hypothesis used in the design? Or the semantic categorisation done by hand does not exactly match with the organisation in the brain?

      Delete
    2. This is the problem I was mentioning in the class and hope to bring it up again during the presentation. Because they use wordnet hierarchy to propagate labels in the labeling matrix it is bound to show similar structure in the brain. So, the experiment is designed with hypothesis in mind and data is completely corrupted with that hypothesis.

      Delete
    3. I'm also curious about the impact of wordnet on the data. Although they say that they get only 341 higher-order categories via wordnet, I'm curious what the results would look like without the hierarchy and the higher
      order categories (i.e., only the subordinate categories). The authors argue towards the end (page 11) that the word net bias only affects rare categories (since subordinate categories that occur frequently seem to be able to be different than superordinate categories). However, their results, apart from figures like Figure 4, are done in aggregate over all data. I'm not sure what the distribution is, but it seems plausible that the data is dominated by rare categories (which the authors admit is probably biased by the wordnet data), and so the few "unbiased" categories might contribute fairly little to the overall conclusions.

      Nonetheless, the authors do, claim on Page 12 that they got qualitatively similar results using the labeled model. I'm not quite sure what qualitatively similar means. I wish they showed this data, so that we didn't have to speculate.

      As two side notes: (1) Figure 4 is very pretty and I think does show something, but it's been laid out via the wordnet tree, rather than something else. I'm curious what a standard graph embedding approach (isomap?) would do if applied to the principal components. (2) I'm not a statistician and my data analysis skills aren't great, but a p-value of less than 10^{-34} makes me feel a bit nervous: the null hypothesis was PCs of the category matrix. I can't think of anything better though.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. I agree that it's hard to know how abstract this semantic representation really is. Is this simply a map for visual semantics/memory? How would the data look if we used linguistic or even tactile input instead? Is there a different representation if we simply asked the subjects to think of a given category? I also agree that the WordNet hierarchy is not very helpful; why should that predetermined hierarchy be the same one used in the brain?

      Delete
    6. I would guess that hearing the word "Lion" and seeing a lion will cause some similar neurons to fire somewhere (after all, my conscious mind makes the connection between the two somehow, so there must be a neural mechanism for that). I would also guess though that seeing a Lion would also fire a bunch of other things related to what it takes to visually process it, localize it, etc. The phrase "a picture is worth a thousand words" comes to mind. A single word or phrase is very abstract and gives you little context in isolation. A picture (or visual experience of the actual thing) gives you so much more that you may not even be initially aware of (pose of the lion, how big it is compared to other lions you've seen, etc).

      Delete
  2. BOLD responses from the entire brain is used in the calculation of the category model weights, which was then used to find the semantic space. Doesn't this open up the possibility of different amounts of processing (and other things) being mistaken as evidence for the existence of a continuous space?

    E.g. for the PC representing human vs non-human, a face might have a stronger response than some other body part, but because it takes the brain more processing to recognize a face, and not because a face is "more human-like" than a hand?

    Hopefully this question makes sense...

    ReplyDelete
    Replies
    1. But since natural movies/dynamic scenes are being used, wouldn't the subject be able view a particular scene or object for quite some time. So the effects of difference in processing time would average out over long periods of looking at the same scene?

      Delete
    2. this is definitely interesting and something which I did not think of..the contextual effects in this experiment seem strong...While Akanksha does have a point of averaging I think it would not hold because this is biased averaging [if it was continuous stream of random images then it should be OK in averaging]....Does anyone else have thoughts on this?

      Delete
    3. This comment has been removed by the author.

      Delete
    4. Does the brain really represent objects in a continuous semantic space and not have individual nuerons coding for individual objects? We studied earlier how a human subject has a specific neuron that fires for Jennifer Aniston. This paper studies BOLD responses in voxels which are 2x2x4 mm, an area that potentially has many thousands of nuerons. As such I'm not convinced that we don't have a nueron specialized for specific objects. I would like a study that proves the homogenity of a voxel by breaking down the response of every nueron in a voxel.

      Delete
    5. I think the BOLD response is already very slow. The brain processes stimuli much faster and I feel the time difference in processing should not effect the study significantly.

      Also, the other effects are in my opinion a part of the representation. It might be that the exact time delay between the onset of a stimuli and the first neuron to fire in a particular area encodes its semantics.

      Delete
  3. Most of the categories chosen in paper are very different in concept. So, it is very ambiguous whether semantic space represents visual features or conceptual features. Will it help to have categories which are similar in concept but visually very different. For ex:- different model of cars (visually different). Also what if we use methods other methods to represent the concept. For ex:- Using sound of human being and animal (non-human) to represent them. Then compare their BOLD responses.

    ReplyDelete
    Replies
    1. The underlying question seems related to object representation. Can this be interpreted as a co-occurrence of object and context?

      Delete
    2. Yes. Because I find it very difficult to separate object from its context. According to me we always visualize the object from the context. For ex:- if we heart he sound of lion, then we will visualize the picture of lion in our mind.

      Delete
    3. The authors do acknowledge this in the last paragraph. They admit that the semantic space they report might actually be a mixture of both the visual and conceptual features.

      Delete
    4. I also saw that. But I was curious to know if there is any reliable method to remove this ambiguity. Because according to me it is very difficult to separate visual and conceptual feature.

      Delete
    5. To address this concern the authors could run an experiment with different senses and extract the common components from the PCs. In other words, they would find the voxels commonly activated between a vision based study and an audio based study and the common part might be conceptual with a slight mix of both visual and auditory features originating from our consistent imagination of the other modality.

      Delete
  4. The assumption would be that the natural movies are representative of the real world occurrences of objects and actions. Considering that PCs (principal components) are computed over the entire brain and over all the movie frames, would the extracted PCs change significantly if the content of the movies had more concurrences of certain objects or actions compared to others? Maybe Figure 6 would give a different interpretation of the PCs then.

    ReplyDelete
    Replies
    1. I too feel that movies are by design not real world. The reaction of a person to a real world incident is very different from his reaction to the same thing shown in a movie (because they know its a movie and the actor never really died). But this concern is valid for any controlled study where stimuli are shown on a screen. May be we have a very different neural response to real world events as opposed to synthetic / artificial display.

      Delete
    2. Although I definitely believe that art and real life are interpreted somewhat differently, it seems likely that the brain would use very similar regions for parsing parts of the scene. So even if one's higher-level understanding and actions are different, to me it seems reasonable to assume that the same parts of your brain that recognize lions will do it whether you're in a zoo, in the wild, or watching TV.

      For the curious: Gibson had a long running debate with an art historian, Gombrich, about stuff like this (how art and real life are perceived differently): http://gombrich.co.uk/gombrichgibson-dispute/

      Delete
  5. I just have a concern about their experimental setup, which was discussed in Figure 1. According to their paper, “The response of a voxel to a particular scene is predicted as the sum of the weights for all categories in that scene”, it seems they assumed people can observe all the labeled objects/actions in a frame/shot. However, it might not be always the case. Take the top-left frames in Figure 1 as an example. By a simple glance, I can quickly identify two objects in the frame: woman and dog. But It took me longer time to identify the spoon in the woman’s hand. Given some shots can be pretty short, people may simply overlook objects in a frame. It is related to the question of how their annotated the objects. Will it make more sense to only annotate the salient objects/actions in a frame? You might say the “saliency” is also subjective – salient objects to me may not be salient to you. I’m thinking if we can ask the volunteers to identify what they have seen in the video (e.g. provide them some choices). Of course, in this case, shorter videos need to be used, because 2hrs is too long to remember everything.

    ReplyDelete
    Replies
    1. Fair point. I had a similar doubt, and they mention this on Page 12, where they say that the number of labels per second varied between 1 and 14, with an average of 4.2 per second. Categories were tagged if they appeared for at least 0.5 seconds.

      Given that we are attracted to stimuli with the highest stimulus energy (which I think makes sense from an evolutionary point of view), I think the brain prioritizes processing certain categories first. Now it is definitely interesting to know whether the brain actually does process all categories, but only filters in the more 'essential' categories that we consciously think about - which is what I think you mean by saliency being subjective. To take this idea further, I would be very interested to see such experiments on people that are hypersensitive/have low latent inhibition. My hypothesis is that since they will consciously observe (and think about) a lot more data, it will improve the quality of the correlations that can be obtained.

      Delete
    2. One way to handle that is show the labeler same information and then ask them to label...so if you did not see spoon....labeler will also not see the spoon.

      Delete
  6. The paper is neatly organized and is gentle on jargon.

    I had one issue (which I have raised in class as well) with the way they process the BOLD signal. The second paragraph of the paper outlines that contiguous spaces in the brain would reflect semantic similarity. However, throughout the paper the authors process each voxel independently without ensuring any spatial consistency or any correlation between their signals.

    Also, is anyone else troubled by the fact that group PCs explain only 19% of the variance. Isn't that too low? Is it a brain registration artifact ? Or is it just that fMRI is this noisy?

    And I agree with David on Fig 4. Overlaying their structure on the wordnet graph is one way. But a more convincing way is to show a lower dimensional embedding of their PCs (e.g. t-SNE : http://homepage.tudelft.nl/19j49/t-SNE_files/mnist_large.jpg)

    ReplyDelete
    Replies
    1. Fig 5 shows 2 dimensional version of their embedding....

      Delete
  7. A concern about the experimental subjects. 5 males ages 25-32 seems to be a very narrow selection. Could this perhaps explain large coefficients for car on a Principal Component? Would these men be more interested in cars and have a “semantic space” tuned for car recognition. Would a creative artist be more jumbled to conform to the semantic space hypothesis ?
    --Gaurav Singh

    ReplyDelete
    Replies
    1. Interesting catch. Additionally, the subjects also included 3 of the authors, who probably share many similarities. I was very impressed with the results with all the data pooled -- I wouldn't have really expected it to work well. But now I'm a bit disappointed since I had assumed the people were at least somewhat diverse. Fortunately though, at least not all of them are American.

      Delete
  8. The paper is well organized and I loved the visualizations.

    I agree with discussions above that imposing word-net hierarchy kind of muddles the story. If they had just limited themselves to leaf nodes (as David pointed out), it would have been much stronger for me.

    With respect to the number of PCs being used, I agree with Ishan that 19% might be too low.. One good way to visualize that might have been to visualize other group PCs (say PCs 5-9 or 10-14) as semantic spaces (fig. 4 and 6). If they were truly noisy and not capturing much information, we won't see any patterns emerging.. (just a thought)... and if they had released their data (the label and fMRI matrix), we could have just tested it ourselves!

    @Ishan: They did point out about the smoothness of Cortical Maps and deal with nearby voxels on page 1222 "Smoothness of Cortical Maps under Group Semantic Space". I would have loved to see t-SNE as well.. So far, I find that the best 2D embedding visualization!

    ReplyDelete
  9. The paper was easy-to-follow and used visualizations in a very informative way. I was particularly intrigued by Fig 4. Although I would have liked to see it being represented in a way in which PCs decided the spatial configuration instead of the WordNet graph structure. Still, it is cool to see a greenish tinge to 'body parts' related nodes which is dominantly ascribed to the 'person'-related categories. This kind of wows me that such relations were in fact found.
    Although I do think the use of WordNet hierarchy to assign semantic information to each clip seems a bit dicey.
    Another thought which comes to my mind is, if natural movies were used, do these reactions say more about the particular objects, actions etc shown or are dominated by the observed juxtaposition of these entities. Because if seen together enough, entities lose individuality in a persons mind and instead become a joined blob which is described more by the mutual relation than individual characteristics. So any deviation or alliance to these "used to" visuals will result in reactions which (I think) are more of a description of the natural depiction of these entities together by the user rather than for each one individually.
    I slightly confused myself, but I hope this makes sense somehow.

    ReplyDelete
  10. This paper argues that human brain uses a spatially continuous semantic space across cortex to represent semantically similar categories. However, I think there are some bias existing here that makes the conclusions not so impeccable. That is, they do not exclude other possibilities to interpret the experimental phenomenons. First, the input data they used are several natural movies, which impose quite strong contextual constraints in both time and space. This probably also leads to strong "memory semantic" rather than "visual semantic". If random images sequences are uses, it might be more convincing. Second, the usage of wordnet hierarchy also impose some bias on the category, as mentioned by David. Third, I am not quite familiar with fMRI and not sure whether only BOLD fMRI response itself is enough to demonstrate the semantic space. In all, even though a continuous semantic space does exit, they should prove that the space is indeed visual space, rather than memory space or linguistic space or the combination. My guess is that this is probably impossible.

    BTW, the visualization of this paper is more impressive than most of vision papers.

    ReplyDelete
  11. There are a lot of beautiful questions posed above about the validity of approach, experiments and analysis. I am wondering what if we consider two different kinds of studies which are almost inverse mapping of each other.

    1. Crudely speaking, this paper suggests about the spatial location of object and action categories in the human brain space.

    2. Recently, some computer vision and neuroscience researchers (Oliva et.al. http://cvcl.mit.edu/memorableImages.html) have come out with the idea of image memorability and what governs the memorability (ex - faces). I am not sure about the cognitive neuroscience of brain, but are there some studies which suggest that particular set or portion of brain is responsible for remembering things? If it is the case, can we use the results from that study and look at the experimental analysis from this paper to find what are the factors/attributes governing image memorability. And then use this result and match with the findings of Oliva et.al. These comparisons will not only give stronger cues about the validity of studies but can also make a bridge between FMRI space (neuroscience researchers), human space (community trying to understand human visual aspects using crowd-sourcing techniques such as Amazon Mechanical Turk) and machine space (computer vision researchers trying to use maths to get better machines/algorithms).

    N.B. The above question might not be in sync to the beautiful questions posed (above this post). But I thought that since the class is about neurons and machines, so there should be some discussion which could help us make connection between two.

    ReplyDelete
  12. A lot of the comments argue the need to separate visual processing from other forms of processing. I feel that a study of semantics has to involve other processing more ... I'd rather exclude V1, V2 and keep all of the parts needed for memory, logic and emotion.

    ReplyDelete
  13. Maybe I'm a little confused, but is it really efficient to use large swathes of your entire brain to represent five dimensions? Aren't there more compact ways of doing this? Maybe I'm misunderstanding their argument for why this is an efficient way for the brain to do things.

    I agree with others who said there may be other interpretations of this data that the authors do not consider. I'll throw my own hypothesis into the stew: a lot of the signal they're getting could be attributed to small descriptive labels associated with certain objects. For some objects, these descriptive labels will not be applicable. If I have a piece of brain that estimates (or encodes) the length of fur covering an object, it may be going crazy when looking at cats, but it will probably not be active at all when I'm looking at a toaster. It seems to me, that when seen from a distance through the very blurry lens of fMRI data, this may look exactly like a continuous space. As we move from animate objects to inanimate objects, invariably certain attributes will begin to become relevant while others become irrelevant. It doesn't mean that the brain is encoding animate vs inanimate across these neurons, it could only mean that the brain is storing various descriptive labels that happen to apply to those things in a particular area. My argument assumes some kind of sparsity in the space of descriptions, basically that there are some descriptive labels that apply to particular sets of objects and not others. Anyway I guess what I'm saying is: there's no way to tell if this data is continuous or discrete, but at a much higher resolution than we are able to measure. I think their "smoothness" argument is meant to punch a hole in this, but I don't buy this either. It may make sense for the brain to organize descriptive labels in areas that transition cleanly from one place to another due to practical realities related to brain function or efficiency.

    Again, it's possible that I'm misrepresenting their argument or misunderstanding their data somehow though.

    ReplyDelete
    Replies
    1. So I guess efficiency might be worth defining in response to your question. Gallant would probably argue that efficiency would mean using as few neurons as possible to represent common signals. I'm not sure that directly translates to localizing information or necessarily even representing each dimension more compactly. There could be totally different dimensions that are more valuable to compactly represent than the ones they chose to define (arguably, there'd be a low probability they happen to pick meaningful dimensions in 'brain representation space'.

      I do think the point about alternative interpretations of the data is a very valid one. I think the framing of their null vs. alternative hypothesis is a little too distinct -- just because the representations are spread out doesn't mean they're continuous (I believe this is mentioned in others' comments above as well). That's the major problem I have with the paper.

      Although I do have to say that I like the idea of moving away from trying to discretize everything and thinking more in terms of the representation space in which things live. I love seeing computer vision papers that think in this manner, and I wonder if that is going to be a more popular way of looking at things in both the human and computer vision literature. Making the dimensions along which we categorize things (indoor vs. outdoor, large vs. small) unsupervised as Abhinav mentioned in class would be an interesting addition to this paper.

      Delete