This is an interesting paper. It holds a different viewpoint towards the function of IT compared to the paper[Kriegeskorte et al] we studied last week. (Was that because their monkey volunteers coming from a different place? :P) I have one concern about their experiments. If I understood correctly, they first performed clustering in each representation space (e.g. IT neuron representation, semantic, shape-based, low-level visual property). Then, they counted the cluster overlap between three potential hypotheses and IT neuron representation. My main concern is the parameters used for clustering could influence the results and final conclusion a lot. For example, they used K-means with K=15 for shape-based representation. As K can directly decide the granularity of each cluster, it may change the overlap results. The same question for low-level visual property: why 15 images for each category were selected? If I change the number, will it influence the results? It seems to me they were trying to estimate the correlation between different feature representations. Can we do that directly in the “raw” feature space (i.e. raw IT response) as opposed to clustered space?
I have similar concern about the number of images for low-level visual properties and what are the respective categories. The paper makes a mention of 'Materials and Methods' to look up details but unfortunately could not find any link to it in the paper. So I looked up on author's webpage to find some supplementary material. Please look at link . On it check out, Supporting information fig.c. It seems to me that out of all 213 objects, the objects were so chosen that there are at-least 15 of them in each category. I think they could surely add remaining (213 - 120) objects to either of these categories. Maybe it is for the sake of simplicity that they just suggested 15 as number OR since 15 might be the lowest number of images which could fit in each category, therefore they are considering that
Sorry. Link is http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003167#s5
More insight about clustering hypothesis is on Page-18 (Clustering Hypothesis). Pasting here --Clustering hypothesesThe neuronal-based object clusters produced by the algorithms described above were compared to object categories obtained according to three different clustering hypotheses: 1) shared semantic membership; 2) shared shape features; and 3) shared low-level visual properties.Eleven semantic categories (shown in Fig. S1A) were built according to the criteria established in . These categories were further grouped into the two superordinate categories of animate and inanimate objects.Fifteen categories of objects sharing shape features (shown in Fig. S1B) were obtained as the result of object clustering in the output layer of a well-known hierarchical model of object recognition , , . For our application, we have chosen the version of the model described in  (and downloaded from http://www.mit.edu/~jmutch/fhlib/ – version 8), which consists of four layers of artificial neural units named S1, C1, S2, and C2. Units S1 are a bank of Gabor filters with various orientations, spatial frequencies, positions and scales. Units C1 implement an OR-like operation on subsets of S1 afferent units, having the same orientation tuning but in different positions/scales. Units S2 perform a template matching (AND-like) operation on subsets of C1 afferent units to gain tuning for a particular combination of visual features. In this version of the model, the templates to which these units are tuned are random patches of images taken from the Caltech 101 database (different S2 units are built having as a template the same image patch, but at different positions and scales). In the output layer of the model, C2 units perform again an OR-like operation on subsets of S2 afferent units tuned for the same image patch, but at different positions and scales. In our instantiation of the models, 24,451 C2 output units were built. These units convey the more explicit (i.e., more shape selective and position/scale tolerant) representation of visual objects provided by the model. They could therefore be used to assess the similarity of our visual objects at the level of shared middle- to high-level shape features. This was achieved by running a k-means clustering algorithm over the representation of our object set provided by the model's output units, so as to obtain 15 groups of objects with similar features. The number of groups was set to 15 to match the optimal number of k-means clusters found in the IT neuronal representation using the BIC and AIC criteria (see previous section).Eight categories of objects sharing low-level visual properties (shown in Fig. S1C) were defined on the base of four global properties of the images of the objects – luminance, contrast, area and aspect ratio. Each category contained 15 images having either the highest or the lowest values of one of such properties, which were defined as following. Luminance was defined as the average pixel intensity of the object image, divided by the maximum of the grayscale range (i.e., 255). Area was defined as the fraction of pixel, in the image frame, that was occupied by the image of the object. Note that object area, as defined here, is different from object size, which was fixed to ~2° of visual angle for all the objects. Contrast was defined as: (median(pixels>128)−median(pixels<128))/(median(pixels>128)+median(pixels<128)).Aspect-ratio was defined as the maximum, across all the possible rotations, of the height of an object image divided by its width.
While the parameters of clustering would matter, I think they are trying to control for clustering parameters and algorithm by trying multiple algorithms such as DST etc. Also, the number of shape and IT clusters look similar to me.
They claim that the optimal number of k-means clusters (15) was determined by the Bayes Information Criterion and the Akaike Information Criterion, which are criterion that try to balance information gain and model complexity (overfitting)
Yes, I was rather troubled to see that they were shooting holes in the gorgeous dissimilarity matrices from [Kriegeskorte et al]. So sad... they were so nice...
I think one of the main reason for disagreement from Kriegeskorte et al. (and Kiani et al.), which the authors explain multiple times, is the statistical analysis of overlap between semantic and neural clusters. Specifically, one example from the paper (Pg 8) - "compensating for existence of multiple (very similar) exemplars of the same object (i.e. twins)". The importance of this distinction can clearly be seen in all the tables as well as Figure 6. For me, the main disagreement the authors want to show is between animate vs. inanimate objects.
This comment has been removed by the author.
Figure 5 interested me the most, since it showed a clear correlation between shape and cluster. However, I also have some questions about the choice of images. If high level semantic information is represented in the IT, then we would expect the neural responses to be invariant to things such as lighting and viewpoint. However, it seems to be that in the 213 images used in the paper, images that share a semantic category also more or less shared the same viewpoint. For example, all faces were front view, all animals were side view, and so on. It might be that because of this, that invariance to viewpoint could not be demonstrated, and sensitivity to shape becomes much more noticeable. So in that sense it seems to be that the choice of the images favors shape based categories.
having low level features is the most worrisome for me in IT. Low level was supposed to be processed in V1-V4 and therefore the IT representation should be invariant of low-level features. I think what you suggest is some kind of bias in the dataset...maybe knowing exactly how dataset is formed is important in this case
My knowledge of the IT is not perfect, but I think the variance in low-level features in the IT contradicts the findings of a lot (decades) of previous work. It seems really strange.
I agree with the comments. These insights seem to suggest that additional processing of low-level features happens here(?). This might be a similar idea to one of the previous papers, where the authors show distributed processing along with local neural activity.
I agree with Anirudh's point. However isn't it weird that coarse clustering of objects depends on low-level properties and fin-grained structure depends on higher-level shape features. It seems counter-intuitive to me.
Yeah, that is strange, because I thought that [Kriegeskorte et al] clearly showed that their data from IT could NOT be explained by low-level feature processing.
I think some overlap with low level similarity can also be explained by visual similarity. However, the FLD analysis in Fig 6 for pruned low level categories is indeed perplexing, e.g., the high aspect ratio cluster can be seen as a shape based cluster (thin objects) whereas I couldn't assign a shape "name" to the high luminance cluster.
I agree. They are supposed to rule out this possibility, due to the intra-class variation within one semantic category. They should either include objects with different viewpoints and shape variances. Or at less give the results when viewpoint and shape are changing.
This is an intriguing paper. The authors contend thathigh-level as well as low-level visual features account for therepresentations in the IT cortex instead of abstract, semantic membership. Two concerns:1. I think one should be careful about the semantic vs shaperepresentation dichotomy. It may be that the IT may havesemantic representations for *some* objects but not *all* visual objects. As the authors note, there was some evidencefor a cluster for four-legged animals as well as perhaps birds.2. Clustering of faces is due to visual similarity, notsemantics. However, it they did not record in face-selectiveareas. Is it possible that this generalizes to other categories;i.e., there are actually specific sub-areas of the IT which encodesemantic categories?
It definitely does not rule out the possibility; in fact they do cite papers that state that "neurons in higher order areas of both streams can learn to encode general categorical associations between arbitrary visual patterns" I really like a point that the authors mention in the paper, and which I think is the one of the main takeaways - "Shape similarity among members of the same semantic category...can easily lead to an overestimation of how well semantic membership is represented in the visual cortex". Since objects that fall under the same semantic categorically also look very similar, we run the risk of over generalizing.
I agree with Shaurya point. Even the categories mentioned in the paper has shape similarity. Generally animate objects are complex where as inanimate object can be generalized to standard shapes as they are man made.
I agree with the concerns that Jacob raises here. Specifically it might be possible that the test subject brains might not have learned several inanimate and animate objects presented in the study and hence will be discerning them as just shapes in higher level brain areas instead of "understanding" them as semantic categories.The authors apparently probed only a small section of the cortex with probes inserted directly on neurons. What would the result be if they used fMRI studies with the same dataset ?
Although the author mentions this point, I am still not entirely convinced about the monkey conditioning part. And to add to the confusion, I can't really seem to understand how exactly representation of objects were separated as based on semantics, shapes etc for the monkeys. It seemed a bit dicey to impose such structures on this abstract space.
Does that imply that you think that different monkey brains would organize category representation in their brains differently based on their experienced stimuli? I think it's a fairly reasonable argument to make that since their monkeys interacted with a large number of inanimate and animate objects outside the domain of the experiment (they are lab monkeys, after all, and so were the monkeys in the other paper).Yes, I agree on the latter aspect regarding representation of abstract concepts in monkey brains. Although there seems to be research that suggests that primate brains are capable of comprehending abstract concepts (like math), I'm not sure whether they use semantic information/their recognition wirings are as complicated as ours.
The previous semantics related studies were done on monkeys, so I think it was right thing to do shape related experiments on the monkeys in order to counter previous results. I also feel that it is very difficult to separate object on the basis of semantics for the monkey, shape is still understandable.
The representation of objects were separated as based on semantics, shapes etc was done for the dataset (not monkeys). This separation/clustering was then compared to various clustering and classification techniques performed on monkey's neural data. I hope this clears some of the confusion.
I agree with Ishan on this part -- the authors didn't have to assume that the brain is using these clusters as the primary methods for organizing visual information. They simply chose the different groups for comparison purposes.To bring up your first statement, Priyan, how would you change the monkey conditioning? Would you show shapes that the monkeys had likely never seen before during the testing phase? I would guess your problem with it is that the training of the monkey and the training of the neural representation might be inherently linked (?)
It was interesting to note that grayscale images of natural objects were used for this study. I don't know how reliable this is, but Wikipedia (http://en.wikipedia.org/wiki/Inferotemporal_cortex) states that the IT is specifically involved in the processing of color to determine "what" from the visual stimuli. Although the authors do mention this concern while comparing to other studies which use color images as input (Page 15). There must be a way to look at the color and shape interaction in the IT, which they seem to suggest is difficult to quantify. They agree this may be a major reason why their results differ from past studies.
I think their claim is quite valid in that colour can strongly influence detection, and it is difficult to disentangle clustering. Since as you mention the IT is triggered to respond to colour, what prevents us from making erroneous correlations with neurons that may be otherwise firing to colour in the stimuli? I'm not sure what a reasonable control experiment could be - clearly plain colour fields will not cut it, because there must be pathways to distinguish gradients and patterns in coloured stimuli.
I am not sure, but they can show color and gray scale image of same object, and consider neuron in the both case. I think this might negate the effect of color stimuli.
I wonder if it would be possible to find sets of color images of different objects with similar color histograms. I'm not sure that would help though.
The fact that the area is good at distinguishing color would be different from saying it can only see color images. I'm not sure which is true.
One of the things I gained was that fMRI way might be giving us wrong conclusions. Directly probing neurons is uncovering that semantic categories might not be a deciding factor for specificity in brain regions.However, that said, I have two doubts :1. Could the fact that there is no specificity or semantic separation for inanimate objects be because the monkeys haven't learned to manipulate them/have other interactions with them? We know brain region specificity might be derived from connectivity with other brain regions that process motor movements, emotions etc. I'm not convinced that lab monkeys used in this experiment have had sufficient interactions with inanimate objects in this study. Hence, can we compare this study of the fMRI studies in the previous two classes?2. Could the limited area of the high level brain regions being sampled / other experimental conditions explain the shape selectivity conclusion?