Did anyone read the neuroscience paper comparing humans and monkeys? If not, I would definitely recommend it as well.
I have two points about this paper. First of all, I again noticed hat they use both the word and the visual representation for the concept. This again brings the discussion of semantic versus visual concepts/categories. That would be interesting to see whether there is any difference in the results if only the word or the visual representation is shown. But as I see this is a more fundamental problem which is difficult to resolve.Following the previous paper which suggest continuous space in the brain for similar categories, my second point is about the relation between similarity of neural activities and similarity of semantic feature space. I wonder if shared attributes/semantic features result from similar partial neural activities even if they belong to different object classes at the end.
I agree. There may be entirely different semantic maps for seeing, hearing, and even touching a particular object. How abstract is knowledge representation in the brain? Can it really be fully separate from sensation? In other words, when we think about an object, are we really just "playing back" an experience of seeing, hearing, or touch the object?
On the first point I completely agree that using both word and shape is not ideal. They could have chosen visual or conceptual. Another thing I did not like was asking people to think of properties of objects...the task forces people to use an attribute based feature space and hence there might be higher correlation than passive viewing..
On the second point, I think one of the links between this paper and the previous paper was that the PCA dimensions are supposed to be the semantic features. Therefore, one could have used PCA dimensions as well...the only problem would be how to estimate the knowledge base
Knowledge base could be represented as Principal components in a word co-occurence matrix.
I think the knowledge base (i.e., the intermediate coding) is actually quite tricky here. You need something that can separate classes well but which can still be learned. I'm not sure whether such codes / attributes exist for general objects, since to learn anything fine-grained, you might need fairly specific codes for which there might not be much data. For instance, in one of the categories they struggle with on the retrieval task (furniture items), how would one distinguish side-tables from coffee tables?I'm somewhat surprised at these results too: given 58 examples, they learn a mapping to a 5000 dimensional space. Even with regularization, this seems suspect. Certainly their results are better than chance, but I'm curious what the breakdown is and whether their performance is largely due to success on predicting the distinction between vehicles and animals. I do like the predicted/target results for bear vs. dog, but these are selected attributes, not random or the strongest predictions.
I don't get the y axis labels on that figure - I would have expected the scoring to be based on a scale of 1 to 5 (or correspondingly, a -2.5 to 2.5) Have they normalized somehow? Is it because they have chosen these specific 10 semantic features to show, and then normalized their weights on a -0.10 to 0.10 scale?
I feel David's argument touches on a broader question of whether this semantic feature space can be hand designed for general objects. And as he mentions this gets very challenging for fine grained category recognition (Can I design the feature space that can classify a novel mushroom type). This is all the more complicated when we consider their assumption that there is a one to one mapping from the semantic features to category labels. Is there a feature space that has enough bits to support 100K categories and at the same time has a unimodal representation for each of them?I feel that this feature representation is built on the fly. Bird experts may devote more dimensions for birds while a car expert will devote more dimensions (feature bits) for a car. The mapping from this feature space to the labels would then be more complicated - a word net hierarchy driven nearest neighbor?
Isn't it just a matter of building deeper and deeper decision trees? Looking at the human218 dataset, shouldn't an approach like that continue to provide sufficiently distinguishable features?Case in point - the Akinator. Maybe I'm just talking nonsense.
Ah, but you're then also increasing the dimension of the feature space. Nevermind.
Good features are hard to come by, but also have varying degrees of applicability. The distinction between animate and inanimate objects is fairly broad and can be applied to many things. Whatever it takes to distinguish between a coffee table and a side table is probably very specific to furniture. To me this says that we should not be putting all objects in the same feature-space.
Coming back to David's point about the dimensionality of the semantic space:-The problem might be feasible if the original 500 voxel representation is already very very good. In other words the semantic attribute information might be encoded very clearly in the neural data and the transformation thereafter is easy to learn from a few examples.
In the previous paper, the authors recovered a continuous space and found that only the first few PCs are meaningful. That seems to suggest that the amount of information we can recover from fMRI data is limited to a few dimensions. In contrast, for this paper both semantic spaces used were of very high dimensions (218 and 5000 respectively). This seems to suggest 2 possibilities:1. The method used in the first paper fails to recover information in the higher dimensions, or2. there are many redundant dimensions in the semantic spaces in the second paper. It would be interesting to see how well a semantic space with far fewer dimensions would perform on the experiment in this paper.
Given the several layers of abstraction in the BOLD signals of fMRI data, I would guess that your first assumption is correct: the data was too noisy to recover more principle components.
A paper that explains how dream machine in the movie 'Inception' can be developed. Interpret a dreamer's world and show it to the other people plugged in :) Concerned over the choice of semantic knowledge base of corpus5000 – co-occurence over 5000 most frequent words from Google Trillion Word Corpus. Is this number enough ? Do we need to calculate co-occurence over more words to make a true semantic space? Mitchell (2008) (CMU's Tom Mitchell yeh!) uses co-occurence over the whole trillion word corpus.Would like an ablative analysis of semantic knowledge set. Also, the semantic set that performs best in Mitchell's paper is a co-occurence knowledge set of 25 action related words. I am curious to see comparison of 218 attribute set with the same co-occurence knowledge set used by Mitchell. I have an intuition Zero Shot learning on 218 attribute set might be better as it includes size, shape, surface properties than just actions and motor related information but a side by side comparison might show how useful non-motor attributes are.
I really liked this paper for the following reasons1. A clear explanation of the task at hand.2. Formalism for zero shot learning: though I did not understand most of it, a few tidbits made intuitive sense.3. A neat ideaThings I was not happy with1. A more detailed analysis of the mistakes the system makes. Any hypothesis why insects/man-made items perform poorly?2. The corpus5000 baseline does not convince me. It seems rather weak to just consider word co-occurrences, while asking much richer questions in human-218. Also, the human-218 seems to be an attribute based method.3. Mixing up vision and language for the semantic space.
I think motive behind using corpus5000 baseline, is to show that even non attribute based method perform better than random guess. It shows that even if we don't have rich knowledge base like human-218, we can still do zero shot learning but obviously it won't be as good as richer knowledge base.
I liked the paper for its simplicity and clear presentation!Few things that bothered me:1. One-to-one encoding between class labels and points assumption (raised above too in a different form). -- The authors assume that they can manually design a feature space that has a unique signature for each class label... For smaller # of categories, this seems doable, but for a much larger set, isn't manually defining this space an over-kill? Sure you can represent 100k categories uniquely in ~17 bits; but its highly unlikely that these 17 bits can be translated to simple questions like the users ask..2. (being a vision person) -- for me, the semantic feature space was more-or-less an attribute space. It would have been nice to see the authors use attributes from Farhadi et al. and Lampert et al. as the intermediate space for a baseline.3. Experiment Design -- was it correct for them to ask participants "to think about the properties of these objects"? Or just showing the object and asking the participants to look at the images and read the text enough? --- I don't know.. May be forcing the participants to think of properties is more congenial to their problem.4. Experiment Design -- mixing vision and language semantic space (as mentioned by Ishan). I would have also liked to see 2 different experiments (different fMRI data). One for showing the word, other for showing the image of the object and doing analysis with them..
Just showing the object does not guarantee that subject will actually think about all the properties of the object. Because they can recognize the object with just few properties and may not put stress on the remaining properties of the object. But for their zero shot learning using human218 knowledge base, it would be better if subject thinks about more properties of the object than required.
I suppose these have been the concerns with human studies. We had a lot of such discussion in the class about fMRI studies. But see these are really very hard to make these settings. I always have sympathies for the brave souls who do human studies because at the end of the day when your paper is out for review, the reviewers ask such questions and is most likely to say -- 'I do not how to do it but it would have been good if you would have considered these settings, more dataset. Your paper and ideas are novel but it should have more experiments'. The problem is that it is not so easy to do that. In computer vision, people have come a long way (though this did not mean that they have solved it) but still they have good datasets, good evaluation metric. But the understanding of human systems is still at a naive stage. Consider it as a computer vision of 70s where people were working on line drawings. The point I want to convey is that we cannot compare the capabilities of a grown up person with that of a little kid.P.S. I have a first-hand experience of all these problems while trying to get my paper in for a submission. Therefore I generally do not comment when people are discussing such questions. The point is -- 'Yes you have a beautiful question. But I do not know the answer. And this discussion would not even help you get that answer. So can we just read the paper and try to understand what authors are actually trying to convey'.https://filebox.ece.vt.edu/~parikh/Publications/Bansal_edges_3dRR_ICCV_2013.pdf
I agree with Abhinav's observations. Asking the subject to think about the object, compared to actually showing them an image of the object might trigger different neural activity.A control group which was shown images, in addition to the current set of users, would have helped clear out the ambiguity.
I really liked the paper. The authors have formalizes the zero shot learning algorithm. It give us ability to predict the classes which were not part of training set. But I have following issues with the paper:-1) Creating a rich knowledge base is very costly operation, also it is not always feasible to have a good knowledge base for our problem. Also it is practically very difficult to find the attributes to differentiate classes at finer level. They should have discussed in more detail about how to choose our knowledge base.2) Also how to choose input features (X) for our training data for a given knowledge base(F). Our feature space should be able to capture the semantic of the knowledge base. So, randomly choosing features won't help. There should be some correlation between them.
I think --1. This paper might help for Q1 - D. N. Osherson, J. Stern, O. Wilkie, M. Stob, and E. E.Smith. Default probability. Cognitive Science, 15(2), 1991.2. The work of Farhadi et.al (third paper for discussion) might help for Q2.
Ok. So I have some issues with the paper. As stated by the authors, the focus of this paper is to answer : 'Given a semantic encoding of a large set of concept classes, can we build a classifier to recognize classes that were omitted from the training set?'. In the related work section, the paper mentions about the work of Lampert et. al. 'Learning to detect unseen object classes by between-class attribute transfer' which discusses the problem of zero-shot learning in computer vision by transfer learning of semantic attributes (85 attributes) on animals with attribute dataset (AwA). While reading this paper (Palatucci et.al.), I was looking for the question how is it different from the findings of Lampert et.al. Except for Section 3 which gives theoretical PAC analysis, I could not see how it is offering something new. Please correct me if I am missing something. The idea of using human fMRI images is new. Maybe it is the selling factor of this paper. Further they could have tested their approach on AwA dataset which has more confusing categories and more semantic attributes. But I suppose it is not a computer vision paper which is competing on numbers for an approach. The author's idea is to show the possibility of zero-shot learning which in my opinion has been well demonstrated by Lampert et. al. Please correct me if I am somewhere missing the beauty of this paper. Notes: 1. Lampert et.al paper link:
This paper (Mohammad Rastegari, Ali Farhadi, and David Forsyth. "Attribute discovery via predictable discriminative binary codes." ECCV 2012.) provides a way to learn predictable while discriminative codes / attributes for general objects. Beside, their assumption of one-to-one mapping between class labels and points in the semantic feature space seems tricky. A more feasible way is to group or classify different categories in the attribute space.
In my opinion the work of Rastegari et al is a crazy optimization which worked for Caltech-256 mostly due to the use of classeme-1369 feature vector which itself contains the knowledge of semantics.
I have a basic question about the semantic labels used -- when they used Mechanical Turk to obtain the labels, are those the same properties they asked the people looking at the images to see? I couldn't discern from the paper whether they were purposefully trying to focus the features they were trying to pull from brain activity with the labels pulled from Mechanical Turk. Possibly an irrelevant question, but I was thinking it would really affect the results if the properties they asked for were different from the ones the subjects were thinking about.Also, as for the main conclusion of this paper -- they are clearly interested in the neuroengineering application, but I think the implications for how we build vision systems are way more interesting from a scientific perspective rather than an engineering one. Have there been any large CV challenges where no examples of a certain category are given, but the classifier is still expected to work on them? I guess this kind of a challenge would rely heavily on semantics, and the whole vision+semantics thing is still kind of new.
I have two points on attributes for computer vision purpose. 1. Should attributes be easier to recognize/detect/estimate than categories in term of computation? Otherwise, we may convert a hard problem to a harder problem. In this sense, attribute like "man-made" might be bad attribute, because it is really not easy to classify from only appearance. 2. Each attribute should be shared by multiple categories, otherwise it will come to be a category-like label itself. And, in terms of visual representation, it is not compact.