Visual Neurons and Machines: Feb 26 Read

Monday, February 24, 2014

Feb 26 Read

This week we will read:
[Mitchell] Palatucci et al., Zero-Shot Learning with Semantic Output Codes, NIPS 2012

30 comments:

UnknownFebruary 25, 2014 at 5:39 AM
Did anyone read the neuroscience paper comparing humans and monkeys? If not, I would definitely recommend it as well.
ReplyDelete
Replies
UnknownFebruary 25, 2014 at 8:46 AM
I have two points about this paper.

First of all, I again noticed hat they use both the word and the visual representation for the concept. This again brings the discussion of semantic versus visual concepts/categories. That would be interesting to see whether there is any difference in the results if only the word or the visual representation is shown. But as I see this is a more fundamental problem which is difficult to resolve.

Following the previous paper which suggest continuous space in the brain for similar categories, my second point is about the relation between similarity of neural activities and similarity of semantic feature space. I wonder if shared attributes/semantic features result from similar partial neural activities even if they belong to different object classes at the end.
ReplyDelete
Replies
UnknownFebruary 25, 2014 at 4:49 PM
On the first point I completely agree that using both word and shape is not ideal. They could have chosen visual or conceptual. Another thing I did not like was asking people to think of properties of objects...the task forces people to use an attribute based feature space and hence there might be higher correlation than passive viewing..
ReplyDelete
Replies
UnknownFebruary 25, 2014 at 4:51 PM
On the second point, I think one of the links between this paper and the previous paper was that the PCA dimensions are supposed to be the semantic features. Therefore, one could have used PCA dimensions as well...the only problem would be how to estimate the knowledge base
ReplyDelete
Replies
UnknownFebruary 25, 2014 at 7:26 PM
I think the knowledge base (i.e., the intermediate coding) is actually quite tricky here. You need something that can separate classes well but which can still be learned. I'm not sure whether such codes / attributes exist for general objects, since to learn anything fine-grained, you might need fairly specific codes for which there might not be much data. For instance, in one of the categories they struggle with on the retrieval task (furniture items), how would one distinguish side-tables from coffee tables?

I'm somewhat surprised at these results too: given 58 examples, they learn a mapping to a 5000 dimensional space. Even with regularization, this seems suspect. Certainly their results are better than chance, but I'm curious what the breakdown is and whether their performance is largely due to success on predicting the distinction between vehicles and animals. I do like the predicted/target results for bear vs. dog, but these are selected attributes, not random or the strongest predictions.
ReplyDelete
Replies
UnknownFebruary 25, 2014 at 7:30 PM
In the previous paper, the authors recovered a continuous space and found that only the first few PCs are meaningful. That seems to suggest that the amount of information we can recover from fMRI data is limited to a few dimensions. In contrast, for this paper both semantic spaces used were of very high dimensions (218 and 5000 respectively). This seems to suggest 2 possibilities:
1. The method used in the first paper fails to recover information in the higher dimensions, or
2. there are many redundant dimensions in the semantic spaces in the second paper.
It would be interesting to see how well a semantic space with far fewer dimensions would perform on the experiment in this paper.
ReplyDelete
Replies
GauravFebruary 26, 2014 at 12:17 AM
A paper that explains how dream machine in the movie 'Inception' can be developed. Interpret a dreamer's world and show it to the other people plugged in :)

Concerned over the choice of semantic knowledge base of corpus5000 – co-occurence over 5000 most frequent words from Google Trillion Word Corpus. Is this number enough ? Do we need to calculate co-occurence over more words to make a true semantic space? Mitchell (2008) (CMU's Tom Mitchell yeh!) uses co-occurence over the whole trillion word corpus.

Would like an ablative analysis of semantic knowledge set. Also, the semantic set that performs best in Mitchell's paper is a co-occurence knowledge set of 25 action related words. I am curious to see comparison of 218 attribute set with the same co-occurence knowledge set used by Mitchell. I have an intuition Zero Shot learning on 218 attribute set might be better as it includes size, shape, surface properties than just actions and motor related information but a side by side comparison might show how useful non-motor attributes are.
ReplyDelete
Replies
IshanFebruary 26, 2014 at 12:36 AM
I really liked this paper for the following reasons
1. A clear explanation of the task at hand.
2. Formalism for zero shot learning: though I did not understand most of it, a few tidbits made intuitive sense.
3. A neat idea

Things I was not happy with
1. A more detailed analysis of the mistakes the system makes. Any hypothesis why insects/man-made items perform poorly?
2. The corpus5000 baseline does not convince me. It seems rather weak to just consider word co-occurrences, while asking much richer questions in human-218. Also, the human-218 seems to be an attribute based method.
3. Mixing up vision and language for the semantic space.
ReplyDelete
Replies
Abhinav ShrivastavaFebruary 26, 2014 at 2:40 AM
I liked the paper for its simplicity and clear presentation!

Few things that bothered me:
1. One-to-one encoding between class labels and points assumption (raised above too in a different form). -- The authors assume that they can manually design a feature space that has a unique signature for each class label... For smaller # of categories, this seems doable, but for a much larger set, isn't manually defining this space an over-kill? Sure you can represent 100k categories uniquely in ~17 bits; but its highly unlikely that these 17 bits can be translated to simple questions like the users ask..

2. (being a vision person) -- for me, the semantic feature space was more-or-less an attribute space. It would have been nice to see the authors use attributes from Farhadi et al. and Lampert et al. as the intermediate space for a baseline.

3. Experiment Design -- was it correct for them to ask participants "to think about the properties of these objects"? Or just showing the object and asking the participants to look at the images and read the text enough? --- I don't know.. May be forcing the participants to think of properties is more congenial to their problem.

4. Experiment Design -- mixing vision and language semantic space (as mentioned by Ishan). I would have also liked to see 2 different experiments (different fMRI data). One for showing the word, other for showing the image of the object and doing analysis with them..
ReplyDelete
Replies
UnknownFebruary 26, 2014 at 4:14 AM
I really liked the paper. The authors have formalizes the zero shot learning algorithm. It give us ability to predict the classes which were not part of training set.

But I have following issues with the paper:-
1) Creating a rich knowledge base is very costly operation, also it is not always feasible to have a good knowledge base for our problem. Also it is practically very difficult to find the attributes to differentiate classes at finer level. They should have discussed in more detail about how to choose our knowledge base.
2) Also how to choose input features (X) for our training data for a given knowledge base(F). Our feature space should be able to capture the semantic of the knowledge base. So, randomly choosing features won't help. There should be some correlation between them.
ReplyDelete
Replies
UnknownFebruary 26, 2014 at 5:08 AM
Ok. So I have some issues with the paper. As stated by the authors, the focus of this paper is to answer : 'Given a semantic encoding of a large set of concept classes, can we build a classifier to recognize classes that were omitted from the training set?'. In the related work section, the paper mentions about the work of Lampert et. al. 'Learning to detect unseen object classes by between-class attribute transfer' which discusses the problem of zero-shot learning in computer vision by transfer learning of semantic attributes (85 attributes) on animals with attribute dataset (AwA).

While reading this paper (Palatucci et.al.), I was looking for the question how is it different from the findings of Lampert et.al. Except for Section 3 which gives theoretical PAC analysis, I could not see how it is offering something new. Please correct me if I am missing something. The idea of using human fMRI images is new. Maybe it is the selling factor of this paper. Further they could have tested their approach on AwA dataset which has more confusing categories and more semantic attributes. But I suppose it is not a computer vision paper which is competing on numbers for an approach. The author's idea is to show the possibility of zero-shot learning which in my opinion has been well demonstrated by Lampert et. al.

Please correct me if I am somewhere missing the beauty of this paper.

Notes:
1. Lampert et.al paper link:
ReplyDelete
Replies
Yuxiong WangFebruary 26, 2014 at 5:13 AM
This paper (Mohammad Rastegari, Ali Farhadi, and David Forsyth. "Attribute discovery via predictable discriminative binary codes." ECCV 2012.) provides a way to learn predictable while discriminative codes / attributes for general objects. Beside, their assumption of one-to-one mapping between class labels and points in the semantic feature space seems tricky. A more feasible way is to group or classify different categories in the attribute space.
ReplyDelete
Replies
AllieFebruary 26, 2014 at 8:55 AM
I have a basic question about the semantic labels used -- when they used Mechanical Turk to obtain the labels, are those the same properties they asked the people looking at the images to see? I couldn't discern from the paper whether they were purposefully trying to focus the features they were trying to pull from brain activity with the labels pulled from Mechanical Turk. Possibly an irrelevant question, but I was thinking it would really affect the results if the properties they asked for were different from the ones the subjects were thinking about.

Also, as for the main conclusion of this paper -- they are clearly interested in the neuroengineering application, but I think the implications for how we build vision systems are way more interesting from a scientific perspective rather than an engineering one. Have there been any large CV challenges where no examples of a certain category are given, but the classifier is still expected to work on them? I guess this kind of a challenge would rely heavily on semantics, and the whole vision+semantics thing is still kind of new.
ReplyDelete
Replies
UnknownFebruary 26, 2014 at 2:09 PM
I have two points on attributes for computer vision purpose.

1. Should attributes be easier to recognize/detect/estimate than categories in term of computation? Otherwise, we may convert a hard problem to a harder problem. In this sense, attribute like "man-made" might be bad attribute, because it is really not easy to classify from only appearance.

2. Each attribute should be shared by multiple categories, otherwise it will come to be a category-like label itself. And, in terms of visual representation, it is not compact.
ReplyDelete
Replies

Add comment