Until now we have focused on semantic categories and attributes being the way how visual information is organized in the brain...in this paper we focus on the alternative specifically with respect to manipulable objects. The idea is that when it comes to finer categories in manipulable world...the arrangement in the brain might be with respect to the motor commands [this will make it efficient].One thing I found interesting in this paper was that in the area of medial fusiform gyrus [the area this study is concerned with] there were high responses for non maipulable objects but no RS. RS was only observed for manipulable objects...The authors point out that RS means more finer level analysis which is what happens in this area. I did not understand Fig 3C,D,E specifically what does it mean that region defined by tool vs. animals...did anyone else understand what they meant?
Here's my understanding of 3C,D,E. Rather than pre-define the regions by doing some task to localize the region where the results mightbe an artifact of the task they picked, they define the regions after the fact. They can then determine the medial fusiform gyrus by any criteria one pleases. In this case, you get similar results with respect to RS for tools irrespective of the criteria (animals vs nonmanipulable/animals vs tools/animals vs arbitrarily manipulable) you use to define the region. I guess this ensure that it's not just an artifact of their localization task.
I find figure 5 interesting, as it suggest a difference between manipulated objects and tools. This leads me to wonder if the BOLD signal changes in these areas reflect a high level conscious recognition that the object being viewed is a tool? I wonder if there are experiments that deal with objects which are visually ambiguous in terms of being a tool or not. For example, there might be tools that are shaped like other objects, or objects which are not normally identified as a tool, but which the subject has used extensively as a tool. (e.g. Someone who used a pencil to rewind cassette tapes might see it as more "toolish" than someone who has no knowledge of this function)
this definitely is similar to argument of primary and secondary affordances...pencil's primary affordance is writing but secondary affordance is the cassette tape rewinding...but based on experience it might be organized differently
I like the paper especially because of the supporting neuropsychological study. However there are some issues that I couldn't understand well.Previously we talked about lesions of people who cannot identify objects visually but can identify them when they touch it. Especially for the tools it seems that touching is more important in identification than seeing the tool. Are the responses of those people with such a lesion similar to the normal cases?That is, is RS more related to visual identification or identification by touching (even if we are not familiar with using the tool/manupulating it)?While tools (such as hammers) are shown to be different from other objects (such as books), they still are close to each other compared to animals. In this case I wonder how the results would change if the animals were selected only from the category of pets (cats, dogs) , that we can touch and the tests would be performed especially among the people who owns a pet.I also wonder what would happen if the tool is changed so that its functionality is the same but visually it is different. Or what would happen if a new tool which has not been seen or used before was tested.Going back to previous paper on visual attributes, is it possible that there are some functional attributes which allows the unseen categories to be recognised?In this case, do we see chairs as the flat surfaces that we can sit. Or are there some categories that we use in our daily lives so often that, the recognition is more instance based, but when we see an unseen one we start by thinking about their attributes (visual or functional)?
Maybe related to your question:"Converging with the results of the fMRIstudy, we find that lesions to the left middle temporal gy-rus and the left inferior parietal lobule are associated withimpairments for both using and identifying objects."The fact that lesions impair both use and identification suggests that it represents more than just the visual look of an object.
I think we should also consider studying people with a lesion different from the normal cases. One thinking is that the people with a lesion will probably develop some other areas in the brain to functionally replace the previous one. That means the functionality of processing areas might be transferrable if one region is prohibited due to lesion.
But then again, as shown from the slides in class, people with a lesion in the scene area, are unable to recognize scenes. I understand that your argument comes from the perspective of functional compensation - in the event that an area of the brain is damaged, for the agent to function normally other areas also help.
Am I going to once again ask what the implications are here for the computer vision side of things: does this argument push for thinking more in terms of functional categories? I wonder whether the selectivity to tools is based on feedback from another brain area or whether visual categories are actually organized that way in the visual system. I'm not sure any of their experiments could answer that (if the brain is lesioned, connections from other areas to the lesioned area would presumably be destroyed; in the non-lesioned studies, I'm not sure there would be enough BOLD activity to identify that much activity from other relevant areas)Also, I liked that the authors made a point about drawing conclusions from BOLD signals by reproducing the results of other studies and showing how they could generate a different conclusion (e.g. - at the bottom of page 517). They note that overall amplitude of BOLD signals can't be taken as evidence for how discriminative the regions are for categories. This kind of meta-research is probably even more helpful than the results themselves.
Yes, I am left with a similar question. When looking at this activity, is the strength of the response due to the brain actively detecting the object, or is it possible that the activity reflects this part of the brain sending signals related to what it might do with this object now that it has already detected it. For example, it seems that there are two conclusions that could be drawn:1. I see this region of the brain lighting up when I show people a particular category of objects, therefore: this region of the brain is responsible for visual processing of these types of objects.2. I see this region of the brain lighting up when I show people a particular category of objects, therefore: the brain has already detected the object, and is now producing a bunch of activity designed to react to that object. In other words, this part of the brain is responsible for taking what has already been seen and saying "Hey hands, think fast! There might be a hammer to grab in the near future."Is there a way that these types of experiments could tell the difference? To me, it seems that this is an important distinction to make because if we (computer vision people) want to reproduce the identification behavior, but don't necessarily care about the reaction to the identification, we would like a way to distinguish between the two scenarios.I suppose another possibility is that the two things are actually quite similar, and that objects are actually recognized based on what you might do with them.
Oh no I wrote the whole thing and it got deleted damn....anyways here it is again...function based neuroscience studies have had impact on CV already..but what is exciting is that doing function based organization seems efficient way since all these objects have to lead to actions.In CV, no one has done any clustering of data based on functions yet but that is something which should be interesting. I have been thinking about it myself but lets see.What is also interesting is the fact that all manipulable objects are not same. Tools are different from books. Does this suggest anything about primary and secondary affordance? Is it the case when semantics are in congruence with functions that we organize the data based on functions?
For Aaron's question, I would say its even stronger is what is being claimed here...the claim is to understand the object you would even simulate the action in your mind....[this is one of the claims of embodied cognition and mirror neuron theory...GS should touch on it tomorrow].
I think the findings demonstrated in the paper provides very good arguments for understanding of objects/scenes from functional perspective in computer vision. It seems our human do consider the functional properties when we recognize objects. It also provides justifications for the vision paper “From 3D geometry to Human Workspace” in this section. I don’t know if that is fare to divide the space into “animal”, “tool”, “arbitrarily manipulated objects” and “non manipulatable objects”. To me, “tool” is a very specific and narrow category. Maybe human will demonstrate the bias over more general categories.
I'm quite curious about the embodied cognition /action for understanding argument too and how it applies to computer vision and what people in the class think. From my understanding of the results and from what I've heard, it seems that the brain might do this (but I'm on board with Aaron in saying that I don't think these specific results show this). But I think the broader implications of Aaron's question for CV are very interesting. Specifically, whether (a) it's necessary for intelligent systems or (b) it's more efficient. My two cents are that irrespective of the whether it's practical for finding cars, etc., it might be tremendously useful for a human-centric understanding though (e.g., communicating to humans or making the sorts of errors that humans are ok with).
Definitely Embodied Cognition is a point of debate. There is evidence for and against Embodied cognition theory which we will discuss.The authors acknowledge that object identification may be unimpaired if object use is impaired but what's interesting to see is the positive correlation in object use performance and object identification performance.
I read a book loosely related to this discussion thread. Its "Action in Perception" by Alva Noe. In my understanding, he argues that to identify an object is to know how the sensory input will transform given an egomotion. In other words, the answer to Aron's question is always number 2. Trying to interpret this argument from a Machine learning perspective:-The brain has a lot of latent variables. These latent variables are formed for the purpose of regressing from inputs (5 senses) to outputs (action and language - keep consciousness out to avoid controversy). These latent variables turn into action related or object identity related variables as required. I feel that we should try to use the same approach in computer vision. Rather than defining our own categories we should do an end to end regression and study what latent variables the system derives - those will be the machine's categories.
Functionality seems not quite diverse from semantic, and can be viewed as a form of semantic. It seems that every object can be assigned to certain manipulability. Although there are regions responsible for functionality, another concern is that how brains bridge and combine the processing of semantic and functionality together as a whole.
This comment has been removed by the author.
This paper is organized well and shows clear experiments. I was surprised by Fig 2. The left and right halves actually are biased in slightly different ways towards tools and arbitrary manipulable objects. I did not expect such asymmetry for a basic motor response. Then again, I think this is more due to feedback than feed-forward. The halves of the brain are more different at higher-level of information processing.My final question::This paper seems to suggest that there is a similarity metric over "motor space". I do not see why this should mean anything for computer vision researchers. It does not say that this similarity metric has anything to do with visual processing.Couldn't one come up with an experiment to see if there is indeed a "visual link" at play here? What if I asked people to just think of the object rather than show it to them? If I get a similar conclusion in terms of specificity of the RS in the gyri, that does not tell me anything about visual processing? Am I correct?
I guess this discussion was happening above as I was typing my post. Oh well.
Interesting point Ishan - "the left and right halves actually are biased in slightly different ways towards tools and arbitrary manipulable objects. I did not expect such asymmetry for a basic motor response"I would hazard a guess that this is due to left or right handed subjects. Elissa can probably clarify this and you should definitely raise this in class.
@Ishan: I think your question is slightly different. How does knowing that a distance metric in motor space is used, help us build a better machine vision system?The distance metric in motor space is computed over features largely derived from visual stimuli but via a different inductive bias. Thus even though we are computing distance(function(visual input)) the function() in this case is having a hidden parameter B - the bias. Knowing that this bias relates to motor response for objects guides us in our machine vision model design.
Yes. I also think the asymmetry is due to left and right handed people
Well, a lot of the observations I had have already been made, so here are a few additional things I was thinking abouta. I find it very interesting that when shown a novel object a large region of neurons get activated, and under repetition apparently activity narrows down to what seems like a very specialized region - In terms of application to computer vision this seems to me a lot like firing a whole bunch of detectors in parallel to determine what subcategory of an object the brain is looking at and then using the specialized (as in our case) mushroom detectorb. I find the distinction between tools and generally manipulable objects interesting. For instance, I will generally only visualize one way of interacting with a barrel (rolling), yet it is considered as an arbitrary manipulated object, i.e. it's use is not necessarily determined just by how I can visualize using it. So it's not just the fact that I can think about moving/manipulating an object - it's use has to be completely defined by the action I would take.As a side note, I found the language of the paper initially tough to follow because of the multiple references and general start-stop nature. The final section that compiled all the data, however, I felt was more consistent.
Most of my thoughts are brought up by previous discussions, so I'll try to bring up most concerning (for me) things:Since they point out that manipulable objects (book) are different from tools (hammer), apart from suggesting that there might be primary and secondary affordance (as Prof. Gupta pointed out), it might also suggest evolution or adaption to different categories.. i.e., since we use use (manipulate) book differently most of the time (and very often) than we use (manipulate) a hammer, we might have created a codebook of manipulations in our brain... we just use a shortcut and lookup how to manipulate that particular thing instead of computing and inferring the whole thing again and again. just a thought..To clarify, we can test showing a very custom tool (e.g., used by surgeons or architects or construction workers) to a professional from that field and to a common man who don't use that tool/haven't used it ever or haven't seen it often, and see if we have similar neural response in the first try.. If so, it might suggest we inherently have different regions for tools and manipulable objects, if not, it might suggest that we are just adapting by doing a lookup to avoid doing the same work again and again.. -- again, just a thought...
This is an interesting point. In conjunction with Shaurya's first thought, the codebook of manipulations/actions could be helping us narrow down to classify a new object into one of the previously known visual semantic categories.
Lot of points have been already covered in discussion. But I will like to add one more point:-Do we see object as whole for its functionality, or we see its smaller component and associate functionality to it. Assume if we see new object, and we don't know how to use it all, but still seeing its local features we can map it to some already seen object features and predicts its usage. For ex:- If object has sharp end it can be used for digging, or if object has hollow area it can be used for storing etc.
A relation between functionality and visual appearance could exist. We can also say that without looking but simply holding the object and manipulating or observing it, we try to associate a physical property to its shape and visual appearance.
I found this paper more confusing than the ones we've read so far, more because of the heavy use of biological jargon and my lack of knowledge thereof.Firstly I had rejected the idea of RS as being anything but helpful, but as Prof Gupta said, actually these signals indicate finer processing of the entities, thus surfacing hidden relations. It was interesting to see how data can be interpreted for more knowledge of the system.Triggered by what Ishan said above, I think possibly computer vision researchers can see this as an example of 'priming' of objects. To attach certain meta-data after seeing them in different poses or 'employments' in an image. Which can be taken as a parallel to the 'motor space' that the paper talks about for the brain. And just thinking of this concept compels me to bring in image context as one of the features as well, I don't know how will that co-relate. Just a random thought.
It is a good point to think that the recognition mechnism of very familiar objects (such as faces) can be quite different from that of unfamiliar or even unseen objects (thus unknown its functionality). For a specific instance, I noticed that a lot of papers had conjectured that familiar faces are generally recognized with some feature-based or structure-based representations, while unfamiliar faces are represented in some pictorial patterns.