I think one interesting question that's worth considering, irrespective of the experimental content is the idea from the intro: "The gist of this proposal is that a partially analyzed version of the input image (i.e., a blurred image) .... is projected rapidly from early visual areas directly to the prefrontal cortex ... This coarse representation is subsequently used to activate predictions about the most likely interpretations of the input image in recognition-related regions within the temporal cortex. Combining this top-down ‘‘initial guess’’ with the bottom-up systematic analysis facilitates recognition by substantially __limiting__ the number of object representations that need to be considered (Fig. 1)."One interesting question worth considering is: irrespective of time constraints (remember: things that were impossible a decade ago are now real-time), is there merit in this idea for computer vision? In particular, I think the term "limit" here is key: this isn't some huge graphical model where you look for zebras, policemen, and whether you're on 42nd street in Manhattan, and turn the crank on inference. This analyze-all-the-evidence approach, if you can factorize the model properly and get enough data, seems to be common-sense (although it might not help, or might require the right representation). Instead, I see this as not running the zebra detectors, based on the gist of the scene. Will this help? I'm curious where people stand on this.
I think there are two interesting points based on what you brought up. I'm going to try to rephrase (sorry if I interpret this wrong): Should I use some quick computations to decide (1) my task (what things to recognize) and (2) the representation I should use (features/attributes to pick out, scales to look at, etc.) based on the coarse information we got. So top-down information would mean using quick, coarse information to answer these two questions. It seems like it's a great way to go, especially in the shorter term since there are an infinite number of ways we could manipulate the image, and we have shown in class that they are very dependent on the task. I think the most benefit comes from knowing how to manipulate the data to throw into the detector (specifically where to look and which features will be useful for distinguishing between things we need to recognize). I'm not sure this will ever be beneficial for the most popular kind of vision tasks now because the benefits are greatest when you have a time constraint and you need to be able to do a large variety of tasks (I think this has the most applications in robotics, for instance).
Yes -- but I was thinking even in the infinite time, but finite data case it might be useful. For instance: the fine-grained literature, if I recall correctly, they're trying to solve P(collie | dog) and so you could think to some degree that a fine-grained detector might work as P(dog) P(collie | dog). Similarly, in the propose + classify detection method, the object detector method doesn't have to separate cats all negative windows, but instead a few. My impression was that this sort of top-down filtering might be useful in that it factors things properly: I don't need to learn the relationship between polar bears and policemen because, and if I had to learn the N^2 object-object relationships, I'd probably overfit terribly.
Given a time constraint, I guess your learning algorithm might form a different set of rules, compared to the infinite time case. Then again, there's the argument that let's get the inference right in the first place, and then optimize for time.
Yeah, of course, and we're all bound by what's feasible, even those of us whose code takes minutes to run per image. I think the answer to the time question that we discussed last class is basically driven by what problem people are working on. But what I was getting at was: is limiting (in a yes/no sense) your interpretations via top-down knowledge useful for learning in the first place irrespective of any time gains. Note that this is, I think, different than the question in last class (is ignoring parts of the scene useful)?
I think it's important to examine not only limited-time vs. infinite-time, but also consider what happens to the time cost (in an algorithmic sense) as you start to change these variables. Since nobody really knows what the algorithm is, this is kind of hard, but my suspicion is that as the number of objects you are able to recognize increases, very bad things happen in terms of complexity unless you aggressively prune the space you're considering. For simplicity's sake, let's say that all of our mid-level features are binary. That means that if we're considering all of them at once, we have 2^N possible categories to distinguish between, the overwhelming majority of which are utter nonsense ("overweight bicycles with orange hair, blue eyes, long whiskers, good rhythm,..." is a point in this space). If on the other hand we can use an initial guess to select a much smaller subset of N, we save the kind of time that can only be considered an optimization if you consider the heat-death of the universe to be a mere practicality. It's the difference between 2^(small constant) and 2^(everything). So I guess what I'm saying is that it's possible that if you don't take this into account, you'll never be able to solve the problem, regardless of how long you give Moore's law to catch up because of the algorithmic implications inherent in the problem. But I freely admit this is total speculation on my part.As for whether or not this effect matters qualitatively, I think there's an argument that it may still be important, even if you are given infinite time. It goes back to the sparrow differentiation example: there may be some specific feature that lets you easily distinguish between two species of sparrow (wing-tip color for example), but you're not likely to use that feature on anything other than a bird. If you try to compute the value of that feature on something without wings, you're only going to confuse yourself. If there's noise in your system, you might make the argument that the more extraneous features you test, the more likely you will get some bad signal that throws off your classification. Somebody with a better understanding of statistics would have to tell me if that's a reasonable argument though.
Another way to ask what David said -- do we gain anything, apart from time, in limiting our interpretations via top-down knowledge? Also, how much and when is it bad to limit our interpretations?David: Correct me if I understood it incorrectly!
This reminds me one recent paper "Peng Zhang, Jiuling Wang, Ali Farhadi, Martial Hebert, Devi Parikh. Alert: Predicting Failures. CVPR 2014". This paper might be an example usage in computer vision. Here the gain is not only time. Instead, a general top-down knowledge is introduced at the first place to predict the likely accuracy of any computer vision system on a given input instance, including applications such as semantic segmentation, vanishing point and camera parameter estimation, image memorability prediction and attribute detection. In this way, the top-down and bottom-up processing mechanisms are complementary, where top-down processing provides an initial coarse decision while the bottom-up processing gives a more precise one.
I think that a top down initial guess leads to significant improvement in recognition accuracy. This is hard to see in human vision because our bottom up processes are already very good. This is,however, easily seen in computer vision systems. For example:-- By discarding exemplars that are not suitable for a given image, we do better than using all the exemplars.- By discarding windows that may not contain objects (bottom up region proposal methods) we also significantly reduce the number of false positives.
This comment has been removed by the author.
This paper suggested a model for top-down processing in object recognition, and detailed the experiments performed to verify the model. The time dimension is added to the experiments, and I find this the most interesting aspect of this paper. Something which I have been wondering about is what information should be sent to the higher areas to be processed. Intuitively, we need to have something which is cheap to compute and communicate, but strongly indicative of what the object being viewed is. In this paper it is suggested that this information is a blurred/low spatial frequency version of the image. I wonder if high spatial frequency features (and other things) might be sent over as well. My reason for raising this is that if we just take the blurred image, then the top down would perform very badly on objects which are almost entirely obscured. However, in many cases humans are able to mentally hallucinate to fill in the gaps, even when the overall shape of the object is not visible (but a few distinctive parts of the object are). This seems to suggest to me that some top-down is happening, and the top-down is relying on high-frequency details.
I meant to reply to this comment, actually.. I guess high-spatial frequency info requires more detailed processing and longer time (presumably, early visual areas-->ventral regions, serial processing) than simply replying on low-spatial frequency info to make predictions. Maybe I am heavily influenced by Jeff Hawkins who argues that the brain is a prediction machine, so Bar’s argument for a shortcut from low-spatial info to frontal regions makes perfect sense to me. Here in the vision domain, making predictions/guesses most likely means to infer from insufficient image structure. It actually mimics real-world cases where we see blur images of objects/scene from a distance and constantly update our hypothesis (maybe wrong at the beginning) what it is as we approach it.
I think you are right to bring up the importance of finding those cheap and highly indicative early guesses that lets you prune your space very well. I think a lot of these claims hinge on these intermediate models and the ability to learn them.It seems that in the same way that certain attributes only apply to some object categories, certain low-level features only ever exist in the presence of some objects. For example, certain textures strongly indicate cloth. Seeing that texture might fire up some initial guess "hey there's cloth here" that is then verified by using a model of cloth to predict what else might be used to strengthen or weaken the postulated guess.
For your example of mentally hallucinating to fill in the gaps, even when the overall shape of the object is not visible, I can think of two explanations: 1) There could be other regions of the brain (apart from OFC) which would be responsible for top-down processing based on other features (HSF). Considering the evidence provided in this paper, the top-down activity observed in the OFC is independent of the HSF features. I am not sure if there are other regions in the brain which show evidence of top-down processing for object recognition but this would essentially mean that later in the visual cortex when some form of high spatial frequency information is extracted it is used for top-down processing. A complex architecture using a multi-layered top-down processing model where every level of the visual stream would independently provide information to prune the search space in some sense. 2) Hallucination could be a bottom up process. In cases where low spatial frequency top-down information is not useful for recognition, the complex high spatial information alone (in a bottom-up) manner might help us in predicting the object.
This may be a bit redundant, but couldn't we just model this as a form of Bayesian inference? The top-down connections simply give a prior-distribution in order for the brain to infer the distribution of possible perceptions of the image. Bayesian inference may extend to many different parts of the brain (motor function, decision-making, etc.) There have been whole books written on this: http://mitpress.mit.edu/books/bayesian-brain
It seems people have used 'context' to study top-down information. Is there some other approach/way, we can come out to understand this thing?
In "Malisiewicz, Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships" paper they have used exemplar to pass top-down information by creating memex. Is it right way to approach, as I feel it is too specific, and generalization and consistency is always an issue.
Another way to think of top-down information is to looking for a large thing rather than piecing it together from little fragments bottom-up. For instance, in layout estimation, a more top-down way is searching for a whole room or cuboid all in one go (with a low-dimensional model). A bottom-up approach would be to try to find the planar pieces and then assemble a cube from the pieces.
Also, how do we exactly define the 'context'?
It's not addressed as much here, but I wonder what happens when we start to think of this happening in multiple stages.You might imagine a process where the texture indication of cloth causes your brain to hallucinate different types of cloth, find the one that matches best, then use that type of cloth to hallucinate known objects made of that kind of cloth, etc.In this way, every feature/attribute/object has some implication which suggests a set of other things that may or may not exist. Each of these other things has a model that can be used to verify that suggestion. Once verified, that object has it's own implications that suggest objects even further up the chain.
I think one question to that will be how to structure the stages, and which stage to start at. For different images (different lighting/viewpoint/etc) of the same object, it might not be possible to use the same "chain". For example if we see a sofa from far away it might be easier to tell that it is a sofa than that there is cloth. On the other hand in a close up image it might be easier to tell that it is made of cloth, than that it is a sofa.
I find the OFC role in vision fascinating as it is most probably a survival mechanism for humans to react fast based on incomplete information. Apart from providing a prior for object recognition, OFC might be the brain part that directs our attention to things in the environment for closer inspection. I believe this is a good thing for a robot which can be primed for a task and can quickly recognize areas of interest in an image.
Have you heard of Poggio's Bayesian model of Visual Attention? It focuses on a saliency model from the PFC, but it is similar to what you describe: http://cbcl.mit.edu/cbcl/publications/ps/Chikkerur_Serre_Tan_Poggio_VisRsrchApril2010.pdf
The authors describe top down information flowing from OFC to temporal cortex as object proposals. But our readings from the first few weeks suggest the signals in our temporal cortex form a continuous semantic space. In this context, I feel we should interpret top down information from OFC to temporal cortex as a scaling matrix along the principle components of the semantic space. This mechanism is possibly less prone to errors - pruning too much too early will reduce recall. This mechanism is also more possible - its might be much easier to regress the scaling matrix from LSF when compared to generating object identity guesses.
One thing I found mildly amusing was that for all their claims of top down processing being beneficial for processing more complex scenes, in the experiments at best they used grayscale images of objects without a background. Wouldn't it have been a more compelling story had they used natural scenes and colour?
One takeaway from today's class: the top-down information is sent all the way to the prefrontal cortex (OFC) as it can be transmitted for decision making to not just visual but other areas of the brain, such as auditory, speech, motor regions.