I really liked the Kverga paper: the fact that the network was being brought online early was quite interesting, and I think pretty convincing for me to show that differences in activations for objects with strong context isn't due to post-processing. The question (which the authors acknowledge in the discussion) is: what does this mean? Is it just that it's brought online as quickly as possible (you can imagine from an engineering standpoint that you might want to do this)? Of course, a paper has to pick a reasonable sized problem, and so it makes sense that they don't want to do a really complex experiment to answer all the possible questions.
The one question that's always bugged me is: is contextual processing a generate and then post-process method (i.e., give me all your detections, and I'll sort out which are ok), or something making hard decisions along the way. The latter's more efficient, and might be ok if you have good classifiers, but if you have bad classifiers (as we do in vision), you never want to make hard decisions. This is seems like the dichotomy discussed during the discussion section, but I'm not positive.
I think it is the latter in human brain (a good classifier). I think there is some pre-processing happening in our brain which filters proposals as per the surroundings and timings (something like a tree). For ex - If we are in a movie theater, we know other person is constrained by some particular activities. If (s)he is just looking at screen, we will infer person is watching the movie. But if his hands are constantly going towards mouth (chewing or making sound), we will infer that he is eating popcorns while watching movie. OR if there is a straw in his mouth, we will probably say that he is having a soft-drink. Somehow there is a limited set of actions and objects which can co-occur at a particular situation. But probably we have a lot of cues that we can make a very reasonable arguments.
The hard decision method does make a lot more sense than the other one, even for human brain. For me one of the most interesting part of the paper was: "The brain constantly generates predictions about what it is about to perceive next (72), and activating contextual associations early during perception allows the brain to focus the gaze, attention, and action where it is most likely to yield biologically important information."
This makes me agree with what Aayush said, that there is this domain-slicing process which goes on in the background depending upon where we have come so far and what's the next cue. There are ample examples from our daily-life too which would support this, let's say one is playing a first-person shooting game like House of the Dead. Every time that the character faces a new "Boss", some patterns have been learnt from the previous levels which make the person realize that this is the final stage and search for the weak-points of the opponent in particular areas first, hitting and missing or confirming the hypothesis. But there are alternate plans always available which stop a complete meltdown and help steer the gamer towards a new strategy for this new 'Boss' who maybe doesn't conform to what was seen earlier. Makes me believe more in the 'tree-pruning' or an A* like mechanism for this.
I'd like to chip in here. The idea that the brain makes hard decisions is less appealing. The reason why I say this is as we saw in earlier papers, objects have strong and weak contextual associations. Additionally, co-occurrence of objects does help disambiguate the context. It seems to me that the underlying generative process updates it's belief of the current state based on new visual information. Aayush's examples are also applicable here.
It seems to me like this is not a choice of one or the other, but rather the choice of how to balance the two. We can consider multiple detections, but then prune the possibilities a lot and it becomes equivalent to making hard decisions.
So, in my honest opinion, I don't know if we can say exactly what kind of context is being processed here; what we strictly get from the paper is that context is a bottom-up as well as a top-down process; the top-down part just comes later. We don't know exactly what kind of context this is; it appears to be semantic; but the bottom-up context may be distinct from the top-down context. I think more investigation is needed.
So, in preparing my presentation, I came across this paper https://www.princeton.edu/~napl/pdf/BeckKastner2008.pdf that seems to tie up all these things that we've been discussing over the past few days regarding context and attention. The general idea they present is this (and which to me seems very appealing): most of the brain regions that deal with vision systems are competitive. What this implies is that we can dynamically change gains, and essentially bias, such that if, say looking for faces in a crowd, face detection areas get pre-activated, and other object representation methods, say finding golf balls get suppressed. From a bottom-up perspective, this competition can also be biased by things like saliency or contrast, which is what we saw in the previous paper. Finally, they say that dominance of an object in one region (say lower visual areas) gets transferred to other regions (like the higher order frontal areas) by virtue of the system being integrated. This is where this paper very neatly fits in.
That idea of competition is certainly an interesting one. David's question about hard decisions is interesting too. It seems that there are nice things about making hard decisions, but some problems too. Maybe another way to think about it is resource allocation. I have a limited amount of processing I can do in a certain time frame. Based on context/importance weights I'm going to apportion those resources differently.
From the paper we can just say that context is extracted early enough to help us in different recognition task. But I feel that context might give us priority order of different object detectors to try by providing them score. Then we might run object detectors with higher priorities. Then provide feedback to contextual processing unit about which objects are present for sure, then context will generate new priority order of detector by incorporating new information. And this iteration goes on till we understand the scene in detail.
I agree with Krishna. It might be that the out of context object is being detected but the number of neural spikes for it are very small. In other words, the brain is using less energy to process objects that are unlikely to be found but is still processing them. In the event that there is enough image evidence for the out of context object, it might override any contextual rules acting against it. This might also explain why the response time is the same for both strong context objects and weak context objects. There is enough image evidence for their recognition to go through equally fast. I feel that more interesting results can be obtained using challenging stimuli where enough image evidence is not available.
These two paper give a good overview of the role of context in neuroscience: 1. Bar, Moshe. "Visual objects in context." Nature Reviews Neuroscience 5.8 (2004): 617-629. 2. Bar, Moshe, and Elissa Aminoff. "Cortical analysis of visual context." Neuron 38.2 (2003): 347-358.
I like this paper for it's clear writing style. The authors are very careful in stating only those results that are supported by their experiments. I also read Elissa's other paper - "Context based memory distortion". To sum it in one line it states that one can falsely claim to remember seeing an object A, in cases where context(object A) == context(object B) and we have seen object B. With both these papers in mind, I think context is not a "final filter" for object detection. The process is more "cyclical". In fact, context can generate some object proposals as well. If this were not the case, we would not have the false memories. I think this also fits nicely with the assigned paper. Since context seems to be activated early, it might be the starting point for object proposals. This should be verifiable if we have instances where we find objects with strong context much easier to recognize than those with no context (in which case we have no "prior").
This also ties in with what we were talking about last class, where if something is contextually relevant, we might be more accepting of false positives.
I may have missed this in the paper, but is there a way to know from these results whether similar circuits are activated before and after generating context? I'm specifically interested in whether the process is truly "cyclical" for recognition, or whether the circuits simply integrate contextual information as it comes. It seems like if we can recognize objects with weak context, the circuits would have to be slow enough that later recognition stages are in essence "waiting" for context and proceeding without it in its absence. In other words, if we were to build a similar system for computer vision, should it simply send some quickly-computed information off to a context database, continue computing full features, and integrate the information from the context database as it goes (cutting off computations made by poor object categories; computing more features to distinguish more promising categories). This is probably too 'messy' to do in practice, but it's more of how I imagine the brain doing things.
I think its not very messy if we program using a connectionist computing architecture as opposed to a Von Neumann architecture. If each node in the connected network signals true by sending a spike and sends more spikes if it is more confident; then these effects can be simply modeled by being more responsive to individual spikes and not waiting for many of them to come based on a lookup in the context database (may be using an autoassociative memory module).
Perhaps this (cycle) is actually an iterative process. It would be interesting to see what happens on a longer time scale. After the top-down effect, do we get another bottom-up phase locking? Perhaps bottom-up and top-down forces actually work together to converge towards some type of optimum?
This paper is interesting in the light of the last two classes on top-down processing and attention. The early extraction of context shown in this paper and with some evidence of it in the medial pre-frontal cortex or MPFC suggests connections with top-down processing (Bar's paper showed that top-down processing was in the OFC which also lies in the pre-frontal cortex). This paper also hints at the connection of context and attention as discussed in the last class. Early extraction of context could be generating bottom-up attention which could be linked to memory and important contextual cues learned with evolution.
By the way, those of you who do or apply vision for a living should probably read Autocontext, even if not for tomorrow. This is a really, really neat idea.
You can also see it in action here (Guo and Hoiem, ECCV 2012): http://web.engr.illinois.edu/~guo29/files/eccv12_surface.pdf
And if you like it, you should also read recent grad Dan Munoz's thesis, which also has a bit of this flavor: http://www.cs.cmu.edu/~dmunoz/munoz_thesis_13.pdf
I second David's assertion. Intuitively, this looks like a much simpler/faster idea than using CRFs/MRFs to model context, and it appears to be just as effective.
I really liked the Kverga paper: the fact that the network was being brought online early was quite interesting, and I think pretty convincing for me to show that differences in activations for objects with strong context isn't due to post-processing. The question (which the authors acknowledge in the discussion) is: what does this mean? Is it just that it's brought online as quickly as possible (you can imagine from an engineering standpoint that you might want to do this)? Of course, a paper has to pick a reasonable sized problem, and so it makes sense that they don't want to do a really complex experiment to answer all the possible questions.
ReplyDeleteThe one question that's always bugged me is: is contextual processing a generate and then post-process method (i.e., give me all your detections, and I'll sort out which are ok), or something making hard decisions along the way. The latter's more efficient, and might be ok if you have good classifiers, but if you have bad classifiers (as we do in vision), you never want to make hard decisions. This is seems like the dichotomy discussed during the discussion section, but I'm not positive.
I think it is the latter in human brain (a good classifier). I think there is some pre-processing happening in our brain which filters proposals as per the surroundings and timings (something like a tree). For ex - If we are in a movie theater, we know other person is constrained by some particular activities. If (s)he is just looking at screen, we will infer person is watching the movie. But if his hands are constantly going towards mouth (chewing or making sound), we will infer that he is eating popcorns while watching movie. OR if there is a straw in his mouth, we will probably say that he is having a soft-drink. Somehow there is a limited set of actions and objects which can co-occur at a particular situation. But probably we have a lot of cues that we can make a very reasonable arguments.
DeleteThis comment has been removed by the author.
DeleteThe hard decision method does make a lot more sense than the other one, even for human brain. For me one of the most interesting part of the paper was:
Delete"The brain constantly generates predictions about what it is about to perceive next (72), and activating contextual associations early during perception allows the brain to focus the gaze, attention, and action where it is most likely to yield biologically important information."
This makes me agree with what Aayush said, that there is this domain-slicing process which goes on in the background depending upon where we have come so far and what's the next cue. There are ample examples from our daily-life too which would support this, let's say one is playing a first-person shooting game like House of the Dead. Every time that the character faces a new "Boss", some patterns have been learnt from the previous levels which make the person realize that this is the final stage and search for the weak-points of the opponent in particular areas first, hitting and missing or confirming the hypothesis.
But there are alternate plans always available which stop a complete meltdown and help steer the gamer towards a new strategy for this new 'Boss' who maybe doesn't conform to what was seen earlier. Makes me believe more in the 'tree-pruning' or an A* like mechanism for this.
I'd like to chip in here. The idea that the brain makes hard decisions is less appealing. The reason why I say this is as we saw in earlier papers, objects have strong and weak contextual associations. Additionally, co-occurrence of objects does help disambiguate the context.
DeleteIt seems to me that the underlying generative process updates it's belief of the current state based on new visual information. Aayush's examples are also applicable here.
It seems to me like this is not a choice of one or the other, but rather the choice of how to balance the two. We can consider multiple detections, but then prune the possibilities a lot and it becomes equivalent to making hard decisions.
DeleteSo, in my honest opinion, I don't know if we can say exactly what kind of context is being processed here; what we strictly get from the paper is that context is a bottom-up as well as a top-down process; the top-down part just comes later. We don't know exactly what kind of context this is; it appears to be semantic; but the bottom-up context may be distinct from the top-down context. I think more investigation is needed.
DeleteSo, in preparing my presentation, I came across this paper https://www.princeton.edu/~napl/pdf/BeckKastner2008.pdf
Deletethat seems to tie up all these things that we've been discussing over the past few days regarding context and attention. The general idea they present is this (and which to me seems very appealing): most of the brain regions that deal with vision systems are competitive. What this implies is that we can dynamically change gains, and essentially bias, such that if, say looking for faces in a crowd, face detection areas get pre-activated, and other object representation methods, say finding golf balls get suppressed. From a bottom-up perspective, this competition can also be biased by things like saliency or contrast, which is what we saw in the previous paper. Finally, they say that dominance of an object in one region (say lower visual areas) gets transferred to other regions (like the higher order frontal areas) by virtue of the system being integrated. This is where this paper very neatly fits in.
That idea of competition is certainly an interesting one. David's question about hard decisions is interesting too. It seems that there are nice things about making hard decisions, but some problems too. Maybe another way to think about it is resource allocation. I have a limited amount of processing I can do in a certain time frame. Based on context/importance weights I'm going to apportion those resources differently.
DeleteFrom the paper we can just say that context is extracted early enough to help us in different recognition task.
DeleteBut I feel that context might give us priority order of different object detectors to try by providing them score. Then we might run object detectors with higher priorities. Then provide feedback to contextual processing unit about which objects are present for sure, then context will generate new priority order of detector by incorporating new information. And this iteration goes on till we understand the scene in detail.
I agree with Krishna. It might be that the out of context object is being detected but the number of neural spikes for it are very small. In other words, the brain is using less energy to process objects that are unlikely to be found but is still processing them. In the event that there is enough image evidence for the out of context object, it might override any contextual rules acting against it.
DeleteThis might also explain why the response time is the same for both strong context objects and weak context objects. There is enough image evidence for their recognition to go through equally fast. I feel that more interesting results can be obtained using challenging stimuli where enough image evidence is not available.
These two paper give a good overview of the role of context in neuroscience:
ReplyDelete1. Bar, Moshe. "Visual objects in context." Nature Reviews Neuroscience 5.8 (2004): 617-629.
2. Bar, Moshe, and Elissa Aminoff. "Cortical analysis of visual context." Neuron 38.2 (2003): 347-358.
I like this paper for it's clear writing style. The authors are very careful in stating only those results that are supported by their experiments.
ReplyDeleteI also read Elissa's other paper - "Context based memory distortion". To sum it in one line it states that one can falsely claim to remember seeing an object A, in cases where context(object A) == context(object B) and we have seen object B.
With both these papers in mind, I think context is not a "final filter" for object detection. The process is more "cyclical". In fact, context can generate some object proposals as well. If this were not the case, we would not have the false memories.
I think this also fits nicely with the assigned paper. Since context seems to be activated early, it might be the starting point for object proposals. This should be verifiable if we have instances where we find objects with strong context much easier to recognize than those with no context (in which case we have no "prior").
This also ties in with what we were talking about last class, where if something is contextually relevant, we might be more accepting of false positives.
DeleteI may have missed this in the paper, but is there a way to know from these results whether similar circuits are activated before and after generating context? I'm specifically interested in whether the process is truly "cyclical" for recognition, or whether the circuits simply integrate contextual information as it comes. It seems like if we can recognize objects with weak context, the circuits would have to be slow enough that later recognition stages are in essence "waiting" for context and proceeding without it in its absence. In other words, if we were to build a similar system for computer vision, should it simply send some quickly-computed information off to a context database, continue computing full features, and integrate the information from the context database as it goes (cutting off computations made by poor object categories; computing more features to distinguish more promising categories). This is probably too 'messy' to do in practice, but it's more of how I imagine the brain doing things.
DeleteI think its not very messy if we program using a connectionist computing architecture as opposed to a Von Neumann architecture. If each node in the connected network signals true by sending a spike and sends more spikes if it is more confident; then these effects can be simply modeled by being more responsive to individual spikes and not waiting for many of them to come based on a lookup in the context database (may be using an autoassociative memory module).
DeletePerhaps this (cycle) is actually an iterative process. It would be interesting to see what happens on a longer time scale. After the top-down effect, do we get another bottom-up phase locking? Perhaps bottom-up and top-down forces actually work together to converge towards some type of optimum?
DeleteThis paper is interesting in the light of the last two classes on top-down processing and attention. The early extraction of context shown in this paper and with some evidence of it in the medial pre-frontal cortex or MPFC suggests connections with top-down processing (Bar's paper showed that top-down processing was in the OFC which also lies in the pre-frontal cortex). This paper also hints at the connection of context and attention as discussed in the last class. Early extraction of context could be generating bottom-up attention which could be linked to memory and important contextual cues learned with evolution.
ReplyDeleteBy the way, those of you who do or apply vision for a living should probably read Autocontext, even if not for tomorrow. This is a really, really neat idea.
ReplyDeleteYou can also see it in action here (Guo and Hoiem, ECCV 2012): http://web.engr.illinois.edu/~guo29/files/eccv12_surface.pdf
And if you like it, you should also read recent grad Dan Munoz's thesis, which also has a bit of this flavor: http://www.cs.cmu.edu/~dmunoz/munoz_thesis_13.pdf
I second David's assertion. Intuitively, this looks like
Deletea much simpler/faster idea than using CRFs/MRFs to model context,
and it appears to be just as effective.
Yeah, this is really cool stuff. Thanks David.
Delete