Paper for reading: Interacting Roles of Attention and Visual Salience in V4, Reynolds 2003
The second paper "Maunsell 2006, Feature-based attention in visual cortex" is also a good read. It is a review paper.
I am not sure if responses from a total of 80 neurons from both monkeys are sufficient to study this experiment. Is it that they are considering neurons specifically for a particular receptive field? But if it is the case why is it that only 50 of 80 neurons responded to the reference and probe stimulus? Any thoughts?
Yes they are considering neurons specifically for a particular receptive field (RF). All the neurons respond to the reference and probe stimuli but only 50 out of the 80 gave significantly different responses for the two (selectivity - reference becomes preferred stimulus and probe becomes non-preferred stimulus). They needed this because they wanted to study attention when a preferred and non-preferred stimulus is present in the same RF for the neuron population.
So far we have primarily talked about V1 and never looked beyond that. But this paper talks about attention and visual salience in V4. Why is it that we are not talking about V1 here (or V2, V3)?
The receptive field size is very small for V1 neurons. The RF size becomes larger along the visual pathway from V1 to V4. In V4 the receptive field size is large enough to talk about attention. There can be two different stimuli in a large enough RF and the response of the neurons can change based on what you give attention to. V1 neurons respond to low level features. So it would be difficult to study the effect of attention with V1 neurons as they will be firing based on the feature present in their small local RF.
You can look at second paper " Feature-based attention in visual cortex", as they have explained why we student the neuron present in V4 and MT(V5) region to study the affect of attention.As stated in paper "Theseareas are frequently examined not because they arethought to have a special or prominent role in attention,but because they represent a good compromise betweenearlier stages of the visual pathways, where neuronsrespond robustly and predictably to simple stimuli andhave moderate-sized receptive fields, and later stages inthe pathways, where the effects of attention tend to bemore pronounced.".
By the time we get to V4, however, hasn't the representation moved far beyond simple things like contrast? Isn't there a lot of literature that suggests that v4 encodes concepts as complex as shapes? Could this supposed increase in contrast perception actually be part of something more complex (shape hallucination, etc)?
Based on a quick lit search, it looks like V4 and many other high level areas are still contrast dependent, at least at the single-neuron level. I don't think they become totally invariant to all low-level features. V4 does encode higher order features (e.g. - contours). I'm not sure I get what you mean by shape hallucination discounting the effects on contrast perception. The phrasing they use is "as though its contrast had increased" -- if they really intend to say this has a perceptual link to increased contrast, that would be one thing, but claiming the neural behavior acts like the contrast has increased could have other perceptual/algorithmic purposes (maybe even shape hallucination?)
Did the quick lit search also say how V4 encodes contrast. If yes, then how easy is it to adjust this encoded contrast? This might help reverse engineer the neural mechanisms for attention (assuming attention operates by adjusting this encoded contrast).
This comment has been removed by the author.
This paper explores the question of attentiveness. Being monkey brains, the authors use probes which results in an impressive time resolution of the response.I like the result of the paper. It shows how "concentration" is reflected in the suppression of neural signals irrespective of the contrast of the stimulus (as shown in later experiments and discussed in "The possibility that distracters attracted attention" on Pg 9).This fits well with "task based focus" of vision systems. In this light, can we say something about task independent "saliency of images" being a very fuzzy thing?
It seems like there are 2 points. First is that the brain responds more strongly to high contrast inputs. The second is that the brain responds strongly to something it is consciously attending to. I'm guessing the task independent "saliency of images" will be similar to the first point, and the "concentration" is similar to the second point. So the conclusion seems to be that when performing a task, attention is more important. But how important is the task independent "saliency" if there are no tasks? Is there a "default mode" for passive viewing, and if so how is attention distributed for this "default mode"?
I think this division sort of makes sense and I'm not sure about mainstream computer vision's tendency to group the two together (not always but sometimes). I'm not sure whether the pop-out effect (i.e., "high contrast") salience is the same as the attention salience: it might just be that attention salience can override the normal tendency to pick up on high contrast stuff (which you might call salience or you might not). I'm curious whether it can override other preferential things (e.g., preference for movement).
Liu raises and interesting question of how attention is distributed for some default mode where there's no obvious task. One way to think of it may be that your brain is trying to "make sense" of the world by default. It's building a model of what it expects to see, and as long as what it sees is within some expected range, it just motors along attending to small things here and there to make sure everything fits with the model well. When expectations aren't met, then the unexpected gets a lot of attention. The big question is then how do we build that model, how accurate is it, how is it parameterized, etc. Which... who knows?I think there's also an interesting issue of varying levels of saliency. We have all these different regions of the brain, and it makes sense that there might be a similar concept of saliency that applies to all of them. For example, small edges and star shapes are salient to V1 in the "default mode," but is it possible that feedback information would modulate this in a task-dependent way. For example, a tree with lots of leaves might have a lot of small edges that excite V1. But maybe as soon as your higher levels recognize it, they say to V1 "no worries dude, we figured out what it is, you can calm down now." I'm not sure if that made any sense, but the point is that each layer/region of the brain might be "paying attention" to the various stimuli that it's receiving using some common mechanism, it's just that the type of stimuli in each layer/region is different.
I feel that Aaron's last paragraph is confusing between saliency and preferred stimuli. Edges are preferred stimuli for V1. But everything else makes sense to me.
I think there might be different regions in the brain which are responsible for the task independent "saliency of images" and "concentration". For the task independent "saliency", certain region would process a general, more gist-like, or prototypical, representation of the image. Hence, even without a specific task in mind, i.e., no explicit and conscious task, there would be some default output kept in the brain for future processing. Meanwhile, for "concentration", certain other region would activate strongly the most appropriate frames based on the current task, and "attention" is thus raised.
My question may not be very relevant to the main topic of the paper. It seems to me, in this paper, visual salience is primarily evaluated by the contrast. I’m not quite sure if my conclusion is fare. I think from computer vision perspective, visual salience contains more factors. For example, in addition to the image attributes variation (e.g. gradients, edges), people also consider the visual novelty, uniqueness, etc.
I find the idea of saliency confusing: Are we talking about unusual features in an image which attracts eye saccades (as David mentions, pop-out effect), or are we talking about the influence that attention has on the perception of image features (my attention to a square creates the perception of contrast stronger). These sound like to very distinct concepts.
A similar question is on my mind. In this case, given the time constraint, the pop-out effect seems to be more plausible, since it is something you can train to become better at over time.
I agree. Visual salience might not be the sole source for attentiveness. Other simpler sources should be separated to be under consideration, too.
It might be something stupid but I was trying to do a small experiment using the following images (without any task in mind but to just see what I look in the images) --1. Image 1 (http://blog.gettyimages.com/wp-content/uploads/2013/08/Jennifer-Lawrence-Bradley-Cooper-Oscars-2013-Christopher-Polk-Getty-Images-162549463.jpg) -- In this image, there are a lot of people and I look at faces in 3-4 seconds (mainly on the face of Jennifer Lawrence). 2. Image 2 (http://www.lonelyplanet.com/travel-blog/tip-article/wordpress_uploads/2013/05/india1_cs.jpg) - In this image, there are some people posing in front of Taj Mahal but in first 3 seconds I looked at Taj Mahal. (I later on looked more carefully to see what is there in image).3. Image 3 (http://img.ibtimes.com/www/data/images/full/2013/08/19/400404-representational-image.jpg) - In this image, there are a lot of people sitting on train but I am not able to fix up my eyes on one thing.Interestingly in image-1 & 2, the focus point (Jennifer Lawrence and Taj Mahal) are at center and in image-3, things seem to be symmetric around center. Although it is not a very rigorous experiment, can we say that something which is on center has more bias towards attention (instead of saying that bias is more for faces, monuments etc)? I mean more I think, more it seems to me that for bottom-up saliency detection or task-independent saliency we start looking from center and try to see something interesting near center. Does it make sense? Any thoughts?
A few comments: for your first image, not only is Jennifer Lawrence in the center of the frame, but there is a shallow depth of field around her as well. In the other two images, the faces aren't anywhere close to as large as those in the first. Most importantly, all three images were intentionally constructed by a photographer to focus on one particular aspect of the image. I would say that things in the center of an image do not have more saliency than the periphery, it's just that when photographers are taking a picture of something, they're more likely to put the intended object of attention closer to the middle. There's actually a very common technique in visual art that places dominant objects somewhere on the "thirds" of an image, which tends to make things more pleasing and dynamic.http://en.wikipedia.org/wiki/Rule_of_thirds
Ok, this is a total can of worms, but...what's the take-home for main-stream computer vision? Is there anything to learn here beyond the notion that a task can overrule looking for standard pop-out-type things. I think the the fact that there's task-based overruling of standard visual search policies is a very interesting point, but unfortunately we're too early to make much use of it (we really don't have systems with multiple purposes yet).The Judd paper mainly suggests either replacing eye trackers or graphics-like tasks where you need to know where a human will look (or not!). i'm hopeful that at least someone will be willing to defend saliency as something really important.
I don't have an answer to this question, but I do have a followup to one of your statements: Why is it too early to make use of the idea of task-based saliency rules? Even though we generally approach vision from the single-task engineering approach, it does seem useful to recognize that there could be a general black box that goes in front of (or in the middle of) feature extraction and that black box gets hand-engineered for each task a CV scientist needs (just as we pick our favorite HoG/HoF/SIFT features for our problem). I'd argue that CV people probably get the idea of task-based policies better than most :)
What I mean is that making things explicitly task-based (i.e., focused on the fact that there are multiple tasks for an agent) isn't particularly interesting until you have multiple tasks in mind. Certainly there's a great deal of intuition pushed into vision problems, and you definitely need some intuition about the task baked in. But that's true, I think in all applied learning problems.But what I was getting at was: what does the task being able to overrule some standard visual policy tell us? Well, it tells us that animals are task-driven and that their current task is given priority in the vision system. But what does that mean for computer vision? If your agent is a robot in the wild, that's truly important and I think fundamental. But if your agent is a security camera that has one task and is only supposed to track people and report anomalies, I'm not sure whether this has great importance. Unfortunately, I think despite our hopes and dreams, most of our vision systems have much more in common with the security camera case.
This is just an opinion, but I think it's possible that this kind of thing is what allows us to create multi-purpose systems in the first place. For the last semester or so I've been working on a detector that has the goal of being able to quickly tell you if you if a candidate object-pose combination exists in an image. The idea is that even though the space of possible object-pose combinations is too large to exhaustively search, it may still be very useful in combination with weak priors that suggest local areas in which to look for things. My hope is that it will be useful for passive viewing where you use point-features or some saliency metric to find places to look for particular objects, and will also be good when you have some task-specific knowledge of certain objects you may care about and where they may be (for example locating upright objects on a table is a common thing we do with Herb). Admittedly, this is only slightly multi-purpose, but I think it gets interesting when you start to treat it as a resource allocation problem. You have a finite number object-pose candidates you can test in a given amount of time, and you want to quickly figure out the best way to allocate those given shifting goals and priorities. This might come up in a multi-stage plan, where we want Herb to open the fridge and get something out of it. The fridge handle gets more saliency at first, but once he's found it and opened the door, he starts looking for the specific item he wants to retrieve.
Computer vision research has primarily focused on attention as a way of knowing what to search for - spatially or in terms of feature selection. In this context, we have several papers that try to mimic the human in predicting image saliency.I'm, however, interested in exploiting attention in the algorithm design itself. For example, this paper (http://www.umiacs.umd.edu/~mishraka/activeSeg.html) decomposes the problem to sequentially focus/fixate on one point at a time to segment out different objects in the scene. I unfortunately have not found any such work for other popular computer vision problems - what happens if we have a sequential decomposition of the problem and does a point seed fundamentally different from other seeding methods.
An interesting paper which incorporates saliency in algorithm design is 'An assisted photography method for street scenes' http://www.ri.cmu.edu/pub_files/2011/1/13.pdf . It is specifically aimed at assisting blind mobile camera users for documenting transit issues by clicking photographs. However the task is very specific here, saliency is an important part of the algorithm proposed.
To me this was one of the more intriguing papers. Also I think the discussions going on are pretty interesting.But one thing that this makes me think of is what if the feature extractors that are used for a particular image have such an RF of its own and the problems that vision faces might be due to the fact that this RF captures the wrong things from a scene or image.What if the black box which processes inputs in CV isn't the right 'neuron' for the task. This might seem really stupid but being pretty new to vision as a field, I have some apprehensions about even the basics.
After reading the paper, the most interesting aspect to me was the suppression of the response to stimuli by the presence of weak probe stimuli, and the suppression increased the poorer the probe stimulus was for the particular neuron. To me initially, this was very counterintuitive - it is very likely that most stimuli would not be tuned for given a particular selective neuron, and thus the net effect would be that the neuron would hardly ever trigger a strong response. Is it because we want the neuron to fire well only when the stimulus it is tuned for is at a high contrast/is being attended to? The second thing that I was led to think about by the suppression was that this implies that viewing a stimulus with a very high response for a given selective neuron (say, e.g. a very bright red colour for a red specific neuron) would in effect 'devastate' the detection of other colours in the same receptive field. A simple test would be to display a big red circle with green dots in it vs the green dots in isolation. However, I feel that the colour-contrast between the two should actually enhance the appearance of the green dots. Maybe something like a very thin vertical stripe hidden between very prominent black and white horizontal stripes.
I think there are 2 points raised by this paper possibly relevant to designing human like vision systems:1. Higher contrast regions are automatically more important in lower vision areas and their response is preferentially transmitted to higher level areas.2. Attention can change the response of the low level and hence the high level by increasing the effect of response due to what is present at area of attention. These might be obvious in a way but the paper does a statistical study of the effects of contrast and attention to prove these two.I think the "take home" might be to remove low contrast regions in a scene understanding setting.
One important discussion as a takeaway was the overall impact of attention during visual processing. Is it just a filter helping to prune what to process or is it actually helping in re-weighting our estimates of what to expect/recognize? How much of a role does context play in bottom up attention?