Numediart Institute, Faculty of Engineering (FPMs), University of Mons (UMONS) Matei Mancas, 31 Bd. Dolez, 7000 Mons, Belgium
Idea and approaches. As we already saw, attention is a topic which was taken into account by philosophy first, it was than discussed by cognitive psychology and neuroscience and, only in the late nineties, attention modeling arrived in the domain of computer science and engineering. In this domain, two main approaches can be found. The first one is based on the notion of “saliency”, while the second one on the idea of “visibility”. In reality, the models based on saliency are by far more spread than the visibility models in computer science. The notion of “saliency” implies a competition between “bottom-up” or exogenous and “topdown” or endogenous information. The idea of bottom-up saliency maps is that the sight of people will direct to areas which, in some way, stand out from the background based on novel or rare features. This bottom-up saliency can be modulated by top-down information based on memory, emotions or goals. The eye movements (scan paths) can be computed from the saliency map which remains the same during eye motion: it is a global static attention (saliency) map which only provides, for each pixel, a probability to attract human gaze.
Visibility models. These models of human attention assume that people attend locations that maximize the information acquired by the eye (the visibility) to solve a given task (which can also be simply free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a “saliency map” which is considered as a surface giving the posterior probability (following each fixation) that the target is at each scene location Geisler & Cormack (2011). Compared to other Bayesian frameworks, like the one of Oliva et al. (2003), visibility models have one main difference. The saliency map is dynamic: indeed visibility models make explicit the resolution variability of the retina (Figure 1): in that way an attention map is “re-computed” at each new fixation, as the feature visibility changes at each of these fixations. Tatler (2007) introduces a tendency of the eye gaze to stay in the middle of the scene to maximize the visibility over the image (which reminds the centered preference for natural images also called centered Gaussian bias.
Figure 1: Depending on the eye fixation position, visibility thus feature extraction is different. Adapted from images by Jeff Perry.
The visibility models are much more used in the case of strong tasks (like Legge et al. (2002) who proposed a visibility model capable to predict the eye fixations during the task of reading) and few of them are applied to free viewing which is considered as a week task Geisler & Cormack (2011).
Saliency approaches: bottom-up methods. While visibility models are more used in cognitive sciences and with strong tasks, in computer science, bottom-up approaches use features extracted only once from the signal independently from the eye fixations mainly for free-viewing. Features are extracted from the image, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those definitions are actually synonyms and they all amount to searching for some unusual features in a given spatial context. In the following, we provide examples of contexts used for still images to obtain a saliency map. This saliency map can be visualized as a heatmap where hot colors represent pixels with a higher probability to attract human gaze (Figure 2).
Figure 2: Left: initial image. Right: superimposed saliency heatmap on the initial image. The saliency map is static and gives an overview of where the eye is likely to attend.
Saliency methods for still images. The literature is very active concerning still images saliency models. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is not the purpose here to provide a review of all those models, but we instead propose a taxonomy to classify those models. We structure this taxonomy of saliency methods on the context that those methods take into account to exhibit image novelty. In this framework, there are three classes of methods.
The first one focuses on pixel’s surroundings: here a pixel, a group of pixels or a patch is compared with its surroundings at one or several scales. The main idea is to compute visual features at several scales in parallel, to apply center-surround inhibition, combination into conspicuity maps (one per feature) and finally to fuse them into a single saliency map. There are a lot of models derived from this approach which mainly use local center-surround contrast as a local measure of novelty. A good example of this family of approaches is the Itti’s model Itti et al. (1998) which is the first implementation of the Koch and Ullman model. This implementation proved to be the first successful approach of attention computation by providing better predictions of the human gaze than chance or simple descriptors like entropy.
A second class of methods will use as a context the entire image and compare pixels or patches of pixels with other pixels or patches from other locations in the image but not necessarily in the surroundings of the initial patch. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. A good example can be found in Seo & Milanfar (2009) which first proposes to use local regression kernels as features. Second it uses a nonparametric kernel density estimation for such features, which results in a saliency map of local “self-resemblance” measure. Mancas (2009) and Riche et al. (2013) focus on the entire image. These models are designed to detect saliency in the areas which are globally rare and locally contrasted. Boiman & Irani (2007) look for similar patches and relative positions of these patches in an image.
Finally, the third class of methods will take into account a context based on a model of what the normality should be: if things are not like they should be, this can be surprising, thus attract people attention. Achanta et al. (2009) proposed a very simple attention model: a distance is computed between a smoothed version of the input image and the average color vector of the input image. The average image is used as a kind of model of the image statistics: pixels which are far from those statistics are more salient. This model is mainly useful in salient objects detection. Another approach to “normality” can be found in Hou & Zhang (2007), where the authors proposed a spectral model that is independent of any features. The difference between the log-spectrum of the image and its smoothed log-spectrum (spectral residual) is reconstructed into a saliency map. Indeed, a smoothed version of the log-spectrum is closer to a a f1 decreasing log-spectrum template of normality as small variations are removed. This approach is almost as simple as Achanta et al. (2009) but more efficient in predicting eye fixations.
Towards video, audio or 3D signals and top-down attention. In the next parts we will focus on other kind of signals such as moving images (video), audio or even 3D signals. In addition, even if the top-down information is less modeled for saliency approaches, there is anyway an important literature linked to the topic which will also be detailed in the next parts.
Achanta, R., Hemami, S., Estrada, F. & Susstrunk, S. (2009). Frequency-tuned Salient Region Detection, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). URL: http://www.cvpr2009.org/
Boiman, O. & Irani, M. (2007). Detecting irregularities in images and in video, International Journal of Computer Vision 74(1): 17–31.
Geisler, W. S. & Cormack, L. (2011). Chapter 24: Models of Overt Attention, in The Oxford handbook of eye movements, Oxford University Press.
Hou, X. & Zhang, L. (2007). Saliency detection: A spectral residual approach, Proc. IEEE Conf. Computer Vision and Pattern Recognition CVPR ’07, pp. 1–8.
Itti, L., Koch, C. & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11): 1254 –1259.
Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.chips 2002: new insights from an idealobserver model of reading, Vision Research pp. 2219–2234.
Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.
Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.
Riche, N., Mancas, M., Duvinage, M., Mibulumukini, M., Gosselin, B. & Dutoit, T. (2013). Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis, Signal Processing: Image Communication 28(6): 642–658.
Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL: http://www.journalofvision.org/content/9/12/15.abstract
Tatler, B. (2007). The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions, Journal of Vision 7.