Attention research has covered a lot of ground in the twelve years since fuzzy and blue's paper was originally written. This literature review digs a little bit deeper into the history of attention research and fills in some details on the latest research.

Spotlight on Attention

Although philosophers and scientists have been interested in the neural processes of attention and selection for most of recorded history, a thorough scientific study to determine its true nature has only been in progress for a little over a half a century. As a consequence, ideas about attention that had existed as folklore for thousands of years have only recently been dispelled. This writeup is intended to explore the debates that have arisen recently about what exactly attention is, where it takes place in the brain and according to what time course, and, most importantly, how it might work.

As is usually the case in questions of this form, “what is attention?” is open to a bit of semantic debate. For the most part, researchers have come to an agreement about what cognitive processes to label as attention, but this wasn’t always so. For a greater part of recent history, psychologists and philosophers seemed intent on avoiding this question altogether. William James writes “…the perceptual presence of selective attention has received hardly any notice from psychologists of the English empiricist school. The Germans have explicitly treated of it,[…] but in the pages of [English writers] the word hardly occurs…”[14] However, Kleine-Horst writes, of early 20th century German literature, “…there are a number of occasional references to the effects of increased or decreased directing of attention on the experience of a stimulus pattern. But…they did not even make a tentative effort to treat the empirically established effects of attention theoretically.”[16] In other words, although, as James put it, “everyone knows what attention is,” and some were even intent on studying it empirically, it took the field a very long time to formally define attention in a scientifically rigorous way. As Egeth and Bevan put it:


"Psychologists have generally recognized the existence and importance of attentionlike phenomena, but they haven’t known how to incorporate them into their theoretical structures."[6]

The current empirical definition consists of a list of capabilities and limitations of a set of perceptual routines whose purpose is selection. Namely, it is a phenomenon by which parts of a perceived stimulus receive enhanced processing and other parts receive diminished processing. That both enhancement and reduction must both occur in attended perception was clear from introspection. As James put it, “Millions of items of the outward order are present to my senses which never properly enter into my experience.” Moreover, many studies have shown that we are not even consciously aware of most of those stimuli which we do not attend. And yet, we have conscious control over what we attend. Thus, the most general definition of attention as it is studied today is “those mechanisms which cause attended stimuli to receive enhanced perceptual processing, while limiting the processing done on unattended stimuli.”

This definition makes it possible to break down our what, where, and how questions as follows:

  • What: In which ways do attentional processes enhance perception of attended stimuli? What sort of stimuli can be attended, and to what extent can their perception be enhanced? To what extent is perception of unattended stimuli limited?
  • Where/When: On the pathway from the direct sensory perception to cognition about that perception, at which point are attentional effects applied? Where in the brain does this phase of processing occur? Into how many distinct processing elements is this perceptual pathway divided?
  • How: What exactly do the attentional processes in perception do in mechanical/signal processing terms? In computational terms, what happens in the brain to attentionally enhanced or diminished perceptual signals?

1 What

There are many results demonstrating that attention actually makes a considerable difference in what can be discerned about a stimulus. Perhaps the oldest question that was addressed in this way was “how many ways can attention be divided?” The answer to this question says something about what sorts of things can be attended. The answer is, as with anything about the brain: “Uhhhhh…it depends.”

William James gave the answer to this question as “one object” where object here could be given as a indefinitely-sized collection of perceptually similar or connected “things.” As he puts it:


“But however numerous the things, they can only be known in a single pulse of consciousness for which they form one complex ‘object’, so that properly speaking there is before the mind at no time a plurality of ideas, properly so called.”[15][Emphasis is James's]

When it’s put that way, it’s not a refutable hypothesis because the ways in which these things can be combined into an object is not defined. However, this idea that the mind was able to attend to one perceptual “object” at once has persisted, being reformulated in such hypotheses as (to take the visual mode as an example) “visual attention can only select a single contiguous space.” This hypothesis, combined with the idea that only items in the attended space can be scrutinized, resulted in the popular “spotlightmetaphor of attention. James hinted at this idea when he spoke of the margin and the periphery of vision: the ability to focus attention where the eyes were not focused, even when there was no stimulus there to focus attention on.

But this idea that there could only be one “spotlight” of visual attention has proven to be false in its strongest sense. It is possible to split attention in other senses: although you may temporarily forget a headache after stubbing a toe, if you make a conscious effort, you can fixate on both of these pains and ignore feelings from other parts of the body. So, why shouldn’t visual attention be splittable to remote regions as well?

There have been many experiments testing the ability of humans to split their visual attentional “spotlights” without refixating the eyes. One prototypical result in this paradigm is that of Awh and Pashler.[1] Before an array of characters was displayed, a pair of cues were presented at two discontiguous locations. After the array, two locations were indicated and subjects were asked to name the characters that had appeared there. Subjects performed well at naming the characters at the cued locations, but showed difficulty naming characters at uncued locations, even when that location was between the cued locations. In other words, they were able to split attention between two distinct cued locations without paying any attention to the region between those locations, indicating that they had two distinct “spotlights.” An interesting fact about these experiments is that they showed an increased ability to split attention between the left and right visual fields; when the cued locations were on the same side, one above the other, subjects showed considerably less accuracy remembering the character at the lower one (although performance remembering the lower character was still better than at any uncued location). This indicates that splitting attention is easier if they can be split across brain hemispheres.

Thus, it is not particularly surprising that attention can similarly be split between the two ears in listening tasks. In an experiment by Hink, et al., subjects were asked to count target spoken phonemes that were played alternately at differing intervals in just one ear (by a male voice) or the other (by a female voice).[10] Thus, subjects were required to split attention between two different frequency spectra and between two different audio channels. They were still able to recognize the targets almost 79% of the time on average, although they were able to do so over 90% of the time on average when paying attention to just one ear or the other. Thus, dividing attention in this way does come at some cost to performance.

Hink and Fenton et al. performed another experiment where subjects where asked to identify only tones that occurred in two audio channels (central and extreme left, for instance) when tones were being presented in five different channels.[12] Subjects performed best when they did not have to avoid identifying a tone in the in-between audio channel (middle-left, for instance). However, even with the in-between tone being considered task-irrelevant, subjects were able to successfully and quickly discriminate the target tones, indicating an ability to split attention between two non-adjacent audio channels.

Now, there are two different ways to interpret the above results about dividing attention. One is that there is only a single “beam” of attention that can be very rapidly switched back and forth between two sources (the two cued locations in the visual case, or the two audio channels in the auditory case). The other is that attentional resources can be divided to process stimuli from multiple sources simultaneously. It may also be the case that different senses use one or the other of these modes of selection. However, in either case, it means we have to be careful making claims about how many “things” can be attended to at a time.

In these studies we have also inadvertently glimpsed an answer to another question that we have asked: target stimuli are best recognized and identified where the attention is directed and usually neglected where attention is not focused. However, there are limits to the enhancement attention alone can yield. For example, the resolution of differentiable items in a location other than where the gaze is fixated is much lower than visual resolution itself. The eyes are able to differentiate objects that the mind is not. Take, for instance, the following image (from Cavanagh, He, and Intriligator, 1999)[8]:

Figure 1

In fixating on the dot in the center (viewing from sufficiently far away), we can tell there are a number of distinct vertical lines to either side, but we can only count the lines on the right, because we cannot select just one of the lines on the left at a time. However, if you focus your gaze on a point below the collection of lines on the left, you may find that it suddenly looks not only like a collection of lines, but a collection of ten distinct lines. Thus, visual attentional resolution has a finer grain in the direction tangential to fixation than in the radial direction.

Based on all this evidence, our current best answer to “what is attention?” is “that facility which allows selection of some subset of a stimulus for admission into working memory, facilitates enhanced scrutiny of that subset although in a fashion more limited than required by the senses, and suppresses the remaining (uninteresting or irrelevant) parts of the stimulus.” We have some idea of what this means, but it is still far too general: researchers haven’t pinned down all its limitations just yet. In addition, this definition doesn’t give us much of an idea of how it works. We can only do that by breaking it down into more manageable chunks. We now turn our attention (heh) to tracking the course of attentional processes through the brain.

2 Where/When

It doesn’t take much thought to translate the ideas of enhancing and suppressing signals into the neural paradigm of excitation and inhibition. In order to suppress irrelevant stimuli, we would only need a feedback system to inhibit the neurons which would be processing those stimuli. We understand that this must happen somewhere between sensory input and entry into working memory, but the sensory information is transformed many times over in this process, so, at which point are attentional controls applied? There are two implied stages here:

An early, pre-attentive stage that operates without capacity limitation and in parallel[...], followed by a later, attentive limited-capacity stage that can deal with only one item (or at best a few items) at a time. When items pass from the first to the second stage of processing, these items are considered to be selected.[18]

So the question becomes: what processing is done in the pre-attentive stage and where in the brain does the information pass to the second stage? We are disregarding for the moment the question of where conscious control of attention originates and how it works. Our concern in this section is only the bottom-up attentional mechanism itself, and not the top-down control process which steers it. Unlike the conscious control process, this set of routines is inscrutable. You know when you have aimed your attention at something, and you know what the result of doing so is, but you cannot say what happened between. Attentional processes are atomic to introspection.

Of course, what happened during that inscrutable process, and where in the brain it happened depends on what sense it was applied to. It may be that the attentional processes used for each sense are completely different, and even reasonable to say they were based in different neuroanatomical substrates. In fact, to say this would be to adopt an early selection theory. According to such models, as first proposed by Broadbent, attention works as a filter at the earliest possible point in the pathway from the sense organs to working memory, such that the information that attention suppresses will never reach consciousness.

The extreme alternative to such theories is, of course, late selection. According to these models, most processing that transforms a stimulus is performed before selection occurs, that selection happens immediately before the processed stimulus enters consciousness. This would imply a large amount of information being available from unattended parts of stimulus.

To support his early selection hypothesis, Broadbent drew from the results of E.C.Cherry's dichotic listening task, of which several variants have been developed in the past half century.[2] The common component is that two different signals are played one into each of a subject’s ears, and the subject is given a task that directs attention to just one audio channel such as repeating or remembering a sequence of words or numbers, and then later testing how much information is gleaned from the unattended channel. Broadbent’s experiments showed that the subject remembers nothing from the unattended channel that doesn’t remain in echoic memory, which means anything that happened more than two seconds ago.

This wasn’t the end of the story however, as experiments by Treisman and others showed that certain stimuli from an unattended channel could actually be processed on its semantic information. Treisman had subjects attend to and shadow a message in one ear by speaking it aloud, while a meaningless nonsense sequence of words in the other ear was ignored. At some unspecified point in the middle of a sentence the meaningful message would switch to the other ear and the string of nonsense would switch to the original ear.[19] Some of the subjects would switch ears (against instruction) and continue shadowing the meaningful message. This, of course, supported a late selection model, since analyzing meanings is the sort of process that is supposed to occur after selection in Broadbent’s model, but Treisman proposed attenuation theory instead. It states that selection happens early, but it is incomplete. Unattended channels are only attenuated, not blocked completely. Although it would be more difficult to select from the attenuated signals (and indeed, most subjects continued shadowing the prescribed ear), enough information got through that made it possible given good enough reason.

J.A. Deutsch and D. Deutsch gave a late-selection explanation of Treisman’s experiment: although subjects were analyzing both audio channels for semantic content, they were selecting just one to respond to by shadowing.[5] The following diagram (of Treisman and Geffen) illustrates the difference between (a) Treisman’s stimulus-set filter and (b) Deutsch and Deutsch’s response-set filter:

Figure 2

In order to settle this difference, Treisman and Geffen devised an experiment whose outcome could distinguish these theories.[20] If a subject was told to track a word occurring in either audio channel, then, assuming the filter happened at the response level, there should be no degradation of detection of that word due to inattention: the word would be processed semantically and the response filter would flag it based on context no matter if it were in the unattended channel. However, if attenuation were taking place, the word should be detected less frequently in the unattended channel. Treisman and Geffen forced attention onto one channel by requiring it to be shadowed, but asked subjects to signal detection of a target word in either channel by tapping. End result: subjects detected 87% of targets in the attended channel and only 8% in the unattended channel.

It is likely that something close to attenuation holds in visual attention as well. Consider again the experiment of Awh and Pashler mentioned in the first part. They did not find that cued locations received higher accuracy rate of target recall and everywhere else received equally low target recall rate. Rather, the locations between the cued locations garnered slightly elevated performance (although not as high as the cued locations), and distant peripheral locations garnered much lower performance. Indeed, it appeared there was a gradient in performance away from the cued (attended) locations. Usai, Umiltá, and Nicoletti showed that if the gap between two relevant locations was small, information originating in this gap could not be suppressed.[23] In other words, there appears to be an attenuation of information from spatial locations as a function of its distance from attended locations, but the information is not lost entirely. It seems that attenuation is the right idea to pursue here.

But what reason do we have to think this attenuation occurs early in the perceptual pathway? We get a hint from ERP studies. There are many studies of auditory attention that show attentional effects beginning at the earliest moment that a sound signal could reach the auditory cortex. Of particular interest are a series of experiments by Hillyard et al.[9] The basic paradigm is that a series of very brief tones are played alternately in each ear, and a subject is given a task to attend to just one channel and count the number of occurrences of a certain tone in that channel. The EEG is divided into the electrical responses for the attended channel and the unattended channel, and averaged synchronized to the stimulus onset time. The difference of these two average waveforms shows a significant increase in negativity for the attended channel beginning at 50-100ms after onset. This is about as early as any kind of significant difference could occur: no task can cause any variation in the waveform before this point as any electrical effects stem only from the involuntary nervous conduit of auditory information to the cortex, seen even when sleeping.

So that is what we know of the time course of attention. What about the “where?”

It would be no surprise if attention is applied directly in the visual, auditory, and somatosensory cortices directly, if it is happening as early as possible. But the brain can be more efficient than that. As it turns out, all three of these regions are arranged quite neatly around the parietal lobe, and there are several studies indicating increased activity in various parts of this ‘association’ cortex. BOLD fMRI studies show increased activation of the temporoparietal junction (TPJ) in tasks involving low-level attention deflection, such as when detecting a change in a visual or auditory stimulus when both are presented simultaneously (but only in the task-relevant modality). Consciously controlled attentional shifts have been traced in the same way to activation in the superior parietal lobule (SPL) and precuneus (PC). A task for which this was observed is a visual character stream tracking task, with one stream on each side of fixation, and the identity of a target character indicating whether or not to switch attention to the other stream.[7]

Figure 3 Figure 4

The SPL is also implicated in the shifting of attention from one spatial location to another while the gaze is fixed. Coactivation of the SPL with the PC is seen when non-spatial attention shifts, for instance between viewing an image as a house or a face when the two images are superimposed.

A lot of research has also been done on locating visual-specific attention processes. The portion of the pre-motor cortex responsible for planning where the gaze shifts, the frontal eye fields (located along the superior lateral convexity dorsal to the medial frontal gyrus, or above and to the right of the “MFG” label in the above image, attributed to Behrmann et. al.) is also implicated in the shifting of visual attention in the absence of gaze refixation. This relationship is likely why it requires so much conscious effort to keep one’s gaze fixed when attending to an object marginal to the point of fixation. No other sensory modality has such a large component of attention located outside of the parietal control center, likely due to its unique nature as the only sense with an independently steerable organ (making involuntary steering of said organ quite an advantage).

Given enough information about the where and when, it’s possible to begin making guesses about the how of attention. Thus, modern biologically plausible theories of attention began to arise pretty quickly as brain imaging techniques became more accurate and precise. We will look at some of these theories now.

3 How

Most of the models of how attention is implemented in the brain originate from a variety of studies using visual stimuli, and these will be our primary focus here. It is evident that some of these models are generalizable to aural or tactile attention, or at least parts of them, in how they can be implemented in neural wetware. However, it is far easier to design visual stimuli that test some particular idea, and so this is the sensory mode for which the best tested models exist.

We must start in 1980 with Treisman and Gelade’s “Feature-Integration Theory” of attention.[21] This model suggests what may happen in each of the two phases of the time course of perception: pre-attentional processing and attentional processing. According to this model,

…features are registered early, automatically, and in parallel across the visual field, while objects are identified separately, and only at a later stage, which requires focused attention. […] Thus focal attention provides the “glue” that integrates the initially separable features into unitary objects.[21]

The primary source of evidence for this model comes from experiments using a target detection task. In such tasks, subjects are presented an array of objects for a very short period of time and then asked either to simply determine whether a target object was present, or to make some sort of judgment about a target object. Consider the following pair of images:

Figure 5

If a human is tasked with locating the “O” in either image, it takes hir, on average, the same amount of time to respond. However, if tasked with finding an “R” in either image, the response time grows linearly with the number of letters in the stimulus: it takes a lot longer to rule out the presence of an “R” in image (b).

This effect is known as the pop-out effect: the “O” doesn’t require any effort to find because it is salient. It draws attention to itself. It is quickly found by the pre-attentive phase’s rapid parallel feature search. However, proving that there exists no “R” requires an attended search through the every perceived object in the stimulus, and studies show that response time is about linear in the number of objects being searched.

A great deal of research has been focused on explaining the pop-out phenomenon, and exactly what it is that makes an object salient enough to be extracted by this pre-attentive process, with models being based in signal-detection theory or information theory, but they all boil down to one thing: the features of the stimulus are analyzed first, and if no features stand out, then an attended search for a target must be performed. If salient objects are found then they draw attention quite involuntarily. These models are all intended to explain the same set of facts derived from the sorts of experiments Treisman and Gelade did on visual search tasks, and there is no consensus on which one is the best at the moment.

So, ignoring what these models say about the nature of the visual cortex and perception, what else do they tell us about attention? They show us that it is a confluence of top-down and bottom-up processes. Salience draws attention from below, while conscious task-directed control steers it from above.

Another example of the attention being drawn both from above and below is demonstrated by Posner’s cueing task.[17] The task is to respond to a target appearing to the left or right of fixation. In the exogenous attention version, a high contrast box suddenly onsets on one side of space. Subsequently, if the target appears shortly afterward on the same side, response time is shortened. If it appears on the opposite side as the cue, response time takes longer. This is just another example of the pop-out phenomenon in a way: the cue is salient and draws attention, and attending to that side of space improves performance recognizing a target there.

A more interesting effect is the phenomenon of inhibition of return, wherein if one presents the target after several seconds, it actually takes longer to respond to the cued location than the cued version. The standard explanation is that the drawing of attention exogenously has a built-in “give-up” mechanism by which it automatically ignores locations which had been salient but turned out not to be useful. In other words, attention is inhibited from returning focus to a region it has already rejected as unworthy of attention.

This doesn’t occur if attention is directed in a top-down fashion. For example, in the endogenous attention version of the Posner task, an arrow appears near the point of fixation pointing toward the cued location. If the target is then presented immediately, there is no advantage to receiving the cue, as directing attention consciously takes longer. However, after it has happened, attention can remain focused on the cued location indefinitely: there is no inhibition of return.

To tie this information back to auditory attention, the “cocktail party effect” deserves a mention. One may consciously direct auditory attention to certain locations in space and certain frequencies of sound and types of signals. This is how one can listen to what a single person is saying in a crowded room. Moreover, if someone tells you to listen to what someone is saying without looking at them (to be all sneaky about it), and you know what that second person sounds like, you can redirect attention and follow that second signal for some time. This is an example of endogenous auditory cueing. However, if someone says your name (or something that sounds reasonably close to your name) on the other side of the room, your attention will be drawn immediately to the direction and spectrum of the voice that spoke it. This “own name effect” is an exogenous cue: your attention is drawn because a salient stimulus appeared. It is frequently used as evidence for the attenuation model, but it doesn’t necessarily have to be so. It could be simply evidence for the sort of two-phase model we have described above for visual search: pre-attentive processing influences attention from the bottom-up whenever sufficiently salient stimuli arise.

There are surely some processes that take place whenever attention is being directed, whether it is from above or below, and we look now at some biologically plausible models that explain what these processes might do. Again, we will focus primarily on theories specific to vision. Wolfe offered his Guided Search model as an alternative to Feature Integration Theory, although it preserves many of the ideas of the model.[24] The primary difference is that, whereas Treisman and Gelade’s model proposed that if the initial parallel search did not locate a target, then the subsequent attended search would be essentially a serial search through the stimulus starting at zero information, Wolfe revised this to say that some information survived the first, pre-attentional search, and was used to rank locations to look at or near in the attended search.

This model accounts very nicely for inhibition of return. Consider searching for an “L” in the following image:

Figure 6

You’ll notice that your eyes first jump to the red “T”, before finding first one and then the other of the two “L”s. However, between noticing the red “T” and finding an “L”, your attention does not return to or linger on the red “T”. Inhibition of return thus serves as a built-in mechanism to prevent wasting effort searching previously searched areas, but, as we have already seen, it is a bottom-up effect. In other words, the attended search is involuntarily affected by artifacts of the pre-attentional process that serve to make it more efficient over short or medium timescales. This sort of interplay of information between pre-attentional and attentional processes is what Wolfe means by Guided Search.

However, as Wolfe puts it:

“[In Guided Search], target identification requires that preattentive features (e.g. red, vertical) must be bound by attention into a coherent object (e.g. a red, vertical bar) and that binding requires attention.”[24]

His model explains quite well how search proceeds, but makes no attempt to explain how exactly this binding process takes place. A suggestion here comes from John K. Tsotsos in the form of the Selective Tuning Model.[22] In this model, visual perception is described as a layered pyramid in which information is filtered piecemeal at each level. In the preattentive stage, more salient features are passed up the pyramid (where salient here means “containing more global information” or, if Wolfe’s model is adapted, simply “contrasting locally with surrounding features”), where at the top of the pyramid, some neuron (or group of neurons) is activated the most. This “winning neuron” is then selected to acquire more information about. Neurons fire back down the pyramid in such a way that neurons that didn’t contribute to exciting the winner neuron (or to neurons that did so, directly or indirectly, on a higher level) are inhibited, their signals suppressed. This is called “suppression of the surround,” and it is selective attention in this model. Once this suppression takes place, the input signal will be modified as it travels back up the pyramid. This process can repeat several times in a short period, resulting in only the spatial region immediately surrounding the object contributing the salient features remaining unsuppressed. Local indicators like edges and color gradients are analyzed in the attended region to integrate the features together into an object. This is how attention is proposed to be drawn from the bottom-up.

Top-down attention proceeds similarly, except volitional processes also contribute excitation and inhibition to the perceptual pyramid, perhaps at several layers. Regions or features which are expected to be task-relevant can have their associate neurons excited, increasing the likelihood of the winner neuron being excited by that region of space or those features. Irrelevant regions or features can likewise be suppressed. Notice that this sort of excitation and suppression is exactly the sort of mechanism we used to define selective attention in the first place.

Figure 7

Although this model reconciles well with all of the experiments we have seen so far, we can go one step further and show that fits well into Desimone and Duncan’s Biased Competition model as well.[4, 3] They hypothesized that the limitation on number of objects that could be tracked at once by the brain stems from the neural structure of the visual extrastriate cortex, in which cells have been shown to optimally represent only one object at a time. Thus, if several similar objects are presented at the same time, these cells would have to choose the features of only one of them to represent. In other words, this collection of objects would compete for representation by the relevant cells.

Evidence that this is actually the case is seen in the following experiment: There are two faces presented to a subject on either side of fixation. At onset time, a stimulus is presented between the two faces, about which a judgement must be made: is it right-side-up or upside-down? The stimulus can be either a house or another face. (This experiment is a composite of several experiments, made by Jacques and Rossion and others.[13]) Processing the houses in this scenario takes reliably less time than the faces, as the central face must compete for representation with the flanking unattended faces. Presumably, this competition is resolved by attention, but it doesn’t happen immediately.

So far, we’ve not contradicted the Selective Tuning Model, as we would, in fact, expect that if information was being combined several times going up the perceptual pyramid, eventually similar features from different regions would be combined into the same cells for compactness of representation. However, we haven’t done much to support it either. The interesting part is this: competition for representation is more pronounced when targets/distractors are closer together. This phenomenon, known as “Localized Attentional Interference” is shown by an experimental paradigm of Hilimire et al. in which colored targets are identified in a briefly presented ring of letters, separated by one letter, three letters, or five letters.[11] Accuracy in such tasks improves significantly as the separation between the targets increases.

How does this support the Tsotsos model? Firstly, the simple form of the Tsotsos model would predict that a particular feature found in a particular region will excite a particular cell, and so would multiple instances of the feature in the same region. Thus, a straightforward trace back to the source of the excitation will include all of the objects in the region having those features, and several more iterations of the process would be needed to individuate a particular object, so fixating attention on just one target requires more time to increase accuracy. However, widely separated instances of those features will excite different cells at higher levels of the pyramid, and therefore be easier to individuate.

Moreover, according to Tsotsos’s model, the region surrounding a target will be suppressed, such that once an individual object has been represented, similar objects appearing nearby will have to overcome that inhibition in order to achieve representation, a further source of competition.

Although it has not been tested as yet, one can imagine the same sort of process happening in the auditory mode as well. As a demonstration of how to translate this experimental paradigm, here is a procedure that could test this: Place two speakers separated by several feet playing square waves at the same frequency but out of phase with one another so that they can be individually localized. Then play a third tone between them that is slightly higher or lower pitched than the other two, and ask the subject to classify it as higher or lower. Firstly, we should see that this task is easier when the central tone is, say, a triangle or sine wave rather than a square wave. Secondly, we should see that this task is easier when the flanking speakers are placed further away from center. This could be simulated using the spectrum of audio channels and playing over headphones, although the phenomenon of binaural beats might throw off the parsing of the sounds in this situation.

Finally, this same sort of phenomenon could also be tested in the tactile setting, using different objects to poke the fingertips of a blindfolded subject at various degrees of separation. It should be clear how the paradigm adapts to this scenario.

4 Conclusion

We have now answered the questions posed at the beginning of the paper up to the current state of the art of the field. We have identified attention as a mechanism that enhances a selected subset of a stimulus, usually corresponding to an “object”, while suppressing (albeit incompletely) the remainder. We have seen how the attended portion of the stimulus can be selected based on bottom-up salience or top-down task relevance information. We have traced its course through the brain in both space and time, and even identified mechanisms which could be implemented in neural wetware and account for the experimental data that has been seen.

And yet, we have very little confidence that any of this is actually the Way Things Work. The state of the art is still woefully far from comprising a cohesive and predictive theory of attention accounting cleanly and elegantly for 100% of the evidence. Indeed, the sort of evidence we are able to collect from human subjects is very limited and it can be very difficult at times to use the data to differentiate between various theories. Consider how different Treisman’s explanation for the Broadbent dichotic listening experiment was from that of Deutsch and Deutsch. Such competing and as yet indistinguishable models are arising all over the field, and everyone has their own pet model to defend. The evidence is so limited as yet that anyone can come up with something new that fits the current data, then just run it up the flagpole and see who salutes.

And there are many different aspects of attention that can be modeled independently, which forces one to dig quite deeply to tie together any sort of cohesive theory. We are aware that attention is applicable to all sensory modes, and yet only the visual and auditory modes have received a great amount of study, and efforts to test theories generalized from one mode on the other modes have been limited. Numerous models have been warring just to explain the tiniest portion of the attentional process, for instance, salience in pre-attentional search.

What comes so naturally to us as humans, and happens so quickly without any sort of conscious effort, is an extremely complex process composed of many subprocesses still complex enough to be debated. It’s been half a century since this line of research began in earnest, and now, standing on this mountain of evidence we have collected, we can’t see any farther than we could back when William James’s colleagues were experimenting on themselves using nothing but introspection and the occasional stopwatch. It’s sad to say it, but nothing has been settled in a satisfactory way.

References

[1] E. Awh and H. Pashler. Evidence for split attentional foci. J Exp Psychol Hum Percept Perform, 26:834–46, 2000.

[2] D.E. Broadbent. Perception and communication. Pergamon Press, London, 1958.

[3] R. Desimone. Visual attention mediated by biased competition in extrastriate visual cortex. Philos Trans R Soc Lond B Biol Sci., 353(1373):1245–1255, August 29 1998.

[4] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annu. Rev. Neurosci., 18:192–222, 1995.

[5] J.A. Deutsch and D. Deutsch. Attention: some theoretical considerations. Psychological Review, 70:80–90, 1963.

[6] H. Egeth and W. Bevan. Attention. In B B Wolman, editor, Handbook of general psychology, chapter Attention, pages 395–418. New York, 1973.

[7] Behrmann et al. Parietal cortex and attention. Current Opinion in Neurobiology, 14:212–217, 2004.

[8] Cavanagh et al. Attentional resolution: The grain and locus of visual awareness. In C. Taddei-Ferretti and C. Musio, editors, Neuronal basis and psychological aspects of consciousness, pages 41–52. World Scientific, 1999.

[9] Hillyard et al. Electrical signs of selective attention in the human brain. Science, New Series, 182(4108):177–180, 1973.

[10] Hink et al. Event-related brain potentials and selective attention to acoustic and phonetic cues. Biological psychology, 6(1):1–16, 1978.

[11] M.R. Hilimire et al. Competitive interaction degrades target selection: An ERP study. Psychophysiology, 46:1080–1089, 2009.

[12] R.F. Hink, W.H. Fenton, and A. et al. Pfefferbaum. The distribution of attention across auditory input channels: an assessment using the human evoked potential. Psychophysiology, 15(5):466–73, 1978.

[13] C. Jacques and B. Rossion. The time course of visual competition to the presentation of centrally fixated faces. Journal of Vision, 6:154–162, 2006.

[14] W. James. The Principles of Psychology, volume 1, chapter Chapter XI: Attention, pages 402–404. Henry Holt, New York, 1890.

[15] W. James. The Principles of Psychology, volume 1, chapter Chapter XI: Attention, pages 436–437. Henry Holt, New York, 1890.

[16] L. Kleine-Horst. Empiristic Theory of Visual Gestalt Perception (ETVG), chapter 6.IV. What "is" attention? Self-published, Cologne, 2001.

[17] M. I. Posner. Orienting of attention. Quarterly Journal of Experimental Psychology, 32(325), 1980.

[18] J. Theeuwes. Visual selective attention: A theoretical analysis. Acta Psychologica, 83:93–154, 1993.

[19] A. Treisman. Contextual clues in selective listening. Quarterly Journal of Experimental Psychology, 12:242–248, 1960.

[20] A. Treisman and G. Geffen. Selective attention: Perception or response? Quarterly Journal of Experimental Psychology, 19:1–18, 1967.

[21] A. Treisman and G. Gelade. A feature integration theory of attention. Cognitive Psychology, 12:97–136, 1980.

[22] J.K. Tsotsos. Analyzing vision at the complexity level. Behavioral and Brain Sciences, 13(3):423 – 445, 1990.

[23] M.C. Usai, C. Umilta, and R. Nicoletti. Limits in controlling the focus of attention. European Journal Of Cognitive Psychology, 7(4):411–39, December 1995.

[24] J. M. Wolfe. Guided search 4.0: A guided search model that does not require memory for rejected distractors. Journal of Vision, 1(3), 2001. article 349.