Defining attention from an auditory perspective

This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.

Associated Data

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Abstract

Attention prioritizes certain information at the expense of other information in ways that are similar across vision, audition, and other sensory modalities. It influences how—and even what—information is represented and processed, affecting brain activity at every level. Much of the core research into cognitive and neural mechanisms of attention has used visual tasks. However, the same top‐down, object‐based, and bottom‐up attentional processes shape auditory perception, largely through the same underlying, cognitive networks.

This article is categorized under:

Psychology > Attention Keywords: auditory attention, cortical attention networks, endogenous attention, exogenous attention

In settings with multiple sources, attention acts to mediate what object (estimate of an external, physical source) gets processed in detail. Attention depends on bottom‐up salience of the competing sources, top‐down volitional focus of the observer, and object‐based perceptual organization of the sensory inputs both over time and across different peripheral sensory channels.

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g002.jpg

1. INTRODUCTION

Attention is a set of processes that modulate what information gets represented in the brain. These processes act similarly—and are even shared—across auditory and other sensory modalities.

In 1953, Colin Cherry ignited research on the cocktail party problem, describing how attention alters perception in a crowded setting with multiple sound sources (Cherry, 1953 ). However, by the 1990s, many hearing researchers, driven to optimize information preserved in telephone communication and to understand the impact of hearing loss, had turned to developing quantitative, bottom‐up models that assumed an ideal observer (Egan, 1971 ; Fletcher & Galt, 1950 ; Henning, 1967 ; Siebert, 1970 ). This approach specifically ignores any role of central cognitive processes like memory and attention. Ideal observer models account well for performance on simple psychoacoustic tasks, but not for how we cope with the cacophony of sounds in daily life.

Because vision scientists dominated attention studies in the ensuing decade (Corbetta & Shulman, 2002 ; Hopfinger et al., 2000 ; Posner & Petersen, 1990 ; Treisman, 2006 ; Yantis, 2008 ), many discussions of attention and its underlying brain mechanisms focused on results, paradigms, and stimuli from visual research. For instance, many visual paradigms present a static scene of competing objects that, to make tasks demanding, is flashed on only briefly. Auditory information, however, is intrinsically temporal: information is primarily conveyed by changes in time‐varying acoustic signals, such as amplitude and frequency modulation. There is no meaning in a “snapshot” of auditory stimuli. The dominance of visual studies led to an emphasis on how visual features such as spatial adjacency influence the deployment of attention. Conversely, there was little work examining how attention operates through time to track dynamic objects, an issue critical to how attention works to track an auditory source that extends through time (known as a “stream”). Here, as auditory researchers, we argue that most attention effects identified by visual studies operate analogously in auditory attention; in addition, however, there are key temporal phenomena critical to how attention governs auditory attention that likely impact visual attention as well.

Another key issue is that attentional abilities differ across individuals. For instance, people with sensorineural hearing deficits or who use cochlear implants often have trouble focusing auditory attention (Dai et al., 2018 ; Shinn‐Cunningham & Best, 2008 ). Certain neurological disorders can compromise top‐down cognitive control important for focusing attention, including those with post‐traumatic stress disorder (Bressler et al., 2017 ; Leskin & White, 2007 ; Vasterling et al., 1998 ), autism spectrum disorder (Schwartz et al., 2020 ), schizophrenia (Mathalon et al., 2004 ), and attention deficit and hyperactivity disorder (Hasler et al., 2016 ). Understanding the mechanisms of attention, including how attention operates through time, are thus important for developing treatments and interventions to help individuals who have trouble deploying attention, whether in vision or audition.

2. DESCRIBING ATTENTION

Attention prioritizes certain information at the expense of other information. It determines whether we notice the broccoli on our plate or the serving dish it came from, and whether we register the gravel under our feet as we walk in the park. Attention depends not only on how well we focus volitionally (top‐down attention), but also on automatic responses to salient events (bottom‐up attention). It can be used to prepare for an event, to follow an information source over time, and to reorient after a distraction.

Attention allows us to cope with the fact that human cognition is capacity limited. We cannot fully process the barrage of information reaching us; instead, our brains favor processing information that is “important.” In top‐down attention, the observer consciously decides what to process based on their goals, while in bottom‐up attention, the brain blithely ignores predictable, expected information while automatically prioritizing new and unexpected events (e.g., a crash of thunder or flash of lightning). The schematic in Figure 1 shows examples of top‐down, object‐based, and bottom‐up attention in an auditory scene with several sources.

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g001.jpg

Auditory stimuli can be considered in a multidimensional space, with axes including time, spatial features (such as interaural differences and spectral location cues), and nonspatial features, such as pitch and timbre, here collapsed into a single dimension for visualization. This figure shows examples of how attention might behave in an auditory scene with several sources, depicted as acoustic waveforms. (a) A prestimulus cue, such as “Listen to the left stream”, can orient a listener to engage top‐down attention (visualized as the highlighted area on the plots) towards an upcoming stream with the desired spatial features. Once that stream (illustrated in red) is selected, attention focuses down to also emphasize other, nonspatial features of that stream. (b) Object‐based attention allows the listener to follow the object across momentary silent gaps. (c) When two sound sources (illustrated in red and yellow) are near each other in feature space, they can be confused: Attention may begin tracking the distractor instead of the target, especially at moments of ambiguity, such as an instantaneous silence in the target. (d) The sudden appearance of a new stream (illustrated in purple) can capture attention involuntarily (via bottom‐up processes), regardless of its similarity to the previous attentional target.

Mechanisms of attention are distributed across every stage of cognitive processing. Throughout the cortex, feedforward and feedback connections increase the representation of attended stimuli and decrease the representation of ignored stimuli (Yantis, 2008 ). Attention can be directed either externally, to sensory input, or internally, to a memory representation (Panichello & Buschman, 2021 ), and even modify what gets encoded into memory (Payne et al., 2013 ; Payne & Sekuler, 2014 ).

3. TOP‐DOWN ATTENTION

Top‐down attention allows us to consciously focus on upcoming events, select a specific feature in a scene, or search for a target amongst distractors. It is top‐down attention that, in a setting with competing for sensory inputs, allows us to deliberately bias what is enhanced and what is suppressed. For instance, Figure 1a illustrates the process of a listener preparing to attend to a source from a known (desired) location; once that stream begins, attention selects that stream by focusing an “attentional spotlight.”

When people successfully focus attention on a target, the neural responses to that target are magnified, and those to competing distractors are suppressed (Clark & Hillyard, 1996 ; Foster et al., 2020 ; Woldorff et al., 1993 ). For example, when subjects are instructed to listen to one source amongst overlapping sounds, attention enhances the event‐related potential (ERP) evoked by the onsets of events in the attended source and reduces ERPs to events in the other sources (Choi et al., 2014 ; Hillyard et al., 1973 ).

Figure 2a shows a typical paradigm for studying top‐down attention in audition. A listener is cued to attend to one spatial location. Then, three melodies of complex tones are presented from the listener's left, center, and right. Figure 2b shows grand average ERPs recorded during such a task. Onset times of tones in the left and right melodies are shown by the blue and red vertical lines, respectively. The traces show the average voltage over the course of a trial. The negative peaks approximately 0.15 s after each tone's onset reflect sensory processing. When subjects attend to the left melody (blue traces), the peaks elicited by the left tones (blue circles) are larger than when subjects attend to the right stream (red traces); the converse is true for peaks elicited by the right (red) stream. Many studies with similar results demonstrate that attention modulates neural responses, explicitly altering information's representation in the brain. Note that the very first tone onset elicits equally strong responses regardless of the task instructions, an example of bottom‐up attentional capture (see below).

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g005.jpg

(a) Schematic of a typical top‐down auditory attention task. A subject is cued to attend to one of three spatially separated streams of sound and report its contents. (b) ERP traces (electrode Cz) while subjects attend to either a left‐spatialized melody (onset times shown by blue vertical lines) or a right‐spatialized melody (red lines). The auditory cue telling subjects to listen to either the left or right stream occurs at the time denoted by the vertical black line. When subjects attend to the left stream (blue traces), negative‐going ERPs elicited by notes in the left melody are enhanced (blue circles) and those elicited by notes from the right are suppressed, and vice versa. Data from Choi et al. ( 2014 )

In vision studies, researchers distinguish between focusing attention to a particular location (e.g., 15 degrees left of fixation) versus to some nonspatial attribute (e.g., a red object). In part, this reflects the fact that visual inputs at the retina are encoded according to where objects are relative to an observer's eyes (an organization that is topographically preserved throughout the visual system), while other features must be centrally computed. In contrast, the auditory system is not organized by spatial location. Sound location is itself a feature that must be computed from the signals reaching the two ears. Still, listeners can focus attention to sounds based on either location or other features, such as pitch, timbre, or talker identity. And, as we discuss below, even though both spatial and nonspatial acoustic features must be centrally computed, spatial auditory attention recruits different brain networks than does nonspatial auditory attention, paralleling functional differences between spatial and feature‐based attention in vision.

Most headphone‐based studies of auditory spatial attention use simplified spatialization cues (such as a pure time delay between the signals at the two ears), ignoring the frequency‐dependent differences in the levels and timing of the signals that would reach the left and right ears in the real world. Many even use dichotic presentations, in which entirely separate streams of information are presented to each ear (Hashimoto et al., 2000 ; Jäncke et al., 2003 ; Kimura, 1967 ; Treisman, 1960 ). Listeners can successfully direct spatial attention even with such unnatural, degraded spatialization (Baumgartner et al., 2017 ; Ross et al., 2010 ). Some dichotic studies find that people can follow the stream in the right ear better than the left‐ear stream, an effect that has been ascribed to left‐lateralized language specialization in the brain (Hiscock & Kinsbourne, 2011 ; Kimura, 1961 ). However, recent work suggests that the right‐ear bias may instead arise from increased powers of attention to the right hemifield (Payne et al., 2017 ; Tanaka et al., 2021 ). In addition, it is not clear whether such asymmetries arise with natural binaural spatial cues; at least one study has shown that unrealistic spatial simulations do not fully engage the brain networks devoted to spatial auditory attention (Deng, Choi, et al., 2019 ).

4. OBJECT‐BASED ATTENTION

When a person focuses on an object because it has a desired feature (such as location, pitch, etc.), top‐down attention tends to focus on the entire object. In other words, the natural unit of attention seems to be an object: a collection of features that the brain believes came from the same external source (Duncan, 1984 ; Shinn‐Cunningham, 2008 ). Figure 1a illustrates this idea: attention focuses initially on one location; however, once a stream at that location becomes the focus of attention, attentional focus narrows onto the nonspatial features of the attended source.

Relatedly, auditory objects extend across time, often even across silences. For instance, speech includes momentary gaps, yet is perceived as one perceptual stream. Once we attend an element of a stream (e.g., a syllable in ongoing speech), that stream tends to remain the focus of attention over time, automatically (Best et al., 2008 , 2010 ; Billig & Carlyon, 2016 ; Bressler et al., 2014 ; Woods & McDermott, 2015 ), and attention becomes more tightly focused to that stream (Choi et al., 2014 ; Golumbic et al., 2013 ). Figure 1b illustrates this idea: attention remains focused to pick up the next syllable from the ongoing stream even after a momentary gap.

When two streams are similar in spatial and other features, they can be confused across time. Figure 1c illustrates this: a competing stream that is similar to the attended stream in high‐dimensional feature space can be confused with the attended stream, especially after a momentary silence in the attended stream. When a listener is asked to focus attention on the speech from one location (e.g., “always report the words spoken from the left”) and the left talker suddenly switches locations with a competing talker, subjects often track the original talker rather than the original location (Mehraei et al., 2018 ). That is, successful attention relies on object formation and segregation through time, and object‐based attention can even override volitional attention.

This critical role of attention as a mechanism for following an object through time is most evident in the auditory system, where the temporal dynamics of scenes and stimuli cannot be ignored. However, this also affects vision and other senses. One of the most challenging visual attention paradigms—multiple object tracking—requires subjects to use attention to track moving objects among identical distractors (Alvarez & Cavanagh, 2004 ; Pylyshyn & Storm, 1988 ). Similarly, everyday interactions with our surroundings require us to attend to objects over time: to catch a ball, to judge whether it is safe to cross a street, or to fill a water glass without it overflowing.

5. BOTTOM‐UP ATTENTION

Attention can also be drawn by unusual or salient sensory inputs. For example, a plate smashing on the floor involuntarily captures attention. Automatic attentional capture occurs even for less dramatic events, such as the onset of a tone or appearance of an image, disrupting top‐down and object‐based attention (Bulger et al., 2020 ; Maryott et al., 2011 ; Noyce & Sekuler, 2014a ). Figure 1d shows an example of a new sound source's onset grabbing attention involuntarily, away from the previously attended object.

In the audition, involuntary attentional capture interferes with top‐down auditory attention regardless of the spatial location of the interrupting event. Figure 3 illustrates this effect when a listener is attending a stream comprising three speech syllables from one direction and ignoring a similar, competing stream from the opposite hemifield (taken from Liang et al., 2022 ). Top‐down attention allows the listener to select one stream, and object‐based attention lets them follow it over time. But when an unexpected sound (a cat “MEOW”) occurs just before the second target syllable, it captures attention, disrupting the ongoing processing and requiring the listener to reorient back to the intended target. This interference is equally strong whether the MEOW comes from the same hemifield as the target or the opposite hemifield.

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g004.jpg

(Left) Schematic of the spatial locations of two competing speech streams from the same talker, and, when present, an interrupting cat “MEOW.” Subjects direct top‐down spatial attention to follow either the left or right speech stream. The interrupting MEOW, which happens randomly on one‐fourth of trials, appears randomly from either the same or opposite hemifield as the target stream. (Right) Syllable identification accuracy. The MEOW significantly affects the recall of not only the second and third, but also the first syllable. Critically, contralateral and ipsilateral interrupters are equally disruptive

Another interesting point demonstrated in Figure 3 is that salient interruptions interfere not only with attention per se, but also with storing attended information in memory. In particular, the interrupting MEOW harms performance on the first target syllable, which finishes playing before the MEOW even begins (Liang et al., 2022 ). Indeed, roughly two‐third of the 45 subjects tested in this study showed a significant decrease in their ability to recall the first syllable due to the interrupting MEOW. Thus, attentional capture not only interferes with ongoing attention to a stream, but with the memory encoding and storage of items that already occurred (see also Section 7).

6. BRAIN NETWORKS CONTROLLING ATTENTION

The most common framework for understanding the brain regions that participate in attention posits two anti‐correlated networks, one for top‐down attentional control, and one for bottom‐up attentional capture (Corbetta & Shulman, 2002 ; Fox et al., 2005 ; Power et al., 2011 ; Yeo et al., 2011 ). However, evidence for these two networks comes primarily from visual attention studies, in part due to the challenges of auditory studies in fMRI (Peelle, 2014 ). (See Lee et al., 2014 for a review of auditory exceptions.)

We find that there are at least two similar, but differently specialized networks recruited for top‐down attention (Figure 4 ). One, the well‐established frontoparietal network, is specialized for visual and spatial attention. However, a second auditory‐biased attention network includes distinct regions in lateral frontal cortex that interleave anatomically with the frontoparietal network (Braga et al., 2013 ; Michalka et al., 2015 ; Noyce et al., 2017 ; Tobyne et al., 2018 ). These complementary networks extend broadly throughout frontal cortex (Noyce et al., 2021 ; Tobyne et al., 2017 ).

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g003.jpg

Maps of bilateral visual‐biased and auditory‐biased regions in frontal cortex. Three visual‐biased regions and five auditory‐biased regions control attention and working memory tasks. Adapted from Noyce et al. ( 2021 )

The complementary affinities of vision for spatial information and audition for timing information play out in how tasks recruit sensory‐biased cortical networks. A rhythm memory task with purely visual stimuli recruits not only the visual‐biased frontal network, but also the auditory‐biased network (Michalka et al., 2015 ). A spatial memory task with purely auditory stimuli recruits the auditory‐biased network, as well as visual‐biased regions of both lateral frontal cortex (Michalka et al., 2015 ) and anterior parietal cortex (Michalka et al., 2016 ). Magnetoencephalography (MEG) and EEG reveal further evidence of parietal recruitment for spatial—but not nonspatial—auditory attention (Lee et al., 2012 ). Contralateral alpha‐band (8–14 Hz) oscillations over parietal cortex, long associated with visual–spatial attention (Klimesch et al., 1999 ; Worden et al., 2000 ), also occur during auditory spatial attention, even tracking syllabic events over time (Bonacci et al., 2019 , 2020 ; Deng et al., 2020 ; Wöstmann et al., 2021 ). Contralateral transcranial alternating current stimulation (tACS) in the alpha frequency disrupts spatial but not nonspatial auditory attention (Deng, Reinhart, et al., 2019 ; Wöstmann et al., 2018 ).

Regardless of whether spatial or nonspatial attention guides auditory selective attention, the net result is similar: information about the attended objects is represented more robustly in the brain. This is seen as a relative enhancement of ERPs elicited by attended objects compared to those elicited by unattended objects (as illustrated in Figure 2 ), stronger neural entrainment to the envelope of an attended vs. an unattended stream (Golumbic et al., 2013 ; Mesgarani & Chang, 2012 ), and increased success in decoding neural activity to reconstruct an attended versus an unattended stimulus (Bednar & Lalor, 2020 ; Mesgarani & Chang, 2012 ).

Bottom‐up attention has been ascribed to a ventral attention network, which includes nodes in cingulate, temporoparietal, and opercular cortex, but this work has almost entirely been done in visual paradigms (Corbetta & Shulman, 2002 ; Devaney et al., 2019 ; Dosenbach et al., 2007 ; Seeley et al., 2007 ). While few studies have investigated brain networks controlling attentional reorienting to unexpected auditory events, at least one MEG study hints that the “visual” ventral network is multisensory: auditory attention switching engages a region within this canonical bottom‐up attention network (Larson & Lee, 2013 ).

7. ATTENTION AND MEMORY

Attention, especially top‐down attention, and working memory (the ability to hold and manipulate information) are intricately linked (Bettencourt et al., 2011 ; Panichello & Buschman, 2021 ). Working memory requires attention to be directed to an internal representation that is maintained after sensory inputs are gone (Lim et al., 2015 ), and if attention is disrupted the memory representation degrades (Payne et al., 2013 ; Unsworth & Robison, 2016 ). Attentional limits likely restrict working memory capacity.

Attention also affects how objects are encoded into memory. Attended objects tend to be remembered (Maryott et al., 2011 ; Noyce & Sekuler, 2014b ); relatedly, events that disrupt attention also impair recall of events surrounding the disruption (Lim et al., 2019 ). Even when an object is attended, not all of its features will get into memory: attended features are encoded more robustly (e.g., Noyce et al., 2016 ).

Figure 5 shows discrimination performance when subjects are asked to attend to either the locations or time intervals of a short sequence, which is composed of either visual or auditory events; they then must compare their stored representation of the sequence to a subsequent, similar sequence of events in the other sensory modality. Performance shows a significant interaction between sensory modality and the feature to be recalled. As we discussed above, attention co‐opts the specialized neural circuitry for visual cognition when performing spatial tasks, and for auditory cognition when performing temporal tasks (Michalka et al., 2015 , 2016 ). Consistent with this, it is easier to attend to and stores spatial information from visual inputs, and easier to attend to and store timing information from auditory signals (Noyce et al., 2016 ), as supported by better performance in conditions relying on these “natural” memory representations.

An external file that holds a picture, illustration, etc. Object name is WCS-14-0-g006.jpg

Memory performance (d′) for short sequences, when cued to track either the locations (left) or timing (right, orange) of events in a sequence. Subjects are better at encoding locations from a visual than an auditory sequence, but better at encoding timing in an auditory than visual sequence. Adapted from Noyce et al. ( 2016 )

8. CONCLUSION

Attention performs essential pre‐processing for the rest of human cognition. It is a tug of war between our desires and our distractions, in a finely tuned and complex interplay between top‐down, bottom‐up, and object‐based attention. This allows us to select, to focus, to remember… while not getting eaten by a bear or hit by a taxi. After attention filters important information from the barrage of neural events, the brain can analyze those important events in detail, letting us understand speech, appreciate a sunset, or merely cross a busy intersection unscathed.

AUTHOR CONTRIBUTIONS

Abigail Noyce: Conceptualization (equal); investigation (equal); supervision (supporting); visualization (equal); writing – original draft (lead); writing – review and editing (equal). Jasmine Kwasa: Conceptualization (supporting); investigation (equal); visualization (equal); writing – review and editing (supporting). Barbara Shinn‐Cunningham: Conceptualization (supporting); funding acquisition (lead); investigation (equal); project administration (lead); supervision (lead); visualization (equal); writing – original draft (supporting); writing – review and editing (lead).

FUNDING INFORMATION

This work was supported by NIDCD R01 DC013825 (to Barbara G. Shinn‐Cunningham), ONR project N00014‐20‐1‐2709 (to Barbara G. Shinn‐Cunningham), and NINDS F99 NS115331 (to Jasmine A. C. Kwasa).

CONFLICT OF INTEREST

The authors have declared no conflicts of interest for this article.

RELATED WIRE s ARTICLE

Notes

Noyce, A. L. , Kwasa, J. A. C. , & Shinn‐Cunningham, B. G. (2023). Defining attention from an auditory perspective . WIREs Cognitive Science , 14 ( 1 ), e1610. 10.1002/wcs.1610 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

Edited by: Anna Fisher, Editor

Funding information National Institute of Neurological Disorders and Stroke, Grant/Award Number: NINDS F99 NS115331; National Institute on Deafness and Other Communication Disorders, Grant/Award Number: R01 DC013825; Office of Naval Research, Grant/Award Number: N00014‐20‐1‐2709

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

REFERENCES