In peripheral vision, objects in clutter are difficult to identify. The exact cause of this “crowding” effect is unclear. To perceive coherent shapes in clutter, the visual system must integrate certain local features across receptive fields while preventing others from being combined. It is believed that this selective feature integration–segmentation process is impaired in peripheral vision, leading to crowding. We used functional magnetic resonance imaging (fMRI) to investigate the neural origin of crowding. We found that crowding was associated with suppressed fMRI signal as early as V1, regardless of whether attention was directed toward or away from a target stimulus. This suppression in early visual cortex was greatest for stimuli that produced the strongest crowding. In contrast, the pattern of activity was mixed in higher level visual areas, such as the lateral occipital cortex. These results support the view that the deficiency in feature integration and segmentation in peripheral vision is present at the earliest stages of cortical processing.
In peripheral vision, flanking an otherwise identifiable object with other objects or patterns interferes with its identification (Bouma 1970; Anstis 1974; Flom 1991). This crowding phenomenon greatly reduces the utility of peripheral vision for everyday form-vision tasks such as reading and face and object recognition. The cause of crowding is under debate, but is believed to involve failures in the core visual processes of feature integration and segmentation. While behavioral studies have suggested loci of crowding at multiple stages of visual processing (Louie et al. 2007; Whitney and Levi 2011), knowing the earliest stage where crowding occurs can significantly constrain theories of peripheral form vision.
Several recent models of peripheral visual processing propose that peripheral representations in visual cortex capture only local statistics of the visual image (Parkes et al. 2001; Balas et al. 2009; Van den Berg et al. 2010; Freeman and Simoncelli 2011), resulting in the distorted percept associated with crowding. This reductive statistical representation is believed to be a consequence of pooling of image features within large receptive fields. Such theories typically suggest an origin of crowding after V1. In contrast, Neri and Levi (2006) as well as Pelli and Tillman (2008) suggest that imprecise feature binding in V1 is a component of crowding. Nandy and Tjan (2012) also propose V1 as the earliest source of crowding, due to inappropriate feature integration via horizontal connections.
Psychophysical experiments provide ambiguous evidence for crowding at the level of V1. The finding that crowding is reduced by placing target and flanker on opposite sides of the vertical meridian but not the horizontal meridian (Liu et al. 2009) suggests that the low-level origin of crowding can be either V1 or hV4, since both have a contiguous representation of the horizontal meridian. Several experiments have demonstrated a dependence of crowding on the perceived rather than physical stimulus (Dakin et al. 2011; Maus et al. 2011; Wallis and Bex 2011), with the interpretation that crowding occurs beyond V1 (Dakin et al. 2011). However, crowding affects orientation-specific adaptation even when target and flankers are removed from awareness (Ho and Cheung 2011), suggesting an earlier origin. While physiological effects of crowding have been reported for V2 and beyond (Motter 2006; Bi et al. 2009; Freeman et al. 2011), physiological evidence for a V1 locus of crowding is currently lacking. A single recent study found a correlation between blood oxygenation level–dependent (BOLD) adaptation in V1 and crowding (Anderson et al. 2012). However, this effect was measured while subjects detected changes of a crowded target. Hence, the effect may be percept-driven as opposed to input-driven. Whether there is a bottom-up, input-driven origin of crowding in V1 remains undetermined.
To investigate the neural origin of crowding, we used functional magnetic resonance imaging (fMRI) to measure BOLD signals in visual cortex resulting from crowded and noncrowded letter stimuli presented in the periphery. To overcome the difficulty of separating signals for small, closely spaced peripheral target and flankers (due to the small cortical magnification factor relative to imaging resolution), we designed the experiments and the associated analyses to use regions of interest that encompassed both the target and flankers. We examined the effect of crowding on BOLD signal for attended and unattended stimuli in separate experiments. In the first, subjects' attention was directed away from the letter stimuli with an unrelated task at fixation, to emphasize the automatic aspect of the crowding mechanism. In the second, they identified the target letter in crowded and noncrowded conditions, a common paradigm for assessing crowding in behavioral studies. Regardless of whether attention was directed to the letter stimuli, we observed a suppression of BOLD signal with crowding in early visual areas, including V1. A third experiment showed that this suppression was greatest for stimuli that induced the strongest crowding, and a fourth ruled out task difficulty and response accuracy as confounds. While crowding may involve multiple levels of visual processing, these results, taken together, argue for a significant role of the primary visual cortex in crowding.
Materials and Methods
All subjects were University of Southern California students with normal or corrected-to-normal vision. Subjects had definable retinotopic and higher visual processing areas using the techniques described below. The Institutional Review Board of the university approved the experimental protocol, and each subject provided written informed consent.
All stimuli were generated on Macintosh computer, using MATLAB and the PsychToolbox (Brainard 1997; Pelli 1997). Stimuli consisting of 1, 2, or 3 uppercase Sloan letters were presented with the center letter at 5° eccentricity in the lower right or upper left quadrant of the subject's visual field, midway between the horizontal and vertical meridians. The target letter was presented alone or flanked by a letter on each side. Targets and flankers were randomly chosen from 4 letters: K, N, R, or S in Sloan font for Macintosh (provided by Denis Pelli). Target and flanker letters subtended 0.75° of visual angle and were presented at 100% contrast. Four target–flanker center-to-center separations were used: 0.9° (crowded condition), 1.5°, 2.25° (noncrowded condition), and infinity (target-presented alone). Center-to-center spacing between target and flankers was manipulated in a manner that did not change the eccentricity of the flankers and at the same time avoided any overlapping between target and flankers at the smallest letter spacing (Supplementary Fig. 1A). At a center-to-center spacing of 1 letter height (0.75°, a condition not tested), the target and flankers were arranged horizontally and thus in a configuration midway between radial and tangential relative to fixation. Larger letter spacings were generated by moving each flanker from this horizontal “home” position on an arc centered at fixation. Experiments 1, 2, and 4 tested only separations 0.9° (crowded) and 2.25° (noncrowded), while Experiment 3 tested all 4. We chose these stimuli because identification was comparable in the noncrowded condition to that for an isolated letter but was strongly impaired in the crowded condition (Supplementary Fig. 2). Stimuli were displayed on a 32 × 24-cm rear projection screen mounted perpendicularly to the toe–head axis in the bore of the magnet, directly above the subject's head. The viewing distance was 75 cm. The background luminance of the display was set at 156 cd/m2, and the maximum luminance at 100% contrast was 312 cd/m2.
All scans were performed using a 3 Tesla whole-body magnet (Siemens MAGNETON Trio) at the USC Dana and David Dornsife Cognitive Neuroscience Imaging Center with a single-channel send–receive circular polarization (CP) coil for Experiment 3 and a 12-channel matrix coil running a simulated CP mode for Experiments 1, 2, and 4 (a major scanner upgrade occurred midcourse during the study; Experiment 3 was the first experiment). Functional scans were acquired using a gradient echo planar imaging (EPI) sequence (TR = 1000 ms, TE = 30 ms, flip angle = 65°) with an isotropic voxel resolution of 3 mm. Fourteen oblique slices oriented perpendicular to the calcarine sulcus were acquired to ensure that all early visual areas would be captured. Anatomical scans were acquired using a magnetization prepared rapid acquisition gradient sequence (TI = 1100 ms, TR = 2070 ms, TE = 4.14 ms, flip angle = 12°) with a resolution of 1 × 1 × 1.2 mm.
BrainVoyager was used to preprocess the imaging data. Preprocessing of functional images included motion-correction, slice-timing correction, linear-trend removal, and high-pass temporal filtering. No spatial smoothing was applied. Intrasession functional scans were aligned to one another and co-registered with the intrasession anatomical scan. Intersession co-registration was achieved by co-registering the anatomical scans from each session. Aligned data were mapped to the flattened representation of the subject's cortical surface. Data from each scan were normalized by z-transform prior to analysis. Specifically, the time-course of each voxel in each run was normalized by first subtracting the mean voxel signal of an entire run, then dividing by the standard deviation of the voxel signal of the same run. This normalization is used to remove differences in signal baseline and fluctuation amplitude for each voxel across runs to improve model estimation. (Each run contained an equal number of each condition, such that this normalization does not bias any particular condition.)
Localization of Visual Areas
We used a standard paradigm to localize the lateral occipital complex (LOC; Kourtzi and Kanwisher 2000, 2001). Retinotopic visual areas were defined using rotating wedge and expanding ring stimuli (Engel et al. 1994, 1997).
BrainVoyager QX, BVQXToolbox (Brain Innovation, Maastricht, The Netherlands) and in-house MATLAB code were used for anatomical and functional data analysis, to extract BOLD time-courses from regions of interest (ROIs) of each individual subject. Unless otherwise stated, statistics were performed on the mean of the z-transformed signal within ROIs for V1, V2, V3, and hV4. In analyses for Experiments 2–4, we only included the dorsal representations of V1, V2, and V3 in the left hemisphere because stimuli were presented in the lower right visual field. For Experiments 1 and 2, ROIs for visual areas V1, V2, and V3 were defined by finding the corresponding subregion in each visual area that mapped to the retinal location of the stimulus, based on the subject's retinotopic map. These retinotopically defined ROIs represented a contiguous region in visual space that included the target and all possible flanker positions. Using this all-encompassing ROI allowed us to capture more comprehensively the effects of target–flanker interaction. Retinotopic polar angle and eccentricity maps were used to identify voxels corresponding to the stimulus (flankers and target) location. To ensure inclusion of the target and the flankers, ROIs were defined to include voxels that responded most strongly to eccentricities between 3° and 7° in the quadrant in which the stimulus was presented. Because of difficulties in mapping within hV4 due to distortions caused by vasculature (Winawer et al. 2010), the hV4 ROI was determined by stimulus-evoked activation. For Experiments 3 and 4, additional reference scans were used to help define regions of visual cortex that responded selectively to the area of visual space where the target and flanking letters appeared in the main experiment. Stimuli were single-letter presentations, displayed at any 1 of the 7 possible locations where a letter could appear in the main Experiment 3. Reference scans consisted of 24-s blocks of 1.5-s trials, blocked by letter position, and separate blocks for a fixation-only condition. ROIs were defined by restricting the localized retinotopic and LOC regions to subregions that were significantly activated by the stimuli, which included any of the 7 letter positions (from reference scans) and all possible target–flanker configurations (from the main experiment). To extract the mean time-course for each ROI, preprocessed fMRI data averaged across all voxels within an ROI were deconvolved against an indicator function formed by placing a Dirac delta function at each time-point to be estimated, with separate indicator functions for each event type. In Experiment 1 which used a block design, statistics were performed on the area under the resulting curve from 6- to 16-s poststimulus. For all other experiments, the peak of the time-course was estimated from fitting with a difference-of-gamma function (Boynton and Finney 2003), and entered into statistical tests. SPSS and MATLAB code were used for group- and ROI-level statistical analyses. Paired t-tests were used to test for within-subject differences across stimulus conditions, and repeated-measures ANOVA was used to identify interactions between stimulus manipulations. The mean time-course across subjects for each ROI was fit separately for visualization. Standard error was calculated as the within-subjects error (Loftus and Masson 1994) to facilitate the visualization of within-subjects differences.
We also examined the response in the LOC when subjects attended to the letter stimuli (Experiments 2–4). Using 2 different methods, we tested for the presence of subregions within LOC that were affected differentially by crowding. For each individual subject for whom LOC mapping data were available, we created a sign-of-difference map by contrasting the “crowded” (0.9° target–flanker separation) with “noncrowded” (2.25°) condition, using a general linear model (GLM) with an assumed BOLD impulse response function. We compared the number of subregions thus obtained in LOC to a null distribution (histogram of the number of subregions in a map produced by randomly assigning preference for crowded or noncrowded stimuli while preserving the spatial correlation between voxels, over 5000 simulations—see Supplementary Experimental Procedures for details), and obtained the empirical probability that the number of observed subregions could have occurred from a random distribution of the relative signal levels for the crowded and noncrowded conditions in LOC.
A group-level LOC analysis was also performed in Talairach coordinates using spatially smoothed data [full width at half maximum (FWHM) = 6 mm] from subjects who participated in any of Experiments 2–4. A common ROI for LOC was defined based on the LOC localizer runs of the subjects for whom LOC mapping data were available. This was done using a conventional voxelwise mixed-effects GLM, thresholded at a conservative false discovery rate (FDR) of 0.02. Within this common ROI, a contrast map for crowded and noncrowded conditions was determined using voxelwise mixed-effects GLM on spatially smoothed (FWHM = 6 mm) data from Experiments 2–4. This map was thresholded to achieve a FDR of 0.05 to identify the subregions of LOC where the response to the crowded condition was significantly different from that to the noncrowded condition.
Fifteen subjects (4 females) participated. (One male subject was excluded from the analysis because the subject reported difficulty with the central task.)
Design and Procedure
Stimuli were either 3 letters (target and flankers) or 2 letters (flankers alone), in crowded and noncrowded configurations (Fig. 1). These 4 experimental conditions were displayed in a block design, with alternating 16-s blocks where stimuli were presented in the upper left and lower right visual fields, respectively (Supplementary Fig. 1B). During the first half of each 2-s trial, letter stimuli were counter-flickered twice for 2 cycles at 20 Hz in the periphery while a small colored square blinked twice at fixation, and subjects were asked to press the button that corresponded to the color of the square. For 6 subjects, the central stimulus was instead a square that changed colors 3 times (4 color presentations), and subjects were asked to respond if the last color was the same as either of the first 2.
Both left and right hemispheres were used in the analysis, with active blocks defined as blocks where stimuli were presented in the contralateral visual field. Time-courses for the target-absent conditions were subtracted from the target-present curves to obtain a time-course representing the effect of the target letter. For visualization, data were averaged across hemispheres and subjects.
Fifteen subjects participated in the study (3 females).
Design and Procedure
Stimuli corresponding to the 4 experimental conditions used in Experiment 1 were presented in the lower right visual quadrant. In order to minimize condition or task-difficulty-dependent fluctuations in attention, the stimulus conditions plus an unannounced rest condition were displayed in a fast event-related design. Subjects were instructed to respond with the identity of the center letter when 3 letters were present and not to respond when the center letter was missing. Behavioral data (accuracy and response time) were collected while subjects were performing the letter-identification task inside the scanner. Each trial lasted 3 s (Supplementary Fig. 1C). The stimulus was presented during the first 100 ms of each trial, and subjects had the remainder of the trial to respond.
Six subjects participated in the study (1 female). The data for 1 male subject were excluded due to poor data quality.
Design and Procedure
There were 4 experimental conditions with respect to the target–flanker separation: 0.9°, 1.5°, 2.25°, and target alone. Behavioral data (accuracy and response time) were collected while subjects were performing the letter-identification task inside the scanner, to make direct comparisons between behavioral and BOLD response. The event-timing protocol was identical to that for Experiment 2.
Six subjects participated (1 female), 3 of whom also participated in Experiment 3. The data for 1 male subject who also participated in Experiment 3 were excluded due to poor data quality.
Experiment 4 used the crowded (0.9° target–flanker separation) and noncrowded (2.25°) triplets from the previous experiments. In addition, 2 new conditions were generated by dividing the center letter in the noncrowded condition into 36 tiles and randomly rearranging either 50% (moderately scrambled) or 90% (severely scrambled) of the tiles. The amount of scrambling was chosen such that the subjects' performance in identifying the highly scrambled target (proportion correct) was approximately equal to that for identifying the crowded target (Supplementary Fig. 3).
To determine the effect of crowding on processing in visual areas, we measured the BOLD response to crowded and noncrowded configurations with target and flankers or flankers alone within V1, V2, V3, and hV4. By design, the ROI within each of these visual areas encompassed both target and flankers. In Experiment 1, we directed the subjects' attention to a demanding fixation task and away from the target letter and flankers. We found that the response to a noncrowded target and flankers was greater than that for the flankers alone—adding the target led to a positive change in BOLD signal in areas V1 [t(13) = 4.41, P < 0.001], V2 [t(13) = 5.89, P < 0.001], V3 [t(13) = 4.02, P < 0.005], and hV4 [t(13) = 2.46, P < 0.05] (Fig. 1B,D). In contrast, response to a crowded target and flankers was the same as that to the flankers alone (Fig. 1C,D) in each of these areas (P's > 0.46). This interaction between target-presence and crowding reached statistical significance in V1 [F1,13 = 5.25, P < 0.05] and V2 [F1,13 = 7.84, P < 0.05]. While the target added the same amount of contrast energy to the stimulus in each condition, the addition of the target to the noncrowded configuration led to a reliably larger signal change than to the crowded configuration in V1 and V2. These results show that the addition of the target in the crowded configuration led to mutual suppression of the signals for the target and flankers, particularly in V1 and V2.
The central task was challenging, and mean performance was nearly identical (ranging from 80% to 81%) for all conditions, eliminating the possibility that differential attention to the task versus letter stimuli across conditions was responsible for this outcome.
Results from Experiment 1 show that the mutual suppression between target and flankers in the crowded condition occurs when attention is directed away from the target, suggesting that crowding occurs automatically, regardless of the behavioral relevance of a stimulus. Nevertheless, crowding persists when attention is directed to the target. We therefore expect to observe crowding-induced signal suppression with attention. To test this, we examined the effect of crowding on BOLD response using the same stimuli as in Experiment 1, but where the subjects' task was to identify the target (center) letter (Experiment 2).
Accuracy measured inside the scanner for identifying the target letter was significantly better in the noncrowded condition (97% in the noncrowded condition vs. 59% in the crowded condition; t(14) = 10.1; P < 0.001), indicating that the crowding manipulation was effective. We found that the suppression in BOLD signal with crowding observed in Experiment 1 persisted when attention was directed to the letter stimuli. The addition of the target letter led to a significant increase in peak BOLD response in V1 [t(14) = 5.18, P < 0.001], V2 [t(14) = 2.59, P < 0.05], and hV4 [t(14) = 3.75, P < 0.005] in the noncrowded condition, but not in the crowded condition (P's > 0.10; Fig. 2). In V3, target-presence did not lead to any measurable increase in response in the noncrowded condition (P = 0.41), but in the crowded condition, it led to a significant decrease in response [t(14) = −2.72, P < 0.05]. The V3 result is consistent with those for other visual areas. This is because the magnitude of the mutual suppression between the target and flankers relative to any signal gain caused by the target is not known. What is relevant for the current study is whether mutual suppression between target and flankers is stronger in the crowded condition than in the noncrowded condition. The interaction between crowding and target-presence was statistically significant in each visual area tested: V1 [F1,13 = 7.69, P < 0.05], V2 [F1,13 = 11.02, P < 0.01], V3 [F1,13 = 4.98, P < 0.05], and hV4 [F1,13 = 6.35, P < 0.05]. Despite the differences in experiment design and attentional state, Experiments 1 and 2 yielded qualitatively similar results, particularly in V1 and V2.
A key property of crowding is that it decreases with increasing distance between target and flankers. We thus expected the BOLD signal suppression we observed to depend on target–flanker spacing. To test this, we again measured BOLD signal while subjects identified a target letter, and varied the spacing between target and flankers (Experiment 3). Letters were displayed at a center-to-center spacing of 0.9° (strongly crowded), 1.5° (weakly crowded), 2.25° (noncrowded), or “infinity” (single letter) (Fig. 3A). We found in Experiment 2 that the BOLD response within the all-inclusive ROI to the flankers-alone stimuli was not affected by flanker-flanker separation (P's > 0.10, Supplementary Fig. 4), which justifies the omission of the target-absent conditions.
Averaged across 5 subjects, the accuracy for identifying the target letter was 63 ± 3% when the target–flanker separation was 0.9°; it improved to 94 ± 2% at a separation of 2.25°, which was not different from the single-letter condition (Supplementary Fig. 2). Response time was shorter for conditions with higher accuracy (1.3 s for target–flanker separation of 0.9°, 0.95 s for the single letter); hence, there was no speed-accuracy tradeoff.
We again saw a clear separation between the time-courses corresponding to the crowded and noncrowded conditions in areas V1, V2, V3, and hV4, where the crowded condition was associated with lower peak BOLD amplitude (Fig. 3B). Repeated-measures ANOVA confirmed a main within-subjects effect of crowding (0.9° vs. 2.25°) on BOLD amplitude [F1,4 = 88.1, P < 0.001] but found no significant interaction between crowding and visual area. Consistent with Experiments 1 and 2, signal for the crowded condition was significantly lower than that for the noncrowded condition in V1 [t(4) = 2.79, P < 0.05], V2 [t(4) = 5.85, P < 0.01], V3 [t(4) = 3.37, P < 0.05], and hV4 [t(4) = 5.81, P < 0.005].
The relationship between BOLD amplitude and target–flanker separation was monotonic and systematic. BOLD amplitude of the weakly crowded (1.5°) condition fell in-between the crowded (0.9°) and noncrowded (2.25°) conditions (Fig. 3B), consistent with behavioral measures. As expected, BOLD amplitudes for the single-letter condition were lower than those for the noncrowded condition [F1,4 = 22.49, P < 0.01] because the ROIs included both the target and flanker locations. However, there was no significant difference between the amplitudes for the single-letter condition and the crowded condition [F1,4 = 1.7, P = 0.26], despite the crowded stimulus having 3 times the total contrast energy.
It could be suggested that subjects might “give up” in the crowded condition, which was more difficult when the task was letter identification (Experiments 2 and 3), resulting in less attention to stimuli and less attentional enhancement of the BOLD signal (Ress et al. 2000). This was not the case. We performed a control experiment (Experiment 4) that included the crowded and noncrowded conditions, plus 2 new conditions that were identical to the noncrowded condition but with the target moderately or severely scrambled (Fig. 4A). The mean accuracy for target identification was significantly different across all conditions (P < 0.05) except between the crowded and severely scrambled (48% vs. 60%, with within-subjects differences ranging from −23% to 9%), which constituted the most difficult conditions (Supplementary Fig. 3).
We examined the effects of crowding and task difficulty on BOLD signal within the ROIs in V1 through hV4. Crowding (crowded vs. noncrowded) was again associated with a reduction in BOLD amplitude [F1,4 = 19.6, P < 0.05)] (Fig. 4B), and there was no interaction between crowding and visual area [F3,12 = 1.77, P = 0.20]. More importantly, we found a lower response to the crowded condition than to the highly scrambled condition [F1,4 = 9.45, P < 0.05)], which were comparable in performance, and there was no interaction with visual area. In contrast, there was no discernible difference in BOLD signal amplitude between the 2 scrambled conditions and the noncrowded condition [F2,4 = 1.22, P = 0.35], despite their varying levels of behavioral performance. The reduction of BOLD signal in the early visual areas was associated with crowding and not task difficulty. There was no correlation between task difficulty and BOLD response, most likely because anticipation was not possible in the rapid event-related design.
Combining data from the 2 conditions (crowded and noncrowded) common to all 3 event-related, letter-identification experiments yielded the same conclusion: there was a strong main effect of crowding [F1,21 = 31.16, P < 0.0005]. Paired t-test on the combined data showed an effect of crowding in all 4 cortical areas: V1 [t(21) = −4.56, P < 0.0005], V2 [t(21) = −4.49, P < 0.0005], V3 [t(21) = −5.50, P < 0.00005], and hV4 [t(21) =−4.22, P < 0.0005]. (3 subjects participated in 2 of the 3 experiments. A single entry was created for each subject using the average BOLD amplitude of each condition across the 2 experiments.)
Results from these 4 experiments show a highly significant physiological effect of crowding as early as in V1. They do not rule out the contribution of other areas to crowding, nor do they by themselves suggest a V1 origin of crowding. Crowding may occur in higher level visual areas (Louie et al. 2007; Farzin et al. 2009), in which case regions beyond retinotopic visual cortex could be expected to exhibit signal suppression with crowding. In the LOC, a large higher level object-selective region, we found that the effect of crowding systematically varied across subregions for most subjects. Within the subregions of LOC that were significantly modulated by the stimuli in Experiments 2, 3, and 4 (where subjects attended to the stimuli), we computed the sign of the voxelwise differences in BOLD amplitude between crowded and noncrowded conditions. For 11 of the 15 subjects who participated in the experiments and had localized LOCs, including the 3 subjects who participated in multiple experiments and yielded consistent results, distinct patches of voxels were evident in LOC: the more superior–posterior subregions of LOC showed the same sign of difference as the early visual areas, where the crowded condition evoked a lower BOLD signal level; in contrast, the more anterior–inferior subregions showed the opposite: the crowded condition produced the larger response (Fig. 5A). These distinct regions were generally contiguous such that it is statistically unlikely that this configuration could have occurred by chance (bootstrapped P < 0.05 for each of the 11 out of 15 subjects; Fig. 5B–D). The lack of statistically significant subregions in a minority of subjects may have resulted from low power due to the relatively small voxelwise signal differences in rapid event-related designs without spatial smoothing. To increase our power to identify LOC subregions, we performed a voxelwise mixed-effects group analysis within an independently defined group-level LOC ROI using spatially smoothed data. The resulting group LOC map again revealed posterior regions where the BOLD response was greater for the noncrowded condition, and an anterior region where response was greater for the crowded condition (Fig. 5E).
The pattern of results in LOC could be due to the representation of different classes of stimuli in segregated subregions of LOC (Konkle and Oliva 2012), or the dedication of different regions to distinct types of processing (Grill-Spector et al. 1999; Kourtzi and Huberle 2005). One possible explanation for this pattern of results is that an impoverished bottom-up feed-forward signal from the early stages of visual processing is received by subregions of LOC, while other subregions are forced to engage in top–down inference often unsuccessfully, to disambiguate the crowded signal).
We found that crowding was associated with a decrease in BOLD signal in early stages of visual processing, from V1 to hV4. The suppression of BOLD signal correlates with crowding: suppression was strongest for the most closely spaced stimulus configuration. In V1 and V2, this effect persists regardless of whether attention is directed toward or away from the stimulus, implicating an automatic rather than behavior-driven process. In accord with this view, the complex response pattern of LOC to crowding also makes a top–down explanation less parsimonious. Results from these experiments are consistent with an early locus of crowding in V1.
Crowding has been described as a distinct process from “surround suppression”, where a mask placed adjacent to a target interferes with detection of the target (Levi et al. 2002; Petrov and McKee 2006; Petrov and Popple 2007; Petrov et al. 2007; Levi 2008). This distinction is largely based on differences in psychophysical measures. However, the 2 phenomena have several properties in common (see Table 1 of Petrov et al. 2007), and a precise delineation between crowding and surround suppression has not been established. Surround suppression can very well be a component of crowding (Maniglia et al. 2011). Indeed, for studies that directly compare crowding with surround suppression, it is not uncommon to define the phenomena in terms of the stimuli (e.g., Petrov et al. 2007)—small laterally placed flankers for crowding, a large grating annulus matched in spatial frequency to a central target grating for surround suppression. In order to emphasize crowding in case the phenomena are distinct, we used stimuli typically designed to induce crowding rather than surround suppression. We used spatially broadband letter stimuli rather than oriented gratings (Zenger-Landolt and Heeger 2003). We kept our “surround” region very small, consisting of only 2 letters, one on each side of the target. We displayed the target (and flankers) at 100% contrast (Pelli et al. 2007). The amount of BOLD signal suppression we observed with our crowding stimuli was very strong—the BOLD response to the target and flankers in the ROIs that encompassed both target and flankers was similar to the BOLD response evoked by the flankers alone, regardless of the attention state (Experiments 1 and 2),
The lack of response increase when the target was added to the crowded configuration was also not due to saturation of the BOLD response at the voxel level. (Response clearly did not saturate at the ROI level since the ROI responses were higher in the target-present noncrowded condition than that in the crowded condition.) If voxel-level saturation were the only mechanism underlying our finding in the crowded condition, we would expect to never observe a decrease in response when a target was added. However, individual subject data showed a significant and reliable decrease in single-voxel responses when the target was added (Supplementary Fig. 5), which rules out the saturation of voxel responses as an explanation of our finding.
Our V1 results differ from those of 3 recent neuroimaging studies that investigated the effect of crowding on BOLD signal in retinotopic visual areas (Fang and He 2008; Bi et al. 2009; Freeman et al. 2011); none of these found any effect of crowding on BOLD signal in V1 when attention was directed away from their stimuli. Bi et al. measured a crowding-modulated adaptation effect rather than directly measuring crowding. Fang and He, as well as Bi et al., attempted to isolate the responses to target from those to flankers. This necessitated the use of relatively large stimuli and large center-to-center spacing between target and flankers, which induced only weak behavioral crowding.
Freeman et al. manipulated crowding with the relative timing between target and flankers. Their behavioral experiment suggested a large difference between their crowded (simultaneous presentation of the target and flankers) and noncrowded (sequential presentation of the target and flankers) conditions. The target and flankers were spatially adjacent, and like the current study, the authors did not try to spatially delineate target response from flanker response. While they found that crowding disrupted the temporal correlation between V1 and the visual word-form area, they did not observe any effect of crowding on BOLD amplitude in V1. However, the differences in the temporal dynamics between their crowded and noncrowded stimuli may have obscured the effect in V1. This is because for short interstimulus intervals, as in their noncrowded condition, temporal summation of sequentially evoked BOLD responses is sublinear, probably due to vascular refractory effects (Liu et al. 2010). This means that neuronal response could be disproportionally underestimated in their noncrowded (sequential presentation) condition, thus masking the difference between crowded and noncrowded conditions.
In the current study, we measured BOLD signal modulated by crowding directly and under identical temporal regimes for the crowded and noncrowded conditions. We were able to use typical crowding stimuli with small center-to-center spacing that induced very strong behavioral crowding because our method of analysis does not require separation of responses to target and flankers. Our method allowed us to reliably observe the suppression of BOLD signal in V1 with crowding with and without attention directed to the stimuli.
Crowding is reduced when flankers are contralateral to the target with respect to the vertical but not the horizontal meridian (Liu et al. 2009). This implicates a locus for crowding where the upper and lower visual fields are contiguously represented across the horizontal meridian, such as V1 or hV4. It has been suggested that cortical magnification in V1 in conjunction with inappropriate feature integration over a constant radius on the cortex could explain the scaling of the spatial extent of crowding (the critical spacing) with eccentricity (Pelli and Tillman 2008; Nandy and Tjan 2012). Furthermore, Nandy and Tjan provided the first quantitative explanation of the elliptic shape of the spatial extent of crowding based in part on the physiological and anatomical properties specific to V1. Their theory implicated V1 as a site for crowding. The current finding of crowding-induced suppression in V1 is consistent with this view.
The combined results from our 4 experiments provide support for an input-driven origin of crowding in V1. Anderson et al. (2012) observed a release from adaptation of BOLD signal in V1 when subjects noticed a change in a flanked target, but not when crowding interfered with change detection. However, this effect was measured while subjects attended to the crowding stimuli, such that it could be specific to stimuli that are behaviorally relevant, rather than automatic. In the present study, the fact that there was substantial signal suppression in V1 regardless of the behavioral relevance of the stimulus could suggest a lack of intervention by feedback and recurrent processing that are task-specific. While it may be tempting to exclude feedback as a possibility, feedback may be necessary for task-independent processing of visual input. Regardless, our result is consistent with the view that there is a generic bottom-up image-processing component to crowding.
Our finding does not preclude the possibility that crowding arises at multiple stages of processing. Crowding occurs for stimuli of varying complexity (sinusoidal wavelets, line segments, letters, faces, objects), which could reflect contributions to crowding at various stages of processing (Louie et al. 2007; Farzin et al. 2009; Whitney and Levi 2011; Anderson et al. 2012). It is possible that, while crowding begins at a low-level, higher cortical areas additionally contribute to the crowding effect. Our view is that, starting with V1, crowding is caused by similar local processes in multiple visual areas.
This research was supported by the National Eye Institute of the US National Institutes of Health (EY017707 to B.S.T.).
The authors thank Pinglei Bao and MiYoung Kwon for suggestions regarding data analysis, and Dennis Levi, Anirvan Nandy, and 3 anonymous reviewers for helpful comments on the manuscript. Conflict of Interest: None declared.