## Abstract

Vocalization in lower animals is associated with a well-described visceromotor call system centered on the mesencephalic periacqueductal grey matter (PAG), which is itself regulated by paramedian cortical structures. To determine the role this phylogenetically older system plays in human phonation, we contrasted voiced and unvoiced speech using positron emission tomography and then evaluated functional connectivity of regions that significantly differentiated these conditions. Vocalization was associated with increased and highly correlated activity within the midline structures — PAG and paramedian cortices — described in lower mammalian species. Concurrent activation and connectivity of neocortical and subcortical motor regions — medial and lateral premotor structures and elements of basal ganglia thalamocortical circuitry — suggest a mechanism by which this system may have come under an increasing degree of voluntary control in humans. Additionally, areas in the temporal lobe and cerebellum were selectively activated during voiced but not unvoiced speech. These regions are functionally coupled to both visceromotor and neocortical motor areas during production of voiced speech, suggesting they may play a central role in self-monitoring and feedback regulation of human phonation.

## Introduction

Techniques for imaging the intact brain, such as positron emission tomography (PET) or functional magnetic resonance imaging (fMRI), have revolutionized the study of human speech and language. These methods were at first typically used to evaluate language comprehension, while speech production was studied less frequently (e.g., Petersen et al., 1988). Those studies that investigated production most often sought to differentiate the motor-articulatory from the cognitive-linguistic elements of speech (Fiez and Peterson, 1998; Price 1998; Murphy et al., 1997).

However, at the level of articulation, the elements of motor speech can be further decomposed in an important way — by isolating the neural substrates of phonation and differentiating these from those that regulate other components of a verbal utterance. What brain mechanisms, we might ask, support the precise coordination of respiration and laryngeal activity and the intricate adjustments of vocal fold length and tension that underlie production of the human voice?

Animal studies suggest that in lower mammalian species from rat (Fardin et al., 1984) to cat (Davis et al., 1996; Shiba et al., 1997) to monkey (Larson, 1991) there exists a midline network of brain regions dedicated to phonation, explicitly controlling the generation of species-specific calls. At the core of this neural circuit is the mesencephalic periaqueductal gray (PAG) which regulates synchronous activity in visceromotor neurons of the lower brainstem that control vocal fold tension (adjusting pitch) and respiration (adjusting inspiration, expiration and subglottal air pressure) (Zhang et al., 1995; Davis et al., 1996). The PAG is itself regulated by limbic-related regions of the forebrain and is thus in a position to encode information about emotional status and behavioral arousal in a repertoire of vocal calls over which voluntary control — in non-human species — appears to be slight (Jurgens and Zwirner, 1996).

Humans, on the other hand, possess a remarkable degree of voluntary control over sounds produced by the larynx. And the role that this phylogenetically ancient system plays in normally voiced human speech, if it plays a role at all, is unclear. There are two competing hypotheses (Davis et al., 1996; Jurgens and Zwirner, 1996; Jurgens, 2002). The first proposes that only involuntary emotional vocalizations such as laughter or crying are subserved by visceromotor mechanisms in humans, and that voluntary phonation during propositional speech is under the control of an autonomous neocortical system. An alternate possibility is that both of these systems operate in concert during the production of spoken language, the visceromotor system having been exploited (or ‘exapted’, Gould and Lloyd, 1999), for use during volitional speech production with a degree of hierarchical control maintained by neocortical motor systems.

One purpose of the present study was to determine the extent to which the phylogenetically older system is activated during normally voiced human speech. We used a relatively simple paradigm in order do so, comparing two easily manipulated, natural production tasks, in order to highlight the phonatory features associated with speech. Using

$$\mathrm{H}_{2}^{15}\mathrm{O}$$
PET, which makes it possible to evaluate overt, continuous speech in a relatively silent environment, we sought to identify the neural activation patterns that are specific to vocalization by contrasting normally voiced speech with whispered speech production.

During both normally voiced and whispered speech, the activity of the oral articulators — the lips, tongue and jaw — is essentially equivalent. The principal difference is that during voiced speech the larynx undergoes complex adjustments — dynamic modulation of the vocal folds that enables coordination and rapid fine-tuning of subglottal pressures and continuous alternation of voiced and voiceless consonants and vowels. During whispering, on the other hand, the vocal folds generally maintain an open configuration (Monoson and Zemlin, 1984; Solomon et al., 1989) so that speech is produced in the absence of voice, without the need for precise control of vocal fold dynamics.

To highlight the regional cerebral blood flow (rCBF) patterns associated with phonation, we scanned 20 healthy participants (i) at rest; and while they told a story (ii) with voice (voiced narrative speech) or (iii) without voice (whispered narrative speech). The voiced and whispered narrative tasks were cognitively equivalent — subjects produced a propositional, autobiographical, emotionally neutral narrative — differing only in the production of voice, factoring out common areas of activation for semantic processing, lexical selection and syntactic construction.

Thus, when voiced speech (oral articulatory movements plus vocalization) is contrasted with whispered speech (oral articulatory movements alone) the differences should selectively reflect the demands associated with vocalization. We hypothesized that these would include activation of the midline visceromotor circuit — the set of regions that comprise the species specific call system in lower mammals — as well as neocortical motor areas. We also hypothesized that functional connections between these systems would be observed only the during voiced speech condition, suggesting a neural substrate for neocortical modulation of the midline circuit during human phonation.

A corollary aim of this study was to investigate the ways in which sensory systems that play a role in self-monitoring of voice might interact with motor structures to regulate voiced speech. Recent neuroimaging studies have demonstrated selective activation of temporal association areas during the perception of speech and voice (Belin et al., 2000; Wise et al., 2001); however, studies specifically evaluating vocal self-monitoring during speech production, have been rare (Creutzfeldt et al., 1989; McGuire et al., 1996; Numminen et al., 1999; Curio et al., 2000).

It is likely that the central nervous system utilizes different pathways for monitoring of one's own voice during production than for monitoring the voice of others during comprehension, since the functional objectives of these processes are quite different. Therefore we also paid attention to concurrent activation of auditory and motor systems in the contrasts outlined above and evaluated functional connections between these systems as well. Taken together, these analyses should selectively map brain regions active for the production of normally voiced speech and the sensory areas that process and monitor subjects' vocal output.

We hypothesized that temporal association areas would be selectively activated during the production of voiced speech, but that the pattern of activation would differ from that reported for the perception of speech and voice in comprehension. We also hypothesized that functional connections between these temporal regions and both visceromotor and neocortical systems would be observed only during the voiced speech condition, providing a possible substrate for an interaction between auditory and motor systems that modulates phonation.

Understanding the associated neuroanatomical systems that support production and self-monitoring of human voice, and the ways in which these systems interact, may provide insight into motor speech disorders such as spasmodic dysphonia and stuttering. In these disorders, symptoms are typically manifest during voiced but not during unvoiced speech, and as such may reflect pathology in the circuits that regulate phonation.

## Materials and Methods

### Participants

Informed consent was obtained from all participants after the risks, hazards and potential discomfort associated with the procedures were explained. Participants included 9 females [age: 36 ± 10 years (mean ± SD), range 24–50] and 11 males [34 ± 9, range 23–47]. Voiced speech scans were obtained in all participants, whispered speech scans in 19. Therefore n = 20 for the voice–rest contrast and regional correlations derived for voiced speech; n = 19 for voice–whisper and whisper–rest contrasts and regional correlations derived for whispered speech. All participants were determined to be free of medical or neuropsychiatric illnesses that might affect brain function on the basis of history and physical examination, baseline laboratory evaluation, and magnetic resonance brain imaging.

PET tasks consisted of two overt speech tasks — voiced and whispered narrative production — and a rest condition. In the voiced speech task, which has been reported previously (Braun et al., 1997, 2001), participants were instructed to extemporaneously recount a story — an event or sequence of events — from personal experience. Participants were instructed to avoid material with intense emotional content and to speak using natural rate, rhythm and intonation. They were told to avoid using ancillary gestures of the limbs and were not required to complete the narrative within a fixed period of time. Subjects practiced the task during a training session while lying supine in a closed room to approximate conditions in the scanner suite; the narrative subjects used during this session was autobiographical, but differed from that they ultimately selected for the scanning session itself. In the whispered speech task, participants were given identical instructions, except that production was limited to a whisper. Participants were trained prior to the PET session to insure that during the whispering task, speech was produced without voice. Participants were also scanned at rest.

The narrative tasks were designed to be cognitively equivalent, so that they would differ only in the production of voice, factoring out common areas of activation for semantic processing, lexical selection, syntactic construction as well as oral articulation — processes that were engaged during both conditions. The differences observed should thus selectively map regions active during production of voice — both motor areas that control phonation and sensory areas that process and monitor subjects' vocal output. While voiced speech samples were recorded and transcribed, this was not technically possible in the whispered condition; therefore, we cannot state unequivocally that the complexity of syntactic structure or semantic features were identical in both conditions, and it is possible that unmeasured variations in these linguistic features might contribute to differences in the voice–whisper contrast. This appears unlikely, however, since subjects continued narrating the same story in both the voiced and whispered speech conditions (while the sequence of conditions was randomized across subjects). On this basis, we assume that semantic content and narrative coherence were equivalent in both portions of the narrative, and there is no reason to suspect that, within individual subjects, vocabulary or patterns of syntactic construction would change dramatically between whispered and voiced conditions.

### PET Data Acquisition

PET scans were performed on a Scanditronix PC2048-15B tomograph (Uppsala, Sweden) which has an axial and in-plane resolution of 6.5 mm. Fifteen planes, offset by 6.5 mm (center to center), were acquired simultaneously. The transverse field of view was 25.6 cm; axial sampling included brain structures from ∼48 mm above to 20 mm below the bi-commissural line — essentially extending from the SMA and superior dorsolateral prefrontal and parietal cortices to the region of the pons, cerebellum and temporal pole; the lower brainstem was not sampled.

Participants' eyes were patched, and head motion was restricted during the scans with a thermoplastic mask that permitted free movement of the oral articulators. For each scan, 30 mCi of

$$\mathrm{H}_{2}^{15}\mathrm{O}$$
was injected intravenously. Tasks were initiated 30 s prior to injection of the radiotracer and were continued throughout the scanning period. Studies were separated by 10 min intervals. Emission data were corrected for attenuation by means of a transmission scan.

### Data Analysis

#### Image Averaging and Spatial Normalization

PET scans were registered and stereotaxically normalized using SPM software (Wellcome Department of Cognitive Neurology, London). Images were smoothed with a Gaussian filter (15 × 15 × 9 mm in the x, y and z axes) to accommodate intersubject differences in anatomy, and spatially normalized to produce images of 26 planes parallel to the anterior–posterior commissural line in a common stereotaxic space cross-referenced with a standard anatomical atlas (Talairach and Tournoux, 1988). Differences in global activity were controlled for proportional normalization.

Using SPM, we performed the following contrasts: (i) both the voiced speech and whispered speech conditions were compared to rest as baseline (voiced speech — rest; whispered speech — rest); the rest condition is here considered an appropriate baseline for the evaluation of motor and auditory activations; (ii) the tasks were then directly compared with one another (voiced speech — whispered speech; whispered speech — voiced speech). We used masking procedures to identify regions in which differences between tasks (voiced speech versus whispered speech) were also associated with significant activations in voiced and whispered speech versus rest (P < 0.001, uncorrected in each instance). In order to limit type I error, we report only differences between tasks that were also associated with significant task–rest differences (at these specified thresholds, the conjoint probability of both criteria being reached concurrently by chance is P < 10−6). This restriction was also applied to exclude differences arising from the comparison of regions that are deactivated versus rest. In addition, the masking procedures make it possible to identify differences between voiced and whispered speech and that were unique to one speech–rest contrast or were activated in both but differed significantly in magnitude (see Tables 1 and 2).

Table 1

Significant elevations in rCBF during voiced speech

Left hemisphere

Right hemisphere

Region of interest

Brod. no.

Z-score

x

y

z

Z-score

x

y

z

Subcortical
Midbrian PAC  3.65 −8 −28 −4a – – – –
Thalamus (CM)  3.21 −10 −22 4b – – – –
Globus pallidus  3.09 −14 4b – – – –
Cerebellar vermis  3.01 −16 −56 −16b 3.54 −54 −16a
Prefrontal
Inferior operculum 47 3.05 −38 22 −8b 3.31 42 28 0b
Paramedian
Medial prefrontal cortex 3.80 −6 50 24a – – – –
Anterior cingulate cortex 32/24 3.02 −6 48 12b – – – –
Pre SMA 3.52 −12 16 44b – – – –
Temporal
Anterior MTG 21 3.27 −50 −16 −16b – – – –
Posterior STG/STS 22 3.05 −42 −60 20a – – – –
Posterior MTG/STS, ang G

39

3.88

−38

−54

24a

Left hemisphere

Right hemisphere

Region of interest

Brod. no.

Z-score

x

y

z

Z-score

x

y

z

Subcortical
Midbrian PAC  3.65 −8 −28 −4a – – – –
Thalamus (CM)  3.21 −10 −22 4b – – – –
Globus pallidus  3.09 −14 4b – – – –
Cerebellar vermis  3.01 −16 −56 −16b 3.54 −54 −16a
Prefrontal
Inferior operculum 47 3.05 −38 22 −8b 3.31 42 28 0b
Paramedian
Medial prefrontal cortex 3.80 −6 50 24a – – – –
Anterior cingulate cortex 32/24 3.02 −6 48 12b – – – –
Pre SMA 3.52 −12 16 44b – – – –
Temporal
Anterior MTG 21 3.27 −50 −16 −16b – – – –
Posterior STG/STS 22 3.05 −42 −60 20a – – – –
Posterior MTG/STS, ang G

39

3.88

−38

−54

24a

Regions in which normalized regional blood flow rates are greater during voiced than whispered speech are tabulated along with Z-scores, representing local maxima and associated Talairach coordinates. Regions are included only if significant activations were detected when voiced speech was also compared to rest. Symbols differentiate regions that are significantly activated only for voiced speech or activated for both speech conditions (versus rest).

a

Activated during voiced speech (versus rest) alone.

b

Indicates regions significantly activated for both voiced and whispered speech (versus rest).

### Functional Connectivity

PET images, preprocessed using SPM software as outlined above, were used in these analyses. Images acquired during voiced and whispered speech were evaluated independently. For each condition, normalized rCBF rates were correlated, across the cohort of subjects, with values derived from seed voxels of interest, utilizing software written in MATLAB (Horwitz and McIntosh, 1994; Horwitz et al., 1998). This routine produces a normalized output image with a Pearson product–moment correlation coefficient assigned to each voxel, indexing correlations between blood flow in that voxel and the seed voxel of interest.

Correlation coefficients were then transformed to standard scores (Fisher's Z-prime transformation) and transformed images were subtracted from one another to identify voxels in which correlations differed between conditions; difference scores of Z > 2.0 in absolute value were considered significant in this instance. The thresholded difference images were then used to mask the voiced and whispered speech correlation images themselves in order to identify the original coefficients that differentiated these conditions. Instances in which at least one of these exceeded 2.33 in absolute value are summarized in Tables 3 and 4 and used to produce maps and scatterplots depicted in Figures 2 and 3.

Table 2

Significant elevations in rCBF during whispered speech

Left hemisphere

Right hemisphere

Region of interest

Brod. no.

Z-score

x

y

z

Z-score

x

y

z

Sub cortical
Thalamus (pulvinar)  3.10 −20 −28 12a – – – –
Perirolandic
SII 43 4.10 −60 −12 20b 3.22 52 −18 20b
Temporal
AI/AII 41/42 3.69 −56 −16 12b – – – –
Post STG/PT 22 3.12 −38 −38 20b – – – –
Lingular/PHPC

37

3.32

28

−44

−4a

Left hemisphere

Right hemisphere

Region of interest

Brod. no.

Z-score

x

y

z

Z-score

x

y

z

Sub cortical
Thalamus (pulvinar)  3.10 −20 −28 12a – – – –
Perirolandic
SII 43 4.10 −60 −12 20b 3.22 52 −18 20b
Temporal
AI/AII 41/42 3.69 −56 −16 12b – – – –
Post STG/PT 22 3.12 −38 −38 20b – – – –
Lingular/PHPC

37

3.32

28

−44

−4a

Regions in which normalized regional blood flow rates are greater during whispered than voiced speech are tabulated along with Z-scores, representing local maxima, and associated Talairach coordinates. Regions are induced only if significant activations were detected when whispered speech was also compared to rest. Symbols differentiate regions that are significantly activated only for whispered speech or activated for both speech conditions (versus rest).

a

Activated during whispered speech (versus rest) alone.

b

Indicates regions significantly activated for both whispered and voiced speech (versus rest).

Table 3

Functional connections of the PAG during voiced and whispered speech

Left hemisphere

Right hemisphere

Region of interest Brod. no. x y z Transformed coeff.

x y z Transformed coeff.

Voiced

Whispered

Voiced

Whispered

(A) Regions and contralateral homologues identified in task contrasts
Subcortical
Globus pallidus  −14 −4 2.41 0.40 – – – – –
Thalamus (CM)  – – – – – 10 −20 3.33 1.21
Cerebellar vermis  −2 −48 −4 −2.40 0.06 – – – – –
Prefrontal
Inferior operculum 47 −36 28 −12 3.87 1.08 34 34 −12 3.08 0.65
Paramedian
Ventral medial prefrontal cortex −18 62 −4 3.48 0.49 – – – – –
Dorsal medial prefrontal cortex  −2 44 44 −3.74 −0.62 40 40 −3.86 0.71
Anterior cingulate cortex 32/24 – – – – – 20 34 16 2.61 0.60
Perirolandic
SII 43 −56 −18 20 −2.85 0.60 54 −10 20 −2.56 0.43
Temporal
AI/AII 41/42 −56 −26 12 −3.20 0.58 – – – – –
Anterior MTG 21 −54 −18 −16 2.65 0.52 60 −20 −8 3.43 1.18
Posterior MTG 21 −52 −44 −4 2.82 0.88 – – – – –
(B) Regions not identified in task contrasts
Rolandic
Postcentral gyrus 3,1,2 −56 −18 32 −3.07 0.33 56 −16 32 −2.72 0.17
Proisocortical
Temporal pole 38 −38 24 −16 4.33 1.45 48 −16 2.78 0.62
Claustrum/insula  – – – – – 30 −4 12 −2.76 0.15
Anterior insula

−30

10

4

0.31

3.61

Left hemisphere

Right hemisphere

Region of interest Brod. no. x y z Transformed coeff.

x y z Transformed coeff.

Voiced

Whispered

Voiced

Whispered

(A) Regions and contralateral homologues identified in task contrasts
Subcortical
Globus pallidus  −14 −4 2.41 0.40 – – – – –
Thalamus (CM)  – – – – – 10 −20 3.33 1.21
Cerebellar vermis  −2 −48 −4 −2.40 0.06 – – – – –
Prefrontal
Inferior operculum 47 −36 28 −12 3.87 1.08 34 34 −12 3.08 0.65
Paramedian
Ventral medial prefrontal cortex −18 62 −4 3.48 0.49 – – – – –
Dorsal medial prefrontal cortex  −2 44 44 −3.74 −0.62 40 40 −3.86 0.71
Anterior cingulate cortex 32/24 – – – – – 20 34 16 2.61 0.60
Perirolandic
SII 43 −56 −18 20 −2.85 0.60 54 −10 20 −2.56 0.43
Temporal
AI/AII 41/42 −56 −26 12 −3.20 0.58 – – – – –
Anterior MTG 21 −54 −18 −16 2.65 0.52 60 −20 −8 3.43 1.18
Posterior MTG 21 −52 −44 −4 2.82 0.88 – – – – –
(B) Regions not identified in task contrasts
Rolandic
Postcentral gyrus 3,1,2 −56 −18 32 −3.07 0.33 56 −16 32 −2.72 0.17
Proisocortical
Temporal pole 38 −38 24 −16 4.33 1.45 48 −16 2.78 0.62
Claustrum/insula  – – – – – 30 −4 12 −2.76 0.15
Anterior insula

−30

10

4

0.31

3.61

Correlations between normalized cerebral blood flow rates in the PAG (Talairach x = −8, y = −28, z = −4, Table 1) and other brain regions were calculated for voiced and whispered speech and compared as outlined in the text. Values are Z-transformed correlation coefficients identifying regional interrelationships that differentiated these conditions.

Table 4

Functional connections of the anterior MTG during voiced and whispered speech

Left hemisphere

Right hemisphere

Region of interest Brod. no. x Y z Transformed coeff.

x y z Transformed coeff.

Voiced

Whispered

Voiced

Whispered

(A) Regions identified in task contrasts
Subcortical  – – – – – – – – – –
Midbrain PAG  −12 −32 −4 2.73 0.80 – – – – –
Globus pallidus  −12 −4 2.30 0.43 – – – – –
Paramedian
Ventral medial prefrontal cortex −18 62 3.02 0.68 – – – – –
Dorsal medial prefrontal cortex  – – – – – 54 28 −2.46 0.36
Temporal
AI/AII 41/42 −38 −42 12 −0.21 2.78 – – – – –
Mid MTG 21 −54 −38 −4 0.85 2.87 – – – – –
(B) Regions not identified in task contrasts
Subcortical
Thalamus (ventral)  −10 −14 4.25 2.15 – – – – –
Rolandic
Precentral gyrus 3,1,2 −54 −4 16 2.52 0.47 – – – – –
Postrolandic
Angular gyrus 39 −36 −68 36 −3.66 0.98 34 −60 36 −3.34 −1.32
SMG 40 −54 −46 40 −3.14 0.85
ITG 37 −58 −46 −16 3.11 0.65 60 −38 −16 3.50 0.94
Proisocortical
Anterior insula

−30

24

4

0.60

2.61

Left hemisphere

Right hemisphere

Region of interest Brod. no. x Y z Transformed coeff.

x y z Transformed coeff.

Voiced

Whispered

Voiced

Whispered

(A) Regions identified in task contrasts
Subcortical  – – – – – – – – – –
Midbrain PAG  −12 −32 −4 2.73 0.80 – – – – –
Globus pallidus  −12 −4 2.30 0.43 – – – – –
Paramedian
Ventral medial prefrontal cortex −18 62 3.02 0.68 – – – – –
Dorsal medial prefrontal cortex  – – – – – 54 28 −2.46 0.36
Temporal
AI/AII 41/42 −38 −42 12 −0.21 2.78 – – – – –
Mid MTG 21 −54 −38 −4 0.85 2.87 – – – – –
(B) Regions not identified in task contrasts
Subcortical
Thalamus (ventral)  −10 −14 4.25 2.15 – – – – –
Rolandic
Precentral gyrus 3,1,2 −54 −4 16 2.52 0.47 – – – – –
Postrolandic
Angular gyrus 39 −36 −68 36 −3.66 0.98 34 −60 36 −3.34 −1.32
SMG 40 −54 −46 40 −3.14 0.85
ITG 37 −58 −46 −16 3.11 0.65 60 −38 −16 3.50 0.94
Proisocortical
Anterior insula

−30

24

4

0.60

2.61

Correlations between normalized cerebral blood flow rates in the anterior MTG (Talairach x = −50, y = −16, z = −16, Table 1) and other brain regions were calculated for voiced and whispered speech and compared as outlined in the text. Values are Z-transformed correlation coefficients identifying regional interrelationships that differentiated these conditions.

Thus, differences in the associations between regions are reported only when correlations within condition (in at least one instance) and differences between conditions exceeded the specified thresholds. Since these are independent measures, the conjoint probability of error is actually smaller than the individual thresholds (conjoint probability < 0.0001).

In addition, non-parametric (chi-square) methods were used to test for homogeneity in the distribution of values meeting the above criteria. Values exceeding threshold were treated as dichotomous variables in these analyses (i.e. the magnitude of z-scores was not taken into account) and their distribution in a contingency table was evaluated, with the assumption that, if due to chance alone, these should appear with equal frequency in voiced and whispered conditions; a significant chi-square value indicates that these values are not randomly distributed with respect to condition.

It should be noted that this combination of approaches, while limiting type 1 error to some degree, does not formally correct for the effect of multiple comparisons.

## Results

### Voiced and Whispered Speech versus Rest

Compared with rest, both voiced and whispered speech were associated with activation of the cerebellum (hemispheres and vermis) and midbrain tegmentum; voiced speech alone activated the mesencephalic PAG. Both tasks were accompanied by activation of the basal ganglia (principally putamen) and thalamus, ventral pre- and post-central gyri, and SMA proper. Left-lateralized activation of the mid- and dorsal operculum was detected in both cases, with opercular activation extending ventrally during voiced speech. The anterior and posterior insula were activated; in both instances activation was lateralized to the left. Neither task was associated with dorsolateral prefrontal activation outside of the operculum, but robust activation of the medial prefrontal cortex accompanied voiced speech. The anterior cingulate cortex was activated in both instances, but the extent of activation, both dorsally and ventrally, was qualitatively greater for voiced speech. Auditory cortices were activated bilaterally in both instances; activation extending posteriorly in the left hemisphere during voiced speech. Similarly the anterior and middle portions of the middle temporal gyrus were activated for both speech tasks, with activation extending posteriorly in the left hemisphere for the voiced speech task. The supramarginal gyrus was active bilaterally during both voiced and whispered speech. The parahippocampal and lingual gyri were activated as well — in the left hemisphere during voiced speech, bilaterally for whispered speech.

### Voiced Speech — Whispered Speech

The contrast between voiced speech and whispered speech (significant differences, voiced minus whispered speech, eliminating voxels in which significant voice–rest activations were not detected) revealed an array of regions in which rCBF was significantly greater during voiced speech (Table 1, Figure 1a) including both cortical and subcortical midline structures, frontal operculum and temporoparietal cortices. Local maxima are displayed in Table 1.

Figure 1.

Brain maps illustrating differences in regional cerebral blood flow (rCBF) between voiced and whispered speech conditions. Maps depict significant elevations in rCBF (a) during voiced (compared to whispered) speech and (b) during whispered (compared to voiced) speech. Differences depicted were in each case masked with the results of task–rest contrasts (e.g. voxels in which significant elevations in rCBF were detected in the voiced–whispered speech contrast are included in (a) only when significant activations were also detected in the voiced speech–rest contrast). Statistical parametric maps resulting from these analyses are displayed on a standardized MRI scan, which was transformed linearly into the same stereotaxic (Talairach) space. Scans are displayed using neurological convention (left hemisphere is represented on the left). Planes of section relative to the anterior commissural–posterior commissural are indicated. Values are Z-scores representing the significance level of voxel-wise differences in normalized rCBF for the contrast between speech tasks, masked as outlined above. The range of scores is coded in the accompanying color table. Relative increases in rCBF during voiced speech (a) are seen in the PAG and midline cerebellum (−14, −6 mm), anterior MTG (−14, −6), left posterior MTG/STS and inferior parietal lobule (+22, +30 mm), frontal operculum (−6 mm), ACC (+22, +30 mm), MPFC (+22, + 30 mm) and pre-SMA (+45 mm). Data are summarized in Table 1. Relative increases in rCBF during whispered speech (b) are seen in the lingual-parahippocampal gyri (−4 mm), primary auditory and contiguous temporal cortices (+11 mm), SII (+20 mm) and neighboring Rolandic cortices (+20, +30 mm). Data are summarized in Table 2.

Figure 1.

Brain maps illustrating differences in regional cerebral blood flow (rCBF) between voiced and whispered speech conditions. Maps depict significant elevations in rCBF (a) during voiced (compared to whispered) speech and (b) during whispered (compared to voiced) speech. Differences depicted were in each case masked with the results of task–rest contrasts (e.g. voxels in which significant elevations in rCBF were detected in the voiced–whispered speech contrast are included in (a) only when significant activations were also detected in the voiced speech–rest contrast). Statistical parametric maps resulting from these analyses are displayed on a standardized MRI scan, which was transformed linearly into the same stereotaxic (Talairach) space. Scans are displayed using neurological convention (left hemisphere is represented on the left). Planes of section relative to the anterior commissural–posterior commissural are indicated. Values are Z-scores representing the significance level of voxel-wise differences in normalized rCBF for the contrast between speech tasks, masked as outlined above. The range of scores is coded in the accompanying color table. Relative increases in rCBF during voiced speech (a) are seen in the PAG and midline cerebellum (−14, −6 mm), anterior MTG (−14, −6), left posterior MTG/STS and inferior parietal lobule (+22, +30 mm), frontal operculum (−6 mm), ACC (+22, +30 mm), MPFC (+22, + 30 mm) and pre-SMA (+45 mm). Data are summarized in Table 1. Relative increases in rCBF during whispered speech (b) are seen in the lingual-parahippocampal gyri (−4 mm), primary auditory and contiguous temporal cortices (+11 mm), SII (+20 mm) and neighboring Rolandic cortices (+20, +30 mm). Data are summarized in Table 2.

Midline subcortical structures included the PAG, along its entire extent, from −16 to −4 mm below the anterior commissural–posterior commissural (AC-PC) plane, the maximal difference (also representing the local activation maximum in the voiced speech–rest contrast) occurring in the dorsolateral PAG (Table 1). These differences, which may encompass contiguous portions of the midbrain tegmentum, paralemniscal and parabrachial areas, exceeded threshold only in the left hemisphere. Significant elevations in activity during voiced speech were also observed in the left putamen and left ventral thalamus (local maximum in the region of the CM nucleus) and in the midline cerebellum bilaterally, with more robust differences, in this instance, occurring in the right hemisphere.

Paramedian cortical areas in which rCBF was greater during voiced speech included both proisocortical — ventral anterior cingulate cortex (BA 32/24) — and neocortical regions — medial prefrontal cortex (BA 8,9,10), extending dorsally from ∼12mm above the AC-PC plane and posteriorly to include the pre-SMA (BA 6). Significant differences in midline cortical areas were lateralized to the left hemisphere. Activations were stronger in the medial prefrontal than in the anterior cingulate cortex. Activity in the ventral portions of the left and right frontal operculum (pars orbitalis, BA 47) was significantly greater during voiced speech as well.

Voiced speech-related elevations in rCBF were detected in the anterior portion of the middle temporal gyrus (MTG, BA 21) and in the posterior superior temporal gyrus (STG) at the temporoparietal junction, extending dorsally into the angular gyrus and ventrally into the superior temporal sulcus (STS) and posterior MTG. Activations in the anterior MTG were detected in both left and right hemispheres, although those on the right were of borderline significance (local maximum = 2.65 at x = 50, y = 0, z = −16). Activations in the posterior temporal cortices were, on the other hand, robustly lateralized to the left hemisphere.

Voiced speech-related elevations in rCBF were maximal in the posterior temporal gyrus/STS, followed by the medial prefrontal cortex and PAG.

Significant activations of the midline cerebellum, left thalamus, left putamen/globus pallidus and left pre-SMA were observed when both voiced and whispered speech were compared to rest, but these were significantly greater during voiced speech. In the anterior temporal cortices, anterior cingulate cortices and frontal operculum, significant activations were also observed during both voiced and whispered speech versus rest, but a qualitatively wider extent of activation was in each case associated with voiced speech (local maxima within these extended regions are summarized in Table 1).

### Whispered Speech — Voiced Speech

The contrast between whispered and normally voiced speech (significant differences, whispered minus voiced speech, eliminating voxels in which significant whisper–rest activations were not detected) showed that rCBF was significantly greater during whispered speech (Table 2, Figure 1b) in perirolandic, auditory and mesial temporal cortices and in the posterior thalamus. Local maxima are displayed in Table 2.

Whispered speech-related elevations in perirolandic activity were observed in both left and right hemispheres, maximal in somatosensory (SII, BA43) cortices, and lateralized to the left. The transverse temporal gyrus (AI/AII, BA 41,42) and posterior STG in the region of the planum temporale (BA22) were significantly more active for whispered than for voiced speech, with the differences exceeding threshold only in the left hemisphere. The right fusiform gyrus (BA 37) and left pulvinar were significantly more active for whispered speech as well.

Significant activations of the left somatosensory cortex and transverse temporal gyrus were observed during both voiced speech and whispered speech (versus rest) but rCBF responses were significantly greater during whispered speech. In the left planum temporale and right somatosensory cortex, significant activations were also observed during both voiced and whispered speech versus rest, but a qualitatively wider extent of activation was in each case associated with whispered speech (local maxima within these extended regions are summarized in Table 2).

### Functional Connectivity

Activation (task subtraction) and covariance methods represent two distinct approaches to the evaluation of brain function. While the former evaluate differences in activity between pairs of tasks, covariance methods assess within-task relationships between regional cerebral blood flow rates, with interregional correlations serving as indices of functional connectivity within the CNS. The two approaches thus provide different but complementary types of information. For example, whether two regions are significantly correlated within a task — either positively or negatively — has been shown to be independent of whether one or both have been activated relative to a common control task.

Correlations between normalized cerebral blood flow rates in selected voxels of interest and all other brain regions were calculated individually for voiced and whispered speech conditions and then compared as outlined above. The voxels selected — PAG and anterior MTG — correspond to local maxima from the voiced versus whispered speech contrasts reported above.

All values reported below and in Tables 3 and 4 and Figures 2 and 3 represent instances in which correlations within condition (in at least one instance) and differences between conditions exceeded the specified thresholds (conjoint P < 0.0001).

Figure 2.

Schematic maps illustrating differences in functional connectivity of the PAG during voiced and whispered speech. Correlations between normalized rCBF rates in the PAG (Talairach x = −8, y = −28, z = −4, representing the local maximum, voiced versus whispered speech, Table 1) and other brain regions. Pearson product–moment correlation coefficients were calculated for each condition independently and compared — as outlined in the Materials and Methods — to identify functional connections that differentiated voiced (a) and whispered (b) speech. Regions displayed are limited to those in which significant differences were detected in the speech task contrasts (Figs 1a,b; Tables 1 and 2) and their homologues in the contralateral hemisphere. The PAG is functionally coupled to a wide array of regions during voiced but not during whispered speech: activity in the PAG is correlated with that in both subcortical and cortical areas including paramedian and sensorimotor cortices as well as heteromodal association areas within the temporal lobe. Red indicates a positive correlation, blue negative. Lines depict z-transformed correlation coefficients >0.7 in absolute value. Line width corresponds to the relative magnitude of coefficients, which are indicated in the figure. Data are summarized in Table 3.

Figure 2.

Schematic maps illustrating differences in functional connectivity of the PAG during voiced and whispered speech. Correlations between normalized rCBF rates in the PAG (Talairach x = −8, y = −28, z = −4, representing the local maximum, voiced versus whispered speech, Table 1) and other brain regions. Pearson product–moment correlation coefficients were calculated for each condition independently and compared — as outlined in the Materials and Methods — to identify functional connections that differentiated voiced (a) and whispered (b) speech. Regions displayed are limited to those in which significant differences were detected in the speech task contrasts (Figs 1a,b; Tables 1 and 2) and their homologues in the contralateral hemisphere. The PAG is functionally coupled to a wide array of regions during voiced but not during whispered speech: activity in the PAG is correlated with that in both subcortical and cortical areas including paramedian and sensorimotor cortices as well as heteromodal association areas within the temporal lobe. Red indicates a positive correlation, blue negative. Lines depict z-transformed correlation coefficients >0.7 in absolute value. Line width corresponds to the relative magnitude of coefficients, which are indicated in the figure. Data are summarized in Table 3.

Figure 3.

Scattergrams depicting correlations between normalized rCBF rates in PAG and temporal lobe regions. Normalized rCBF rates were extracted from individual PET scans at coordinates specified in the covariance analyses (results summarized Tables 3 and 4 and Fig. 2). A negative correlation between activity in the PAG (x = −8, y = −28, z = −4) and the left primary auditory cortex (x = −56, y = −26, z = 12) (r = −0.65, see Table 3, Fig. 3) is seen during voiced speech (a). No significant relationship between activity in these regions is evident during whispered speech (b). A positive correlation between activity in the anterior MTG (x = −50, y = −16, z = −16) and the PAG (x = −12, y = −32, z = −4) (r = 0.59, see Table 4) is seen during voiced speech (c). Again, no significant relationship between activity in these regions is observed during whispered speech (d).

Figure 3.

Scattergrams depicting correlations between normalized rCBF rates in PAG and temporal lobe regions. Normalized rCBF rates were extracted from individual PET scans at coordinates specified in the covariance analyses (results summarized Tables 3 and 4 and Fig. 2). A negative correlation between activity in the PAG (x = −8, y = −28, z = −4) and the left primary auditory cortex (x = −56, y = −26, z = 12) (r = −0.65, see Table 3, Fig. 3) is seen during voiced speech (a). No significant relationship between activity in these regions is evident during whispered speech (b). A positive correlation between activity in the anterior MTG (x = −50, y = −16, z = −16) and the PAG (x = −12, y = −32, z = −4) (r = 0.59, see Table 4) is seen during voiced speech (c). Again, no significant relationship between activity in these regions is observed during whispered speech (d).

### Functional Connections of the PAG

Figure 2 Illustrates functional connections of the PAG (Talairach x = −8, y = −28, z = −4, Table 1) that differentiated voiced and whispered speech; data are summarized in Table 3a. Regions displayed in Figure 2 are limited to those (and homologues in the contralateral hemisphere) in which significant differences were detected in the task contrasts reported above. Differences in connectivity that lay outside of this set of regions are contained in Table 3b. In general, activity in the PAG was correlated with activity in a wide array of regions during voiced but not whispered speech, including both visceral and somatomotor areas as well as heteromodal association areas within the temporal lobe. The chi-square test indicated that the distribution of these differences was non-random (P < 0.0001). While significant differences in activation were, as noted above, generally lateralized to the left hemisphere, differences in functional connectivity were frequently bilateral.

During voiced but not whispered speech, significant positive correlations were detected between the PAG and ventral frontal operculum and anterior MTG bilaterally, and with the left posterior MTG, left globus pallidus, right thalamus (local maximum in the CM nucleus) and right anterior cingulate cortex. Correlations between PAG and medial prefrontal cortices were positive ventrally (in the left hemisphere) and negative dorsally (bilaterally). Negative correlations with SII in both hemispheres, the left cerebellar vermis and left primary auditory cortex were observed as well. In contrast, none of these correlations reached the specified threshold for whispered speech; transformed coefficients exceeded ±1.0 only in the left ventral frontal operculum and right anterior MTG and thalamus during this condition (and were significantly lower than correlations observed during voiced speech). Figure 3a,c depicts correlations between rCBF rates in the PAG and primary auditory cortex during both speech tasks.

Outside of regions highlighted in the voiced–whispered speech contrasts, covariance analyses identified additional significant differences in connectivity in Rolandic cortices, proisocortical and subcortical regions. During voiced but not whispered speech, activity in the PAG was negatively correlated with that in postcentral gyri, right claustrum/insula, and positively with activity in temporal polar cortices. In contrast, activity in the left anterior insula was significantly correlated with activity in the PAG during whispered but not during voiced speech.

### Functional Connections of the Anterior MTG

Similar comparisons were made for functional connections of left anterior MTG (Talairach x = −50, y = −16, z = −16, Table 1) — which, like the PAG, was more tightly coupled to other regions during voiced than during whispered speech. The chi-square test indicated that this asymmetry was significant (P < 0.025). Differences were again identified in regions (and homologues in the contralateral hemisphere) that had significantly differentiated conditions in the contrasts depicted in Figure 1a,b and Tables 1 and 2. The pattern was in many cases similar to that reported above, i.e. a number of regions were similarly coupled to both PAG and anterior MTG during voiced but not during whispered speech. In contrast to the PAG, however, differences in functional connectivity in this case predominated in the left hemisphere. These data are summarized in Table 4a.

The anterior MTG was coupled to the PAG itself during voiced but not whispered speech. Figure 3b,c depicts the correlations between rCBF rates in these regions during both conditions. Like the PAG, the anterior MTG was positively correlated with that in the left ventral MPFC, and left GP and negatively correlated with activity in the right dorsal MPFC during voiced speech. Some of the differences between conditions were novel: activity in the anterior MTG was positively correlated — during whispered but not during voiced speech — with activity in the left primary auditory cortex, and the left central portion of the MTG.

Outside of regions highlighted in the voiced–whispered speech contrasts, significant differences in connectivity were detected between the anterior MTG and Rolandic and post-Rolanidic cortices, as well as proisocortical and subcortical regions. Differences again predominated in the left hemisphere, and are summarized in Table 4b. During voiced but not whispered speech, activity in the anterior MTG was positively correlated with that in the left ventral primary motor cortex. Activity in the MTG and left ventral thalamus was positively correlated during both voiced and whispered speech, but significantly more so during voiced speech. Activity in MTG and inferior parietal areas, including the SMG and angular gyrus, was negatively correlated during voiced speech; correlations were absent or marginal for whispered speech. Activity in the left anterior insula and anterior MTG was positively correlated (as with the PAG) for whispered but not voiced speech.

## Discussion

### Production: Both Visceromotor and Neocortical Systems Are Activated During Human Vocalization

A considerable body of literature suggests that a set of subcortical and cortical midline structures, centered upon the mesencephalic PAG, controls the production of species-specific calls in lower mammalian species (e.g. Jurgens, 1976; Larson and Kistler, 1986). These stereotypical vocalizations characteristically convey information about arousal and emotional state, consistent with the fact that activity in the PAG is regulated by inputs from limbic or limbic-related areas of the brain (Jurgens, 2002).

In most lower species, vocalization appears to be exclusively controlled by these visceromotor mechanisms, and voluntary control over production of stimulus-contingent calls is correspondingly slight (even in non-human primates; see Goodall, 1986). While increasing neocortical regulation of lower-order motor centers is the phylogenetic rule, this does not appear to be true of species-specific vocalizations: there are direct projections from the neocortex to cranial nerve nuclei that control the oral articulators (lips, tongue and jaw; Kuypers, 1958) in non-human primates, but this is not the case for the nucleus ambiguous, the medullary nucleus that controls the larynx (Zhang et al., 1995).

In contrast, humans possess an unprecedented measure of voluntary control over the larynx. Since direct projections from the motor cortex to the nucleus ambiguous do exist in humans (Kuypers, 1958; Jurgens, 1976; Iwatsubo et al., 1990), it has frequently been assumed that neocortical mechanisms must play the principal role in human verbal behavior. Visceromotor mechanisms, on the other hand, have been assumed to underlie the production of ‘vestigal’ non-verbal emotional utterances such as crying or laughter. But the degree to which the PAG and other elements of the midline call system might be involved in the production of normal propositional speech is unresolved and until now has never been examined directly. Our results suggest that propositional speech in humans is indeed associated with activation of visceromotor mechanisms that organize reflex-like vocal reactions in lower species.

### Activations within Elements of the Midline Call System

#### PAG

The central role played by the PAG in the generation of species-specific calls has been established in a wide range of species (Davis and Zhang, 1991; Larson, 1991). In monkeys it is the lateral portion of the PAG that projects to the lower brainstem nuclei — nucleus retroambiguus and other portions of the medullary reticular formation — which themselves project to motor neuron pools controlling the laryngeal, pharyngeal, facial, intercostal and abdominal musculature, representing the final common pathway for regulation of vocalization (Holstege, 1989; Zhang et al., 1995; VanderHorst et al., 2000). In agreement with this, it was in the lateral PAG that we observed maximal activation during voiced speech.

Vocalization-related activation in our participants included an extensive portion of the PAG (from −16 to −4 mm below the AC-PC plane); because of the limited spatial resolution of the PET method, this activity may also include contiguous areas — e.g. midbrain tegmentum, paralemniscal and parabrachial areas — in which stimulation also elicits species specific calls in non-human primates (Magoun et al., 1937; Jurgens and Richter, 1986). It should be noted that activation of the PAG during vocalization might be due in part to increased sensory feedback to this region during phonation (Ambalavanar et al., 1999). In addition it is important to point out that coverage of the brainstem extended only to the upper pons. The medulla — which contains structures, such as the nucleus ambiguous, in which voice-related activations might be expected — was not sampled.

#### Paramedian Cortices

We observed concurrent activation of the anterior cingulate cortex (ACC) during vocalization in our participants. The ACC plays a well-recognized role in vocalization behavior in lower species, where lesion studies suggest that it participates in hierarchical regulation of PAG induced vocalizations (Jurgens and Lu, 1993; Jurgens and Zwirner, 1996). The region is known to play a role in human motor control (Picard and Strick, 1996; Roland and Zilles, 1996), and it has been shown that speech tasks selectively activate the intermediate dorsal and the rostral ACC (Paus et al., 1993), corresponding to the portions of the cingulate that are active in primates during vocalization (see Vogt and Barbas, 1988, for review). Consistent with this, we observed vocalization-related activation in these portions of the ACC, including BA 24 and 32, extending from the midline into the cingulate sulcus.

It is interesting, however, that in our participants the strongest activations in the paramedian cortices were found not in the proisocortical ACC, but in the contiguous medial prefrontal cortex (MPFC, BA 9, 10). This is in agreement with a previous report that has shown this region to be activated during voiced speech in humans (Blank et al., 2002).

The MPFC (along with the entire prefrontal cortex) has expanded over the course of evolution and is more prominent in humans than in any of the lower species in which the species-specific call system has been described. Thus, direct evidence of a role for this region in animal vocalization is not available. While the function of the MPFC remains obscure even in humans, it appears likely that it may share some functional characteristics with the ACC and other structures of medial wall. Indeed, tract tracing studies in non-human primates have shown that the lateral PAG receives descending projections from the MPFC as well as the ACC (An et al., 1998); it has been suggested that the region may be a constituent of the visceromotor system in monkeys, and could represent a neocortical elaboration of a similar system in humans.

Most strikingly, the covariance analyses indicated that not only are the PAG and paramedian cortices coactivated, but that activity in these regions is strongly correlated during voiced but not during whispered speech. The fact that they are both activated and functionally coupled provides compelling evidence that, as in lower mammalian species, these midline regions may operate as an integrated system during human vocalization.

### Activations Extending beyond the Midline System

Activation of the cortical-PAG network that controls vocalization in lower species might plausibly account for rapid communication of visceral — e.g. affective or other paralinguistic — information. Clearly, however, the linguistic information conveyed in our participants' narratives is propositional, extends beyond transmission of information about emotion or arousal status, and is subject to a significant degree of voluntary control.

Accordingly, we observed coactivation of additional regions during voiced speech — premotor regions including the SMA and frontal operculum and subcortical projection areas in the basal ganglia and thalamus — that have never been described as elements of species-specific call systems in lower mammalian species. Nevertheless, these regions play a well-recognized role in the control of voluntary movement and, as such, they may regulate activity within the midline cortical-brainstem system during human phonation.

#### Supplementary Motor Area, Basal Ganglia and Thalamus

The SMA is a central element of the medial premotor system, playing a role in the organization of voluntary, self-initiated movements (Picard and Strick, 1996). Stimulation of the region in which voice-related elevations of rCBF were maximal, the pre-SMA, has been reported to elicit vocalizations in humans (Fried et al., 1991) but not in lower species. It is assumed to do so independently of the PAG since there appear to be no direct connections between these structures (Jurgens, 1984). On the other hand, the pre-SMA is reciprocally connected with the MPFC and ACC (Bates and Goldman-Rackic, 1993) and could, in humans, be involved in voluntary, stimulus-independent regulation of these cortical regions — and thus the PAG — during voiced speech.

Activity within the SMA is regulated to some degree by its interactions with the basal ganglia (Alexander et al., 1986; Parent and Hazrati, 1995), in one of a series of circuits connecting striatum, frontal cortex and thalamus that serve to coordinate complex sequences of behavior. The putamen, in which voice-related increases in rCBF were maximal, is a central constituent of one of these — the so-called motor circuit — which conducts neural information from the striatum through the ventral thalamus to the SMA. The motor circuit might enable more precise voluntary control over the timing and sequencing of laryngeal, respiratory and articulatory activity during voiced speech.

Activations were observed throughout the thalamus during voiced speech, encompassing the ventral nuclei that are part of the motor circuit that has just been described. However, thalamic activations were maximal in the region of the centromedian (CM) nucleus. Although it is impossible to reliably identify individual thalamic nuclei at the level of spatial resolution obtained with PET, it is interesting, nevertheless, that the CM thalamus may represent another point of intersection between medial cortical-PAG and neocortical premotor systems. The CM nucleus represents a major source of afferent input to the basal ganglia, plays a role in motor intention (Burk and Mair, 2001), and has widespread projections to sensorimotor areas, including the SMA and primary motor cortex (Moran et al., 1982; Jurgens, 1984). Furthermore, the CM nucleus receives projections from the cingulate vocalization area described above (Muller-Preuss and Jurgens, 1976) and projects to the PAG (Marini et al., 1999). As such, it may represent an interface between the visceral and somatomotor systems.

#### Frontal Operculum

While the frontal operculum has never been described as a core element of the call system in lower species, it functions as a premotor area — a subdivision of the lateral premotor system (Roland and Zilles, 1996) — related to speech production in humans. Interestingly, the opercular region commonly considered to represent Broca's area proper (BA 44, 45) in humans — homologous to the frontal area (F5) in the non-human primate that houses mirror neurons and has been proposed to play a role in ‘protospeech’ (Arbib, 2003) — was active during both speaking and whispering (versus rest) without differences specifically related to phonation.

In contrast, the portion of the operculum that was associated with greater activation during vocalization was the ventral pars orbitalis, BA 47. Based on anatomical connectivity, the pars orbitalis may be more likely related to ‘visceral’ systems than the dorsal portions of the operculum. Interestingly, this region has also been shown to be more susceptible than the dorsal regions to the interruption of utterances by intraoperative stimulation in humans (Ojemann et al., 1989). The region may thus represent a point at which voluntary regulation of visceromotor systems could be interposed during human speech. This possibility is supported by our covariance analyses, which show that ventral opercular activity — like that in the ACC and medial prefrontal cortex — is functionally coupled with activity in the PAG during the production of voiced speech.

#### Rolandic Cortex

The ventral Rolandic cortices were active during both speaking and whispering when these conditions were compared to rest. This activity may in both instances be related to cortical control of the oral articulators [mediated by direct corticobulbar projections to the cranial nerve nuclei controlling the lips, tongue and jaw (Kuypers, 1958)], since these are engaged during both voiced and whispered speech. Somewhat unexpectedly, we found that activity in the dorsal Rolandic areas (which increased during both voiced and whispered speech versus rest) was significantly greater during whispered speech. More dorsal Rolandic activity might also reflect cortical control of the respiratory musculature of the chest wall and diaphragm (e.g. Ramsay et al., 1993, Smejkal et al., 2000); the differences we observed might conceivably be due to an increase in respiratory rate secondary to a more rapid loss of air during whispered speech.

It is interesting that in a majority of regions, vocalization-related increases in rCBF were markedly left lateralized. This observation is consistent with the idea (Kimura and Archibald, 1974; Greenfield, 1991) that the left hemisphere plays a dominant role in praxis, i.e. in the organization of complex, sequential, time-ordered motor programs of both the oral and limb musculature. Although not generally thought of in this sense, the temporal coordination of respiration and laryngeal activity, and the rapid, precise pitch adjustments that occur during voiced speech may constitute a praxic demand that preferentially engages left hemisphere mechanisms.

### Perception: Voice-sensitive Areas in Temporal Cortex Are Selectively Activated and Interact with Motor Structures During Human Vocalization

The motor control of phonation does not occur in a vacuum. Intricate online adjustments of pitch and airflow must depend upon perception of one's own vocal output — self-monitoring that provides crucial information for feedback regulation of the motor areas involved. Both contrast and covariance analyses suggest a network of regions that may support this mechanism.

#### Cortical Patterns Associated with Vocal Self-monitoring

Voiced speech was associated with increased activity in perisylvian areas of the temporal lobe, including the anterior MTG and contiguous STS and pronounced left-lateralized activity of the posterior STG, MTG and caudal STS. Although these regions are reported to participate in language processing, the increases we observed cannot be attributed to task-related linguistic features, since both voiced and whispered speech require the same manner and degree of semantic, lexical and syntactic performance. Nor can differences be related to self-monitoring of speech per se, since this occurs in both conditions (in the relative silence of the PET suite, subjects were able to hear both voiced and whispered output). Our results suggest instead that these perisylvian regions may play a unique role in self-monitoring, specifically in the perception of one's own voice.

The present findings are consistent with a growing number neuroimaging studies that have demonstrated temporal lobe activations — in the STS and contiguous portions of the superior and middle temporal gyri — during the processing of speech and voice (Belin et al., 2000; Binder et al., 2000; Scott et al., 2000; Wise et al., 2001). While we detected voice-related activity in the same regions as these authors did, there are nevertheless some differences with respect to lateralization. For example, Belin et al., who specifically evaluated voice perception, reported bilateral responses that were in general lateralized to the right hemisphere. The distinctive pattern we detected — bilateral activation of anterior temporal, but strongly left-lateralized activation of posterior temporal structures — might be due to the fact that our participants were monitoring their own voice, while other studies have typically evaluated perception of the speech and voice of others.

Electrophysiological methods have similarly shown unique responses associated with vocal self-monitoring (Gunji et al., 2001). Indeed, one would expect the neural substrate for monitoring one's own voice to differ from that involved in monitoring the voice of an interlocutor, since the functional objectives of these processes are quite distinct: Monitoring the voice of others is aimed at decoding and extracting linguistic (voice onset, linguistic stress, prosody) as well as paralinguistic (age, sex, emotional status) information in the course of speech perception and language comprehension. Self-monitoring is part of a servomechanism that makes possible the continuous online correction of the laryngeal and oral articulatory movements that encode such information.

There are alternative explanations for these findings. For example, it should be noted that the acoustic features of voiced and whispered speech are different. While rate, timing and amplitude modulation are the same in both conditions, whispering produces broadband white noise while voiced speech has a number of more complex acoustic characteristics — rich differences in timbre, abundant formant transitions, variations in frequency, as well as overall increased amplitude. It is possible that differences in temporal lobe responses to voiced and whispered speech could be due to these acoustic dissimilarities.

Although this possibility cannot be ruled out, it appears unlikely since differences in the perception of lower-level features might be expected in early auditory areas — e.g. in primary auditory cortex and contiguous areas of the auditory belt area within the STG. The fact that voice-specific activations were not found here, but were detected instead in higher order areas (anterior MTG and STS) makes more tenable the possibility that they are related to more complex processes such as vocal self-monitoring.

It should be possible to disambiguate lower-level auditory perception and vocal self-monitoring in future studies. This could be done by distorting vocal output such that acoustic characteristics are preserved but the signal is rendered unrecognizable as one's own voice, or by simply comparing spontaneously voiced speech with playback of subjects' recorded output; in the latter case, the recording would retain most of the acoustic as well as linguistic features, but subjects would not be monitoring speech in the in the same fashion (i.e. for the purpose of online correction in the course of production).

Unexpectedly, the primary and contiguous auditory association cortices (AI and AII, BA 41/42, and the proximal portion of the planum temporale, BA 22) were more active for whispered than for voiced speech. It is possible that increased gain in these auditory areas might be associated with the processing of whispered speech, which is produced at lower volume, or that noisier broad band output during whispering elicits a greater response in AI.

On the other hand, it is possible that the differences between voiced and whispered speech in the early auditory areas may actually constitute relative, perhaps compensatory decreases in activation associated with self-monitored vocalization. This notion is in agreement with functional neuroimaging (Wise et al., 1999) and electrophysiological studies in humans (Numminen and Curio, 1999; Curio et al., 2000) and non-human primates (Muller-Preuss et al., 1980) showing that voiced utterances or production of calls may have an inhibitory effect on cortical auditory activity.

While whispered speech was associated with elevated rCBF in unimodal auditory core and belt areas, it was not associated with increased rCBF in downstream heteromodal regions — i.e. anterior MTG or posterior temporoparietal areas extending to the posterior MTG and intervening STS. These areas, which receive input from the unimodal areas (Romanski et al., 1999) and project to the prefrontal cortex and other heteromodal cortices, were more active during voiced speech, and may preferentially perform higher-level analysis of the more complex acoustic information that is contained in voice.

#### Interaction of Auditory and Motor Systems

As noted, self-monitoring serves a distinct set of needs that, unlike ordinary speech perception, requires the precise integration of auditory and motor activity. Our covariance analyses provide evidence for functional connections that may support such interactions.

During voiced but not whispered speech, the anterior MTG was functionally coupled to regions within the visceromotor system (PAG, MPFC, ACC) as well as to neocortical areas and subcortical structures to which they are connected (precentral and supramarginal gyri, basal ganglia and thalamus) (Figs 2 and 3; Tables 3 and 4). In light of the suggestion that human vocalization is regulated by visceromotor as well as neocortical motor systems, these results provide strong evidence that auditory information gains feedback access to both.

Of additional interest is the fact that while vocalization-related increases in PAG activity was positively correlated with those in the MTG, they were negatively correlated with activity in AI (Fig. 3). This suggests that this central visceromotor structure is functionally coupled in different ways — and perhaps differentially regulated by — primary auditory and auditory association cortices.

Contrast analyses additionally demonstrated that voiced speech was associated with significant increases in activity of the midline cerebellum. Lesions of this portion of the cerebellum have been shown to have the greatest effects upon speech (Brown et al., 1970; Lechtenberg and Gilman, 1978; Ackerman and Ziegler, 1991). Indeed, the vermis may play a central role in the integration of auditory and motor systems. It has been suggested (Penhune et al., 1998) that this region — which has reciprocal connections with both cortical auditory and cerebellar motor areas (Martin et al., 1977; Watt and Mihailoff, 1983) — provides a circuitry within which temporal auditory information (which may be more richly encoded in voiced than unvoiced speech) is extracted, enabling the motor system to generate a precisely timed response. The vermis may thus represent an additional auditory–motor interface for the online control of vocalization. This possibility is supported by the fact that covariance analyses demonstrated significant correlations between activity in the PAG and the cerebellar vermis during voiced but not during whispered speech.

## Conclusions

Taken together, our results suggest that human vocalization is not exclusively regulated by neocortical or visceromotor mechanisms, but by a combination of both. That the PAG and paramedian cortices — elements of the species-specific call system that regulates vocal production in lower species — are selectively activated during vocalization in humans, may represent the process of ‘exaptation’ (Gould, 1999), whereby features previously designed for one function (species specific vocalizations that may convey information about emotional state) are co-opted for a different purpose (linguistic and paralinguistic use of voice during propositional speech) in the course of evolution.

The greater degree of voluntary control that humans have over phonation can be explained by neocortical regulation of these visceromotor structures. Our results are consistent with such a model: task contrasts showed concurrent activation of structures that are in a position to provide this type of control, including premotor structures and elements of basal ganglia-thalamocortical circuitry. Moreover, covariance analyses showed that the PAG — a final common pathway in the visceromotor system — is functionally coupled, during voiced speech, to both the phylogenetically older paramedian cortices as well as neocortical premotor areas and their subcortical connections.

Finally, there are areas in the temporal lobe and cerebellum that are coactivated and functionally coupled to both visceromotor and neocortical systems during voiced speech production. These regions may support vocal self-monitoring, providing crucial information for regulation of the motor areas, i.e. complex on-line adjustments of pitch and airflow that depend upon perception of one's vocal output.

Together, these regions may constitute a neuroanatomical circuit that regulates the production and feedback regulation of speech-related vocalization. If so, this may shed light on the pathophysiology of motor speech disorders such as spasmodic dysphonia and stuttering, in which symptoms are typically manifest during voiced but not during unvoiced speech, and as such may reflect dynamic abnormalities within this system.

The authors wish to thank Drs. Jonathan Fritz and Mar Hauser for a critical review of the manuscript and helpful suggestions, Omar Ali for aid with data analysis, and the NIH PET technologists for expert technical assistance.

## References

Ackermann H, Ziegler W (
1991
) Cerebellar voice tremor: an acoustic analysis.
J Neurol Neurosurg Psychiatry

54
:
74
–76.
Alexander GE, DeLong MR, Strick PL (
1986
) Parallel organization of functionally segregated circuits linking basal ganglia and cortex.
Annu Rev Neurosci

9
:
357
–381.
Ambalavanar R, Tanaka Y, Damirjian M, Ludlow CL (
1999
) Laryngeal afferent stimulation enhances Fos immunoreactivity in periaqueductal gray in the cat.
J Comp Neurol

409
:
411
–423.
An X, Bandler R, Ongur D, Price JL (
1998
) Prefrontal cortical projections to longitudinal columns in the midbrain periaqueductal gray in macaque monkeys.
J Comp Neurol

401
:
455
–479.
Arbib MA (
2003
) The evolving mirror system: a neural basis for language readiness. In: Language evolution (Christiansen MH, Kirby S, eds), pp. 182–200. New York: Oxford University Press.
Bates JF, Goldman-Rackic PS (
1993
) Prefrontal connections of medial motor areas in the rhesus monkey.
J Comp Neurol

336
:
211
–228.
Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B (
2000
) Voice-selective areas in human auditory cortex.
Nature

403
:
309
–312.
Binder JR, Frost JA, Hammeke TA, Bellgowan PSF, Springer JA, Kaufman JN, Possing ET (
2000
) Human temporal lobe activation by speech and nonspeech sounds.
Cereb Cortex

10
:
512
–528.
Blank SC, Scott SK, Murphy K, Warburton E, Wise RJS (
2002
) Speech production: Wernicke, Broca and beyond.
Brain

125
:
1829
–1838.
Braun AR, Varga M, Stager S, Schulz G, Selbie S, Maisog JM, Carson RE, Ludlow CL (
1997
) Altered patterns of cerebral activity during speech and language production in developmental stuttering: an H215O positron emission tomography study.
Brain

120
:
761
–784.
Braun AR, Guillemin A, Hosey L, Varga M (
2001
) The neural organization of discourse: a PET study of English and ASL production.
Brain

124
:
2028
–2044.
Brown JR, Darley FL, Aronson AE (
1970
) Ataxic dysarthria.
Int J Neurol

7
:
302
–318.
Burk JA, Mair RG (
2001
) Effects of intralaminar thalamic lesions on sensory attention and motor intention in the rat: a comparison with lesions involving frontal cortex and hippocampus.
Behav Brain Res

123
:
49
–63.
Creutzfeldt O, Ojemann G, Lettich E (
1989
) Neuronal activity in the human lateral temporal lobe. II. Responses to the subjects own voice.
Exp Brain Res

77
:
476
–489.
Curio G, Neuloh G, Numminen J, Jousmaki V, Hari R (
2000
) Speaking modifies voice-evoked activity in the human auditory cortex.
Hum Brain Mapp

9
:
183
–191.
Davis PJ, Zhang SP (
1991
) What is the role of the midbrain periaqueductal gray in respiration and vocalization? In: The midbrain periaqueductal gray matter (Depaulis A, Bandler R, eds), pp. 57–66. New York: Plenum Press.
Davis PJ, Zhang SP, Winkworth A, Bandler R (
1996
) Neural control of vocalization: respiratory and emotional influences.
J Voice

10
:
23
–38.
Fardin V, Oliveras JL, Besson JM (
1984
) A reinvestigation of the analgesic effects induced by stimulation of the periaqueductal gray matter in the rat. I. The production of behavioral side effects together with analgesia.
Brain Res

306
:
105
–123.
Fiez JA, Petersen SE (
1998
) Neuroimaging studies of word reading.

95
:
914
–921.
Fried I, Katz A, McCarthy G, Sass KJ, Williamson P, Spencer SS, Spencer DD (
1991
) Functional organization of human supplementary motor cortex studied by electrical stimulation.
J Neurosci

11
:
3656
–3666.
Greenfield PM (
1991
) Language, tools and brain: the ontogeny and phylogeny of hierarchically organized sequential behavior.
Behav Brain Sci

14
:
531
–595.
Goodall J (
1986
) Chimpanzees of Gombe: patterns of behavior. Cambridge, MA: Belknap Press of Harvard University Press.
Gould SJ, Lloyd EA (
1999
) Individuality and adaptation across levels of selection: how shall we name and generalize the unit of Darwinism?

96
:
11904
–11909.
Gunji A, Hoshiyama M, Kakigi R (
2001
) Auditory response following vocalization: a magnetoencephalographic study.
Clin Neurophysiol

112
:
514
–520.
Holstege G (
1989
) Anatomical study of the final common pathway for vocalization in the cat.
J Comp Neurol

284
:
242
–252.
Horwitz B, McIntosh AR (
1994
) Quantification of brain function, pp. 589–596. Amsterdam: Excerpta Medica.
Horwitz B, Rumsey JM, Donohue BC (
1998
) Functional connectivity of the angular gyrus in normal reading and dyslexia.

95
:
8939
–8944.
Iwatsubo T, Kuzuhara S, Kanemitsu A, Shimada H, Toyokura Y (
1990
) Corticofugal projections to the motor nuclei of the brainstem and spinal cord in humans.
Neurology

40
:
309
–312.
Jurgens U (
1976
) Reinforcing concomitants of electrically elicited vocalizations.
Exp Brain Res

26
:
203
–214.
Jurgens U (
1984
) The efferent and afferent connections of the supplementary motor area.
Brain Res

300
:
63
–81.
Jurgens U (
2002
) Neural pathways underlying vocal control.
Neurosci Biobehav Rev

26
:
235
–258.
Jurgens U, Lu CL (
1993
) The effects of periaqueductally injected transmitter antagonists on forebrain-elicited vocalization in the squirrel monkey.
Eur J Neurosci

5
:
735
–741.
Jurgens U, Richter K (
1986
) Glutamate-induced vocalization in the squirrel monkey.
Brain Res

373
:
349
–358.
Jurgens U, Zwirner P (
1996
) The role of periaqueductal grey in limbic and neocortical vocal fold control.
Neuroreport

7
:
2921
–2923.
Kimura D, Archibald Y (
1974
) Motor functions of the left hemisphere.
Brain

97
:
337
–350.
Kuypers HGJM (
1958
) Corticobulbar connections to the pons and lower brain-stem in man.
Brain

81
:
364
–388.
Larson CR (
1991
) Activity of PAG neurons during conditioned vocalization in the macaque monkey. In: The midbrain periaqueductal gray matter (Depaulis A, Bandler R, eds), pp. 23–40. New York: Plenum Press.
Larson CR, Kistler MK (
1986
) The relationship of periaqueductal gray neurons to vocalization and laryngeal EMG in the behaving monkey.
Exp Brain Res

63
:
596
–606.
Lechtenberg R, Gilman S (
1978
) Speech disorders in cerebellar disease.
Ann Neurol

3
:
285
–290.
Magoun HW, Atlas D, Ingersoll EH, Ranson SW (
1937
) Associated facial, vocal and respiratory components of emotional expression: an experimental study.
J Neurol Psychopathol

17
:
241
–255.
Marini G, Pianca L, Tredici G (
1999
) Descending projections arising from the parafascicular nucleus in rats: trajectory of fibers, projection pattern and mapping of terminations.
Somatosens Mot Res

16
:
207
–222.
Martin GF, Beattie MS, Hughes HC, Linauts M, Panneton M (
1977
) The organization of reticulo-olivo-cerebellar circuits in the North American opossum.
Brain Res

137
:
253
–266.
McGuire PK, Silbersweig DA, Frith CD (
1996
) Functional neuroanatomy of verbal self-monitoring.
Brain

119
:
907
–917.
Monoson P, Zemlin WR (
1984
) Quantitative study of whisper.
Folia Phoniatr

36
:
53
–65.
Moran A, Avendano C, Reinoso-Suarez F (
1982
) Thalamic afferents to the motor cortex in the cat. A horseradish peroxidase study.
Neurosci Lett

33
:
229
–233.
Muller-Preuss P, Jurgens U (
1976
) Projections from the ‘cingular’ vocalization area in the squirrel monkey.
Brain Res

103
:
29
–43.
Muller-Press P, Newman JD, Jurgens U (
1980
) Anatomical and physiological evidence for a relationship between the ‘cingular’ vocalization area and the auditory cortex in the squirrel monkey.
Brain Res

202
:
307
–315.
Murphy K, Corfield DR, Guz A, Fink GR, Wise RJS, Harrison J, Adams L (
1997
) Cerebral areas associated with motor control of speech in humans.
J Appl Physiol

83
:
1438
–1447.
Numminen J, Curio G (
1999
) Differential effects of overt, covert and replayed speech on vowel-evoked responses of the human auditory cortex.
Neurosci Lett

272
:
29
–32.
Numminen J, Salmelin R, Hari R (
1999
) Subject's own speech reduces reactivity of the human auditory cortex.
Neurosci Lett

265
:
119
–122.
Ojemann G, Ojemann J, Lettich E, Berger M (
1989
) Cortical language localization in left, dominant hemisphere: an electrical stimulation mapping investigation in 117 patients.
J Neurosurg

71
:
316
–326.
Parent A, Hazrati LN (
1995
) Functional anatomy of the basal ganglia: the cortico-basal ganglia-thalamo-cortical loop.
Brain Res Rev

20
:
91
–127.
Paus T, Petrides M, Evans AC, and Meyer E (
1993
) Role of the human anterior cingulated cortex in the control of oculomotor, manual, and speech responses: a positron emission tomography study.
J Neurophysiol

70
:
453
–469.
Penhune VB, Zattore RJ, Evans AC (
1998
) Cerebellar contributions to motor timing: a PET study of auditory and visual rhythm reproduction.
J Cogn Neurosci

10
:
752
–765.
Picard N, Strick PL (
1996
) Motor areas of the medial wall: a review of their location and functional activation.
Cereb Cortex

6
:
342
–353.
Petersen SE, Fox PT, Posner MI, Mintun M, Raichle ME (
1988
) Positron emission tomographic studies of the cortical anatomy of single-word processing.
Nature

331
:
585
–589.
Price CJ (
1998
) The functional anatomy of word comprehension and production.
Trends Cogn Sci

2
:
281
–288.
Ramsay SC, Adams L, Murphy K, Corfield DR, Grootoonk S, Bailey DL, Frackowiak RSJ, Guz A (
1993
) Regional cerebral blood flow during volitional expiration in man. A comparison with volitional inspiration.
J Physiol

461
:
85
–101.
Roland PE, Zilles K (
1996
) Functions and structures of the motor cortices in humans.
Curr Opin Neurbiol

6
:
773
–781.
Romanski LM, Tian B, Fritz J, Mishkin M, Goldman-Rakic PS, Rauschecker JP (
1999
) Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex.
Nat Neurosci

2
:
1131
–1136.
Scott SK, Blank CC, Rosen S, Wise RJ (
2000
) Identification of a pathway for intelligible speech in the left temporal lobe.
Brain

123
:
2400
–2406.
Shiba K, Umezaki T, Zheng Y, Miller AD (
1997
) Fictive vocalization in the cat.
Exp Brain Res

115
:
513
–519.
Smejkal V, Druga R, Tintera J (
2000
) Brain activation during volitional control of breathing.
Physiol Res

49
:
659
–663.
Solomon NP, McCall GN, Trosset MW, Gray WC (
1989
) Laryngeal configuration and constriction during two types of whispering.
J Speech Hear Res

32
:
161
–174.
Talairach J, Tournoux P (
1988
) Co-planar stereotaxic atlas of the human brain. Stuttgart: Thieme Verlag.
Vanderhorst VG, Terasawa E, Ralston HJ 3rd, Holstege G (
2000
) Monosynaptic projections from the lateral periaqueductal gray to the nucleus retroambiguus in the rhesus monkey: implications for vocalization and reproductive behavior.
J Comp Neurol

424
:
251
–268.
Vogt BA, Barbas H (
1988
) Structure and connections of the cingulated vocalization region in the rhesus monkey. In: The physiological control of mammalian vocalization (Newman JD, ed.), pp. 203–225. New York: Plenum Press.
Watt CB, Mihailoff GA (
1983
) The cerebellopontine system in the rat: autoradiographic studies.
J Comp Neurol

215
:
312
–320.
Wise RJS, Scott SK, Blank SC, Mummery CJ, Murphy K, Warburton EA (
2001
) Separate neural subsystems within ‘Wernicke's area’.
Brain

124
:
83
–95.
Wise RJS, Greene J, Büchel C, Scott SK (
1999
) Brain regions involved in articulation.
Lancet

353
:
1057
–1061.
Zhang SP, Bandler R, Davis PJ (
1995
) Brain stem integration of vocalization: role of the nucleus retroambigualis.
J Neurophysiol

74
:
2500
–2512.

## Author notes

1Department of Speech and Hearing Science, The George Washington University, USA, 2Language Section, Voice, Speech and Language Branch, NIDCD, NIH, Bethesda, MD 2089, USA and 3Laryngeal and Speech Section, Medical Neurology Branch, NINDS, NIH, Bethesda, MD 2089, USA