Genome-wide association studies from spoken phenotypic descriptions: a proof of concept from maize field studies

Abstract We present a novel approach to genome-wide association studies (GWAS) by leveraging unstructured, spoken phenotypic descriptions to identify genomic regions associated with maize traits. Utilizing the Wisconsin Diversity panel, we collected spoken descriptions of Zea mays ssp. mays traits, converting these qualitative observations into quantitative data amenable to GWAS analysis. First, we determined that visually striking phenotypes could be detected from unstructured spoken phenotypic descriptions. Next, we developed two methods to process the same descriptions to derive the trait plant height, a well-characterized phenotypic feature in maize: (1) a semantic similarity metric that assigns a score based on the resemblance of each observation to the concept of ‘tallness’ and (2) a manual scoring system that categorizes and assigns values to phrases related to plant height. Our analysis successfully corroborated known genomic associations and uncovered novel candidate genes potentially linked to plant height. Some of these genes are associated with gene ontology terms that suggest a plausible involvement in determining plant stature. This proof-of-concept demonstrates the viability of spoken phenotypic descriptions in GWAS and introduces a scalable framework for incorporating unstructured language data into genetic association studies. This methodology has the potential not only to enrich the phenotypic data used in GWAS and to enhance the discovery of genetic elements linked to complex traits but also to expand the repertoire of phenotype data collection methods available for use in the field environment.


Introduction
Collecting phenotype data can be slow, which limits the speed of association genetics and genomic studies for trait improvement.High-throughput phenotyping methods are an area of development that concentrates on engineering sensors and unmanned vehicles to collect (mainly visual) data about traits of various crop species (reviewed in Yang et al. 2020).These methods are beneficial for collecting large amounts of data in an automated fashion, but there are difficulties in deploying these tools in a field environment, and some traits are not detectable by images alone.Additionally, manually collecting phenotypes with pen-and-paper or tablet-and-stylus is time-consuming and generally requires predefined traits of interest.Sensors, imaging, and barcodes make data organization easier for large quantities of data (Kazic 2020;Yao et al. 2021;Saric ́ et al. 2022).An underdeveloped area of in-field phenotyping ripe for exploration is using natural language descriptions of plants.Platforms exist where audio descriptions are recorded (Kazic 2020).However, the biologically relevant data in spoken phenotypes thus far has remained inaccessible for association studies and other applications.
As demonstrated by Oellrich et al. (2015), language-based descriptions of phenotypes can be analyzed computationally to recover known biological associations in plants.Their work used phenotype descriptions derived from various ontologies, then structured the data as entity quality (EQ) statements (where an entity is a feature, e.g.whole plant, and quality is a describer, e.g.dwarf-like), which results in less intensive computations (Mungall et al. 2010).Semantic or word-meaning similarity methods have also shown promise in ascertaining biologically meaningful genetic associations without the burden of manual curation (Braun and Lawrence-Dill 2020;Braun et al. 2020).Additionally, pretrained models have enabled free-text descriptions of plant phenotypes for association studies based on semantics (Braun et al. 2021).The developments in the computational processing of natural language plant phenotypes through unstructured text formats contribute to conceptualizing methods for recording spoken descriptions of phenotypes.
Given the various difficulties associated with collecting structured phenotype data for trait analyses alongside recent advances in semantic reasoning by computer systems, we were curious to find out whether computing on unstructured descriptions of phenotype might now be tractable for biological inferences in a field setting.Imagine collecting phenotype data simply by walking through a field and using spoken words and vocabularies to

Materials and methods
We used a genotypic dataset that includes WiDiv panel taxa (lines).A dataset of 18 million SNP markers (Mural et al. 2022a) obtained from aggregating RNA-Seq and resequencing techniques for 1,051 taxa (described in Mural et al. 2022b).Phenotypic datasets (described in Yanarella et al. 2024), include audio recordings as well as manual measurements of plant height for 686 unique WiDiv panel taxa (Yanarella et al. 2023a), hereafter referred to as the Yanarella et al. dataset.This dataset contains an additional 25 taxa (Supplementary Table 1) that were positive controls for the analysis of spoken descriptions of phenotypes, as these genetics lines were expected to have noticeable and readily describable phenotypes.These positive control taxa are not members of the WiDiv panel.Informed consent for the spoken data from participants was collected per Iowa State University's Institutional Review Board's (IRB) Exempt Project status, and volunteers provided informed consent for using their spoken observations.Phenotypic data include measurements for plant height and spoken descriptions of plants grown in a field environment.The Mural et al. and Yanarella et al. datasets share 653 taxa (Fig. 1), which were used for further analyses described here.

Spoken phenotype collection summary
We reported on phenotype descriptions recorded by deidentified student workers as a peer-reviewed dataset elsewhere (Yanarella et al. 2024).We refer to the Yanarella et al. dataset throughout, and detailed information on data collection and dataset formats can be accessed via Yanarella et al. (2024).These data and the methods used for their collection are reviewed in brief here for context.
Unstructured spoken descriptions were exclusively captured in the field environment.The field in which the recordings were taken included two replicates planted in a randomized incomplete block design.The first block consisted of 31 WiDiv panel taxa for a seed increase, and the second block consisted of 8 B73 experimental control rows, 25 positive control taxa, and 655 unique WiDiv panel taxa.The second block contained two rows of the WiDiv panel line MEF156-55-2.Therefore, the recordings were taken over 720 rows in each replicate.Manually collected phenotypes included plant height as well as ear and tassel features.Phenotype data were collected during July and August of 2021, a time when plants had reached maturity.
Each of the deidentified student workers selected NATO code names.The students who were undergraduate Agronomy, Biology, and Genetics Majors at Iowa State University are known as "Delta", "Golf", "Kilo", "Lima", "Mike", "Quebec", "Victor", "Yankee", and "Zulu".Each participant was instructed to state their NATO code name and the row tag number before observing the plants in each row (Fig. 2a).This procedure ensured the participant's deidentified connection to the row number and spoken observation while enabling the parsing of each observation so that multiple row observations could be recorded in the same file (data available from Yanarella et al. 2023a, anddescribed in Yanarella et al. 2024).
Participants were asked to comment on overall appearance, including color and height, tassel, ear, leaves, braceroots, and any disease.Two examples of how a given row might be described are as follows."Alpha, 1,641.Plants are flowering, there's a lot of variability in height, tassels have emerged, the anthers have everted, brace roots at two nodes, some braceroots point upward".Another: "Alpha, 2,287.Plants are tall.They have brace roots at up to three nodes.Those brace roots are chunky.They have red to purple anthocyanins in rings.Leaves are very much upright.Silks are yellow, tassels are yellow".Because many individuals passed each row many times, each row has a diversity of descriptions in the full dataset.

Phenotype detection from spoken descriptions of positive control lines
A subset of spoken observation transcripts collected in the field containing the positive control accessions were parsed (Fig. 2b).Four to six terms from the description and phenotype records for each accession were drawn from MaizeGDB (Woodhouse et al. 2021).One term from these lists was used to collect synonyms from Merriam-Webster (Merriam-Webster 2023) and WordHippo (Kat IP Pty Ltd 2008) thesaurus services (Table 1 and  .The number of rows containing at least one synonym related to each accession's descriptions and phenotype records was calculated as a proportion to the number of observations for that accession.

Preprocessing phenotypic datasets
Plant height measurement data R Scripts (v.4.2.2 and v.4.3.1)(R Core Team 2023) were developed to process the measuring and scoring data from Yanarella et al. dataset such that only plant height observations were retained for each of the three observation sets (each row in the field was  measured and scored by two teams made up of participants and one observation set was collected by a volunteer).Replicate number was programmatically added to these data, and positive controls were removed.The 653 taxa shared between the datasets were retained.Best Linear Unbiased Estimators (BLUE) values were calculated for each taxon using R's built-in lm function to perform linear regressions and the emmeans v.1.8.7 package (Lenth 2023), where taxa and replicate were fit as fixed effects.Best Linear Unbiased Prediction (BLUP) corrected mean values were calculated for each taxon with the lmer function of the lme4 v.1.1-34package (Bates et al. 2015) where taxa, replicate, and row number were fit as random effects.Visualization of diagnostics plots of the models (Supplementary Figs. 1, 2) were generated using the ggResidpanel v.0.3.0 package (Goode and Rey 2022).

Semantic similarity for plant height spoken data
Text transcripts of spoken data from the Yanarella et al. dataset were processed using Python v.3.8.2 (Van Rossum and Drake 2009).The spaCy v.3.5.1 package (Honnibal and Montani 2023) and the TensorFlow v.2.12.0 (Abadi et al. 2016) spaCy Universal Sentence Encoder v.0.4.6 (Mensio 2023) were used to process the transcripts to obtain semantic similarity scores.Three phrases, "tall", "tall plant", and "tall height", were compared to each row observation through spaCy's similarity function using the pretrained large English universal sentence encoder (en_use_lg) from TensorFlow.A dataset of similarity scores in the form of values from 0 to 1 was generated (Fig. 2b).The spaCy Universal sentence encoder and the similarity function we used scores semantic similarity from 0 to 1 based on the phrase used to compare the observations.Therefore, we selected to use all tall terms as the query (rather than also focusing on phrases for short stature) so that observations for tall plants would be closer to a score of 1 (observations meaning the opposite, i.e. small plants would be scored closer to 0).If the word short were substituted for tall as the query, then smaller would be scored closer to 1 and tall plants closer to 0.
Similarity scores for the 653 taxa shared by both datasets were retained, encompassing 35,709 rows or 91.92% of the original rows observed (Table 2).The similarity scores for the "tall" query were used as input to calculate BLUEs and BLUPs in the same manner as described in the Plant Height Measurement Data section, and visualizations of diagnostics plots of the models (Supplementary Figs. 3, 4) were generated.

Binning for plant height spoken data
The transcripts of spoken plant descriptions were mined manually and programmatically for phrases directly related to narrations about plant height.A set of 797 plant height phrases were identified and manually curated and binned from 0 to 7, where 0: no growth, 1: very short plants, 2: short plants, 3: short-medium height plants, 4: medium height plants, 5: medium-tall height plants, 6: tall plants, 7: very tall plants.Bin values were assigned to observations for the 653 taxa shared by both datasets (Fig. 2b).Of the total text transcripts, there were 34,209 or 88.06% row observations that were retained and binned (Table 2).
BLUEs and BLUPs were calculated as described in the Plant Height Measurement Data section, and visualizations of diagnostics plots of the models (Supplementary Figs. 5, 6) were generated.Additionally, the R nnet v.7.3-19 package (Venables and Ripley 2002) was used to perform multinomial logistic regression to predict height phrase bins, where taxa and replicate were fit as fixed effects.

Preprocessing the genotypic dataset
To ease the subsetting step to obtain the data for the 653 that overlap between the Mural and Yanarella datasets, Trait Analysis by aSSociation, Evolution and Linkage (TASSEL) Version 5.0 Standalone (Bradbury et al. 2007) was used to convert the Mural et al. genotypic data from a variant call format (vcf) formatted file to a HapMap (hmp) formatted file.The data were then processed to contain marker information for the 653 taxa shared with the Yanarella et al. dataset, these data were grouped by chromosome (because datasets were so large as to exceed the limits of the HPC memory we were able to utilize), and HapMap files were generated for each chromosome.The chromosome files were sorted by maker position from lowest position to highest position.
The sorted chromosome files were reformatted to vcf files using TASSEL Version 5.0 Standalone, then vcftools v.0.1.14(Danecek et al. 2011) concatenated these files, and the resulting file was zipped.The concatenated data (a subset of 653 lines) were then again converted to a VCF format for use with PopLDdecay v3.42 (Zhang et al. 2018)

GWAS and analysis
Genome Association and Prediction Integrated Tool (GAPIT) 3 v.3.1, 2022.4.16 (Lipka et al. 2012;Tang et al. 2016;Wang and Zhang 2021) was used to perform Fixed and random model Circulating Probability Unification (FarmCPU) (Liu et al. 2016) and Mixed Linear Model (MLM) (Yu et al. 2005) on each of the phenotypic datasets using the Mural et al. marker dataset for genotypic input.Each chromosome was run individually, and PCA.total parameter was set to 3 for all analyses (Fig. 2c).This manuscript focuses on the FarmCPU analyses, the MLM processing and results are available as described in Data availability section.
Manhattan plots from the resulting GAPIT analyses were generated using the ggplot2 v.3.4.3 package (Wickham 2016).We used the RAINBOWR v.0.1.29package's (Hamazaki and Iwata 2020) CalcThreshold to determine the Bonferroni threshold for each analysis with a sig.level of 0.05.SNPs that were identified as above the The count of row observations utilized for binning for spoken data methods represents the data 653 intersecting taxa between Yanarella et al. and Mural et al., and the data in parentheses are the proportion of data retained from total rows observed (left) and the proportion of data retained from the 653 intersecting taxa which have plant height terms (right).
We collected a list of genes shown to influence plant height (Table 3, Supplementary Table 2) and compared the gene IDs from the GWAS analyses using a web-based intersection and Venn diagram tool (Sterck 2021) to determine if these previously published plant height genes were identified within the ±300 kb region indicated by the LD decay curve for each of our analyses.Additionally, gene ontology (GO) terms for the gene IDs within ±300 kb of SNPs identified as significant were obtained using the B73 RefGen_V4 Zm00001d.2annotations generated by Maize GO Annotation-Methods, Evaluation, and Review (maize-GAMER) tool (Wimalanathan and Lawrence-Dill 2017;Wimalanathan et al. 2018) and the R package GO.db v.3.17.0 (Carlson 2023) was used to collect terms associated to the GO IDs (Supplementary Table 3).Where no GO terms were associated with genes or gene models or those associated were not clearly related to the plant height trait, knowledge of gene function via associated Arabidopsis thaliana orthologs listed in MaizeGDB and linked to TAIR were reviewed for possible insight (Andorf et al. 2016;Reiser et al. 2022).

Results and discussion
Student participants made three complete passes of the field (∼4,320 observations per student participant (Table 2, total rows observed count) and used their individual wording and phraseology to describe phenotypes.Recorded observations varied from 2 to 241 words in length (Fig. 3).From this dataset (described fully in Yanarella et al. 2024), all analyses and results reported here are derived.

Visually striking phenotypes are detected from unstructured spoken descriptions
We reasoned generally that including visually observable phenotypes with known phenotype-gene associations would be a first step toward understanding whether individuals without specific training on describing plant phenotypes could recover known phenotype-gene associations.To explore student participants' ability to identify and describe phenotypes, 25 positive control accessions that, if grown in the appropriate environmental conditions, would show visually "dramatic" phenotypes were included in the field.
The 25 positive control accessions were observed 53-55 times over all nine student participants (Table 1).We utilized Merriam-Webster (2023) and WordHippo (Kat IP Pty Ltd 2008) thesaurus services to determine participants' ability to identify words synonymous with descriptors that characterize positive control phenotypes as demonstrated in Fig. 4a-c.For example, accessions    M241C A1 A2 B1 C1 C2 Pl1 Pr1 R1-r and 219L B1-S; R1-r pl1-McClintock (gene name colored1 and colored plant1, Supplementary Table 1) had at least one synonym in each of the observations made by the student participants as indicated in Table 1 by the proportion of 1.000 for both Merriam-Webster and WordHippo synonyms.While accessions U740G Fbr1-N1602 (gene name few branched1), 703J Rs1-O 1 and 703K Rs1-Z (gene name rough sheath1, Supplementary Table 1) had low proportions of observations having at least one synonym for each observation as indicated in Table 1.
Our findings indicate that nonexpert participants, unaware of expected phenotypes, can identify and describe what they observe using terms that could be valuable for genomic association studies.These findings are, however, constrained by the number of synonyms used for phenotype descriptions and the specific environmental conditions of the plant growth.The reliance on thesaurus services could also potentially reduce the proportion of observations with at least one synonym, particularly if participants opted for informal descriptions.Furthermore, accurate descriptions of our intended positive control phenotypes were contingent upon favorable field environment and weather conditions.

Plant height assessments can be derived from unstructured spoken descriptions
Given that visually apparent phenotypes were described successfully by nonexpert participants for control lines grown in the field, we moved on to assessing whether plant height, a continuous and subjective trait from "short" to "tall", could be derived from the WiDiv population, a large association mapping panel.
Two methods were employed to preprocess text transcriptions of spoken descriptions of plants.The semantic similarity method of comparing the term "tall" to each row observation retained 91.92% of the full set of row recordings captured (including the 25 positive controls and 33 accessions unique to the Yanarella et al. dataset) by the student participants, demonstrating that 35,709 row observations were made with taxa in both datasets (Table 2).The manual bin method of identifying phrases related to plant height and binning them based on apparent semantic similarity retained 86.06% of the full set of row recordings captured by the student participants and 95.80% of the observations with plant height phrases made with taxa in both datasets, which results in 34,209 row observations for manually binned data (Table 2).
Both methods parse information about the plant height trait and process the data into a format appropriate as input into available GWAS tools and models.Using a query term for plant height and semantic similarity requires less manual curation and was implemented on a larger subset of data.The benefit of the binning method is that it reduces the noise (where noise in the context of natural language processing refers to additional information not directly related to the particular topic of interest).Only observations with plant height-related terms were considered.
A limitation of the query term and semantic similarity method is retaining noisy data because this method compares the "tall" query to each observation and relies on pretrained models.An example of noise comes from the participant NATO code name "Victor".Their recording for row 1,456 on July 16, 2021, tall and height green all the way to the bottom … super short hairs on top that are quite prickly … in general there are the brace roots are short fat and light green in color, in which the similarity function of spaCy University Sentence Encoder (Mensio 2023) when compared to the "tall" query string determine the semantic similarity score as 0.0848.The shortcomings of the binning method are the timeconsuming nature of curating lists of phrases relevant to the trait of interest and the loss of data where phrases directly related to plant height were not specified.

Genomic associations for plant height can be derived from unstructured spoken descriptions of phenotype
Efforts to assess whether spoken data could be used for GWAS were structured as follows.We collected a dataset of manually measured plant heights, which would serve as ground truth for assessing any language-based findings.We also processed spoken descriptions of lines in two ways (semantic similarity versus a binning method), and carried out GWAS using BLUEs to model the parameters taxa and replicate as fixed effects and BLUPs to model the parameters taxa, replicate, and row number as random effects.Once regions of the genome associated with the trait plant height were derived for manually measured, semantic similarity, and binned datasets, significant regions that included genes known to be involved in plant height were highlighted.Loci identified as significant from both spoken datasets were compared to the manual, ground truth dataset to determine whether the same regions were identified across both data collection methods, and all genes in regions of significant association for the plant height trait were assessed for GO terms associated with plant growth hormones auxin, brassinosteroid, and gibberellin.

Ground truth: manual plant height measurement GWAS
We performed association studies using FarmCPU (Liu et al. 2016) on three categories of phenotype data.The first phenotype category was ground truth (measured) plant height data using BLUEs (Fig. 5a) and BLUPs (Fig. 5b).For the BLUE analysis, 21 significant SNPs above the Bonferroni threshold of 8.55 were identified, and of those SNPs, we discovered 10 (Supplementary Table 3; colored orange) in which at least one plant height gene was detected in the literature within ±300 kb (Supplementary Table 2).The BLUP analysis identified 29 significant SNPs, 9 (Supplementary Table 3) where at least one plant height gene was discovered in the literature within ±300 kb (Table 3, Supplementary Table 2).

Semantic similarity for plant height spoken data GWAS
The second phenotype category used BLUEs (Fig. 6a) and BLUPs (Fig. 6b) for tall query and semantic similarity of spoken phenotype descriptions.These analyses identified 27 and 23 significant SNPs (Supplementary Table 3; colored orange), respectively, above the Bonferroni threshold of 8.55.Of these, 9 and 8 genes, respectively, were formerly detected for plant height within ±300 kb of the SNP (Table 3, Supplementary Table 2).
The semantic similarity method with the tall query to generate phenotype values for each spoken observation using spaCy's similarity function (Honnibal and Montani 2023) and the Universal Sentence Encoder (Mensio 2023) was successful.We were able to perform GWAS, and there were significant SNPs from regions associated with plant height.As this is a proof of concept study, we acknowledge that other pretrained models exist capable of calculating semantic similarity or models that can be adapted to generate similarity scores related to plant height such as BioBERT (Lee et al. 2019), those implemented by the Python gensim package (R ̌ehůřek and Sojka 2010), or others reviewed in Koroleva et al. (2019).Further, additional queries could be employed for relating the text observations to a height value.

Binning for plant height spoken data GWAS
The third phenotype category used BLUEs (Fig. 7a) and BLUPs (Fig. 7b) for manual binning of spoken phenotype descriptions with plant height terms.These analyses identified 32 and 33, respectively, significant SNPs (Supplementary Table 3; colored orange), above the Bonferroni threshold of 8.55.Of these, 13 and 12 have genes formerly reported within ±300 kb of the SNP (Table 3, Supplementary Table 2).An additional analysis was completed using predicted values from a multinomial regression (Supplementary Fig. 8, Supplementary Table 2) in which 21 SNPs were significant, and 3 had genes detected within ±300 kb of the SNP in the literature (Table 3, Supplementary Table 2).
The binning method for plant height phrases appears to be a promising method for association studies with phenotype data extracted from spoken descriptions.This method reduces the noisiness of the transcription data and scores observations on only phrases detailing features of plant height.Additionally, the GWAS performed with BLUE and BLUP values generated from the binning method detected more known regions associated with plant height formerly reported than the manually measured and semantic similarity query methods.

Comparing across manual measurement, semantic similarity, and binning
Within the ±300 kb region indicated by the LD decay curve for each of the analyses, we lined up shared significant regions within BLUEs and BLUPs for each of the three studies (i.e.manual measurement, semantic similarity, and binning).These comparisons are shown in Supplementary Table 4. Within the manual measurements, 11 loci identified are shared across BLUEs and BLUPs.For semantic similarity, 12 shared regions are identified.
For binning, 30 shared regions are identified.
Of particular interest are comparisons between manual measurement and the two methods derived from unstructured, spoken language descriptions of phenotype (i.e.semantic similarity and binning).Manual measurement and semantic similarity analyses showed no correspondence.However, manual measurement and binning shared four regions for BLUEs, with one of the four regions  resulting from the same SNP (as opposed to the other three, where different SNP within the ±300 kb region are responsible for the correspondence).Four SNPs in that shared group (representing a region on Chromosome 1 and a region on Chromosome 8) contain genes known to be associated with plant height from the literature (marked in orange in Supplementary Table 4).For BLUPs, manual measurement and binning share six regions, with three regions resulting from shared SNP (and three as a result of different SNP within the ±300 kb region).
Four SNPs in that shared group (representing a region on Chromosome 1 and a region on Chromosome 8) contain genes known to be associated with plant height from the literature (likewise indicated in orange).The significant shared SNPs on chromosome 8 for each group are shared across measured data binned data for both BLUEs and BLUPs.
Of some note are the regions that are not shared between manual measurements and the semantic similarity or binning GWAS outputs.For those significant associations that are  only present in the semantic similarity and/or binning outputs, some of these candidate genes for plant height agree with other studies.For others, they are not documented in other studies, making them truly novel candidate genes identified via association genetics based on spoken phenotypic descriptions.

Example candidate genes in select regions
To give examples of how to review and prioritize candidate genes for the plant height trait in some of the regions identified, we describe here the three regions that were shared between the measured and binned results sets for both BLUEs and BLUPs.These constitute three loci, one each on chromosomes 3, 8, and 9. Shown in Table 4 are these three loci, of which one (on chromosome 8) coincides with a previously published association from GWAS for plant height, and the other two are newly identified through GWAS in this study from spoken phenotypic data collection techniques as well as from the manual scoring procedure.
Table 4 shows the SNPs identified, the number of gene models in the ±300 bp region that contains the SNPs.Where characterized genes are known for those gene models, gene model names as well as gene names and symbols are also listed.GO terms relevant to the plant height trait are shown in bold.For the chromosome 3 region, no gene models in the region had any GO terms obviously related to height, so we also looked up the gene models via MaizeGDB to find out whether any associated Arabidopsis orthologs might have known phenotypes or functions related to plant height.
Promising candidate genes are as follows.For the newly identified chromosome 3 locus, three candidate gene models are: Zm00001d044242, the gene bhlh25, which has been implicated in abscisic acid biosynthesis/regulation (Vendramin et al. 2020;and Arabidopsis ortholog AT3G21330 is a member of a family of genes involved in photoresponsiveness for hypocotyl length; Khanna et al. 2006), Zm00001d044255 (Arabidopsis ortholog AT3G13882 is involved in plant growth rates and flowering; Xu et al. 2023), and Zm00001d044260, the gene c3h2, with c3h genes showing some height-related phenotypic involvement (Fornalé et al. 2015;and Arabidopsis ortholog AT5G56930, also called AtC3H65, is involved in brassinosteroid signaling; Wang et al. 2022).For the chromosome 8 locus previously associated with plant height by Azodi et al. 2020), the loci pmpm7 (also called PMP3-7 in some literature) and iaa34 are the only genes in the region.Both are known to be responsible for plant growth and height phenotypes (Fu et al. 2012;Galli et al. 2015) and are associated with GO terms that are likely relevant.For the newly identified chromosome 9 locus, one candidate gene model looks promising: Zm00001d048461 with associated term GO:0009684, indoleacetic acid biosynthetic process (where indoleacetic acid or IAA is an auxin).Upon further investigation, this gene model represents the gene blue fluorescence1 (bf1) known to affect maize plant stature (reviewed by Khavkin and Coe 1997), but not among the literature of associated plant height genes we had assembled for this work.Note: for all loci identified as being significantly associated with the trait plant height, the information that would enable these same analysis steps can be accessed via Supplementary Tables 3 and 5.

Plant growth hormone functions in genomic regions associated with plant height
To assess relationships to plant growth hormones and identified gene IDs across the whole genome, we reviewed the set of GO terms annotated to gene IDs associated with the regions ±300 kb of significant SNPs.The full list of GO terms for each model is available in Supplementary Tables 3 and 5. To examine how these terms align with plant height terms, we queried the dataset for the words auxin, brassinosteroid, and gibberellin because of their known functions in plant height regulation (Li et al. 2020 reviews the importance of these hormones).The term "auxin" was more frequently present in these datasets when compared to brassinosteroid or gibberellin (Table 5).
Additionally, other GO annotations identified in our analyses have functions that may affect plant height.Examples of these GO terms include developmental growth (GO:0048589), anatomical structure formation involved in morphogenesis (GO:0048646), and shoot system development (GO:0048367).Further, using GWAS, we found genomic regions with predicted functional annotations related to plant hormone functions that were not reported in the literature in Table 3.These regions may be of interest for follow-on experimentation to assess potential involvement in the plant height trait.
While examining GO terms, descriptions that do not relate to plant functions but were annotated to gene IDs occurred.Errant assignments of GO terms for plant-specific tasks has been described in Fattel et al. (2022).An example is the gene ID Zm00001d008201 being assigned the term animal organ development (GO:0048513).Interestingly, this ID was also assigned the terms auxin-activated signaling pathway (GO:0009734) and postembryonic development (GO:0009791).These results demonstrate a compelling argument for reviewing GO terms carefully.

Additional methods tried and room for improvement
We demonstrate the use of a multinomial regression to generate phenotypic input with binned data, although we recognize that FarmCPU is not optimized for multinomial input.GWAS tools that utilize an ordered multinomial regression model to predict multinomial values for association studies were developed in the medical research field (German et al. 2019).Regardless, FarmCPU with multinomial binned input for plant height detected regions of the genome associated with plant height.GWAS from spoken phenotypic descriptions | 9 While participant language was not constrained, input that is less noisy and with lower data loss could be attainable if an emphasis were placed on stating specific aspects of the plant accompanied by a descriptor and reducing literary descriptive comments.An example of describing a specific aspect of a plant is "tall, green, and long" compared to "this row has tall plants".The former statement is unclear whether the whole plant is described or a specific aspect of the plant, while the latter makes it clearer that the total plant height is described.Literary language descriptions are more difficult to compute because context is necessary to determine the meaning behind a phrase, an example, candy cane stripe.While candy cane stripe may induce a mental image of a candy cane, unless a computation model is trained to identify the literary description, the model would not be able to discern the spoken description of phenotype as a particular striped pattern.

Conclusion
By translating the linguistic complexity of human speech into data compatible with GWAS tools, we have demonstrated a successful proof of concept that paves the way for further innovation in data collection and analysis.Our 2-fold approach-semantic similarity and manual binning-proved to be robust, capturing known genetic associations with plant height and identifying new regions of interest.Of note, those describing the plants were not aware that plant height would be a focus of the study, and no guidance on what was "tall" or "short" was provided.Nonetheless, known genetic associations with the trait plant height were uncovered.These findings underscore the potential for expanding the scope of phenotypic data collection methods in genetic studies.By confirming that unstructured spoken data can yield quantifiable results for genetic association, we open the door to complementary and diverse data collection techniques that may be more tractable for nonexpert involvement.This could enable the inclusion of nonexperts in data collection, and significantly enrich the dataset by bringing in previously overlooked phenotypic nuances.
Beyond these demonstrations that spoken, unstructured phenotypic descriptions can be used to recover known associations, there are two other conceptual benefits that should be considered.Firstly, when people are describing what they see in the field rather than exclusively collecting predefined traits, the potential to uncover novel phenomena is perhaps increased.Secondly, it is the case that for many years we have used computers to analyzed structured data, so those collecting the data have limited themselves to documenting data in a structured, computer-friendly format.This is, in effect, asking people to structure their thinking and documentation like-and for-a computer.With the methods described here, the people collecting the data are enabled to think and behave in a more naturally human way for data collection.This has implications for the rate of data collection and for cognitive burden as follows.Over 3 weeks of data collection, each participant made three complete passes of the field, recording spoken observations.However, for manual scoring and measurement data, none were able to make a single complete pass of the field.Our experimental design enabled the student participants to speak and describe plant traits using their unique vocabulary and speech patterns.Participant "Zulu" reported that recording spoken observations was simpler and easier than measuring and scoring because they could make more detailed observations about different parts of the plants because recording spoken observations was both less strenuous and less mentally taxing, indicating that data collection through speech may reduce the cognitive load on field researchers.
While image-based data collection derived from unmanned vehicles equipped with sensors for visual phenotype detection continues to improve and is invaluable for certain traits (reviewed in Xiao et al. 2022), it falls short for traits requiring, e.g.tactile or olfactory observations.Coupling spoken and written language-based annotations with image analysis can enhance our understanding of complex phenotypes, harnessing human perception to capture nuanced details that purely image-based approaches might miss.This work demonstrates that such data can be collected in straightforward ways, and that phenotypic information is indeed accessible for large-scale genetics and genomics analyses.

Fig. 1 .
Fig. 1.Comparison of intersections of Mural et al. and Yanarella et al.WiDiv dataset taxa (positive controls not included), where n is the number of unique taxa in each dataset.Mural et al. dataset (in blue, n = 1,051), and Yanarella et al. dataset (in orange, n = 686).

Fig. 2 .
Fig. 2. Spoken phenotype process overview.a) In field spoken phenotype descriptions collection, b) spoken phenotype data processing multiple ways, including transcript production and methods for generating numeric representations of phenotypes for traits, and c) top box: detecting phenotypes from positive control accession descriptions.Lower two boxes: GWAS using data derived from spoken observations.
to analyze and visualize Linkage Disequilibrium (LD) Decay of the Mural et al. genotypic data.

Fig. 3 .
Fig. 3. Distributions of each student participant's word count per observation.X axis shows number of words for individual row observation.Y axis indicates each participant's NATO name.Boxplots are included within each of the violin plots, individual outlier points not represented.

Fig. 4 .
Fig. 4. Example of detecting phenotypes from positive control accession descriptions.a) Transcript of a spoken description for positive control accession M241C A1 A2 B1 C1 C2 PI1 Pr1 R1-r, b) image of positive control accession M241C A1 A2 B1 C1 C2 PI1 Pr1 R1-r, c) example of phenotype description terms, and synonyms from Merriam-Webster and WordHippo, to demonstrate a description having at least one instance of a synonymous word for the phenotype of interest.

Fig. 5 .
Fig. 5. Manually measured height-phenotypic data.Manhattan plot generated using GAPIT and FarmCPU using measured height data a) BLUEs and b) BLUPs using Mural et al. genotypic data.The horizontal red dashed line indicates the Bonferroni threshold (a and b) −log 10( p) = 8.55; orange points above the line and intersected by a vertical line indicate identified SNPs with known plant height genes within ±300 kb.

Fig. 6 .
Fig. 6.Semantic similarity for query tall-phenotypic data.Manhattan plot generated using GAPIT and FarmCPU using semantic similarity score data a) BLUEs and b) BLUPs using Mural et al. genotypic data.The horizontal red dashed line indicates the Bonferroni threshold (a and b) −log 10( p) = 8.55; orange points above the line and intersected by a vertical line indicate identified SNPs with known plant height genes within ±300 kb.

Fig. 7 .
Fig. 7. Binned phrases for plant height-phenotypic data.Manhattan plot generated using GAPIT and FarmCPU using binned plant height data a) BLUEs and b) BLUPs using Mural et al. genotypic data.The horizontal red dashed line indicates the Bonferroni threshold (a and b) −log 10( p) = 8.55; orange points above the line and intersected by a vertical line indicate identified SNPs with known plant height genes within ±300 kb.

Table 1 .
Proportion of word usage for describing positive control accessions.

Table 2 .
Observation retention by spoken phenotype method.Rows observed count includes data collected for all taxa within the Yanarella et al. dataset, including 25 positive control taxa and 33 taxa not found in the Mural et al. dataset.The count of row observations utilized for semantic similarity for spoken data methods represents the data 653 intersecting taxa betweenYanarella et al.  and Mural et al., andthe data in parentheses are the proportion of data retained from total rows observed.
a b c

Table 3 .
Plant height gene models count identified from publications.

Table 4 .
Candidate genes for plant height, shared measured and binned BLUEs and BLUPs.

SNP ID (Position) Binned SNP ID (position) Gene model(s) a Maize gene name (symbol) GO unique ID b GO term name Arabidopsis ortholog(s)
See Supplementary Table4for full list of Gene IDs and GO terms c Identified in a previously published GWAS for the trait plant height. b

Table 5 .
Appearance of GO terms by plant hormone for each method.
bAppearance count for GO terms unique to this study.