A novel bioinformatics approach identifies candidate genes for the synthesis and feruloylation of arabinoxylan.

Arabinoxylans (AX) are major components of graminaceous plant cell walls, including those in the grain and straw of the economically important cereals. Despite some recent advances in identifying the genes encoding the biosynthetic enzymes for a number of other plant cell wall polysaccharides, the genes encoding enzymes of the final stages of AX synthesis have not been identified. We have therefore adopted a novel bioinformatics approach based on estimation of differential expression of orthologous genes between taxonomic divisions of species. Over 3 million public domain cereal and dicot ESTs were mapped onto the complete sets of rice and Arabidopsis genes, respectively. It was assumed that genes in cereals involved in AX biosynthesis would be expressed at high levels and that their orthologues in dicotyledonous plants would be expressed at much lower levels. Considering all rice genes encoding putative glycosyltransferases (GT) predicted to be integral membrane proteins, genes in the GT43, GT47 and GT61 families emerged as much the strongest candidates. When the search was widened to all other rice or Arabidopsis genes predicted to encode integral membrane proteins, cereal genes in the Pfam family PF02458 emerged as candidates for the feruloylation of arabinoxylan. Our analysis, known activities and recent findings elsewhere are most consistent with genes in the GT43 families encoding beta 1,4 xylan synthases, genes in the GT47 family encoding xylan alpha 1,2 or alpha 1,3 arabinosyl transferases and genes in the GT61 family encoding feruloyl-arabinoxylan beta 1,2 xylosyl transferases.


INTRODUCTION
All higher plants are believed to synthesise arabinoxylan (AX) or glucuronarabinoxylan (GAX) as a component of their cell walls. This hemicellulose may function in coating and crosslinking of cellulose microfibrils (Carpita and Gibeaut, 1993). In primary cell walls, this cellulose and hemicellulose network is thought to be embedded in a protein and pectic matrix.
GAX has single alpha 1,2 linked residues of glucuronic acid (GlcA) or 4-O-methyl-GlcA attached to the xylosyl residue backbone in addition to arabinosyl substitutions.
In grasses, the arabinose may be feruloylated and then further substituted with β -1,2xylosyl residues (Wende and Fry, 1997a). The main hemicellulose in dicots is xyloglucan, a β -1,4-linked glucan with xylosyl side chain substitutions which can be further decorated with galactose, and fucose. However, some species show unusual characteristics; sugar beet cell walls appear to be almost devoid of hemicelluloses, including xylans and xyloglucan (Renard and Thibault, 1993).
The relative importance of the different polysaccharides of walls varies widely. In dicot primary walls including Brassicas and Arabidopsis, the major hemicellulose is xyloglucan (Bacic et al., 1988;Carpita, 1996;Harris et al., 1997). However, in the Type II walls of plants in the Poales order (grasses) such as rice, wheat and barley, and other commelinoid monocots, AX occurs as a major constituent of all primary and secondary cell walls (Wilkie, 1979). Hence, AX is a major component of the dietary fibre consumed by humans in cereals products and also has impacts on their processing properties and quality for livestock feed. Furthermore, the exploitation of plant biomass for biofuels will depend largely on our ability to digest AX and other cell wall polymers. A key factor in the digestibility is believed to be the feruloylation of AX and GAX in cereals which allows possible cross-linking between AX chains and from AX to lignin (Grabber, 2005). Despite the huge economic and nutritional importance of AX, the genes which encode the xylan synthase (which generates the xylan backbone), the arabinosyl transferases (which substitutes arabinose onto the xylose residues) and AX feruloyl transferase (which substitutes ferulate onto these arabinose residues) have not been identified.
Until recently, it was thought that the xylan synthase gene was likely to belong to the cellulose-synthase like (Csl) family (Richmond and Somerville, 2000), however this now seems less likely as members do not show xylan synthase activity when expressed in insect cells (Liepman et al., 2005). Recent progress in identifying the functions of key glycosyltransferase (GT) genes has been made by using their expression patterns as an initial screen. Brown et al. (2005) and Persson et al. (2005) used microarray data to identify genes that were coexpressed with secondary cell-wall specific cellulose synthase genes in Arabidopsis. These included several GT genes for which insertional mutants were shown to have altered cell wall composition. dicots and cereals, this is less of a problem in our analysis.
We searched for genes which encode the enzymes responsible for AX synthesis making a minimal number of assumptions. These are that the enzymes must be integral membrane proteins, since AX is synthesised in the Golgi, and that the expression of the encoding genes should reflect the relative abundances of AX (Wilkie, 1979;McNeil et al., 1984;Bacic et al., 1988). The fact that polysaccharide synthases appear to occur at relatively low abundance (Dhugga, 2005), shows that no simple relationship exists between amounts of different enzyme types and their products, but it seems highly probable that expression of the same enzyme type should reflect the gross differences between species. Thus we assumed that the expression of genes encoding AX synthetic enzymes will be high in absolute terms in monocots compared with other GT genes, and substantially higher than that for their putative orthologues in dicots. The complete complement of rice genes was linked to  Tables I and II are derived is available as supplementary data Table   S1).
The first criterion for selection of candidate genes involved in AX biosynthesis is that they should be highly expressed in cereals and therefore Tables I and II show only those orthologous groups with more than 100 associated ESTs. The second criterion is that they should be more highly expressed in grass species than dicot species due to the much greater prevalence of AX in the former; xylans make up about 5% of primary cell wall in dicots compared to 20% in grasses (McNeil et al., 1984). The orthologous groups were therefore ranked in descending order of the normalised ratio of EST counts in cereals to EST counts in dicots.  , 2006). As mixed-linkage beta glucan is quite abundant in grass cell walls but does not exist in dicots, this finding supports the suggestion that grass-specific synthases occur high in the ranked list.
Since no exact corresponding genes to the CSLF genes exist in Arabidopsis, the method groups them together in an orthologous group with the related CSLD genes which must encode enzymes with different activities.
Further evidence for the validity of the method came from inspection of the bottom of the ranking, i.e. those GT genes that are much more highly expressed in dicots than in cereals. These should encode the enzymes responsible for the cell wall components that are more abundant in dicots; e.g. xyloglucan and pectin (Bacic et al., 1988). Table   II shows all the orthologous groups for which the normalised EST count ratio was below 0.4. As expected, this list includes genes encoding enzymes active in xyloglucan synthesis (AtXT1 genes in GT34 family; Bencur et al., 2005) and QUA1 which is implicated in the synthesis of the pectin homogalacturonan (Bouton et al., 2002). It is possible that different orthologous groups within the same GT family could substitute for one another in dicots and cereals, thus one group would be highly expressed in cereals, the other in dicots. This can be ruled out for the GT61 and GT43 families for which all groups were more highly expressed in cereals, and for the GT22 and GT34 families for which all groups are more highly expressed in dicots.
We investigated whether a large bias in the tissue types of libraries made for cereals and dicots could affect the results in Table I. Inspection of all the libraries used showed the most obvious differences to be (1) a larger proportion of seed libraries for cereals compared to dicots, and (2) a preponderance of tuber libraries for potatoes, presumably due to their economic importance. The procedure used to generate Table I was therefore repeated, but for two subsets excluding (1) all seed libraries and (2) all root/tuber libraries. The results were very similar to those in Table I showing that they are not caused by a tissue bias (data summarised in Table S3). Since dicots and cereals differ in primary cell wall AX content much more than in secondary cell walls, we might expect the differential expression of candidates for AX synthesis to be exaggerated in developing tissues. We investigated this by identifying a subset of ESTs annotated as from young or developing tissue; unfortunately this subset is probably too small to test this and the results (Table S3) were not conclusively different from Table I.
Not all plant GTs are in the CAZy database; bioinformatics approaches have identified new putative GTs, some of which have subsequently been experimentally confirmed (Egelund et al., 2004) and proteomic studies of Golgi-located proteins have also revealed more putative GTs (P.D., unpublished). We were also interested in genes responsible for the addition of ester-linked phenolic groups onto AX in grass species, particularly ferulate. For these reasons, we analysed all rice genes that are not in CAZy but are predicted either to encode integral membrane proteins or are in orthologous groups containing such genes. All genes that are highly expressed in cereals with more than 100 cereal ESTs and at least 20 in each cereal species were considered (Supplementary data acceptors. Whereas the abundance of xylan differs quantitatively between grasses and dicots, the presence of feruloyl groups on AX appears to be an absolute difference as these are present in cell walls of all Gramineae but have never been detected in dicots (Bacic et al., 1988). The high degree of differential expression and relatively low similarity between rice and Arabidopsis orthologues for the PF02458 group in Table   III therefore appears consistent with these genes encoding enzymes which transfer feruloyl and perhaps other, rarer hydroxycinnamoyl residues onto AX. A complication is that this family also encodes acetyl transferases and AX is often heavily acetylated (Carpita, 1996), it is therefore possible the group may also be responsible for this activity. Nevertheless, these genes remain the strongest candidates for AX feruloyl transferase as no other acyl transferase group shows nearly as much differential expression between cereals and dicots (Table S2).
Further evidence for the functions of the candidates in Tables I and III can be derived from the distribution of cDNA libraries in which the cereal ESTs occur. Since AX is prevalent in all primary and secondary cereal cell walls, the expectation is that functional groups representing AX biosynthetic genes should be highly expressed in all tissues, although the tissue specificity of individual genes within groups may vary. Figure 1 summarises this information and shows that this criterion is met for most groups, but not for the three GT2 family cellulose synthases that in dicots are specifically associated with secondary cell wall synthesis. The expression of these CESA GT2 genes was mostly in the flower/pistil/carpel category. Similarly, the GT31 expression was mostly anther-specific. Genes for which expression is mostly limited to only a few libraries can be regarded as less reliable; Table IV summarises the information for highly expressed cereal genes in the orthologous groups of Table I and again shows that this applies most to the CESA genes for which >50% of the counts are contributed by only 2-3 libraries.
Genes encoding xylan synthase and xylan arabinosyl transferase might be expected to be coexpressed with UDP glucuronic acid decarboxylase (UDPGlcAdc), which is responsible for generating the substrates for AX synthesis; both UDP-xylose, and via an epimerase, UDP-arabinose (Zhang et al., 2005). Whilst UDP-xylose and UDParabinose also provide the xylose and arabinose present in other polysaccharides such as xyloglucan, arabinogalactan of arabinogalactan proteins and arabinan side chains in pectin, the greater abundance of AX in grass cell walls may be expected to result in a correlation for ESTs in cereal libraries. For such correlations, it is possible to look at individual loci separately to gain information on which are the best candidates within groups. There is significant correlation between the expression of UDPGlcAdc and one gene locus from the GT61 family, two GT47 loci, OsCslF6, two GT43 loci and a GT48 locus (Table IV). All these loci also have significant or near-significant correlation with UDP glucose dehydrogenase (UDPGDH) which is responsible for the synthesis of UDPGlcA. Coexpression between the expression of the PF02458 genes and these genes was also examined. This showed highly significant correlations with six GT61 loci and less significant correlations with two more GT61 and two GT2 loci (Table IV). If the PF02458 genes do encode feruloyl transferases, this argues for a close association in function between these GT61 genes and feruloylation.
In summary, the evidence above suggests that groups of genes in the families GT61, GT47, GT2 CSLC, GT43, GT77, GT48 or GT64 are candidates for AX biosynthesis.
Of these, the groups of GT61, GT47 and GT43 families of inverting GTs are the best candidates as they all have both high cereal EST totals for each family of over 400 and also widespread expression in different tissues. The single group of GT2 CSLC genes with cereal counts of 141 is not as highly expressed. Similarly, the GT48 group is not as highly expressed (167 cereal ESTs) and the known activity for this family is callose synthesis. The GT77 and GT64 families differ from the others in Table I as they encode enzymes with retaining glycosyl transferase activity whereas xylan synthase and xylan arabinosyl transferase require inverting activities, so that it seems unlikely that genes in either of these families have a direct role in AX biosynthesis.
Recent experimental results also provide independent support for the hypothesis that GT43 genes encode xylan synthase enzymes. The Arabidopsis knock-out mutant of At2g37090, irx9 has an irregular xylem phenotype and markedly decreased xylose content in its secondary cell walls (Brown et al., 2005) and this change has been suggested to be due to altered xylan (Brown et al., 2005;Bauer et al., 2006). The rice orthologue of IRX9, Os05g03174, is in the GT43 group with the greatest cereal /dicot ratio ( recent structural and mutational analysis identified key residues that are important for binding the UDP-GlcA molecule in the human enzymes (Fondeur-Gelinotte et al., 2006). One of these residues (R156) is conserved in all animal GT43 enzymes, but not in plants (Fondeur-Gelinotte et al., 2006). Interestingly, it appears possible from the structure that this arginine residue stabilises the carboxylate group in GlcA; its absence in the plant enzymes therefore suggests that UDP-xylose, which is very similar in structure to UDP-GlcA but lacks this carboxyl group, may be the donor rather than UDP-GlcA.
The known activity for GT61 gene products including one cloned from Arabidopsis  (Table I) are on different branches within the family. The results in Table IV suggest a close relationship in the expression of these genes with feruloylation. Feruloylated arabinose residues on AX are frequently further substituted with β -1,2-D-xylosyl residues in all grass species tested (Wende and Fry, 1997a). These xylosyl residues can be terminal or the start of longer oligosaccharide side chains (Wende and Fry, 1997a,b). It seems probable that the GT61 genes encode the xylosyl transferases responsible for adding these residues to the feruloyl arabinose. Consistent with this is the fact that the GT61 groups are the most highly differentially expressed of any GT family (Table I) since this activity would be expected to be absent in dicots.
A GT47 gene family member in Arabidopsis, At2g35100, encodes a putative arabinan alpha 1,5-arabinosyltransferase (Harholt et al., 2006). The orthologous group for this gene is not highly expressed in cereals, but a GT47 group is highly expressed in cereals and may encode an AX arabinosyl transferase (Table I) Also, poplar orthologues of NpGUT1 and IRX10 (PttGT47A, PttGT47D) were highly expressed during secondary wall formation (Aspeborg et al., 2005). Neither irx10 nor a knock-out of a second GT47 gene with a similar phenotype (irx7) have decreased arabinose content in their stem cell walls (Brown et al., 2005) and the major hemicellulose in poplar secondary cell walls is glucuronoxylan with little arabinose substitution (Aspeborg et al., 2005). We nevertheless judge the most likely function of these genes in cereals is to transfer arabinose residues to the xylan backbone in the synthesis of AX. The higher differential expression of the GT47 group compared to GT43 groups (Table I) is consistent with this hypothesis since arabinosyl substitution of xylan is more common in grasses compared to dicots (Bacic et al., 1988).
In conclusion, the analysis here provides strong support for particular genes within the GT43, GT47, GT61 and PF02458 families being responsible for the synthesis of AX and its side chains. The most likely activities based on the arguments above and some particularly promising candidate genes for investigation are summarised in Table V. The novel approach using EST counts or other transcript abundance data applied here to reach these hypotheses could be readily extended to look for candidate genes responsible for other interspecific differences in plant biochemistry,.

MATERIALS AND METHODS
Coding nucleotide and protein sequences for all Arabidopsis and rice genes were obtained from TAIR (version 6) (Rhee et al., 2003) and TIGR (release 4) (Ouyang et al., 2007) respectively. All public ESTs for Arabidopsis, Brassica spp., Glycine max, Solanum tuberosum, Oryza sativa, Triticum aestivum and Hordeum vulgare were obtained from the dbEST division of Genbank (Boguski et al., 1993). From these, those which made any reference to normalized or subtractive libraries in the complete genbank entry were excluded from the analysis, since numbers of ESTs in such libraries no longer reflect abundance. About 11% of the ESTs were excluded for this reason, leaving a total of 3.4 million ESTs used in the study.
The relationship between rice and Arabidopsis genes and mapping of ESTs to genes were achieved with the blast program suite (Altschul et al., 1997) as shown in Figure   2. The rice-Arabidopsis relationship was defined by protein similarity; all gene models (putative splice variants) were compared but only the highest scoring alignment was used for each locus. Orthologous pairs of rice and Arabidopsis genes were defined as those where the other locus was the top blastp hit from the other genome in both directions and were above an alignment score of 200 bits. All nonorthologous loci were defined as paralogues of the closest matching orthologous locus in the same genome above a threshold score of 200 bits. This resulted in 10,855 orthologous groups containing a total of 30,990 rice and 23,581 Arabidopsis loci, leaving 24,899 rice loci designated as having no orthologue. ESTs were assigned to the most similar locus of the appropriate species by sequence similarity (Fig. 2). In some cases, there were multiple EST-loci relationships because of identical alignment scores or ambiguities as to the best match due to multiple aligned regions. In these cases, equal fractions of the EST count were assigned to each locus. EST counts for splice variants were summed to give a single value per locus. Results were obtained and processed with custom Perl scripts employing Bioperl (Stajich et al., 2002) modules and stored in a MySQL database.
Library information for ESTs was derived to look at tissue distribution and coexpression of cereal genes. Since formal library information is rarely present in dbEST, instead library was defined by an identifier composed of the species, cultivar, tissue and stage fields from each EST entry. Libraries where either tissue or stage information was missing were excluded from these analyses. Tissue distribution was examined by assigning to categories tissue fields containing the regular expressions 'anther', 'callus', 'seed|grain|kernel|caryopsis|embryo|endosperm', 'flower|pistil|carpel', 'ear|spike|panicle|inflorescence', 'shoot|stem|leaf|leaves', 'root|tuber' or 'whole plant|seedling'. In addition to these mutually exclusive classifications, a 'developing' subset was used in an attempt to identify those libraries where primary cell wall synthesis was likely to predominate over secondary cell wall synthesis; this was further 56 rice loci with no orthologues and was the dataset used to generate the results in Tables I, II and S1. The remainder of the global set was used to generate the results in Tables III and S2. Table S1: Complete dataset of EST counts for the CAZy set of loci from which the summaries in Tables I and II are derived. Table S2: Complete dataset of EST counts for the complete set of loci defined as encoding putative membrane proteins derived from Aramemnon, excluding those in the CAZy set. Table III contains data for one group taken from this.    Table I. Counts of cereal ESTs matching rice (Os) or dicot ESTs matching Arabidopsis (At) loci predicted to encode integral membrane GTs. Orthologous groups with more than 100 matching ESTs are shown and are ranked by the normalised ratio of cereal EST counts / dicot EST counts. The first rice locus and first Arabidopsis locus in each group are orthologues, the others are paralogues to these. Groups with normalised ratio > 2.5 are shown. a Ratio is normalised by multiplying by the ratio of total counts for all loci of dicot ESTs / cereal ESTs, a factor of 0.468. b this gene was identified as being coexpressed with secondary cell wall-specific CESA genes in Arabidopsis (Brown et al., 2005). Table II. Results as for Table I, but for groups with normalised ratio < 0.4. a Ratio is normalised by multiplying by the ratio of total counts for all loci in this set of dicot ESTs / cereal ESTs, a factor of 0.468. b The mature proteins encoded by these groups do not contain transmembrane domains but are present because of the inclusive method used to predict integral membrane proteins.   a The minimum number of libraries required to give 50% of the total counts for this locus. If a large proportion of the counts arise from a few libraries this may indicate that this is not representative of generally high expression in cereals. b,c,d Correlation coefficients for normalised counts for each gene with normalised counts across 63 cereal libraries. Counts were normalised by dividing by total number of ESTs in each library to correct for variation due to library size. * indicates significant at P<0.05, ** significant at P<0.01. b Correlations with sum of counts for three UDPGDH genes. c Correlations with sum of counts for three UDPGlcAdc genes. d Correlations with sum of counts for genes in Pfam family PF02458 shown in Table III   VI. Numbers of ESTs present in the database categorised by species and tissue type. Tissue type classifications were assigned according to the presence of particular terms within the tissue and/or stage fields of the EST entry.