GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics

We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a ‘mean’ human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves. Database URL: http://apollo11.isto.unibo.it/software/


Introduction
Genome browsers such as the Map Viewer at the National Center for Biotechnology Information (NCBI) (1), Ensembl (2) at the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute and the University of California Santa Cruz (UCSC) Genome Browser (3) provide plenty of data about the human genome and genes, also downloadable in different types of text formats. However, they do not offer searches like the ones made possible by a fully structured database (4). In particular, numerical values for many features are not treated as database number fields, and summarization of the values in terms of mean, standard deviation (SD) and so on is often not available. When these statistics are provided (e.g. https://www.ncbi.nlm.nih.gov/genome/annota tion_euk/Homo_sapiens/108/), they are based on fixed gene sets and cannot be dynamically created following searching for a set of genes with the desired characteristics (e.g. gene function or chromosomal location). Furthermore, the reported values (e.g. the length of the shortest gene) cannot be associated with the records of the specific genes having those values. Finally, different data tables presented as web pages for gene features and gene sequences are not related and cross-table searches in this sense are not possible (e.g. searching for introns ending with the CAGCAG sequence to readily identify possible candidate genes for subtle alternative splicing (5)).
To address these issues we improved GeneBase (1.0), a user-friendly local tool with a graphical interface incorporating data available in the NCBI Gene database and allowing users to perform original searches, calculations and analyses of the main gene-associated meta-information (6). The original GeneBase (1.0) has been revised adding in particular calculation and related fields leading to the release GeneBase 1.1 presented here which now allows summarization in terms of median, mean, SD and total for many quantitative parameters associated with genes, gene transcripts and gene features, dynamically calculated for any desired subset thus providing unique functionalities not offered by the NCBI Gene itself.
To demonstrate the utility of GeneBase 1.1 for understanding statistics about a whole set of genes, we provide an updated reference set of statistics of human nuclear gene parameters useful for human genome studies. Although we do not undervalue the relevance of repetitive DNA sequences which are estimated to account for 66-69% of the human genome (7), we will especially focus on sequences annotated as genes analysing data available in the NCBI Gene (1) database following parsing by GeneBase 1.1. The tool also allowed the selection of a curated subset of human nuclear genes with a 'REVIEWED' or 'VALIDATED' Reference Sequence (RefSeq) status (8). The NCBI Gene database has arbitrarily been chosen as a reference data set because being part of an international effort, it represents data that is mostly presented also by other genome browsers such as Ensembl (2) and the UCSC Genome Browser (3) which are now based mainly on GENCODE (9). Due to the high concordance of GENCODE Basic set (Only full-length, proteincoding transcripts at protein-coding genes) with NCBI RefSeq database data as proved by Frankish and coll. (10), it may derived that NCBI Gene may soundly be suitable for our purpose, although use of the other genome browsers might be a useful addition to the analysis of gene data.
In particular, here we show that substantial changes have modified these main statistics still typically provided in the literature as reference data. Numerical and statistical data related to the human genome are typically mentioned as general knowledge without citing a proper updated reference (11,12) or from genome browser websites (13,14). This is also due to a lack of a systematic reanalysis of these values which date back to February 2001 (15)(16)(17) in the recent literature, while the reanalysis of data provided by regularly updated genome browser websites is hampered by limitations presented above.
While showing potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval and analysis, we provide as a sample application a revised set of statistics for human nuclear genes which offers both an updated reference data set for human genome studies and interesting clues to the biomedical meaning of the gene features themselves.

Tool development
The original GeneBase (1.0) is a structured local database consisting of three main related tables ('Gene_Summary', 'Gene_Table' and 'Gene_Ontology') containing information such as gene nomenclature, structure and transcripts parsed from a desired NCBI Gene dataset through an initial Python script.
First, the original available Python executable script has been improved (http://www.python.org/, version 2.7) in order to obtain a 'Gene_Table' record also for genes lacking transcribed products but with a given genomic location, leading to the possibility to calculate, e.g. gene size of RNA transfers (tRNAs), that would have otherwise been impossible. The other script parsing functions remain unchanged, including the obtainment of the three tabdelimited files which then need to be loaded into GeneBase as provided, following the software documentation. GeneBase 1.1 has been improved here compared with the original version especially in order to calculate specific gene and transcript feature statistics. In particular, fields related with other GeneBase 1.1 tables were added in the 'Gene_Summary ' table in order to improve search opportunities showing, e.g. the length, the annotated strand and the transcript RefSeq status associated with each gene. A script automatically executed after the first import step was implemented in order to also calculate mature messenger RNA (mRNA), 5 0 and 3 0 UTR (untranslated region) lengths in the 'Gene_Table' table. This was not a trivial task due to the necessity to take the intron lengths that can be different for each isoform into account in the calculation. 'Transcripts' GeneBase 1.1 table has been expanded to also include mature transcript, CDS (coding DNA sequence), 5 0 and 3 0 UTR lengths and exon and coding exon number per transcript. 'Genes' table has been created in order to show an informational overview regarding gene lengths, transcript number for each gene and exon and coding exon number for the transcript isoform with the highest number of exons of each gene.
Furthermore, in 'Transcripts' and 'Genes' tables summary sections have been created in order to collect and calculate median, mean, SD and total values for all the available gene features. The summary section of the 'Reports' table, generated in the original GeneBase (1.0) version to easily provide mean exon and intron lengths, has been improved here by adding the median and the total length calculations for the current records shown in 'Gene_Table'. In particular, all the summary sections available in GeneBase 1.1 update the listed values depending on the current found record subset, thus statistics can be dynamically calculated for any desired subset of genes. GeneBase 1.1 was originally developed and improved here within the FileMaker Pro Advanced environment (FileMaker, Santa Clara, CA), a database management system with an intuitive user-friendly graphical interface for both Macintosh (Mac OS X) and Windows operating systems. Information has been fragmented into distinct fields as much as possible in order to facilitate independent data management. The normalization through relationships between the tables has been only partially realized in order to balance the elimination of redundancy and the speed of searches. Numbers are stored as numeric values allowing for instance record sorting by ascending or descending order and searches by range. Further data not present in the original NCBI Gene entries and extracted here from the available information is highlighted in red text. All data fields are indexed to ensure efficient data retrieval through the query options.
As for the original version, we do not provide a GeneBase 1.1 web tool because the FileMaker Pro environment does not support the full features available in the local file when the file is published via a HTML (HyperText Markup Language) interface.

Human database construction
Improvement brought in GeneBase 1.1 makes it easier to extract statistical summary information about genes available in NCBI Gene. As a sample application of GeneBase 1.1, we provide a revised snapshot of statistics for human nuclear genes, making a stand-alone version of GeneBase 1.1 pre-loaded with available human gene data updated to January 2016 (GeneBase 1.1 Human). Although it will not be automatically kept up-to-date as other on-line gene browsers, we provide an empty template which may be used at any time to load ab initio the latest version or any desired subset of NCBI Gene data for any organism and by any user, following parsing by our scripts. In contrast with the first presentation of GeneBase (6), we decided to initially include in our study also gene models, in order to obtain a complete picture of human nuclear genes in the NCBI database. All the currently ('alive'/'live' qualification) available human gene entries were downloaded from NCBI Gene on 19 January 2016, using the following text query: 'Homo sapiens'[Organism] AND 'source_ genomic'[properties] AND alive [property].
In order to integrate exon and intron nucleotide sequences, all the human chromosome sequences were downloaded from the NCBI Nucleotide database using Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchen trez) in FASTA format on 26 January 2016 (corresponding to GRCh38.p2, incorporating centromere models), parsed using the available Python executable script and imported into GeneBase 1.1 Human, thoroughly following the software documentation. Sequences are not required to calculate statistical summary information, thus if the user is interested only in gene features such as gene number and size statistics, this step is not necessary.
We made Mac OS X and Windows stand-alone software (of both GeneBase 1.1 pre-loaded and empty versions), including the FileMaker runtime with a user guide provided and the relative Python scripts for the initial data pre-processing and sequence calculations, freely available to all for basic use at http://apollo11.isto.unibo.it/soft ware/. The freely distributed licensed runtime application allows full data import, records export in diverse file formats, as well as full record management and analysis and script execution. The downloading, parsing and import of gene entries, the downloading of chromosome sequences and the calculation of exon and intron sequences and the improvements are described in detail in the software documentation. An original copy of FileMaker Pro version 12 (or higher) is required only for the modification of the tool for personal purposes (creation of new fields, further calculation or additional relationship definition).

Sample application: statistics of the human gene features
In order to provide a revised set of statistics regarding human nuclear genes and transcripts through GeneBase 1.1 Human, we considered only genes with 'REVIEWED' or 'VALIDATED' RefSeq status, with at least one 'REVIEWED' or 'VALIDATED' transcript, excluding 'not in current annotation release' records ('Genome _Annotation_Status' field). This selection has already been proven successful in excluding from calculations model RefSeq records generated by automated pipelines (6). As mentioned above, GeneBase has been improved here and the fields useful for this gene selection are available in each software table. In particular, the exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms. Specific GeneBase 1.1 searches performed to find numbers cited in Figures

Results
Database functionality GeneBase 1.1 is now composed of six related tables: 'Gene_Summary', 'Gene_Table', 'Gene_Ontology', 'Reports', 'Trascripts' and 'Genes'. The first three tables include information which needs to be extracted and parsed from NCBI Gene entries (please see the user guide provided at http://apollo11.isto.unibo.it/software/ for further details), while 'Reports', 'Transcripts' and 'Genes' tables provide an overview of the main available gene and transcript features and summary sections created in order to collect and calculate their median, mean, SD and total values. In particular, a statistical summary for exon and intron lengths is available in the 'Reports ' table. 'Transcripts'  table provides statistical values for  The most relevant functionality introduced in GeneBase 1.1 is that these statistics can be dynamically calculated for any desired subset of genes, making it easier to extract summary values for the subset of 'REVIEWED' and 'VALIDATED' human nuclear genes, as described below.
We obtained 59 801 entries from downloading all current live human records with a genomic gene source (Methods) from NCBI Gene available up to 19 January 2016.
Following the initial parsing and importing steps (Methods), the three main tables in GeneBase 1.1 Human database are constituted as follows: 'Gene_Summary' contains 59 801 records (one for each NCBI Gene entry). 'Gene_Table' contains 1 502 237 records (one record for each gene exon, including the downstream intron if an intron follows that exon), corresponding to 40 942 genes with 136 694 transcripts in total (equal to the 'Transcripts' table record number), excluding genes without annotated transcribed products. 'Gene_Ontology' contains 18 726 records in all, one for each gene with Gene Ontology information available.
The overall gene type composition of GeneBase 1.1 Human is shown in Figure 1A Figure 1B is the representation of 22 451 GeneBase 1.1 Human gene entries with 'REVIEWED' or 'VALIDATED' RefSeq status, with at least one 'REVIEWED' or 'VALIDATED' transcript and excluding genes not in current annotation release, corresponding to a total of 45 541 transcripts, which is the subset that will be considered onwards. The available RefSeq statuses associated with genes and the corresponding RNAs are summarized in Supplementary  Table S1; it should be noted that the RefSeq status of a particular RNA may be different from the status assigned to the corresponding gene.
In order to integrate exon and intron nucleotide sequences, only entries with the 'REVIEWED' or the 'VALIDATED' RefSeq status having an 'NM_' or 'NR_' type of RefSeq RNA accession number (in order to exclude 'XM_' or 'XR_' model Refseq records generated by automated pipelines) were selected. After the chromosome sequence download, parsing and importing steps (Methods), a total of 459 868 'Gene_Table' records were updated with exon, coding exon (for protein-coding transcript isoforms) and the corresponding downstream intron sequences up to 26 January 2016. The whole database including sequences has a size of 6.43 gigabytes following decompression.

Revision of human nuclear gene feature statistics
The subset of the 22 451 'REVIEWED' or 'VALIDATED' gene entries with at least one 'REVIEWED' or 'VALIDATED' transcript (excluding genes not in current annotation release) available in GeneBase 1.1 Human for each human chromosome is shown in Figure 2 (Table 1) and includes a total of 18 255 protein-coding genes, 668 pseudogenes and 3528 non-coding genes (Supplementary Table S2). The location of the 50.42% of genes has the transcript sequence on the DNA strand annotated as 'plus'.
In the considered subset of 'REVIEWED' and 'VALIDATED' entries, Table 2 Tables 3 and 4. Supplementary Table S3 gives these statistics for protein-coding and non-coding genes counted together.
It can be noted that the 66.56% of considered genes (14 944 out of the total of 22 451 genes) transcribes for at least two transcript isoforms. Regarding splicing, out of a total of 414 385 available intron sequences, the overriding majority (98.95%, 410 038 introns) presents canonical splice donor and acceptor sites (GT and AG, respectively); 0.87% (3,594 introns) are of the GC-AG type and 0.11% (439 introns) AT-AC; among the remaining, the majority uses at one boundary one canonical splice site (GT or AG) in combination with a noncanonical one at the other boundary (Supplementary  Table S4).
Data on the length of the specific mRNA regions are available in Table 5. The main known non-coding RNA types are in Supplementary Table S5.
About 39.35% of the sequences of nuclear DNA correspond to genes coding for proteins. In our data set, only 4.59% of the sequences of nuclear DNA correspond to non-coding genes. Protein-coding and non-coding genes together account for the 43.95% of chromosome sequences. The remaining part of the genomic DNA, approximately 1/ 2, seems to be extragenic DNA or intergenic, since it does not contain canonical genes and is interposed between them.
In addition, considering the non-redundant set of exons (without accounting for the occurrence of an exon more times in different transcript isoforms), on average, only 4.43% of the DNA sequence of a gene is part of a mature mRNA which is constituted by the sum of exons only; exons thus correspond to 1.74% of the total genome. Since, in turn, only 44.03% of the exon sequences is encoding stricto sensu, i.e. consists of triplets of bases which can be effectively translated into a sequence of amino acids, it can be deduced that <1% (0.77%) of the genome is coding in the strict sense, which corresponds to a total of 23.7 Mbp if we consider the non-redundant set of coding exons (counting only one coding exon for each group of exons present in multiple transcript isoforms). The remaining fraction of the nuclear DNA gene sequence is mainly

Discussion
We have presented here an improved, new version of our original GeneBase (1.0), a tool with a graphical interface able to parse, structure and index the whole text-based NCBI Gene content (6). Remarkably, the tool offers the possibility to manage sequences related to gene features, providing the way to integrate analysis of nucleic acid sequences and structural and functional annotation of gene sequence elements. For example, the availability of Gene Ontology is also relevant because function conservation may still be present without any sequence similarity (18).
Main changes in the new version 1.1 regard the addition of number fields useful for calculating characteristic length of the gene elements and the implementation of new tool tables useful in order to show related information about genes and transcripts, also giving the opportunity to find new relationships among their features. In particular, summary sections have been created in order to collect and calculate median, mean, SD and total values for all the available features. These values automatically update depending on the current found record subset, giving users the freedom to customize statistics which can be dynamically calculated for any desired subset of genes.
Advances implemented in the 1.1 version of GeneBase have been critical in making the use of the tool possible for the innovative purpose of generating a set of detailed statistics about a set of genes. As a sample application of GeneBase 1.1, we have provided an analysis of main statistics for annotated human nuclear genes updated to January 2016. GeneBase 1.1 has been filled with all known human nuclear genes (GeneBase 1.1 Human) as previously described (6), except for the inclusion of gene models; this decision has caused the presence in the database of a high number of genes without a transcribed product, as expected, giving on the other hand the opportunity to include genes for tRNAs.
Several results obtained by GeneBase 1.1 Human offer the possibility to obtain quantitative parameters associated with genes, gene transcripts and gene features as interesting clues to their biomedical meaning as discussed below.
The total number of human protein-coding genes and pseudogenes annotated in the NCBI Gene up to January 2016 (and thus available in GeneBase 1.   Table 1 and Supplementary Table S2 for more details.   We considered only protein-coding or non-coding genes with 'REVIEWED' or 'VALIDATED' RefSeq status, with at least one 'REVIEWED' or 'VALIDATED' transcript, excluding 'not in current annotation release' records in the 'Gene_Summary' and 'Genes' tables of GeneBase 1.1 Human as explained in the 'Methods' section and in the Supplementary Methods File. The mean, minimum and maximum gene numbers are also available in Supplementary Table S2. between the numbers of non-coding genes (18 882 versus 25 180). In addition, considering in GeneBase 1.1 Human the subset of genes with at least one RNA having a 'REVIEWED' or 'VALIDATED' RefSeq status for our sample application trimmed 81% of non-coding genes (from 18 882 to 3528 entries retained) and 95% of pseudogenes (from 16 555 to 668 gene entries retained), while it did not affect protein-coding genes in a large measure (from 20 909 to 18 255). On one hand, this reflects the great uncertainty still surrounding non-coding genes; on the other hand, this protein-coding gene subset represents the most reliable data confirmed by the fact that this number (18 255) is very close to the last estimate (19 000) obtained through the analysis of large-scale proteomic experiments (19). Furthermore, since all these kinds of analyses depend on the chosen gene entry subset, the classification system and are subject to the accuracy of the input dataset, we decided to perform this selection in order to exclude erroneous data that were not manually verified (6,20,21). It is unlikely that the data shown here could undergo significant changes when those related to genes that still remain to be characterized in detail will be included. Furthermore, we always study a consensus ideal genome in no more existing cells due to the theoretical impossibility to determine the whole sequence in living cells (22). However, human gene feature estimates differ significantly from those reported in the first preliminary article on the sequence of the human genome published in Nature (15) and have never been systematically revised since then. For example, the average size of a human protein-coding gene had been evaluated as 27 kbp (median: 14 kbp) (15), whereas now analysing the currently available data, it is of 67 kbp (median: 26 kbp, Table 2). This is likely due to a progressive increase in the genomic organization determination accuracy of the genes following the burst of expressed sequence tags (EST) database (23), in which >5.2 million of human transcript related sequences were deposited from 2001 to 2007, therefore after the publication in 2001 of the two human genome seminal reports. This subset accounts for 61% of the whole human EST database, allowing over the years the progressive merging of EST clusters mapped in the UniGene database into longer transcripts thus bridging gaps between apparently different loci. On average, a non-coding gene is half as long (34 kbp, Table 2). The absolute size, however, may vary across a range of variation of four orders of magnitude (10 000 times). Using an original method for transcriptome mapping (24), including systematic UniGene based conversion of gene identifiers (25), the estimation of the average human gene length was useful in order to determine the significance of over-or under-expressed genomic segments equivalent to single gene size in the whole normal human heart transcriptome map (26).
The largest currently known human gene, RBFOX1 (RNA binding protein, fox-1 homolog 1), spans 2.47 Mbp on chr16 ( Table 2). The other known human genes exceeding 2 Mbp in length are CNTNAP2 (contactin associated protein-like 2, spanning 2.30 Mbp on chr7), PTPRD (protein tyrosine phosphatase, receptor type D, 2.30 Mbp on chr9) and DMD (dystrophin, 2.22 Mbp on chrX). It is reasonable to think that the length of a gene can be one of the factors that influence its probability of being interrupted when there is a chromosomal lesion or to We considered only protein-coding genes with 'REVIEWED' or 'VALIDATED' RefSeq status, with at least one 'REVIEWED' or 'VALIDATED' transcript, excluding 'not in current annotation release' records in GeneBase 1.1 Human software. Non-coding RNA produced by a protein-coding gene are excluded, selecting only transcripts with an 'NM_' RNA accession number type (Methods and Supplementary Methods File). In particular, mRNA data were derived from the 'Genes' and 'Transcripts' tables, while exon and intron data from 'Gene_Table' and 'Reports' tables. Exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms. A comprehensive analysis including both non-coding and protein-coding genes is available in the Supplementary Table S3  We considered only genes with 'REVIEWED' or 'VALIDATED' RefSeq status, with at least one 'REVIEWED' or 'VALIDATED' transcript, excluding 'not in current annotation release' records in GeneBase 1.1 Human software. Here only transcripts with an 'NR_' RNA accession number type (Methods and Supplementary Methods File) are selected, corresponding to 5890 genes (since also non-coding RNAs can be transcribed from a protein-coding gene). In particular, transcript data were derived from the 'Genes' and 'Transcripts' tables, while exon and intron data were derived from 'Gene_Table' and 'Reports' tables. Exon and intron non-redundant sets were found counting only one exon or intron for each group of exons or introns present in multiple transcript isoforms. A comprehensive analysis including both non-coding and protein-coding genes is available in the Supplementary Table S3  mutate following point errors of DNA replication. Interestingly, for these genes mutations and clinical phenotypes have been described (27)(28)(29) and since their correlations with gene length do not yet appear to have been systematically studied to date, investigations in the field will be made easier by the systematic dataset we present here.
The chromosome (chr) with the smallest number of genes is chrY, followed by chr21 (which is also the shortest), chr18 and chr13. Significantly, since 1959-1960, the only three autosomal trisomies allowing live births have known to be the ones of human chromosomes 13, 18 and 21 (13). Only after 2000, we know that these chromosomes are exactly the three having the lowest number of genes in absolute in the human genome (497, 331 and 273 known genes respectively, according to GeneBase 1.1 Human database, Table 1). In these cases, despite the three not being the shortest of the human genome, the survival is progressively worsening (30) in proportion to the gene number on the chromosome suggesting that the damage derived from the general over-expression of an entire set of genes may be somewhat compensated until a certain point.
Alternative promoters, splicing and polyadenylation are the main processes leading to complex multi-transcript systems (31)(32)(33). The gene with the highest number of protein-coding transcript isoforms (10) is CACNA1G (calcium voltage-gated channel subunit alpha1 G, chr7, Table 3), recently associated with spinocerebellar ataxia (34). The gene with the overall highest number of described protein-coding and non-coding transcript isoforms is UTY (ubiquitously transcribed tetratricopeptide repeat containing, Y-linked) with at least 77 known alternative transcripts (Table 4 and Supplementary Table S3) of which the full role is still unclear (35,36).
The average size of a human intron is 6355 bp in protein-coding genes and 7897 in non-coding genes, but in larger genes a single intron of 1 Mbp upper length (Tables  3 and 4; Supplementary Table S3) can be found. Again this estimation is substantially different from the one provided in the first genome draft of 3365 bp for protein-coding gene introns (15). The longest intron (1 160 411 bp) belongs to ROBO2 (roundabout guidance receptor 2, chr3). Due to the presence of artefactual data in some records, manual curation is needed when considering extremely low values, as previously discussed (6). Following validation as described (6), we confirm that there are no human (and actually of any species) introns shortest than 30 bp (Table 3). About 554 protein-coding and 948 non-coding transcripts (corresponding to a total of 1496 genes) are intronless (monoexonic), representing 3.3% out of the total considered transcript set.
The average length of a human exon is only 309 bp in protein-coding genes and 362 bp in non-coding genes, but also in this case the existing variability within the human genome is very high. Notably, thanks to the GeneBase 1.1 architecture, it is possible to analyse exon lengths which vary a lot depending on whether it is a first, internal or terminal exon and whether it is coding or not (Supplementary Figure S2). For example, excluding last exons, whose sequence is mainly non-coding (because they typically contain the stop codon), the mean exon length in proteincoding genes is of 161 bp, again an estimation substantially different from the mean value provided in the first genome draft of 145 bp (for protein-coding gene internal exons). The ratio between intron and exon length (6355:309 and 7897:362 bp for protein-coding and non-coding transcripts, respectively) is about 21:1. If exon and intron lengths of protein-coding and non-coding transcripts are comparable, the factor accounting for the different gene lengths (one double of the other) is the mean number of exons, which is equal to 11 for protein-coding genes versus 6 of non-coding genes.
The average size of a human mRNA is 3392 bp ( Table 3) of which only half (49.03%, Tables 3 and 5), on average, is coding. This average size much lower than that of the gene is the operative basis making the insertion of most human cDNAs (complete human DNA copy of mRNA) in many types of vectors used for gene transfer into cells possible. The mean mRNA size is included between the 28S and 18S rRNA sizes (5070 and 1869 bp, respectively, Supplementary Table S5), thus explaining the typical intensity pattern of mRNA fractioned in agarose gels with an enrichment between 28S and 18S rRNA bands. Table 5 also shows the artefactual 0 bp length of the 5 0 and 3 0 UTRs, thus in a certain number (203) of entries only the coding sequence has been registered. This reflects the well-known and still present difficulties in the determination of the full-length RNA (9), especially at the 5 0 end (37,38) and the characterization analyses in this field are still necessary. Notably, the extended length of 3 0 UTR in comparison to 5 0 UTR may have a biological explanation because it is enriched in several regulatory motifs such as adenylate uridylate (AU)-rich elements and binding sites for microRNAs, RNA-binding proteins and long noncoding RNAs (11).
The reference values for human gene elements also have a crucial key application in high throughput RNA sequencing methods (RNA-Seq), where alignment software often gives the possibility to configure custom parameter settings such as minimum and maximum intron lengths, essential for novel junction discovery (as for example in the TopHat mapper tool available in Galaxy platform, https://usega laxy.org/). Other parameters are, e.g. mean mRNA length, mean exon length (including last exons) and average number of exons per transcript (14). Parameters of proteincoding genes that were recently estimated differently considering different databases (13), in absence of a systematic revision, include the average genomic size of genes [27 kbp in (13) (39), a portion greater than the 1/2 that we have found considering only characterized sequences, pointing out again that a great effort is still needed in the annotation process for all reasons highlighted in this discussion.
In conclusion, here we present an original tool which allows parsing, structuring and dynamic summarizing data from NCBI Gene data bank. Furthermore it makes the analysis of the main gene and transcript structure parameters possible also following the search for a set of genes with the desired characteristics. In addition, we show its usefulness for a systematic revision of the main reference parameters updated to January 2016 for a description of the human nuclear gene structure. Beyond the other possible applications described here, it might be useful to design experiments for poorly characterized annotated genome regions, as in our current annotation effort of the recently defined highly restricted Down Syndrome critical region (HR-DSCR) for example, which to date does not contain known genes (40) and is fundamental to understanding genotype-phenotype relationships of Down syndrome and for identifying new therapeutic approaches (41). This analysis finally paves the way for similar studies in other organisms, possibly providing new insight on gene and genome evolution.

Availability
GeneBase 1.1 (both pre-loaded with human nuclear gene data and empty versions), the user guide and the relative Python scripts for the initial data pre-processing and sequence calculations are publicly available at http:// apollo11.isto.unibo.it/software/.