FlyNet: a versatile network prioritization server for the Drosophila community

Drosophila melanogaster (fruit fly) has been a popular model organism in animal genetics due to the high accessibility of reverse-genetics tools. In addition, the close relationship between the Drosophila and human genomes rationalizes the use of Drosophila as an invertebrate model for human neurobiology and disease research. A platform technology for predicting candidate genes or functions would further enhance the usefulness of this long-established model organism for gene-to-phenotype mapping. Recently, the power of network prioritization for gene-to-phenotype mapping has been demonstrated in many organisms. Here we present a network prioritization server dedicated to Drosophila that covers ∼95% of the coding genome. This server, dubbed FlyNet, has several distinctive features, including (i) prioritization for both genes and functions; (ii) two complementary network algorithms: direct neighborhood and network diffusion; (iii) spatiotemporal-specific networks as an additional prioritization strategy for traits associated with a specific developmental stage or tissue and (iv) prioritization for human disease genes. FlyNet is expected to serve as a versatile hypothesis-generation platform for genes and functions in the study of basic animal genetics, developmental biology and human disease. FlyNet is available for free at http://www.inetbio.org/flynet.

we used '0.632 bootstrapping' for all LLS calculations because of its credibility in estimating classifier error rates.
For the gene pairs sorted by data-intrinsic scores, LLS scores were calculated for bins of equal numbers of gene pairs. A regression between data-intrinsic scores (e.g. mutual information, correlation coefficient and probability) and log likelihood scores based on gold standard gene pairs is used for interpolation to estimate LLS of individual gene pairs. Linear fits in general over-estimate LLS for the gene pairs in the most significant score range, whereas sigmoidal curve fits result in more conservative LLS for the same score range (4).

Weighted Sum (WS) method for network integration
Weighted sum (WS) (5,6) is a variant of the naïve Bayesian method, which accounts for the average correlation among integrated data. WS is calculated using the following equation: • , , where S 0 is the best LLS score among all of the available LLSs for each link; D is a free parameter representing the degree of correlation among the scores; T is a threshold of LLS to be integrated; and i is the rank index from ordering in descending magnitude the n LLSs for each link. The values for the free parameters, D and T, were chosen to maximize overall performance on the benchmarks. To take the best case of integration, we also tested the performance of the naïve Bayesian integration of LLS scores, and then selected the integration conditions that maximizes the area under the plot of LLS versus genome coverage of the integrated network (4).

Protein-protein interactions -based on high-throughput experiments (HT) and literature curation (LC)
Protein-protein interaction (PPI) data was dealt as two categories: i) PPI by high-throughput experiments such as yeast-two-hybrid assay (Y2H) or affinity purification/mass spectrometry (AP/MS), and ii) PPI by small-scale experiments. Both categories of PPI data were obtained from iRefWeb (7) version 13.0, a consolidated database of several public protein interaction databases. Protein interactions are prioritized based on the significance calculated using Fisher's exact test.

Co-expression (CX) of genes across biological conditions
More than 2,000 microarray and RNA-seq samples are publicly available from Gene Expression Omnibus (GEO) (8). We analyzed GEO series (GSE) based on two Affymetrix chip platforms, GPL72 and GPL1322, which support the largest number of gene expression samples. GSE with less than 12 samples were excluded, because correlation coefficient by low dimensional data tends to give many promiscuous co-functional links. Overall, 53 GSE comprising 1,873 expression samples were analyzed and the full list of GSE series are represented in Supplementary Table S1. The degree of co-expression was measured by Pearson's correlation coefficient.

Comparative genomics-based computational methods -Phylogenetic profiling (PG) and Gene neighborhood (GN)
The phylogenetic profiling (PG) method is based on the observation that functionally related genes tend to be gained or lost together during the evolutionary process (i.e., co-inherited) because they both might be required to operate the same biological pathways. To identify cofunctionality between genes from the co-inheritance pattern, we conducted BLASTp for all D. melanogaster genes against the genome set generated based on each of three domains of life; sets of 122 completely sequenced genomes for Archaea, 1,626 for Bacteria and 396 for Eukarya. We found that this divide-and-integrate approach based on domain-specific phylogenetic profiles substantially improves network coverage as well as accuracy.
Phylogenetic profile matrices were constructed with BLAST hit scores and the similarity between profiles of two genes was calculated as mutual information (MI) score as described in a previous study (9).
The gene neighborhood (GN) method is based on the observation that genes located in the bacterial genomic vicinity tend to be co-regulated, and thus functionally associated. We used 1,748 bacterial genomes (122 from Archaea and 1,626 from Bacteria) for the analysis. There are two different measures of genomic vicinity: i) physical distance between neighboring genes (10-12), and ii) relative distance measured by the probability of neighborhood (13). It has been reported that these two methods are complementary and the integration of two methods improves prediction performance of the network (14).

Text mining from research articles -Co-citation (CC)
Co-citation of genes in the same articles is a relatively simple but effective method to identify functionally associated genes (15). We inferred co-citation-based links by scanning PubMed Central full text articles and Medline abstracts (as of January 2014) that contain the word "melanogaster". Co-cited genes in the same articles were paired to generate links and measured the statistical significance using Fisher's exact test.

Co-occurrence of protein domains -Domain co-occurrence (DC)
Because protein domains are generally considered as functional units, proteins that share domains could have similar functions. Based on the presence of domains by InterPro database (16), we generated domain profiles for proteins. With these profiles, we measured significance of domain co-occurrence between two proteins. Accounting for inverse relationship between occurrence and functional specificity of domains, we used 'weighted mutual information' scheme which gives more weight on rarer domains.

Functional information transferred by orthology -Associalogs
Due to the evolutionary conservation of biological pathways across species, functional association between genes in a target organism can be inferred from a functional association between orthologs in a reference organism, based on the algorithm namely 'associalog' (6). We used Inparanoid (17) for the orthology mapping, which allows identification of coorthologs. We transferred co-functional links from AraNet v2 (18), WormNet v3 (19), YeastNet v3 (20), and unpublished co-functional networks for human and zebra fish.

Spatiotemporal-specific network (STN)
The data of spatiotemporal expressions of D. melanogaster genes were obtained from the recent Drosophila transcriptome atlas data from the modENCODE project (21). For all expression samples based on RNA-seq, we took genes with BPKM (bases per kilobase per million mapped bases) > 1 only. We classified 41 selected spatiotemporal expression samples into four developmental stages (embryo, larvae, pupae and adult) and ten tissue types (imaginal disc, CNS, salivary gland, fat body, digestive system, carcass, heads, accessory gland, ovary and testes). As the result, we generated 14 sets of genes associated with different developmental stages and tissue types. These 14 gene sets were used to filter FlyNet for 14 spatiotemporal networks. We then compared the networks for four developmental stages to identify specific network links for each developmental stage, and compared the networks for ten tissue types to identify specific network links for each tissue type. These spatiotemporalspecific links (i.e., links found in only one of four developmental stages or in only one of ten tissue types) generate 14 spatiotemporal-specific networks (STNs) summarized in Supplementary Table S2. Edge information of the 14 STNs are also available from FlyNet server.

Dataset for the assessment of human disease prioritization
Fly X-chromosome genes whose human orthologs are associated with neurodevelopmental diseases were collected from a recent study of fly mutagenesis screen (22). Human genes with de novo mutations in neurological disorders from the following references: autism (23)(24)(25)(26), epilepsy (27), and schizophrenia (28-31).  , and YY represents the data type (CX: inferred from co-expression of genes, CC: inferred from cocitation, DC: inferred from domain co-occurrence, GN: inferred from gene neighborhood, GT: inferred from genetic interaction, HT: inferred from high-throughput protein-protein interactions, LC: inferred from literature curated protein-protein interactions, PG: inferred from phylogenetic profile similarity)