sefOri: selecting the best-engineered sequence features to predict DNA replication origins

Lou, Chenwei; Zhao, Jian; Shi, Ruoyao; Wang, Qian; Zhou, Wenyang; Wang, Yubo; Wang, Guoqing; Huang, Lan; Feng, Xin; Zhou, Fengfeng

doi:10.1093/bioinformatics/btz506

Abstract

Motivation

Cell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins. A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.

Results

This study proposed a feature selection procedure to further refine the classification model of the DNA replication origins. The experimental data demonstrated that as large as 26% improvement in the prediction accuracy may be achieved on the yeast Saccharomyces cerevisiae. Moreover, the prediction accuracies of the DNA replication origins were improved for all the four yeast genomes investigated in this study.

Availability and implementation

The software sefOri version 1.0 was available at http://www.healthinformaticslab.org/supp/resources.php. An online server was also provided for the convenience of the users, and its web link may be found in the above-mentioned web page.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A cell needs to replicate its double-stranded DNA during its division, and then its genetic information is accurately passed to the next generation (Watson and Crick, 1953). The replication origin is where the genome replication process starts and requires precise regulations (Mott and Berger, 2007). Bacterial genome usually has one replication origin (Chodavarapu and Kaguni, 2016; Skarstad and Katayama, 2013), while eukaryotic genome has multiple replication origins to parallelly replicate the large genomes (Bogenschutz et al., 2014; Peng et al., 2015; Warner et al., 2017) and so does archaea (Ausiannikava and Allers, 2017). Many genetic elements have inherent sequence-level patterns that can be used to optimize a sequence-based element prediction system (Liu et al., 2017).

The replication origins may be precisely detected and bound by the origin recognition complex (ORC) (Lee and Bell, 1997). The replication origins of the well-studied budding yeast Saccharomyces cerevisiae have two characteristic sequence patterns. First, S.cerevisiae has highly specific sequence patterns in the replication origins (Liachko et al., 2010). ORC recognizes the T-rich 17 bp motif autonomously replicating sequence (ARS) and its consensus sequence (ACS), which may interact with the three neighboring B1/B2/B3 elements (Biswas et al., 2005; Liachko et al., 2010). The B1 element has an A/TTA/T motif within the 228 ARSs conserved across the six Saccharomyces species (Chang et al., 2011; Leonard and Mechali, 2013). The spacing between the B1 element and the ACS is essential for the ORC binding (Marahrens and Stillman, 1994). The B2 element is the binding residue for Cdc6p, the MCM-2-7 helicase (Wilmes and Bell, 2002; Zou and Stillman, 2000), while the B3 element bound to the ARS-binding factor 1 (Abf1) for the precise site specificity of replication (Marahrens and Stillman, 1992).

The second characteristics of S.cerevisiae replication origins is nucleosome exclusion (Nieduszynski et al., 2006). The nucleosome mapping result demonstrates that ∼180bp region overlapping the ARS1 replication origin is free from nucleosome binding. Nucleosome occupant regions have significant sequence motifs and tend to avoid the 200 bp regions upstream of the gene start codon and other regulatory regions (Yuan et al., 2005).

Various computational technologies have been utilized to detect the genomic replication origins (Peng et al., 2015). Nucleotides guanine (G) and cytosine (C) were observed to be differentially represented in the replication origin regions (Lobry, 1996). So the GC distributions were used to develop several tools like cumulative skew diagram (Grigoriev, 1998), Z-curve (Ou et al., 2003; Zhang and Zhang, 2002), base composition asymmetry (Gao and Zhang, 2008; Luo et al., 2014), etc. Nevertheless, the lack of negative samples may introduce false positives to these tools’ results.

Recent studies trained the replication origin predictors using both positive and negative samples, so that the number of both false positives and false negatives may be reduced. These samples were evaluated for their structural characteristics (Chen et al., 2012) and sequence statistical compositions (Robinson et al., 2014). Li et al. generated the pseudo-k-tuple nucleotide compositions (PseKNC) from the sample sequences and achieved 83.72% in the overall success rate of predicting replication origins in the S.cerevisiae genome (Li et al., 2015). The integration of dinucleotide physiochemical features (Zhang et al., 2016) and position-specific features (Xiao et al., 2016) with the original version of general pseudo amino acid composition (Chou, 2011) were demonstrated for their improved prediction accuracies of replication origin sites. A recent study considered the varied lengths of replication origins and GC asymmetry bias and proposed a replication origin prediction algorithm for four yeast genomes with the state-of-the-art prediction accuracies (Liu et al., 2018a).

This study hypothesized that the replication origin prediction models mentioned above may be improved by finding the best subset of the sequence features. So we carried out the supporting experiments for our hypothesis by following the five-step procedure, i.e. (i) well-defined benchmark dataset, (ii) formulating the sample sequence into features, (iii) selecting the features and optimizing the predicting classifier, (iv) cross-validating the model and (v) a user-friendly predicting tool. The experimental data demonstrated that the replication of original prediction models was reasonably improved.

2 Materials and methods

2.1 Benchmark dataset

This study chose to use the benchmark datasets of four yeast species curated from the database DeOri version 6.0 (Gao et al., 2012) and the corresponding genomic sequences were retrieved from the database GenBank (Benson et al., 2018). A comparative evaluation of a replication origin prediction algorithm was carried out on these four yeast species, i.e. S.cerevisiae, Schizosaccharomyces pombe, Kluyveromyces lactis and Pichia pastoris. The genomic locations of their replication origins were collected from the database DeOri version 6.0 (Gao et al., 2012). The negative samples were randomly extracted from the same genomes with the same length distributions as the positive samples. Only the sequences longer than 49 bp were kept for further analysis. Homologous redundant sequences with cut-off threshold 80% were detected and removed using the state-of-the-art sequence clustering software CD-HIT (Fu et al., 2012; Li and Godzik, 2006). The construction procedure of the four datasets was followed exactly as in Liu et al. (2018a).

The four benchmark datasets were denoted as D(Sc), D(Sp), D(Kl) and D(Pp) for the four yeast species: S.cerevisiae, S.pombe, K.lactis and P.pastoris. Saccharomyces cerevisiae had 340 replication origin samples in the positive dataset Dp(Sc) and 342 negative controls in the negative dataset Dn(Sc). There were 338 and 335 samples in the positive Dp(Sp) and negative Dn(Sp) datasets for S.pombe, respectively. Kluyveromyces lactis had 147 pairs of positive and negative samples in Dp(Kl) and Dn(Kl), respectively. The last genome Pichia pastoris had 305 non-redundant replication origins and 302 non-replication origins in the datasets Dp(Pp) and negative Dn(Pp). The final datasets may be found in Supplementary Materials S1–S4 of Liu et al. (2018a). The four datasets were summarized in Table 1.

Table 1.

Benchmark datasets

Species	Dataset	P	N
Saccharomyces cerevisiae	D(Sc)	340	342
Schizosaccharomyces pombe	D(Sp)	338	335
Kluyveromyces latis	D(Kl)	147	147
Pichia pastoris	D(Pp)	305	302

Species	Dataset	P	N
Saccharomyces cerevisiae	D(Sc)	340	342
Schizosaccharomyces pombe	D(Sp)	338	335
Kluyveromyces latis	D(Kl)	147	147
Pichia pastoris	D(Pp)	305	302

Note: The column ‘Species’ gave the names of the four genomes. The column ‘Dataset’ was the denotation of each dataset. The number of positive and negative samples in each dataset were given in the columns ‘P’ and ‘N’, respectively.

Open in new tab

Table 1.

Benchmark datasets

Species	Dataset	P	N
Saccharomyces cerevisiae	D(Sc)	340	342
Schizosaccharomyces pombe	D(Sp)	338	335
Kluyveromyces latis	D(Kl)	147	147
Pichia pastoris	D(Pp)	305	302

Species	Dataset	P	N
Saccharomyces cerevisiae	D(Sc)	340	342
Schizosaccharomyces pombe	D(Sp)	338	335
Kluyveromyces latis	D(Kl)	147	147
Pichia pastoris	D(Pp)	305	302

Note: The column ‘Species’ gave the names of the four genomes. The column ‘Dataset’ was the denotation of each dataset. The number of positive and negative samples in each dataset were given in the columns ‘P’ and ‘N’, respectively.

Open in new tab

2.2 Experimental design

This study was organized into three main steps, i.e. feature extraction, feature selection and classification, as shown in Figure 1.

Fig. 1.

Open in new tab Download slide

Flowchart of this study. Modules with the best performances were highlighted in bold

The PseKNC was widely used to describe the sequence level characteristics of DNA or RNA elements and have been successfully utilized for predicting various genetic elements, as similar in Liu et al. (2018b).

Six feature selection algorithms and seven classification algorithms were evaluated for building the prediction models of DNA replication origins. The prediction performance was evaluated by five metrics, i.e. sensitivity (Sn), specificity (Sp), overall accuracy (Acc), balanced accuracy (bAcc) and Matthews correlation coefficient (MCC) (Feng et al., 2018; Xu et al., 2018). Details may be found in Supplementary Materials due to the page limit.

3 Results and discussion

This study used the dataset D(Sp) to tune the parameters of the prediction model. A recent study investigated the prediction problem of DNA replication origins using the four datasets D(Sc), D(Sp), D(Kl) and D(Pp) (Liu et al., 2018a, b). They achieved the best LOO accuracy 0.965 on the dataset D(Sp), while the other three datasets D(Sc), D(Kl) and D(Pp) only got the LOO Acc = 0.730, 0.851 and 0.710, respectively. So this study assumed that a set of parameter values would perform well on all these four datasets if it could improve the previous model on D(Sp) with the best accuracy.

3.1 Find the best feature selection algorithm

Six feature selection algorithms were evaluated for their feature screening capabilities on predicting DNA replication origins, as shown in Figure 2. These algorithms were the filter feature selection algorithm χ-squared (χ²) test (Jin et al., 2006), the wrapper algorithm McTwo (Ge et al., 2016), random forest-based recursive feature elimination (RF-RFE) (Granitto et al., 2006), support vector machine-based recursive feature elimination (SVM-RFE) (Duan et al., 2005), Lasso (Deshpande et al., 2019; Kumar et al., 2017) and Ttest (Gharbali et al., 2018; Ye et al., 2017). The selected feature subset was evaluated for their back-propagation neural network (BPNN) classification performances averaged over 20 random runs of the 5-fold cross-validation strategy.

Fig. 2.

Open in new tab Download slide

Classification performances of the classifier BPNN using features selected by six feature selection algorithms. The six feature selection algorithms were χ², McTwo, RF-RFE, SVM-RFE, Lasso and Ttest. The horizontal axis gave the six performance metrics, i.e. Sn, Sp, Acc, bAcc, AUC and MCC. The vertical axis was the value of these performance metrics

Figure 2 illustrated that the feature selection algorithm SVM-RFE outperformed the other five algorithms in all the overall performance metrics, i.e. Acc, bAcc, AUC and MCC. SVM-RFE achieved about 0.02 improvements in Acc compared with χ², McTwo and RF-RFE. Although McTwo achieved an improvement of 0.0092 in Sn compared with SVM-RFE, its Sp was 0.0413 smaller than SVM-RFE. SVM-RFE outperformed the second best feature selection algorithm Lasso with an improvement of 0.0012 in Acc. So SVM-RFE achieved an improvement of 0.0193 in Acc compared with McTwo. So the following sections used SVM-RFE to select a feature subset for predicting the DNA replication origins.

3.2 Find the best classification algorithm

Seven classification algorithms were evaluated for their capabilities of predicting DNA replication origins using the sequence features selected by SVM-RFE, as shown in Figure 3. The classifier SVM (Weston et al., 2001) has been widely applied in predicting biological elements (Beer, 2017; Zhang and Ma, 2019). The RF classifier determined the class assignment by summarizing the predictions of multiple random trees (Jang et al., 2018; Li et al., 2018). Multinomial naïve Bayes (MNB) calculated the a priori probabilities of a sample belonging to a class (Pan et al., 2018). The gradient boosting decision tree (GBDT) classifier optimized an additive classification model by the step-wise forward strategy (Liang et al., 2019; Wang et al., 2019). The BPNN was proposed by Rumelhart et al. in 1986 as a multilayer feedforward neural network (Rumelhart et al., 1986). Xgboost was a gradient-based classifier that has been used to predict many sequence-based bioinformatics problems (Deng et al., 2019; Qiang et al., 2018). Extreme learning machine (ELM) was a simplicity-based classifier and provides a very rapid learning speed for many datasets (Li et al., 2019; Zhang et al., 2019).

Fig. 3.

Open in new tab Download slide

Classification performances of seven classifiers using features selected by SVM-RFE. The seven classification algorithms were SVM, RF, MNB, GBDT, BPNN, ELM and Xgboost. The horizontal axis gave the six performance metrics, i.e. Sn, Sp, Acc, bAcc, AUC and MCC. The vertical axis was the value of these performance metrics

The feature subset selected by SVM-RFE was evaluated for its classification performances using the seven classification algorithms mentioned above, and all the performance metrics were averaged over 20 random runs of the 5-fold cross-validation strategy. Figure 3 illustrated that BPNN outperformed all the other six classification algorithms on the SVM-RFE selected features in all the performance metrics except for Sn. At least 1.56% improvement in Acc was achieved by BPNN compared with the other six classification algorithms.

The classification algorithm SVM was the embedded classifier in the feature selection algorithm SVM-RFE, and SVM outperformed BPNN by 0.0439 in Sn, as shown in Figure 3. But SVM did not achieve a good specificity (Sp = 0.6203), and its overall accuracy (Acc = 0.6619) was 0.0397 smaller than that of BPNN (Acc = 0.7016). So the classification algorithm BPNN performed the best and was used in the following sections.

3.3 Optimize the parameters of SVM-RFE

The 5-fold cross validation strategy was used to calculate the prediction performances of the classifier BPNN on the training dataset (denoted as training performances), and then the test dataset was used to evaluate the model trained over the training dataset (denoted as testing performances). The parameters were tuned to optimize the testing performances by the step-wise zooming strategy.

The parameter C of SVM-RFE was tuned in four steps with different scaling resolutions, as shown in Figure 4. First, the BPNN classification performance metrics were calculated for each of the 10 choices C∈[1, 10] with step size 1, as shown in Figure 4a. The best testing accuracy was achieved for C = 2, while the best training accuracy was achieved for C = 3. So the screening region was switched to [0.1, 3] and the step size was set as 0.1, as shown in Figure 4b. The best classification performances of both training and testing datasets were achieved for C = 0.3. The next refining region was set to [0.21, 0.40] with step size 0.1, as shown in Figure 4c. The top three best testing accuracies were achieved for C = 0.29, 0.30 and 0.31. Therefore the final refining step screened the region [0.281, 0.320] with step size 0.001. The following sections used C = 0.294 with the best prediction accuracies for both training and testing datasets, as shown in Figure 4d.

Fig. 4.

Open in new tab Download slide

Optimizing the parameter C of SVM-RFE. (a) C∈[1, 10] with step size 1, (b) C∈[0.1, 3] with step size 0.1, (c) C∈[0.21, 0.40] with step size 0.01 and (d) C∈[0.281, 0.320] with step size 0.001. The accuracy curves of the training and testing datasets were plotted in solid and dashed lines, respectively

3.4 Double feature screening by the linear SVM weights

A double feature screening strategy was applied to the classifier SVM to find a feature subset with the best classification performance. Each round of feature screening normalized the SVM weights as percentiles, and only the features with weights no smaller than the threshold value were kept for further analysis. So the two threshold values T1 and T2 were evaluated for their classification performances.

Figure 5a illustrated that the first round of feature screening generated fewer features when T2 ≤ 0.5. After T2 > 0.5, the change of feature number had a different trend. No features were selected when T2 > 1.5 and T1 < 0.25. T2 = 1.5 and T1 = 0.2 also generated no features. Figure 5b suggested that more features did not mean better prediction accuracies. The best prediction accuracy of 0.7252 was achieved when T2 = 0.8 and T1 = 0.3. The data also suggested that T1 = 0.3 generated good prediction accuracy for different T2 values. So the following sections used T1 = 0.3 and T2 = 0.8.

Fig. 5.

Open in new tab Download slide

Performance comparison for different values of T1 and T2. (a) Feature numbers and (b) overall accuracy of the 5-fold cross-validation on the testing dataset

3.5 Find the best pair of neural network layer number and neuron number

Figure 6 illustrated that a smaller number of layers led to a better averaged testing Acc, and the best averaged testing Acc = 0.7230 was achieved for one-layer BPNN. However, the fluctuated testing Acc was not linearly correlated with the parameter a. The averaged testing Acc suggested that the parameter neuron feature ratio a = 1.2 achieved the best performance testing Acc = 0.7265. However, the best performing duet (d, a) was actually (1, 1.6), which achieved the testing Acc = 0.7285. The following section used (d, a) = (1, 1.6) for the other experiments.

Fig. 6.

Open in new tab Download slide

Performance comparison of different number of neurons and layers. The BPNN model was evaluated for its testing Acc using different choices of the two parameters. The last column gave the averaged testing Acc over the same row (of the parameter neuron feature ratio a). The last row gave the averaged testing Acc over the same column (of the parameter layer number d)

3.6 Dropout rate of BPNN

The parameter dropout rate played an essential role in optimizing a neural network (Wang et al., 2018), and the performance of a prediction model usually did not linearly correlate with the dropout rate (Bonn et al., 2018). Our data also illustrated this pattern that the prediction model might behave just like a random guess with the testing Acc = 0.5574 when the dropout rate was 0.99, as shown in Figure 7. After the dropout rate was increased to 0.10, the testing Acc reached a plateau until 0.50. The maximum value of 0.7285 was achieved when the dropout rate was 0.40. This value was used for the rest of this study.

Fig. 7.

Open in new tab Download slide

The testing Acc of the classifier BPNN with different dropout rates. The horizontal axis gave the dropout rate, and the vertical axis was the testing Acc

3.7 Final models and the comparison with the existing study

Twenty runs of BPNN with different random seeds and the best parameter choices mentioned above were carried out on the four datasets, as shown in Figure 8. BPNN runs very slowly. The first recall function was removed and the training iterations were reduced in half. Figure 8a suggested that the final models on the four datasets performed accurately and stably. The prediction accuracies of our final models were also positively correlated with those of the existing study (Liu et al., 2018a) on these four datasets.

Fig. 8.

Open in new tab Download slide

Performance evaluation and comparison of the final models. (a) The averaged value and standard deviation of 20 random runs of the leave-one-out validation on the four datasets D(Sc), D(Sp), D(Kl) and D(Pp). (b) The performance metrics of the final models minus those of the previous study. The error bars of both sub-figures were calculated from the 20 random runs

Our final models outperformed the previous study (Liu et al., 2018) in all the six performance metrics on the four datasets, D(Sc), D(Sp), D(Kl) and D(Pp), as shown in Figure 8b. This study only achieved a minor improvement in Acc (Acc = 0.973) on the dataset D(Sp), because the previous study already achieved a very good Acc = 0.965. At least 11% improvement in Acc was achieved for the other three datasets. The largest improvement of 0.269 in Acc was achieved on S.cerevisiaea, where the prediction accuracy was improved from 0.730 to 0.999. The performance variations were no more than 0.010 over the 20 random runs for all the four datasets. The previous study did not calculate multiple random runs, so the error bars were not compared with the previous study.

3.8 Evaluation of other PseKNC features and the biological implications

A further investigation was carried out to evaluate whether the other PseKNC feature groups may improve our models, as shown in Supplementary Figure S1. There are 12 feature groups for each of the types 1 and 2 of PseKNC. This study utilized the feature groups MW and NU of type 1. The experimental data suggested that the two feature groups MW and NU achieved the best prediction accuracies.

The selected features demonstrated statistical significance associated with the DNA replication origins and AT richness, as supported by the literature.

Details may be found in Supplementary Materials due to the page limit.

4 Conclusion

The main contribution of this study was to comprehensively evaluate multiple algorithms to select a subset of DNA sequence features with the best classification accuracies of the yeast DNA replication origins. The best duet of feature selection and classification algorithms was SVM-RFE and BPNN. The prediction models of the DNA replication origins of all the four benchmark yeast genomes were improved compared with the literature. The DNA sequence features chosen in this study demonstrated the AT-rich pattern, which supported the previous observations of the DNA replication origins.

Acknowledgements

The constructive comments from the anonymous reviewers were greatly appreciated.

Authors Contributions

F.Z. and X.F. conceived the project. F.Z., X.F., C.L., J.Z. and R.S. designed the experiments and carried out the analysis. W.Z., Y.W., G.W. and L.H. contributed to the experimental design and results and discussions. Q.W. established the online prediction server. All the authors contributed to the manuscript drafting and have approved the final version of the manuscript.

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400); Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC); the Education Department of Jilin Province (JJKH20180145KJ) and the startup grant of the Jilin University. This work was also partially supported by the Bioknow MedAI Institute (BMCPP-2018-001), and by the Fundamental Research Funds for the Central Universities, JLU.

Conflict of Interest: none declared.

References

Ausiannikava

D.

,

Allers

T.

(

2017

)

Diversity of DNA replication in the Archaea

.

Genes

,

8

,

56

.

Beer

M.A.

(

2017

)

Predicting enhancer activity and variant impact using gkm-SVM

.

Hum. Mutat

.,

38

,

1251

–

1258

.

Benson

D.A.

et al. (

2018

)

GenBank

.

Nucleic Acids Res

.,

46

,

D41

–

D47

.

Biswas

S.B.

et al. (

2005

)

Control of ATP-dependent binding of Saccharomyces cerevisiae origin recognition complex to autonomously replicating DNA sequences

.

Cell Cycle

,

4

,

494

–

500

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Bogenschutz

N.L.

et al. (

2014

)

Initiation of DNA replication from non-canonical sites on an origin-depleted chromosome

.

PLoS One

,

9

,

e114545.

Bonn

S.E.

et al. (

2018

)

App-technology to increase physical activity among patients with diabetes type 2-the DiaCert-study, a randomized controlled trial

.

BMC Public Health

,

18

,

119

.

Chang

F.

et al. (

2011

)

High-resolution analysis of four efficient yeast replication origins reveals new insights into the ORC and putative MCM binding elements

.

Nucleic Acids Res

.,

39

,

6523

–

6535

.

Chen

W.

et al. (

2012

)

Prediction of replication origins by calculating DNA structural properties

.

FEBS Lett

.,

586

,

934

–

938

.

Chodavarapu

S.

,

Kaguni

J.M.

(

2016

)

Replication initiation in bacteria

.

Enzymes

,

39

,

1

–

30

.

Chou

K.C.

(

2011

)

Some remarks on protein attribute prediction and pseudo amino acid composition

.

J. Theor. Biol

.,

273

,

236

–

247

.

Deng

L.

et al. (

2019

)

Prediction of binding hot spots at protein(-)RNA interfaces utilizing extreme gradient boosting

.

Genes

,

10

,

242

.

Deshpande

S.

et al. (

2019

)

PLIT: an alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets

.

Comput. Biol. Med

.,

105

,

169

–

181

.

Duan

K.-B.

et al. (

2005

)

Multiple SVM-RFE for gene selection in cancer classification with expression data

.

IEEE Trans. Nanobiosci

.,

4

,

228

–

234

.

Google Scholar

Crossref

WorldCat

Feng

X.

et al. (

2018

)

Selecting multiple biomarker subsets with similarly effective binary classification performances

.

J. Vis. Exp

.,

140

,

e57738

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Fu

L.

et al. (

2012

)

CD-HIT: accelerated for clustering the next-generation sequencing data

.

Bioinformatics

,

28

,

3150

–

3152

.

Gao

F.

et al. (

2012

)

DeOri: a database of eukaryotic DNA replication origins

.

Bioinformatics

,

28

,

1551

–

1552

.

Gao

F.

,

Zhang

C.T.

(

2008

)

Ori-Finder: a web-based system for finding oriCs in unannotated bacterial genomes

.

BMC Bioinformatics

,

9

,

79.

Ge

R.

et al. (

2016

)

McTwo: a two-step feature selection algorithm based on maximal information coefficient

.

BMC Bioinformatics

,

17

,

142

.

Gharbali

A.A.

et al. (

2018

)

Investigating the contribution of distance-based features to automatic sleep stage classification

.

Comput. Biol. Med

.,

96

,

8

–

23

.

Granitto

P.M.

et al. (

2006

)

Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products

.

Chemom. Itell. Lab. Syst

.,

83

,

83

–

90

.

Google Scholar

Crossref

WorldCat

Grigoriev

A.

(

1998

)

Analyzing genomes with cumulative skew diagrams

.

Nucleic Acids Res

.,

26

,

2286

–

2290

.

Jang

B.S.

et al. (

2018

)

Prediction of pseudoprogression versus progression using machine learning algorithm in glioblastoma

.

Sci. Rep

.,

8

,

12516

.

Jin

X.

et al. (

2006

) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In:

International Workshop on Data Mining for Biomedical Applications, Singapore

, pp.

106

–

115

.

Springer

.

Kumar

S.

et al. (

2017

)

CSP-TSM: optimizing the performance of Riemannian tangent space mapping using common spatial pattern for MI-BCI

.

Comput. Biol. Med

.,

91

,

231

–

242

.

Lee

D.G.

,

Bell

S.P.

(

1997

)

Architecture of the yeast origin recognition complex bound to origins of DNA replication

.

Mol. Cell Biol

.,

17

,

7159

–

7168

.

Leonard

A.C.

,

Mechali

M.

(

2013

)

DNA replication origins

.

Cold Spring Harb. Perspect. Biol

.,

5

,

a010116.

Li

J.

et al. (

2018

)

RNAm5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5C) sites based on random forest

.

Sci. Rep

.,

8

,

17299

.

Li

W.-C.

et al. (

2015

)

iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition

.

Chemom. Intel. Lab. Syst

.,

141

,

100

–

106

.

Google Scholar

Crossref

WorldCat

Li

W.

,

Godzik

A.

(

2006

)

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

Bioinformatics

,

22

,

1658

–

1659

.

Li

Y.

et al. (

2019

)

ELM-MHC: an improved MHC identification method with extreme learning machine algorithm

.

J. Proteome Res

.,

18

,

1392

–

1401

.

Liachko

I.

et al. (

2010

)

A comprehensive genome-wide map of autonomously replicating sequences in a naive genome

.

PLoS Genet

.,

6

,

e1000946

.

Liang

S.

et al. (

2019

)

Classification of first-episode schizophrenia using multimodal brain features: a combined structural and diffusion imaging study

.

Schizophr. Bull

,

45

,

591

–

599

.

Liu

B.

et al. (

2017

)

Pse-analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods

.

Oncotarget

,

8

,

13338

–

13343

.

Liu

B.

et al. (

2018a

)

iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach

.

Bioinformatics

,

34

,

3835

–

3842

.

Liu

B.

et al. (

2018b

)

iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC

.

Bioinformatics

,

34

,

3086

–

3093

.

Lobry

J.R.

(

1996

)

Asymmetric substitution patterns in the two DNA strands of bacteria

.

Mol Biol Evol

,

13

,

660

–

665

.

Luo

H.

et al. (

2014

)

Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes

.

Front. Microbiol

.,

5

,

482

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Marahrens

Y.

,

Stillman

B.

(

1992

)

A yeast chromosomal origin of DNA replication defined by multiple functional elements

.

Science

,

255

,

817

–

823

.

Marahrens

Y.

,

Stillman

B.

(

1994

)

Replicator dominance in a eukaryotic chromosome

.

EMBO J

.,

13

,

3395

–

3400

.

Mott

M.L.

,

Berger

J.M.

(

2007

)

DNA replication initiation: mechanisms and regulation in bacteria

.

Nat. Rev. Microbiol

.,

5

,

343

–

354

.

Nieduszynski

C.A.

et al. (

2006

)

Genome-wide identification of replication origins in yeast by comparative genomics

.

Genes Dev

.,

20

,

1874

–

1879

.

Ou

H.Y.

et al. (

2003

)

Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method

.

FEBS Lett

.,

540

,

188

–

194

.

Pan

Y.

et al. (

2018

)

Identification of bacteriophage virion proteins using multinomial naive Bayes with g-gap feature tree

.

Int. J. Mol. Sci

.,

19

,

pii: E1779

.

Google Scholar

Crossref

WorldCat

Peng

C.

et al. (

2015

)

Recent advances in the genome-wide study of DNA replication origins in yeast

.

Front. Microbiol

.,

6

,

117

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Qiang

X.

et al. (

2018

)

M6AMRFS: robust prediction of N6-methyladenosine sites with sequence-based features in multiple species

.

Front. Genet

.,

9

,

495

.

Robinson

D.G.

et al. (

2014

)

Design and analysis of Bar-seq experiments

.

G3

,

4

,

11

–

18

.

Rumelhart

D.E.

et al. (

1986

)

Learning representations by back-propagating errors

.

Nature

,

323

,

533.

Google Scholar

Crossref

WorldCat

Skarstad

K.

,

Katayama

T.

(

2013

)

Regulating DNA replication in bacteria

.

Cold Spring Harb. Perspect. Biol

.,

5

,

a012922.

Wang

B.

et al. (

2018

) Turn-to-turn short circuit of motor stator fault diagnosis using dropout rate improved deep sparse autoencoder. In:

2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) Chongqing, China

, pp.

220

–

225

.

IEEE

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Wang

C.

et al. (

2019

)

Predicting future driving risk of crash-involved drivers based on a systematic machine learning framework

.

Int. J. Environ. Res. Public Health

,

16

,

334

.

Warner

M.D.

et al. (

2017

)

Replication origin-flanking roadblocks reveal origin-licensing dynamics and altered sequence dependence

.

J. Biol. Chem

.,

292

,

21417

–

21430

.

Watson

J.D.

,

Crick

F.H.

(

1953

)

Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid

.

Nature

,

171

,

737

–

738

.

Weston

J.

et al. (

2001

) Feature selection for SVMs. In: Leen,T.K. et al. (eds.) Advances in Neural Information Processing Systems, Denver, CO, USA. MIT Press, Cambridge, MA, pp.

668

–

674

.

Wilmes

G.M.

,

Bell

S.P.

(

2002

)

The B2 element of the Saccharomyces cerevisiae ARS1 origin of replication requires specific sequences to facilitate pre-RC formation

.

Proc. Natl. Acad. Sci. U.S.A

.,

99

,

101

–

106

.

Xiao

X.

et al. (

2016

)

iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition

.

Oncotarget

,

7

,

34180

–

34189

.

Xu

C.

et al. (

2018

)

An OMIC biomarker detection algorithm TriVote and its application in methylomic biomarker detection

.

Epigenomics

,

10

,

335

–

347

.

Ye

Y.

et al. (

2017

)

RIFS: a randomly restarted incremental feature selection algorithm

.

Sci. Rep

.,

7

,

13013

.

Yuan

G.C.

et al. (

2005

)

Genome-scale identification of nucleosome positions in S. cerevisiae

.

Science

,

309

,

626

–

630

.

Zhang

C.J.

et al. (

2016

)

iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition

.

Oncotarget

,

7

,

69783

–

69793

.

Zhang

F.

et al. (

2019

)

Voxel-based morphometry: improving the diagnosis of Alzheimer’s disease based on an extreme learning machine method from the ADNI cohort

.

Neuroscience

. doi: 10.1016/j.neuroscience.2019.05.014.

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

R.

,

Zhang

C.T.

(

2002

)

Single replication origin of the archaeon Methanosarcina mazei revealed by the Z curve method

.

Biochem. Biophys. Res. Commun

.,

297

,

396

–

400

.

Zhang

Y.

,

Ma

Y.

(

2019

)

Application of supervised machine learning algorithms in the classification of sagittal gait patterns of cerebral palsy children with spastic diplegia

.

Comput. Biol. Med

.,

106

,

33

–

39

.

Zou

L.

,

Stillman

B.

(

2000

)

Assembly of a complex containing Cdc45p, replication protein A, and Mcm2p at replication origins controlled by S-phase cyclin-dependent kinases and Cdc7p-Dbf4p kinase

.

Mol. Cell Biol

.,

20

,

3086

–

3096

.

Author notes

The authors wish it to be known that, in their opinion, Chenwei Lou, Jian Zhao Fengfeng Zhou should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
June 2019	25
July 2019	9
August 2019	10
September 2019	54
October 2019	49
November 2019	54
December 2019	38
January 2020	71
February 2020	10
March 2020	25
April 2020	10
May 2020	7
June 2020	13
July 2020	7
August 2020	7
September 2020	17
October 2020	15
November 2020	20
December 2020	21
January 2021	12
February 2021	15
March 2021	6
April 2021	27
May 2021	48
June 2021	47
July 2021	27
August 2021	38
September 2021	49
October 2021	25
November 2021	32
December 2021	31
January 2022	20
February 2022	29
March 2022	34
April 2022	25
May 2022	31
June 2022	20
July 2022	31
August 2022	35
September 2022	138
October 2022	222
November 2022	63
December 2022	114
January 2023	53
February 2023	20
March 2023	19
April 2023	74
May 2023	27
June 2023	15
July 2023	13
August 2023	21
September 2023	18
October 2023	30
November 2023	15
December 2023	21
January 2024	22
February 2024	21
March 2024	22
April 2024	12

Article Contents

sefOri: selecting the best-engineered sequence features to predict DNA replication origins

Abstract

1 Introduction

2 Materials and methods

2.1 Benchmark dataset

2.2 Experimental design

3 Results and discussion

3.1 Find the best feature selection algorithm

3.2 Find the best classification algorithm

3.3 Optimize the parameters of SVM-RFE

3.4 Double feature screening by the linear SVM weights

3.5 Find the best pair of neural network layer number and neuron number

3.6 Dropout rate of BPNN

3.7 Final models and the comparison with the existing study

3.8 Evaluation of other PseKNC features and the biological implications

4 Conclusion

Acknowledgements

Authors Contributions

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

sefOri: selecting the best-engineered sequence features to predict DNA replication origins

Abstract

1 Introduction

2 Materials and methods

2.1 Benchmark dataset

2.2 Experimental design

3 Results and discussion

3.1 Find the best feature selection algorithm

3.2 Find the best classification algorithm

3.3 Optimize the parameters of SVM-RFE

3.4 Double feature screening by the linear SVM weights

3.5 Find the best pair of neural network layer number and neuron number

3.6 Dropout rate of BPNN

3.7 Final models and the comparison with the existing study

3.8 Evaluation of other PseKNC features and the biological implications

4 Conclusion

Acknowledgements

Authors Contributions

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only