K  2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

OpenURL Placeholder Text

Adjeroh

D.

et al. (

2008

)

The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching

, 1st edn.

Springer Publishing Company

,

Berlin, German

.

Bansal

M.S.

et al. (

2010

)

Robinson foulds supertrees

.

Algorithms Mol. Biol

.,

5

,

1

–

12

.

Bao

J.

et al. (

2014

)

An improved alignment-free model for DNA sequence similarity metric

.

BMC Bioinformatics

,

15

,

1

–

15

.

Bao

J.P.

,

Yuan

R.Y.

(

2015

)

A wavelet-based feature vector model for DNA clustering

.

Genet. Mol. Res. GMR

,

14

,

19163

.

Bauer

M.

et al. (

2008

)

The average mutual information profile as a genomic signature

.

BMC Bioinformatics

,

9

,

48.

Beal

R.

et al. (

2016a

) Compressing genome resequencing data via the maximal longest factor. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China, December 15–18, 2016, pp.

92

–

97

.

Beal

R.

et al. (

2016b

)

A new algorithm for the LCS problem with application in compressing genome resequencing data

.

BMC Genomics

,

17

,

544

.

Blaisdell

B.

(

1986

)

A measure of the similarity of sets of sequences not requiring sequence alignment

.

Proc. Natl. Acad. Sci. USA

,

83

,

5155

–

5519

.

Bonham-Carter

O.

et al. (

2014

)

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis

.

Brief. Bioinf

.,

15

,

890

–

905

.

Cao

Y.

et al. (

1998

)

Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders

.

J. Mol. Evol

.,

47

,

307

–

322

.

Christensen

D.

(

2005

)

Fast algorithms for the calculation of Kendall’s tau

.

Comput. Stat

.,

20

,

51

–

62

.

Dai

Q.

et al. (

2011

)

Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison

.

J. Theor. Biol

.,

276

,

174

–

180

.

Deorowicz

S.

,

Grabowski

S.

(

2013

)

Data compression for sequencing data

.

Algorithms Mol. Biol

.,

8

,

25.

Fischer

C.

et al. (

2013

)

Complete mitochondrial DNA sequences of the threadfin cichlid (Petrochromis trewavasae) and the blunthead cichlid (Tropheus moorii) and patterns of mitochondrial genome evolution in cichlid fishes

.

PLoS One

,

8

,

e67048.

Giancarlo

R.

et al. (

2012

)

Textual data compression in computational biology: Algorithmic techniques

.

Comput. Sci. Rev

.,

6

,

1

–

25

.

Gusfield

D.

(

1997

)

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

.

Cambridge University Press

,

Cambridge, England

.

Kantorovitz

M.

et al. (

2007

)

A statistical method for alignment-free comparison of regulatory sequences

.

Bioinformatics

,

23

,

i249

–

i255

.

Karlin

S.

et al. (

1983

)

New approaches for computer analysis of nucleic acid sequences

.

Proc. Natl. Acad. Sci. USA

,

80

,

5660

–

5664

.

Kendall

M.G.

(

1938

)

A new measure of rank correlation

.

Biometrika

,

30

,

81

–

93

.

Kuo

C.-E.

et al. (

2015

)

Resequencing a set of strings based on a target string

.

Algorithmica

,

72

,

430

–

449

.

Li

C.

,

Wang

J.

(

2005

)

Relative entropy of DNA and its application

.

Phys. A Stat. Mech. Appl

.,

347

,

465

–

471

.

Lin

J.

et al. (

2016

)

K_{2}

⁠: Efficient alignment-free sequence similarity measurement using the Kendall statistic. In: IEEE International Conference on Bioinformatics and Biomedicine, pp.

1128

–

1132

.

Lin

J.

et al. (

2017

) fastwkendall: an efficient algorithm for weighted Kendall correlation. accepted by Comput. Stat.

Liu

L.

et al. (

2006

)

Clustering DNA sequences by feature vectors

.

Mol. Phylogenet. Evol

.,

41

,

64.

Léonard

M.

et al. (

2012

)

On the number of elements to reorder when updating a suffix array

.

J. Discret. Algorithms

,

11

,

87

–

99

.

Lu

B.

et al. (

2017

)

A program to compute the soft Robinson–Foulds distance between phylogenetic networks

.

BMC Genomics

,

18

,

111.

Manber

U.

,

Myers

G.

(

1993

)

Suffix arrays: a new method for on-line string searches

.

SIAM J. Comput

.,

22

,

935

–

938

.

Marden

J.I.

et al. (

1992

)

Rank correlation methods (5th ed.)

.

J. Am. Stat. Assoc

.,

87

,

249.

Otu

H.H.

,

Sayood

K.

(

2003

)

A new sequence distance measure for phylogenetic tree construction

.

Bioinformatics

,

19

,

2122

–

2130

.

Qi

J.

et al. (

2004

)

Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach

.

J. Mol. Evol

.,

58

,

1

–

11

.

Reinert

G.

et al. (

2009

)

Alignment-free sequence comparison (I): statistics and power

.

J. Comput. Biol

.,

16

,

1615

–

1634

.

Reyes

A.

et al. (

2000

)

Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris

.

Mol. Biol. Evol

.,

17

,

979

–

983

.

Robinson

D.F.

,

Foulds

L.R.

(

1981

)

Comparison of phylogenetic trees

.

Math. Biosci

.,

53

,

131

–

147

.

Shepp

L.

(

1964

)

Normal functions of normal random variables

.

SIAM Rev

.,

6

,

459

–

460

.

Shi

L.

,

Huang

H.

(

2012

) DNA sequences analysis based on classifications of nucleotide bases. In:

Luo

J.

(ed.)

Affective Computing and Intelligent Interaction

.

Springer

,

Berlin, Heidelberg

, pp.

379

–

384

.

Smith

T.F.

,

Waterman

M.S.

(

1981

)

Identification of common molecular subsequences

.

J. Mol. Biol

.,

147

,

195

–

197

.

Song

K.

et al. (

2014

)

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

.

Brief. Bioinf

.,

15

,

343

–

353

.

Vinga

S.

(

2014

)

Information theory applications for biological sequence analysis

.

Brief. Bioinf

.,

15

,

376

–

389

.

Vinga

S.

,

Almeida

J.

(

2003

)

Alignment-free sequence comparison: a review

.

Bioinformatics

,

19

,

513

–

523

.

Wan

L.

et al. (

2010

)

Alignment-free sequence comparison (II): theoretical power of comparison statistics

.

J. Comput. Biol

.,

17

,

1467

–

1490

.

Wandelt

S.

,

Leser

U.

(

2013

)

FRESCO: referential compression of highly similar sequences

.

IEEE/ACM Trans. Comput. Biol. Bioinf

.,

10

,

1275

–

1288

.