LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification

Berg

C.

et al. (

1984

)

Harmonic Analysis on Semigroups

.

Springer

,

Berlin

.

Choi

L.J.

et al. (

2008

) Adapting normalized google similarity in protein sequence comparison. In: IEEE International Symposium on Information Technology, 2008. ITSim 2008, Vol. 1, pp. 1–5.

Cilibrasi

R.

,

Vitanyi

P.M.B.

(

2005

)

Clustering by compression

.

IEEE Trans. Information Theory

,

51

,

1523

–

1545

.

Cilibrasi

R.

et al. (

2004

)

Algorithmic clustering of music based on string compression

.

Comput. Music J

.,

28

,

49

–

67

.

Cover

T.M.

,

Thomas

J.A.

(

2006

)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

.

Wiley-Interscience, New York, NY, USA

.

, https://www.springer.com/la/book/9781617795848?

Cristianini

N.

,

Shawe-Taylor

J.

(

2000

)

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

.

Cambridge University Press, New York, NY, USA

.

Cuturi

M.

,

Vert

J.-P.

(

2005

)

The context-tree kernel for strings

.

Neural Netw

.,

18

,

1111

–

1123

.

Dombi

J.

,

Kertész-Farkas

A.

(

2009

)

Applying fuzzy technologies to equivalence learning in protein classification

.

J. Comput. Biol

.,

16

,

611

–

623

.

Ferragina

P.

et al. (

2007

)

Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment

.

BMC Bioinformatics

,

8

,

252.

Forslund

K.

,

Sonnhammer

E.L

year. (

2012

) Evolution of protein domain architectures. In:

Evolutionary Genomics

.

Springer

, pp.

187

–

216

Fox

N.K.

et al. (

2013

)

Scope: structural classification of proteins–extended, integrating scop and astral data and classification of new structures

.

Nucleic Acids Res

.,

42

,

D304

–

D309

.

Haussler

D.

(

1999

) Convolution kernels on discrete structures. Technical report UCS-CRL-99-10. University of California at Santa Cruz, Santa Cruz, CA, USA.

Henikoff

S.

et al. (

1999

)

Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations

.

Bioinformatics

,

15

,

471

–

479

.

Jaakkola

T.

et al. (

1999

)

Using the fisher kernel method to detect remote protein homologies

.

Intell. Sys. Mol. Biol

.,

149

–

158

, https://dl.acm.org/citation.cfm?id=660801.

Kertész-Farkas

A.

et al. (

2008a

)

Benchmarking protein classification algorithms via supervised cross-validation

.

J. Biochem. Biophys. Methods

,

70

,

1215

–

1223

.

Kertész-Farkas

A.

et al. (

2008b

) The application of the data compression-based distances to biological sequences. In:

Emmert-Streib

F.

,

Dehmer

M.

(eds.)

Information Theory and Statistical Learning, Lecture Notes in Computer Science

,

Springer, Boston, MA

.

Kertész-Farkas

A.

et al. (

2007

) Equivalence learning in protein classification. In:

Perner

P.

(ed.)

MLDM, Lecture Notes in Computer Science

, Vol. 4571.

Springer, Berlin, Heidelberg

, pp.

824

–

837

.

Kocsor

A.

et al. (

2006

)

Application of compression-based distance measures to protein sequence classification: a methodological study

.

Bioinformatics

,

22

,

407

–

412

.

Kraskov

A.

et al. (

2003

) Hierarchical clustering using mutual information. CoRR, q-bio.QM/0311037.

Krasnogor

N.

,

Pelta

D.A.

(

2004

)

Measuring the similarity of protein structures by means of the universal similarity metric

.

Bioinformatics

,

20

,

1015

–

1021

.

Leslie

C.S.

et al. (

2002

) The spectrum kernel: a string kernel for svm protein classification. In: Pacific Symposium on Biocomputing, World Scientific, Singapore, pp. 566–575.

Leslie

C.S.

et al. (

2004

)

Mismatch string kernels for discriminative protein classification

.

Bioinformatics

,

20

,

467

–

476

.

Li

M.

et al. (

2001

)

An information-based sequence distance and its application to whole mitochondrial genome phylogeny

.

Bioinformatics

,

17

,

149

–

154

.

Li

M.

et al. (

2003

) The similarity metric. In: SODA ’03: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 863–872.

Liao

L.

,

Noble

W.S.

(

2002

) Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: RECOMB ’02: Proceedings of the Sixth Annual International Conference on Computational Biology. ACM, New York, NY, USA, pp. 225–232.

Lodhi

H.

et al. (

2002

)

Text classification using string kernels

.

J. Mach. Learn. Res

.,

2

,

419

–

444

.

Moore

A.D.

et al. (

2008

)

Arrangements in the modular evolution of proteins

.

Trends Biochem. Sci

.,

33

,

444

–

451

.

Santos

C.C.

et al. (

2006

) Clustering fetal heart rate tracings by compression. In: CBMS ’06: Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems. Computer Society, Washington, DC, USA, pp. 685–690.

Shawe-Taylor

J.

,

Cristianini

N.

(

2004

)

Kernel Methods for Pattern Analysis

.

Cambridge University Press

,

New York, NY, USA

.

Smith

T.F.

,

Waterman

M.S.

(

1981

)

Identification of common molecular subsequences

.

J. Mol. Biol

.,

147

,

195

–

197

.

Sonego

P.

et al. (

2007

)

A protein classification benchmark collection for machine learning

.

Nucleic Acids Res

.,

35

,

D232

–

D236

.

Vert

J.-P.

et al. (eds.) (

2004

)

Kernel Methods in Computational Biology

.

MIT Press

,

Cambridge, MA

.

Vert

J.-P.

et al. (

2007

)

A new pairwise kernel for biological network inference with support vector machines

.

BMC Bioinformatics

,

8

,

S8.

Zielezinski

A.

et al. (

2017

)

Alignment-free sequence comparison: benefits, applications, and tools

.

Genome Biol

.,

18

,

186.

Ziv

J.

,

Lempel

A.

(

1977

)

A universal algorithm for sequential data compression

.

IEEE Trans. Information Theory

,

23

,

337

–

343

.