Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models

Abstract Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.

In Euclidean space, 0-simplex is a point, 1-simplex is an edge, 2-simplex is a triangle, and 3-simplex is a tetrahedron.The k-simplex can describe abstract simplex for k > 3.
A subset of the k + 1 vertices of a k-simplex, σ k , with m + 1 vertices forming a convex hull in a lower dimension and is called an m-face of the k-simplex σ m , denoted as σ m ⊂ σ k .A simplicial complex K is a finite collection of simplexes satisfying two conditions: 1) Any face of a simplex in K is also in K.
2) The intersection of any two simplexes in K is either empty or a shared face.
The interactions between two simplexes can be described by adjacency.For example, in graph theory, two vertices (0-simplexes) are adjacent if they share a common edge (1-simplex).Adjacency for k-simplexes with k > 0 includes both upper and lower adjacency.Two distinct k-simplexes, σ 1 and σ 2 , in K are upper adjacent, denoted σ 1 ∼ U σ 2 , if both are faces of a (k + 1)-simplex in K, called a common upper simplex.Two distinct k-simplexes, σ 1 and σ 2 , in K are lower adjacent, denoted σ 1 ∼ L σ 2 , if they share a common (k − 1)-simplex as their face, called a common lower simplex.Either common upper simplex or common lower simplex is unique for two upper or lower adjacent simplexes.The upper degree of a k-simplex, deg U (σ k ), is the number of (k + 1)-simplexes in K of which σ k is a face; the lower degree of a k-simplex, deg L (σ k ), is the number of nonempty (k − 1)-simplexes in K that are faces of σ k , which is always k + 1.The degree of k-simplex (k > 0) is defined as the sum of its upper and lower degree ( For k = 0, the degree of a vertex is: A simplex has orientation determined by the ordering of its vertices, except 0-simplex.For example, clockwise and anticlockwise orderings of three vertices determine the two orientation of a triangle.Two simplexes, σ 1 and σ 2 , defined on the same vertices are similarly oriented if their orderings of vertices differ from an even number of permutations, otherwise, they are dissimilarly oriented. Algebraic topology provides a tool to calculate simplicial complex.A k-chain is a formal sum of oriented k-simplexes in K with coefficients on Z.The set of all k-chains of simplicial complex K together with the addition operation on Z constructs a free Abelian group C k (K), called chain group.To link chain groups from different dimensions, the k-boundary operator, , maps a k-chain in the form of a linear combination of k-simplexes to the same linear combination of the boundaries of the k-simplexes.For a simple example where the k-chain has one oriented k-simplex spanned by k + 1 vertices as defined in Eq. ( 1), its boundary operator is defined as the formal sum of its all (k − 1)-faces: where -simplex with its vertex v i being removed.The most important topological property is that a boundary has no boundary: A sequence of chain groups connected by boundary operators defines the chain complex: When n exceeds the dimension of K, C n (K) is an empty vector space and the corresponding boundary operator is a zero map.

Filtration for multiscale chain complexes
Filtration is a process that constructs a nested sequence of simplicial complex allowing a multiscale analysis of the point cloud.It creates a family of simplicial complexes ordered by inclusion (??c): where K is the largest simplicial complex can be obtained from the point cloud.
The filtration induces a sequence of chain complexes where ) is the chain group for subcomplex K t , and its k-boundary operator is k is the co-boundary operator.Associated with the k-boundary operator, its adjoint operator is the k-adjoint boundary operator, ).There are various simplicial complex that can be used to construct the filtration, such as Rips complex, Čech complex, and Alpha complex.For example, the Rips complex of K with radius t consists of all simplexes with diameter at most 2t:

Homology group and persistent homology
With the chain complex defined in Eq. ( 5), the k-cycle and k-boundary groups are defined as: Then the k-th homology group H k is defined as The k-th Betti number, β k , is defined by the rank of k-th homology group H k which counts kdimensional holes.For example, β 0 = rank(H 0 ) reflects the number of connected components, β 1 = rank(H 1 ) reflects the number of loops, and β 2 = rank(H 2 ) reveals the number of voids or cavities.
Persistent homology is devised to track the multiscale topological information along the filtration [1].The inclusion map K i ⊆ K j induces a homomorphism f i,j k between homology groups where Intuitively, this homology group records the k-dimensional homology classes of K t that are persistent at least until K t+p .The birth and death of homology classes can be represented by a barcode, a set of intervals (??d).Associated with the boundary operator ∂ k , the adjoint boundary operator is

Combinatorial Laplacian.
where its matrix representation is the transpose of the matrix, B T , with respect to the same ordered bases to the boundary operator.
The k-combinatorial Laplacian, a topological Laplacian, is a linear operator ∆ k : and its matrix representation, L k , is given by In particular, the 0-combinatorial Laplacian (i.e.graph Laplacian) is given as follows since ∂ 0 is an zero map: The elements of k-combinatorial Laplaicn matrices are For k = 0, the graph Laplacian matrix L 0 is The multiplicity of zero spectra of L k gives the Betti-k number, according to combinatorial Hodge theorem [2]: The Betti numbers describe topological invariants.Specifically, β 0 , β 1 , and β 2 may be regarded as the numbers of independent components, rings, and cavities, respectively.

Persistent spectral graph (PSG)
The homotopic shape changes with a small increment of filtration parameter may be subject to noise from the data.The persistence may be considered to enhance the robustness when calculating the Laplacian.First, we define the p-persistent chain group C where Then PSG defines a family of p-persistent k-combinatorial Laplacian operators ∆ t,p k : [3,4] which is defined as We denote B where N is the dimension of a standard basis for C t k , and L t,p k has dimension N × N .The kpersistent Betti number β t,p k can be obtained from the multiplicity of harmonic spectra of L t,p k : β t,p k = dim(L t,p k ) − rank(L t,p k ) = null(L t,p k ) = #{i|(λ i ) t,p k ∈ S t,p k , and (λ i ) t,p k = 0}.
In addition, the rest of the spectra, i.e., the non-harmonic part, capture additional geometric information.The family of spectra of the persistent Laplacians reveals the homotopic shape evolution [5].

For
k-boundary operator ∂ k : C k → C k−1 in K, let B k be the matrix representation of this operator relative to the standard bases for C k and C k−1 in K. B k ∈ Z M ×N is the matrix representation of boundary operator under the standard bases σ k i N i=1 and σ k−1 j M j=1 of C k and C k−1 .
is the k-boundary operator for chain group C t+p k .Then we can define a p-persistent boundary operator, ð t,p k , as the restriction of ∂ t+p k on the p-persistent chain group C t,p k : Since the Laplacian matrix, L t,p k , is positive-semidefinite, its spectra are all real and non-negativeS t,p k = Spectra(L t,p k ) = {(λ 1 ) t,p k , (λ 2 ) t,p k , • • • , (λ N ) t,p k }, (