Application of beta and gamma carbonic anhydrase sequences as tools for identification of bacterial contamination in the whole genome sequence of inbred Wuzhishan minipig (Sus scrofa) annotated in databases

Abstract Sus scrofa or pig was domesticated thousands of years ago. Through various indigenous breeds, different phenotypes were produced such as Chinese inbred miniature minipig or Wuzhishan pig (WZSP), which is broadly used in the life and medical sciences. The whole genome of WZSP was sequenced in 2012. Through a bioinformatics study of pig carbonic anhydrase (CA) sequences, we detected some β- and γ-class CAs among the WZSP CAs annotated in databases, while β- or γ-CAs had not previously been described in vertebrates. This finding urged us to analyze the quality of whole genome sequence of WZSP for the possible bacterial contamination. In this study, we used bioinformatics methods and web tools such as UniProt, European Bioinformatics Institute, National Center for Biotechnology Information, Ensembl Genome Browser, Ensembl Bacteria, RSCB PDB and Pseudomonas Genome Database. Our analysis defined that pig has 12 classical α-CAs and 3 CA-related proteins. Meanwhile, it was approved that the detected CAs in WZSP are categorized in the β- and γ-CA families, which belong to Pseudomonas spp. and Acinetobacter spp. The protein structure study revealed that the identified β-CA sequence from WZSP belongs to Pseudomonas aeruginosa with PDB ID: 5JJ8, and the identified γ-CA sequence from WZSP belongs to P. aeruginosa with PDB ID: 3PMO. Bioinformatics and computational methods accompanied with bacterial-specific markers, such as 16S rRNA and β- and γ-class CA sequences, can be used to identify bacterial contamination in mammalian DNA samples.


Introduction
Pigs (Sus scrofa) were domesticated in multiple geographic regions of Asia and Europe through artificial and natural selections about 10 000 years ago. Especially in China as one of the main centers, the domestication created a number of indigenous breeds with various phenotypes including Plateau, Lower Yangtze River Basin, Southwest and North China types (1)(2)(3). The whole genome sequences (WGS) of pig models and minipig varieties are important in biomedical studies, such as generation of porcine-induced pluripotent stem cells for the treatment of human diseases including diabetes and cancer as well as ophthalmic, neurodegenerative and cardiovascular diseases (4,5).
Wuzhishan pig (WZSP) is a Chinese inbred miniature minipig, which is characterized by its small size, approximately weight of 30 kg, homozygosis, genetic stability and good predictability in in vivo studies (6). WZSP was developed in the Institute of Animal Science of the Chinese Academy of Agriculture Science in 1987. Fang et al. performed the WGS of WZSP in 2012, which defined a high-level derivation of transposons from transfer RNA with 2.2 million copies (12.4% of the genome) (7). In addition, many human gene and effective drug targets have been identified in the genome of WZSP. The WGS of WZSP, completed by the researchers from Beijing Genomics Institute, provided pivotal data for the use of this minipig model in biological, medical and veterinary medicine studies.
The genome of WZSP contains porcine endogenous retroviruses (PERVs), which can be transmitted in the germ lines and infect human cells, leading to severe combined immunodeficiency (8). Therefore, PERVs are considered a great potential risk of xenotransplantation of organs from transgenic pigs like WZSP to human.
Carbonic anhydrases (CAs) are ubiquitous enzymes with metal cofactors such as zinc, iron, cobalt or cadmium in the enzyme active sites catalyzing the hydration of CO 2 to HCO 3 − and H + for pH homeostasis and playing the crucial roles in many biochemical pathways and physiological functions (9,10). CAs are classified into eight evolutionarily distinct families, including α, β, γ, δ, ζ, η, θ and ι (11)(12)(13)(14). α-CAs are present in many prokaryotes and eukaryotes (15,16). There are 13 α-CA isozymes in mammals, of which 12 are present in humans, including CA I-IV, CA VA and VB, CA VI, CA VII, CA IX and CA XII-XIV. CA XV can be found in several vertebrates with the exception of at least chimpanzee and human (17). In addition, the presence of three acatalytic CA-related proteins (CARPs), including CARP VIII, CARP X and CARP XI, has been reported, and these highly conserved proteins seem to play critical biological roles (18)(19)(20)(21)(22). Although β-and γ-CAs have been reported in several prokaryotes and eukaryotes, there is no report showing the presence of a β-or γ-CA in vertebrates (23,24). Databases such as Ensembl Genome Browser contain huge data resources of vertebrate genomes to support the related studies in various fields, such as evolutionary and computational biology, associated with the WGS, gene expression studies and encoded protein analyses in vertebrates (25). Due to the bacterial contamination of eukaryotic nucleic acid samples with environmental microbiome and normal flora of the eukaryotic hosts, some contaminant gene and protein sequences from prokaryotes have been erroneously annotated for eukaryotes in databases (26).
In this study, we performed a quality control analysis of the WGS results of WZSP annotated in databases using β-and γ-CA gene sequences as markers through bioinformatics and data mining approaches.

Identification of CAs from S. scrofa
To identify genomics and proteomics information of the CA isozymes from S. scrofa, the National Center for Biotechnology Information (NCBI) database (https:// www.ncbi.nlm.nih.gov/) (27) was used to define the chromosome location and exon counts of the corresponding genes. In addition, data from the UniProt database (https:// www.uniprot.org/) (28) were used to define the subcellular localization of CA isozymes from S. scrofa.

Analysis of β-and γ-CA sequences
In this analysis, β-CA protein sequence from Acetobacter aceti (UniProt ID: A0A1U9KGA1) and γ-CA protein sequence from Shigella flexneri (UniProt ID: P0A9X0) were used as the query sequences. Basic Local Alignment Search Tool (BLAST) analysis was performed on both β-and γ-CA query sequences using BLAST algorithm of Ensembl Genome Browser (https://asia.ensembl.org/ index.html) (25). To find similar sequences in the BLAST analysis, Pig-Wuzhishan (assembly: minipig_v1.0; accession: GCA_002844635.1; genebuild released: September 2019) was selected by species selector section, and    TBLASTN search tool with normal sensitivity was applied to search for the translated nucleotide databases using a protein query. In the next step, the defined β-and γ-CA protein sequences of WZSP were analyzed by the BLAST homology search tool of the UniProt database. In the final step, multiple sequence alignment (MSA) analysis was performed on all β-and γ-CA protein sequences involved in this evaluation using Clustal Omega algorithm of the European Bioinformatics Institute database (https:// www.ebi.ac.uk/Tools/msa/clustalo/) (29). To reduce the size of protein sequences and output figures from MSA analysis, just 69 and 60 amino acid sequences of β-and γ-CA protein sequences containing the enzyme active sites were selected, respectively.
Genomic analysis of β-and γ-CA sequences from putative bacterial contaminants The coding genes for β-and γ-CAs from Pseudomonas spp. as one of the putative contaminants in WGS of WZSP were evaluated using the BLASTP search tool in the Pseudomonas Genome Database, version 20.2 (https:// www.pseudomonas.com/) (30) by using 1e-4 as the default value cutoff. In addition, the coding genes for β-and γ-CAs from Acinetobacter spp. as another potential contaminant were analyzed by the Ensembl Bacteria database (http:// bacteria.ensembl.org/index.html) (31).

Identification of α-CAs from S. scrofa
This analysis defined 12 α-CA isozymes including CA I-IV, CA VA and VB, CA VI, CA VII, CA IX and CA XII-XIV and three CARPs including CARP VIII, CARP X and CARP XI in S. scrofa. The results revealed that chromosome 1 contains the coding genes for CA IX and CA XII; chromosome 4 contains the coding genes for CA I-III, CA XIII, CAXIV and CARP VIII; chromosome 6 contains the coding genes for CA VA, CA VI, CA VII and CARP XI; chromosome 12 contains the coding genes for CA IV and CARP X and chromosome X contains the coding gene for CA VB.
Our study on the subcellular localization of α-CAs from S. scrofa predicted that CA I-III, CA VII, CA XIII and CARP VIII are cytoplasmic; CA VA and CA VB are mitochondrial; CA VI, CARP X and CARP XI are secretory; CA IX, CA XII, and CA XIV are transmembrane and CA IV is membrane-bound ( Table 1).

Analysis of β-and γ-CA sequences
The BLAST homology analysis of the predicted WZSP CA sequences first identified a β-CA sequence from A. aceti and a γ-CA sequence from S. flexneri. A more detailed BLAST homology analysis of β-CA and γ-CA sequences from WZSP showed 100% similarity with bacterial β-and γ-CA sequences from Pseudomonas spp. and Acinetobacter spp. To confirm the identity of the defined sequences, MSA of the β-CA sequences showed the five highly conserved amino acids, including cysteine, aspartic acid, arginine (CXDXR) and histidine and cysteine (HXXC), which are known to be characteristic features of β-CA enzymes. Similarly, the predicted γ-CA sequences showed the four highly conserved amino acids characteristic of γ-CAs, including glutamine and histidine (QXXXXXH) as well as two histidines (HXXXXH) (Table 2; Figure 1).

Genomic analysis of β-and γ-CA sequences from putative bacterial contaminants
The analysis revealed that the β-and γ-CA genes from putative bacterial contaminants are located in the genomes of Pseudomonas spp. and Acinetobacter spp. Further evaluation revealed that all the encoded β-and γ-CAs from the putative bacterial contaminants are probably cytoplasmic proteins (Figures 2-4).

Protein structure analysis
The 3D models of crystallized β-and γ-CA protein structures, most similar to the bacterial contaminant proteins described in this study, were visualized in NGL (WebGL) viewer of the RSCB PDB database (accession codes 5JJ8 and 3PMO) ( Figure 5). The visualized images of the bacterial β-and γ-CA proteins show homodimeric  and homotrimeric structures typical for the β-and γ-CA proteins, respectively (33).  (15). Surprisingly, the first analyses of our study using the query bacterial β-and γ-CA sequences detected counterpart CA sequences in WZSP, and indeed, the MSA analysis approved that these sequences belong to the β-and γ-CA families. The BLAST search homology analyses of the identified β-and γ-CAs from WZSP displayed 100% identity to β-and γ-CA sequences from Pseudomonas spp. and Acinetobacter spp. In addition, genomic characterization of the detected β-and γ-CA sequences by the Pseudomonas Genome Database and Ensembl Bacteria database showed the presence of corresponding β-and γ-CA genes in the genomes of Pseudomonas spp. and Acinetobacter spp., with cytoplasmic subcellular localization of the encoded CAs.

Discussion
Previous studies have revealed that both host gutassociated flora and environmental microbiome, such as airborne microbes as well as bacterial contamination of equipment and solutions used for DNA isolation, can represent potentially interfering substances and contamination sources of the shotgun metagenomic sequencing samples, leading to false-positive results (34)(35)(36). For similar reasons, it would be highly possible that the isolated DNA samples from WZSP for WGS project had been contaminated with bacterial members of the Pseudomonadales order including Pseudomonas spp. and Acinetobacter spp., resulting in the detection of β-and γ-CAs from these bacterial species in the Ensembl assembly (minipig_v1.0) of S. scrofa. In addition, further analysis with protein structure modeling of β-and γ-CA sequences from bacterial contaminants revealed that β-CA sequences from contaminants were similar to 5JJ8 crystal structure from P. aeruginosa, and γ-CA sequences from contaminants were similar to 3PMO crystal structure from P. aeruginosa, which both approve the membership of β-and γ-CA sequences of bacterial contaminants to Pseudomonadales order.
There are different pipelines for decontamination of genomic reads in DNA-Seq and RNA-Seq projects, such as hierarchical clustering algorithm (37), RapMap (38), DecontaMiner (39), Sequencing Quality Assessment Tool or SQUAT (40), map-guided scaffolding or MaGuS (41), and Kraken 2 (42), which can improve the quality of genomic samples. DNA-free reagents and kits are used to reduce the bacterial contamination in the sequencing projects (43). Internal controls of every step in the sequencing protocols can detect the trace fragments of foreign DNA or RNA to reduce the risk of bacterial contamination (44). Nevertheless, our results demonstrate that the sequences present in genomic databases do contain incorrect sequences due to microbial contamination, underlining the need for high-quality internal controls and biocuration.

Conclusions
In addition to aforementioned methods for detection of bacterial contamination in the WGS projects of animals, the bioinformatics and computational approaches accompanied with bacterial-specific markers, such as CA sequences, can be employed to detect and reduce the risk of microbial contamination in the WGS projects through implementation of biocuration in databases. It is important to control the quality of short-size libraries, contigs and scaffolds as well as to perform internal checks of solutions, reagents and equipment during the shotgun genomic projects. This can be led to reducing the risk of annotation of false DNA and protein sequences in databases.