BioCommons: a robust java library for RNA structural bioinformatics

Abstract Motivation Biomolecular structures come in multiple representations and diverse data formats. Their incompatibility with the requirements of data analysis programs significantly hinders the analytics and the creation of new structure-oriented bioinformatic tools. Therefore, the need for robust libraries of data processing functions is still growing. Results BioCommons is an open-source, Java library for structural bioinformatics. It contains many functions working with the 2D and 3D structures of biomolecules, with a particular emphasis on RNA. Availability and implementation The library is available in Maven Central Repository and its source code is hosted on GitHub: https://github.com/tzok/BioCommons Supplementary information Supplementary data are available at Bioinformatics online.


Modified residues
BioCommons can parse and process information about modified residues. It also employs a fuzzy detector of residue type, which comes in handy when working with 3D models generated in silico without any header information.
For example, BioCommons detects that residue A.10 in 1EHZ structure is modified guanine. The parser sets its name to 2MG, and the detector correctly marks that the heavy atom content is different than expected.
Even for A.16 -a dihydrouridine with additional hydrogens -and for A.39 -a pseudouridine, i.e. an isomer of uridine -the parser correctly marks the residues as modified, despite the heavy atom content is the same as in an unmodified uridine. The detector is sensitive to all atoms, so modifications to backbone or ribose also get caught. Of course, the same mechanism works for proteins, e.g. S.169 in 148L is correctly determined to be 2,6-diaminopimelic acid -a modified lysine.

Missing and unknown residues
The other distinguishing feature of BioCommons is its handling of missing residues. Information about them is available in the headers with details about the sequence of the missing parts. BioCommons reads these headers and embeds the information in its data structures.
A bit different case is with UNK residues (e.g. 3KFU), for which both the sequence and electron density maps are uncertain. BioCommons handles these as well.

Misformatted data
Many software tools which generate 3D models do not respect these rules. For example, PDB files generated with certain force-fields violate rule 1 by aligning atom names' to the left and rule 2 by using the blank character for chain identifier. Others end the line before 79 characters (violation of rule 3).

Residue order
The PDB specification requires to store lines describing atoms ordered according to increasing residue numbers. Even when this is true, a robust library needs to analyze F.74 F.77A F.76 F.77 Figure 5: The last four residues of chain F in 1OB5 structure. Colors from light to dark represent the order of residues in the file. However, connections between residues are arranged in a different order.
the data to find the correct order as shows the example of 1OB5 structure. In the file, the numbers of the last residues in chain F are 74, 76, 77, 77A. Residue 75 is marked as missing in the headers. However, looking at the 3D coordinates reveals that the actual chain goes along 74, 77A, 76, 77. There is no gap in the chain as one would expect from the information in the headers. Additionally, the connections between residues appear in a different order than in the file. BioCommons correctly analyzes structures like that, which is crucial for i.a. torsion angles calculation. Despite the overhead required for robust data handling, BioCommons is extremely fast to parse and process raw data. As of 2020-12-10, there are more than 11 thousands of PDB structures marked as nucleic acid (alone or in complex with a protein). In a computational experiment, BioCommons and BioJava were used to measure parsing time of all of them. The largest structure stored in PDB format (id: 6WLN) has 80 models with a total of almost 28k residues having 900k atoms. BioCommons required just 1.4 seconds to parse this file on a Linux machine with Intel® Core™ i7-2600K CPU @ 3.40GHz and 16 GB RAM. For this structure, BioJava returned an object consisting in sum of 15k residues having 494k atoms (when 6WLN is parsed by BioJava from mmCIF format, the number of residues and atoms are correct). In general, the parsing time by BioCommons library grows linearly with the number of atoms in the structure and even the largest ones can be processed quickly.

Classes and methods count
BioCommons was developed for many years. Its components were created during the implementation of tasks appearing during the author's work on the projects from the RNApolis suite (Szachniuk, 2019). For example, for MCQ4Structures (Zok et al., 2014;Wiedemann et al., 2017), a set of functions was designed to compare biomolecules in torsion angle space. They have also been used in the assessment of RNA 3D structures predicted within RNA-Puzzles contest (Magnus et al., 2020;Miao et al., 2020). In RNApdbee (Antczak et al., 2014;Antczak et al., 2018;Zok et al., 2018) and RNAvista (Rybarczyk et al., 2015;Antczak et al., 2019), the library's components formed a mapping layer between 2D and 3D structure data, especially in terms of pseudoknot handling. BioCommons also played a crucial role in the preparation of RNA conformer library (Zok et al., 2015) for nucleobase and nucleoside remodeling within RNAfitme (Antczak et al., 2018).