Top considerations for creating bioinformatics software documentation

Abstract Investing in documenting your bioinformatics software well can increase its impact and save your time. To maximize the effectiveness of your documentation, we suggest following a few guidelines we propose here. We recommend providing multiple avenues for users to use your research software, including a navigable HTML interface with a quick start, useful help messages with detailed explanation and thorough examples for each feature of your software. By following these guidelines, you can assure that your hard work maximally benefits yourself and others.


Introduction
You have written a new software package far superior to any existing method. You submit a paper describing it to a prestigious journal, but it is rejected after Reviewer 3 complains they cannot get it to work. Eventually, a less exacting journal publishes the paper, but you never get as many citations as you expected. Meanwhile, there is not even a single day when you are not inundated by emails asking very simple questions about using your software. Your years of work on this method have not only failed to reap the dividends you expected, but have become an active irritation. And you could have avoided all of this by writing effective documentation in the first place.
Academic bioinformatics curricula rarely train students in documentation. Many bioinformatics software packages lack sufficient documentation. Developers often prefer spending their time elsewhere. In practice, this time is often borrowed, and by ducking work to document their software now, developers accumulate 'documentation debt'. Later, they must pay off this debt, spending even more time answering user questions than they might have by creating good documentation in the first place. Of course, when confronted with inadequate documentation, some users will simply give up, reducing the impact of the developer's work.
To avoid this, we suggest several guidelines for improving multiple aspects of your documentation (Table 1). These guidelines improve the usability of your software and reduce time spent supporting users. Many of these guidelines apply both to bioinformatics software and to bioinformatics databases. In this perspective, we describe in detail the best practices of many well-established bioinformatics tools (Table 2).

Hierarchical documentation
Your documentation should consist in hierarchically grouped and carefully sorted components. This allows users to efficiently find the detail they need without overwhelming them with a large span of top-level material. It limits the amount of information shown to the user at one time, and it sorts the most important materials at the top and less frequently used details at the bottom.
The MEME Suite contains multiple programs for sequence motif analysis. Its documentation begins with a flow chart that describes its modules and their relationship to each other ( Figure  1B). Other top-level items provide information on installation, databases that the programs rely on, and ways to get support. The MEME Suite also has a top-level menu that groups programs by function ( Figure 1A). More commonly used modules appear first. This grouping and ordering makes it easier for users to find the module they need and to compare with related tools for their task.
For example, the 'Manual' section of the sidebar, groups the programs into four categories-'Motif Discovery', 'Motif Enrichment', 'Motif Scanning' and 'Motif Comparison' ( Figure 1A). The manual of each program within describes both the web and command-line interfaces. As an illustrative sub-example, we will examine further the manual for DREME, one of the MEME Suite's motif discovery tools. Its command-line documentation consists in several components. 'Usage' describes the minimal parameters for using the program. 'Description' includes a technical but abstract explanation of DREME's functionality. The manual comprehensively defines 'Input' and 'Output' formats and describes options in detail using a table ( Figure 1C). This table groups the options in several categories such as 'Input/Output', 'Alphabet', 'General', and 'Miscellaneous'. For each option, this table describes the parameters, description and the default behavior in subsequent columns. The MEME Suite concludes each program's manual with a citation to the peer-reviewed manuscript describing that program.
Bedtools [15] provides another example of well-documented and widely used bioinformatics software. Bedtools has a table of contents that directs users to the information they need (Figure 2A). These contents consist in a hierarchy of information structured and stored for optimal retrieval ( Figure 2). Bedtools notably uses informative figures and extensive examples to clarify the functionality of different options ( Figure 2C).

Tools for documentation
Several software packages automatically generate up-to-date documentation from a markup language in the source code and elsewhere. These tools transform your code and markup into formats such as Unix manual ('man') page, Hypertext Markup Language (HTML) and Portable Document Format (PDF). Ideally you will create all these formats, but we consider an HTML manual most essential.
Examples of documentation generators include Doxygen [23] and Sphinx [24]. Sphinx has particular popularity in bioinformatics owing to its use of the intuitive markup language reStructuredText [25] and extensive formatting options. Some tools generate documentation specifically for one programming language, such as Javadoc [26] for Java, or Roxygen [27] for R.
The main disadvantage of automatically generated documentation is that you have less control of how to organize the documentation effectively. Whether you used a documentation generator or not, however, there are several advantages to an HTML web site compared with a PDF document. Search engines will more reliably index HTML web pages. In addition, users can more easily navigate the structure of a web page, jumping directly to the information they need.

Graphical interfaces
Software with a graphical interface, such as web applications, also requires more graphical documentation. Describing how to interact with a graphical interface in text can prove laborious, and a wellannotated picture can be worth a hundred words. As an example, Swiss-PdbViewer [4] is graphical software that models protein structure. Its documentation makes ample use of screenshots and visuals that depict elements of the Swiss-PdbViewer interface, such as icons. These visuals help users to quickly understand how to complete tasks, and to interpret the software's output.

Installation
Describe how to install your software and all of its dependencies, in detail. At a minimum, provide exact instructions for the most recent versions of Debian, Red Hat Enterprise Linux, macOS and Windows-or the subset of those systems that you support. It is laborious to support multiple versions of an operating system, but that does not excuse avoiding these instructions for at least one version. Indicate a known working version of all of the dependencies, as well. Many scientists use computing clusters or network computers where they lack root privileges. When possible, your instructions should cover root and non-root installation.
Ensure you test installation on a new, unconfigured environment. A continuous integration service (see below) provides a great means for accomplishing this. If you use non-standard build tools or your software has complex dependencies, document the installation thoroughly and extensively. Sometimes it is easier for you to make installation easier for users. If your installation instructions seem complex, consider ways to make it easier, perhaps by contributing your software to a package repository such as Debian Med [28], Homebrew [29] or the Comprehensive R Archive Network [30].
PLINK [3] provides a good example of bioinformatics software supporting all major operating systems, with detailed instructions for each platform.

Readme and news
Provide a readme file at the top level of your source code with basic information about installation and use of your software, A four-column table describes details of each option in the DREME program. Each row describes a single option, and these options are categorized into broader option groups. and details on where users can find more information. The readme should show up to users visiting your source code repository and will provide the first impression for many. The readme should also include the software's license.
Also, provide a news section dedicated to the changes in each release of the software. Discuss bug fixes, caveats, new features and changes in behavior of the software in detail. Users will often upgrade after several new versions, and want a place to find the details of all that has changed since their last install. Include the news as another file in the top level of your source code and link to it from the readme.

File formats
If you must create a new file format (and please do not, if you can avoid it), make sure to specify it in detail. Burying specification details in your code make operation with future software by others frustrating. A detailed specification, however, makes it easier to use your software in a larger pipeline, and reduces the chance you will have to debug interoperability problems later. The MEME Suite [17] and PLINK [31] both exemplify detailed description of input and output formats.

Communication with users
Users may need to contact you if they cannot find the answers they need in the documentation. Set up a mailing list to allow users to send questions and feedback. Archive the mailing list where search engines can find it. People who encounter an error will report the message, allowing others to easily find the solution. Mailing lists facilitate an open development process, which may lead to users developing and submitting new features for your software. Some bioinformatics software packages, such as GATK [11], also host a forum which serves a similar purpose in making answers available to all. Forums, however, perform more poorly than mailing lists in getting others to contribute. New submissions to mailing lists are pushed to all list members, including those who registered to ask their own questions or learn about software updates. In forums, however, users must actively check the forum to see new questions. Often only the developers have the motivation to do this.
Issue trackers provide a great way to communicate about specific potential bugs or requests. GitHub [32] and Bitbucket [33] provide a free service for issue tracking, along with a repository for your code and documentation.
Adding a comment section to your documentation, web page encourages users to contribute helpful feedback. So does Read the Docs [34], which makes it easy for users to submit a pull request correcting the documentation. If you receive repeated inquiries on one aspect of your software, this is evidence for insufficient documentation. Take this as a sign to revise the documentation.
MISO [35], ggplot2 [36] and Bedtools [15] provide detailed documentation in HTML format, have a public GitHub repository to track issues, and also have a mailing list for other communications with users.

Frequently asked questions
Prepare a frequently asked questions (FAQ) document to answer common questions you expect or have received. Many users find the FAQ format more compelling than a reference manual, and it is easier to link to an answer to a common question from a mailing list. PLINK has an FAQ that covers a variety of difficulties one may encounter before starting to use the software. It also includes questions that are related to unexpected outputs, and comparison with other packages.

Troubleshooting
Your software should provide meaningful warning and error messages when it receives unexpected input. Include a chapter in your documentation to thoroughly explain error and warning messages and how to resolve them. When the users search the Internet for the text of these errors and warnings, they will find answers immediately.

Programming environment
Using programming environments and languages that require difficult installation and configuration reduces the usability of your program, and they also require more complex documentation. For example, to run MATLAB programs without an expensive license, user must install a specific version of the MATLAB Compiler Runtime (MCR). Documenting all the things that can go wrong in installing an old version of MCR provides quite a challenge. This explains partially why few widely used bioinformatics tools rely on MATLAB.

Default parameters
Many users rely on your default parameters, so choose them carefully. Configuration options left to potentially inexpert users provide no substitute for sensible defaults. Document the rationale for selecting any default parameter. This will help users understand when they should change it.

Citation
Provide a citation to your own manuscript with a link to an open-access version. This makes it easier for users to find a description of your methodology and cite your work.

Writing code
At some point, the documentation will not answer every question. At this point, someone must examine the source code and make it easy for that someone else to figure things out without help. That someone, invariably, will end up being yourself sometimes.
Put a premium on making your code easily intelligible to others. Use descriptive variable and function names following the standard format for your environment. PEP 8 [37] supplies a format for Python, and Google style guides [38] provide them for other programming languages. Many text editors can check code style automatically.
Comments provide an important avenue to increase code accessibility. Use a template to begin the header of your code with a comment including your name, email address and date of creation. At the top of each source code file, provide a brief description of its function. Concisely annotate your code with block or inline comments whenever it does anything not understood with trivial effort. If you use a documentation generator, use specially formatted comments to annotate functions with structured information.

Continuous integration of quick start and tests
Your quick start effectively provides a simple script on a small test data set. Not only does this familiarize users with features of your software, but it also ensures that the software is installed properly and functions as expected.
You or other contributors can also use this script as a quick test to ensure that changes do not break any part of the software, or your instructions. You should therefore include the major options of your software in this script.
Consistent version control with Git or Mercurial helps you and collaborators track the development of the project and contribute easily. Using tools for coverage or mutation test of your code and continuous integration services such as drone.io [39], which supports both GitHub and Bitbucket, help you identify potential problems with your program faster.

Discussion
While many bioinformatics software packages have satisfactory documentation, insufficient documentation makes others unusable by the community. Well-documented software is also an important aspect of reproducible analysis [40,41]. Several previous reviews include checklists for bioinformatics software engineering that include software documentation [42][43][44]. Despite this, many bioinformatics software developers do not prioritize the creation of documentation. Nguyen-Hoan et al. [45] performed a survey asking 60 scientific software developers about their development practices. While 51 of 60 participants used inline code comments, fewer supplied the other documentation formats such as installation instructions (42 of 60) or user manuals (30 of 60) suggested here. Clearly, there is a long way to go in educating bioinformatics software developers on the best practices of effective documentation.
Although documentation is often mentioned as an important element of bioinformatics software engineering, little primary research specifically focuses on bioinformatics documentation. One can find primary research, however, on the effects of software documentation more generally. Junji et al. [46] reviewed the literature on software documentation research, and quantified how often documentation was shown to improve various aspects of software engineering. Documentation is shown to have a positive influence on software maintenance (29 articles), software development (16 articles), code comprehension (14 articles) and software design comprehension (10 articles). One study shows that initial documentation improves software quality even if the documentation is rarely maintained [47].
Additionally, three independent studies [48][49][50] indicate that documentation also improves usage. Forward [48] asks software developers about the effectiveness of different attributes of software documentation, and finds that content, maintenance, availability and example usage are the most important attributes. De Souza et al. [49] conduct two surveys, once asking the opinion of maintainers on types of documentation, and once the type of documentation they actually use. They found that source code readability, in-line comments, data model and requirement description are among the important documentation artifacts in both surveys. Dzidek et al. [50] quantitatively assessed the costs and benefits of Unified Modeling Language [51] documentation in a controlled experiment. They found a significant increase in correctness of future changes to software, as well as a significant improvement in software design.
Effective documentation of bioinformatics software and adopting standard code style has specific importance in academia. Much academic software is developed by trainees who soon move on to other employment. These trainees have often had little training in software engineering, which would include the necessity of sufficient documentation [52]. Without good documentation, it becomes difficult to continue developing or using the software. This results in premature abandonment of the software and a waste of the investment in the project. For this reason, documentation can be even more important in academia than in industry, but much academic software remains under-documented.
Peer review of a bioinformatics software paper rarely assesses the software documentation directly. If the reviewers cannot figure out to run the software, however, this may result in rejection of the manuscript. The developer should ensure that described uses of their software remain reproducible. Long after the paper is accepted, published software remains part of developers' ré sumé s and can affect their reputations.
When you lack the time to apply every guideline we propose, you should at least have the following minimum documentation: 1. GitHub or Bitbucket page with code and issue tracker. 2. Readme that covers installation, quick start, input formats and output formats. 3. Reference manual with detailed description of every userconfigurable parameter.
The Software Sustainability Institute's online sustainability evaluation [53] assesses how sustainable and reusable your software is. Many parts of this evaluation focus on adequate documentation. After following our other guidelines, we additionally recommend this evaluation for further detailed suggestions on creating great documentation.

Key Points
• Great bioinformatics software documentation provides detailed instructions for installation, usage and all available options. • It begins with a quick start guide with walk-through examples.
• Details of software capabilities are navigable through a hierarchical interface.
• Users can request further assistance through a searchable forum.