CandiMeth: Powerful yet simple visualization and quantification of DNA methylation at candidate genes

Abstract Background DNA methylation microarrays are widely used in clinical epigenetics and are often processed using R packages such as ChAMP or RnBeads by trained bioinformaticians. However, looking at specific genes requires bespoke coding for which wet-lab biologists or clinicians are not trained. This leads to high demands on bioinformaticians, who may lack insight into the specific biological problem. To bridge this gap, we developed a tool for mapping and quantification of methylation differences at candidate genomic features of interest, without using coding. Findings We generated the workflow "CandiMeth" (Candidate Methylation) in the web-based environment Galaxy. CandiMeth takes as input any table listing differences in methylation generated by either ChAMP or RnBeads and maps these to the human genome. A simple interface then allows the user to query the data using lists of gene names. CandiMeth generates (i) tracks in the popular UCSC Genome Browser with an intuitive visual indicator of where differences in methylation occur between samples or groups of samples and (ii) tables containing quantitative data on the candidate regions, allowing interpretation of significance. In addition to genes and promoters, CandiMeth can analyse methylation differences at long and short interspersed nuclear elements. Cross-comparison to other open-resource genomic data at UCSC facilitates interpretation of the biological significance of the data and the design of wet-lab assays to further explore methylation changes and their consequences for the candidate genes. Conclusions CandiMeth (RRID:SCR_017974; Biotools: CandiMeth) allows rapid, quantitative analysis of methylation at user-specified features without the need for coding and is freely available at https://github.com/sjthursby/CandiMeth.

Dear Professor Zhou, Many thanks for your letter with an interim decision on our MS above and we appreciate the kind words and positive feedback from the reviewers and yourself. This is particularly the case given that they experienced problems with the workflow, which we had hoped to avoid through having it tested by several users prior to submission. The difficulty was due in large part to the withdrawal of a tool from Galaxy without notice, as well as a time-dependent decay of the converted dataset collections. We have addressed both these issues, tested extensively, and revised the manuscript [revisions in blue]. Our detailed responses to the comments, which we have numbered for ease of reference, are given below.
Editor's comments: 1... please register any new software application in the bio.tools and SciCrunch.org databases to receive RRID (Research Resource Identification Initiative ID) and biotoolsID identifiers The CandiMeth RRID: SCR_017974 and Biotools identifier (Biotools:CandiMeth) have been added to the abstract and at the start of Methods on p5 2.Please also ensure that your revised manuscript conforms to the journal style, which can be found in the Instructions for Authors on the journal homepage.
We have done our best to match journal style as indicated on the homepage 3.If the data and code has been modified in the revision process please be sure to update the public versions of this too.
These have been updated to match the latest version Reviewer #1: It's been a delight reading your manuscript. I agree that publishing such workflows that ultimately serve as a tool is worth considering. This paper describes so concisely the method used on CandiMeth that even without deep knowledge on the subject it is easy to understand. We really appreciate the positive comments here and have made every effort to address the suggestions made, as indicated below Specific Comments: 1.While reading p4, on line 6 I missed a reference to indicate the comparison of intersample reproducibility Thank you for highlighting this: we have inserted the reference we were thinking of, showing high inter-sample reproducibility using the array (Bibikova et al Genomics 98:288 2011), and rephrased the sentence to make our meaning clearer: .."where a lower CpG resolution is satisfactory, but where greater inter-sample reproducibility is required [16]." [p4, L6] 2. On Fig 2, is it normal that bars exceed the marked limits on tracks and overlap? This issue was due to some inadvertent editing of the image before submission: a new version has been generated using CandiMeth and, as can be seen, the default spacing between the tracks in the UCSC browser is sufficient and there is no overlap 3.p12, ChAMP output is Supp Table 5, I suggest including this information in the GitHub Guide.
For this resubmission, we have written a much more extensive step-by-step User Guide which is available through the GitHub page (section A there) and is attached as a file with this resubmission. This more complete User Guide has specific sections on using ChAMP data (sections 3.4 and 4.2). We have retained a Quick Start guide on the GitHub page for more experienced users in Markdown format (section G) and this also flags more clearly that ChAMP outputs can be used in addition to RnBeads.
"At least one file containing information on methylation differences between two samples produced from either RnBeads or ChAMP (for Input Type 1 below)" [GitHub, section G, L5] 4.Abbreviations, include KD since it's repeatedly used in the figures. Done 5.GitHub and Workflow. The website is very well designed, and docs are easy to follow and very detailed. I followed these instructions to test it but the workflow didn't get to run. My new histories are empty. I also tried to copy the files into a new history and run the workflow from there, with same result. I got an error file from Galaxy, which I'll try to attach to this report. This was very unfortunate and we apologize for the inconvenience. Several of us had tested the workflow and histories before submission, when everything worked well. We have traced the problem to two sources: converting the R output files into a dataset collection for input and the unnotified removal of tools from Galaxy. To maximize ease of use for the bioinformatics novices, we wanted to have all the datasets ready to use and so had converted the input differentially methylated probes table from R into a dataset collection in the History. However unfortunately these appear to be temporary files and decay over time. We have therefore not done any conversions this time and instead have moved the instructions in the guide on how to do that to earlier in the procedure (now step 5 in the Quick Start guide and Section 2,Getting Set Up step 3 in the complete User Guide), so that the user does the conversion themselves, which solves that issue. 6.Next I downloaded the history and workflow locally to test on my local Galaxy in Docker. Two tools weren't installed and I had to find on the Galaxy Tool Shed, finding their names on the original galaxy.org server (the names on the paper or workflow are modified).
This was the second problem: the Galaxy team had removed the Cut and Join tools from the newer version of Galaxy between submission of the MS and review, so the workflow would not work. We have removed these tools and recoded the relevant parts in AWK (Steps 6.1 and 6.2 in the new version of Fig.4). The relevant text has been added or changed in the Figure legend and main MS (pp14 & 15) as well. Table 1 has also been updated with the new tool names, and we have added a column indicating what parts of the workflow use each tool.
In addition to the changes above, we had to recode the original import of data in SED due to withdrawal of another tool in the interim, this has also been updated in the table, Fig.4 (new step 2.1) and relevant text on p14. 7.When importing the history, the dataset lists need to be re-created manually, so I did it with Supp Table 1.
See response to point 6 above 8.When running the workflow it got stuck again, this time I realized that a box called variable was not linked, and removing it the workflow would run but not create outputs due to this missing input.
This was related to the tool issue and is no longer a problem with the updated workflow 9.I run Galaxy 18.04, while the workflow requires Galaxy 19 features Unfortunately we only have access to Galaxy 19 features as we do not have a private instance: however we wanted the workflow to be accessible by users without access to their own instance so have concentrated on that version. The Tools in Table 1 can be used to downgrade the workflow to function with Galaxy 18.04 if so desired however. We have also provided a link to a downloadable version which should import and work on other Galaxy instances-see points 12 and 13 below and Appendix 3 of the User Guide.
10.Indicate minimum requirements to the workflow If the user is working through the on-line version of Galaxy (usegalaxy.org) as is the intention, this is quite platform-independent. In order to understand and potentially alter the workflow they would need to know what tools are used where in the workflow, and this is indicated in the updated Table 1. For their own instance of Galaxy, we have included a link to download and import the workflow (Appendix 3 of User Guide) and a .yaml file detailing the technical requirements. We now refer to this under Overview on p8 of the revised MS: "CandiMeth is optimised to work on the latest version of Galaxy (19.0) through the Galaxy website (www.usegalaxy.org), thus making it platform-independent. For users who have their own instance of Galaxy, the workflow can be downloaded and imported via a link on the GitHub page above, where a .yaml file is also available." 11.Functional workflow for a lower Galaxy version is suggested but not required.
See response to point 10 above and to next point 12.Usage of this workflow only for users of galaxy.org misses out a large proportion of researches. I suggest inclusion of instructions to download and upload the data and history to a custom Galaxy instance, if not already present.
We have now included information in the User Guide and a link to allow users to download the workflow and import it into their custom Galaxy instance (User Guide Appendix 3) This is mentioned in the main MS text (p8) now as well.
13.Include a .yaml file listing the packages used by the workflow as specified by Galaxy This has been created and uploaded with the revised MS and mentioned in the text (p8) 14.I'd like to be able to use the workflow, please check if there's a bug or why it isn't working for me We're pleased the reviewer wishes to use CandiMeth going forward: the new version should be clearer and more robust, and was working for multiple users at time of resubmission 15.A typo on the docks: on part C, candimeth outputs, third line, "you wish to view" Done Reviewer #2: In general I like this work, it focused on a very straightforward but common required demand: Mapping differential methylated status to whole genome, and match track with other genomic results. It's a good attempt to integrating traditional R package, cloud computing resource, and UCSC browser. I hope this pipeline is robust enough among these various tools, and hopefully the author in the future would not be troubled too much for constantly upgrading of any of these gears.
We thank the reviewer for their positive and constructive comments and are pleased that they also feel it meets a common demand. The concern regarding tools is justified given the withdrawal of two tools between submission and review: we have recoded parts of the pipeline as indicated, and will be monitoring the workflow on a weekly basis, as we have a number of frequent users in our own labs.
We have also given more comprehensive guidance now as part of a new User Guide, an extensive document taking users step-by-step through tutorials and indicating how to upload and convert data as well as many other functions. There is also a Quick Start Guide as section G on the GitHub page for those familiar with the program and who want a quick reference guide only.
It does highlight a need for Galaxy to be more transparent and less cavalier in their treatment of Tools, as we have experienced this problem once before during workflow development: we have written to those maintaining the Galaxy resource and hopefully this will help ensure changes are flagged in time to allow smooth transitions to newer CandiMeth versions to adapt to new tools.
Specific Comments: 1.I run the default CandiMeth history, with RnBeads (Supp. Table 1), but the "results table" for region statistic are always empty with 0 rows. The tracks are generated successfully, but seems the results tables are not. I just followed the Step-by-Step guild on google drive, not sure if I missed anything or the guild should be improved This would have been due to the problems which arose between submission (tested and working) and review involving 1) dataset conversion and 2) Tool replacement. These have been detailed above (see responses 6 & 7 to Reviewer 1), and are now corrected and the new version extensively tested. In case our responses to reviewer 1 above are not visible to you, in brief: 1) we had converted the R output files into dataset collections in the history to try and make it more user-friendly, but the collections were not as stable when done like this-users must now convert the example data themselves, a simple step we have now detailed as step 5 in the Quick Start guide on GitHub and Section 2, Getting Set Up step 3 in the complete User Guide document; 2)some tools (Join, Cut) were withdrawn by the Galaxy team between submission and revision; we have recoded these parts and updated Fig4, Table 1 and the text to reflect these changes.
2.The ChAMP Demo is not working, without any data generated. I used default "Supp. Table 5" for test. The error is: "Input dataset 'Supp.Table5' was deleted before the job started"... This would reflect the same problem with the converted dataset: if the user does the conversion themselves then this will not be a problem: as a visual reminder the Supp.Table5 dataset says underneath "(unconverted)". See also step 5 in the Quick Start guide on GitHub "The input Differential Methylation Table has to be converted from a table into the form of a Dataset Collection: This is in case there are multiple differential methylation tables to be assessed, then CandiMeth can assess them all simultaneously and present them in the typical Results and Tracks outputs, as opposed to multiple outputs that might make your history very crowded: -Click on the already checked box at the top of the History panel (mouse over shows "Operations on multiple datasets"): this will cause checkboxes to appear beside all of your datasets as well as some choices to appear at top -Check the box beside the Differential Methylation Table dataset(s) -Under the pulldown menu beside "For all selected" choose "Build Dataset List" -In the window that appears, you can give the collection a new name e.g. "DMP set1" and click "Create" -A new entry will appear in the RHS with the new name and "a list with 1 (or more) items"-this is the Dataset Collection and is now ready to be processed by CandiMeth Upload Input Type 2: Candidate Features of interest" There is also a more detailed guidance on this with screenshots etc as part of the more extensive User Guide (Section 2, Getting Set Up step 3). If these steps are followed, then the workflow will process the data without any errors.
3.Minfi is also an important package, and I think it generated DMP tables as well, however, the paper did not mention (or cite) minfi at all, nor the pipeline. Is that because minfi's result is similar to ChAMP or RnBeads or some other reasons?
Minfi is not an end-to-end pipeline per se, but rather an individual workpackage which has to be run using more bespoke coding in R: however it is called as part of the RnBeads and/or ChAMP pipelines, with the outputs then further handled and integrated into the html outputs by these two more user-friendly pipelines, which only require a few lines of code to run. As we are aiming primarily at users of RnBeads and ChAMP, the Minfi DMP outputs will be presented at the end of these pipelines as RnBeads or ChAMP DMP tables, and so are catered for in this way.
Minor suggestions: 4.CandiMeth provides some nice features like BLAT Primer Designing, Repeats Analysis, which are not mentioned in the end part of introduction. Some researchers (like me) would prefer to find key features like this on that part, so maybe it's a good idea to include them. I discovered these features only at the later section of paper Two sentences have been added to the end of the Introduction to highlight these features: "This also facilitates the design of assays to cover specific CGs using Blat.2 (p4)….. It also has a bespoke analysis allowing estimation of methylation differences at repetitive sequences by leveraging the RepeatMasker tracks at UCSC." (top p5)

5.I would prefer to put
Step-by-Step guild in Github repo as well in Markdown format, instead of a PDF on google drive...
As mentioned above, we have now written a much more extensive step-by-step User Guide (attached to resubmission and available as download from GitHub site, rather than Google Drive) which provides comprehensive instruction in how to use CandiMeth, with screenshots and complete tutorials. This has been found very useful and easy to follow by our team of beta-testers in-house. As this would be far too long to code in Markdown, we have instead put a brief Quick Start Guide for the experienced user in Markdown as section G of the GitHub page.
6.As far as I know, ChAMP does not provide csv download for DMP table, so I think it worth add one section in guild for data converting from those R package. It may only cost 1-2 lines of R code but still worth being mentioned.
We'd like to thank the reviewer for highlighting this: some lines of R code have been added to the Guide to allow users to convert their ChAMP outputs into csv file format (User Guide section 4.2 Locating data files in ChAMP): "-If your ChAMP related output has not been produced as a .csv file outside of R, please see the below instructions on how to write your differential methylation (where x is the element number of the file comparison you wish to write to the .csv file and myDMP is the resulting object of running champ.DMP() as within the ChAMP vignette (https://www.bioconductor.org/packages/3.7/bioc/vignettes/ChAMP/inst/doc/ChAMP.ht ml#section-differential-methylation-probes) -For the output of multiple comparisons: compnames<-names(myDMP) for(i in1:length(compnames)){write.csv(myDMP[[i]],file=paste(compnames[i], "\".csv\",sep=\"\"),quote=FALSE)}" This will create all probes differential methylation tables within your documents folder" Further guidance and screenshots are shown in the User Guide 7.In many paper, "DMP table" means CpG probs only show significant differentiation between phenotypes, like P value <= 0.05 .eg. However, the "DMP Table" used in CandiMeth is actually all Probe's differential analysis result (including non-significant ones), without any cutoff selection. I think it should be mentioned in paper, as many tools would automatically return only significant probes.
We were careful to refer to the tables as containing differentially methylated probes or regions , not significantly differentially methylated probes/regions, but we have added some text to highlight this distinction more [under Example outputs, p10]. "Note that this track shows all differences in methylation, however small: the FDRcorrected probes are shown in the next track." The generation of a separate track only showing the FDR-significant probes in the outputs also should highlight this difference we hope. CandiMeth therefore generates three types of tracks, all of which we and our collaborators have found to be useful:-1)absolute methylation level tracks, one per sample (raw -all probes), 2)differential methylation () plotting differences in methylation between pairs of tracks and 3) tracks showing only the probes that are significant at FDR <0.05 from the comparisons. These designations have been made clearer in the example outputs section on p10.
We wanted to generate tracks showing all probes, as well as FDR-significant only probes, since in our experience we had found that many smaller sample sets had no probes with FDR<0.05. The comparison between the two types of track is also valuable for example if trying to establish if there are any probes in a region at all. The use of all differentially methylated probes rather than just FDR-corrected allowed us to detect the gain in methylation in the PCDHG exons shown in Case Study 4 here: our previous analysis of the PCDH loci (O'Neill et al Epigenetics & Chromatin 2018) had missed this due to only using the FDR-corrected probes.
8."RnBeads" and "Rnbeads" can both be seen in paper, is that a typo? The same for "ChAMP" and "Champ".
Yes, apologies for these typos: we have gone back carefully over the text and ensured all instances match "RnBeads" and "ChAMP" (however the workflow accepts all variants!).  However, looking at specific genes requires bespoke coding which wet-lab biologists or clinicians are not trained for. This leads to high demands on bioinformaticians, who in turn may lack insight into the specific biological problem. We therefore wished to develop a tool for mapping and quantification of methylation differences at candidate genomic features of interest, without using coding, to bridge this gap.

Introduction:
Epigenetics can be defined as stable, and most often heritable, changes to the chromatin which do not alter the DNA sequence itself but still impact gene expression and/or are required to maintain genomic stability [1]. These modifications consist of reversible marks such as cytosine DNA methylation or histone modifications, each critical to gene expression regulation, imprinting, X-inactivation and many other processes from mammalian gestation to later life [1].
Cytosine DNA methylation is the most common and thoroughly investigated of these epigenetic alterations. It is characterised by the addition of a methyl group to a cytosine residue, many of which are located within so-called CpG islands (CGI) close to gene promoters [3]. High levels of DNA methylation at promoters aids in the stable long-term repression of the cognisant genes, such as can be seen on the inactive X chromosome in mammals [4]. Methylation at control elements such as insulators or enhancers can also help regulate regional gene expression, with multiple examples being seen among imprinted genes [5] or gene clusters such as the protocadherins [6]. High levels of methylation are seen on selfish DNA elements such as endogenous retroviruses, where it plays an important role in their suppression [7] as well as at inert regions of the genome like pericentromeric repeats [8]. More recently, methylation through the body of the gene has been recognised as contributing to maintaining gene transcription levels at highly-expressed genes [9,10]. As well as showing such developmental programming, DNA methylation is susceptible to environmental influence, with inputs such as diet [11,12] and exposure to pollutants like cigarette smoke [13] having clear and reproducible effects on methylation levels, sparking great interest in analysis at a population level, particularly in humans [14].
Advances in sequencing technology have allowed us to quantify and analyse methylation via whole genome bisulphite sequencing at approximately a 28 million CpG resolution [15]. While this technique remains the gold standard for whole genome methylation assessment, it can be very expensive, and when there are hundreds of samples to be tested and analysed prohibitively so, and quantifying small differences reproducibly between multiple samples is also challenging. An alternative technology known as a microarray, which predates the era of whole genome bisulphite sequencing, is often a popular solution for such cases, where a lower CpG resolution is satisfactory, but where greater inter-sample reproducibility is required [16]. A popular choice here is the Illumina Infinium Methylation BeadChip array [17], which currently covers 850,000 CpG sites across the human genome, including 99% of RefSeq genes and large numbers of enhancers and other features. This can help elucidate the effects of an intervention across hundreds of samples in a cost effective manner. There are many packages across multiple computational languages to analyse the outputs from these arrays such as RnBeads [18] or ChAMP [19] but these pipelines operate in the statistical programming environment R, and require some coding. Additionally, the output file formats can be overwhelming and difficult to investigate further without experience in data analytics and bioinformatics. This situation is exasperated by the typically higher number of samples in epidemiological or intervention studies where such arrays are commonly used.
To help solve this predicament, we developed a Galaxy workflow known as CandiMeth, which takes the main output from such methylation analysis pipelines and pairs this with a list of features that the user may wish to investigate. The workflow first generates tracks showing both absolute methylation levels in samples and differences in methylation between samples. These can be viewed via the UCSC genome browser and overlaid with other available tracks such as CpG island, enhancers, ChIP data etc. to allow data exploration and more intuitive analysis. This also facilitates the design of assays to cover specific CGs using Blat. The workflow can then help confirm any patterns observed by quantifying data across the identified regions or features e.g. methylation differences at specific sets of genes between cases and controls. It also has a bespoke analysis allowing estimation of methylation differences at repetitive sequences by leveraging the RepeatMasker tracks at UCSC. The workflow removes the need for further analysis in R and increases reproducibility by using an automated process, but in a more user-friendly manner.

Methods
CandiMeth (RRID:SCR_017974; Biotools:CandiMeth) is designed to work downstream of DNA methylation analysis pipelines in R. It was developed initially using RnBeads as reference, but has been subsequently successfully run with ChAMP and other packages (see below). ChAMP (RRID: SCR_012891) [20] and RnBeads (RRID: SCR_010958) [17,20] are end-to-end pipelines in R which can take raw data files such as IDATs and bam files from microarray readers or sequencers and process these to allow data exploration, visualisation and comparison. For array data, which is the main area where CandiMeth addresses an unmet need, IDAT files containing raw values for the red and green channels for each of ~850K probes are exported from the microarray reader. RnBeads/ChAMP can perform quality control, remove probes with low signal or overlapping with single nucleotide polymorphisms (SNPs) and provide a cleaned dataset giving absolute levels of methylation as beta () or M values. The packages can also facilitate exploratory visualisation through principle components analysis or similar and allow grouping of data prior to looking for differential methylation. Probes showing substantial differences in methylation (delta beta: ) can be identified and then ranked based on a variety of parameters, including probability of occurrence (p value), delta beta, false discovery rate (FDR) or a combination of several of these. The packages can look for enriched gene sets using gene ontologies/GSEA [22] and visualise differences for annotated categories of array probe such as promoter and gene body.
While packages for array analysis provide genome-level data such as whether promoters in general are losing or gaining methylation, querying specific gene sets which might give more A ready way of assessing the degree to which methylation is changing across a particular region and the exact location of the probes also greatly facilitates the design of gene-specific assays such as primer sets for pyrosequencing or clonal analysis. It is also generally of interest to try and leverage the enormous pool of publicly-available data accessible through UCSC genome browser tracks to explore possible novel correlations between methylation changes in your dataset and other genome characteristics such as replication timing, histone modifications, or similar.
We therefore wished to develop a user-friendly non-computationally intensive method of candidate feature investigation which avoided command-line but was more powerful than browser-only interfaces. To this end we chose the Galaxy (RRID:SCR_006281) platform at [23] which is a free open-source environment for user-friendly and reproducible bioinformatics [24]. It provides a variety of data manipulation and analysis tools via a web interface with no prior installation or dependency packages required, with results stored within the Galaxy infrastructure and every action producing a new history entry so the original data is never compromised via destructive edits. Galaxy also allows users to aggregate analysis steps into repeatable pipelines called workflows which can be easily shared, along with the histories, via URL or username. These can allow biologists with little bioinformatics experience to conduct complex analyses on their own data within a system which has a low maintenance requirement and with little worry over data storage or data corruption. Moreover, workflows can be published to a repository such as GitHub (RRID:SCR_002630) or MyExperiment (RRID: SCR_001795) [25] or within a scientific journalfurther encouraging open data science and reproducibility. Galaxy also provides many plugins such as interactive visualisation software to view results, the option to export results to genome browsers and the option to configure tools, or indeed an entire Galaxy instance, to the desired end-user needs.

Overview of workflow
The main process undertaken by CandiMeth is to take as input the methylation data from an R pipeline such as RnBeads or ChAMP and 1) Visualise the data as tracks in the UCSC genome browser and 2) Analyse the methylation differences relative to genomic features specified by the user. The workflow is comprised of 3 main steps; Inputs, Feature Mapping and Analysis (Fig.1). There are also 4 items required at input stage: the user must a) indicate the R package used with the keywords 'RnBeads', 'ChAMP' or 'Custom', then supply b) the methylation data, c) a list of the genes of interest, and d) specify the human genome build to be used e.g. hg19. The basic workflow for CandiMeth is that the genes of interest are mapped to the reference genome and then cross referenced with the input methylation data to get feature-specific statistics. The workflow can currently look at either the promoters (-500 to +1 bp relative to transcription start site; suffix "_P" on results) or gene bodies (the transcription unit; "_GB"), or both parts of the gene together ("_all"). We have found this to be a particularly useful split, since the current consensus is that promoters and gene bodies can show opposite methylation patterns, with methylation at the promoter is largely associated with repression, whereas gene body methylation instead is a feature of transcribed genes. Outputs are then grouped in the history into two types, Results or Tracks (Fig.1). The methylation data from the R packages is a standard differential methylation .yaml file is also available.

Example outputs
To illustrate the type of analysis that can be done, Figure 2 shows outputs from one of the example dataset runs. Here we used as input one of our previously-published differential methylation tables generated by RnBeads (NCBI Gene Expression Omnibus (GEO) identifier GSE90012, the table is also given as Suppl. Table 1) [28].  Mean, the mean methylation value across all probes; 5) SD, the standard deviation; 6)Max., the maximum probe value seen in the feature and 7) Min, the minimum probe value (Fig.2B).
It can be seen that methylation values are much lower in the DNA methyltransferase-depleted cells (d8) for each miR compared to the parental or WT cells e.g. MIR1185-1 62.6% median methylation in d8 vs 72.2% in KD. It can be seen that, while usually in reasonable agreement, in some cases the median and mean vary substantially, and having data on the numbers of probes can be useful on deciding confidence in the results and on any threshold to be applied.
In the Tracks folder CandiMeth also generated four tracks on the UCSC genome browser RnBeads extends further than originally estimated [29]. Note that this track shows all differences in methylation, however small: the FDR-corrected probes are shown in the next track.
Lastly, an FDR-corrected track ('FDR_D8', Track 4) was also produced: this only showed information for those probes where the R package has assessed the false discovery rate to be below 0.05, as this is a statistical cut-off implemented by many array users. This is an excellent method for visualising only CG which have high-confidence differences in methylation between samples. Here, only a single probe passed the FDR threshold and is shown: the absolute methylation level at the probe is given, as p values would not scale correctly.
One of the most powerful features of using this approach is that data can easily and more intuitively be compared to other UCSC tracks (Fig.2C, 5-7). The specific CpG site can be identified in UCSC for example by right-clicking on the column on the track, or by typing the CG identity into the UCSC browser search window, which will then pull out a track with the

Data Preparation and inputs:
A complete User Guide document with step-by-step tutorials is available at [26], here we will describe more general features of the workflow. As indicated, CandiMeth runs in the Galaxy environment: users must first create an account and copy the CandiMeth test history and workflows to their account, as explained in the Guide. Once these simple steps have been carried out the first time, they do not need to be repeated. When CandiMeth is being run, the initial window will look as shown in Fig.3: the workflow occupies the central window, while the example data and datasets required for the workflow are in the History window at right; the left window Tools will not be used. One initialising, the workflow window will look as shown, with one Yes/No choice and four fields (numbered 1-4) to fill in. We recommend saving the outputs of CandiMeth to a new history when initiating the pipeline. This will 1) enable you to continue working on other tasks while CandiMeth is running in the background the workflow can take a while to run depending on server usage and 2) segregate the current job from the reference datasets in the CandiMeth initial history, which avoids  Table 5 in the CandiMeth History.
If the custom option is chosen at Input 1 above, the user can input a data frame of any origin as long as it follows the default CandiMeth format namely: Chromosome; Start; cgid; mean.X; mean.Y; the difference between the two groups; and the FDR corrected p value (were X and Y equal the names of the experimental and control groups respectively). Data frames can also be rearranged in Galaxy using the text manipulation tools "cut" and "join" Guide). Alternatively, they can be uploaded as a list in a tab-delimited file format at this step.
To facilitate initial trials, the miR gene names used above have been preloaded into the default CandiMeth history for use, and are also supplied as Supp. Table 2. The features associated with the gene names are then mapped to the genome using the genomic data discussed next.

Input Type 4: Genome Information
An important part of the CandiMeth workflow is the parsed human genome information used to assign array probes to various genomic features. Example human genome build information used for the mapping part of the CandiMeth pipeline can be found within the CandiMeth history (right hand pane in Fig.3). The data provided here covers two genome assemblies, hg19/hg38 and will aid the mapping of candidate features to promoters, whole gene body region or both (hg19_all option ) as defined by RefSeq [31].
Using CandiMeth, users can query RefSeq-defined genes or repeats to obtain the same types of information as can be obtained by analysis in an R package. One advantage here however is that the simultaneous visualisation allows the user to inspect the match between probe location and gene structure for candidate regions of interest: for example, the initial screen CandiMeth allows the user to refine or alter the promoter definition to exclude bases downstream of the TSS for example, and re-evaluate. An approximation of promoter areas of these RefSeq genes was generated for the example data analysis and was defined as the region from 500 bp upstream to the first base (-500-> +1bp) and is available in the CandiMeth history [32] mentioned above. Similarly, probes were also parsed into gene body and repeat categories for CandiMeth to facilitate user analysis of affects over these types of genomic intervals for their candidate genes of interest.  examples of each of these were given above-the workflow proceeds as follows. identification among the multiple output datasets. The workflow utilises a number of preexisting tools available in Galaxy to carry out these steps (Table 1).

Compilation of methylation data for features:
The dataset collection containing now correctly named absolute methylation tracks (5.7) is now joined with the mapped features of interest (7). This allows the generation of feature-specific statistics.

Output Files
The CandiMeth workflow produced as indicated above under Example output two main types:

Tables
Results tables all follow the same layout; feature name, probe coverage, median methylation, mean methylation, standard deviation, maximum and minimum. A partial example of a Tabular output for the set of miRs used in the example above is shown in Fig.2B (first five lines) and given in full in Suppl. Table 3. Methylation values for the features can then be plotted within Galaxy via their integrated visualization software or the Table can be exported and downloaded then plotted within the user's preferred visualization software such as Prism, Excel etc. as desired.

Tracks
CandiMeth produced four different tracks from the differential methylation table input in the first step, of three different kinds (absolute methylation, relative differences in methylation (delta beta) and FDR-significant methylation difference, as shown in the example above for a cell line system.

Findings
The utility of the CandiMeth workflow may be best illustrated by some case studies.

Case Study 1: Application to array results from model systems
One straightforward use of CandiMeth which has found common use in our lab and among collaborators is to test a specific gene set, as illustrated by the MIR example above (Fig.2). To do this, the user only has to specify a list of the names of the genes they are interested in, together with the genome release, then upload a table containing differential methylation data. This can either be one generated by the bioinformatics team in-house, one which was supplied, typically when array services are bought in, or one which was generated from publicly-available array data such as our dataset GSE90012 described previously [33] and used above.

Case Study 2: Application to EWAS study outputs
A major application of methylation array technology is in epigenome-wide association studies (EWAS). CandiMeth can provide a very useful tool for quickly examining in detail and quantifying methylation differences around candidate regions identified either by the Rbased packages or from the literature. Fig.5 shows the application of this approach to an EWAS we have recently published containing data from 86 participants divided into 45 on placebo and 41 on folic acid treatment during trimesters 2 and 3 of pregnancy to assess the potential positive effects of prolonging this vitamin treatment beyond the currentlyrecommended periconception and first trimester periods [29]. Output differential methylation tables from RnBeads were used as input for CandiMeth, together with the names of the top candidate promoters reported earlier. This produced a collection of outputs (Fig.5A) including a set of tabular Results for the two groups Placebo and Treatment, as well as a set of Tracks. The latter included absolute mean beta, delta beta and an FDR track, although the latter returned the message "#No FDR significant sites" (not shown), often the case for EWAS if the sample set was small or the perturbation mild. Clicking through to the tabular results (Fig.5B) showed tables indicating the number of probes present at each promoter and mean methylation, revealing for example that median methylation at the CES1 promoter is 2.5% lower in folic acid-treated participants than placebo (666.14 -641.1=25/1000=0.025 or 2.5%).
Examination of the CandiMeth Tracks (Fig.5C) was however also informative here. This BedGraph track type is set by default to scale to the maximum loss and gain on visualisation, so that when the UCSC browser is opened on a genomic region of interest, not only are the maximum loss and gain shown, but the graph is scaled to these, meaning that even when small differences in methylation occur, as typically seen in epidemiological studies, the areas of the genome with the greatest changes can be easily identified at a glance. In-house testing has found delta-beta tracks to be particularly useful, as it can easily be seen if a feature contains any probes with methylation differences between samples big enough to assess by other meanse.g. pyrosequencing can accurately assess differences in methylation greater than 5%. It can be easily seen from the delta beta (track 3) that the biggest loss of methylation was 7% (-0.071). The clustering of sites losing methylation at the promoter is also striking (boxed in green) compared to the rest of the gene, suggestive of a step-change in methylation at this important regulatory element rather than a point source. The seamless integration of BLAT [34] meant that designing primers to verify methylation changes could be done very intuitively and the area covered by the assay mapped against the methylation data to confirm that the assay could confirm methylation levels at the exact same location (Fig.5C track 4 "Pyro"). It was also seen from the absolute methylation levels in the samples (tracks 1,2, values for promoters given in Fig.5B) that loss of methylation at the CES1 promoter occurred against a background of high methylation at this region, which suggested this control element is normally methylated and silenced, a type which often responds to even small losses of methylation. Additional data to corroborate this could be obtained by examining chromatin state data available through the ChomHMM track in UCSC (Fig.5 track 6) which showed that the promoter falls into the "poised promoter" category (colour-coded pink) and is regulated in part by polycomb-group proteins (grey shading). A low likelihood of SNPs at the pyroassay region could be confirmed by examination of the Common SNPs dataset (Fig.5C track 7) and individual CpGs labelled by searching using the UCSC query window, and their status in other public datasets highlighted if desired (Fig.5C, track 8). Thus CandiMeth allowed quick examination of candidate regions, quantification of differences specifically at these, the assessment of sites which could be verified in the lab, exclusion of confounding SNPs, eased assay design and gave additional valuable insights through mining of UCSC datasets using only a few simple inputs and no coding.

Case Study 3: Analysis of methylation at genomic repeats such as LINE1
Many studies looking for epigenetic changes also try to assess DNA methylation outside of the coding regions. One common approach is to assess methylation at a highly-repetitive interspersed repeat such as LINE1, which is found scattered throughout the genome at ~500,000 copies, so in theory sampling methylation across many locations. This normally has to be done using a separate wet-lab assay such as pyrosequencing, since the 450K and EPIC arrays are designed to cover genes and their associated control elements, not repetitive DNA.
However, as has been noted elsewhere [32 , 33], a substantial number of probes on the arrays, particularly the EPIC, nevertheless fall within repeats such as LINEs and SINEs. Taking advantage of this, we parsed data from the RepeatMasker track on UCSC to allow mapping and quantification of methylation at the major repeat classes using array data (Fig.6A). By simply listing the categories of repeat given by RepeatMasker (as in Suppl. Table 4), it is possible to obtain summary statistics indicating the numbers of probes overlapping the respective elements, together with median methylation etc. from any differential methylation table, in this case from our experiment comparing WT and DNMT1-deficient cell lines (Fig.6B). It can be seen from the tables that very substantial numbers of probes on the EPIC map to the various repeat classes, with ~20,000 probes in LINE elements spread across the genome, and equal numbers in SINE elements, with satellite repeats near centromeres showing the lowest coverage at ~1000. The summary data was exported to Excel and graphed to highlight where the greatest differences lay (Fig.6C), which showed that satellite sequences appear to be most demethylated on average, with notable decreases at LINE and long terminal repeat (LTR)-containing elements too, which would include endogenous retroviruses for example, whereas low complexity and simple repeats show almost no changes, despite good probe coverage (Fig.6B). Thus CandiMeth allowed straightforward assessment of repeat methylation across the genome without the need for wet-lab analysis, and gave novel insights into the differential effects of DNMT1 loss on individual repetitive DNA classes.

Case Study 4: Analysis of methylation changes seen at a large complex gene locus in multiple samples using parallel processing in CandiMeth
A powerful feature of CandiMeth is the ability to process data from multiple differential methylation analyses at once. To illustrate this, we took three sets of comparisons between the independently-derived DNMT1 knockdown cell lines described earlier (d8, d10 and d16), each of which had been compared to the parental WT cell line and processed them simultaneously. In our earlier publication [28] we had found differences between the variable A and B classes and the variable C class of exons at the important neurodevelopmental gene cluster Protocadherin  (PCDHB), with the A&B classes showing severe loss of methylation but no change at the C class. This highlighted differences between these classes, which 1) indicate a hyper-dependence on DNMT1 for maintenance of methylation levels and 2) a potential difference in methylation-dependence which may track with allele usage, since the A&B classes show monoallelic expression but not the C class. Here, we wished to examine the neighbouring PCDHG locus which has a similar structure and see if the same effect could be seen there.
We therefore generated a candidate region list containing the names of the gamma cluster genes and input this as our candidate feature list input to CandiMeth, together with the three differential methylation tables from RnBeads (d8 vs WT, d10 vs WT, d16 vs WT). All three sets are processed at once (Fig.7A, left) and give as outputs data on absolute methylation levels in each KD line as well as the WT parental line (which will not vary), from which summary tables were derived specific to the PCDHG exons-example data for one A and one C exon in each cell line only is shown (Fig.7A, right) In Fig.7C we show the differential methylation (delta beta) tracks, from which it appeared that methylation was largely lost across the region of the gene containing the A&B class variable exons (Fig.7C, region boxed in red), though some gains (blue peaks) could be seen particularly in the d10 track. Additionally, given the size of the region (~200kb) it cannot be assessed whether many of the probes lie in the introns rather than the exons themselves. For the C class exons (Fig.7C, blue box at right) most changes appeared to be gains (blue) though peak sizes were smaller and interspersed with some individual large losses in red. To resolve the exact nature of the changes seen, the tabular data (Fig.7A) was exported and median values across all A&B exons vs WT generated, converted back to  value to allow direct comparison to previous results [29] and plotted (Fig.7D). This clearly showed a general loss before is likely to be because our previous examination of the C class exons at PCDHG used the FDR-significant probes only, and as can be seen the magnitude of the gains at the C class exons is much smaller than the losses at the A&B classes (compare scales in Fig.7D and E).
The analysis thus confirmed and extended observations from our previous study that the A&B class variable exons at the clustered protocadherin loci are hypersensitive to loss of DNMT1 across multiple independently-derived cell lines, suggesting a strong dependence on this enzyme for maintenance of epigenetic state at this important neurodevelopmental locus.
Further, we have uncovered new evidence for differences between the A&B exons and the C exons, which may reflect divergent transcriptional control, or an increased transcription across the C class exons in response to loss of DNMT1, in line with observations that intragenic DNA methylation is associated with transcription at active loci [10 , 34]. In terms of CandiMeth functionality, the study highlights the ability of the workflow to process multiple comparisons in parallel and the value of being able to directly compare the visual outputs and the quantitative data where complex genetic loci are being examined, giving insights into the underlying biology.

Conclusions and Future Directions
CandiMeth provides a user-friendly non-computationally intensive method of candidate feature investigation. With a minimum of training and no coding, users of CandiMeth can set up and run quite advanced exploratory and confirmatory analyses and use the rich set of existing data in UCSC to formulate and test hypotheses regarding the methylation changes they are seeing.
In future versions, we hope to add support for further methylation processing pipelines and continue to grow the CandiMeth history with additional genomic data such as DNA hypersensitivity sites. In addition to the current pipeline, we also wish to make CandiMeth more intuitive via the creation of a Galaxy tool which would allow the pipeline to be extended to whole genome bisulphite sequencing or RNA-seq data and would also allow further analysis options for those with a private instance of Galaxy.

Availability of supporting data
All supporting data and materials are available in the GigaScience GigaDB database [35].

Ethics approval and consent to participate
Not Applicable

Consent for publication
Not Applicable

Competing interests
The authors declare that they have no competing interests.

Funding
Work was funded in part by grants from the Medical Research Council (MR/J007773/1), the EpiFASSTT grant from the ESRC/BBSRC (ES/N000323/1) and the HDHL EpiBrain award from the BBSRC (BB/S020330/1).

Authors' contributions
SJT generated and tested the workflow and accompanying datasets and drafted the MS; DL helped with Sed and workflow debugging; REI provided supervision and feedback on early versions; KP and SDZ provided guidance and comments; CPW designed the study, generated    showing only those sites whose differential methylation meets the cut-off criteria of a 0.05 false discovery rate. (5)(6)(7) Examples of some of the tracks available through the UCSC genome browser which can be aligned and directly compared to CandiMeth tracks: 5) HAIB Methyl450, data on comparative methylation from ENCODE projects; 6) Pyro, the Blat tool in UCSC, which can be used to find primers for pyroassays to cover one or multiple CG 7) RefSeq track, showing the location of the top two MIR from (B) above. results to a new history (recommended), then specifies 1) which R package was used to preprocess the data e.g. RnBeads; 2) the dataset collection table of pre-processed data-available sets will appear in the drop-down menu; 3) a list of the genes/other features of interest to analyse 4) the reference genome to be mapped to e.g. hg19. Once all four have been decided, the user clicks on the blue "Run workflow" button at top right to initiate a run.    A complete Guide to setting up and using CandiMeth, including some background on Galaxy and UCSC browser, how to import the workflow and example files, tutorials on the use of the example data, and further guidance and instruction.