CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations

Abstract Background The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Results Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1–2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. Conclusions The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species.

Thank you for your positive comments and encouragement. We have substantially revised the manuscript upon your suggestions.
-There are several grammatical errors which make the paper somewhat confusing. I would strongly recommend further extensive English editing.
We apologize for these mistakes. The manuscript has been professionally edited by an English-language editing service, American Journal Experts (AJE) .
-My main criticism of the analysis is one that I have seen repeatedly of most other CNV calling publications, and that is there is no sensitivity analysis.
We are sorry for the ambiguity of the sensitivity tests, which were included in the previous Table 1. In the revised manuscript, Figure 6C has been added to describe the sensitivity. In general, CNVcaller demonstrated 57%-67% sensitivity for duplications and 66%-68% for deletions in human data. The sensitivity in sheep was ~73% by indirectly evaluation. The detailed descriptions are as follows: Human: "The sensitivity was estimated as the proportion of the high-confidence CNVR database that overlapped with the predicted CNVRs. Two previously published highconfidence databases that include our test samples are the aCGH-based CNVR database [1] and the 1000GP CNVR map [2]. For the aCGH database, CNVcaller demonstrated the highest sensitivity (57%) in duplications, whereas Genome STRiP achieved the highest sensitivity (74%) in deletions ( Figure 6C). Both Genome STRiP and CNVnator were the core contributors to the 1000GP CNV maps; However, the sensitivity of CNVcaller was 68% and 67% for deletions and duplications according to this database, only 4%-10% lower than Genome STRiP and CNVnator." Sheep: "The sensitivity of sheep CNVRs was estimated indirectly due to the lack of a validated database. Based on our integrated analysis (see methods), there were 138 sheep X chromosome-origin scaffolds, which were not anchored onto chromosomes of OAR v3.1. Therefore, all of these scaffolds should be detected as CNVs because the rams had half the copy numbers of the ewes. As a result, CNVcaller detected 101 of these 138 X-origin scaffolds, with a sensitivity of 73%. In contrast, CNVnator and Genome STRiP did not report these unmapped CNVRs." -The authors here also suggest various parameters throughout their paper for performing CNV calling, but there is no analysis of how the results change if these parameters are adjusted, i.e. no analysis of how robust your algorithm is to changes in the parameters.
Thank you for your suggestion. Two important parameters of CNVcaller were window size and minimum report allele frequency. The FDRs against the window size and alternative allele frequency have been added to Supplementary Figure 1 and Figure  6B. In general, with the increasing of window size and allele frequency, the accuracy raised while the sensitivity decreased.
-As another example, Hong et al 27503473 has demonstrated that the biggest variability in calling CNVs is in terms of the CNV size. I suspect that the same can be said of CNVcaller. Please comment on what sizes of CNVs does CNV caller do well or poorly on.
Thank you for your suggestion. Figure 4 and Figure 6 have been added, which evaluate the effects of length and frequency in sheep and human data. The detailed comparisons in the manuscript are as follows: Sheep: "The accuracy was evaluated by the Mendelian inconsistency of all the CNVRs on autosomes against the length and alternative allele frequency ( Figure 4). CNVcaller achieved higher accuracy than Genome STRiP in both deletion (1% vs 2%) and duplication (4% vs 7%) ( Figure 4A). Whereas Genome STRiP had greater capability to detected short (<2.5 kb) deletions ( Figure 4B), indicating the RP methods integrated in Genome STRiP performed well on small deletions. Concerning the alternative allele frequency, both methods showed an increased FDR in rare duplications ( Figure 4C). However, CNVcaller is primarily used to detect CNVRs related to economic traits in livestock and crops. In these studies, the target CNVRs usually have a high frequency after long-duration breeding selection." Human: "CNVcaller demonstrated the highest overall accuracy for detecting duplications and performed consistently across the length and frequency categories, whereas Genome STRiP and CNVnator had high FDRs on the short or singleton duplications ( Figure 6A, B). Genome STRiP showed the greatest ability to detect deletions, indicating the advantage of combining RD and RP methods for deletion detection. The genotyping accuracy of the human dataset was further benchmarked against the high-confidence aCGH array-based database. The discordance rates of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively. This genotyping accuracy ranking was same with the Mendelian inconsistency of the 10 Dutch trios (Supplementary Figure 5)." -2:32 "the prevalent.." is a gross exaggeration. I think you mean "a prevalent". This has been corrected as suggested.
-2:35 I don't think you mean geometric. I did not comment on other grammatical/English errors as there were too many to list individually. I would highly recommend getting help with the English in this paper.
We apologize for these mistakes. The manuscript has been professionally edited by an English-language editing service, American Journal Experts (AJE).
We apologize for the missing definition. The following description has been added to the introduction: "read depth (RD) refers to the depth of coverage in a genomic region that can be calculated from the number of aligned reads [14], a CNV region should have a higher or lower RD than expected [22][23][24]." -6:120 Give a brief description of how CNVnator handles GC bias. Also why 40% for the GC bias? Shouldn't this parameter be dependent on the organism of interest?
We apologize for not clearly describing the procedure. In general, the mean RD of windows with 40% percent GC is used as only a temporary standard in the GC correction step. It will be lost in the following normalization step, in which the GCcorrected RDs of each window are divided by the global median RDs. Because the denominator is calculated from the RDs already corrected by the 40% GC windows, this parameter will be lost and is not necessarily dependent on the organism of interest.
The GC corrected RD for a window is calculated by CNVnator as follows: the raw RD times the global average RD and divided by the average RD with the same GC content as in this window. Because the global average RD is calculated before the GC correction, no temporary parameter is used. The equation is showed in Personal Cover.
-The commentary on certain genomes not being as complete as others is important. I suspect though that if a large percentage of the samples show a CNV in a genome that is newer or not as complete, then this observation may be more likely indicative of a problem with the reference. Can you comment?
If the detected CNVR has variation in a population, which means the read depths can be clustered into two or more normal distributions, this CNVR is probably true even with high frequency. In contrast, if all of the individuals show the same abnormal read depth, this suggests that the reference individual is different from the sample population or some assembly problems exist.
-7:145 I am not convinced Pearson's correlation is appropriate. Your data is likely to have outliers and non-normal data. A non-parametric test of correlation like Spearman's correlation (Kendall-Tau is likely too computational intensive), or performing correlation after 5 or 10% trimming may be more appropriate.
We tried replacing Pearson's correlation with Spearman's correlation in the 30 BAM files from the 1000 Genome Projects data. However, the FDR doubled after the replacement, while the length of each call was reduced by half. A possible reason is that Spearman's correlation is calculated by ranking instead of the numerical value of copy numbers across samples. Therefore, the divergent copy numbers of individuals with deletions or duplications contributed no more than the subtle random mistakes of normal copy individuals, especially in the low-frequency CNVRs.
Trimming is also not recommended for a similar reason. In the low-frequency CNVRs, individuals with an abnormal copy number will be trimmed as outliers.
-cn.MOPS (Klambauer et al, PMID: 22302147) uses a mixture of Poissons as opposed to Gaussian Mixture Models for CNV detection. I suspect the mixture of Poissions will be superior to Gaussian Mixture Models when the read depths are low, and Gausssian mixtures may be more appropriate when read depths are high. How difficult is it to replace the Gaussian mixtures with Poisson mixtures and compare the performance? I feel that this analysis would be informative and potentially improve your algorithm.
Thank you for your suggestion. However, it is not easy to replace the distribution because the RDs after GC correction and normalization are not integers; thus, they cannot be directly treated as Poisson distributions. Additionally, we totally agree with your comment about the Poisson distribution will be superior for low read depths with high STDEV. However, the currently common sequencing depth are about 5-10X. Under this sequencing depth and a proper window size, the STDEV/mean RD was only 0.2-0.3, which essentially not fit the Poisson distribution. In addition, we used the RDs of 232 goats with ~10X coverage to test the fitness of Gaussian distribution using the omnibus test (scipy 0.19.0). As a result, 88% of windows accepted the null hypothesis at the P = 0.01 level. Therefore, we believe the Gaussian Mixture Model is acceptable for the current data.
-The term "CNVR" is critical for understanding the algorithm, and requires more explanation of the term.
We apologize for missing this important concept. The following explanation has been added to the introduction: "To study the polymorphism among individuals, the overlapping CNVs need to be merged into unified regions, namely CNV regions (CNVRs)" -It would be helpful to include some further discussion on where you see that CNVcaller works better or worse than existing CNV calling software.
Thank you for your suggestion. Figure 2 shows that the speed of CNVcaller was one to two orders of magnitude higher than the other methods. Newly added Figure 4 and Figure 6 evaluated the effects of length and frequency in sheep and human data. In general, the performance of CNVcaller was better for all sizes of duplications but was poor for deletions <2.5 kb.
Two citations have been added as follows: -Minor comment: Since speed seems to be a major selling point of the software, more details about running the software on a compute cluster or running algorithms in parallel in the documentation would be helpful.
A new section, "Parallel submission of individual RD processing," has been added to the methods with the principle and commands as follows: "Parallel processing of individual RDs. The CNVcaller processes the BAM file of each individual separately in the first step, and therefore, parallel computations can be performed to reduce the total running time. All BAM files are equally distributed into N groups, and each group contains M files. The max N is the total available processing cores, and M is the total number of BAM files/N. For example, the 232 goat BAM files were processed on a node with 32 processing cores and 124 GB of RAM. We distributed the 232 files into 20 groups, and each group contained 12 BAM files. The shell command for one group is as follows: After corrections and normalization, the comparable RDs of each sample are aggregated into an ~100 MB intermediate file and output, thus preventing repeated calculations for the same individual in different populations." ************************ Reviewer #2: The proposed method "CNVcaller" enables the efficient discovery and genotyping of CNVs in large populations. One of the main benefits of the method is that it can handle draft genome assemblies with thousands of scaffolds. The computational benchmarks proof that the method is fast and memory efficient but the evaluation of the accuracy of the method is less convincing. Some details of the method remain vague and hinder an objective evaluation. Detailed comments of how to improve the manuscript are below: Thank you for your affirmation. We are sorry for the ambiguity of the accuracy test and have substantially revised the manuscript according to your suggestions. In the revised manuscript, the performance evaluation in the previous Table 1 is described in more detail in Figure 4 and Figure 6. Comment 1 -The primary application of CNVcaller is the detection of CNVs in large populations. Population variant call sets are dominated by rare variants of rather small size. For instance, less than 20% of the 1000 Genomes structural variants have a population allele frequency >5% and almost 50% of the SVs are <2kbp in size despite the rather low coverage (~7x). CNVcaller is currently restricted to large CNVs (>2kbp) and common variants (>5% allele frequency), which is a major limitation for population genomic studies.
We apologise for the ambiguous. Actually, the user can retain all the windows with at least one individual that shows heterozygous deletion or duplication. However, we recommend removing low-frequency windows in large populations with low sequencing coverage because of increased random mistakes. In the revised version, Figure 6 was added to evaluate the effects of length and frequency by IRS test. We found Genome STRiP showed the greatest ability to detect short and rare deletions, indicating the advantage of combining RD and RP methods for deletion detection. However, short and rare duplications still had extremely high FDR. The shortest duplications reported by CNVnator and Genome STRiP were 2.8 kb and 2.5 kb, and the IRS FDRs of 2.5-5 kb calls were 29% and 88%, respectively. The FDRs of singletons were 35% and 69% for CNVnator and Genome STRiP, respectively. The main improvement of CNVcaller is the accuracy of duplications. The FDR of 2.5 kb -5 kb duplications was reduced to 19%, and the FDR of singleton duplications were reduced to 9%. However, the FDRs were still higher than those of the longer and higher-frequency calls. So, these calls were removed from the previous manuscript. These uncertain calls were also removed by the phase 3 extended SV release of 1000GP. After extra quality controls, the number of duplications in the released database is only 1/7 the number of deletions, and the median size is 36 kb, which is 17 times longer than deletions. Therefore, improving the accuracy of duplications on this foundation is meaningful for enriching the CNV database.
Additionally, the current main use of CNVcaller is the detection of CNVRs related to economic traits in livestock and crops. In these populations, the target CNVRs usually have a medium or high frequency after long-duration artificial selection. We believe that the high-confidence medium to high frequency reported by CNVcaller can contribute to functional and breeding studies of animals and plants.
The sensitivity increase of CNVcaller for the subset of common and large CNVs seems to be driven by an increased number of detected CNVs in SD regions ( Figure 5C). SNP arrays have a low SNP density in SD regions and in the present Manuscript array SNP probes in SD regions have been removed entirely. The reported IRS FDR is therefore heavily biased against CNVs in SD regions and it thus seems mandatory to me to proof that this sensitivity increase for SD-associated CNVs is not leading to an inflated FDR.
Thank you for your suggestions. Figure 5C (new Figure 3C) has been updated to show both the number and the Mendelian inconsistency of the detected CNVs in SDs. The Mendelian inconsistency rate of the calls in SD regions made by CNVcaller was approximately 3%, no higher the other methods. The copy numbers of unique and SDs were also indirectly validated by the X-origin scaffolds of a 133-sheep population. All of these scaffolds should be detected as CNVs because the rams had half the copy numbers of the ewes. As a result, CNVcaller detected 101 of these 138 X-origin scaffolds. In contrast, CNVnator and Genome STRiP did not report these regions.
The Manuscript lacks a Figure that shows the size and allele frequency distribution of the discovered CNVs in comparison to Genome STRiP and CNVnator. An estimate of breakpoint accuracy of CNVcaller would also be valuable.
Thank you for your suggestion. Figure 4 and Figure 6 have been added, which evaluate the effects of length and frequency in sheep and human data. The detailed comparisons in the manuscript are as follows: Sheep: "The accuracy was evaluated by the Mendelian inconsistency of all the CNVRs on autosomes against the length and alternative allele frequency ( Figure 4). CNVcaller achieved higher accuracy than Genome STRiP in both deletion (1% vs 2%) and duplication (4% vs 7%) ( Figure 4A). Whereas Genome STRiP had greater capability to detected short (<2.5 kb) deletions ( Figure 4B), indicating the RP methods integrated in Genome STRiP performed well on small deletions. Concerning the alternative allele frequency, both methods showed an increased FDR in rare duplications ( Figure 4C). However, CNVcaller is primarily used to detect CNVRs related to economic traits in livestock and crops. In these studies, the target CNVRs usually have a high frequency after long-duration breeding selection." Human: "CNVcaller demonstrated the highest overall accuracy for detecting duplications and performed consistently across the length and frequency categories, whereas Genome STRiP and CNVnator had high FDRs on the short or singleton duplications ( Figure 6A, B). Genome STRiP showed the greatest ability to detect deletions, indicating the advantage of combining RD and RP methods for deletion detection. The genotyping accuracy of the human dataset was further benchmarked against the high-confidence aCGH array-based database. The discordance rates of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively. This genotyping accuracy ranking was the same with the Mendelian consistency of the 10 Dutch trios (Supplementary Figure 5)." Thank you for your reminding of the breakpoint issue. However, unlike the PR/SP algorithm, RD can not detect breakpoints in the at base pair resolution or less than the window step size resolution. Integrating RD and RP methods can improve the breakpoint accuracy in human genome. However, precise breakpoint is more difficult to achieve in the poorly assembled genomes. Additionally, the breakpoint issue did not affect the genotyping accuracy which is the direct input of GWAS. The genotyping FDR of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively.
The Manuscript mentions mrsFAST for absolute copy number validation. I could not find any formal comparison of predicted copy-number by mrsFAST and CNVcaller but maybe I missed this?
Supplementary Figure 2 (previous Supplementary Figure 1) shows that the copy numbers calculated using mrsFAST and CNVcaller were similar. However, mrsFAST needed to realign all the multi-hit reads in BWA alignments, leading to significantly increased computational time. For example, mrsFAST required 10 hours for a 3G genome with 10X sequencing data, whereas CNVcaller needed only 4 minutes.
-Please add to Table 1 the number of CNV sites that could be assessed by the IRS method and what proportion of each call set could be evaluated using IRS. I also believe the IRS method reports p-values separately for deletions, duplications and multi-allelic CNVs. Was there any difference among these for CNVcaller?
Detailed information on 1000GP calls, including the required information, has been added to Supplementary Table 5. Overall, 28%, 30% and 60% of the CNVRs of CNVcaller, CNVnator and Genome STRiP covered at least one probe of the Affymetrix SNP 6.0 array and therefore could be assessed using the IRS test. One main reason for the divergent testable proportions was that only 4% of Genome STRiP calls overlapped with SDs, which have infrequent probes, whereas 34% of the CNVcaller calls and 28% of the CNVnator calls overlapped with SDs.
Two extra genome-wide evaluations can provide supplemental evidence. The Mendelian inconsistency of 10 Dutch families was added to Supplementary Figure 5, which was based on tests of both unique and SD regions. The inconsistency rates of CNVcaller, CNVnator and Genome STRiP were 1.5%, 4.4%, and 0.4%, respectively. This accuracy ranking was consistent with the genotyping discordance values compared with the aCGH database, which were 2.6%, 5.5% and 2.2% for CNVcaller, CNVnator and Genome STRiP, respectively.
To analysis the difference between deletions and duplications, all FDRs were evaluated separately in the revised manuscript. We found the duplications had much higher FDRs than the deletions, especially for the short and rare CNVs. This section has been expanded in the newly added subsection "RD corrections for sex chromosomes" as follows: "RD corrections for sex chromosomes. Most mammalian and avian genomes show an XX/XY-type or ZZ/ZW-type sex-determining system. Their homogametic sex chromosomes (X or Z) constitute 5%-10% of the total genome and show half the RD of the autosomes in XY or ZW individuals. Therefore, intensive correction for X and Z chromosomes is needed. The RD of the X or Z chromosome (the particular name provided by the user) is used to determine the sex of a particular individual. If the median RD of this chromosome is <0.6X the median RD of the autosome, the individual is considered an XY or ZW type, and the RDs of this chromosome are doubled before normalization. Otherwise, nothing is performed for individuals determined to be XX or ZZ type." Page 8, line 154: "... and the distance between them is less than a certain percent of their own length." This text has been modified as follows: "As CNVRs can be separated by gaps or poorly assembled regions, the adjacent initial calls are merged if their RDs are highly correlated. The default parameters are as follows: the distance between the two initial calls is less than 20% of their combined length, and the Pearson's correlation index of the two CNVRs is significant at the P = 0.01 level." Page 5, line 91: "The reference genome is segmented into overlapping sliding windows." What window size and overlap was used for high-coverage genomes?
The following description has been added to the methods. "The window size is an important parameter for RD methods. CNVcaller uses half of the window size as the step size. The optimal window size is 800 bp (with a 400 bp overlap) for 5-10X coverage human and livestock sequencing data (Supplementary Figure 1). The recommended window sizes are inversely related with coverage, and thus, ~400 bp windows correspond to 20X coverage, and ~200 bp windows correspond to 50X coverage." Page 5, line 95: "The raw RD signal is calculated for each window as the number of placed reads with centers within window boundaries." Does this imply that for pairedend data both reads are counted?
We apologise for the ambiguous description. The following description has been added: "Considering the uncontrollable effect of gap ratios from different genome assemblies, all of the end reads located in the window are independently added to the RD of this window, regardless of whether the read is from single-end mapping or paired mapping." Page 8, line 154: "Then the two adjacent initial calls are further merged if their copy numbers are highly correlated". What threshold was used?
This text has been modified as follows: "As CNVRs can be separated by gaps or poorly assembled regions, the adjacent initial calls are merged if their RDs are highly correlated. The default parameters are as follows: the distance between the two initial calls is less than 20% of their combined length, and the Pearson's correlation index of the two CNVRs is significant at the P = 0.01 level." Figure 3A: CNVcaller 13.7. What is the unit? Are these 13,700 CNVs?
The unit in this figure is Mb. Because the intersection of the three methods with different boundaries was difficult to define in numbers, they were evaluated in terms of length. CNVcaller covered 40% of the CNVRs detected by CNVnator, 45% of the CNVRs detected by Genome STRiP and 65% of their intersecting CNVRs, in terms of length.
Minor: -I could not find a reference to the 232 goat sequencing data? Is this data publicly available?
Among the 232 goat whole-genome sequencing data files, 103 files were acquired from NCBI, the accession numbers are provided in Supplementary Table 1. The remaining 129 samples without accession numbers were generated by ourselves, and will be published soon. The reference and unpublished paper are as follows: This has been modified as suggested.
-Is the Mendelian consistency higher for the high-coverage trio: NA12878, her father (NA12891) and her mother (NA12892)?
Yes. In the high-coverage data of all three members of the trio (NA12891, NA12892 and NA12878 were all 50X), the inconsistency rate was 2.4%. In the high-coverage data of the parents (50X for NA12891 and NA12892) and the low-coverage data of the child (5.3X for NA12878), the inconsistency rate was 6.1%. Thus, increased sequencing depth can help to reduce the number of false positives.
-I believe the claim that read-pair/split-read algorithms are less powerful on draft assemblies of non-model organisms compared to read-depth methods is potentially true but the Manuscript lacks a proof for this or a citation that supports this claim.
Thank you for your agreement. This problem was found in our previous reference genome assembly projects for both sheep and goats. However, we did not report this result in the section on CNV/SD detection. The review listed below has some comments about this claim, however, without direct supporting data. Therefore, we have removed this comment from this manuscript.
Bickhart DM and Liu GE. The challenges and importance of structural variation detection in livestock. Frontiers in genetics. 2014;5.
"While RP methods should provide a suitable means for detecting such events in theory, two major problems currently challenge the accuracy of this method: (1) alignment errors resulting from the mapping of read pairs to repetitive regions of the genome…… The first problem (1) is unfortunately dependent on the reference genome assembly for the species, and is unlikely to be resolved until better reference assemblies are created for livestock." -It is not clear from the Manuscript if CNVcaller reports copy-number likelihoods based on the Gaussian mixture model. Please clarify.
Thank you for your suggestion. CNVcaller reports the silhouette coefficients of the copy numbers instead of the Gaussian mixture model likelihood as quality control because we found that silhouette coefficients had a greater correlation with the IRS test results than likelihood.
- Figure 5A: Why is the absolute copy-number correction different for Human and Sheep?
We are sorry for not clearly interpreting the high proportion of misassembled segmental duplications in non-human assemblies. This part of the manuscript has been modified as follows: "Previous studies have shown that a high proportion of SDs in animal genomes are misassembled single-copy regions [27,29]. Therefore, we detected the ratios of false SDs on the human (hg19) and sheep (OAR v3.1) reference genome assemblies by the sequencing copy number of a human (NA12878) and a Tan sheep sample ( Figure 3A). If the SDs were correctly assembled, the sequencing diploid copy number should be twice the copy number of SDs. For example, the average sequencing copy number of the two-copy SDs was four in NA12878. However, the corresponding sequencing copy number in sheep was only 2.4. These results indicated that most two-copy SDs of hg19 were truly duplicated in NA12878, while approximately 80% of the two-copy SDs in OAR v3.1 were single-copy regions in the Tan sheep sample. Thus, the SDs in the sheep genome were called "putative SDs" before validation." -There is quite a few typing and grammatical errors. For instance: * Figure 2B: Max mamory *Supplementary We are sorry for these mistakes. We have proofread the revised manuscript and used a professional English-language editing service to minimize the grammatical errors. ************************

Abstract
Background: The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level.
However, the growing sample sizes and the divergent complexity of non-human genomes challenge the efficiency and robustness of current human-oriented CNV detection methods.
Results: Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1-2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds.
CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the non-human reference genome assembly.
Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller

Introduction
Copy number variations (CNVs) are defined as duplications or deletions of genomic segments that range in size from 50 base pairs (bp) to megabase pairs (Mb) and vary among individuals or species [1]. As a prevalent and important source of genetic diversity, over 50,000 CNVs have been detected in the human genome, accounting for 10% of the entire genome [2]. CNVs regulate gene expression via both gene dosage and position effects, and they have larger expression-altering effect sizes than single nucleotide polymorphisms (SNPs) and indels [3]. In the human genome, With the dramatic increase in sequencing capacity and the accompanying decrease in sequencing cost, whole-genome sequencing data is becoming the main source of CNV detection .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   3 The large-scale population sequencing data also provide an unprecedented opportunity to discover the functional CNVs using genome-wide association studies (GWAS) and evolutionary analysis [11,12]. To study the polymorphism among individuals, the overlapping CNVs need to be merged into unified regions, namely CNV regions (CNVRs) [13]. Since merging CNVs identified in each individual is inconvenience for large populations, some methods use multiple samples as input, then output the CNVRs directly [14]. More importantly, the population-based methods can improve the detection by building statistic models, such as Poisson distribution and Gaussian Mixture model [15,16].
As the amount of data increases, the computational efficiency is becoming a rate-limiting factor in CNV analysis. In additional, most CNV detection algorithms are based on mapping the sequencing reads back to the reference genome. For example, the methods of read-pair (RP) and split-read (SR) deduce the breakpoint of CNVs from the discordant alignments [17][18][19][20][21]; the methods of read depth (RD) refers to the depth of coverage in a genomic region that can be calculated from the number of aligned reads [14]. A duplicated or deleted region should have a higher or lower RD than expectation [22][23][24]. However, the non-model genome assemblies are riddled with many gaps, unplaced scaffolds and misassembled segmental duplications (SDs) [25-28]. For example, 97% of highly similar tandem duplications in the Btau4.1 cattle genome assembly actually correspond to a single copy [29]. Therefore, more robust signal detection and noise reduction algorithms are required for detecting CNVRs from non-model species.

Input data
CNVcaller requires alignment files in BAM format as the main input. The following data/samples were included in the validation.

Individual RD processing
RD estimation. The pipeline of CNVcaller is shown in Figure 1. To calculate the RD signal, we first divide the reference genome into overlapping sliding windows, which is used for all samples. CNVcaller uses half of the window size as the step size. The optimal window size is 800 bp (with a 400 bp overlap) for 5-10X coverage human and livestock sequencing data (Supplementary Figure 1). The recommended window sizes are inversely related with coverage, and thus, ~400 bp windows correspond to 20X coverage, and ~200 bp windows correspond to 50X coverage.
Absolute copy number correction. To perform absolute copy number correction, windows with >97% sequence similarity are linked together to form a duplicated window record file. This file is generated by splitting the reference genome into non-overlapping windows and aligning the

CNVR detection by multiple criteria
Individual candidate CNV window definition. The individual candidate CNV windows are defined using two criteria: (1) The normalized RD must be significantly higher or lower than the normalized mean RD (deletions < 1 -2 * STDEV; duplications > 1 + 2 * STDEV). (2) Considering that the normalized RD of heterozygous deletions and duplications should be approximately 0.5 and 1.5, respectively, an empirical standard for the normalized RD (deletions < 0.65; duplications > 1.35) also must be achieved. For some strictly self-bred species, such as soybean and wheat, this empirical standard should be raised to 0.25 or 1.75 for the normalized RD of the homozygous deletions or duplications, respectively.
Population-level candidate CNV window definition. All the individual RD files are arranged according to the universal window index into a two-dimensional population RD file. Each line of this file is the multi-sample RDs of a particular window, from which the candidate CNV windows   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   9 are selected. The user can retain all the windows with at least one individual that shows heterozygous deletion or duplication. However, we recommend removing low-frequency windows in large populations with low sequencing coverage because of increased random mistakes. By default, windows with an allele frequency ≥0.05 or at least two homozygous duplicated/deleted individuals are selected for further validation. Then, Pearson's product-moment correlation coefficients of the multi-sample RDs are calculated between two adjacent non-overlapping windows. If the Pearson's correlation index is significant at the P = 0.01 level by Student's t test, the two windows are merged into one call.
CNV region definition. The initial calls are selected if more than four sequential overlapping windows are defined as population-level candidate windows. Regarding noise tolerance, a maximum of one unselected window out of four continuous candidate windows is allowed; however, their RD is not calculated in the RD of CNVR. As CNVRs can be separated by gaps or poorly assembled regions, the adjacent initial calls are merged if their RDs are highly correlated.
The default parameters are as follows: the distance between the two initial calls is less than 20% of their combined length, and the Pearson's correlation index of the two CNVRs is significant at the P = 0.01 level.

CNVR genotyping
After merging the candidate CNV windows into a CNVR, the RDs of all samples in each CNVR are clustered, and the integer copy number of each individual is calculated, which represents the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 10 genotyping step as used in SNP detection. The copy number of a specific sample is initially estimated as two times the median RD of all candidate windows in a given region. Then, the copy numbers of all samples of a CNVR are decomposed into several Gaussian distributions. The expectation maximization (EM) algorithm is used to estimate the model parameters, and the effective number of components is inferred by the Dirichlet Process. To ensure the quality of the genotyping, the silhouette coefficient is calculated for each CNVR. The Python package scikit-learn v0.19.0 [41] is used to implement the above algorithms. This genotyping step can be performed in sequence or in parallel, and the parameter "nproc" is used to control the number of processes. The genotyping of 232 goats took 17.49 minutes and 488 MB of memory on one node with two processors. The final output is a VCF file, which can be analysed by SNP-based population genetic software.

Performance evaluation
Competing methods. Most of the validations were based on the 30 human BAM files form 1000GP Phase 3 data, and only the autosomes were included unless otherwise noted. The performance of CNVcaller was compared with two pipelines, including CNVnator_v0.3.3 (CNVnator, RRID:SCR_010821) [23], which is widely used for CNVR detection in animal populations, and Genome STRiP (included in svtoolkit_2.00.1696) [16], which is the state-of-the-art CNV detector generated by 1000GP. The recommended parameters and quality controls were used. For Genome STRiP, both the deletion and CNV pipelines were utilized. The unplaced scaffolds were excluded, and the whole genome was separated by chromosomes as 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   11 recommended. The standard screening procedures were applied to select the passing sites and to remove duplicate calls. For CNVnator, a 400 bp window was used, as recommended. The gap regions and calls with p values less than 0.01 were removed, and the q0 filter was used to remove any predictions with q0 <0.5 (reads with multiple mapping locations), as recommended. The individual CNVs of all samples were merged into the population CNVRs based on the following arbitrary standards: two calls with >50% reciprocal overlap or one call with >90% coverage by another call [23,42]. Then, the CNVRs were genotyped using the built-in function in CNVnator.
Sensitivity validation. Sensitivity was defined as the number of CNVs existed in both the CNV predictions and the high-confident CNVR database (>50% reciprocal intersection) divided by the total number of CNVs in the database. Calls with ≤2,500 bp and an allele frequency <5% and sex chromosomes were removed from this study. Two previously published high-confident CNVR databases, including the same samples from the test data, were used. One was the 1000GP CNVR map [44], which included 26 tested samples, and the other was array comparative genomic hybridization (aCGH)-based CNVR database [1], which included 10 tested samples. The CNVRs of the specific samples were extracted from the database and were then screened by the same length and frequency as the detected CNVRs (length >2,500 bp and alternative allele frequency 0.05). The intersected length of the predicted CNVRs and the high-confidence CNVR database was calculated using BEDTools v2.25 (BEDTools , RRID:SCR_006646) [43].
mrsFAST alignment. The paired-end reads with multiple hits indicated by the "XA" tag in the BWA alignment were selected for realignment using mrsFAST_v3.3.10 [46] as previously described [47]. Longer reads were trimmed to 40 bp to reduce the read length heterogeneity prior to sequence alignment. After alignment, the reads with more than 20 hits were excluded to remove the low-complexity regions.

Computational cost in complex genomes from large population-based studies
Since the computational cost is one of the greatest challenges for large populations, the computational efficiency of CNVcaller was evaluated on the real sequencing data of the different genomes. The individual RD processing step was compared to CNVnator, which detects CNVs individually, and has been used in yak, chicken and fish populations [42,49,50]. The processing time of CNVcaller was linearly related to the genome size and sequencing coverage: 20-40 minutes for a 3 Gb genome with 10X coverage (Supplementary Table 3). However, the processing time of CNVnator exponentially increased with scaffold number, which was the only time-consuming index when the scaffold number exceeded one thousand (Figure 2A).
Consequently, CNVcaller achieved a 145-fold increase in speed over CNVnator for goat CNV detection. Notably, the goat reference genome ARS1, which contains 29,907 scaffolds, was newly assembled by single-molecule sequencing [35]. The robustness of CNVcaller reduces the quality restrictions of the reference genome, which promotes CNV research in species with a draft assembly at the scaffold level. This feature also enables comprehensive variation discoveries based on pan-genomes, which reveal numerous functionally important genes not localized on a single reference genome [51-53].
The memory requirement of CNVcaller is mainly related to the genome size: only approximately 500 MB memory for a mammalian genome, which is less than one twentieth of the memory required by CNVnator ( Figure 2B). Therefore, in multi-sample CNV detection, the individual RD processing step can be run in parallel on one node to further reduce the running time. The population-level performance of CNVcaller was evaluated and benchmarked against 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   15 Genome STRiP, which also detects CNVRs at the population level, and is a main contributor of the 1000GP. After removing the unplaced scaffolds, CNVcaller was still 3.5-7.8 times faster than Genome STRiP (Figure 2C), with a 70%~86% reduced memory requirement ( Figure 2D). For 232 goats with a mean coverage of 12X, CNVcaller can complete CNV detection in 1.4 days using a single node. The high efficiency of CNVcaller can facilitate CNV detection in large populations.

Absolute copy number correction in putative SDs of the sheep genome
Previous studies have shown that a high proportion of SDs in animal genomes are misassembled single-copy regions [27,29]. Therefore, we detected the ratios of false SDs on the human (hg19) and sheep (OAR v3.1) reference genome assemblies by the sequencing copy number of a human (NA12878) and a Tan sheep sample ( Figure 3A). If the SDs were correctly assembled, the sequencing diploid copy number should be twice the copy number of SDs. For example, the average sequencing copy number of the two-copy SDs was four in NA12878. However, the corresponding sequencing copy number in sheep was only 2.4. These results indicated that most two-copy SDs of hg19 were truly duplicated in NA12878, while approximately 80% of the two-copy SDs in OAR v3.1 were single-copy regions in the Tan sheep sample. Thus, the SDs in the sheep genome were called "putative SDs" before validation.
CNV detection can be confounded by the presence of false SDs. Due to the random placement of multiple mapped reads, the RD signal in these regions is effectively smeared over all copies; thus, the raw copy number is underestimated. For example, in the putative two-copy SDs, the main peak of the copy numbers was one, the same as heterozygous deletions ( Figure 3B).  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 16 CNVcaller incorporates absolute copy number correction by simply adding the RD of the putative SDs to deduce the absolute copy number independent of the copy number of the genome assembly ( Figure 1). This target can also be achieved using mrsFAST; however, more than 10 core hours were required to realign the multi-hit reads by mrsFAST for a mammalian genome with 10X sequencing coverage. The equivalent result was achieved by CNVcaller within only 0.06 core hours (Supplementary Figure 2).
In the simulate sheep sequencing data, this correction can deduce the correct genotyping in SD regions (Supplementary Figure 3), also reduced the STDEV within each genotyping Table 4). In the real individual sheep data, the corrected putative two-copy SDs clearly fall in to two categories: normal copy (the major peak, with a diploid copy number of two) and the true duplicated regions (the minor peak, with a diploid copy number of four) ( Figure 3B).

(Supplementary
The accuracy of the CNVRs in putative SDs was validated by the Mendelian inconsistency of three Tan sheep trios. CNVcaller detected more duplications in the putative SDs, with only 3% Mendelian inconsistency (Figure 3C).
The sensitivity of sheep CNVRs was estimated indirectly due to the lack of a validated database.
Based on our integrated analysis (see methods), there were 138 sheep X chromosome-origin scaffolds, which were not anchored onto chromosomes of OAR v3.1. Therefore, all of these scaffolds should be detected as CNVs because the rams had half the copy numbers of the ewes. As a result, CNVcaller detected 101 of these 138 X-origin scaffolds, with a sensitivity of 73%. In contrast, CNVnator and Genome STRiP did not report these regions. Furthermore, the copy numbers of single-copy and duplicated X-origin scaffolds were centred at integers, namely one and two in rams and doubled in ewes, whereas the peaks of the raw copy numbers were 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 ambiguous ( Figure 3D). Further examination of the duplicated regions showed that this result was caused by splitting the raw RDs among the putative SDs (Supplementary Figure 4).

Performance evaluations on sheep data
To evaluate the robustness and false discovery rate (FDR) in sheep, we used CNVcaller, CNVnator and Genome STRiP to detect CNVRs from three sheep trios. CNVnator detected less than 1/10 the number of CNVRs of the other methods, with only 260 CNVRs reported. One main reason was that assembly gaps without reads were detected as homozygous deletions by CNVnator in the initial calls. Thus, more than 90% of the initial calls were removed in the recommended gap filtering step because the sheep reference genome OAR v3.1 has ~125,000 gaps. By comparison, the human reference genome hg19 has only 354 gaps. To address the gap problem in non-human genomes, CNVcaller removed the sliding windows with gaps at the first step, and adjacent CNVRs were merged into one call if their RDs were ultimately highly correlated. These optimizations avoided the artefacts caused by assembly errors and retained the adjacent CNVRs as well.
The accuracy was evaluated by the Mendelian inconsistency of all the CNVRs on autosomes against the length and alternative allele frequency (Figure 4). CNVcaller achieved higher accuracy than Genome STRiP in both deletion (1% vs 2%) and duplication (4% vs 7%) ( Figure   4A). Whereas Genome STRiP had greater capability to detected short (<2.5 kb) deletions ( Figure   4B), indicating the RP methods integrated in Genome STRiP performed well on small deletions.
To investigate the reproducibility of CNVcaller, the CNVRs identified by CNVcaller from 133 sheep of 44 worldwide breeds were compared with two other recently released large-scale sheep CNVR datasets. One dataset was derived from allied breeds using multiple platforms, including aCGH, SNP chip and whole-genome sequencing [54], and the other dataset was based on three Chinese sheep breeds using a 600K SNP array [55]. The samples and platforms had major influences on the results; therefore, the overall intersection ratio was low. However, CNVcaller covered 51% of the cross-validated regions of the other dataset ( Figure 4D).
The genotyping accuracy of 73 sheep was validated by a recently developed molecular biology technique, CNVplex. This method reports the copy number of a genomic sequence based on the multiplex ligation-dependent probe amplification (MLPA) method [44]. When the copy numbers of sequencing data predicted by CNVcaller and CNVplex were compared, the Pearson's product-moment correlation coefficients were greater than 0.95, and the genotyping concordance was 98% ( Figure 5).

Performance evaluations on 1000 Genomes Project data
Although CNVcaller was mainly designed for complex genomes, the performance was also evaluated on 30 human BAM files from 1000GP. Because the SNP array data and high-confidence CNVR databases were only available in human, which can be used to evaluate the accuracy from population-level (IRS test) and sensitivity. CNVcaller demonstrated the highest overall accuracy for detecting duplications and performed consistently across the length and 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 19 frequency categories, whereas Genome STRiP and CNVnator had high FDRs on the short or singleton duplications (Figure 6A, B). Genome STRiP showed the greatest ability to detect deletions, indicating the advantage of combining RD and RP methods for deletion detection. The genotyping accuracy of the human dataset was further benchmarked against the high-confidence aCGH array-based database. The discordance rates of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively. This genotyping accuracy ranking was the same with the Mendelian consistency of the 10 Dutch trios (Supplementary Figure 5).
The sensitivity was estimated as the proportion of the high-confidence CNVR database that overlapped with the predicted CNVRs. Two previously published high-confidence databases that include our test samples are the aCGH-based CNVR database [1] and the 1000GP CNVR map [2].
For the aCGH database, CNVcaller demonstrated the highest sensitivity (57%) in duplications, whereas Genome STRiP achieved the highest sensitivity (74%) in deletions ( Figure 6C). Both Genome STRiP and CNVnator were the core contributors to the 1000GP CNV maps; However, the sensitivity of CNVcaller was 68% and 67% for deletions and duplications according to this database, only 4%-10% lower than Genome STRiP and CNVnator.

Conflict of interest
The authors declare that they have no competing interests.

Figure Legends
RDs of adjacent windows are significant correlated

Figure1
Click here to download Figure Figure1.pdf

Figure2
Click here to download Figure Figure2

Figure5
Click here to download Figure Figure5.pdf

The high proportion of misassembled segmental duplications in non-human assemblies may have led to misunderstanding on the reviewer's part. This
section has been extensively redrafted with analyses of both real and simulated data.
5. The previous discussion section has been merged with the results section to reduce the length of the manuscript. The first part of the previous results section has been moved to the methods section, as suggested by the reviewer.
6. The language has been professionally edited by an English-language editing service, American Journal Experts (AJE).
7. We have registered the software in the SciCrunch.org database. The RRID SRC_015752 was added to the 'Availability and requirements' sections.
Thank you again for all of your assistance.
Thank you for your positive comments and encouragement. We have substantially revised the manuscript upon your suggestions.
There are several grammatical errors which make the paper somewhat confusing. I would strongly recommend further extensive English editing.
We apologize for these mistakes. The manuscript has been professionally edited by an English-language editing service, American Journal Experts (AJE).
My main criticism of the analysis is one that I have seen repeatedly of most other CNV calling publications, and that is there is no sensitivity analysis.
We are sorry for the ambiguity of the sensitivity tests, which were included in the previous The authors here also suggest various parameters throughout their paper for performing CNV calling, but there is no analysis of how the results change if these parameters are adjusted, i.e. no analysis of how robust your algorithm is to changes in the parameters.
Thank you for your suggestion. Two important parameters of CNVcaller were window size and minimum report allele frequency. The FDRs against the window size and alternative allele frequency have been added to Supplementary Figure 1 and Figure  6B. In general, with the increasing of window size and allele frequency, the accuracy raised while the sensitivity decreased.  Figure 6A, B). Genome STRiP showed the greatest ability to detect deletions, indicating the advantage of combining RD and RP methods for deletion detection. The genotyping accuracy of the human dataset was further benchmarked against the high-confidence aCGH array-based database. The discordance rates of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively. This genotyping accuracy ranking was same with the Mendelian inconsistency of the 10 Dutch trios (Supplementary Figure 5)." 2:32 "the prevalent.." is a gross exaggeration. I think you mean "a prevalent".
This has been corrected as suggested.
2:35 I don't think you mean geometric. I did not comment on other grammatical/English errors as there were too many to list individually. I would highly recommend getting help with the English in this paper.
We apologize for these mistakes. The manuscript has been professionally edited by an English-language editing service, American Journal Experts (AJE).

3:53 "RD" is not defined.
We apologize for the missing definition. The following description has been added to the introduction: "read depth (RD) refers to the depth of coverage in a genomic region that can be calculated from the number of aligned reads [14], a CNV region should have a higher or lower RD than expected [22][23][24]." 6:120 Give a brief description of how CNVnator handles GC bias. Also why 40% for the GC bias? Shouldn't this parameter be dependent on the organism of interest?
We apologize for not clearly describing the procedure. In general, the mean RD of windows with 40% percent GC is used as only a temporary standard in the GC correction step. It will be lost in the following normalization step, in which the GC- The commentary on certain genomes not being as complete as others is important. I suspect though that if a large percentage of the samples show a CNV in a genome that is newer or not as complete, then this observation may be more likely indicative of a problem with the reference. Can you comment?
If the detected CNVR has variation in a population, which means the read depths can be clustered into two or more normal distributions, this CNVR is probably true even with high frequency. In contrast, if all of the individuals show the same abnormal read depth, this suggests that the reference individual is different from the sample population or some assembly problems exist. 7:145 I am not convinced Pearson's correlation is appropriate. Your data is likely to have outliers and non-normal data. A non-parametric test of correlation like Spearman's correlation (Kendall-Tau is likely too computational intensive), or performing correlation after 5 or 10% trimming may be more appropriate.

We tried replacing Pearson's correlation with Spearman's correlation in the 30 BAM
files from the 1000 Genome Projects data. However, the FDR doubled after the replacement, while the length of each call was reduced by half. A possible reason is that Spearman's correlation is calculated by ranking instead of the numerical value of copy numbers across samples. Therefore, the divergent copy numbers of individuals with deletions or duplications contributed no more than the subtle random mistakes of normal copy individuals, especially in the low-frequency CNVRs.
Trimming is also not recommended for a similar reason. In the low-frequency CNVRs, individuals with an abnormal copy number will be trimmed as outliers.   (scipy 0.19.0). As a result, 88% of windows accepted the null hypothesis at the P = 0.01 level. Therefore, we believe the Gaussian Mixture Model is acceptable for the current data.
The term "CNVR" is critical for understanding the algorithm, and requires more explanation of the term.
We apologize for missing this important concept. The following explanation has been added to the introduction: "To study the polymorphism among individuals, the overlapping CNVs need to be merged into unified regions, namely CNV regions (CNVRs)" It would be helpful to include some further discussion on where you see that CNVcaller works better or worse than existing CNV calling software.
Thank you for your suggestion. Figure 2 shows that the speed of CNVcaller was one to two orders of magnitude higher than the other methods. Newly added Figure 4 and Figure 6 evaluated the effects of length and frequency in sheep and human data. In general, the performance of CNVcaller was better for all sizes of duplications but was poor for deletions <2.5 kb. 9:180. The "arbitrary standards" require a citation. ************************ Reviewer #2: The proposed method "CNVcaller" enables the efficient discovery and genotyping of CNVs in large populations. One of the main benefits of the method is that it can handle draft genome assemblies with thousands of scaffolds. The computational benchmarks proof that the method is fast and memory efficient but the evaluation of the accuracy of the method is less convincing. Some details of the method remain vague and hinder an objective evaluation. Detailed comments of how to improve the manuscript are below: Thank you for your affirmation. We are sorry for the ambiguity of the accuracy test and have substantially revised the manuscript according to your suggestions. In the revised manuscript, the performance evaluation in the previous Table 1 is described in more detail in Figure 4 and Figure 6.
Comment 1 -The primary application of CNVcaller is the detection of CNVs in large populations. Population variant call sets are dominated by rare variants of rather small size. For instance, less than 20% of the 1000 Genomes structural variants have a population allele frequency >5% and almost 50% of the SVs are <2kbp in size despite the rather low coverage (~7x). CNVcaller is currently restricted to large CNVs (>2kbp) and common variants (>5% allele frequency), which is a major limitation for population genomic studies.
We apologise for the ambiguous. Actually, the user can retain all the windows with at least one individual that shows heterozygous deletion or duplication. However, we recommend removing low-frequency windows in large populations with low sequencing coverage because of increased random mistakes. In the revised version, Figure 6 was added to evaluate the effects of length and frequency by IRS test. We found Genome STRiP showed the greatest ability to detect short and rare deletions, indicating the advantage of combining RD and RP methods for deletion detection. However, short and rare duplications still had extremely high FDR. The shortest duplications reported by CNVnator and Genome STRiP were 2.8 kb and 2.5 kb, and the IRS FDRs of 2.5-5 kb calls were 29% and 88%, respectively. The FDRs of singletons were 35% and 69% for CNVnator and Genome STRiP, respectively. The main improvement of CNVcaller is the accuracy of duplications. The FDR of 2.5 kb -5 kb duplications was reduced to 19%, and the FDR of singleton duplications were reduced to 9%. However, the FDRs were still higher than those of the longer and higher-frequency calls. So, these calls were removed from the previous manuscript.
These uncertain calls were also removed by the phase 3 extended SV release of 1000GP. After extra quality controls, the number of duplications in the released database is only 1/7 the number of deletions, and the median size is 36 kb, which is 17 times longer than deletions. Therefore, improving the accuracy of duplications on this foundation is meaningful for enriching the CNV database.
Additionally, the current main use of CNVcaller is the detection of CNVRs related to economic traits in livestock and crops. In these populations, the target CNVRs usually have a medium or high frequency after long-duration artificial selection. We believe that the high-confidence medium to high frequency reported by CNVcaller can contribute to functional and breeding studies of animals and plants.
The sensitivity increase of CNVcaller for the subset of common and large CNVs seems to be driven by an increased number of detected CNVs in SD regions ( Figure  5C). SNP arrays have a low SNP density in SD regions and in the present Manuscript array SNP probes in SD regions have been removed entirely. The reported IRS FDR is therefore heavily biased against CNVs in SD regions and it thus seems mandatory to me to proof that this sensitivity increase for SD-associated CNVs is not leading to an inflated FDR.
Thank you for your suggestions. Figure 5C (new Figure 3C) has been updated to show both the number and the Mendelian inconsistency of the detected CNVs in SDs. The Mendelian inconsistency rate of the calls in SD regions made by CNVcaller was approximately 3%, no higher the other methods. The copy numbers of unique and SDs were also indirectly validated by the X-origin scaffolds of a 133-sheep population. All of these scaffolds should be detected as CNVs because the rams had half the copy numbers of the ewes. As a result, CNVcaller detected 101 of these 138 X-origin scaffolds. In contrast, CNVnator and Genome STRiP did not report these regions.
The Manuscript lacks a Figure that shows the size and allele frequency distribution of the discovered CNVs in comparison to Genome STRiP and CNVnator. An estimate of breakpoint accuracy of CNVcaller would also be valuable.
Thank you for your suggestion. Figure 4 and Figure 6 have been added, which evaluate the effects of length and frequency in sheep and human data. The detailed comparisons in the manuscript are as follows: Sheep: "The accuracy was evaluated by the Mendelian inconsistency of all the CNVRs on autosomes against the length and alternative allele frequency (Figure 4). CNVcaller achieved higher accuracy than Genome STRiP in both deletion (1% vs 2%) and duplication (4% vs 7%) ( Figure 4A). Whereas Genome STRiP had greater capability to detected short (<2.5 kb) deletions ( Figure 4B), indicating the RP methods integrated in Genome STRiP performed well on small deletions. Concerning the alternative allele frequency, both methods showed an increased FDR in rare duplications ( Figure 4C). However, CNVcaller is primarily used to detect CNVRs related to economic traits in livestock and crops. In these studies, the target CNVRs usually have a high frequency after long-duration breeding selection." Human: "CNVcaller demonstrated the highest overall accuracy for detecting duplications and performed consistently across the length and frequency categories, whereas Genome STRiP and CNVnator had high FDRs on the short or singleton duplications ( Figure 6A, B). Genome STRiP showed the greatest ability to detect deletions, indicating the advantage of combining RD and RP methods for deletion detection. The genotyping accuracy of the human dataset was further benchmarked against the high-confidence aCGH array-based database. The discordance rates of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively. This genotyping accuracy ranking was the same with the Mendelian consistency of the 10 Dutch trios (Supplementary Figure 5)." Thank you for your reminding of the breakpoint issue. However, unlike the PR/SP algorithm, RD can not detect breakpoints in the at base pair resolution or less than the window step size resolution. Integrating RD and RP methods can improve the breakpoint accuracy in human genome. However, precise breakpoint is more difficult to achieve in the poorly assembled genomes. Additionally, the breakpoint issue did not affect the genotyping accuracy which is the direct input of GWAS. The genotyping FDR of CNVcaller, CNVnator and Genome STRiP were 2.6%, 5.5% and 2.2%, respectively.
The Manuscript mentions mrsFAST for absolute copy number validation. I could not find any formal comparison of predicted copy-number by mrsFAST and CNVcaller but maybe I missed this? Figure 1) shows that the copy numbers calculated using mrsFAST and CNVcaller were similar. However, mrsFAST needed to realign all the multi-hit reads in BWA alignments, leading to significantly increased computational time. For example, mrsFAST required 10 hours for a 3G genome with 10X sequencing data, whereas CNVcaller needed only 4 minutes.

Supplementary Figure 2 (previous Supplementary
-Please add to Table 1  To analysis the difference between deletions and duplications, all FDRs were evaluated separately in the revised manuscript. We found the duplications had much higher FDRs than the deletions, especially for the short and rare CNVs. This section has been expanded in the newly added subsection "RD corrections for sex chromosomes" as follows: "RD corrections for sex chromosomes. Most mammalian and avian genomes show an XX/XY-type or ZZ/ZW-type sex-determining system. Their homogametic sex chromosomes (X or Z) constitute 5%-10% of the total genome and show half the RD of the autosomes in XY or ZW individuals. Therefore, intensive correction for X and Z chromosomes is needed. The RD of the X or Z chromosome (the particular name provided by the user) is used to determine the sex of a particular individual. If the median RD of this chromosome is <0.6X the median RD of the autosome, the individual is considered an XY or ZW type, and the RDs of this chromosome are doubled before normalization. Otherwise, nothing is performed for individuals determined to be XX or ZZ type." Page 8, line 154: "... and the distance between them is less than a certain percent of their own length." This text has been modified as follows: "As CNVRs can be separated by gaps or poorly assembled regions, the adjacent initial calls are merged if their RDs are highly correlated. The default parameters are as follows: the distance between the two initial calls is less than 20% of their combined length, and the Pearson's correlation index of the two CNVRs is significant at the P = 0.01 level." Page 5, line 91: "The reference genome is segmented into overlapping sliding windows." What window size and overlap was used for high-coverage genomes?
The following description has been added to the methods. "The window size is an important parameter for RD methods. CNVcaller uses half of the window size as the step size. The optimal window size is 800 bp (with a 400 bp overlap) for 5-10X coverage human and livestock sequencing data (Supplementary Figure 1). The recommended window sizes are inversely related with coverage, and thus, ~400 bp windows correspond to 20X coverage, and ~200 bp windows correspond to 50X coverage." Page 5, line 95: "The raw RD signal is calculated for each window as the number of placed reads with centers within window boundaries." Does this imply that for pairedend data both reads are counted?
We apologise for the ambiguous description. The following description has been added: "Considering the uncontrollable effect of gap ratios from different genome assemblies, all of the end reads located in the window are independently added to the RD of this window, regardless of whether the read is from single-end mapping or paired mapping." Page 8, line 154: "Then the two adjacent initial calls are further merged if their copy numbers are highly correlated". What threshold was used?
This text has been modified as follows: "As CNVRs can be separated by gaps or poorly assembled regions, the adjacent initial calls are merged if their RDs are highly correlated. The default parameters are as follows: the distance between the two initial calls is less than 20% of their combined length, and the Pearson's correlation index of the two CNVRs is significant at the P = 0.01 level." Figure 3A: CNVcaller 13.7. What is the unit? Are these 13,700 CNVs?
The unit in this figure is Mb. Because the intersection of the three methods with different boundaries was difficult to define in numbers, they were evaluated in terms of length. CNVcaller covered 40% of the CNVRs detected by CNVnator, 45% of the CNVRs detected by Genome STRiP and 65% of their intersecting CNVRs, in terms of length.
Minor: -I could not find a reference to the 232 goat sequencing data? Is this data publicly available?
Among the 232 goat whole-genome sequencing data files, 103 files were acquired from NCBI, the accession numbers are provided in Supplementary Table 1 -The first Results section "Overview of CNVcaller algorithm" seems better suited for the Methods part. This has been modified as suggested.
-Is the Mendelian consistency higher for the high-coverage trio: NA12878, her father (NA12891) and her mother (NA12892)?
Yes. In the high-coverage data of all three members of the trio (NA12891, NA12892 and NA12878 were all 50X), the inconsistency rate was 2.4%. In the high-coverage data of the parents (50X for NA12891 and NA12892) and the low-coverage data of the child (5.3X for NA12878), the inconsistency rate was 6.1%. Thus, increased sequencing depth can help to reduce the number of false positives.
-I believe the claim that read-pair/split-read algorithms are less powerful on draft assemblies of non-model organisms compared to read-depth methods is potentially true but the Manuscript lacks a proof for this or a citation that supports this claim.
Thank you for your agreement. This problem was found in our previous reference genome assembly projects for both sheep and goats. However, we did not report this result in the section on CNV/SD detection. The review listed below has some comments about this claim, however, without direct supporting data. Therefore, we have removed this comment from this manuscript.
Bickhart DM and Liu GE. The challenges and importance of structural variation detection in livestock. Frontiers in genetics. 2014;5.
"While RP methods should provide a suitable means for detecting such events in theory, two major problems currently challenge the accuracy of this method: (1) alignment errors resulting from the mapping of read pairs to repetitive regions of the genome…… The first problem (1) is unfortunately dependent on the reference genome assembly for the species, and is unlikely to be resolved until better reference assemblies are created for livestock." -It is not clear from the Manuscript if CNVcaller reports copy-number likelihoods based on the Gaussian mixture model. Please clarify. Thank you for your suggestion. CNVcaller reports the silhouette coefficients of the copy numbers instead of the Gaussian mixture model likelihood as quality control because we found that silhouette coefficients had a greater correlation with the IRS test results than likelihood.
- Figure 5A: Why is the absolute copy-number correction different for Human and Sheep? We are sorry for not clearly interpreting the high proportion of misassembled segmental duplications in non-human assemblies. This part of the manuscript has been modified as follows: "Previous studies have shown that a high proportion of SDs in animal genomes are misassembled single-copy regions [27,29]. Therefore, we detected the ratios of false SDs on the human (hg19) and sheep (OAR v3.1) reference genome assemblies by the sequencing copy number of a human (NA12878) and a Tan sheep sample ( Figure  3A). If the SDs were correctly assembled, the sequencing diploid copy number should be twice the copy number of SDs. For example, the average sequencing copy number of the two-copy SDs was four in NA12878. However, the corresponding sequencing copy number in sheep was only 2.4. These results indicated that most two-copy SDs of hg19 were truly duplicated in NA12878, while approximately 80% of the two-copy SDs in OAR v3.1 were single-copy regions in the Tan sheep sample. Thus, the SDs in the sheep genome were called "putative SDs" before validation." -There is quite a few typing and grammatical errors. For instance: * Figure 2B: Max mamory *Supplementary Table 3: Memery *Page 3, line 53: ...the number of reads aligned to of a particular region. *Page 8, line 160: This model presets the average copy number of homozygous deletion, heterozygous deletion, normal, heterozygous deletion (duplication!), homozygous deletion (duplication!) at zero to four respectively. We are sorry for these mistakes. We have proofread the revised manuscript and used a professional English-language editing service to minimize the grammatical errors.

Current version
Last version   Table 1 and newly added      Table 1 and newly added Supplementary Table 4 Newly added Supplementary Table 5 Newly added