A single mini-barcode test to screen for Australian mammalian predators from environmental samples

Abstract Identification of species from trace samples is now possible through the comparison of diagnostic DNA fragments against reference DNA sequence databases. DNA detection of animals from non-invasive samples, such as predator faeces (scats) that contain traces of DNA from their species of origin, has proved to be a valuable tool for the management of elusive wildlife. However, application of this approach can be limited by the availability of appropriate genetic markers. Scat DNA is often degraded, meaning that longer DNA sequences, including standard DNA barcoding markers, are difficult to recover. Instead, targeted short diagnostic markers are required to serve as diagnostic mini-barcodes. The mitochondrial genome is a useful source of such trace DNA markers because it provides good resolution at the species level and occurs in high copy numbers per cell. We developed a mini-barcode based on a short (178 bp) fragment of the conserved 12S ribosomal ribonucleic acid mitochondrial gene sequence, with the goal of discriminating amongst the scats of large mammalian predators of Australia. We tested the sensitivity and specificity of our primers and can accurately detect and discriminate amongst quolls, cats, dogs, foxes, and devils from trace DNA samples. Our approach provides a cost-effective, time-efficient, and non-invasive tool that enables identification of all 8 medium-large mammal predators in Australia, including native and introduced species, using a single test. With modification, this approach is likely to be of broad applicability elsewhere.

-thank you for this paper. I included a citation in the "conservation implications" part of the "Discussion" section. I indeed developed this mini-barcode for a management and monitoring purpose thus for a broader application than the "simple" identification purpose. Line 347 reviewer 2: Stephane Boyer -It is interesting to see that a more relaxed genetic distance threshold may be more appropriate (line 201). The authors used the default 1% threshold in the functions bestCloseMatch and threshID. They seem to base this decision on the graphical representation of threshID (code below). >barplot(t(threshfullMat) [4:5,], >names.arg=paste((threshfullMat[,1]*100), "%")) The visual reading of this barplot gives some indication of how many false positives/negatives the user may have to tolerate. However, this is somewhat a crude measure of the optimal threshold. A better option is to use the localMinima function in SPIDER, which calculates the most appropriate threshold to use for a given dataset based on pairwise distances only. When running this function on the full dataset (see code below), I obtained a threshold of 0.0335 which seems more appropriate for the data. The authors may want to re-visit their analysis based on that threshold (instead of 1%). >#local minima calculation of optimal species delineation threshold >Thresh <-localMinima(fullDist) #Compute the localMinima function >#Results: 0.0335 ; 0.195 >plot(Thresh, main="localMinima 12S FULL") If the authors choose to use the localMinima function, the optimal threshold should be calculated using the Unique dataset only. As it is not possible to calculate an accurate threshold with this function using singletons only.
-The reviewer is correct, I chose 1% for the FULL database and 4% for the UNIQUE database based on the code lowest cumulative error. I have not used localMinima, and I thank the reviewer for making us aware of this option. However, this does not seem to provide a sensible output for the unique database in this instance. As suggested, I have used a threshold of 3.5% (rounding up the localMinima result of 0.335) for the full database. I have incorporated this into the analysis by comparing results using thresholds of 1% and 3.5%. For example, using best close match, The higher threshold results in a greater number of correct identifications, but also a greater number of incorrect identification. In contrast a 1% threshold has a higher number of "no ID" results. I have amended the discussion to note that the most appropriate threshold will depend on the management context, and the relative importance of false positive identifications / unidentified samples. The results for the unique database using localMinima are more problematic. Using uniThresh <-localMinima(uniqueDist) uniThresh$localMinima [1] *100 plot(uniThresh) the threshold identified is 19%, which seems extremely high in this context. While this threshold does produce perfect results (all samples correctly identified with best close match / threshID) using our unique dataset, my concern is that sequences from taxa that are not well represented in our database will be at a much greater risk of misidentification with such a relaxed threshold. Hopefully as more reference sequences become available from a wider range of Australian mammals it will be possible to improve this analysis. However, in the meantime I would argue that in most management contexts it would be better for a sample to be ambiguously identified, or to have "no ID" than to be incorrectly identified. For example, working with the full database, I see the following results using best close match with thresholds of 1% and 3.5% > table(bestCloseMatch(fullDist, Sppfull, thresh = 0.01)) correct incorrect no id 147 3 24 > table(bestCloseMatch(fullDist, Sppfull, thresh = 0.035)) correct incorrect no id 152 6 16 And using threshID with the same thresholds I get: > table(threshID(fullDist, Sppfull, thresh = 0.01)) ambiguous correct incorrect no id 5 142 3 24 > table(threshID(fullDist, Sppfull, thresh = 0.035)) ambiguous correct incorrect no id 12 141 5 16 To improve consistency between the two sets of results, I have also amended the text so that analyses with the unique database also use a threshold of 3.5%. This has the same cumulative error as a threshold of 4% (which is what was previously used) and the results are not affected by this change. text added Methods: lines 456 + 461-462 Results: lines 183-186, 193-196, 201-202, 216-231 Discussion: lines 319-323 Table 1, Additional file 6, Additional file 7 -I can only commend the authors for providing the annotated R code. The main code works well and is easy to follow. The very last line of code seems incomplete. I think it misses a closing bracket at the very end and another line to query a sequence (as written below) >} >withinF [[1141]] -The reviewer is correct that the code should end this way. However, in my version of the file this text is not missing. I have uploaded the file again to make sure that there are no errors.
-I was a little confused with the code for sliding window analysis. I don't understand why the window width was set on 20 bp and why only this particular length was investigated. The authors seem to have used the sliding window analysis to determine the position of potential primers, rather than the position of a suitable mini-barcode region (which was the original purpose of sliding window). If that is the case, then I suppose suitable 'primer windows' must be highly conserved, but what were the other criterion used to select them? It reads as follow on line 343: "…regions up to 200 bp in length, incorporating two primer sites (each of 20 bp in length) that were well-conserved across all taxa but which flanked a region of 100-200 bp that displayed high levels of interspecific variation" What is the threshold for 'well conserved'? What is considered 'high levels of interspecific variation'? Are these based on values obtained from the sliding window analysis? I would have expected that a range of length, for example from 50 bp up to 200 bp, would have been investigated with the aim of determining the shortest possible minibarcode region. For example, I ran a sliding window analysis using a width of 150 bp (see code below modified from the authors'). >a12SWin <-slidingWindow(a12Sref, width = 150, interval=1) >length(a12SWin) >a12SWin[ [1]] >a12SAna <-slideAnalyses(a12Sref, Sppa12S, width = 150, interval = >"codons", distMeasures = TRUE, treeMeasures = TRUE) >str(a12SAna) >plot(a12SAna) Useful variables provided by the sliding window function includes the 'proportion of zero non-conspecific K2P distances'. When this value is 0, the window has enough identification power to tell all species apart. All 150 bp windows starting on base ~90 tõ 240 are good picks in this regard. So I do believe the chosen region is probably a good one. But it is unclear why the window starting on position 160 was deemed the best window by the authors -I agree that this section was unclear and I have amended the manuscript to include more detail. I thank the reviewer for these comments as these have helped to improve my explanation and interpretation. I did indeed use a wider range of window sizes. I first used larger window sizes (100-175bp) to identify potential mini barcode regions. I then used the shorter window sizes (20-30 bp) to identify conserved sites suitable for primer development within the region of the candidate mini-barcode. The combination of both of these factors (a highly diagnostic sequence and conserved primer sites) are crucial for effective barcode design and adjusting the sliding windows analysis seemed like a good way to identify primer sites. Using larger windows, I identified regions that may have been good candidate minibarcodes, except that it was not clear that suitable primer sites were present. By restricting the window size, I was able to clearly identify highly conserved primers as well as the diagnostic mini-barcode regions. While the broader region (90-240bp) identified by the reviewer using 150bp windows could certainly serve as a mini-barcode, my choice of starting position was driven by the identification of a suitable forward primer sequence, and also with the final length of the amplicon in mind given the location of a suitable reverse primer. While the region from 90 bp is identified as a potential mini-barcode using a window size of 150, I also found a good region between bases ~160-~380. Considering just the smaller window size (for primer design), the region around base 160 was a suitable primer site. I have updated the text to clarify this and to better explain the approach and criteria. Methods: lines 375-387, 425-436 Results: lines 163-170 I have also amended the figures to reflect this ( Figure 2) and have updated the supplementary R code (additional file 3).
-Now, it is important to note that the actual values on the x-axis on the plots (e.g. Figure 2) are the positions of the first nucleotide of the window. As such, the box drawn on Figure 2 and presented as the 'best candidate site for a short diagnostic amplicon' is slightly misleading because each dot on that graph represents one window. There is also an issue with the positioning of that box as it is clearly not located between positions 160 and 380 as suggested in the legend of Figure 2. -Indeed, the boxed area was a bit to the right on figure 2, it is fixed now. Also, I changed the legend in figure 2 to precise that a dot was the window and the xaxis represents the first base of a window.
-Last small comment about the code: I found that on my version of R, there is an issue with object names that start with a number (e.g. 12Sref). Just placing a letter as the first character in the name solves the issue.
-Additional file 3: This problem is now fixed with all names starting with 12S preceded by "db" (for database) -Lines 41-55 There is no flow between these sentences. They need to be better linked together. As it stands it is rather laborious to read.
-I changed the text so the sentences flow better Lines 44-55 -Line 77. it is not clear what you mean by 'barcode tests' -changed to improve clarity of meaning Line 77-81 -Lines 113-114 need rephrasing to avoid repetition -changed Lines 116-120 -lines 114-120. This paragraph follows few sentences where the authors described their study and their taxa. I think it needs to be more clear that here they are back to general statements. Alternatively, these general statements could be placed before the sentence starting with 'Our goal was…' -changed. I put the general statements (the two common limiting factors) before summarising the findings in this study (our goal was…) Lines 109-120 -Line 136. I think it would be useful to include citation [2] here as it is the one describing the sliding window analysis in details.
-changed Line 137 -Line 144. To create the UNIQUE database, I am guessing that the first step was to remove the singletons and THEN to only keep one sequence per haplotype. It would make sense to write these two steps in the correct order.
-yes, that makes sense. Changed Lines 145-146 -I was also surprised to see that you had singletons in the FULL dataset, given that line 132-133, it is stated that: "Sequences were obtained from GenBank, with additional targeted sequencing conducted for species under-represented in GenBank." If there were indeed singletons and those species were eliminated, it would be useful to list which species they were -The singletons referred to non-target species. I focused the additional sequencing on specific target taxa most relevant to wildlife surveys in Australia, in particular the quolls which are poorly represented in sequencing databases. Line 133: to specify that I added sequences from the target animals I amended Additional file 7 to note all singleton species -Line 205. Yes, but a 5% distance threshold would have caused much ambiguity for the identification of the other sequences. Any chances one of the sequences for Dasycercus cristicauda was obtained on Genbank and could be either mis-identifiation or a different (cryptic) species?
-the two Dasycercus sequences are from the same team of scientists. The origin of only one sequence (AF009889) was mentioned (the Tanami desert in Northern Territory). As for the second, they don't know the origin. So it might be an ID error. I put a note in the text Lines 216-219 Same was true for a western quoll sequence, lines 183-186 Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation -Line 208. rather than 'a wide range of Australian mammals', please provide the number of species -changed to note that 40 species were included, but also to emphasise that these represent a wide taxonomic range (ie not just 40 species from a single order). Line 234-239 -Line 201. Add "and possibly beyond" to the end of the sentence or something similar to acknowledge that you also successfully used the primers with non-mammalian vertebrates. Alternatively, remove reptile amphibian and bird from the previous sentence, and write a new sentence at the end of the paragraph, stating why the primer was tested on those non-mammalian specimens.
-Changed to "This demonstrates the broad applicability of the primers across the mammalian taxa and their potential applicability to other vertebrate classes" Lines 234-239 -  Here we have shown that it is also possible to identify multiple species from a single DNA test, using a straightforward PCR and Sanger sequencing approach" Lines 279-282 -Line 239. 92% amplification success is quite good. It would have been interesting to compare this to what can be obtained with primers targeting longer DNA fragments. I understand this was not the aim of this particular paper, but in a sense the authors went into all the trouble of designing mini-barcodes because 'regular (longer) barcodes' don't work. It would be good to put this 92% success rate into perspective with the success rate of longer barcodes if there was any such data in the literature. It is eluded to on line 277, but the actual numbers are not provided.  approach' used is simply DNA barcoding, the benefits of which have been widely demonstrated elsewhere. The real novelty lies in the primers and the mini-barcode designed for Australian mammals, which does make a very useful tool for managers and scientists. So rather than the 'approach' I would highlight the primers or the minibarcode here -changed to "Using our mini-barcode, DNA can be screened for the presence of multiple Australian predator species in a single and inexpensive test, without the need to develop and apply a set of species-specific primers for each predator of interest. We provide a non-invasive instrument with potential utility for scientists or managers working with endangered or invasive Australian predators, but a similar approach could be used to target predator assemblages in other regions." so more focussed on Australia  The looming biodiversity crisis, referred to by some as the Sixth Mass Extinction [1], 40 has made the conservation of wildlife a rapidly growing concern. There is an urgent need to 41 document the distribution of biodiversity as the foundation for identifying effective solutions 42   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 to wildlife management issues. The rapid and reliable identification of species at local and 43 regional scales can provide the first step towards determining the distribution of biodiversity 44 in the landscape and changes that might be occurring in that distribution. 45 Advances in genetics and genomics have revolutionized many areas of biology and in 46 particular, the identification of wildlife from trace and environmental samples (e. Legend: Summary of results of genetic distance-based evaluations of the AusPreda_12S mini-barcode 191 conducted using the R package SPIDER to analyse the "FULL" (at 1% and 3.5% thresholds) and 192 "UNIQUE" (at 3.5% threshold) reference sequences databases. The thresholds were calculated based 193 on the minimum cumulative error (Additional file 6) and the 3.5% threshold for the "FULL" database 194 allows for comparison between the two databases. The specified genetic distance thresholds were 195 used for the bestCloseMatch and threshID analyses. 196 197 BestCloseMatch and ThreshID analyses, which both assume that sequences from a 198 single species fall within a specified genetic distance threshold, correctly identified 147 and 199 142 sequences respectively in the "FULL" database using the 1% threshold given by the  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 (Table 1; details of results: Additional file 7). As noted 213 previously, these sequences would have been correctly identified if a genetic distance 214 threshold of 5% was used. This represents a high level of divergence between two 215 conspecific sequences, but as both of these sequences were obtained from GenBank, and 216 the origin of one of the samples is unknown, we cannot rule out sample misidentification or 217 sequencing error in this instance. 218 Using a 3.5% genetic threshold for the "FULL" database, to allow for comparison with 219 the results obtained with the "UNIQUE" database, correctly identified more sequences with 220 the BestCloseMatch analysis which was to be expected using a more relaxed genetic 221 threshold allowing for more mismatches among sequences. Nevertheless, six sequences 222 previously resulting in an "No ID" match became correctly identified and two became 223 incorrectly identified. The western quoll (KJ780027) became incorrectly identified using a 224 higher threshold which, once again, lead us to believe that this sequences from GenBank 225 was incorrectly identified to start with. Comparing the ThreshID results with the more 226 conservative approach used with the 1% threshold, five sequences that were previously 227 correctly identified became ambiguous and from eight sequences resulting in a "No ID" 228   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 match, four became correctly identified, two became incorrectly identified and two had an 229 ambiguous identification.  238 We also successfully amplified our mini-barcode from a wide range of input template 239 DNA concentrations. We set up serial dilutions of DNA from six predator species. 240 Amplification was successful for all three qPCR replicates from all six species for all dilutions 241 from 9 ng / µl to 9 pg / µl inclusive, demonstrating that the primers can amplify from low 242 quantity DNA. Amplification success was less consistent at the highest and lowest DNA 243 concentrations, estimated at 90 ng / µl, 0.9 pg /µl and 0.09 pg / µl (   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 2 Undetermined results were excluded when calculating mean CT. 256 3 Where the qPCR traces were of an irregular shape (three replicates), the replicate was excluded 257 when calculating mean CT. 258 259 We tested the ability of the AusPreda_12S primers to correctly identify known 260 predators by analysing scats from captive animals. 57 scats were tested and amplified 261 product was obtained from 53 samples. We obtained good quality DNA sequences, ranging  marsupial predator) and introduced mammal carnivores with a high level of accuracy. We 276 expect that these primers will also amplify DNA from both species of New Guinean quoll. 277 Previous studies, aimed at identifying species from scats or hairs, have applied barcoding 278   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 methods to detect individual species across multiple time points (examples in [53], [54]). 279 Here we have shown that it is also possible to identify multiple species by implementing a 280 single DNA test, using a straightforward PCR and Sanger sequencing approach. All clear 281 sequences obtained from 49 scats of six target predator species were correctly identified to 282 species level. In the small number of cases where a clear sequence was not obtained from a 283 scat, we found that the sequences obtained were mixed, probably arising from the 284 amplification of two or more species in the same sample. This could arise from cross of these analyses to enable recognition of non-specific PCR amplification. In practice, mixed 292 sequences cannot be used to identify the predator with confidence and therefore such 293 samples must be excluded from analysis. In addition to successful amplification of scat DNA, 294 we demonstrate that our mini-barcode primers can successfully amplify low-template DNA 295 (at least as low as 0.9 pg / µl) from museum samples. This provides further evidence of the 296 utility of this marker for application to eDNA studies. 297 Whilst DNA metabarcoding may more clearly determine which species are 298 represented in mixed samples, metabarcoding methods are relatively costly and require 299 more specialist equipment, which may not be available to many wildlife managers. In this 300 study, PCR and Sanger sequencing reliably identified the predator of origin for 86% of scat 301 samples, which is likely to be sufficient for many management applications and is a higher 302   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 success rate than has been reported for several other faecal DNA studies (for example [41], 303 where 79% of sequences were amplified using a 134 bp fragment and [57], where <70% of 304 sequences were amplified using regions ranging from 243 bp to 708 bp according to target 305 taxon). Using our mini-barcode, DNA can be screened for the presence of multiple Australian 306 predator species in a single and inexpensive test, without the need to develop and apply a 307 set of species-specific primers for each predator of interest. We provide a non-invasive 308 instrument with potential utility for scientists or managers working with endangered or 309 invasive Australian predators, but a similar approach could be used to target predator 310 assemblages in other regions. 311 The bioinformatic evaluation of our mini-barcode shows that this marker can reliably 312 discriminate among the eight target predator species (eastern, western, northern and 313 spotted-tail quolls, Tasmanian devils, cats, dogs and foxes) in Australia. The close genetic 314 similarity between the bronze quoll (from New Guinea) and the western quoll (from 315 Australia), described above and supported by [48], may pose some problems for reliable 316 species identification from unknown samples, but the different geographic distributions of 317 these two species will likely provide a clear identification in most cases. The most 318 appropriate threshold to be used will depend on the management context and the relative 319 importance of false positive identifications, but in most cases, an ambiguous or "No ID" 320 identification would be a better result for a sample than to result in a correct identification 321 when this is erroneous. 322 Further development of our reference database, to include additional D. 323 albopunctatus and D. spartacus sequences, will be required to better understand the utility 324 of this test for identification of specimens to species level in New Guinea. Likewise, a better 325   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 reference database would improve the relevance of this DNA test for application to historic 326 samples. Sequences from the extinct thylacine could be clearly identified in our initial 327 analyses, but this species could not be included in the UNIQUE database for further 328 bioinformatic analysis because only one 12S rRNA haplotype was available. Finally, because 329 we are working with mitochondrial DNA which is maternally inherited, we cannot currently 330 use this test to distinguish between dogs and dingos, in part because of the prevalence of   3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Western quolls from Western Australia were also re-introduced to the Flinders Ranges in