Screening methods for detection of ancient Mycobacterium tuberculosis complex fingerprints in next-generation sequencing data derived from skeletal samples

Abstract Background Recent advances in ancient DNA studies, especially in increasing isolated DNA yields and quality, have opened the possibility of analysis of ancient host microbiome. However, such pitfalls as spurious identification of pathogens based on fragmentary data or environmental contamination could lead to incorrect epidaemiological conclusions. Within the Mycobacterium genus, Mycobacterium tuberculosis complex members responsible for tuberculosis share up to ∼99% genomic sequence identity, while other more distantly related Mycobacteria other than M. tuberculosis can be causative agents for pulmonary diseases or soil dwellers. Therefore, reliable determination of species complex is crucial for interpretation of sequencing results. Results Here we present a novel bioinformatical approach, used for screening of ancient tuberculosis in sequencing data, derived from 28 individuals (dated 4400–4000 and 3100–2900 BC) from central Poland. We demonstrate that cost-effective next-generation screening sequencing data (∼20M reads per sample) could yield enough information to provide statistically supported identification of probable ancient disease cases. Conclusions Application of appropriate bioinformatic tools, including an unbiased selection of genomic alignment targets for species specificity, makes it possible to extract valid data from full-sample sequencing results (without subjective targeted enrichment procedures). This approach broadens the potential scope of palaeoepidaemiology both to older, suboptimally preserved samples and to pathogens with difficult intrageneric taxonomy.


17
A rapid population growth initiated in Neolithic period, connected with domestication of animals and increase of human sedentism, 18 played a key role in pathogen transmission within the so-called first epidemiological transition [1][2][3][4]. The identification of infectious diseases and 19 selection of unique fingerprints of their causative agents, especially those derived from skeletal elements, are still of the greatest interest for 20 paleopathologists and anthropologists, which is evidenced by the range of available analysis methods. Members of Mycobacterium tuberculosis 21 complex (MTBC) are genetically very closely related and are causative agents for one of the oldest human infectious diseasestuberculosis 22 (TB). It is a disease that may leave lesions on patients' bones, enabling a diagnosis based on bone morphology [5]. The main problem of 23 paleopathological diagnoses based solely on dry bones is that there are no pathognomonic skeletal indicators of TB. The most reliable skeletal 24 indicator of TB are destructive lesions in thoracic and lumbar spine sections, which in advanced disease stage lead to destruction and collapse of 25 vertebral bodies, resulting in spinal kyphosis, or gibbus, known as Pott's disease [5,6]. However, many other conditions, like chronic pyogenic 26 osteomyelitis, Brucella osteomyelitis, fungal infections, typhoid spine, vertebral fractures, septic, traumatic, and rheumatoid arthritis, malignant 27 bone tumors can all affect the spine and produce similar pathological conditions which are difficult to distinguish from tuberculosis in 28 paleopathological practice [7,8]. Diagnoses based on other types of bone lesions are even more tentative; these are primarily based on 29 osteomyelitis of the joints (most commonly the hip and knee, but also ankle and elbow) and periosteal reactive lesions (mainly in the ribs or 30 diaphysis of the long bones, including tubular bones of the hands and feet in children [6,8]. Lastly, morphological studies of bones do not permit 31 detection of many individuals affected with TB in past human populations: data from the pre-antibiotic era show that bone changes occur only in 32 about 3-7% of individuals with active TB [8]. 33 Since the 1990s, new possibilities to diagnose TB in archaeological specimens have arisen, offered by the detection and analysis of 34 mycobacterial DNA and other biomolecules specific to MTBC at the molecular level [9][10][11][12][13][14][15][16][17][18][19][20]. A common complication in molecular studies for 35 ancient MTBC detection is the presence of DNA and other metabolites from the whole microbiome of the individual whose remains are being 36 analysed as well as from environmental bacteria that have colonised the skeleton post-mortem [21,22]. These contaminants might include 37 Mycobacteria other than M. tuberculosis (MOTT), some of which are prevalent in the environment, while others are associated with clinical 38 cases of non-tuberculosis diseases [21,[23][24][25]. It should be emphasized that members of Mycobacterium tuberculosis complex themselves are 39 characterized by a particular high sequence similarity [26,27], which leads to often unsurmountable difficulties in distinguishing them on the 40 molecular level. 41 Detection of cell wall components such as mycolic, mycocerosic and mycolipenic acids [12,14,17,18] with matrix-assisted laser 42 desorption/ionization tandem time of flight (MALDI-TOF) which present profiles specific for MTBC is considered a reliable method to identify 43 ancient causative agents in human archaeological samples. On the other hand, initial attempts to use mass spectrometry to detect cell wall lipids 44 were shown to be erroneous in some cases [14,28,29]. Polymerase chain reaction, followed by gel electrophoresis, is still a popular method for Recently, next generation sequencing (NGS) methods were introduced for detection of causative agents of ancient diseases [45,46], 53 including MTBC, with or without pre-enrichment of MTBC aDNA [47][48][49][50]. The increasing amount of data generated by NGS and efficiency of 54 non-Sanger-based sequencing platforms requires a new approach in processing tools: suitable bioinformatic pipelines are required for reliable 55 DNA analysis of ancient causative agents. Similar to PCR, where the use of only short conserved regions considered as specific for MTBC may 56 lead to false positive results, improper analysis of NGS data can misinterpret sequences from modern known or unknown environmental 57 Mycobacteria which are present in ancient human skeletons [25]. New analytical tools for more unequivocal answers to questions of 58 identification and differentiation of ante-mortem causative from post-mortem non-causative microbial agents are urgently needed. Application of 59 specifically designed in silico (bioinformatical approach) verification methods for improved downstream processing of molecular fingerprint 60 data from ancient samples is necessary for drawing conclusions on clinical prevalence and epidemiology of pathogenic mycobacteria in history.

61
Here we present an improved strategy for specific identification of bacteria from the M. tuberculosis complex in ancient non-enriched NGS data.

62
Main purpose of this study was to design an unbiased genomic marker alignment query composed of sequences belonging strictly to MTBC 63 members. Subsequently, appropriate bioinformatic alignment algorithms and statistical tools allow the identification of tuberculosis causative 64 agents, using fragment length variation to balance selectivity (species specificity) with sensitivity of detection. Reference target construction (alignment target) 79 As our main reference sequence, we used the most commonly applied modern laboratory strain of M. tuberculosis (MTB), H37Rv, for 80 which the whole genomic sequence is available. In order to select a subset of this reference sequence as an alignment target providing enhanced 81 specificity for tuberculosis-causing agents (MTBC members), we first derived a set of all protein-coding sequences (CDS) from the H37Rv 82 genome using the RAST tool [52]. These 4,360 sequences were screened using the BLAST tool (Megablast) at the National Library of Medicine 83 sequentially against 12 available genomic sequences of selected MOTT: M. kansasii,M. avium subsp. paratuberculosis,M. ulcerans,M. 84 smegmatis, M. fortuitum, M. haemophilum, M. marinum, M. simiae, M. asiaticum, M. xenopi, M. phlei, M. abscessus species, we reasoned that the performance of an alignment target is directly linked to amount of similarities between the MTB genome and other 100 potentially interfering mycobacterial species (both ancient and environmental) present in the ancient host-derived sample. In order to quantify 101 this, we subjected the publicly available genome sequences of Mycobacterium species to an in-silico procedure to generate collections of short 102 sequences broadly analogous to authentic NGS reads. Since including reads below a certain length threshold in similarity analysis of ancient 103 microbial DNA leads to non-specific matches (for both evolutionary and statistical reasons); this threshold is usually arbitrarily assumed around 104 30 bp, but a broader analysis might make it easier to construct a reliable algorithm for detection of specific ancient pathogens. Therefore, in our 105 further analysis both of reference and authentic ancient NGS sequences we extracted groups (bins) of non-human sequences over several length 106 thresholds: ≥20bp, ≥25bp, ≥30bp and ≥35bp, to enable a thorough analysis of specificity gain upon increase in minimal sequence length. For 107 reference Mycbacterium genomes, k-mers of specified length (corresponding to the lower limit of read length for NGS bins: 20, 25, 30 or 35) 108 were filtered against the human genome assembly hg19, and the resulting "short read" collections were aligned to the full MTB reference 109 genome or its selected subsets (Borówka et al., Bouwman et al. and Bos et al. alignment targets). Table 1 shows the respective number of 110 genomic k-mers from MTB complex and MOTT species which match the MTBC alignment targets as well as the total lengths of assayed 111 genomes for comparison. Since the various subsets of the MTB genome differ in length and thus the probability of random match increases with 112 target length, we standardised the obtained data by presenting it as percentage of k-mers from a given mycobaterial genome that match the 113 alignment target, divided by the ratio of target length to the full MTB genome length (genomic coverage of the target). These values, which are 114 an inverse measure of alignment target specificity (they increase if more "reads" from a species which is not MTB or MTBC can be mistaken for 115 MTBC), are shown in Table 1. As a reference, the MTB genome itself was also subjected to this procedure -obviously, the match percentage 116 values are almost 100% here. Several conclusions can be drawn from these data: firstly, it is obvious that selecting longer reads (in this case 117 longer k-mers) for comparison increases specificity, with reads 30 bp long or longer optimal for specific identification of the MTB complex, 118 reflecting a common consensus in the field. However, it is important to note that shorter reads still add important information to the analysis, as 119 the rate of specificity increase (decrease in matching read percentage with increase in read length) varies between species (i.e. some species have 120 longer stretches of highly similar sequence to MTB). For example, while M. smegmatis has a very high match percentage to the Borowka et al.

121
alignment target at low read length, this is rapidly lost at longer (more genuine) read lengths; the opposite is true e.g. for M. marinum. It is a 122 derivation of the evolutionary history of the genus, but in this case also a practical caveat for further interpretation of sequence matches in actual  Since we intended to develop a highly specific screening test (based on low depth sequencing strategy) for verification of MTBC 127 infection in Neolithic samples with a priori relatively low degree of aDNA preservation, we decided on a statistical approach. Since any preserved ancient mycobacterial DNA would be only a fraction of total aDNA, and it in turn would only be a fraction of total reads (the balance 129 being modern environmental metagenome), a balance between sensitivity and specificity in verifying this very low number of reads must be 130 struck. In sedentary, communal populations MTBC infection tends to be epidemic in character, but in most individuals with latent infection the 131 microbial load (and thus the probability of DNA survival in ancient samples) is relatively low and constant. Any similarity analysis based on 132 sequence alignment will also invariably generate false positive alignment hits, thus, it would be impossible to construct a test with sufficient 133 statistical power to distinguish individuals genuinely free of ancient MTBC and those with average/modest latent infection. Therefore, we 134 concentrated on the detection of outlier individuals with high microbial load (which may be later selected for enrichment-based further genetic 135 analysis, such as phylogenetic studies or genome reconstruction), measured by the positive read ratio (the intrinsically very low ratio of reads 136 matching the MTBC alignment target to all eligible reads). Based on the epidemiology of MTBC infection, we assumed a quasi-normal 137 distribution of positive read ratios in a randomly selected sample of ancient individuals, with outliers as candidates for active tuberculosis and for 138 selection for more in-depth studies. Thus, our method was based on standardising read ratio values to normal distribution parameters (arithmetic 139 mean and standard deviation) and, as a further step in the detection algorithm for aTb, we applied a typical cutoff value of 1.5xSD to detect 140 outliers.

141
As a first stage of testing our screening approach on actual NGS data from ancient material, we used a control dataset based on published 142 NGS results of confirmed tuberculosis-infected individuals -18th/19th-century mummified bodies from a crypt in Vác, Hungary, described by  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 complexity. There are individuals who, despite being outliers for the bins including shorter reads, lose this status for the more restrictive bins 168 (e.g. 55_BK4 for the Borówka et al. target), i.e. the majority of their MTBC-like sequences were of low complexity. Contrastingly, in some 169 individuals the share of MTBC-like sequences increases above the cut off value only for bins with longer reads (e.g. 31_BK4 for the Borówka et 170 al. target), i.e. most specifically aligned fragments are relatively long. It is again apparent that since most of this change concerns reads between 171 20 and 29 bp in length, the optimal threshold for read aligning to a genomic target for specificity towards MTBC is ≥30 bp. Thus, the three 172 individuals which exceed the threshold of 1.5xSD for the MTBC-specific Borówka et al. target (17_BK4, 29_BK4 and 31_BK4) are considered 173 with high probability to be ancient cases of MTBC infection and merit selection for further in-depth studies by a more cost-intensive approach.

174
Since the cut off-based detection algorithm, while robust for the presented dataset, may be less suitable for other, less homogenous 175 groups of ancient individuals, we also set out to construct an objective, parametric testing-based outlier detection algorithm. Since the main 176 objective of our overall study is specificity of MTBC detection, we applied this algorithm to the original Borówka et al. genomic alignment 177 target. Based on the observation that positive read ratio tends to depend monotonically on read length bineither consistently increasing or 178 decreasing for outlier individualswe decided to calculate a monotonicity parameter. We first standardised positive read ratios as percentage of hand, negative outliers may either be individuals with ancient MOTT infection (we suggest this as highly probable for 4_BK4) or samples with 186 high proportion of short, non-specific alignments, probably due to environmental contamination (most probably 55_BK4) -to distinguish these 187 two groups, a comparison with the more Mycobacterium-generic whole-genome alignment target is necessary (see below). This approach, while 188 retaining the strong specificity of the cut off approach, gains increased sensitivity due to inclusion of individuals with high background of 189 environmental sequences (low initial positive alignments in the short-read bin) which nevertheless retain specific long positively aligned 190 sequences upon read length restriction, e.g. 21_BK4.

208
Since for two individuals which were strongly enriched in mycobacterial sequences (4_BK4 and 32_BK4) we posit the existence of an 209 ancient MOTT infection (as they do not score highly in comparison with the specific Borówka et al. alignment target), we decided to verify if 210 this assumption is supported by aligning the optimal read bin (≥30bp) to full genomes of other mycobacterial species as targets. Indeed, as seen 211 in Supplementary Fig. 2, those two individuals are also strong outliers in read ratio values after aligning to the M. marinum genome -moreover, 212 when plotted against read ratio values for the MTB genome, it is apparent that they show higher similarity to M. marinum, since they are located  Specificity of this tool is also relatively low -bone lesions that mimic Pott's disease occurs in many other unrelated conditions. In spite of that 231 limitations, osteological analysis is often the main starting point of a study and cannot be disregarded. However, in our study the occurrence of 232 bone lesions that could be linked in any way with tuberculosis did not correlate with the results of our genetic analyses. There are two possible 233 explanations for this fact. First, the bone changes were not caused by tuberculosis, which is in accordance with a lack of pathognomonic 234 characteristics of the disease on the skeleton alone, as was clarified before; it applies primarily to the graves 12_BK4, 18_BK4, 47_BK4, and 235 73_BK4. It may also be that the preservation of MTBC aDNA was too poor to pass the sensitivity/specificity threshold of the method proposed 236 here.

237
Among molecular techniques which are used for diagnosis of ancient tuberculosis cases, both biochemical methods based on mass 238 spectrometry and PCR amplification of marker sequences have been successfully used in literature, e.g. for preliminary description of the 239 Hungarian mummies used subsequently to reconstruct aTB genomes [46,54]. However, both these groups of methods suffer from a number of 240 drawbacks which make them less useful in an ancient epidemiological context than in a contemporary one: environmental contamination from 241 modern soil mycobacteria can overwhelm both traces of ancient MTBC mycolic acids and less specific PCR amplicons, while strong care must 242 be taken to prevent in-lab cross-contamination with genuine MTBC samples. Therefore, NGS has a number of advantages in diagnosis of ancient 243 tuberculosis, having the potential to be both highly sensitive and highly specific; but the balance between sensitivity and specificity depends on the selection of reference genomic sequences and crucially on the method of alignment. Large amount of generated data allows potentially to 245 detect ancient mycobacteria selectively, unequivocally and semi-quantitatively, while making possible additional analyses such as preservation 246 period-related DNA damage pattern detection (e.g. mapDamage [56,57], phylogenetic analysis of genetic kinship [48] or even full genome 247 reconstruction [46]. Due to small absolute amounts of actual ancient pathogen DNA in most types of human body samples, a common approach 248 is to use pre-sequencing enrichment (usually using probe capture, e.g. [48]). Only in bodies preserved in exceptional, isolated conditions, such as 249 the Hungarian mummies from a 18th century crypt, was a non-enriched metagenomics approach used [46]. Use of enrichment techniques 250 strongly increases sensitivity, but comes with its own drawbacks (apart from increased cost), the most relevant of which is the need to pre-design 251 a set of sequences (probes or primers) that will define and limit the scope of subsequently obtained NGS data. A full metagenome approach is 252 often more relevant when dealing with a highly ancient sample like in the present study, when neither the infection prevalence nor the pathogen 253 identity are known to any precision and a preliminary NGS study is needed for formulation of specific hypotheses and pre-selection of 254 individuals for further analysis.

255
However, in the case of ancient MTBC (especially samples as old as our material is), specificity is a more important consideration than 256 sensitivityin this case not so strongly with regard to modern MTBC contamination in the laboratory (which would not mask ancient data in a 257 semi-quantitative study and would be obvious if DNA damage analysis were performed), but mainly with regard to ancient MOTT which can be 258 unpredictably genetically similar to MTBC. The sources of these MOTT can be either soil contamination (including dead animals) which could 259 have happened at any time since inhumation (preventing reliable elimination by DNA damage analysis), or actual ancient MOTT which were 260 pathogenic/infectious/commensal to ancient humans. Thus, the design of sequencing analysis workflow has to take into account the necessity to 261 filter out unknown related sequences that are not derived from MTBC -this was the main rationale behind the design of our study. While (with the caveat that enrichment for specificity-related sequences before NGS will certainly lead to loss of the majority of phylogenetically 279 important loci, so a full metagenomic sequencing round with sufficient coverage is inevitable). 280 We postulate that a combination of read length-based genomic alignment analysis and a careful knowledge-based selection of the 281 alignment target makes it possible to achieve relatively high specificity of aTB detection against all potential false positive sources. Therefore, a 282 robust tool for specifically identifying NGS-derived sequences that belong to ancient MTBC with high confidence is a priority task in molecular paleoanthropology. Even more relevant to paleoanthropological studies, confusion between MOTT and MTBC can lead to spurious 284 identification of ancient individuals as tuberculosis sufferers or carriers, invalidating conclusions relevant to paleoepidemiology. We 285 demonstrate that read length selection is not only highly relevant (as has been shown before and by us, only reads above ca. 30 bp can be used 286 with high confidence), but when a statistics-based approach to multiple length thresholds is used, it can yield a substantial increase in specificity 287 of MTBC identification. At the same time, selection of pre-filtered alignment target, with combined knowledge-based (selection of transcribed 288 sequences) and automated (exclusion of sequences aligning with MOTT genomes) delineation of MTBC-specific sequences (which we call the 289 Borówka et al. target), makes it possible to perform in-depth specificity analysis by comparing the alignments of in silico fragmented 290 mycobacterial genomes (mimicking actual NGS data). Combining the novel alignment target and the read length binning approach, we were able 291 to select with high confidence three ancient individuals with probable ancient MTBC infection and two further individuals with highly probably 292 ancient mycobacteriosis caused by MOTT (which would be misidentified as tuberculosis if another alignment target or to short reads were taken 293 into account). Of course the limitations of our data make these identifications preliminary and another round of directed (e.g. enrichment-based) 294 sequencing would be required both for positive identification of the infectious agent and for potential phylogenetical analysis of its spatial and/or 295 temporal kinship. However, in our case read length analysis allowed us to suggest M. marinum as the potential ancient infectious agent based on 296 statistical analysis; obviously, positive confirmation of this diagnosis would require tools that are currently unavailable such as proven M. 297 marinum-specific enrichment probes as well as a much better sequence coverage that could be achieved in a preliminary study (Supplementary 298 Fig. 3). Still, this possible pathogen identification is not at odds with the archeological context as the inhumation site is next to a lake (Smetowo) 299 and within a geographical region rich in post-glacial lakes (Kujawy), so some individuals could have had routine professional contact with fish.

300
Our combined procedures are a robust specificity-boosting tools, but obviously cannot be treated as ultimate proof (neither for disproval or 301 confirmation of tuberculosis infection). Our samples are relatively old (in comparison to most other ancient tuberculosis cases studied by 302 molecular means before) and thus the absolute read numbers from an unbiased NGS approach is low. We demonstrate that this disadvantage 303 makes it e.g. difficult to perform DNA damage analysis. However, we provide a consistent proof of concept for a tool which would allow 304 relatively cheap and unbiased selection of samples (e.g. individuals) for further analysis, e.g. by enrichment capture NGS. Thus, we suggest that 305 it is possible to use global NGS results from ancient samples as an economical pre-screening tool for more complex methods, while applying 306 bioinformatic tools to maximise the number of reliable conclusions that can be drawn from a limited dataset.

308
Ancient DNA extractions 309 The procedures of sample preparation were conducted in sterile and dedicated ancient DNA sample preparation facility at University of 310 Lodz, with all standard precautions taken to avoid sample contamination. All disposable materials, buffers, water, clean room surfaces and bone 311 material, were UV-irritated for min. 30 minutes before any proceeding steps. The fragments of bone material were isolated using Dremel disks, 312 (USA), surface-cleaned, UV-irradiated for 7.5 minutes on each side, and ground into a fine powder, further used for DNA extraction procedures  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 sequences from parasites, pathogens, commensals or symbionts of the deceased individuals, as well as genomic sequences from environmental 319 organisms (mainly microbes, but also potentially higher Eukaryotes), to which the skeletal remains were exposed post-mortem. 320 Bioinformatical procedures 321 Raw NGS reads were subjected to standard quality processing such as trimming and adapter sequence removal (-q 30 --phred33 --322 illumina --length 20), using the Trim Galore! software package [62]. Since the predominant expected type of sequence in skeletal samples is 323 ancient human genomic DNA and its presence would unnecessarily complicate our analysis, the read datasets were subsequently subjected to 324 filtering by alignment to the standard (hg19) human genome reference sequence. This alignment was performed using the BWA_aln algorithm (-325 n 0.04, -l 1000), with duplicate removal, using the AGAT software tool -ocwrapper3mt.py script [63]. Any read which aligned without gaps 326 within the default mismatch rate (dependent on sequence length, e.g. 2 mismatches per 17 bp) was eliminated from the sample dataset.
We have now modified the Introduction to include more relevant citations and to phrase them so that the relation between our text and citations is unambiguous.
For example, the first paragraph mentions cholera, plague, leprosy, tuberculosis and malaria, yet the cited references deal only with plague (4 refs) and malaria (1). Ref 5 is a review that covers lineages and genotypes but there is nothing about pathology or lesions.
We have rephrased the introduction to remove the need to extensively cite references that would have very remote bearing on the rest of the manuscript and on the rationale of our studies, concentrating on the central justification of our research. These references have been replaced by more relevant ones.
The 'often insurmountable difficulties' in distinguishing members of the MTB complex is incorrect as this can readily be resolved on the basis of our knowledge of the specific and the shared DNA sequences.
The sentence has been rephrased to refer more specifically to knowledge gaps with regard to ancient pathogen genomes.
The controversy mentioned in line 51-52 relates to subsequent authors (e.g. refs 31 and 39) ignoring the information on the specificity of particular target sites in the repetitive sequences IS6110 and IS1081. It is clear from refs 56 and 57 that discrimination between the MTB complex and other mycobacteria is possible provided that the appropriate DNA target sequence is used. Indeed, the present authors, in their cited paper 47, state that the IS6110 site is highly discriminatory and reproducible. Paragraph 3, 2nd sentence, mentions one case when mass spectroscopy was interpreted incorrectly in one laboratory.
We have reformulated the statements referring to controversies about the sensitivity and discriminatory power of biochemical methods and IS PCR for ancient MTBC determination so that the current consensus is emphasised over past shortcomings of some published literature. We also placed more emphasis on the difficulty of specific discrimination between MTBC and MOTT.
Line 68 -suggest the paragraph heading is changed from 'Data Description' to 'Paleopathology of bone samples'.
Since the purpose of this paragraph goes beyond the paleopathological description of bone samples, including also the geographic and archeological context of the samples, we have changed the heading from "Sample description" to "Description of archeological samples".
Line 122 states that ' Fig. 1 presents the changes in standardised ratio values in different read length bins.' However, the legend in Figure 1 states that 'red diamonds -outliers in Mycobacterium tuberculosis H37Rv and Borówka et al. targets in bin of reads equal or longer than 30).' However, the figure appears to show that there are red diamonds indicating read ratios ≤ 20 and 25.
We rewrote the figure caption and the relevant paragraph in Results to explain more carefully that visual emphasis (red diamonds) is used to identify all datapoints belonging to those samples that registered as outliers in the long (relevant) read bins, even if these samples do not exceed the threshold for the length bins which include shorter reads. We chose this presentation form so that the conclusions drawn in text are more apparent at first sight, i.e. the distinction becomes apparent between samples where match ratio increases with read length and samples where match ratio stays consistently high even for shorter reads.
Line 198 -it is less confusing to the reader if the list of read numbers is the same in the text and the Figure legend.
This has now been corrected.
Lines 220-222 -again, this is an overall generalization that disregards the specificity obtained when appropriate PCR target regions are selected Since we rewrote extensively this section of Discussion, we hope that the present phrasing does not include too sweeping generalizations on the limits of applicability of non-NGS MTBC identification methods.
In general, the Discussion is somewhat long and rambling and would have more impact if succinct.
We attempted to make the discussion more concise, while also introducing some additional aspects required by other reviewers.