Phylogenetic signal of genomic repeat abundances can be distorted by random homoplasy : a case study from hominid primates

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © Zoological Journal of the Linnean Society 2018. 2018, XX, 1–12 Phylogenetic signal of genomic repeat abundances can be distorted by random homoplasy: a case study from hominid primates


INTRODUCTION
With the rise of high-throughput sequencing technologies, there has been an intersection between previously disparate fields of cytogenetics/genomics and phylogenetics.There are many approaches that seek to use genome-scale data for phylogenetic inference (often termed 'phylogenomics') and usually aim to reduce the genome complexity to something manageable for phylogenetic purposes.Additionally, such data are very useful for characterizing repeats and other markers for efficiently producing cytogenetic probes.The simplest method is one of 'genome skimming' sensu Straub et al. (2012), whereby whole-genome shotgun sequencing is performed but at a very low depth of coverage (< 1× genome coverage and perhaps < 0.1×).These datasets consist primarily of those sequences that are in high abundance, either in the genome itself or within the organism; this includes, predominantly, the high-copy organellar genome sequences (plastome in plants or mitogenome) but also those sequences that are in high copy in the nuclear genome.Amongst these high-copy nuclear sequences are mainly repetitive elements, an array of different types of repeat sequences, which include satellite (tandem) repeats, and transposable elements (TEs), such as retroelements (class I TEs) and DNA transposons (class II TEs).Often these data are discarded by researchers focusing on phylogenetics with such datasets, who instead use only the reconstructed organellar genomes (e.g.Guschanski et al., 2013;Richter et al., 2015;Timmermans et al., 2016;Ren et al., 2017).
The importance of repetitive DNA abundance as a marker for the phylogenetic history of species has been increasingly explored (e.g.Ricci et al., 2013;Sveinsson et al., 2013;Cai et al., 2014).Several recent studies have shown that genomic repeat abundance, rather than the sequence itself, can be used as an informative character for phylogenetic inference (Novák et al., 2014;Dodsworth et al., 2015Dodsworth et al., , 2016a, b;, b;Usai et al., 2017).Using a recently developed pipeline for de novo repeat analysis from low-coverage sequence data, RepeatExplorer (Novák et al., 2010(Novák et al., , 2013)), a high number of clusters are generated, each representing a putatively homologous repeat family/ class.Within each cluster or element, the sequence divergence is low, and although this can be used for fine-scale classification of element types, particularly retroelements (e.g.Piednoël et al., 2013;Mascagni et al., 2015;Harkess et al., 2016;Tetreault & Ungerer, 2016), the sequence divergence is not sufficient to infer taxon relationships.However, the abundance of homologous repeats does differ and the abundance of elements is often indicative of evolutionary relatedness, i.e. phylogeny (e.g. in bananas, Novák et al., 2014;angiosperms and Drosophila, Dodsworth et al., 2015;and in poplars, Usai et al., 2017).But in some cases this is not entirely clear-cut, owing to the activity of some elements, particularly those in high abundance, that are more reflective of recent activity or, perhaps, differential processes of elimination from the genome (Pons et al., 2004;Ribeiro et al., 2017;Ustyantsev et al., 2017).This needs to be explored and tested in cases where the topology is 'known', such that particular element histories can be teased apart and their impact on overall phylogenetic signal investigated.
Here we decided to test the abundance of repeats as adequate phylogenetic characters, particularly exploring the homoplasy of repeats, using the hominids as a case study.This group was selected owing to the widely accepted phylogenetic hypothesis based on much previous research and genome-scale data.Specifically, we set out to answer the following questions in this study: 1. Is the phylogenetic signal of genomic repeat abundance reliable in the case of the hominids?2. Do certain clusters/repeats with homoplasious abundances adversely affect the phylogenetic signal?3. Is one individual per taxon enough to build a reliable phylogenetic tree from genomic repeat abundances?

Data acquisition
We downloaded high-throughput sequence data from 15 National Center for Biotechnology Information (NCBI) short read archive (SRA) accessions, including Illumina reads (Illumina Inc., San Diego, CA, USA) from three individuals belonging to five of the well-known species of the Hominidae family of primates (Table 1): Homo sapiens, Pan troglodytes, Pan paniscus, Gorilla gorilla and Pongo pygmaeus.We also downloaded  Illumina read data from a Macaca mulatta individual to be used as an outgroup for phylogenetic analyses.
In order to avoid data biases based on different sequencing protocols, all reads used in this study were chosen because they had been obtained on the same sequencing platform (Illumina HiSeq 2000), thus yielding reads of 100 bp in length, except for the M. mulatta library, in which the Illumina read length was 101 bp.Chimpanzee, bonobo, gorilla and orangutan data were acquired from wild-born individuals sequenced within the same SRA BioProject (PRJNA189439; IBE CSIC-Universitat Pompeu Fabra; Prado-Martínez et al., 2013), whereas human short reads belong to the 1000 Genomes Project Phase 3 (PRJNA262923).

MitochonDrial genoMe asseMbly, phylogenetic analysis anD filtering
A total of 5 000 000 100/101 bp raw Illumina read pairs were randomly selected using the SeqTK software (https://github.com/lh3/seqtk)from each library downloaded from the SRA and were used for mitochondrial genome assembly with MITObim v.1.8(Hahn et al., 2013).The mitochondrial genomes used as reference for assembly are indicated in Table 2 and were downloaded from NCBI GenBank reference sequences.Genome annotation was performed in GENEIOUS v.4.8.5 (Biomatters Ltd, Auckland, New Zealand) by aligning with the reference mitochondrial genome of each species.To verify its phylogenetic identity, a phylogenetic tree was built based on maximum parsimony (MP) analysis of a global alignment of the whole newly assembled mitochondrial genome of each individual included in this study.The Tree Analysis Using New Technology (TNT) software for Linux 64 (no taxon limit), updated version of 11 December 2013 (Goloboff et al., 2008), was used for phylogenetic reconstruction, using implicit enumeration.Before subsequent analyses of repetitive DNA abundance, all Illumina libraries were filtered out for mitochondrial DNA with the software DeconSeq v.0.4.3 (Schmieder & Edwards, 2011), using as reference the mitochondrial genome for each species shown in Table 2.
All samples were assumed to have a genome size of ~3.5 Gbp, based on data available in the Animal Genome Size Database, which showed only slight variation in genome size between the species used in this study (3.47-3.85Gbp; http://www.genomesize.com/last accessed 12 November 2016), which is considered appropriate for this type of study (Dodsworth et al., 2016a).Each accession was then sampled for 0.6% of the genome by randomly subsampling each Illumina dataset.This resulted in 200 000 reads per sample from all Hominidae accessions, randomly selected with SeqTK and then converted into FASTA format.
Selected reads from each sample were labelled with a unique five-character prefix, making a total combined dataset of 1 200 000 reads for datasets of one individual per species, 2 200 000 reads for datasets of two individuals per species and 3 200 000 reads for the global dataset including all individual samples.Specifically, we prepared three different datasets of one individual (library or sample) per species plus M. mulatta as an outgroup (six operational taxonomic units [OTUs] per dataset), three different datasets of two biological individuals per species plus M. mulatta as an outgroup (11 OTUs per dataset) and one dataset grouping together all libraries representing three biological individuals per species, making a total of 16 OTUs for phylogenetic analysis, as shown in Table 3.

repeatexplorer clustering of saMples
Clustering of Illumina reads was performed using the RepeatExplorer (RE) pipeline, implemented in a GALAXY server environment running locally in the University of Granada.RepeatExplorer clustering was used to identify genomic repeat clusters within each dataset, with default settings (minimum overlap = 55, cluster size threshold for detailed analysis = 0.01%, and the 'all reads are paired' option selected).For additional details about the clustering algorithm see Nóvak et al. (2010Nóvak et al. ( , 2013)).For further identification of repeat clusters, we used a custom repeat database of all primate repetitive DNA annotations included in RepBase (Bao et al., 2015; http://www.girinst.org/repbase/last accessed 20 November 2016).Following Dodsworth et al. (2016a), we used the 1000 most abundant repeat clusters, The 1000 most abundant clusters of each dataset were used to create the data matrices for phylogenetic inference.TNT software was chosen for phylogenetic analyses under the maximum parsimony principle (Goloboff & Mattoni, 2006;Goloboff et al., 2008).Cluster abundances were used as input (continuous characters).
To make the cluster abundance values suitable as input for the TNT software, we divided all abundances by a factor calculated by dividing the abundance of the most abundant cluster by 65, so that all data would fall within the range 0-65 (with up to three decimals) as needed for analysis of continuous characters with TNT.Further transformations (e.g.cubed root) were checked but provided no improvement on the factorial transformation.Implicit enumeration (branch and bound) tree searches were used for datasets in this study owing to the small number of taxa in each dataset.
Resampling was performed using 10 000 replicates, and symmetrical resampling was done by a modification of the standard bootstrap (Goloboff et al., 2003).FigTree v.1.4.3 (http://tree.bio.ed.ac.uk/) was used for graphical view and representation of phylogenetic trees.

filtering of Disturbing clusters
After the first RE clustering, we found some clusters for satellite DNA and an endogenous retrovirus (ERV) that were abundant in chimpanzee, bonobo and gorilla but were absent in human and orangutan libraries.We identified these clusters by means of a Python script (https://github.com/mmarpe/phyl_rep_hominidae/blob/master/sel_ clusters.py)that helped us to locate those clusters that had < 25 reads in Homo and Pongo but that were abundant in the rest of the hominid species.The identity of these clusters was confirmed by the RepeatExplorer annotation and further characterized by means of sequence homology search using BLASTn (Altschul et al., 1990) and CENSOR (Kohany et al., 2006) tools.
To test the effect of these clusters on the phylogenies built with the abundance of repeats, we performed two sets of phylogenetic analyses, one using unfiltered libraries and the other using libraries previously filtered out for these particular clusters.Filtering was performed by DeconSeq software against the CL3 satellite consensus sequence (X74280.1 and X74281.1 GenBank accessions; Royle et al., 1994) and against the CERV1_INT, the internal sequence for the endogenous retrovirus (Skaletsky et al., 2004) included in RepBase.

coMbinations of one or two inDiviDuals per species
Using a custom script, written in Python (https:// github.com/mmarpe/phyl_rep_hominidae/blob/master/sample_mix.py),we phylogenetically analysed all possible combinations of one or two individuals per taxon (243 phylogenetic trees each), with abundances obtained from a global RE run of all libraries involved in this study after the above filtering of clusters.The 1000 most abundant clusters of each combination were phylogenetically analysed by means of MP implemented using TNT software as described previously.From the 1000 top abundant cluster data obtained from the RE of all three individuals per species (all samples included in this paper) after filtering, this script constructs all possible cluster abundance datasets for all different abundance data combinations of two individuals per species or one single individual per species without sample repetitions; later, it generates the trees derived from each dataset using the same parameters described above for the TNT software, and finally, transforms the tree files from .nexformat to .pdfformat using FigTree to make their visualization more accessible.
The 243 trees produced from these combinations were grouped together in a file and, using Consense v.3.695included in the PHYLIP package (Felsenstein, 1989(Felsenstein, , 2005)), we obtained the consensus tree for two individual per species cluster abundances combinations and for one individual per species combinations.This consensus tree consists of groups that occur as often as possible in the data through implementation of the majority rule (extended) method (Margush & McMorris, 1981).

MitogenoMe phylogenetic tree
In order to check the integrity and reliability of the libraries used, we assembled the full mitochondrial DNA sequence in each individual library, using MITObim v.1.8,and built a mitochondrial phylogeny by means of MP (Fig. 1).This showed the absence of mis-tagging or sample confusion, because it coincided with the universally accepted Hominidae phylogeny (Roos & Zinner, 2017).

phylogenetic analyses using unfiltereD Datasets
The first set of RE clustering and phylogenetic analyses was performed using the datasets indicated in Table 3.None of the phylogenies obtained (Fig. 2) reflected the universally accepted phylogeny for the Hominidae family confirmed by the mitogenome phylogeny depicted below (Fig. 1).In all cases, Homo sapiens appeared in a basal position in the phylogeny and sometimes forming a clade with Pongo pygmaeus (Fig. 2C-F).Given that we noticed that the topology of most trees shown in Figure 2 supported the hypothesis of a Pan/Gorilla clade, we searched for clusters showing extremely high abundance similarity between humans and orangutans, which could be responsible for the observed phylogenetic distortion.For this purpose, we searched for clusters showing < 25 reads in Homo and Pongo but showing higher abundance in Pan and Gorilla, using a custom script.

phylogenetic analyses using filtereD Datasets
We found two repetitive DNA elements, which were practically absent in Homo sapiens and Pongo pygmaeus (< 25 reads) but were abundant in Pan and Gorilla (Fig. 3A).These clusters were identified as a subterminal satellite repeat and an endogenous retrovirus (Figs 3B, C).The repeat unit of the CL3 satellite is 32 bp long; it was isolated from the chimpanzee genome, found to be even more abundant in gorillas, but not detected in humans or orangutans (Royle et al., 1994).The endogenous retrovirus, CERV1/PTERV1, was found by means of the analysis of bacterial artificial chromosome chimpanzee genome sequences.It is integrated in the germline of African great ape and Old World monkey species but is absent from human and Asian ape genomes (Yohn et al., 2005;Polavarapu et al., 2006).
To evaluate the possible effect of these two repeats on the phylogenetic signal, we filtered these repeats out of all libraries and performed a new batch of Downloaded from https://academic.oup.com/zoolinnean/advance-article-abstract/doi/10.1093/zoolinnean/zly077/5208266 by Universidad de Granada user on 25 November 2018 phylogenetic analyses on the same datasets described in Table 3, following the same protocol after filtering.As shown in Figure 3C, the endogenous retrovirus was partially clustered in CL140 (cluster graphs of full ERVs should have a circular shape).We found some other clusters containing part of this ERV, but they were less abundant and they were not discarded after the use of the script for filtering ERV reads.We decided to include these small clusters in subsequent phylogenetic analyses, because their presence did not influence the phylogenetic signal of the dataset as a whole.In addition, homoplasious clusters were filtered out of libraries using full reference sequences from RepBase, which means that the number of retained reads matching those repetitive elements is very low after filtering.
The phylogenies obtained (Fig. 4) failed to show the previous close relationship between Homo and Pongo, indicating that the discarded repeats were responsible for the distortion of the phylogenetic signal shown in the first set of analyses.In fact, the tree built with three individuals per species yielded a tree (Fig. 4G) with essentially the same topology as the mitogenome tree, albeit with low node support in places.
This result demonstrates that some repeats can generate 'random homoplasy' by differential amplification among different evolutionary lineages.In the present datasets, a satellite DNA and a retrovirus became highly abundant in the Pan and Gorilla lineages, whereas they did not prosper in the Homo and Pongo lineages, for which reason the two latter species showed a homoplasious rather than real phylogenetic relationship.This might present a serious problem for using the abundance of repeats for phylogenetic analysis in groups not as well known as the hominids.

one or two inDiviDuals per species can yielD poor phylogenetic trees
As shown in Figure 4G, the phylogeny built with three individuals per species was very similar to that obtained with the mitogenomes, when the two Downloaded from https://academic.oup.com/zoolinnean/advance-article-abstract/doi/10.1093/zoolinnean/zly077/5208266 by Universidad de Granada user on 25 November 2018 homoplasy-generating repeats were filtered out from the libraries.However, trees built with one or two individuals per species were still better than those performed by the unfiltered libraries, because Pongo was ancestral with respect to Gorilla, Pan and Homo, but they did not resolve properly the phylogenetic relationships between the three latter taxa (see Fig. 4A-F), because all these topologies show an unsolved Homo/Pan/Gorilla clade.According to the phylogenetic analysis of technical replicates (15 technical replicates, one for each biological sample used in this study, outgroup excluded), this issue of resolution might be attributable to inter-individual variation rather than sequencing bias (for technical replicates analysis, see Supporting Information, Tables S1 and S2).
To evaluate the effect of inter-individual (coincident with intraspecific in this case) variation in repeat abundance on phylogenetic reconstruction, we made all possible combinations of one or two individuals per species, chosen from the matrix of abundances obtained after RE clustering of the dataset including all three filtered libraries per species.We thus performed the phylogenetic inference for each combination, producing 243 trees for the combinations of one individual and 243 trees for the combinations of two individuals per species.The results showed that the consensus tree for the combinations of one individual per species did not reflect the phylogeny of the mitogenome (Fig. 5A), although 36 trees out of the set of 243 did.However, the consensus tree obtained from the combinations of two individuals clearly represented the phylogenetic relationships universally accepted for the Hominidae (Fig. 5B), although only 16 trees out of the 243 showed the resolved and accepted topology.
We conclude that the phylogenetic inference obtained from genomic repeat abundance is highly dependent on inter-individual variation, and the use of only one or two individuals per taxon may potentially lead, with high probability [(24− 36)/243 = 0.85 with N = 1 and (243 − 16)/243 = 0.93 with N = 2], to wrong phylogenetic inferences, at least in the case of the Hominidae family.

DISCUSSION phylogeny of hoMiniDae using repeat abunDance
The phylogenetic relationships of the Hominidae family have been the object of study and great interest for the scientific community for a long time, and they have not been exempt from controversy (Holmquist et al., 1988;Dean & Delson, 1992;Grehan & Schwartz, 2009, 2011).Currently, the {Pongo [Gorilla (Pan + Homo)]} evolutionary reconstruction is universally accepted and well established (Purvis, 1995;Arnason et al., 2000;Arnold et al., 2010;Perelman et al., 2011;Popadin et al., 2017); therefore, we believe that it is an appropriate model to test the method of phylogenetic estimation from the abundance of genomic repeats.
We compared our results with the reference tree built by mitogenomes (Fig. 1), which agrees with the previously accepted topology for this group (chromosomal evidence, Seuánez, 1982;morphological data, Ciochon et al., 1983; identity of the α and β haemoglobin sequences, Goodman et al., 1983;using DNA-DNA hybridization values, Sibley & Ahlquist, 1984;mitochondrial DNA analyses, Hayasaka et al., 1988; β-globin gene clusters study, Koop et al., 1989).Our results show that there is phylogenetic signal present in repeat abundances, for the top 1000 most abundant repetitive elements in the hominid nuclear genomes (Figs 2-4).Generally, we recovered phylogenetic hypotheses close to the accepted tree topology indicated above.However, this was only after adding more than one individual per species and after filtering out two particular repeats that had high abundance but not in closely related taxa, therefore distorting the phylogenetic inference (Fig. 4).The most acceptable phylogeny was inferred when making a consensus of all possible combinations of two-taxon datasets (Fig. 5B) after RE clustering of three individuals per species and filtering out the two clusters causing homoplasy.Even then, some nodes are not well supported according to bootstrapping, which underlies a lack of phylogenetic signal of repeat abundances for some parts of the tree.

inter-inDiviDual variation affects phylogenetic inference
The abundance of repetitive elements appears to show high variation between individuals, so that ideally two or more individuals per species should be used for phylogenetic analysis based on repeat abundance (Fig. 4).The most unsatisfactory phylogenetic trees we generated were from the datasets that included only one individual per taxon (Figs 2-4), in which Homo is either misplaced or the tree is generally unresolved with respect to other hominids.This did not vastly improve even after filtering of clusters with homoplasious distributions (Fig. 4), suggesting that although this might eliminate the issue of (some) homoplasy, it does not negate the caveat of interindividual variation in repeat abundance.

hoMoplasious repeats obscure true phylogenetic signal
A phylogenetic hypothesis reflecting the currently accepted Hominidae phylogeny was obtained only using two or three individuals per taxon when their libraries were filtered out for the 'disturbing' clusters of repetitive DNA (Fig. 4G), a satellite DNA and an endogenous retrovirus, which showed differences in abundance between closely related species (e.g.Homo and Pan).These repetitive elements thus distorted the phylogenetic signal, yielding a falsely close relationship between Homo and Pongo.Removing these sequences from the libraries substantially improves the phylogenies obtained (Fig. 4).We believe this is a case of 'random homoplasy' generated by the chance amplification of the satellite DNA (and spread of the retrovirus) in Pan and Gorilla but not in Homo, which makes the latter more similar to Pongo in this respect.As Figure 3 shows, the homoplasious satellite DNA was the third repetitive element in order of decreasing abundance in Pan (2.7-4.2%) and Gorilla (4.8-11.9%),such that its influence on phylogenetic signal appears to be logical.However, the endogenous retrovirus was only the 140th most abundant cluster (0.003-0.009% in Pan and 0.007-0.009% in Gorilla), but the trees built that included this repeat failed to fit the accepted phylogeny even after filtering out the abundant satellite (data not shown).This poses a serious problem for phylogenetic reconstruction through this approach, because the phylogenetic signal can be distorted not only by the most abundant repeats but also by others that show much lower abundance in the genomes.
Methods of phylogenetic inference that handle continuous data adequately as phylogenetic characters are currently limited but could be improved upon (e.g.model-based solutions) and, in this case, would aid phylogenetic inference from repeat abundance data.The MP algorithm implemented in TNT is similar to ordinary MP, and therefore homoplasious repeats with large differences in abundance (such as those two identified for hominids) have an adversely large effect on tree length and therefore the most parsimonious phylogenetic tree that is reconstructed.This effect can sometimes be minimized by the use of different transformations on the data matrix, in order to make the abundances between zero and 65.For example, square root or other root transformations retain the abundance differences between taxa but minimize the overall abundance (length) differences for any particular cluster (character), as used, for example, by Dodsworth et al. (2016b) and tested in the present study (data not shown).However, these approaches do not alleviate the problem in the worst cases, such as the one shown in the present study for the Hominidae family, and it is advised that these clusters (repeats) are identified and removed from the dataset before phylogenetic inference.In cases without previous knowledge of phylogenetic relationships for the taxa involved, discarding every cluster showing large differential abundances that might be homoplasious, i.e. being absent or present in only two taxa, could be an option.We tried to do this for the present dataset, but it eliminated some clusters that were important for grouping the two Pan species, because they included repeats specific to that clade of two species (data not shown).More adequate modelbased methods for inferring the phylogeny would also help to overcome the homoplasious nature of some repeat types, but these methods require further development.Therefore, the homoplasy problem might not be easy to solve, because repetitive DNA rarely shows a static path along the tree of life (Kuhn et al., 2008;Feliciello et al., 2014;Rojo et al., 2015;Barghini et al., 2015;Ferreira de Carvalho et al., 2016).

conclusions
Here we tested the abundance of repetitive elements as phylogenetic characters to infer the phylogenetic relationships of hominid primates, the family Hominidae.In general, we were able to recover a phylogenetic hypothesis close to the accepted topology, i.e. that which was recovered from much previous genomic sequence data.We discovered two important caveats when exploring this type of data, which should be borne in mind for future analyses of repeat abundances as phylogenetic characters: (1) individual variation in repeat abundance suggests that multiple samples per taxon should be included if at all possible; and (2) particular repeats can have highly homoplasious distributions such that they distort the phylogenetic signal in the overall dataset.We suggest that without a priori knowledge of the expected phylogenetic topology, researchers should be cautious and check for unusual signals yielded by repetitive elements irregularly distributed in the genomes of the tested organisms.acKnowleDgeMents M.M.P was supported by a FPU fellowship (FPU13/01553) from the Spanish Ministerio de Educación, Cultura y Deporte.F.J.R.R. and J.P.M.C. were supported by the Spanish Secretaría de Estado de Investigación, Desarrollo e Innovación (CGL2015-70750-P), including FEDER funds, and F.J.R.R. was also supported by a Junta de Andalucía fellowship.S.D. was supported by a NERC studentship.

Figure 1 .
Figure 1.Mitochondrial phylogeny of all samples (libraries) used in the present study (A) and with the reference mitogenome from each species used in the sample mitochondrial DNA assembly (B).In each case, the trees represent the wellknown {Pongo [Gorilla (Pan + Homo)]} topology.Bootstrap support of each node is specified on the tree (values < 50 in light grey indicate less robust nodes).

Figure 2 .
Figure 2. Genomic repeat phylogenies of one (A-C), two (D-F) and all samples (G) after RepeatExplorer clustering of unfiltered libraries.Bootstrap support of each node is specified on the tree (values < 50 in light grey indicate less robust nodes).Note that none of the trees matches the topology of the mitochondrial DNA tree.Even with three individuals per species (G), the tree reconstructed using repetitive element abundances (on the left) placed Homo as the ancestor of Pan and Gorilla, in strong disagreement with the mitochondrial DNA tree (and current accepted placement).

Figure 3 .
Figure 3. A, abundance of the CL3 subterminal satellite and the CERV1-ERV (CL140) retrovirus per individual.Number of reads as log-scaled bars and percentages shown next to bars indicate the proportion of each element per sample in the RepeatExplorer dataset.B, C, graph-clusters of the two homoplasious repeats, CL3 satellite and CL140 CERV1 retrovirus.

Figure 4 .
Figure 4. Genomic repeat of one (A-C), two (D-F) and all samples (G) after RepeatExplorer clustering of libraries previously filtered out for a satellite DNA (CL3) and an endogenous retrovirus (CL140-CERV1) that have homoplasious abundance distributions.Bootstrap support for each node is specified on the tree (values < 50 in light grey indicate less robust nodes).

Figure 5 .
Figure5.Consensus phylogenetic trees obtained from all possible combinations of one (A) and two (B) individuals per species (after filtering of the two homoplasious repeats).Numbers beside nodes indicate the number of trees, out of 243, that support the split.Note that the consensus tree built with two samples per taxon (B) shows a similar topology to the mitochondrial DNA tree shown in Figure1, albeit with low support for two nodes.

Table 1 .
Taxon sampling of hominids from National Center for Biotechnology Information SRA accessions SRA, short read archive.

Table 2 .
Mitochondrial genome reference sequences NCBI, National Center for Biotechnology Information.Downloaded from https://academic.oup.com/zoolinnean/advance-article-abstract/doi/10.1093/zoolinnean/zly077/5208266 by Universidad de Granada user on 25 November 2018 because they represented a sufficient proportion of the genome for phylogenetic analyses.Read counts per cluster and sample information obtained from RE can be found in figshare under the accession https:// figshare.com/s/c2ccda047dd502890dcbphylogenetic analysis of clusters