Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication

Background Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of “living fossils.” As arthropods, they belong to the Ecdysozoa, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Results Here we apply a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab, Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers grouped into 1,876 distinct genetic intervals and 5,775 candidate conserved protein coding genes. Conclusions Comparison with other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications 300 million years ago, followed by extensive chromosome fusion. These results provide a counter-example to the often noted correlation between whole genome duplication and evolutionary radiations. The new, low-cost genetic mapping method for obtaining a chromosome-scale view of non-model organism genomes that we demonstrate here does not require laboratory culture, and is potentially applicable to a broad range of other species.


Introduction
Comparative analysis of genome sequences from diverse metazoans has revealed much about 38 their evolution over hundreds of millions of years. The discovery of extensive gene homology across large evolutionary distances has allowed researchers to track chromsome rearrangements 40 and whole genome duplications. The resulting value of whole chromosome sequences presents a challenge for existing whole genome shotgun (WGS) assembly strategies. patterns, assumed to correspond to loci in the genome uninterrupted by meiotic recombination in 128 the cross [27]). Map bins fell into 32 linkage groups, close to the 26 pairs (2N=52) previously found in a cytogenetic analysis of two chromosome spreads [28]. 20 map bins were removed for having 130 inconsistent positions in the maternal and paternal maps, and 12 were singletons. To estimate the frequency of incorrect genotype calls as a function of the log likelihood dif-132 ference between the called and alternative genotype (genotype confidence score), including contributions from uncertainty in SNP-mer identification, assembly, and sampling noise, we carried 134 out a simulation of the library pooling and sequencing, k-mer assembly and genotype inference protocols, using the sequenced Ciona intestinalis genome as a starting point. 136 In the simulated C. intestinalis data set (Methods), a single stretched exponential distribution provided a good fit to the frequency of genotype calling errors as a function of the call confidence 138 score for scores up to 6, or down to error frequency of about 1%. The observed error frequency declined more slowly for higher confidence scores. The minimum χ 2 fit used for estimating the 140 genotyping error rate in the L. polyphemus map bins was p e (s) = a 1 e −s c 1 /b 1 + a 2 e −s c 2 /b 2 , with parameter values a 1 = 0.49, b 1 = 2.08, c 1 = 1.26, a 2 = 5.47, b 2 = 0.17, c 2 = 0.16 (Figure 2). closest map bins with a threshold of p < 10 −6 (Methods), for an estimated genome-wide average 148 density of one mapped sequence marker every 32 kb. A mean of 45 markers were mapped to each map bin, and the number of markers mapped was used to estimate the relative physical size 150 of map bins. 46% of the scaffolds with 12-17 SNPs could be placed with the same threshold, for an additional 32,688 markers, or one marker every 23 kb.

152
The total length of the scaffolds assigned to map bins was 411 Mb, and they contained 2.67 million bi-allelic SNPs assigned a phase with a posterior probability of at least 0.99. Of these, 72% 154 were inferred to be unique to one of the four parental chromosomes. This is close to the 74% predicted under the finite sites neutral coalescent model given the observed SNP density [29]. 156

Sequence composition and recombination rate
In the scaffolds longer than 1kb (mean length 2.9 kb), the G/C base content was 33.3 ± 2.8 per-158 cent, and the local relative frequency of CpG dinucleotides was bimodally distributed, with about 30% of sequences exhibiting depletion of CpG. TpG and CpA dinucleotides were over-represented 160 on average and their local densities negatively correlated with CpG density, suggesting ongoing germ-line CpG methylation for a fraction of the genome [30]. 162 The mean maternal and paternal recombination rates were estimated to be 1.28 and 0.76 centimorgans per megabase respectively, consistent with expectation based on genome size [31]. We 164 did not observe evidence of segregation distortion for any map bins. Estimated local recombination rates in the two parents are correlated across the genome with and r 2 = 7.1% ; p<1e-29, and 166 the mean local recombination rate is correlated with local SNP density, r 2 = 9.7% ; p<1e-40. 168 34,942 scaffolds have significant sequence conservation with 10,399 predicted proteins of the tick Ixodes scapularis [32]. 6,246 of these hits formed reciprocal best pairs, of which 5,775 (92%) could 170 be placed on the linkage map at a threshold of p<10 −6 . These were used as conserved markers for comparisons of genome organization. When linkage groups were divided into 108 non-overlapping 172 bins of 1,000 markers, 52 had significant (p<0.05, after Bonferroni correction for 1,944 pairwise tests) enrichment in shared orthologs (or "hit") with at least one of eighteen ancestral chordate 174 linkage groups [7]. A hidden Markov model segmentation algorithm [6] identified 40 breakpoints in ALG composition in the linkage groups. 72% of the genome is spanned by 53 intervening 176 segments that hit one or (for eight of them) two ALGs (Figures 4 and 5). Each of the eighteen ancestral ALGs has at least one hit among the 45 segments with a unique hit to the ALGs. Figure 2: Genotype call error rate as a function of call confidence score for bins of 10,000 calls in simulated Ciona intestinalis genome data. The stippled blue region shows 95% confidence intervals of the Bayesian posterior probability distribution of the underlying error rate computed from the Beta distribution Beta(n e + 1, n c − n e + 1) conjugate the assumed binomial distribution of observed errors, where n e and n c are the number of errors and number of calls in each bin respectively.

Whole genome duplications
Whole genome duplication (WGD), or polyploidization is a rare but dramatic genetic mutation event 180 which doubles the size of a genome and creates a redundant pair of copies from every gene. Because it creates redundant copies of genes for entire biochemical pathways and genetic networks, 182 it has been proposed that it creates unique raw material for the evolution of novel biological functions and increased complexity.

Homeobox gene clusters
Homeobox genes encode a large family of transcription factors involved in diverse embryonic pat-186 terning and structure formation processes of eukaryotes. As a particular subfamily of homeobox genes, the Hox cluster is known to control metazoan body patterning along the anterior-posterior 188 axis. We identified 155 scaffolds with significant homology to predicted chelicerate homeobox gene sequences in public databases. We classified these sequences into homeobox subfamilies on linkage group (LG) 15 and LG 21, each containing multiple members of the anterior, central and posterior classes. There are also two parahox cluster homologs, each with three homeobox genes: gsx and cdx orthologs and a third homeobox gene not confidently assigned to a subfamily in our analysis (LG 5 and LG 19). There are two smaller clusters containing multiple hox genes (LG 18 and LG 20), and clusters of other homeobox genes, including members of the msx, lbx, 196 nk, evx, and gbx families ( Figure 1).

198
WGD creates many pairs of duplicate genes or "paralogs". The distinctive features of these genes have been used to infer WGD events in fungal, vertebrate and plant genomes[2-4]. We examined 200 the genomic distribution of 2,716 pairs of candidate paralogous gene markers in L. polyphemus for signatures of WGD. In 45% of these pairs both markers mapped to the same chromosome, 202 compared to 5.3 ± 0.5 % in 1000 datasets with randomly-permuted paralogous gene identities. The mapped positions of pairs within the chromosomes were highly correlated (average r 2 = 0.81, 204 and exceeding 0.95 for 8 of the large chromosomes; Figure 8), suggesting that many of the pairs represent recent tandem gene duplicates or single genes fragmented across multiple markers. In 206 the following, these same-chromosome paralogs are referred to as "tandem" duplicates.
Inter-chromosomal duplicates are clustered into conserved paralogous micro-synteny blocks 208 (or "paralogons"[3]): there are 25 pairs of loci, each with at least six (m p =6) paralog pairs clustered with a maximum gap (max-gap) of 300 markers between adjacent paralogs in each cluster. These 210 clusters span 25,044 markers, or 30% of the map, after removing redundancy from paralogons with overlapping footprints. In 1000 datasets with randomly-permuted paralogous gene identities, 212 the maximum number of such clusters observed was 11; the mean and standard deviation were 3.9 ± 1.0. The observed clustering into paralogons was greater than that in the randomized 214 datasets over a broad range of choices of max-gap and m p . For example, for max-gap=100, m p =3 there are 52 clusters vs. 3.5 ± 1.9, range 0-10; for max-gap=500, m p = 9 there are 12 vs. 2.9 216 ± 1.7, range 0-9. Because of the large proportion of apparent tandem gene duplicates (45%), this randomization scheme increases the number of inter-chromosomal paralog pairs relative to the 218 data, making it a conservative significance test for inter-chromosomal paralog clustering. When genes with tandem duplicates are excluded from the randomizations, the observed number of 220 clusters is greater than the maximum observed in 1000 randomizations for all the combinations of max-gap in the set (100, 200, 300, 400, 500, 600) and m p in the set (3, 4, 5, 6, 7, 8, 9). 23 222 max-gap=600, m p = 7 clusters span 59% of the map, compared to respective mean number and map coverage of 3.3 ± 1.8, and 13 ± 6 % in these randomizations.

224
Among the marker pairs mapping to different chromosomes, we find a significant excess of pairs relating segments derived from the same ancestral linkage group (ALG) relative to random-226 ization controls (247 pairs vs 102 ± 11, p<0.001 in randomizations of all genes; 202 vs 46 ± 7 when genes in tandem duplicates are excluded). This pattern is consistent with the creation of 228 these segments by duplication (rather than fission).
The max-gap clusters have a significant amount of overlap among their footprints. For example, 230 the footprints of the max-gap=600, m p = 7 clusters had a total length of 72,072 markers, but a net footprint after redundnacy removal of 49,545 markers. We examined the relationships among the 232 paralogons for evidence of successive rounds of duplication. We considered a graph in which nodes correspond to merged, non-redundant paralogon footprint regions. Nodes are connected 234 with edges if a max-gap cluster connects the two nodes. The average clustering coefficient of this graph is equal to the probability that footprints a and c share a max-gap cluster, given that there 236 are edges (a, b) and (b, c) in the graph. We compared the clustering coefficients to those found in 8 random Erdős-Rényi graphs with the same number of nodes and edge probability as the observed graph. We found that the observed data shows significantly more clustering than these random graphs for a wide range of choices of max-gap and m p . For example, for max-gap 600, m p =7, 240 the average clustering coefficient is 0.19, while 10,000 random graphs had coefficients of 0.034 ± 0.042, p=0.0039.

Age distribution of paralogous genes
Because WGD events create many paralogs at the same time, they leave characteristic peaks in

252
Our results demonstrate that a low cost, combined approach to whole genome sequencing and genetic mapping can be used to efficiently create a very high density genetic recombination map for  [38,39]. Future comparisons to more closely related chelicerates will allow tests to distinguish whether these rates are positively 266 correlated with inter-specific divergence, consistent with a neutral process of correlated mutation and recombination rates [40]. Alternatively, the association could be explained by hitchhiking and have only rarely been subsequently lost. The enrichment of inter-chromosomal paralog pairs in segments of the same ALG origin is 274 consistent with their creation by duplication (rather than fission), although because small-scale duplication is biased toward local (tandem) duplication, fission of segments could also leave be-

276
hind an enrichment of paralogs. Such a mechanism, however, would not create the observed organization of paralogs, i.e. their clustering into "paralogons". The fact that these paralogons 278 span a large portion of the map (59%) suggests that it was a whole genome duplication, rather than segmental duplications that have rise to the pattern. The double-peaked shape of the distribution of synonymous site divergence between pairs of paralogs, combined with the existence of two small clusters of HOX genes in addition to the 282 two complete HOX clusters suggests that there may have been two rounds of whole genome duplication in the horseshoe crab lineage. of whole genome duplication in an invertebrate, and during horseshoe crabs' long and famously conservative evolutionary history suggests that such events may have been more common than 288 previously assumed in metazoan evolution, and that while they may have provided raw material for adaptive evolution in some cases, they are not evolutionary drivers.

294
The JAM method proceeds through three major phases: 1. The frequencies of DNA subsequences of fixed length k (k-mers) are profiled to characterize the quality, uniqueness, poly-296 morphisms and repetition in genomic reads, using software we developed building on work from the Atlas assembler [42]. Allelic pairs of k-mers representing alternate forms of single nucleotide 298 polymorphisms (SNPs) are identified and tracked through the subsequent steps. 2. Contigs are assembled on a graph of unique k-mers and paired SNP k-mers sampled to reduce memory us-300 age, then ordered and oriented using the Bambus scaffolder [26,43]. Each multi-SNP scaffold is treated as a single marker for the linkage mapping steps. 3. The paired SNP k-mers (SNPmers) in 302 each scaffold are combined with the read, mate-pair, and parent-or offspring-library associations of their alleles for haplotype phasing and construction of a high density genetic linkage map.

Sampling and sequencing
Tissue samples were collected from the third walking leg of a monandrous pair of horseshoe crabs 306 nesting at high tide on the beach at Seahorse Key, an island along the west coast of north Florida, on 27 March 2010. The eggs laid by this pair were collected 6 hr later when the tide had receded 308 and reared in plastic dishes as previously described [44]. Trilobite larvae hatched from the eggs 4 weeks later. Tissue samples and larvae were preserved in RNALater. Genomic DNA purification 310 and library construction were carried out using Qiagen DNAEasy, Illumina TruSeq and Nextera kits, following manufacturers' protocols. Barcoded samples were pooled and sequenced on on the 312 Illumina HiSeq2000 platform.
Limulus larvae were processed as follows; each larva, suspended in 100 µL of RNAlater and 314 stored at -80°C in a 1.5 mL Eppendorf tube, was thawed on ice, after which RNAlater was removed. DNA was extracted using the Qiagen DNAEasy kit per manufacturer's protocols. DNA 316 was quantified using picogreen DNA quantitation kit. To prepare TruSeq libraries, DNA was first purified another time using zymo genomic DNA clean columns per manufacturer's protocols. Adult 318 L. polyphemus DNA was prepared as above, but using claw tissue rather than whole larvae. All DNA extracts were tested by gel electrophoresis to ensure DNA was not degraded. TruSeq libraries were prepared at University of Georgia's Georgia Genomics Facility. 1-5 micrograms of sample DNA was subjected to fragmentation using Covaris sonicator. Fragmented DNA was then used for library construction using Illumina TruSeq library prep kits. Libraries were pooled together in equimolar amounts (for 10 larvae) and used for sequencing. For samples 11-34, Library prep 324 was switched from TruSeq to Nextera kits. Nextera library preparation was performed according to manufacturer's protocol. The Nextera library product was quantified by picogreen, and fragment size distribution was checked by using Lonza flash gel, to ensure that fragment size distribution was between 300-1000 bp. Sample libraries were pooled in equimolar concentrations and sent 328 for sequencing. Library pools were sequenced on the Illumina HiSeq2000 platform at Medical College of Wisconsin Sequencing Service Core Facility.

k-mer decomposition
We determine a lower bound on the k-mer size long enough for a given expectation of uniquenes 332 in a random genome. While increasing k reduces the rate of coincidentally repeated k-mers, it also reduces the effective depth of coverage due to untrimmed errors and edge effects at read 334 ends -and increases the cases of multiple SNPs per k-mer locus, which are not tracked in our current software implementation. We can approximately model a genome-scale string G of random 336 nucleotides as G samples taken with uniform probability from the space of all k-mers (of size 4 k /2 for odd k-mers treating reverse-complements as same; slightly more for even k ). The Poisson 338 distribution then gives the probability that a location in G has its own, unique k-mer (shared with no other location) as The probability of a location sharing its kmer is then 1 − e −λ ; thus, to limit the maximum rate kmer as an integer in [0..4 k − 1], slice s consists of those kmers whose remainder on division by S is s. We can tabulate one slice for a representative sample of 1/S kmers (for initial estimation of 362 depth of coverage and genome size) or, using S independent jobs, collect information for kmers in all slices. Our hash tables store odd-length kmers so that reverse-complementary sequences can 364 be combined without the ambiguous orientation of palindromic sequences (e.g., ATCGAT). After selecting k as described above and making a full tabulation of kmer counts and bit vec-366 tors, we filter out kmers not expected to represent genomic sequence. k-mers were required to have three copies in the total sequence set, with at least one copy in the initial run and one in the second run. This was partly to filter out incomplete adapter sequences, which can be difficult to trim but which were different in the two runs.

370
Extending methods developed for the Atlas assembler[42] to heterozygous sequences, Figure  1 gives a rough decomposition of the k -mer frequency distribution for 23-mers with quality ≥ 372 20, minimizing the square of the residuals of k -mer counts on frequencies 3 through 70 while not exceeding the observed counts. Four linked distributions model fractions of the genome as 374 monoallelic or biallelic: homozygous regions with d = 38.9-fold coverage (dark blue), minor alleles covered at d/4 (green), tied alleles at d/2 (red) and major alleles at 3d/4 (purple). This fit is 376 robust enough to confirm the abundance of major-minor allele pairs (27 percent of k -loci, vs. 10 percent for tied alleles), with the broader peaks in the data than in the fitted curves consistent

384
The filtered kmer counts, computed in parallel, are loaded into a hash table with additional fields to track kmers that are uniquely within one mismatch of each other. Because this step analyzes 386 all (non-error) k-mers in one table, this requires a single large-memory processor (on the order of 32 GiB).

388
For each k-mer, we check all its 3k one-substitution neighbors, The k-mers are partitioned each into one of three categories: unique: having no edit-neighbors within one substitution; ambiguous: 390 having either multiple one-substitution neighbors, or one neighbor that has multiple neighbors; or partnered: uniquely pairable with exactly one other k-mer differing by one substitution, such 392 kmers also known from now on as SNPmers or SNPmer pairs. For each SNPmer, we save the position of the substitution, a bitmask for the change (transition, complement, or non-complement 394 transversion), and whether the canonical form of the partner in the table has the same sense or is reverse complemented with respect to this k-mer. 396 Only partnered and unique k-mers will be further tracked. While this limited method cannot identify k-mers for genomic SNP and non-SNP locations with complete confidence, false pairing 398 or missed pairing should have limited effects on subsequent analysis. False pairing, due to coincidental similarities or repeats, will combine nodes of the k-mer graph (see below) and cause 400 possible misassembly: for our purposes, noise in the scaffolding, haplotype phasing, and linkage analysis. Missed pairing can happend from indel polymorphisms, SNPs separated by fewer than 402 k − 1 positions, failure to sequence minor alleles, or ambiguity due to too many similar k-mers. Ambiguously non-unique k-mers will be skipped over (reducing connectivity of the k-mer graph if 404 there are too many in a row). While allelic k-mers misidentified as unique will cause more forks in the k-mer graph, those from major-alleles will still be chained together with flanking unique se-406 quence, provided that major allele k-mers have at least twice the frequency as those for the minor allele.
408 Table 2 shows the totals and percentages of the different kmer categories, counting each SNPmer pair as one kmer. SNPmer pairs account for 16.3 percent of the putative genomically 410 unique 23mer markers; dividing by 23 gives us the fraction of bases in those markers that are putative SNPs: 0.71 percent. Subset k-mer selection To reduce the memory requirements of our k-mer assembly graph, we select a roughly one-tenth

422
Unlike for SNPmers, there are no canonical positions that identify the unique, unpaired k-mers . Several mechanisms have been proposed for sampling k-mers in a representative way [49,50]. 424 We use the more pseudo-random hash-slicing rule, already discussed above, to sample a single slice of k-mers those whose integer encodings are congruent to a particular slice number s, 426 modulo S (the hash slicing factor). We have found that on the finished human genome (results not shown), hash slicing is effectively a Poisson sampling, with sampled k-mers spaced according to 428 an exponential distribution.
A caveat in applying hash slicing is that taking the remainder modulo a prime is not very 430 pseudo-random for Mersenne primes (equal to 2 p−1 for some p), when k-mers are represented in base-4 encoding [48]. We therefore pick a slicing factor of 11, the smallest non-Mersenne prime 432 greater than our SNPmer sampling factor. The resulting k-mer subset has 86.0 million unpaired k-mers and 24.0 million SNPmer pairs, 434 each reduced as predicted, for a total factor of 9.8 reduction in k-mer nodes for the next step.

436
Each of the sampled unique k-mers and SNPmer pairs is a node in the k-mer graph. Nodes are connected when the corresponding k-mers appear consecutively in at least one read of the 438 input (any intervening k-mers having been skipped due to sampling or ambiguity). The relative orientation, distance and number of supporting reads of the k-mers is stored in the edge. When 440 conflicting distance or relative orientation is observed among different reads for the same pair of k-mer nodes, all edges from both nodes in the corresponding direction are ignored in contigging.

442
Robust edges for one direction from a node are defined as those supported by a supermajority of the reads containing the k-mer: the number of supporting reads is greater than or equal to both 444 (1) two plus the sum of the read counts for all other edges in that direction and (2) twice the read count of the next-most supported edge in the same direction. By this construction, a node has at 446 most one robust edge in each direction. two nodes it connects.
Contigs are the connected components of the subgraph consisting of mutually robust edges.
Singleton and circular contigs are reported for diagnostic purposes, but ignored in subsequent analysis. Each retained "k-mer contig" of To generate independent AMOS files for each Bambus run, the large graph of template links is partitioned into connected components. These components are grouped into batches for a rea-470 sonably load-balanced parallel computation with no template links between batches. The results are summarized as "initial scaffolds" in Table 1.

472
Consensus sequence representations for {k-mer} contigs and scaffolds were generated in two phases. In the first phase, sequence spanned by selected SNPmers and subset {k-mers} (see 474 sections and above) are joined together, separated by a number of Ns corresponding to the number of bases not spanned by {k-mers} in the subset. In the second phase, a single pass is 476 made through the read data set, and stretches of Ns that are spanned by single reads are replaced by the sequence of the read.

SNP phase and genotype inference
Each scaffold of the k-mer assembly constitutes a candidate marker for mapping. While the depth 480 of sequence coverage on each member of the mapping panel is too low ( about 1X ) to directly infer the genotype of individual members of the mapping pannel at individual SNPs, the tight link-482 age between SNPs within markers means that learning a sample's genotype at any one reveals it at the others, effectively amplifying the sequence coverage by a factor proportional to the num-484 ber of SNPs within the marker. This is the same principle exploited in genotype by sequencing (GBS) approaches to genetic mapping in the presence of reference genomes, for example in re-486 combinant inbred lines of reference rice strains [16], and in crosses between Drosophila species with sequenced genomes[17]. We use the same principle to genotype offspring in the context of 488 a cross between two outbred individuals, simultaneously inferring the phases of the SNPs (i.e. which bases appears on each of the four parental chromosomes in the cross).
While the data will be insufficient to infer genotypes at many markers, all those where confident inferences can be made can be used to build the linkage map.

492
For the purposes of genotype inference, a marker is treated as a collection of m SNPs (indexed in the following by i ∈ {1, 2, ..., m}), that have been inferred to be closely linked on the genome via 494 the k-mer assembly step. If the four parental chromosomes are labeled a, b in one parent and c, d in the other, then the genotyping problem is to infer which of the four possible segregation states 496 or genotypes ac, ad, bc, bd describes each sample at each marker locus. We index samples with j, and denote a sample genotype by g j . 498 We assume that markers are very small compared to a chromosome, and ignore the possibility of a recombination event within individual markers. The data used for inference of the offspring 500 genotypes consist of the number of reads from each barcoded sample j showing each of the four possible DNA bases b at each variable SNP position i, which we denote n b ij .

502
If the phase φ i of SNP i were known, i.e. which base is present in each of the four parental chromosomes, then a choice of genotype g j implies a specific homozygous or heterozygous state 504 s ij ∈ S = {AA, CC, T T, GG, AC, AT, AG, CT, CG, T G} for SNP i in sample j. For a given phase and genotype, the likelihood function for a given SNP position in a given sample is given by either 506 a binomial (for homozygous states) or trinomial probability mass function of the read counts, basecalling error rate , and the site genotype s ij : where n is the total number of reads at SNP i; m is the number of observations of bases not in σ ij (i.e. mismatches); k and l are the counts for each of the two bases of σ ij for heterozygous 510 sites, and = 2 /3.

512
Searching for an optimal choice of SNP phases φ i and sample genotypes g j is made difficult by the exponential size of the search space: for segregating bi-allelic SNP sites there are 14 possible 514 phases to consider at each SNP site, so for a mapping panel of only 20 siblings and a marker containing only 10 SNPs, there are 4 20 × 14 10 > 3 × 10 23 combinations to consider. In simulation 516 tests, we found that a variant of expectation maximization (EM), an iterative likelihood maximization method can accurately infer a large proportion of marker genotypes.

518
To initialize the iteration, the parental samples and a randomly selected offspring are, without loss of generality, assigned genotypes (a, b), (c, d) and (a, c). At each step, we calculate the condi-520 tional probability distributions over the possible SNP phases p(φ i ) given the genotype assignments according to: where we have labeled the chosen values for the genotypes g j at iteration t collectively by g (t) . n b iσ is the combined total number of observations of base b at polymorphic SNP i for all samples 524 included at iteration step t which have genotype s(φ i , g j ) = σ.
On each iteration until all samples have been included, a randomly selected sample is added to 526 the set after calculating p (t) (φ i ). Then the next set of genotype assignments g (t+1) are determined by choosing those that maximize the expected value of the log likelihood: These steps are repeated until genotypes are being selected for all samples, and the expected log likelihood stops increasing. At the end of the iteration, the likelihood-maximizing genotypes 530 are reported, along with the log likelihood difference between the best and second best choice of genotype for each sample, which provides an indicator of the confidence in genotype call. To 532 gauge convergence, this procedure is repeated 5 times for each marker, with different random choices of initial conditions. Markers which do not identify the same ML genotype multiple times 534 in independent runs are not included among the high confidence genotype calls.
Error model calibration 536 The sequence of 14 Ciona intestinalis autosomes were downloaded from Ensembl (www.ensembl.org). These 14 chromosomes were used as the template in our genome simulation. Based on their se- We wrote a perl script to simulate the genomes of offspring generated by the cross of the two 544 simulated parents. The software package dwgsim [55] was used to generate Illumina paired-end reads based on our simulated genomes of both parents and offsprings, with the coverage of 20X 546 and 5X respectively.
To estimate the frequency of incorrect genotype calls as a function of the log likelihood dif-548 ference between the called and alternative genotype, including contributions from uncertainty in SNP-mer identification, assembly, and sampling noise, we carried out a simulation of the k-mer 550 assembly and genotype inference protocols Among high-confidence genotype calls, the observed error frequency was a function of call confidence score was well-fit by a sum of two stretched 552 exponential functions, allowing assignment of error probabilities to individual genotype calls.
Linkage group construction 554 We use the linkage p-value p ab between pairs of map bins a and b defined as the minimum over the four possible relabelings r of the maternal and paternal chromosomes of the Binomial p-value 556 for the number of matching genotypes: where n is the total number of sample genotype calls (68 in the present case, or 34 in each 558 parent) and m r is the number of matching genotypes under relabeling r.
We identified map bins with segregation patterns indicating either inconsistent placement in 560 the maternal and paternal maps or genotyping error with a double threshold procedure as follows: This procedure identifies map bins which alone account for the merging of what would otherwise be two distinct partitions. We used the following pairs of thresholds p 1 , p 2 to identify a total 568 of 20 map bins for exclusion from the map: 10 −7 ,10 −6 ; 10 −8 ,10 −7 ; 10 −9 ,10 −8 . The remaining markers form locally consistent linkage groups in which all linkages defined at threshold p 1 are 570 corroborated by multiple linkages at p 2 , for the above values of p 1 and p 2 .

572
Markers were ordered within each linkage group using the following protocol. Within each linkage group a consistent labeling of the four parental chromosomes was achieved by constructing 574 a graph G in which nodes correspond to map bins and edges are weighted by linkage p-value p ab (as defined above). The local chromosome labels are updated at each map bin as it is reached in 576 a traversal of the minimum spanning tree of G to the labeling r that maximizes p ab along the incident of G used in the traversal. Markers within each linkage group were clustered by hierarchical 578 clustering (marker-marker distance metric: cosine of the angle between the vectors of recombination distances to the other map bins; distance updating method: average linkage) into a binary 580 tree data structure with leaves representing map bins. A node in the right subtree of the root node was rotated, interchanging its left and right subtrees if its left subtree was not already closer (in 582 average recombination distance) to the markers of the left subtree of the global root; and similarly for nodes in the left subtree of the root. An in-order traversal of the tree generates an ordering of 584 the markers. Finally three reversals of the order of markers in segments of the map were added based on visual inspection of the recombination distance matrix. In the final marker ordering, 51% 586 of adjacent map bin pairs are separated by a single recombination event in the cross, and 94% are separated by three or fewer recombinants in each parent.

Placement of markers on the map
To anchor additional markers to the map, we computed the p ab (see above) between marker a to 590 be placed on the map and each map bin b. Marker a is anchored to the map at the position of the bin b which minimizes p ab if p ab < 10 −6 .

SNP density estimation
Illumina reads were mapped to the assembled scaffold sequences with stampy [57] using default 594 settings. For a sample of 9,228 scaffolds with lengths ranging from 5.0-5.5 kb, sequence variants were called with SAMtools [58] using a variant quality score threshold of 50, and ignoring indel

Estimation of local recombination rate
To estimate the local recombination rate for each map bin, we computed the linear regression of 602 map distance in number of markers on physical distance using up to 10 neighboring map bins in each direction along the map (or fewer for bins within 10 map bins of the end of the linkage 604 group). Map distance was calculated from recombination fraction using Haldane's map distance − 1 2 log (1 − 2r) [60].

Ancestral linkage group conservation
To compare the genome organization in L. polyphemus to the ancestral metazoan ALGs, we used 608 the reciprocal best blast hit (RBH) orthology criterion in an alignment of the Ixodes scapularis predicted proteins [32] to the consensus sequences for the marker scaffolds. L. polyphemus scaf-610 folds with RBH of e-value <= 10 −6 were assigned to the same ancestral bilaterian gene orthology group as their I. scapularis ortholog, and thereby with human genes. Regions of the map were 612 tested for enrichment in genes from particular ancestral linkage groups with Fisher's Exact Test, and breakpoints in ancestral linkage group composition were identified using a hidden Markov 614 model, as previously described [6,7].
Homeobox gene modeling 616 We identified 155 marker scaffolds with a tblastx alignment of e-value < 10 −6 to a set of chelicerate homeobox gene sequences downloaded from Genbank using the NCBI online query in-

638
A multiple sequence alignment of the predicted homeobox sequences combined with a collection of representative sequences from various classes of homeobox genes was constructed with 640 muscle3.8. 31[63] using default settings. The resulting alignment was trimmed to a 63 amino acid segment spanning the conserved homeodomain, and sequences with more than 50% gaps were removed, leaving 93 predicted L. polyphemus homeobox genes in the analysis. Bayesian phyloge-K a and K s estimation for paralogs and T. tridentatus orthologs Figure 8 shows the distribution across the map of pairs of candidate paralogs.

666
To estimate the synonymous sequence divergence between pairs of candidate L. polyphemus paralogous gene pairs, and between L. polyphemus genes and their T. tridentatus orthologs, we 668 followed the following protocol.
1. Reassemble reads from each marker with PHRAP[61], and create a predicted coding se-670 quence using exonerate, as described for the annotation of homeobox gene models (see above).  Figure 9 shows the distributions of K a and K s for paralogs and T. tridentatus orthologs. To estimate the number and age of peaks in the un-saturated range [33] of the K s distribution (and 692 of putative WGD events), we fit a series of univariate normal mixture models, with 1, 2, 3, and 4 components to the paralog K s distribution in the range 0 < K s < 2.5 and selected the best model 694 on the basis of Bayesian Information Criterion (Table 3). The best model had two components, with means at 0.7 and 1.45 substitutions per site. The position of the peak at lowest K s was 696 not sensitive to the addition of more mixture components. Figure 10 shows a comparison of the distribution and the components of the best-fitting model. Gaussian mixture models were 698 estimated in R with mixtools [67].

700
The raw sequencing reads are currently being submitted through the NCBI SRA and are accessible via NCBI BioProject accession PRJNA187356. LG 1 LG 2 LG 3 LG 4 LG 5 LG 6 LG 7 LG 8 LG 9 LG 10 LG 11 LG 12 LG 13 LG 14 LG 15 LG 16 LG 17 LG 18 LG 19 LG 20 LG 21 LG 22 LG 23 LG 24 LG 25 LG 26 LG 27 LG 28 LG 29 LG 30 LG 31 LG 32    3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 Figure 6: Unrooted phylogenetic tree of homeobox sequences (part 1). Nodes are labeled with Bayesian posterior probabilities. Highly supported partitions used to classify L. polyphemus sequences are drawn in read, with the abbreviation for the class shown in large letters. L. polyphemus homeobox sequences not grouped into one of these highly supported partitions are assigned to class "?". For ease of display, a large subtree consisting of HOX and parahox genes has been pruned at the position labeled "HOX", and is shown in Figure 7.  Figure 7: Phylogenetic tree of homeobox sequences, part 2. The rooted subtree pruned from the tree in Figure 6. Nodes are labeled with Bayesian posterior probabilities. Highly supported partitions used to classify L. polyphemus sequences are drawn in read, with the abbreviation for the class shown in large letters. L. polyphemus homeobox sequences not grouped into one of these highly supported partitions are assigned to class "hox?".