Northern Spotted Owl (Strix occidentalis caurina) Genome: Divergence with the Barred Owl (Strix varia) and Characterization of Light-Associated Genes

Abstract We report here the assembly of a northern spotted owl (Strix occidentalis caurina) genome. We generated Illumina paired-end sequence data at 90× coverage using nine libraries with insert lengths ranging from ∼250 to 9,600 nt and read lengths from 100 to 375 nt. The genome assembly is comprised of 8,108 scaffolds totaling 1.26 × 109 nt in length with an N50 length of 3.98 × 106 nt. We calculated the genome-wide fixation index (FST) of S. o. caurina with the closely related barred owl (Strix varia) as 0.819. We examined 19 genes that encode proteins with light-dependent functions in our genome assembly as well as in that of the barn owl (Tyto alba). We present genomic evidence for loss of three of these in S. o. caurina and four in T. alba. We suggest that most light-associated gene functions have been maintained in owls and their loss has not proceeded to the same extent as in other dim-light-adapted vertebrates.


Nextera350nt library
1.1. 1 We intended this library to be a Nextera-sheared library with a small insert size. We isolated DNA using a Gentra Puregene Kit (Qiagen) following the protocol entitled "Protocol: DNA Purification from Tissue Using the Gentra Puregene Tissue Kit" (Qiagen). We used 50 ng of the DNA to prepare a genomic library using a Nextera DNA Sample Prep Kit (Illumina-compatible) (Epicentre). After tagmentation, we cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We amplified the reaction for 5 cycles of PCR using a Nextera DNA Sample Prep Kit (Illuminacompatible) (Epicentre) and the Nextera PCR Enzyme (Epicentre). We then cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We used a LabChip XT DNA 750 Assay Kit on a LabChip XT (PerkinElmer) automated nucleic acid fractionation system to select library fragments in the size range of 375-600 nt, which, after subtracting the 141 nt of adapters, corresponds to an average fragment size of 346.5 nt. We performed a final PCR using 5 µL Klentaq LA 10X Buffer with MgCl (Sigma-Aldrich), 1 µL 12.5 µM dNTPs, 1 µL each of two Illumina-adapter-compatible primers at 10 µM, 1 µL KlenTaq LA DNA Polymerase Mix (Sigma-Aldrich), 5 µL library off of LabChip, and water to make a 50 µL reaction volume. We ran the PCR at 94°C for 2 min; then 5 cycles of denaturation at 94°C for 30 s, annealing at 58°C for 30 s, and extension at 72°C for 3 min; and we performed a final extension at 72°C for 5 min. We removed the PCR products after the final extension and and then cleaned them using a DNA Clean & Concentrator -5 kit (Zymo Research). We obtained one lane of 100 nt paired-end data using a TruSeq PE Cluster Kit v2-cBot-HS kit and a TruSeq SBS v2-HS kit on a HiSeq 2000 (Illumina) and a second lane of 100 nt paired-end data using a TruSeq PE Cluster Kit v3-cBot-HS kit and a TruSeq SBS v3-HS kit on a HiSeq 2000 (Illumina).

Nextera700nt library
1.2. 1 We attempted to construct a Nextera-sheared library with a moderate insert size. We isolated DNA using a Gentra Puregene Kit (Qiagen) and used 50 ng to prepare a genomic library using a Nextera DNA Sample Preparation Kit (Illumina). After tagmentation, we cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We amplified the reaction for 5 cycles of PCR using a KAPA Library Amplification kit (KAPA Biosystems) and then cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We used a BluePippin (Sage Science) to select library fragments in the size range of 734-934 nt, which, after subtracting the 134 nt of adapters, corresponded to selecting an average insert size of 700 nt. We performed a real-time PCR (rtPCR) using a KAPA Real-Time Library Amplification Kit (KAPA Biosystems) on a CFX96 Touch Real-Time PCR Detection System (Bio-Rad) to amplify the library. We amplified the library with 6 cycles PCR and then cleaned the PCR products with a DNA Clean & Concentrator -5 kit (Zymo Research). We lastly assessed the library fragment size distribution with a 2100 BioAnalyzer (Agilent Technologies) and the concentration of double-stranded DNA material with a Qubit 2.0 Flurometer (Invitrogen). We obtained one lane of 150 nt paired-end data sequenced on a HiSeq 2500 (Illumina) in rapid mode.

Nextera550nt library
1.3. 1 We aimed to construct a Nextera-sheared library with overlapping reads, which could be merged into long fragments. We isolated DNA using a Gentra Puregene Kit (Qiagen) and used 50 ng to prepare a genomic library using a Nextera DNA Sample Preparation Kit (Illumina). After tagmentation, we cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We amplified the reaction for 5 cycles of PCR using a KAPA Library Amplification kit (KAPA Biosystems) and then cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We then used a BluePippin (Sage Science) to select library fragments in the size range of 634-709 nt, which, after subtracting the 134 nt of adapters, corresponded to selecting an average insert size of 537.5 nt. We assessed the library fragment size distribution with a 2100 BioAnalyzer (Agilent Technologies). We cleaned the size-selected product with 0.6X Agencourt AMPure XP (Beckman Coulter) magnetic beads to remove adapter dimer of approximately 250 nt in size. We then performed a real-time PCR (rtPCR) using a KAPA Real-Time Library Amplification Kit (KAPA Biosystems) on a CFX96 Touch Real-Time PCR Detection System (Bio-Rad) to amplify the library. We amplified the library with 8 cycles PCR and then cleaned the PCR products with a DNA Clean & Concentrator -5 kit (Zymo Research). We lastly assessed the library fragment size distribution with a 2100 BioAnalyzer (Agilent Technologies) and the concentration of double-stranded DNA material with a Qubit 2.0 Flurometer (Invitrogen). We obtained one lane of 300 nt pairedend data sequenced using a MiSeq Reagent Kit v3 on a MiSeq (Illumina). We obtained a second lane of 375 nt read 1 and 225 nt read 2 for a total of 600 nt of paired-end read data sequenced using a MiSeq Reagent Kit v3 on a MiSeq (Illumina).
We sheared 4,460 ng genomic DNA in 130 µL in a microTUBE AFA Fiber Pre-Slit Snap-Cap tube (Covaris) using a M220 focused-ultrasonicator (Covaris) targeting 550 nt as the center of the fragment distribution. We used peak incident power 50 W, 20% duty factor, 200 cycles per burst, and 45 s treatment time at 20°C. We then removed small fragments and concentrated the sheared material using a DNA Clean & Concentrator -5 kit (Zymo Research). We next constructed a genomic library by using a TruSeq DNA PCR-Free Library kit (Illumina) and following the manufacturer's protocol, including the use of bead-based size selection to remove large and small DNA fragments in succession to target a mean fragment size of 550 nt. We assessed the concentration of doublestranded DNA material in the final library with a Qubit 2.0 Flurometer (Invitrogen).

900ntPCR library
1.5.1 We extracted genomic DNA from blood using a DNeasy Blood & Tissue Kit (Qiagen).
We sheared 4,580 ng genomic DNA in 130 µL in a microTUBE AFA Fiber Pre-Slit Snap-Cap tube (Covaris) using a M220 focused-ultrasonicator (Covaris) targeting 900 nt as the center of the fragment distribution. We used peak incident power 50 W, 5% duty factor, 200 cycles per burst, and 70 s treatment time at 20°C. We then removed small fragments and concentrated the sheared material using a DNA Clean & Concentrator -5 kit (Zymo Research). We next constructed a genomic library by using a TruSeq DNA PCR-Free Library kit (Illumina) and following the manufacturer's protocol, except that we only performed a bead-based size selection to remove small fragments and not large fragments. We used a 0.45X bead to sample ratio in order to eliminate fragments smaller than approximately 700 nt. Following A-tailing and prior to adapter ligation, we took 10% of the sample (by volume) and separated it from the noPCR aliquot for use in a PCR-amplified library. We ligated adapters to these two aliquots separately and cleaned the finished ligations with a DNA Clean & Concentrator -5 kit (Zymo Research). We then only went forward with the aliquot for use in a PCR-amplified library. We used a BluePippin (Sage Science) to select library fragments in the size range of 800-1100 nt, which, after subtracting the 121 nt of adapters, corresponded to selecting an average insert size of 829 nt. We next cleaned the eluted material with a DNA Clean & Concentrator -5 kit (Zymo Research) and then performed real-time PCR (rtPCR) using a KAPA Real-Time Library Amplification Kit (KAPA Biosystems) on a CFX96 Touch Real-Time PCR Detection System (Bio-Rad) to amplify the library. We amplified the library with 11 cycles PCR and then cleaned the PCR products with 1X Agencourt AMPure XP (Beckman Coulter) magnetic beads. We lastly assessed the library fragment size distribution with a 2100 BioAnalyzer (Agilent Technologies) and the concentration of double-stranded DNA material with a Qubit 2.0 Flurometer (Invitrogen).

Hydroshear library
1.6.1 We isolated DNA using a Gentra Puregene Kit (Qiagen) and used a Hydroshear DNA Shearing Device (GeneMachines) to shear 25 µg in DNA in 100 µL volume with 30 cycles of shearing using speed code 3. We checked the sheared DNA on a 1% agarose gel and saw that fragments had been sheared between 400-1000 nt. We additionally mechanically sheared the DNA by performing 15 passes through a 28 gauge x 1/2 inch needle attached to a 1 cc U-100 Insulin Syringe (Becton, Dickinson and Company). We performed end-repair using 4266 ng sheared DNA in an End-It DNA End-Repair Kit (Epicentre). We incubated the reaction at room temperature for 45 minutes and then inactivated the enzyme by heating to 72°C for 10 minutes followed by cleaning with a DNA Clean & Concentrator -5 kit (Zymo Research). We then added 3' A tails in a reaction with 2 µL 10X NEBuffer 2, 0.5 µL 100 mM dATP (Invitrogen), 1 µL Klenow Fragment (3'→5' exo-) (NEB), and 16.5 µL cleaned end-repaired product. We incubated for 45 min at 37°C and then 20 min at 75°C to inactivate the enzyme. We cleaned the reaction with a DNA Clean & Concentrator -5 kit (Zymo Research). We then ligated Illumina-compatible adapters using 1 µL 10X Fast-Link Ligation Buffer (Epicentre), 1 µL 10 mM ATP (Epicentre), 5 µL of end-repaired DNA (0.7835 µg), 2 µL of annealed Illumina-compatible adapters at 10 µM (Integrated DNA Technologies), and 1 µL Fast-Link DNA Ligase (Epicentre) for 10 µL total reaction volume. We incubated the ligation reaction overnight at 16°C and then used 1.5X Agencourt AMPure XP (Beckman Coulter) magnetic beads to clean the ligase reaction and remove any extra adapters. We performed a PCR using 10 µL Klentaq LA 10X Buffer with MgCl (Sigma-Aldrich), 2 µL 12.5 µM dNTPs, 2 µL each of two Illumina-adapter-compatible primers at 10 µM, 2 µL KlenTaq LA DNA Polymerase Mix (Sigma-Aldrich), half of the cleaned ligase reaction in 10 µL, and water to make a 100 µL reaction volume. We ran the PCR in two 50 µL aliquots at 94°C for 5 min; then 2 cycles of denaturation at 94°C for 30 s, annealing at 55°C for 30 s, and extension at 68°C for 3 min; and we performed a final extension at 68°C for 5 min. We removed the PCR products after the final extension and and then cleaned them using a DNA Clean & Concentrator -5 kit (Zymo Research). We used a LabChip XT DNA 750 Assay Kit on a LabChip XT (PerkinElmer) automated nucleic acid fractionation system to select library fragments in the size range of 600-700 nt. We performed a final PCR using 5 µL Klentaq LA 10X Buffer with MgCl (Sigma-Aldrich), 1 µL 12.5 µM dNTPs, 1 µL each of two Illumina-adapter-compatible primers at 10 µM, 1 µL KlenTaq LA DNA Polymerase Mix (Sigma-Aldrich), 5 µL library off of LabChip, and water to make a 50 µL reaction volume. We ran the PCR at 94°C for 2 min; then 17 cycles of denaturation at 94°C for 30 s, annealing at 58°C for 30 s, and extension at 72°C for 3 min; and we performed a final extension at 72°C for 5 min. We removed the PCR products after the final extension and and then cleaned them using a DNA Clean & Concentrator -5 kit (Zymo Research). We next assessed the library fragment size distribution with a 2100 BioAnalyzer (Agilent Technologies) and the concentration of double-stranded DNA material with a Qubit 2.0 Flurometer (Invitrogen).  (http://www.boost.org). When running NxTrim, we used the "--preserve-mp" flag to prefer mate pair reads as output even if paired-end reads would be longer. NxTrim utilizes the position of the junction identifier sequence in Nextera mate-pair data to classify reads of mate pair libraries as true mate pair reads, paired-end reads, or singleton reads.
1.9.2 We trimmed adapters and low quality bases separately for the resulting mate-pair data, paired-end reads, and singleton reads using Trimmomatic version 0.32 (Bolger et al. 2014). We trimmed adapters using options "ILLUMINACLIP:<fasta of Illumina adapter sequences >:2:30:10". We removed low quality bases from the beginning and end of the reads using the following options: LEADING:3 TRAILING:3 to remove bases below Phred 3. We trimmed off low quality sequence portions using: SLIDINGWINDOW:4:17, which trimmed the read when the average quality over 4 basepairs dropped below Phred 17. Finally, we trimmed reads less than 36 basepairs in length using "MINLEN:36".
1.10 Trimming -short-insert paired-end data 1.10.1 We first trimmed adapters from all non-mate-pair libraries using Trimmomatic version 0.32 (Bolger et al. 2014). We used the ILLUMINACLIP function with the following options: <fasta of Illumina adapter sequences >:2:30:10.
1.10.2 Since substantial portions of the paired-end reads from all of the libraries, except the Nextera700nt library were overlapping, we joined overlapping paired reads using the BBMerge tool in the BBMap tool suite version 34.00 (Bushnell 2014). We merged overlapping reads using the options "minoverlapinsert=110 mininsert=110 strict=t" for the datasets Nextera350nt lane 1 and Nextera350nt lane 2, We used the options "minoverlapinsert=400 mininsert=400 strict=t" for the datasets Nextera550nt lane 1, Nextera550nt lane 2, noPCR550nt, and PCR900nt, which had longer read lengths. 1.10.3 We then performed quality trimming using Trimmomatic version 0.32 (Bolger et al. 2014). We removed low quality bases from the beginning and end of the reads using the options "LEADING:3 TRAILING:3" to remove bases below Phred 3. We trimmed off low quality sequence portions using "SLIDINGWINDOW:4:17", which trimmed the read when the average quality over 4 basepairs dropped below Phred 17. Finally, we trimmed reads less than 36 basepairs in length using "MINLEN:36".

Error-correction
1.11.1 Since we trimmed using a moderately low quality threshold, we used the k-mer-based error corrector in the SOAPdenovo2 toolkit, SOAPec version 2.01 (Luo et al. 2012), to correct sequence errors. We first used the KmerFreq_HA tool to create a k-mer frequency spectrum with default options except "-k 27 -L 600", which indicate that we used a k-mer size of 27 for creating the frequency spectrum and the maximum read length was 600 nt.
We then used the Corrector_HA tool along with the k-mer frequency spectrum that we created to correct all of our trimmed reads using default options except "-k 27 -r 36", which indicate that we used a k-mer size of 27 for the error correction and kept trimmed reads as short as 36 nt.
1.12 Single-end data 1.12.1 In each stage of the trimming, merging, and error-correction process, some reads previously paired became unpaired due to the loss of their paired read in a trimming step.
We handled the single-end reads separate from the paired reads and subjected them to the same adapter, quality trimming, and error-correcting steps as the reads that remained paired. We used all of these single read sets in the final assembly.
1.13 Read processing variation for some preliminary assemblies 1.13.1 For a trim level of an average Phred 7 or 28, the only difference from the methodology described above was that we trimmed off low quality sequence portions using Trimmomatic with the parameter "SLIDINGWINDOW:4:7" or "SLIDINGWINDOW:4:28", respectively.
1.13.2 We did not apply the error-correction process to reads trimmed to an average Phred 28.
1.13.3 For some preliminary assemblies, we did not merge overlapping paired-end reads. This entailed leaving out the BBMerge step described above, but still performing adapter and quality trimming as noted.
1.14 Genome size 1.14.1 Genome size data estimated from flow cytometry measurement of red blood cells exist for two S. occidentalis congeners of, S. aluco and S. nebulosa. Strix aluco has a C-value of 1.59 pg (De Vita et al. 1994), which is approximately 1.56 Gnt (Doležel et al. 2003).
Strix nebulosa has a C-value of 1.65 pg (Vinogradov 2005), which is approximately 1.61 Gnt (Doležel et al. 2003). BamTools version 2.4.0 (Barnett et al. 2011(Barnett et al. , 2015 requiring CMake version 3.2.3 (Hoffman & Martin 2003;Kitware 2015), and on the 150 nt paired-end reads from the Nextera700nt dataset to estimate the genome size. Preqc estimated the genome size by sampling 20,000 reads and counting the frequency of k-mers of length 31 nt while applying a correction for sequencing errors.

Assembly
1.15.1 We used SOAPdenovo2 version 2.04 (Luo et al. 2012) to assemble the genome. We performed numerous trial runs experimenting with different k-mer values and parameters.
We utilized the insert size estimated in the output of initial, trial assemblies to refine our estimation of the insert sizes for our libraries and used these refined values as input into subsequent assembly configuration files (Table S1). We settled on using the default parameters other than the options "SOAPdenovo-127mer all -N 1500000000 -K 23 -m 127 -k 65 -d 1 -R -F". These options indicate that we used the 127 k-mer version of the assembler and ran the assembly using multiple k-mer sizes starting at 23 and ending with a maximum of 127, we gave an estimated genome size of 1.5 Gnt, we allowed reads as small as 65 nt to map to contigs during scaffolding, we ignored singleton k-mers, we tried to resolve repeats with reads, and we attempted to fill gaps in scaffolds.
1.15.2 In our configuration files for all of the preliminary assemblies, we used the default minimum alignment lengths between a read and contig (32 for paired-end reads, 35 for mate-pair reads) and the default minimum pair number cutoffs (3 for paired-end reads, 5 for mate-pair reads). (Henderson & Hanna 2016a), which utilized the first and last 21 nt of each read as a read fingerprint, to check for sequence duplication in each sequenced library.

Preliminary assembly assessment
1.16.1 In order to compare our preliminary assemblies, we removed contigs / scaffolds <= 300 nt in order to remove any unassembled reads from the assembly. We calculated the contig and scaffold N50 as well as the number of scaffolds in various length classes using scafN50 (Henderson & Hanna 2016d). We calculated the total length of the assembly, the % Ns, and the total number of scaffolds using scafSeqContigInfo (Henderson & Hanna 2016e). We were conservative and separated scaffolds into contigs at each N in the sequence, which is the default option for scafSeqContigInfo (Henderson & Hanna 2016e (Guigó 1998;Blanco et al. 2011), and NCBI's BLAST+ version 2.2.25 (Altschul et al. 1997;Camacho et al. 2009), to annotate a set of highly conserved eukaryotic genes in our assembly and thereby obtain an assessment of the quality and completeness of each assembly. In order to install CEGMA's GeneWise dependency, we followed the source code modification recommendations documented by Markus Grohme (http://korflab.ucdavis.edu/datasets/cegma/ubuntu_instructions_1.txt) and the Homebrew Science GeneWise formula (https://github.com/Homebrew/homebrew-science/blob/master/genewise.rb).

Determination of final assembly
1.17. 1 We examined multiple statistics in choosing our final assembly. We valued high contig and scaffold N50 values, low % Ns in the sequence, a low total number of scaffolds, larger numbers of long scaffolds, and completeness as reflected in the number of conserved genes found by the CEGMA pipeline. We decided that the assembly that had the best statistics across these categories was assembly 4 (Table 2) and we went forward with this assembly as our final assembly.

Gap closing
1.18.1 We found that using the "-F" flag to fill gaps using the SOAPdenovo2 version 2.04 (Luo et al. 2012) de novo assembler was ineffective at gap filling during the assembly. We then filled gaps using the gap closing tool in the SOAPdenovo2 toolkit, GapCloser version 1.12-r6 (Luo et al. 2012), with the default options other than "-l 600" to specify that our longest read length was 600 nt. The program output a warning stating that the maximum supported read length was 155 nt and that it would use that setting for the analysis. We assumed that the program just used the first 155 nt of reads with a total length exceeding 155 nt.
1.18.2 The gap-closed assembly contained many contigs and/or scaffolds under 1000 nt in length, a substantial portion of which appeared to be unassembled reads. We used ScaffSplitN50s (Henderson & Hanna 2016c) to compare the continuity statistics resulting after removing contigs / scaffolds of lengths 300, 500, and 1,000 nt as well as when using N blocks of lengths 1, 5, 10, 15, 20, and 25 to separate contigs within scaffolds. Based on these results, we removed all contigs and scaffolds less than 1000 nt for downstream analyses.  (Guigó 1998;Blanco et al. 2011), and NCBI's BLAST+ version 2.3.0 (Altschul et al. 1997;Camacho et al. 2009), to annotate a set of highly conserved eukaryotic genes in our assembly and thereby obtain an assessment of the quality and completeness of the assembly. We ran CEGMA with default parameters other than specifying "--vrt" to optimize the searches for a vertebrate genome.

Final assembly stats
1.19.2 We used BUSCO version 1.1b1 (Simão et al. 2015a;Simão et al. 2015b), which used NCBI's BLAST+ version 2.2.28 (Altschul et al. 1997;Camacho et al. 2009), HMMER version 3.1b2 (http://hmmer.org), and AUGUSTUS version 3.2.1 (Keller et al. 2011;Stanke 2015) to assess the assembly quality by searching for conserved orthologs. We ran BUSCO with default genome mode parameters other than specifying "vertebrata" as the evolutionary lineage with the option "-l" and using "-sp chicken" to employ the AUGUSTUS parameters optimized for the chicken genome.

Contamination assessment
1.20. 1 We performed a local alignment of all scaffolds in NSO-wgs-v0 to a copy the NCBI nucleotide database (nt) that we downloaded on 24 June 2016 (Clark et al. 2016; NCBI Resource Coordinators 2016) using NCBI's BLAST+ version 2.3.0 tool BLASTN (Altschul et al. 1997;Camacho et al. 2009) with default parameters other than "-outfmt 10 -num_alignments 5 -max_hsps 1". We used these parameters to limit to 5 the maximum number of alignments to unique subjects output and to limit to 1 the number of outputted alignments per subject. This allowed us to examine the top 5 alignments to different subject sequences and ascertain whether those subject sequences were obtained from vertebrate or non-vertebrate organisms.
1.20.2 In order to parse the taxonomy of the subject sequences in the alignment output, we obtained the a local copy of the NCBI taxonomy database using NCBI's BLAST+ version 2.3.0 script, update_blastdb.pl with the parameters "--passive --timeout 300 -force --verbose taxdb". We also downloaded the files taxdump.tar.gz and gi_taxid_nucl.dmp.gz from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy) (Clark et al. 2016; NCBI Resource Coordinators 2016). We then used GItaxidIsVert (Henderson & Hanna 2016b) with default options other than using the parameter "-n" to filter the alignment output for non-vertebrate alignments.
1.20.3 We used the web version of NCBI's BLAST+ version 2.4.0 tool BLASTN (Altschul et al. 1997;Camacho et al. 2009) with default parameters.

Mitochondrial genome identification
1.21. 1 We searched NSO-wgs-v1 (not repeat-masked, all contigs / scaffolds < 1,000 nt removed, contaminant scaffolds removed) for any of the contigs / scaffolds that were assemblies of the mitochondrial genome, rather than the nuclear genome using NCBI's BLAST+ version 2.4.0 tool BLASTN (Altschul et al. 1997;Camacho et al. 2009) with default parameters other than "-outfmt 6".
1.22.2 We used the Geneious version 9.1.4 aligner through the "map to reference" function (Kearse et al. 2012;Biomatters 2016a) with default options to align primers 2550F and 2718R (Fridolfsson & Ellegren 1999) to the scaffolds and then extract the region bounded by the aligned primers.

Repeat annotation
1.23. 1 We performed a homology-based repeat annotation of the genome assembly using RepeatMasker version 4.0.5 (Smit et al. 2013), which employs the repeat databases of the DFAM library version 1.3 (Wheeler et al. 2013) and the Repbase-derived RepeatMasker libraries version 20140131 (Jurka 1998(Jurka , 2000Jurka et al. 2005;Bao et al. 2015). Our installation of the RepeatMasker tool utilized NCBI's BLAST+ version 2.2.30 (Altschul et al. 1997;Camacho et al. 2009) and RMBlast version 2.2.28 (Smit et al. 2015) sequence search engines as well as the tandem repeats finder (TRF) version 4.0.7b (Benson 1999(Benson , 2012. We ran RepeatMasker with default options other than parameters "-gccalc -nolow -species aves". The purpose of this run was to produce a masked genome without masking of low complexity regions or simple repeats, which we could then use for downstream annotation steps. 1.23.2 We performed a de novo modeling of the repeat elements in the genome using RepeatModeler version 1.0.8 (Smit & Hubley 2015), which uses two de novo repeat finders, RECON version 1.08 (Bao & Eddy 2002) and RepeatScout version 1.0.5 (Price et al. 2005), as well as the tandem repeats finder (TRF) version 4.0.7b (Benson 1999(Benson , 2012, the RMBlast version 2.2.28 (Smit et al. 2015) (Jurka 1998(Jurka , 2000Jurka et al. 2005;Bao et al. 2015). We built a sequence database from our genome and ran RepeatModeler with default options. 1.23.3 We further masked the genome by running RepeatMasker again with the masked genome as input, using the repeat database created by our RepeatModeler run, and with default options other than parameters "-gccalc -nolow".
1.23.4 We performed homology-based repeat masking using RepeatMasker as above with default options other than parameters "-gccalc -species aves". We then performed a second run of RepeatMasker using the repeat database created by our RepeatModeler run with the masked genome as input and using default options other than parameters "gccalc -nolow". Our output was a second twice-masked genome with masked low complexity regions and simple repeats.

Gene annotation
1.24.1 We used the MAKER accessory script, cegma2zff, to convert the GFF file output from our CEGMA run on the GapClosed assembly into ZFF format to use in training of the gene prediction tool Semi-HMM-based Nucleic Acid Parser (SNAP) version 2006-07-28 (Korf 2004). We used the fathom tool of the SNAP package with the parameters "categorize 1000", followed by fathom with the parameters "-export 1000", then the forge element of the SNAP package, then the hmm-assembler.pl script from the SNAP package to convert the ZFF files to an HMM file, which was then the newly trained gene finder that we provided SNAP in the MAKER configuration file (Campbell et al. 2014 (Altschul et al. 1997;Camacho et al. 2009) tool "makeblastdb" with default parameters other than the options "-input_type fasta -dbtype prot". We then compared the combined MAKER protein fasta file to the Swiss-Prot UniProt database using the BLAST 2.2.31+ tool "blastp" with default parameters other than the options "-evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking -max_hsps 1". We then used the MAKER accessory script "maker_functional_gff" to add the protein homology data to the combined MAKER GFF3 file and the MAKER accessory script "maker_functional_fasta" to add the protein homology data to the combined MAKER protein and transcript fasta files. 1.24.5 In order to identify proteins with known functional domains, we ran InterProScan version 5.18-57.0 (Jones et al. 2014) with options "-appl PfamA -iprlookup -goterms -f tsv", which limited searches to Pfam, a database of protein family domains, on the protein sequences generated by MAKER. We then used the MAKER accessory script "ipr_update_gff" to update the MAKER-generated GFF3 file with the results of the InterProScan run and add information on protein family domain matches.
1.24.6 We then filtered transcripts with an Annotation Edit Distance (AED) less than 1 and/or a match to a Pfam domain using the option "-s" in the script "quality_filter.pl" supplied in MAKER version 3.00.0 (Cantarel et al. 2008).
1.24.7 We used the "stat" tool of GenomeTools version 1.5.1 (Gremme et al. 2013) to calculate annotation summary statistics, including distributions of gene lengths, exon lengths, number of exons per gene, and coding DNA sequence (CDS) lengths (measured in amino acids). We also used the "stat" tool of GenomeTools with the options "-addintrons" and "-intronlengthdistri" to infer intron lengths within the annotated gene boundaries and calculate the distribution of intron lengths.
1.27.8 We then calculated the unfiltered allele depth (the number of reads that supported an allele) summed across all of the alleles at each of the remaining variant sites using the following GNU cut version 8.21 (Ihnat et al. 2013)  respectively. We used calc-pi-exclude.sh from NSO-genome-scripts version 1.0.0 (Hanna 2012) to calculate H b for S. o. caurina and S. varia. In order to report H w and H b in terms of the number of nucleotide differences per site within the sample, we divided the output from the scripts above by the number of ACGT characters in NSO-wgs-v1-nuc (the whole-genome assembly without the contaminant or mitochondrial scaffolds), which we obtained using "assemblathon-stats-ex.pl" from NSO-genome-scripts (Bradnam et al. 2013;Hanna & Henderson 2017 1.28.2 After variant calling, we used the PSMC script "fq2psmcfa" next with the command "fq2psmcfa -q20 variants.fq.gz >variants.psmcfa". We then ran PSMC with the command "psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o variants.psmc variants.psmcfa". We next ran the PSMC scripts "psmc2history.pl" and "history2ms.pl" with the command "psmc2history.pl variants.psmc | history2ms.pl > variants.psmc_ms-cmd.sh".

Light-associated gene analyses
1.29. 1 We searched in NSO-wgs-v1 for regions orthologous to probes for 19 genes that encode proteins with light-associated functions using Geneious version 9.1.6 (Biomatters 2016b; Kearse et al. 2012) and the included version of the NCBI BLAST+ BLASTn tool (Zhang et al. 2000) with default options. On 1-10 November, 2016, we used the web version of NCBI BLAST+ version 2.5.0 (Zhang et al. 2000) with discontiguous megablast options to align the probes against sequences in the NCBI Whole-Genome-Shotgun (WGS) contigs database limited by specifying the organism T. alba (taxid:56313).
1.29.2 When BLAST searches were unsuccessful, we used synteny data from Ensembl (version 86; (Yates et al. 2016) to search for evidence of whole gene deletion. We identified genes flanking the gene of interest in related taxa, and subsequently used BLAST to align the reference sequences for these genes against the S. o. caurina and T. alba genome assemblies. We imported the S. o. caurina genome assembly into Geneious version 9.1.6 (Biomatters 2016b; Kearse et al. 2012) and used the included version of the NCBI BLAST+ BLASTn tool (Zhang et al. 2000) to search for the flanking genes in our assembly. We used the web version of NCBI BLAST+ version 2.5.0 (Zhang et al. 2000) to align the flanking genes against T. alba sequences in the NCBI Whole-Genome-Shotgun (WGS) contigs database.  (Wu et al. 2016).

Scaffold numbering
2.1.1 When referring to specific scaffolds in the results and discussion sections, we have inserted a dash ("-") between the word "scaffold" and the scaffold number for legibility.
These dashes are not present in any of the assembly data files. Thus, "scaffold-1085" referenced in the manuscript will appear as "scaffold1085" in the assembly and other associated files. Tables   Table S1. Sequence data collected for use in genome assembly. We here provide information on the insert size, fragmentation method, amplification, sequencing length, and raw data quantity for all libraries sequenced for this genome assembly. We have numbered the libraries and refer to these numbers in other sections of this manuscript.  Table S2. Preliminary assembly parameters. We here report the parameters used in our preliminary assemblies using SOAPdenovo2. "Trim level" indicates the average Phred score to which we trimmed using Trimmomatic. A higher Phred score indicates a more restrictive trimming. "Error correction" refers to whether we performed error correction on the input reads for the assembly. We provide information on how we specified that the assembler use the pairedend and unpaired data for each assembly. For a given assembly, we note which libraries provided data and in which portions of the assembly process that data was used. For a given portion of the assembly process, we give the numbers of the utilized libraries followed, in parentheses, by the rank given to each library in the assembly configuration file. Please refer to Table S1 for information about the libraries to which the numbers refer. An asterisk is next to the preliminary assembly that we chose to use as the basis for the final assembly. (3), 10 (4), 11 (5) 1-3 (1), 6 (2), 7 (1), 8 (2) None 2 28 No N/A 1-2 (6), 4-11 (6) 9 (3), 10 (4), 11 (5) 1-2 (1), 6 (2), 7 (1), 8 (2) None 3 28
1-2 (6), 4-5 (6), 7-8 (6) 9 (3), 10 (4), 11 (5) 1-2 (1), 2 (1), 4-5 (1), 6 (2), 7 (1), 8 (2) 1-2 (7), 4-11 (7) 14 7 Yes N/A 1-2 (6), 4-11 (6) 9 (3), 10 (4), 11 (5) 1-2 (1), 4-5 (1), 6 (2), 7 (1), 8 None       This is a table of the gene  Table S9. Information on searches for light-associated genes in non-owl genome assemblies. test for evidence of changes in selection pressure on the owl branches. "BG" indicates the background branches, "lnL" denotes the log likelihood of the model, "LRT" denotes the value of the likelihood ratio test (given by 2 times the difference in the likelihoods of the models), and "cf" denotes the codon frequency model used to calculate the equilibrium codon frequencies with "cf 1" indicating that we used the average nucleotide frequencies and "cf 2" indicating that we used the average nucleotide frequencies at each of the 3 codon positions. "Model" corresponds to the number of ω values employed among branches with one ω value assumed for all branches under model "0", two ω values used under model "1", and 3 ω values used with model "2". "Tyto" and "Strix" indicate whether the value pertains to sequence in the Tyto alba or package to detect positive selection affecting certain sites on the owl lineages. "Tyto" and "Strix" indicate whether the values pertain to sequence in the Tyto alba or Strix occidentalis caurina genome assembly, respectively. "BG" indicates the background branches, "FG" denotes the foreground branch, "lnL" denotes the log likelihood of the model, "LRT" denotes the value of the likelihood ratio test (given by 2 times the difference in the likelihoods of the models), and "cf" denotes the codon frequency model used to calculate the equilibrium codon frequencies with "cf 1" indicating that we used the average nucleotide frequencies and "cf 2" indicating that we used the average nucleotide frequencies at each of the 3 codon positions. "Site class" indicates the ω category with "0" indicating sites under purifying selection, "1" sites under relaxed selection, "2a" sites that are under positive selection on the foreground branch and under purifying selection on the background branches, and "2b" indicating positive selection on the foreground branch and relaxed selection on the background branches. "Proportion" indicates the proportion of sites in a given class. "Model" denotes either the positive selection model ("Positive") or the null model ("Null").  Figure S1. Cumulative distribution of annotation edit distances of MAKER-generated annotations. This is a graph of the cumulative distribution of annotation edit distances (AED) of the annotations generated by MAKER. Included here are all of the annotations in the MAKER final output. We have drawn a horizontal line denoting 50% of the annotations. After quality filtering, the cumulative distribution appeared identical. Figure S2. Histogram of the lengths of genes annotated by MAKER. This is a histogram of the distribution of the lengths of genes annotated by MAKER. We included all of the gene annotations in the MAKER final output. We grouped the values into 400 frequency bins, one of these including all genes greater than or equal to 150,000 nt in length. We have provided the mean, median, and standard deviation of the gene lengths in a text box. Figure S3. Histogram of the coding DNA sequence length in genes annotated by MAKER. This is a histogram of the lengths of coding DNA sequences in genes annotated by MAKER. We included all of the gene annotations in the MAKER final output. We grouped the values into 400 frequency bins, one of these including all coding DNA sequences greater than or equal to 10,000 nt in length. We have provided the mean, median, and standard deviation of the lengths in a text box. Figure S4. Histogram of the lengths of exons in genes annotated by MAKER. This is a histogram of the lengths of exons in genes annotated by MAKER. We included the exons from all of the gene annotations in the MAKER final output. We grouped the values into 400 frequency bins, one of these including all exons greater than or equal to 1,600 nt in length. We have provided the mean, median, and standard deviation of the exon lengths in a text box. Figure S5. Histogram of the lengths of introns in genes annotated by MAKER. This is a histogram of the lengths of introns in genes annotated by MAKER. We included the introns from all of the gene annotations in the MAKER final output. We grouped the values into 400 frequency bins, one of these including all introns greater than or equal to 16,000 nt in length. We have provided the mean, median, and standard deviation of the intron lengths in a text box. Figure S6. Histogram of the number of exons in genes annotated by MAKER. This is a histogram of the number of exons in genes annotated by MAKER. We included the exons from all of the gene annotations in the MAKER final output. We grouped the values into 60 frequency bins, one of these including all genes with greater than or equal to 60 exons. We have provided the mean, median, and standard deviation of the number of exons per gene in a text box.