Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Abstract The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here, I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH’s Sequence Read Archive. I recover the deleted files from the Google Cloud and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of currently known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2’s bat coronavirus relatives.

Understanding the spread of SARS-CoV-2 in Wuhan is crucial to tracing the origins of the virus, including identifying events that led to infection of patient zero. The first reports outside of China at the end of December 2019 emphasized the role of the Huanan Seafood Market (ProMED 2019), which was initially suggested as a site of zoonosis. However, this theory became increasingly tenuous as it was learned that many early cases had no connection to the market (Cohen 2020;Huang et al. 2020;Chen et al. 2020). Eventually, Chinese CDC Director Gao Fu dismissed the theory, stating "At first, we assumed the seafood market might have the virus, but now the market is more like a victim. The novel coronavirus had existed long before" (Global Times 2020).
Indeed, there were reports of cases that far preceded the outbreak at the Huanan Seafood Market. The Lancet described a confirmed case having no association with the market whose symptoms began on December 1, 2019 . The South China Morning Post described nine cases from November 2019 including details on patient age and sex, noting that none were confirmed to be "patient zero" (Ma 2020). Professor Yu Chuanhua of Wuhan University told the Health Times that records he reviewed showed two cases in mid-November, and one suspected case on September 29 (Health Times 2020). At about the same time as Professor Chuanhua's interview, the Chinese CDC issued an order forbidding sharing of information about the COVID-19 epidemic without approval (China CDC 2020), and shortly thereafter Professor Chuanhua recontacted the Health Times to say that the November cases could not be confirmed (Health Times 2020). Then, China's State Council issued a much broader order requiring central approval of all publications related to COVID-19 to ensure they were coordinated "like moves in a game of chess" (Kang, Cheng, et al. 2020). In 2021, the joint WHO-China report dismissed all reported cases prior to December 8 as not COVID-19, reviving the theory that the virus might have originated at the Huanan Seafood Market (WHO 2021).
In other outbreaks where direct identification of early cases has been stymied, it has increasingly become possible to use genomic epidemiology to infer the timing and dynamics of spread from the analysis of viral sequences. For instance, analysis of SARS-CoV-2 sequences has enabled the reconstruction of the initial spread of SARS-CoV-2 in North America and Europe (Bedford et al. 2020;Worobey et al. 2020;Deng et al. 2020;Fauver et al. 2020).
But in the case of Wuhan, genomic epidemiology has also proven frustratingly inconclusive. Some of the problem is simply limited data: despite the fact that Wuhan has advanced virology labs, there is only patchy sampling of SARS-CoV-2 sequences from the first months of the city's explosive outbreak. Other than a set of multiply sequenced samples collected in late December of 2019 from a dozen patients connected to the Huanan Seafood Market (WHO 2021), just a handful of Wuhan sequences are available from before late January of 2020 (see analysis in this study below). This paucity of sequences could be due in part to an order that unauthorized Chinese labs destroy all coronavirus samples from early in the outbreak, reportedly for "laboratory biological safety" reasons (Pingui 2020).
However, the Wuhan sequences that are available have also confounded phylogenetic analyses designed to infer the "progenitor" of SARS-CoV-2, which is the sequence from which all other currently known sequences are descended (Kumar et al. 2021). Although there is debate about exactly how SARS-CoV-2 entered the human population, it is universally accepted that the virus's deep ancestors are bat coronaviruses (Lytras et al. 2021). But the earliest known SARS-CoV-2 sequences, which are mostly derived from the Huanan Seafood Market, are notably more different from these bat coronaviruses than other sequences collected at later dates outside Wuhan. As a result, there is a direct conflict between the two major principles used to infer an outbreak's progenitor: namely that it should be among the earliest sequences and that it should be most closely related to deeper ancestors (Pipes et al. 2021).
Here, I take a small step toward resolving these questions by identifying and recovering a deleted data set of partial SARS-CoV-2 sequences from outpatient samples collected early in the Wuhan epidemic. Analysis of these new sequences in conjunction with careful annotation of existing ones suggests that the early Wuhan samples that have been the focus of most studies including the joint WHO-China report (WHO 2021) are not fully representative of the viruses actually present in Wuhan at that time. These insights help reconcile phylogenetic discrepancies and suggest two plausible progenitor sequences, one of which is identical to that inferred by Kumar et al. (2021) and the other of which contains the C29095T mutation. Furthermore, the approach taken here hints it may be possible to advance the understanding of SARS-CoV-2's origins or early spread even without further on-the-ground studies, such as by more deeply probing data archived by the NIH and other entities.

Identification of a SARS-CoV-2 Deep Sequencing Data Set That Has Been Removed from the Sequence Read Archive
During the course of my research, I read a paper by Farkas et al. (2020) that analyzed SARS-CoV-2 deep sequencing data from the Sequence Read Archive (SRA), which is a repository maintained by the NIH's National Center for Biotechnology Information. The first supplementary table of Farkas et al. (2020) lists all SARS-CoV-2 deep sequencing data available from the SRA as of March 30, 2020.
The majority of entries in this table refer to a project (BioProject PRJNA612766) by Wuhan University that is described as nanopore sequencing of SARS-CoV-2 amplicons. The table indicates that this project represents 241 of the 282 SARS-CoV-2 sequencing run accessions in the SRA as of March 30, 2020. Because I had never encountered any other mention of this project, I performed a Google search for "PRJNA612766" and found no search hits other than the supplementary table itself. Searching for "PRJNA612766" in the NCBI's SRA search box returned a message of "No items found." I then searched for individual sequencing run accessions from the project in the NCBI's SRA search box. These searches returned messages indicating that the sequencing runs had been removed ( fig. 1).
The SRA is designed as a permanent archive of deep sequencing data. The SRA documentation states that after a sequencing run is uploaded, "neither its files can be replaced nor filenames can be changed" and that data can only be deleted by e-mailing SRA staff (SRA 2021).
The Deleted Data Set Contains Sequencing of Viral Samples Collected Early in the Wuhan Epidemic The metadata in the first supplementary table of Farkas et al. (2020) indicates that the samples in deleted project PRNJA612766 were collected by Aisu Fu and Renmin Hospital of Wuhan University. Google searching for these terms revealed that the samples were related to a study posted as a preprint on medRxiv in early March of 2020 ) and subsequently published in the journal Small in June of 2020 .
The study describes an approach to diagnose infection with SARS-CoV-2 and other respiratory viruses by nanopore sequencing. This approach involved reverse transcription of total RNA from swab samples, followed by PCR with specific primers to generate amplicons covering portions of the viral genome. These amplicons were then sequenced on an Oxford Nanopore GridION, and infection was diagnosed if the sequencing yielded sufficient reads aligning to the viral genome. Importantly, the study notes that this approach yields information about the sequence of the virus as well enabling diagnosis of infection. In fact, Wang et al. (2020) even list the mutations determined from this sequencing-but because this paper was published in the chemistry journal Small, after the sequences were removed from the SRA, their existence appears to have been entirely overlooked.
The preprint ) says the approach was applied to "45 nasopharyngeal swab samples from outpatients with suspected COVID-19 early in the epidemic." The digital object identifier for the preprint indicates that it was processed by medRxiv on March 4, 2020, which is 1 day after China's State Council ordered that all papers related to COVID-19 must be centrally approved (Kang, Cheng, et al. 2020). The final published manuscript  from June of 2020 updated the description from "early in the epidemic" to "early in the epidemic (January 2020)." Both the preprint and published manuscript say that 34 of the 45 early epidemic samples were positive in the sequencing-based diagnostic approach. In addition, both state that the approach was later applied to 16 additional samples collected on February 11 and 12, 2020, from SARS-CoV-2 patients hospitalized at the Renmin Hospital of Wuhan University.
There is complete concordance between the accessions for project PRJNA612766 in the supplementary table of Farkas et al. (2020) and the samples described by Wang et al. (2020). There are 89 accessions corresponding to the 45 early epidemic samples, with these samples named like wells in a 96well plate (A1, A2, etc.). The number of accessions is approximately twice the number of early epidemic samples because each sample has data for two sequencing runtimes except one sample (B5) with just one runtime. There are 31 accessions corresponding to the 16 samples collected in February from Renmin Hospital patients, with these samples named R01, R02, etc. Again, all but one sample (R04) have data for two sequencing runtimes. In addition, there are 7 accessions corresponding to positive and negative controls, 2 accessions corresponding to other respiratory virus samples, and 112 Bloom . doi:10.1093/molbev/msab246 MBE samples corresponding to plasmids used for benchmarking of the approach. Together, these samples and controls account for all 241 accessions listed for PRJNA612766 in the supplementary table of Farkas et al. (2020).
Neither the preprint ) nor published manuscript  contains any correction or note that suggests a scientific reason for deleting the study's sequencing data from the SRA. I e-mailed both corresponding authors of Wang et al. (2020) to ask why they had deleted the deep sequencing data and to request details on the collection dates of the early outpatient samples, but received no reply.

Recovery of Deleted Sequencing Data from the Google Cloud
As indicated in fig. 1, none of the deleted sequencing runs could be accessed through the SRA's web interface. In addition, none of the runs could be accessed using the commandline tools of the SRA Toolkit. For instance, running fastqdump SRR11313485 or vdb-dump SRR11313485 returned the message "err: query unauthorized while resolving query within virtual file system module-failed to resolve accession 'SRR11313485.'" However, the SRA has begun storing all data on the Google and Amazon clouds. While inspecting the SRA's web interface for other sequencing accessions, I noticed that SRA files are often available from links to the cloud such as https://storage.googleapis.com/nih-sequence-read-archive/run/ACCESSION/ACCESSION.
Based on the hypothesis that deletion of sequencing runs by the SRA might not remove files stored on the cloud, I interpolated the cloud URLs for the deleted accessions and tested if they still yielded the SRA files. This strategy was successful; for instance, as of June 3, 2021, going to https://storage.googleapis.com/nih-sequence-readarchive/run/SRR11313485/SRR11313485 downloads the SRA file for accession SRR11313485. I have archived this file on the Wayback Machine at https://web.archive.org/ web/20210502130820/https://storage.googleapis.com/nih-sequence-read-archive/run/SRR11313485/SRR11313485. I automated this strategy to download the SRA files for 97 of the 99 sequencing runs corresponding to the 34 SARS-CoV-2-positive early epidemic samples and the 16 hospital samples from February. The SRA files for two runs (SRR11313490 and SRR11313499) were not accessible via the Google Cloud, but after I posted the first version of this manuscript as a preprint, several individuals found archived data for these runs that had been downloaded when they were still available on the SRA, and I have used those data in the updated analysis described here (see Methods for details). I used the SRA Toolkit to get the object timestamp (vdbdump -obj_timestamp) and time (vdb-dump -info) for all SRA files. For all files, the object timestamp is February 15, 2020, and the time is March 16, 2020. Although the SRA Toolkit does not clearly document these two properties, my guess is that the object timestamp may refer to when the SRA file was created from a FASTQ file uploaded to the Recovery of Wuhan SARS-CoV-2 Sequences . doi:10.1093/molbev/msab246 MBE SRA, and the time may refer to when the accession was made public.
The Data Are Sufficient to Determine the Viral Sequence from the Start of Spike through the End of ORF10 for Some Samples Wang et al. (2020) sequenced PCR amplicons covering nucleotide sites 21,563-29,674 of the SARS-CoV-2 genome, which spans from the start of the spike gene to the end of ORF10. They also sequenced a short amplicon generated by nested PCR that covered a fragment of ORF1ab spanning sites 15,080-15,550. In this paper, I only analyze the region from spike through ORF10 because this is a much longer contiguous sequence and the amplicons were generated by conventional rather than nested PCR. I slightly trimmed the region of interest to 21,570-29,550 because many samples had poor coverage at the termini.
I aligned the recovered deep sequencing data to the SARS-CoV-2 genome using minimap2 (Li 2018), combining accessions for the same sample, and masking regions that corresponded to the primer binding sites described in Wang et al. I called the consensus viral sequence for each sample at each site with coverage 3% and >80% of the reads concurring on the nucleotide identity. With these criteria, 13 of the early outpatient samples and 1 of the February hospitalized patient samples had sufficient coverage to call the consensus sequence at >90% of the sites in the region of interest (table 1), and for the remainder of this paper, I focus on these high-coverage samples. Table 1 also shows the mutations in each sample relative to proCoV2, which is a putative progenitor of SARS-CoV-2 inferred by Kumar et al. (2021) that differs from the widely used Wuhan-Hu-1 reference sequence by three mutations (C8782T, C18060T, and T28144C). Although requiring coverage of only 3 is relatively lenient, table 1 shows that all sites with mutations have coverage 10. In addition, the mutations I called from the raw sequence data in table 1 concord with those mentioned in Wang et al. (2020). Again, this fact emphasizes that the information contained in the deleted sequencing data is largely present in Wang et al. (2020), but because it was only published in a table in the chemistry journal Small rather than placed on the SRA, its existence was overlooked.
I also determined the consensus sequence of the plasmid control used by Wang et al. (2020) from the recovered sequencing data and found that it had mutations C28144T and G28085T relative to proCoV2, which means that in the region of interest this control matches Wuhan-Hu-1 with the addition of G28085T. Since none of the viral samples in table 1 contain G28085T and the samples that prove most relevant below also lack C28144T (which is a frequent natural mutation among early Wuhan sequences), plasmid contamination of the viral samples is unlikely.

Analysis of Existing SARS-CoV-2 Sequences Emphasizes the Perplexing Discordance between Collection Date and Distance to Bat Coronavirus Relatives
To contextualize the viral sequences recovered from the deleted project, I first analyze early SARS-CoV-2 sequences already available in the GISAID database (Shu and McCauley 2017). The analyses described in this section are not entirely novel but synthesize observations from multiple prior studies Forster et al. 2020;Kumar et al. 2021;Pekar et al. 2021;Pipes et al. 2021) to provide key background.
Known human SARS-CoV-2 sequences are consistent with expansion from a single progenitor sequence Forster et al. 2020;Kumar et al. 2021;Pekar et al. 2021;Pipes et al. 2021). However, attempts to infer this progenitor have been confounded by a perplexing fact: the earliest reported sequences from Wuhan are not the sequences most similar to SARS-CoV-2's bat coronavirus relatives (Pipes et al. 2021). This fact is perplexing because although the proximal origin of SARS-CoV-2 remains unclear (i.e., zoonosis versus lab accident), all reasonable explanations agree that at a deeper level the SARS-CoV-2 genome is derived from bat coronaviruses (Lytras et al. 2021). One would therefore expect the first reported SARS-CoV-2 sequences to be the most similar to these bat coronavirus relatives-but this is not the case.
This conundrum is illustrated in fig. 2, which plots the collection date of SARS-CoV-2 sequences in GISAID versus the relative number of mutational differences from RaTG13 , which is the bat coronavirus with the highest full-genome sequence identity to SARS-CoV-2. The earliest SARS-CoV-2 sequences were collected in Wuhan in December, but these sequences are more distant from RaTG13 than sequences collected in January from other locations in China or even other countries ( fig. 2). The discrepancy is especially pronounced for sequences from patients who had visited the Huanan Seafood Market (WHO 2021). All sequences associated with this market differ from RaTG13 by at least three more mutations than sequences subsequently collected at various other locations ( fig. 2). Importantly, all these observations also hold true if SARS-CoV-2 is compared to other related bat coronaviruses (Lytras et al. 2021) such as RpYN06 (Zhou et al. 2021) or RmYN02  This conundrum can be visualized in a phylogenetic context by rooting a tree of early SARS-CoV-2 sequences so that the progenitor sequence is closest to the bat coronavirus outgroup. If we limit the analysis to sequences collected no later than January 2020 and remove singleton mutations observed only once in the sequence set, then there are three ways to root the tree in this fashion since there are three different sequences equally close to the outgroup ( fig. 3 and Bloom . doi:10.1093/molbev/msab246 MBE supplementary figs. S4 and S5, Supplementary Material online). Importantly, none of these rootings place any Huanan Seafood Market viruses (or other Wuhan viruses from December 2019) in the progenitor node-and only one of the rootings has any virus from Wuhan in the progenitor node (in the leftmost tree in fig. 3, the progenitor node contains Wuhan/0126-C13/2020, which was reportedly collected on January 26, 2020). Therefore, inferences about the progenitor of SARS-CoV-2 based on comparison to related bat viruses are inconsistent with other evidence suggesting that the progenitor is an early virus from Wuhan (Pipes et al. 2021).
Several plausible explanations have been proposed for the discordance of phylogenetic rooting with evidence that Wuhan was the origin of the pandemic.  suggest that viruses from the clade labeled "B" in fig. 3 may just "happen" to have been sequenced first, but that other SARS-CoV-2 sequences are really more ancestral as implied by phylogenetic rooting. Pipes et al. (2021) discuss the Table 1. Samples for which the SARS-CoV-2 sequence could be called at 90% of sites between 21,570 and 29,550, and the substitutions in this region relative to the putative SARS-CoV-2 progenitor proCoV2 inferred by Kumar et al. (2021). FIG. 2. The reported collection dates of SARS-CoV-2 sequences in GISAID versus their relative mutational distances from the RaTG13 bat coronavirus outgroup. Mutational distances are relative to the putative progenitor proCoV2 inferred by Kumar et al. (2021), which itself differs from RaTG13 by 1,132 mutations-so a sequence with a relative mutational distance of 2 actually has 1,134 differences from RaTG13. Note that the lower-right point in the middle (green) panel corresponds to a sequence (Guangdong/FS-30-P00502/2020) reportedly collected in late February that is actually two mutations more similar to RaTG13 than proCoV2. The plot shows only sequences in GISAID collected no later than February 28, 2020. Sequences that the joint WHO-China report (WHO 2021) describes as being associated with the Wuhan Seafood Market are plotted with squares. Points are slightly jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist.html for an interactive version of this plot that enables toggling of the outgroup to RpYN06 and RmYN02, mouseovers to see details for each point including strain name and mutations relative to proCoV2, and adjustment of the y-axis jittering. Static versions of the plot with RpYN06 and RmYN02 outgroups are in supplementary fig. S3, Supplementary Material online.
Recovery of Wuhan SARS-CoV-2 Sequences . doi:10.1093/molbev/msab246 MBE conundrum in detail and suggest that phylogenetic rooting could be incorrect due to technical reasons such as high divergence of the outgroup or unusual mutational processes not captured in substitution models. Kumar et al. (2021) agree that phylogenetic rooting is problematic and circumvent this problem by using an alternative algorithm to infer a progenitor for SARS-CoV-2 that they name proCoV2.
Notably, proCoV2 turns out to be identical to one of the putative progenitors yielded by my approach in fig. 3 of simply placing the root at the nodes closest to the outgroup-a perhaps unsurprising result, since both approaches use outgroups but ignore sampling dates. It is also possible that there was an early ascertainment bias toward market-associated viruses (clade B), since some early case definitions included exposure to the Huanan Seafood Market. Another explanation that I consider less plausible is offered by Garry (2021): that there were multiple zoonoses from distinct markets, with the Huanan Seafood Market being the source of viruses in clade B, and some other market being the source of viruses that lack T8782C and C28144T ( fig. 3). However, this explanation requires positing zoonoses in two markets by two progenitors differing by just two mutations, which seems nonparsimonious in the absence of direct evidence for zoonosis in any market.  To examine if the sequences recovered from the deleted data set help resolve the conundrum described in the previous section, I repeated the analyses including those sequences. In the process, I noted another salient fact: four GISAID sequences collected in Guangdong that fall in a putative progenitor node are from two different clusters of patients who traveled to Wuhan in late December of 2019 and developed symptoms before or on the day that they returned to Guangdong, where their viruses were ultimately sequenced (Chan et al. 2020;Kang, Wu, et al. 2020). Since these patients were clearly infected in Wuhan even though they were sequenced in Guangdong, I annotated them separately from both the other Wuhan and other China sequences.
Repeating the analysis of the previous section with these changes shows that several sequences from the deleted project and all sequences from patients infected in Wuhan but sequenced in Guangdong are more similar to the bat coronavirus outgroup than sequences from the Huanan Seafood Market ( fig. 4).
Furthermore, it is immediately apparent that the discrepancy between outgroup rooting and the evidence that Wuhan was the origin of SARS-CoV-2 is alleviated by adding the deleted sequences and annotating Wuhan infections sequenced in Guangdong. The rooting of the middle tree in fig. 5 is now more consistent with the evidence that the pandemic originated in Wuhan, as half its progenitor node is derived from early Wuhan infections, which is more than any other equivalently large node. The first known sequence identical to this putative progenitor (Guangdong/HKU-SZ-002/2020) is from a patient who developed symptoms on January 4 while visiting Wuhan (Chan et al. 2020). This putative progenitor has three mutations toward the bat coronavirus outgroup relative to Wuhan-Hu-1 (C8782T, T28144C, and C29095T) and two mutations relative to proCoV2 (T18060C away from the outgroup and C29095T toward the outgroup). The leftmost tree in fig. 5, which has a progenitor identical to proCoV2 (Kumar et al. 2021), also looks plausible, with some weight from Wuhan sequences. However, analysis of this rooting is limited by the fact that the defining C18060T mutation is in a region not covered in the deleted sequences. The rightmost tree in fig. 5 looks less plausible, as it has almost no weight from Wuhan and the first sequence identical to its progenitor was not collected until January 24.
We can also qualitatively examine the three progenitor placements in fig. 5 using the principle employed by Worobey et al. (2020) to help evaluate scenarios for the emergence of SARS-CoV-2 in Europe and North America: namely that during a growing outbreak, a progenitor is likely to give rise to multiple branching lineages. This principle is especially likely to hold for the scenarios in fig. 5, since there are multiple individuals infected with each putative progenitor sequence, implying multiple opportunities to transmit descendants with new mutations. Using this qualitative principle, the middle and leftmost scenarios in fig. 5 seem more plausible than the rightmost scenario, where the progenitor lacks branching descendants. I acknowledge that these arguments are purely qualitative and lack the formal statistical analysis of Worobey et al. (2020)-but as discussed below, there may be wisdom in qualitative reasoning when there are valid concerns about the sampling distribution of the underlying data.

Discussion
I have identified and recovered a deleted set of partial SARS-CoV-2 sequences from the early Wuhan epidemic. Analysis of these sequences leads to several conclusions. First, they provide further evidence that the Huanan Seafood Market sequences that were the focus of the joint WHO-China report (WHO 2021) are not representative of all SARS-CoV-2 in Wuhan early in the epidemic. The deleted data and existing sequences from Wuhan-infected patients hospitalized in Guangdong show early Wuhan sequences often carried the T29095C mutation and were less likely to carry T8782C/ C28144T than sequences in the joint WHO-China report (WHO 2021). Second, given current data, there are two FIG. 4. Relative mutational distance from RaTG13 bat coronavirus outgroup calculated only over the region of the SARS-CoV-2 genome covered by sequences from the deleted data set (21,550). Because the calculated distances here are only over a portion of the genome, there are more negative points than in fig. 2. The plot shows sequences in GISAID collected before February 2020, as well as the 13 early Wuhan epidemic sequences in table 1. Mutational distance is calculated relative to proCoV2, and points are jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltad-ist_jitter.html for an interactive version of this plot that enables toggling the outgroup to RpYN06 or RmYN02, mouseovers to see details for each point, and adjustment of jittering.
Recovery of Wuhan SARS-CoV-2 Sequences . doi:10.1093/molbev/msab246 MBE plausible identities for the progenitor of all known SARS-CoV-2. One is proCoV2 described by Kumar et al. (2021), and the other is a sequence that carries three mutations (C8782T, T28144C, and C29095T) relative to Wuhan-Hu-1. Crucially, both putative progenitors are three mutations closer to SARS-CoV-2's bat coronavirus relatives than sequences from the Huanan Seafood Market. Note also that the progenitor of all known SARS-CoV-2 sequences could still be downstream of the sequence that infected patient zero-and it is possible that the future discovery of additional early SARS-CoV-2 sequences could lead to further revisions of inferences about the earliest viruses in the outbreak.
The fact that this informative data set was deleted suggests implications beyond those gleaned directly from the recovered sequences. Samples from early outpatients in Wuhan are a gold mine for anyone seeking to understand spread of the virus. Even my analysis of 13 partial sequences is revealing, and it clearly would have been more scientifically informative to  fig. 3 with the addition of the early Wuhan epidemic sequences from the deleted data set, and Guangdong patients infected in Wuhan prior to January 5 annotated separately. Because the deleted sequences are partial, they cannot all be placed unambiguously on the tree. Therefore, they are added to each compatible node proportional to the number of sequences already in that node. The deleted sequences with C28144T (clade B) or C29095T (putative progenitor in middle tree) can be placed relatively unambiguously as defining mutations occur in the sequenced region, but those that lack either of these mutations are compatible with a large number of nodes including the proCoV2 putative progenitor. Supplementary figs. S4 and S5, Supplementary Material online, demonstrate that the results are identical if RpYN06 or RmYN02 is instead used as the outgroup.
Bloom . doi:10.1093/molbev/msab246 MBE fully sequence all 34 samples rather than delete the partial sequence data. There is no obvious scientific reason for the deletion: the sequences are concordant with the samples described in Wang et al. (2020), there are no corrections to the paper, the paper states human subjects approval was obtained, and the sequencing shows no evidence of plasmid or sample-to-sample contamination. After I e-mailed the NIH the original version of this manuscript, they sent me the email requesting deletion of the data, which is in fig. 6. Despite the statement in the deletion-request e-mail that the sequences were being uploaded to "another website" (fig. 6), I could find no evidence that they were actually uploaded to any other public website-and certainly they were not posted to GISAID, NCBI, or any database used by the joint WHO-China report. Therefore, even though the sequencing data were listed in a table in the Small paper by Wang et al. (2020), the practical consequence of removing the data from the SRA was that nobody was aware these sequences existed. Notably, it is not possible to delete preprints from bioRxiv and medRxiv, so once Wang et al. (2020) had posted their preprint, it was permanently committed to the public record (withdrawn preprints are still accessible, for instance see Yang et al. 2020). Particularly in light of the directive that labs destroy early samples (Pingui 2020) and multiple orders requiring approval of publications on COVID-19 (China CDC 2020; Kang, Cheng, et al. 2020), the deletion of the data from the SRA suggests a less than wholehearted effort to maximize the availability of information about viral sequences from early in the Wuhan epidemic.
However, about a month after I publicly posted this manuscript to bioRxiv as a preprint, the deleted sequence data were discussed in a July 21, 2021, press conference by the vice minister of China's National Health Commission and subsequently in two blog posts by a China state-affiliated reporter (Wang 2021a(Wang , 2021b. The press conference and blog posts stated that the sequence data were removed from the SRA because the journal Small had mistakenly deleted the data availability statement in the paper of Wang et al. (2020), leading the authors to think that they should also delete the actual data. Confusingly, this explanation is inconsistent with the reason provided by the authors to the SRA ( fig. 6), and I am unable to reconcile the discrepancy. The press conference and blog posts also stated that the sequences were all collected on or after January 30, 2020, rather than "early in the epidemic" as originally described in Wang et al. (2020). Finally, the press conference and blog noted that on July 8, 2021, the deleted data were made available on the Chinese NGDC database under accession PRJCA005725.
Regardless of exactly why the data were deleted, another important implication of this study is that genomic epidemiology studies of early SARS-CoV-2 need to pay as much attention to the provenance and annotation of the underlying sequences as technical considerations. There has been substantial scientific effort expended on topics such as phylogenetic rooting (Morel et al. 2021;Pipes et al. 2021), novel algorithms (Kumar et al. 2021), and correction of sequencing errors (Turakhia et al. 2020). Future studies should devote equal effort to going beyond the annotations in GISAID to carefully trace the location of patient infection and sample sequencing. The potential importance of such work is revealed by the observation that many of the sequences closest to SARS-CoV-2's bat coronavirus relatives are from early patients who were infected in Wuhan but then sequenced in and attributed to Guangdong.
There are several caveats to this study. Most obviously, the sequences recovered from the deleted data set are partial and lack full metadata. Therefore, it is impossible to unambiguously place them phylogenetically, or determine exactly when they were collected. It is also important to note that my phylogenetic analyses use relatively simple methods to draw qualitative conclusions without formal statistical testing. Further application of more advanced methods would be a welcome advance. However, qualitative and visual analyses do have advantages when the key questions relate more to the underlying data than the sophistication of the inferences. Finally, both plausible putative progenitors require that an early mutation to SARS-CoV-2 was a reversion toward the bat coronavirus outgroups (either C18060T or C29095T) on a branch that subsequently gave rise to multiple distinct FIG. 6. A redacted version of the e-mails from Wuhan University to the SRA staff requesting deletion of the sequencing data. This e-mail was provided to me by the NIH's NCBI Director Stephen Sherry on June, 19 2021, the day after I e-mailed the NIH an advance copy of this manuscript. The redactions and highlighting were done by the NIH, and I am showing the e-mail exactly as it was provided to me.
Recovery of Wuhan SARS-CoV-2 Sequences . doi:10.1093/molbev/msab246 MBE descendants. Such a scenario can only be avoided by invoking recombination very early in the pandemic, which is not entirely implausible for a coronavirus (Boni et al. 2020). However, because the outgroups have 4% nucleotide divergence from SARS-CoV-2, a mutation toward the outgroup is also entirely possible. Of course, future identification of additional early sequences could fully resolve these questions.
More broadly, the approach taken here suggests that it may be possible to learn more about the origin or early spread of SARS-CoV-2 even without an international investigation. I suggest that it could be worthwhile for the NIH to review e-mail records to identify other SRA deletions. Importantly, SRA deletions do not imply any malfeasance: there are legitimate reasons for removing sequencing runs and the SRA houses >13-million runs making it infeasible for its staff to validate the rationale for all requests. However, the current study suggests that at least in one case, the trusting structures of science permitted a data deletion that obscured sequences relevant to the early spread of SARS-CoV-2 in Wuhan. A careful re-evaluation of other archived forms of scientific communication, reporting, and data could shed light on additional overlooked information relevant to the early emergence of the virus.

Code and Data Availability
The computer code and input data necessary to reproduce all analyses described in this paper are available on GitHub at https://github.com/jbloom/SARS-CoV-2_PRJNA612766. This GitHub repository includes a Snakemake (Mölder et al. 2021) pipeline that fully automates all steps in the analysis except for downloading the sequences from GISAID, which must be done manually as described in the GitHub repository's README to comply with GISAID data sharing terms.
The deleted SRA files recovered from the Google Cloud are all available at https://github.com/jbloom/SARS-CoV-2_ PRJNA612766/tree/main/results/sra_downloads. I have suffixed the file extension.sra to all these files. The consensus sequences recovered from these deleted SRA files are linked to in the relevant subsection.

Archiving of Key Weblinks
Recovery of SRA Files from Deleted Project PRJNA612766 I parsed the first supplementary table of Farkas et al. (2020) to extract the accessions for sequencing runs for deleted SRA BioProject PRJNA612766. By cross-referencing the samples described in this table to Wang et al. (2020), I identified the accessions corresponding to the 34 early outpatient samples who were positive, as well as the accessions corresponding to the 16 hospitalized patient samples from February. Samples had both 10 min and 4 h sequencing runtime accessions, which were combined in the subsequent analysis. I also identified the samples corresponding to the high-copy plasmid controls to enable the analysis of the plasmid sequence to rule out contamination. The code used to parse the Excel table is available as a Jupyter notebook at https://github.com/ jbloom/SARS-CoV-2_PRJNA612766/tree/main/manual_analyses/PRJNA612766. I recovered the SRA files from the Google Cloud by using wget to download files with from paths like https://storage. googleapis.com/nih-sequence-read-archive/run/ SRR11313485/SRR11313485. Note that I cannot guarantee that these Google Cloud links will remain active, as my analyses of other deleted SRA runs (beyond the scope of this study) indicate that only sometimes are deleted SRA files still available via the Google Cloud. For this reason, key runs have been archived on the Wayback Machine as described above, and all downloaded SRA files relevant to this study are included in the GitHub repository. Note also that as described in this paper's main text, two SRA files could not be downloaded from the Google Cloud using the aforementioned method, and so are not part of this study.

FASTQ Files for SRR11313490 and SRR11313499
I was not able to download SRA files for two runs, SRR11313490 and SRR11313499, from the Google Cloud. After I posted the initial version of this preprint, I was contacted by three different individuals who had uncovered FASTQ or FASTA files for these two missing runs that were downloaded from the SRA prior to the deletion of PRJNA612766. For revised versions of the manuscript, I included analysis of the data for these two runs that I obtained from the links provided at https://github.com/lifebit-ai/ SARS-CoV-2/blob/master/assets/ucsc/aws_https_links.txt. Inclusion of data for these two runs did not appreciably Bloom . doi:10.1093/molbev/msab246 MBE change the results of the analysis relative to that in the original version of the preprint, since both runs corresponded to low coverage samples for which meaningful viral genetic information could not be obtained.

Alignment of Recovered Reads and Calling of Consensus Sequences
The downloaded SRA files were converted to FASTQ files using fasterq-dump from the SRA Toolkit. The FASTQ files were preprocessed with fastp (Chen et al. 2018) to trim reads and remove low-quality ones (the exact settings using in this preprocessing are specified in the Snakemake file in the GitHub repository).
The reads in these FASTQ files were then aligned to a SARS-CoV-2 reference genome using minimap2 (Li 2018) with default settings. The reference genome used for the entirety of this study is proCoV2 (Kumar et al. 2021), which was generated by making the following three single-nucleotide changes to the Wuhan-Hu-1 reference (ASM985889v2) available on NCBI: C8782T, C18060T, and T28144C.
I processed the resulting alignments with samtools and pysam to determine the coverage at each site by aligned nucleotides with a quality score of at least 20. I also masked (set to zero coverage) all sites overlapped by the primers used by Wang et al. (2020) to PCR the amplicons used for sequencing. These coverage plots are in supplementary figs. S1 and S2, Supplementary Material online; the legends of these figures also link to interactive versions of the plots that enable zooming and mouseovers to get statistics for specific sites. I called the consensus sequence at a site if this coverage was 3% and >80% of the reads agreed on the identity. These consensus sequences over the entire SARS-CoV-2 genome are available at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ raw/main/results/consensus/consensus_seqs.csv; note that they are mostly N nucleotides since the sequencing approach of Wang et al. (2020) only covers part of the genome.
I only used the recovered consensus sequences in the downstream analyses if it was possible to call the consensus identity at 90% of the sites in the region of interest from sites 21,570-29,550. These are the sequences listed in table 1, and as described in that table, all mutation calls were at sites with coverage 10. These sequences in the region of interest (21,550) are available at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/results/recovered_seqs.fa.

Bat Coronavirus Outgroup Sequences
For analyses that involved comparisons to SARS-CoV-2's bat coronavirus relatives (Lytras et al. 2021), the bat coronavirus sequences were manually downloaded from GISAID (Shu and McCauley 2017). The sequences used were RaTG13 , RmYN02 , and RpYN06 (Zhou et al. 2021)-although the multiple sequence alignment of these viruses to SARS-CoV-2 also contains PrC31 , which was not used in the final analyses as it more diverged from SARS-CoV-2 than the other three bat coronaviruses at a whole-genome level. The GISAID accessions for these sequences are listed at https:// github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ data/comparator_genomes_gisaid/accessions.txt, and a table acknowledging the labs and authors is at https:// github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ data/comparator_genomes_gisaid/acknowledgments.csv. Sites in SARS-CoV-2 were mapped to their corresponding nucleotide identities in the bat coronavirus outgroups via a multiple sequence alignment of proCoV2 to the bat coronaviruses generated using mafft (Katoh and Standley 2013). Curation and Analysis of Early SARS-CoV-2 Sequences from GISAID For the broader analyses of existing SARS-CoV-2 sequences, I downloaded all sequences collected prior to March of 2020 from GISAID. The accessions of these sequences are at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ blob/main/data/gisaid_sequences_through_Feb2020/accessions.txt, and a table acknowledging the labs and authors is at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ blob/main/data/gisaid_sequences_through_Feb2020/ acknowledgments.csv.
I then used mafft (Katoh and Standley 2013) to align these sequences to the proCoV2 reference described above, stripped any sites that were gapped relative to the reference, and filtered the sequences using the following criteria: SARS-CoV-2 sequences. I also limited the analyses to the region of the genome that spans from the start of the first coding region (ORF1ab) to the end of the last (ORF10) because I noticed that some sequences had suspicious patterns (such as many mutations or runs of mutations) near the termini of the genome.
The plot in fig. 2 contains all of the GISAID sequences after this filtering. The plot in fig. 4 shows the filtered GISAID sequences collected before February of 2020 plus the 13 good coverage recovered partial early outpatient sequences (table 1), considering only the region covered by the partial sequences (21,550).

Phylogenetic Analyses
The phylogenetic trees were inferred using the GISAD sequences after the filtering and annotations described above, only considering sequences with 95% coverage over the region of interest that were collected before February of 2020. In addition, after generating this sequence set I removed any mutations from sequences that were singletons in the sense that they were only observed once in the sequence set-a similar approach was used by Kumar et al. (2021), and this approach should eliminate rare mutations that could be sequencing errors. A file indicating the unique sequences used for the phylogenetic analysis, their mutations relative to proCoV2, and other sequences in that cluster is at https:// github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ results/phylogenetics/all_alignment.csv.
I then used IQ-Tree (Minh et al. 2020) to infer a maximum-likelihood phylogenetic tree using a GTR nucleotide substitution model with empirical nucleotide frequencies, and collapsing zero-length branches to potentially allow a multifurcating tree. The inference yielded the tree topology and branch lengths shown in all figures in this study with phylogenetic trees. I then rendered the images of the tree using ETE 3 (Huerta-Cepas et al. 2016), manually rerooting the tree to place the first (progenitor) node at each of the three nodes that have the highest identity to the bat coronavirus outgroup. In these images, node sizes are proportional to the number of sequences in that node and are colored in proportion to the location from which those sequences are derived. As indicated in the legend to fig. 3, the node containing the monophyletic set of sequences with C28144T is collapsed into a single node in the tree images.
For the trees in which I added the recovered sequences from the deleted data set ( fig. 5), the actual trees are exactly the same as those inferred using the GISAID sequences above. The difference is that the sequences from the deleted data set are then added to each node with which they are compatible given their mutations in an amount proportional to the size of the node, the logic being that a sequence is more likely to fall into larger clusters.

Interactive Versions of Some Figures
Interactive versions of some figures are available at https:// jbloom.github.io/SARS-CoV-2_PRJNA612766/ and were created using Altair (VanderPlas et al. 2018).