Scientometric trends for coronaviruses and other emerging viral infections

Abstract Background COVID-19 is the most rapidly expanding coronavirus outbreak in the past 2 decades. To provide a swift response to a novel outbreak, prior knowledge from similar outbreaks is essential. Results Here, we study the volume of research conducted on previous coronavirus outbreaks, specifically SARS and MERS, relative to other infectious diseases by analyzing >35 million articles from the past 20 years. Our results demonstrate that previous coronavirus outbreaks have been understudied compared with other viruses. We also show that the research volume of emerging infectious diseases is very high after an outbreak and decreases drastically upon the containment of the disease. This can yield inadequate research and limited investment in gaining a full understanding of novel coronavirus management and prevention. Conclusions Independent of the outcome of the current COVID-19 outbreak, we believe that measures should be taken to encourage sustained research in the field.

Warning. The length \marginparwidth is less than 2cm and will most likely cause issues with the appearance of inserted todonotes. The issue can be solved by adding a line like \setlength{\marginparwidth}{2cm} prior to loading the todonotes package.

Introduction
Infectious diseases remain a major cause of morbidity and mortality worldwide, in developed countries and particularly in the developing world [1]. According to the World Health Organization, out of the top-10 causes of death globally, three are infectious diseases [1]. In light of the continuous emergence of infections, the burden of infectious diseases is expected to become even greater in the near future [2,3] Currently, the world is struggling with a novel strain of coronavirus (SARS-CoV-2) that emerged in China during late 2019 and by the time of this writing has infected more than 4,400,000 people and killed more than 302,000 [4,5]. COVID-19 is the latest and third serious human coronavirus outbreak in the past 20 years. Additionally, of course, there are several more typical circulating seasonal human coronaviruses causing respiratory infections. It is still too early to predict the epidemic course of COVID-19, but it is already a pandemic that appears more di cult to contain than its close relative SARS-CoV [6,7].
Much can be learned from past infectious disease outbreaks to improve preparedness and response to future public health threats. Three key questions arise in light of the COVID-19 outbreak: To what extent were the previous human coronaviruse (SARS and MERS) outbreaks studied? Is research on emerging viruses being sustained, aiming to understand and prevent future epidemics? Are there lessons from academic publications on previous emerging viruses that could be applied to the current COVID-19 epidemic?
In this study, we answer these vital questions by utilizing state-of-the-art data science tools to perform a large-scale analysis of 35 million papers, of which 1,908,211 concern the eld of virology. We explore nearly two decades of infectious disease research published from 2002 up to today. We particularly focus on public health crises, such as SARS, in uenza (in-Compiled on: June 2, 2020. Draft manuscript prepared by the author. cluding seasonal, pandemic H1N1, and avian in uenza), MERS, and Ebola virus disease, and compare them to HIV/AIDS and viral hepatitis B and C, three bloodborne viruses that are associated with a signi cant global health burden for more than two decades.
A crucial aspect of being prepared for future epidemics is sustained ongoing research of emerging infectious diseases even at 'times of peace' when such viruses do not pose an active threat. Our results demonstrate that research on previous coronaviruses, such as SARS and MERS, was conducted by a relatively small number of researchers centered in a small number of countries, suggesting that such research could be better encouraged. We propose that regardless of the fate of COVID-19 in the near future, sustained research e orts should be encouraged to be better prepared for the next outbreak.

Background
This research is a large-scale scientometric study in the eld of infectious diseases. We focus on the quantitative features and characteristics of infectious disease research over the past two decades. In this section, we present studies that analyze and survey real-world trends in the eld of infectious diseases (see the Infectious Disease Trends subsection) and studies that relate to bibliometric trends in general and public health in particular (see the Bibliometric Trends subsection).

Infectious Disease Trends
There is great promise in utilizing big data to study epidemiology [8]. One approach is to gather data using di erent surveillance systems. For example, one such system is ProMED. ProMED was launched 25 years ago as an email service to identify unusual worldwide health events related to emerging and reemerging infectious diseases [9]. It is used daily around the globe by public health policy makers, physicians, veterinarians, and other healthcare workers, researchers, private companies, journalists, and the general public. Reports are produced and commentary is provided by a global team of subject-matter experts in a variety of elds. ProMED has over 80,000 subscribers and over 60,000 cumulative event reports from almost every country in the world. Additionally, there are many di erent systems used by di erent countries and health organizations worldwide.
In 2006, Cowen et al. [10] evaluated the ProMED dataset from the years 1996 to 2004. They discovered that there are diseases that received more extensive coverage than others; "86 disease subjects had thread lengths of at least 10 reports, and 24 had 20 or more." They note that the pattern of occurrence is hard to explain even by an expert in epidemiology. Also, with the level of granularity of ProMED data, it is very challenging to predict the frequency that diseases are going to accrue. In 2008, Jones et al. [2] analyzed the global temporal and spatial patterns of emerging infectious diseases (EIDs). They analyzed 305 EIDs between 1940 and 2004 and demonstrated that the threat of EIDs to global health is increasing. The same year, Freifeld et al. [11] developed HealthMap, an interactive surveillance system that integrates disease outbreak reports from various sources.
Data about infectious diseases can also come from web-and social-based sources. For instance, in 2009, Ginsberg et al. [12] used Google search queries to monitor the spread of inuenza epidemics. They used the fact that many people search online before going to doctors, and they found that during a pandemic, the volume of searches di ers from normal. They then created a mathematical model to forecast the spread of u.
This research was later converted into a tool called Google Flu Trends, and at its peak, Google Flu Trends was deployed in 29 countries worldwide. However, not everything worked well for Google Flu Trends; in 2009, it underestimated the u volume, and in 2013, it predicted more than double the number of cases than the true volume [13]. As a result of such discrepancies, Google shut down the Google Flu Trends website in 2015 and transferred its data to academic researchers [14]. Also in 2009, Carneiro and Mylonakis [15] used large amounts of data to predict u outbreaks a week earlier than prevention surveillance systems.
In 2010, Lampos and Cristianini [16] extended the idea of Carneiro and Mylonakis [15] to use temporal data to monitor outbreaks. Instead of using Google Trends, they used Twitter as their data source. They collected 160,000 tweets from the UK, and as ground truth, they used HPA weekly reports about the H1N1 epidemic. Using textual markers to measure u on Twitter, they demonstrated that Twitter can be used to study disease outbreaks, similar to Google Trends. Also the same year, Salathé and Khandelwal [17] analyzed Twitter and demonstrated that it is possible to use social networks to study not only the spread of infectious disease but also vaccinations. They found a correlation between the sentiment in tweets toward an in uenza vaccine and the vaccination rate.
In 2014, Generous et al. [18] used Wikipedia to monitor and forecast infectious disease outbreaks. They examined Wikipedia access logs to forecast outbreak volumes for 14 combinations of diseases and locations. The model worked successfully for only 8 out of the 14 cases. Also, the authors suggested that it was even possible to transfer a model between locations without retraining it. In contrast to most of the webbased disease monitoring methods, Wikipedia-based monitoring presents a fully open forecasting system that can be easily reproducible. Generally, in the past couple of years, Wikipedia has become a widely used data source for medical studies [19,20]. Moreover, a recent report [21] shows that Wikipedia has successfully kept itself clean from the misinformation spread during the COVID-19 outbreak. In 2015, Santillana et al. [22] took the in uenza surveillance one step further by fusing multiple data sources. They used ve datasets: Twitter, Google Trends, near real-time hospital visit records, FluNearYou, and Google Flu Trends. They used all these data sources with a machine-learning algorithm to predict in uenza outbreaks. In 2017, McGough et al. [23] dealt with the problem of signi cant delays in the publication of o cial government reports about Zika cases. To solve this problem, they used the combined data of Google Trends, Twitter, and the HealthMap surveillance system to predict estimates of Zika cases in Latin America.
In 2018, Breugelmans et al. [24] explored the e ects of publishing in open access journals and collaboration between European and sub-Saharan African researchers in the study of poverty-related disease. To this end they used the PubMed dataset but discovered it is not suited to performing full bibliometric analysis; to deal with this issue they also utilized Web Of Science as a data source. They discovered that there is an advantage for open access publications in terms of citations. In 2020, Head et al. [25] studied infectious disease funding. They discovered that HIV/AIDS is the most funded disease. Additionally, they discovered a pattern where Ebola, Zika, in uenza, and coronavirus funding were highest after an outbreak.
There is substantial controversy surrounding the use of web-based data to predict the volume of outbreaks. The limitations of Google Flu Trends, mentioned above, raised the question of reliability of social data for assessing disease spread. Lazer [26] noted that these types of methods are problematic since companies like Google, Facebook, and Twitter are constantly changing their products. Studies based on such data sources may be valid today but not be valid tomorrow, and may

Bibliometric Trends
In 2005, Vergidis et al. [27] used PubMed and JCR (Journal Citation Reports) to study trends in microbiology publications. They discovered that microbiology research in the US had the highest average impact factor, but in terms of research production, Western Europe was rst. In 2008, Uthman [28] analyzed trends in paper publications about HIV in Nigeria. He found growth (from 1 to 33) of the number of publications about HIV in Nigeria and that papers with international collaborations were published in journals with a higher impact factor. In 2009, Ramos et al. [29] used Web of Science to study publications about infectious diseases in European countries. They found that more papers in total were published about infectious diseases in Europe than in the US.
In 2012, Takahashi-Omoe and Omoe [30] surveyed publications of 100 journals about infectious diseases. They discovered that the US and the UK had the highest number of publications, and relative to the country's socioeconomic status, the Netherlands, India, and China had relatively high productivity. In 2014, similar to Wislar et al. [31], Kennedy et al. [32] studied ghost authorship in nursing journals instead of biomedical journals. They found that there were 27.6% and 42% of ghost and honorary authorships, respectively.
In 2015, Wiethoelter et al. [33] explored worldwide infectious disease trends at the wildlife-livestock interface. They found that 7 out of the top 10 most popular diseases were zoonoses. In 2017, Dong et al.
[34] studied the evolution of scienti c publications by analyzing 89 million papers from the Microsoft Academic dataset. Similar to the increase found by Aboukhalil [35], they also found a drastic increase in the number of authors per paper. In 2019, Fire and Guestrin [36] studied the over-optimization in academic publications. They found that the number of publications has ceased to be a good metric for academic success as a result of longer author lists, shorter papers, and surging publication numbers. Citationbased metrics, such as citation number and h-index, are likewise a ected by the ood of papers, self-citations, and lengthy reference lists.

Data Description
In this study, we fused four data sources to extract insights about research on emerging viruses. In the rest of this subsection we describe these data sources.
i. MAG -Microsoft Academic Graph is a dataset containing "scienti c publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and elds of study" [37]. The MAG dataset we used was from 22 March 2019 and contains data on over 210 million papers [38]. This dataset was used as the main dataset of the study. Similar to Fire and Guestrin [36], we only used papers that had at least 5 references in order to lter non peer-reviewed publications, such as news columns which are published in journals. ii. PubMed -PubMed is a dataset based on the PubMed search engine of academic publications on the topics of medicine, nursing, dentistry, veterinary medicine, health care systems, and preclinical sciences [39]. One of the major advantages of using the PubMed dataset is that it contains only medical-related publications. The data on each PubMed paper contains information about its venue, authors, and afliations, but it does not contain citation data. In this study, we used the 2018 annual baseline PubMed dataset containing 29,138,919 records. 1 We mainly utilized the PubMed dataset to analyze journal publications (see Paper Trends Section). iii. SJR -Scienti c Journal Rankings is a dataset containing the information and ranking of over 34,100 journals from 1999 to 2018 [40], including their SJR indicator, 2 the best quartile of the journal, 3 and more. We utilized the SJR dataset to compare the rankings of di erent journals to assess the level of their prestige. iv. Wikidata -Wikidata is a dataset holding a vast knowledge about the world, containing data on over 78,252,808 items [43]. Wikidata stores metadata about items, and each item has an identi er and can be associated with other items. We utilized the Wikidata dataset to extract geographic information for academic institutions in order to match a paper with its authors' geographic locations.

Analyses Infectious Disease Analysis
To study the research of emerging viruses over time, we analyzed the datasets described in the Data Description section. In pursuing this goal, we used the code framework recently published by Fire and Guestrin [36], which enables the easy extraction of the structured data of papers from the MAG dataset. The MAG and PubMed datasets were ltered according to a prede ned list of keywords. The keyword search was performed in the following way: given a set of diseases D and a set of papers P, from each paper title p t , where p ∈ P, we created a set of word-grams. Word-grams are de ned as ngrams of words, i.e., all the combinations of a set of words in a phrase, without disrupting the order of the words. For example, the word-grams of the string "Information on Swine Flu," word-grams(Information on Swine Flu), will return the following set: {Information, on, Swine, Flu, Information on, on Swine, Swine Flu, Information on Swine, on Swine Flu, Information on Swine Flu}. Next, for each p, we calculated word-gram(p t ) ∩ D, which was considered as the diseases with which the paper was associated.
In the current study, we focused on the past emerging coronaviruses (SARS and MERS). There are many other strains of the human coronavirus, and four of them are known for causing seasonal respiratory infections [44]. We focused on SARS and MERS since they are closer to SARS-CoV-2 and both have zoonotic origins and raised international public health concern.
Additionally, we also analyzed Ebola virus disease, in uenza (seasonal, avian in uenza, swine u), HIV/AIDS, hepatitis B, and hepatitis C as comparators that represent other important emerging infectious diseases from the past two decades. For these nine diseases, we collected all their aliases, which were added to the set of diseases D and were used as keywords to lter the datasets. To reduce the false-positive rate, we analyzed only papers that, according to the MAG dataset, were in the categories of medicine or biology, and following Fire and Guestrin [36] had at least ve references. Additionally, to explore the trend in the core categories of infectious disease research, we performed the same analysis on the virology category. In the rest of this section, we describe the speci c calculations and analyses we performed.

Paper Trends
To explore the volume of studies on emerging viruses, we examined the publication of papers about infectious diseases. First, we de ned several notions that we used to de ne publication and citation rates. Let D be a set of disease names and P a set of papers. Namely, for a paper p ∈ P, p Disease is de ned as the disease that matches the paper's keywords, p year as the paper's publication year, and p citations as the set of papers citing p. Using these notions, we de ned the following features: • Number of Citations -the total number of citations for a speci c infectious disease.
Using these metrics, we inspected how the coronavirus publication and citation rates di ered from other examined EIDs. We analyzed how trends of citations and publications have changed over time. Additionally, to inspect the similarities between the trends of di erent diseases we calculated the DTW (Dynamic time warping) distance [45] between all the disease pairs. Finally, we clustered the time-series using Time-

Journal Trends
To investigate the relationship between journals and their publication of papers about emerging viruses, we combined the Semantic Scholar and PubMed datasets with the SJR dataset using ISSN, and selected all the journals from SJR categories related to infectious diseases (immunology, epidemiology, infectious diseases, virology, and microbiology). First, we inspected whether coronavirus papers are published in the top journals. We selected the top-10 journals by SJR and calculated the number of papers they had published for each disease over time. Next, we inspected how published papers about coronavirus are regarded relative to other EIDs in terms of ranking. To this end, we de ned a new metric, JScore t . JScore t is de ned as the average SJR score of all published papers on a speci c topic t. We used JScore t to observe how the prominence of each disease in the publication world has changed over time. Lastly, we explored publications by looking at the quartile ranking of the journal over time.

Author Trends
To study how scienti c authorship has changed in the eld of infectious diseases, we explored what characterizes the authors of papers on di erent diseases. We inspected the num- 4 To determine which papers, we used the MAG elds of study. ber of new authors over time to check how attractive emerging viruses are to new researchers. Additionally, we analyzed the number of experienced authors, where author experience is de ned as the time that has passed from his or her rst publication. The authors were identi ed by the identi cation number provided in the MAG dataset. Author disambiguation is a challenging task; Microsoft combined multiple methods to generate their author identi cations [47]. We also analyzed the number of authors who wrote multiple papers about each disease.

Collaboration Trends
To inspect the state of international collaborations in emerging virus research, we mapped academic institutions to geolocation. However, it is not a trivial task to match institution names. Institution names are sometimes written di erently; for example, Aalborg University Hospital and Aalborg University are a liated. However, there are cases where two similar names refer to di erent institutions; for example, the University of Washington and Washington University are entirely di erent institutions. To deal with this problem, we used the a liation table in the MAG dataset. To determine the country and city of each author, we applied a ve-step process: i. For each institution, we looked for the institution's page on Wikidata. From each Wikidata page, we extracted all geography-related elds. 5 ii. To rst merge all the Wikidata location elds, we used the "coordinate location" with reverse geocoding to determine the city and country of the institution. iii. For all the institutions that did not have a "coordinate location" eld, we extracted the location data from the other available elds. We crossed the data against city and country lists from GeonamesCache Python library [48] to determine whether the data in the eld described a city or a country. iv. To acquire country data for an institution that had only city data on Wikidata, we used GeonamesCache city-tocountry mapping lists. v. To get city and country data for institutions that did not have the relevant elds on Wikidata, we extracted geographic coordinates from Wikipedia.org. 6 Even though Wikidata and Wikipedia.org are both operated by the Wikimedia Foundation, they are independent projects which have di erent data. Similar to Wikidata coordinates, we used reverse geocoding to determine the city and country of the institution.
Using the extracted geodata, we explored how international collaborations change over time in coronavirus research. Finally, we explored which countries have the highest number of papers about coronavirus and which countries have the highest number of international collaborations over time.

Results
In the following subsections, we present all the results of the experiments which were described in the Analyses section.

Results of Paper Trends
In recent years, there has been a surge in academic publications, yielding more than 1 million new papers related to medicine and biology each year (see Figure 1a). In contrast to the overall growth in the number of infectious disease papers, there has been a relative decline in the number of papers about the coronaviruses SARS and MERS (see Figure 1b). Also, we found that 0.4090% of virology studies in our corpus from the past 20 years involved human SARS and MERS, while HIV/AIDS accounts for 7.8949 % of all virology studies. We observed that, unlike the research in the domain of HIV/AIDS and avian inuenza that has been published at a high and steady pace over the last 20 years, SARS was studied at an overwhelming rate after the 2002-2004 outbreak and then sharply dropped after 2005 ( Figure 2). In terms of Normalized Paper Rate (see Figure 2), after the rst SARS outbreak, there was a peak in publishing SARS-related papers with NPR twice as high as Ebola's. However, the trend dropped very quickly, and a similar phenomenon can be observed for the swine u pandemic. The MERS outbreak achieved a much lower NPR than SARS, specifically more than 16 times lower when comparing the peaks in SARS and MERS trends. In terms of Normalized Citation Rate (Figure 3), we observed the same phenomenon as we did with NPR. Observing Figures 9 and 10, we can see that there are diseases with very similar trends. More precisely, NPR and NCR trends are in two clusters, where the rst cluster contains avian in uenza, Ebola, MERS, SARS, and swine u, and the second cluster contains HIV/AIDS, hepatitis B, hepatitis C, and in uenza.

Results of Journal Trends
From analyzing the trends in journal publications, we discovered the numbers of papers published by journal quartile are very similar to Normalized Paper Rate and Normalized Citation Rate (see Figure 4). We observed that for most of the diseases, the trends are quite similar: a growth in the study rate is coupled with a growth in the number of published papers in Q1 journals. We discovered that for SARS, MERS, the swine u, and Ebola, Q1 publication trends were almost parallel to their NPR trends (see Figures 2 and 4). Also, we noticed that HIV, avian in uenza, in uenza, and hepatitis B and C have steady publication numbers in Q1 journals. Looking at papers in highly ranked journals (Figure 5), we observed that the diseases which are being continuously published in top-10 ranked journals are mainly persisting diseases, such as HIV and in uenza. Additionally, we inspected how the average journal ranking of publications by disease has changed over time ( Figure 6). We found that only MERS had a decline of JScore. We also noticed that current papers about SARS had the highest JScore.

Results of Author Trends
By studying the authorship trends in the research of emerging viruses, we discovered that there is a di erence in the average experience of authors among diseases. SARS researchers had the lowest experience in years, and hepatitis C had the most experienced researchers (see Table 1). We noticed that the SARS research community had a smaller percentage of relatively proli c researchers than other diseases. Moreover, researchers with multiple papers related to SARS and MERS published on average 3.8 papers, while hepatitis C researchers published on average 5.2 papers during the same period. Additionally, from analyzing authors who published multiple papers on a speci c disease, we found that on average there was a 2.5 paper di erence between HIV and SARS authors. Furthermore, swine u, SARS, and MERS were the diseases on which authors published the lowest number of multiple papers.

Results of Collaboration Trends
By inspecting global collaboration and research e orts, we found that the geolocation of researchers correlated with publication trends. For instance, most SARS, MERS, hepatitis B, and avian in uenza research was done by investigators based   majority of SARS papers (73%) were written by researchers in only 6 countries ( Figure 7). While the US was dominant in the research of all inspected diseases, China showed an increased output in only these three diseases. Also, MERS and SARS were studied in the least number of countries, and HIV was studied in the highest number of countries ( Figure 7). Moreover, SARS and MERS were the diseases least studied in Europe, with only 17% and 19% of SARS and MERS studies, respectively, as opposed to Ebola studies, 29% of which were conducted in Europe.

Discussion
In this study, we analyzed trends in the research of emerging viruses over the past two decades with emphasis on emerging coronaviruses (SARS and MERS). We compared the research of these two coronavirus epidemics to seven other emerging viral infectious diseases as comparators. To this end, we used multiple bibliometric datasets, fusing them to get additional insights. Using this data, we explored the research of epidemiology from the perspectives of papers, journals, authors, and international collaborations. By analyzing the results presented in the Results section, the following can be noted: First, the surge in infectious disease publications (Figure 1) supports the results of Fire and Guestrin [36] that found there has been a general escalation of scienti c publications. We found that the growth in the number of infectious disease publications is very similar to other elds. Hence, Goodhart's Law 7 did not skip the world of virology research. However, alongside the general growth in the number of papers, we observed that there was a decline in the relative number of papers on the speci c infectious diseases we inspected. The most evident drastic drop in the publication rate happened after an epidemic ended. It appears that, for a short while, many researchers study an outbreak, but later their e orts are reduced. This is strengthened by considering the average number of multiple papers per author for each disease (see Table 2). Additionally, similar patterns were found in the funding of MERS and SARS research [25], which indicates that there is a possibility that the research rate has decreased 7 "When a measure becomes a target, it ceases to be a good measure." due to lack of funding.
Second, when looking at journal publications, we noted very similar patterns occurred for citations and publications. This result emphasizes that fewer publications, and hence fewer citations, translate into fewer papers in Q1 journals ( Figure 4). Also, we observed the same patterns as Fire and Guestrin [36], with most of the papers being published in Q1 journals and the minority published in Q2-Q4 journals. This trend started to change when zooming in and analyzing publications in top-10 ranked journals ( Figure 5). While we can see some correlation to outbreaks in Ebola, swine u, and SARS, it is harder to interpret the curve of HIV since there were no focused epidemics in the past 20 years but a global burden, and we did not observe similar patterns in publications and citations. Observing the JScore (Journal Trends Section) results (Figure 6), most diseases showed a steady increase, but two diseases behaved rather anomalously. MERS had a decline since 2013, which is reasonable to expect after the initial outbreak, but we did not see the same trend in the other diseases and there is a general trend of increasing average SJR [36]. The second anomaly is that SARS had an increase in JScore alongside a decrease in citations and publication numbers. Inspecting the data, we discovered that in 2017 there were three published papers in Lancet Infectious Diseases and in 2015 two papers in Journal of Experimental Medicine about SARS, and both journals have a very high SJR. These publications increased the JScore drastically. This anomaly is a result of outliers in the data that biased the results. We can observe in Figure 4 that in the last decade the number of SARS papers published in ranked journals dropped drastically. It dropped low enough that two outliers created a bias on the JScore. Generally, the less data we have, the greater chance for outliers to cause bias in the data.
Third, we observed that on average authors write a fewer number of multiple papers on diseases that are characterized by large epidemics, such as the swine u and SARS. On the other side of the scale are hepatitis C and HIV, which are persistent viral diseases with high global burdens. These diseases involve more proli c authors. Regarding Ebola and MERS, it is too early to predict if they will behave similarly to SARS since they are relatively new and require further follow up.
Fourth, looking at international collaboration, we observed the US to be very dominant in all the disease studies ( Figure  7). Looking at China, we found it to be mainly dominant in diseases that were epidemiologically relevant to public health in China, such as SARS, avian in uenza, and hepatitis B. When looking at Ebola, which has not been a threat to China for the last two decades, we observed a relatively low investment in its research in China. We observed that regarding MERS, we found similar results to Sa'ed [49]. In both studies the top-3 biggest contributors in MERS studies were the US, China, and Saudi Arabia.
Many of the trends we observed are related to the pattern of the diseases. We observed two main types of infectious diseases with distinct trends. The rst type was emerging viral infections like SARS and Ebola. Their academic outputs tend to peak after an epidemic and then subside. The second type were viral infections with high burdens such hepatitis B and HIV, for which there is a more or less constant trend. These trends were most evident in publication and citation numbers, as well as journal metrics. The collaboration and author distributions were more a ected by where the outbreak occurred or where there was a high burden. This was also strengthened in the clusters we found where they were divided in the same way.
In terms of practical implications, we see several options. First, notwithstanding the importance of pathogen discovery, as evident in projects like the Global Virome Project [50] that is trying to discover unknown zoonotic viruses to stop future  outbreaks, it is still important to monitor the status of current research that concerns known pathogens. It can be observed from Figures 2 and 3 that there are diseases with declining interest from the scienti c community. These trends are harder to spot when looking at the total number of publications since the total number of papers generally keeps growing ( Figure  1a). Using NPR and NCR can help decision makers investigate if additional resources should be invested in the study of these diseases. For instance, while SARS and MERS were in WHO's R&D Blueprint as priority diseases, they still exhibited a decline in their research rate. Second, using collaboration data, it is possible to nd which countries have potential for growth in the number of researchers on speci c diseases and also which bilateral grants have potential.
Currently, there is no doubt that we have to be better prepared for the next pandemic and the emergence of "Disease X." We observed that currently there is a non-sustained investment in EIDs such as SARS and MERS, which is a key issue. Another crucial issue is the sharing of research material such as data and code. Data and code allow scientists to make more accurate discoveries faster by continuing knowledge from previous studies. Using the MAG dataset Paper Resources table, we inspected how many papers from the nine diseases we analyzed had code or data. We found that there were 30 and 75 papers that had data and code, respectively. These numbers are very low, and we suspect that there are a lot of missing data in this table. We rmly believe that publishing code and data should be mandatory when possible.
This study may have several limitations. To analyze the data, we relied on titles to associate papers with diseases. While a title is very important in classifying the topic of a paper, some papers may discuss a disease without mentioning its name in the title. Additionally, there may be false positives; for instance, an acronym might have several meanings that are not related to an infectious disease term. An additional limitation is our focus on a limited number of distinct diseases. There are other emerging infections not evaluated here in which could have followed other trends. To deal with some of these limitations, we only analyzed papers that were categorized as medicine and biology papers as a means to reduce false positives. Furthermore, we show that the same trends appeared even when we ltered all the papers by the category of virology (see Figures 11 and 12). Finally, we compared papers that were tagged with a MeSH term on PubMed to the papers we retrieved using our keyword search of the title. We found that we matched MeSH terms with 73% recall, which is in the range described by Breugelmans et al. [24].
In the future, we would like to perform extended collaboration analysis by improving the institution country mapping. Currently, we were able to identify 94.2% of the countries of origin for the institutions in the MAG a liation table. We intend to improve the institution country mapping by using addi-tional data sources. Additionally, we are planning to extend our study into other diseases and look for correlations with realworld data such as global disease burden.

Conclusions
The COVID-19 outbreak has emphasized the insu cient knowledge available on emerging coronaviruses. Here, we explored how previous coronavirus outbreaks and other emerging viral epidemics have been studied over the last two decades. From inspecting the research outputs in this eld from several different angles, we demonstrate that the interest of the research community in an emerging infection is temporarily associated with the dynamics of the incident and that a drastic drop of interest is evident after the initial epidemic subsides. This translates into limited collaborations and a non-sustained investment in research on coronaviruses. Such a short-lived investment also involves reduced funding as presented by Head et al. [25] and may slow down important developments such as new drugs, vaccines, or preventive strategies. There has been an unprecedented explosion of publications on COVID-19 since January 2020 and also a signi cant allocation of research funding. We believe the lessons learned from the scientometrics of previous epidemics argue that regardless of the outcome of COVID-19, e orts to sustain research in this eld should be made. More speci cally, in 2017 [51] and 2018 [52], SARS and MERS were considered to be priority diseases in WHO's R&D Blueprint, but their research rate did not grow relative to other diseases. Therefore, the translation of international policy and public health priorities into a research agenda should be continuously monitored and enhanced.

Availability of source code and requirements (optional, if code is present)
Lists the following: • Project name: XXXX • Project home page: GitHub repository will be available after publication XXXX • Operating system(s): Linux, OS X • Programming language: Python • Other requirements: Python 3.6 or higher • License: MIT License

Availability of supporting data and materials
The datasets supporting the results of this article are available online (see Data Description section). Preprocessed datasets will be available after publication.

Competing interests
The authors declare that they have no competing interests.

"Scientometric Trends for Coronaviruses and Other Emerging Viral Infections"-Response to Reviewers
We would like to thank the reviewers for their highly valuable and constructive criticism. The comments have been very helpful in the preparation of the revised manuscript. We have addressed the reviewers' concerns and have improved the paper accordingly. Moreover, we invested considerable efforts to make our study and code easy to reproduce. Furthermore, we generated a ready to run docker image [1].
The following is a description of the revisions we have made to address the comments pointed out by the reviewers. Please note that we have also provided the revised paper in which we have marked all changes in blue.

Comment 1:
Figures are of low quality and I have difficulties in properly seeing them; can you make them of higher resolution and more readable;

Response 1:
We thank the reviewer for pointing this out. In the revised manuscript we have improved the figures: First, the figures were re-generated with higher resolution. Second, we adjusted the fonts to improve the readability of several figures.

Comment 2:
There is a part dedicated to the so-called non-conventional data streams, like google trend, wiki, etc., I would shorten this part, since it is a little bit out of the scope and focus of the paper,

Response 2:
We have shortened the referenced parts according to the reviewer's suggestion.

Comment 3:
Besides figures, can authors provide more quantitative data and statistical analyses (of signficance) if possible

Response 3:
Comparing temporal graphs is a challenging task. We tried to use regression algorithms to model the curves. However, we did not obtain any significant insights using this method. Different diseases behave differently; for instance, a signal of Disease A can be slower than Disease B since it spreads faster or slower, but in general, they can have the same pattern. Due to this reason, to compare the plots we used DTW (dynamic time warping), a signal processing method to measure the similarity between temporal sequences that vary in speed. Additionally, we used TimeSeriesKMeans, a method based on KMeans clustering that utilizes DTW to measure the distance to cluster the plots/time series into groups. We discovered that there are two clusters: the first contains avian influenza, Ebola, MERS, SARS, and swine flu, and the second cluster contains HIV/AIDS, hepatitis B, hepatitis C, and influenza.

Comment 4:
Can authors put their research in a broader view and also in terms of practical implications (for example, the more research, the higher the likelihood of finding a cure, etc.). can authors correlate bibliometric trends with outcomes

Response 4:
We see here several practical implications. First, we can use the research to find diseases that are understudied. For instance, SARS and MERS were considered in WHO's R&D blueprint as priority diseases. While looking in PubMed, it is easy to conclude that the current amount of SARS research is steady. However, since there is a constant growth in the number of published papers in almost all topics, there is an overall decrease in the percentage of papers about SARS. In other words, even with SARS being considered as a priority disease, there is a decline in the publication rate of SARS papers. Parallel to the drop in the number of papers, there has also been a drop in funding [2] of this coronavirus. After the beginning of the outbreak, reports came in about a scientist that looked for a coronavirus vaccination years ago but stopped the research due to lack of funding and interest from the scientific community [3]. Hence, a drop in the number of papers can indicate reduced interest from the scientific community or diminished resource allocation. We believe further research is needed on the topic to be able to prove causality. Second, using the geodata, we can see which countries have the potential to attract new scientists to study specific diseases. To address these topics, we have added material in our Indeed, there are other strains of coronaviruses that cause seasonal respiratory infections. Since these strains do not cause outbreaks, we focused only on SARS and MERS. We have added a comment regarding research on other coronaviruses in the Analyses section.

Comment 16: (page 7)
That is widely discussed in the literature, so adding some pointers here would be useful.

Response 16:
We agree with the reviewer that there are many papers about international collaboration. However, we did not find studies that focused on measuring worldwide infectious disease collaborations. We only found papers that discussed the topic in general, or in a specific field. In the case of the relevant infectious diseases, we found one paper that analyzed MERS publications. We have added to the Discussion section the similarity we observed about MERS research.

Comment 17: (page 7)
China`s share in Hepatitis B papers is higher than its share in MERS papers.

Response 17:
Thank you for bringing this to our attention, and we have revised the text accordingly.

Comment 18: (page 8)
Good occasion to discuss the role of biases and outliers in the data more generally

Response 18:
We agree with the reviewer's comment, that it is a good idea to discuss the outlier, and we have now addressed this in the Discussion section as the reviewer suggested.

Comment 19: (page 8)
Assuming the acronym stands for an infectious disease, wouldn`t the failure to recognize that be a false negative?

Response 19:
The original intention was that acronym may have several meanings, and there is a possibility of misclassifying papers as infectious disease related. The sentence was rephrased to make this clearer.

Comment 20: (page 9)
The Global Virome Project aims to address viruses systematically.

Response 20:
As the reviewer suggested, we have now mentioned this important project.

Comment 21 (page 5)
In any case, it would be useful to state from which Wikipedia you extracted this information. I assume it was the English one. Chances are, though, that for institutions outside the English-speaking world, a non-English Wikipedia might have better location information than the English one.

Response 21
We indeed used the English Wikipedia. In the revised manuscript, we have added a footnote mentioning that we utilized the English Wikipedia. The idea of using a non-English Wikipedia to get information that is missing in the English Wikipedia is very interesting, and we intend to check it out in future studies. Using only the English Wikipedia and Wikidata, we were able to find the countries of 94.2% of the institutions in the MAG affiliation table. We have added to the Discussion section information about the accuracy of our mapping and that we are planning to improve it in future studies.

Comment 22 (page 5)
What is the underlying dataset~--MAG or PubMed or both? This should be indicated. In any case, the individual graphs are also available from Pubmed, e.g. Ebola, so you could compare against them. Not sure the repetitive ``Disease='' marker is helpful in the legend; that seems more appropropriate for the caption.} Figure 1b contains data from the MAG dataset, and we have updated the figure and the caption according to the reviewer's suggestion. Regarding the graphs available on PubMed, they present the total number of papers while our graphs show the normalized amount.

Comment 23: (page 8)
Why not take MeSH terms?

Response 23:
During our research, we considered using MeSH terms and even at some point PubMed was our primary dataset. However, we discovered that PubMed lacks many fields, for instance, paper citations. Additionally, other fields such as author's affiliation are very noisy. We tried to deal with these problems, but the results were inadequate. Generally, the MAG dataset is larger and has more information than PubMed ( Figure 1a); for instance, it has citation, affiliation, author disambiguation, etc., while all these data are missing in PubMed. For all the analyses except the journal trends, MAG presented a better option than PubMed, and the only shortcoming was that the MAG dataset did not have MeSH terms. To present a unified method in the paper, we used the same method for MAG and the journal trends analysis where we used PubMed. Moreover, we compared paper retrieval by title and MeSH, and 73% of the retrieved papers were the same. A similar problem with the MeSH terms is described by Breugelmans et al. [5]. They also arrived at similar recall values between MeSH terms and keyword search on an external dataset.

Response 24:
We understand the reviewer's concerns. However, using the PubMed search to check trends in diseases is problematic. There are more than 300 infectious diseases with many aliases; manually inspecting all these diseases is not efficient and will not provide a complete graph. The same problem exists even when performing search using MeSH terms; Breugelmans et al. [5] mapped multiple MeSH terms for each disease in order to search for papers. Moreover, the data presented on PubMed is not normalized, and since the total number of published papers is growing every year in the long run, we will see on PubMed a trend of growth almost for every topic.
Comment 25: (page 8) some reference(s) on the funding would be useful here, e.g. some of those at http://researchinvestments.org/publications. Their pneumonia report at http://researchinvestments.org/pneumonia/ is rather detailed but does not mention coronavirus, which helps make the case for your paper. They do have a supplementary database, though, that has some coronavirus-related information, as per https://aqueous-inlet-57355.herokuapp.com/?disease\_areas\_\_name=Coronavirus\&breakdown=type\_o f\_research

Response 25:
We agree with the reviewer that funding data may strengthen the paper results. While the paper was under review, a new pre-print was published [2] that analyzed infectious disease funding. This paper's results present similar patterns to the patterns we discovered in paper publications and citations. We have added this information to the paper as the reviewer suggested.
Comment 26: (page 9) good place to refer to the WHO's R&D Blueprint

Response 26:
As the reviewer suggested, we have added a reference to the WHO's R&D Blueprint. Surprisingly, we discovered that according to the Blueprint, SARS and MERS were considered to be priority diseases in 2017 and 2018 but that did not increase the research rate.

Comment 27: (page 9)
The paper focuses on research on (some) human coronaviruses, without mentioning others. So perhaps adapt this question a bit, or add a section with some pointers to research on other coronaviruses, perhaps separately for human/ lifestock/ wild animals.

Response 27:
Thank you for pointing this out. We have changed the question accordingly.
Comment 28: (page 9) the paper zooms in on a few selected cases from which it is hard to generalize to the level of this question, so the paper only addresses some aspects of it. It also does not consider trends in publications on EIDs as an abstract, cross-disease concept (think ``Disease X'' or ``next pandemic'')

Response 28:
The reviewer is correct in that we focused on selected diseases and our findings are not generalizable to all infectious diseases. Our paper shows a proof of concept on how scientometric analyses could support such investigations and definitely leaves room for future analyses focusing on other diseases. As for EIDs as a cross-cutting concept, we agree this merit further research, but in our opinion other methods should be employed to answer such a question (e.g., NLP) and it is beyond the scope of the current analysis.
Comment 29: (page 9) the applied part here is thin. One application I see that is not discussed is a greater recognition of the need for sharing data and code as well as preprints, which means, for instance, that scientometric studies that want to capture publications during the pandemic will need to consider whether and how to include such materials.

Response 29: (page 9)
The biggest application we see here is that the detection of understudied diseases can encourage the allocation of additional funds for these diseases to increase research volume as appropriate. We also believe that the sharing of code and data is very important for driving science forward. We used the MAG dataset to inspect how many papers include code and data. However, the numbers we found were very low, 30 and 75 papers had data and code, respectively. We did not include this since we suspect that this a result of incomplete data in the relevant MAG table. In the revised paper, we have added two paragraphs on this issue in the Discussion section.

Comment 30: (page 9)
Similarly, Wikidata could assist with much more of this work than just the geolocation of the institutions. For an overview, see Wikidata as a knowledge graph for the life sciences. As for concrete examples, this Scholia profile provides a Wikidata-based overview of the COVID-19 literature, and this demo illustrates a Wikidata way of creating your Fig.\ \ref{fig:net}. The Scholia profile also has an accompanying page detailing some of the known gaps associated with it, so as to facilitate collaborative curation. It would be very helpful to be able to feed the curation work that you have already undertaken on this corpus into this shared system.