Discovering trends and hotspots of biosafety and biosecurity research via machine learning

Abstract Coronavirus disease 2019 (COVID-19) has infected hundreds of millions of people and killed millions of them. As an RNA virus, COVID-19 is more susceptible to variation than other viruses. Many problems involved in this epidemic have made biosafety and biosecurity (hereafter collectively referred to as ‘biosafety’) a popular and timely topic globally. Biosafety research covers a broad and diverse range of topics, and it is important to quickly identify hotspots and trends in biosafety research through big data analysis. However, the data-driven literature on biosafety research discovery is quite scant. We developed a novel topic model based on latent Dirichlet allocation, affinity propagation clustering and the PageRank algorithm (LDAPR) to extract knowledge from biosafety research publications from 2011 to 2020. Then, we conducted hotspot and trend analysis with LDAPR and carried out further studies, including annual hot topic extraction, a 10-year keyword evolution trend analysis, topic map construction, hot region discovery and fine-grained correlation analysis of interdisciplinary research topic trends. These analyses revealed valuable information that can guide epidemic prevention work: (1) the research enthusiasm over a certain infectious disease not only is related to its epidemic characteristics but also is affected by the progress of research on other diseases, and (2) infectious diseases are not only strongly related to their corresponding microorganisms but also potentially related to other specific microorganisms. The detailed experimental results and our code are available at https://github.com/KEAML-JLU/Biosafety-analysis.


Introduction
'Biosafety' refers to safety issues caused by infectious diseases, alien species invasion, biological weapons, biotechnology abuse, loss of biological resources and laboratory safety accidents [74]. Related research has been conducted in medicine [3], biology [44], chemistry [49], environmental science [13] and other disciplines, covering medicine and health, agriculture, the military, science and technology, education, the environment and areas such as monitoring, forecasting, detection, tracing, prevention and control, diagnosis, treatment and other key technical fields [66]. The main content related to biosafety is shown in Figure 1.
With the development of biotechnology and the advancement of globalization [18,30], the notion of biosafety has gradually become better defined. The international community's attention to biosafety is increasing, and prevention and emergency systems have been rapidly established [10,21,63]. Since the proposal of the 2000 Cartagena Protocol of Biosafety [19], many countries have adopted this protocol. Unfortunately, outbreaks of major infectious diseases [42] and biosafety incidents [69] have also brought major challenges to biosafety-related work.
As an important emerging aspect of global security, biosafety challenges come from many areas. Largescale outbreaks of infectious diseases (Figure 1a) are the most crucial challenge. In the past 10 years alone, more than a dozen major international infectious disease incidents have occurred. In 2013, the H1N1 swine flu Figure 1. The subdivisions of biosafety research include 'alien species invasion', 'emerging infectious diseases', 'superresistant bacteria', 'misuse of biotechnology', 'biological weapons and bioterrorism threats' and 'biological accident disclosure'. It is worth mentioning that medical papers do not fully cover all areas of biosafety research. For some studies related to politics, military, ethics, sociology and other disciplines, our data have little or no relevance to them. For some medicine-related biosafety research, there is considerable overlap between different segments. We believe it is necessary to present these results based on the experimental data in a more scientific and intuitive way. And the topics involved in the paper are colored orange. virus broke out. There were a final total of 12 033 laboratory confirmed cases, including 805 deaths [61]. The Ebola virus was first discovered in Central Africa in 1976. In 2014, an Ebola epidemic broke out again and rapidly spread almost completely out of control. The associated mortality rate was as high as 40.3%, causing a global panic. At present, it is still a priority disease of the WHO. It was not until November 2019 that the Ebola hemorrhagic fever vaccine was approved for the first time [62]. As of July 2015, no medication had been proven safe and effective for Ebola treatment. By August 2019, only two experimental treatments had been found to be 90% effective in treating this infectious disease [57].
Similarly, human-induced biosafety accidents ( Figure  1d) are a major challenge for biosafety [2]. Laboratory infections threaten the health of workers and may even cause accidental leakage of organisms, which can have a major impact on the global public health system [23,25]. During the 7-year span of 2004 to 2010, the US Centers for Disease Control and Prevention reported 727 biological media loss and leakage incidents, of which 639 were leakage incidents and most of them Biosafety Level 3 (BSL-3) incidents [69].
In addition, the international community faces multidrug-resistant bacteria (Figure 1c), a common biosafety problem [64]. Antimicrobial resistance (AMR) has had a significant impact worldwide. A working report published by the UK government in December 2014 predicted that drug-resistant bacteria would reduce the global gross domestic product (GDP) by 2-3.5% by 2050 [74]. In 2019, approximately 230 000 people died from infections featuring drug-resistant bacteria strains in the United States, and the related treatment cost the US medical system more than $200 trillion [74]. Since some biosafety-related fields have little to do with biological research, we do not have a great deal of information on them.
With the development of biosafety research, related papers and data have been produced at a speedy rate. In PubMed (https://pubmed.ncbi.nlm.nih.gov/), 66 758 biosafety-related papers have been published in the last 10 years. For researchers, it has become a great challenge to quickly and effectively identify the latest research progress and results from among the tens of thousands of publications. Additionally, extracting the research hotspots and predicting trends are difficult problems.
In response to these issues, we collected the abstracts of biosafety-related papers released in the past 10 years from PubMed and designed a novel topic model called LDAPR (latent Dirichlet allocation (LDA), the affinity propagation (AP) algorithm and the hierarchical PageRank algorithm) to extract the topics of these abstracts. Figure 2 shows the overall framework and workflow of LDAPR. This framework greatly reduces the dependence on human knowledge and labor and produces accurate and high-quality results. In addition, by introducing medical subject heading (MeSH) term categories, we carried out classification and trend analysis, focusing on the categories of microorganisms, diseases, drugs, disciplines and regions, and we further explored the relationships between them. Overall framework and workf low of the LDAPR model. The data were downloaded from the PubMed database and preprocessed in Python. The preliminary results were obtained through a model composed of LDA, AP and a hierarchical weighted PageRank algorithm. The results could then be used for multiple data analysis purposes such as retrieval, sorting and visualization.

Overview
In the following sections, we discuss the results from the following seven viewpoints: topic results, microorganism keywords, disease keywords, disciplinary keywords, regional keywords, research trends and challenges and topics over the decade.

Topic results
First, we visualize the annual results with word clouds. Every word cloud represents a topic and contains 30 words. The higher the weight is, the larger the word. If we take the word clouds of 2020 as an example, as shown in Figure 5, there are 10 word clouds in total, and their centers are 'virus', 'food', 'plastic', 'pneumococcal', 'tolerance', 'utilization', 'f lavonoid', 'cell' and 'epoxide'. Most words under the same topic are related. For example, the central word in the first word cloud is 'virus', and the secondary central words are 'infectious', 'patient', 'ZIKV [Zika virus]' and other virus-related words. As the culprit behind many diseases, viruses cause many serious biosafety problems [58]. Research on viruses is an important aspect of biosafety work. In addition, another central word, 'food', is a research hotspot in biosafety. Its representative topics include food safety-related words such as 'beef', 'milk' and 'chocolate'. Most historical biosafety issues are closely related to food safety [46]. Table 1 shows the central topic words over the years. Among them, 'cell', 'inf luenza', 'risk' and other words related to biosafety appear many times, while some words, such as 'island', appear only in a particular year (2017); we found relevant research hotspots on biosecurity and islands published in that year [47,70].
By analyzing data from the past 10 years, we obtained 122 clusters, each containing 30 words. After deduplication, 1788 unique words were obtained, and then we used MeSH terms to identify and classify the results. A total of 32 different categories were found. We conducted indepth research on 12 of the main categories by analyzing the hotspots and trends. The next sections show analyses of the keywords by microorganism, disease, discipline and region.

Microorganism keywords
Keywords related to microorganisms appear most frequently in the results. We found references to 28 Figure 3 shows the microorganism keywords by year (to make the descriptions more accurate, we mainly used the scientific names of the microorganisms instead of the abbreviations or common names appearing in the result data). 'Salmonella' appears every year, while 'Listeria monocytogenes' appears eight times and 'Escherichia coli' appears five times in a decade. These three kinds of microorganisms are common pathogenic bacteria, and they are also the focus of biosafety research, as they have caused severe biosafety accidents throughout history [17,55,56]. In addition, the emergence of some microorganisms in the data is closely related to international emergent biosafety events, such as the appearance of 'Ebola virus' in 2014 and 2015 and 'Zika virus' in 2016 [32]. In particular, 'MERS-CoV' appeared in 2020 in addition to 2014 [45]. This phenomenon is likely related to the outbreak of COVID-19, which was also caused by a type of coronavirus [50]. Some microorganisms that do not directly cause human diseases, such as 'RHDV', also appear [16]. In addition, some keywords indicate other aspects of biosafety issues, such as 'cyanobacteria', which produce cyanotoxin. This toxin can harm water quality and threaten human health without adequate controls [51]. MERS-CoV and 2019-nCoV are viruses that spread through the air and can cause respiratory diseases [15,31].
To learn more regarding the research relevance of various microorganisms, we used the microorganisms obtained from the data as keywords to search PubMed, processed the data with LDAPR and then searched for other microorganism keywords in the results. Finally, we used the chord diagram in Figure 6 to show the results. We found that much microbial research involves other microorganisms. Thus, studying the correlation among various microorganisms is also an important part of We found that the co-occurrence of microorganismrelated keywords is closely related to the similarity of the microorganisms' characteristics. For example, Listeria monocytogenes and Salmonella are food-borne pathogens that cause diarrhea and other symptoms [60].

Disease keywords
Biosecurity accidents are often accompanied by outbreaks of epidemic diseases, so we studied diseaserelated keywords and derived statistics on the results. Figure 4 shows the keywords in the disease field for each year. We divided diseases into three categories: 'human disease', 'zoonoses' and 'animal diseases.' Among them, items in the category 'zoonoses' appear most frequently, which means that most research is about zoonoses [39].
'Listeriosis' and 'paratyphoid' appear in all years. These are two common serious diseases caused by food safety problems. 'Inf luenza' appears eight times in 10 years. The prevention and control of inf luenza is thus an important topic in biosafety research [53].
By comparing the annual occurrence of microorganisms, it can be found that there is a strong correlation between diseases and microorganisms. For example, 'meningococcus' appears in the 2019 keywords, and 'epidemic cerebrospinal meningitis', which is caused by it, appears in the same year [38]. In addition, some diseases are not directly related to biosafety issues, but they are typical complications of certain diseases. For example, the items 'apoplexy' and 'acute myocardial infarction (AMI)' appearing in 2016 are typical complications of 'ebola hemorrhagic fever' [43]. To further explore the relationship between diseases and microorganisms, we calculated the co-occurrence of such pairs in the past 10 years; the results are shown in Figure 7 as a heat map.
In addition to diseases directly caused by pathogenic microorganisms, the frequency of the co-occurrence of certain diseases and other microorganisms is very high. For example, inf luenza and some infectious viruses often appear simultaneously. Viral infections usually cause inf luenza or are similar to inf luenza. The flu season is also when viruses spread most rapidly. On the disease side, the appearance of Listeria infection and paratyphoid are very similar. We can speculate that there is a close relationship between these two diseases. Trichinella and tularemia also have similar distributions, but the frequency of Trichinella infection is significantly higher than that of tularemia. Based on this, we conclude that tularemia is more likely to occur in groups at high risk for Trichinella infection, which has been confirmed by related studies [11]. On the microbial side, Salmonella is strongly associated with many diseases, which indicates that Salmonella infections tend to make patients more susceptible to other pathogens or that Salmonella-infected people are often exposed to an environment suitable for multiple pathogens [9].

Disciplinary keywords
It is widely acknowledged that biosafety research involves a variety of disciplines. From the results, we select each year's disciplinary keywords for analysis; these are displayed in Figure 8.
We found publications from 2013 and 2020 corresponding to six disciplines, with the results in each year involving a variety of disciplines. Among them, publications from 2019 cover only two disciplines. 'Biochemistry' is associated with publications from 7 years (except 2013, 2015 and 2018), 'microbiology' with publications from 7 years (except 2013, 2018 and 2019) and 'pathology' with publications in 2018 only.

Regional keywords
Biosafety issues and related studies have strong regional characteristics. In Figure 9, we select and mark regionrelated keywords from the results on a world map and use different shades to represent the sums of the weights of regional keywords over the years to mark hotspots for biosafety research. Congo appears five times, while Australia, Indonesia and Saudi Arabia appear four times each. From the typical biosafety incidents marked in Figure 9, we can observe a strong correlation between mentions of these regions and significant biosafety incidents in the last decade [4]. For example, the regions involved in MERS are mainly in the Middle East, those involved in Ebola mainly in Central Africa, those involved in SARS mainly in East Asia and those involved in

Research trends and challenges
Biosafety research is developing continuously, with many new research results appearing every year. To identify development trends, we collected keywords for each year separately. If we take 2017 as an example, the topic results are shown in the first subtable in Table 2. Unlike Table 1, which shows the central words only approximately, Table 2 shows the combined statistical results for all topics in each year. 'Cell', 'gene', 'patient' and 'food' are the central keywords. We visualize the proportions of keyword categories in the annual results in Figure 10. From the results over the 10 years, we can observe the following: (1) 'Cell' always appears as a central keyword. It can be inferred that most of the research on biosafety is centered on cytology. (2) Words related to diseases, microorganisms, chemistry and disciplines appear in high proportions every year, indicating that these are long-term themes of biosafety research. (3) Biosafety is a discipline closely related to events. In addition to longterm research topics, there are some keywords related to public health emergencies, such as the keywords 'coronavirus', 'pneumoniae' and 'vaccine', which appear in 2020 and are closely related to the outbreak of COVID-19. (4) New research findings also bring about changes in research trends. For example, we found the term 'pollen' in the results for 2016 and 2017, indicating that research on pollen in the field of biosecurity may have delivered new findings. We then found evidence to support this speculation: the article 'Pollen-mediated gene flow and seed exchange in small-scale Zambian maize farming, implications for biosafety assessment' published in the journal Nature in October 2016 studied the impact of selforganized ecological factors (pollen f low) on biosafety [7].

Topics over the decade
To study the distribution of topics over the decade, we used LDAPR to process all the data and obtained 12 central topics. Then, we sorted these topics with their PageRank values. The results are shown in Figure 11.
Each word cloud represents a topic, and each topic corresponds to one or more of the topics in Figure 1. In this way, we can identify the latest trends and hotspots in biosafety-related research. For example, the central words of the first topic are 'inf luenza', 'antiviral', 'virus', RHDV and 'MERS-CoV', which are related to viruses and infections. It is well known that viruses and infections are primary topics in biosafety research. The second topic is probably related to cells and genetic diseases. With the development of modern medicine, diagnosis and treatment technology from the perspective of cells and genes has achieved remarkable results [12]. Biosafety research is increasingly turning to methods based on this series of modern medical technologies. The third topic is related mostly to animals. Food safety issues are related to the stability and development of society, and quarantines of meat products (such as pork, beef and mutton) are an important topic in biosafety research [65]. With the rapid development of traditional meat production, processing industries and new meat products, related biosafety research is also facing new challenges. The fourth topic is associated with the environment. With the development of modern society, human activities have brought new challenges to environmental governance. Leakages of toxic and harmful biochemical agents and invasions

Research on COVID-19
To validate the ability of our model to reflect emergency biosafety incidents, we performed an analysis of COVID-19-related research publications. We align the results with the COVID-19 knowledge graph from [28] and show some of the content related to COVID-19 in Figure 12. It can be found that the hit words in our model are concentrated mainly around COVID-19. From the above results, we can observe the following: (1) Words directly related to COVID-19, such as 'coronavirus', 'virus', 'pneumonia' and 'lung', appear with higher weights, which indicates that our model can accurately reflect hotspot events. Based on this feature, we can discover more keywords strongly related to COVID-19 to support future research. (2) Weakly related words such as 'infection', 'spread' and 'RNA' also appear in the results with lower weights, which indicates the rationality of our model in weight distribution. Therefore, the weights of keywords can be used as references of their importance to assist with decision-making in COVID-19 research. (3) It is worth mentioning that indirectly related words such as 'MERS', 'SARS', 'vaccine' and 'influenza' have also been successfully mined, which shows the comprehensiveness of our model for information extraction. Therefore, the COVID-19 keyword output by the model does not simply reflect explicitly related information but can also cover implicitly related information.

Method
As shown in Figure 2, our model accepts preprocessed paper abstracts crawled from PubMed by BioPython (https://biopython.org/) as an input. First, we use LDA to obtain preliminary topic results [5], train a word embedding model from the corpus with the Word2Vec  language model [35,41] and then process the topic results to construct a graph. Second, the topic-level weighted PageRank algorithm is used to rank these topics and remove noisy information [40]. Third, we use the AP clustering algorithm to obtain the clustering centers of the filtered results. Fourth, the keywords of each center are separately constructed as a graph, and then the word-level weighted PageRank algorithm is used to rerank the keyword results. Finally, we use tools to organize and visualize the results to identify development trends and research hotspots in biosafety research. We introduce the algorithms and overall framework of the model in the next sections.

Latent Dirichlet allocation
LDA, a well-known unsupervised probability model for text topic extraction based on the Bayesian model, was proposed by Blei et al. in 2003 [6]. Its main purpose is to extract the hidden topics of each document from largescale complex text information and use certain words to describe each topic to extract the key information of the text content [24,68].
LDA divides the text content into three levels, namely, documents, topics and words, and expresses each document as a distribution of topics: where w, t and d represent words, topics and documents, respectively. This process can be further expressed as Discovering trends and hotspots of biosafety | 9 Figure 11. Decade word clouds. where N m represents the length of document m and z m,n is the topic generated by document m.
To obtain the final topic results, we chose the Gibbs sampling method based on Markov chain Monte Carlo (MCMC) to estimate the LDA model [52].

PageRank algorithm
PageRank is a webpage ranking algorithm proposed by Page et al. in 1998 [48]. It uses hyperlinks between pages as the main basis to measure the importance of pages by iteratively updating their PageRank values. That is, PageRank treats the hyperlinks between pages as votes to evaluate the importance of web pages and rank them by the number of votes. In our method, we assign a separate weight w i,j to each hyperlink [27,71]. The update process among K nodes can be expressed as follows: where PR l i is the value of node i with N i neighbors after l updates. Here, α is the damping factor, and L i is the number of hyperlinks from page i.
In general, the PageRank algorithm uses more practical matrix operations to achieve the same effect. It can be expressed as Through the above formula, the PageRank value PR i,j of each node is continuously updated until convergence, which is the final ranking result.

Affinity propagation clustering
AP is a graph-clustering algorithm proposed in 2007 [20]. It selects 'exemplars' through 'message passing' among nodes. Unlike traditional clustering methods, it can use existing data as 'exemplars' to represent the corresponding category. AP clustering has achieved good results in many fields [22,37,73]. We tried many clustering algorithms here. AP is the best choice based on both the principle and the experimental results. Moreover, AP does not need to specify the number of centers and is insensitive to the initial value, making it very suitable for text data with high dimensions and uncertainty.
There are four main parameter matrices in the AP algorithm: similarity s(i, k), preference p(k), responsibility r(i, k) and availability a(i, k), where s(i, k) is the similarity between node i and node k, represented here using the Euclidean distance, and where p(k) is the similarity value when i = k. In the initial state, all responsibility and availability values are initialized to 0, and they are updated according to the following equations: where λ is the damping factor. After the iterative update, we calculate the sum of each pair of s(i, k) and p(i, k). For node i, if the k value that maximizes a(i, k) + r(i, k) is k , then if i = k , point i is the cluster center; otherwise, node i belongs to center k .

Overall framework
We obtained the abstracts of 66 758 related papers published during the past 10 years from the biomedical database PubMed. The keywords included 'biosafety', 'bio-safety', 'biosecurity', 'bio-security', 'biological safety' and 'biohazard'.
The statistical results show that with the development of biotechnology and increased social attention to biosafety-related fields, related research is also increasing. Figure 14 shows the number of related papers published over the years, revealing an increasing year-onyear trend.
The original data obtained from PubMed are natural language text and contain a great deal of noisy information. To achieve better results for subsequent work, we preprocess the data with natural language processing tools (such as term frequency inverse document frequency [TF-IDF], dictionary building and bag-of-words [BOW] models).
Next, we introduce the main part of the model. The structure and flow of this part are shown in Figure 13. We obtain T topics from the data by the LDA model, with K words in each one, and we use the corpus to train a Word2Vec language model to obtain the word embedding vector of each word. Finally, the word embedding vectors of all topic words are weighted using the weights assigned by the LDA model and summed according to the dimensions of the topic, expressed by the formula: where ν topic t represents the vector of the t-th topic and t = {1, 2, ..., T}. K represents the number of keywords selected under each topic, and w t,k and ν word t,k represent the weight and word vector of the k-th keyword of the t-th topic, respectively. Thus far, we have obtained the representation of each topic.
We connect all topics in pairs and use the cosine similarity between their topic vectors as the weights of  the edges to construct an undirected weighted graph. The formula is as follows: where C h,t represents the weight of the edge between topic vector nodes V h and V t and h = {1, 2, ..., T}, t = {1, 2, ..., T}, h = t. Then, we construct the topic graph G t and delete the edges with lower weights according to the threshold K t . Therefore, the value of nodes O t and the weight of edges E t in the initial state are: where T is the number of topics. Next, we use topiclevel weighted PageRank to iteratively update this network and remove low-ranking topics. This step aims to remove noisy topics as much as possible before further processing. Then, we use the AP algorithm to cluster the remaining topics and select the central topics of each category as the representatives. Subsequently, we redecouple each central topic as a set of topic words and construct word networks with weighted edges from Equation 12 according to the cosine similarity between word vectors. Similar to the topic graph construction method, we use a threshold K w to filter out edges with low weights. Then, word-level weighted PageRank is used to rerank the topic words to remove noise and obtain the final results. Unlike the previous topic-level PageRank, here, we take words as nodes to build a separate graph for each topic. Finally, we use word clouds, time series graphs, radar graphs, chord graphs, heat maps, world maps, etc., to visualize the results from multiple angles and select representative results for visualization purposes. More details can be viewed on our homepage(https://www.keaml.cn/ Biosafety/) .

Hyperparameter settings
We tune the hyperparameters of the model based on perplexity [72], and we show the impact of two key factors, the numbers of topics and iterations, on perplexity in Figure 15. For topic models, perplexity is a commonly used metric. Intuitively, perplexity represents how ambiguous the topic results are, so we choose the hyperparameters that give the model a lower perplexity. In our model, we set α = 0.15 and β = 0.01. For the AP model, the damping factor is set to 0.95 after we test many alternatives. For the PageRank models, the topic-level damping factor is α t = 0.85, the word-level damping factor is α w = 0.45 and by observing the convergence of the PageRank value, we set the model to iterate 50 times for both levels.

Performance comparison
To demonstrate the effectiveness of our method, we selected Random, LDA and PLSA as the baseline models to compare with LDAPR. The results of perplexity(  We find that LDAPR shows significant and consistent improvement over the other methods. To verify the effectiveness of each component of LDAPR, we conduct ablation experiments on the language models (BERT [33], BioBERT [36] and pretrained FastText [8]) and clustering algorithms (K-means [34,67]; we choose the point closest to the center as the center point) . The results of perplexity (×10 3 ) of -Kmeans, -Pretrained FastText, -BERT, -BioBERT and LDAPR are 0.386, 0.242, 0.324, 0.207 and 0.125. Although BERT is a great language model, it does not perform well due to the constraints of the task and corpus. And compared with other methods, our model achieved the best performance. The experimental results demonstrate that the selection of each component is reasonable and effective.

More information
To discover more information, we expand the numbers of topics and words. Specifically, we use the selected keywords to filter the results and calculate the weight W of each keyword as Figure 16. 'Exemplars' of trend knowledge patterns captured by the models for bioinformatics frameworks or tools. Here, 'Matlab' stands for the toolboxes contained in Matlab, and we refer to these toolboxes and other deep learning libraries as 'research tools'.
where T is the number of topics, W t is the weight of topic t, W t,k is the weight of keyword k in topic t and τ is the temperature parameter.
Taking bioinformatics frameworks and tools as an example, we use relevant keywords to search, and the results are presented in Figure 16. Through trend analysis, we can clearly capture trends in biosafety research tools. In the early stage, Matlab occupied the core position among research tools. During this period, researchers used mainly the data analysis methods and modeling toolboxes in Matlab for data mining. Since 2015, with the rapid rise of deep learning, researchers have begun to pay attention to the use of deep learning frameworks such as TensorFlow for pattern recognition. Google promoted TensorFlow during this period. Although deep learning frameworks generally show an upward trend, Theano was no longer popular after 2016, mainly because the Montreal Institute for Learning Algorithms (MILA) stopped supporting it at that time.

Conclusion
In this paper, we proposed a novel LDAPR framework based on LDA, AP and the hierarchical PageRank algorithm to summarize and analyze trends in biosafety research over the past decade. We processed 66 758 papers in related fields and visualized and fully analyzed the results. Studies in related fields support our results, which proves the comprehensiveness and accuracy of our method. The research objects of this experiment covered many areas. Among them, we focused on the microorganisms, diseases and disciplines associated with biosafety research. We discovered many implicit connections among these categories and extracted valuable information that is expected to be useful for research. We believe that the LDAPR model can play a guiding role in trend analysis for bodies of biosafety literature and can shed light on the future directions of biosafety research. For example, it can be seen from the results that research on microorganisms, especially infectious disease viruses, has gradually become an important research hotspot. Genes and vaccines are increasingly becoming the topics of greatest concern in biosafety research.

Key Points
• With the outbreak and spread of COVID-19, biosafety has become a global hot topic, and a large number of related studies have emerged. • Compared with other research domains, biosafety research covers more fields and more disciplines. Therefore, it is difficult to systematically organize and summarize the enormous number of related research papers. • We developed a novel topic model: LDAPR. We utilized this model to process biosafety-related papers in the past 10 years and to discover and analyze trends and hotspots. • We discovered a large number of implicit relationships in the data and demonstrated their authenticity and accuracy via relevant studies. • The proposed model can also be applied to many research fields and can provide valuable information for future research.

Data availability
All data relevant to the study are included in the article or uploaded as supplementary information.