A method for systematically surveying data visualizations in 1 infectious disease genomic epidemiology

14 Data visualization is an important tool for exploring and communicating findings from genomic 15 and health datasets. Yet, without a systematic way of understanding the design space of data 16 visualizations, researchers do not have a clear sense of what kind of visualizations are possible, 17 or how to distinguish between good and bad options. We have devised an approach using both 18 literature mining and human-in-the-loop analysis to construct a visualization design space from 19 corpus of scientific research papers. We ascertain why and what visualizations were created, and 20 how they are constructed. We applied our approach to derive a Genomic Epidemiology 21 Visualization Typology (GEViT) and operationalized our results to produce an explorable 22 gallery of the visualization design space containing hundreds of categorized visualizations. We 23 are the first to take such a systematic approach to visualization analysis, which can be applied by 24 future visualization tool developers to areas that extend beyond genomic epidemiology.


Introduction 26
Cheaper and more accurate genomic sequencing technologies are enabling public health decision 27 makers, from doctors to epidemiologists to researchers to policy makers, to make more informed, 28 near real-time, data-driven decisions toward pathogen diagnosis 1 , routine surveillance 2,3 , and 29 public health interventions 4 . Yet as pathogen genomic data become more ubiquitous and are 30 combined with other sources of routinely collected public health data, analysts and decision-31 makers are forced to confront the dimensionality challenges that attend such "big data", with 32 interpretability of results being chief amongst them. 33 34 Data visualization is an emergent solution to address interpretability challenges. It has been 35 shown to improve comprehension of numerical results in medical risk communication 5,6 , but that 36 context is much less complex than the heterogeneous datasets used in modern genomic 37 epidemiology, which can include, amongst other things, genomic, patient, clinical, 38 epidemiological, and geographic data elements. While the rise of public health genomics has 39 been met with concrete efforts to visualize 'omics data 7 , including Nextstrain 8 and Microreact 9 , 40 few of these visualizations have been tested with target end-users to assess a visualization's 41 utility and usability in decision-making contexts 10 . What is absent is a notion of a visualization 42 design space -the combinatorial space of visualizations that can be produced using basic 43 graphical primitives (points, lines, areas) and aesthetic properties (position, color, size, and so on) 44 to depict input data -and a way to systematically construct and analyze this design space to 45 inform the design and evaluation of public health genomic data visualizations. 46 47 Design spaces are common in number of disciplines, ranging from architecture to computer occurred within and between pathogen topic clusters, and manually annotating those bigrams to 121 map to some a priori concept; for example, the bigram "vancomycin resistance" was mapped to 122 concept of "drug resistance" (Table S2). We mapped a total of 23 a priori concepts to 404 123 bigrams, categorized into three groups: genomic concepts (drug resistance, genome, genotype, 124 molecular biology, pathogen characterization, phylogeny, and population diversity); 125 epidemiology concepts (clusters, disease reservoirs, geography, outbreaks (international, 126 community, hospital), surveillance, transmission, vaccine, and vectors), and medical concepts 127 (clinical, cancer, diagnosis, outcome, and treatment). Some bigrams were not mapped to a priori 128 concepts, often because they were standard technical writing phrases (e.g. "statistically 129 significant", "data show"). A priori concepts did not occur uniformly across pathogen clusters 130 ( Figure S4A) and a variable number of bigrams mapped to individual a priori concepts, with 143 131 bigrams mapped to "drug resistance" and only one bigram mapped to "disease reservoirs" and 132 topic clusters ( Figure S4B). 133 134 Document sampling was stratified according to pathogen and a priori concepts 135 We then performed two rounds of stratified sampling using pathogens and a priori concepts as 136 strata. The sampling resulted in 204 unique articles to which we manually added 17 additional 137 articles that we deemed contained interesting data visualizations (these are clearly tagged in our 138 analysis), for a total of 221 articles (Table S3)  used whole figures and did not split them up into smaller parts. We began by classifying the 149 types of charts in figures, further evolving to also classifying how charts were combined, and 150 finally we also classified how charts were enhanced. We found that these three descriptive axes 151 allowed us to sufficiently describe all visualizations in our dataset (see Online Methods for 152 detailed sufficiency criteria). For each of these descriptive axes we also derived a controlled 153 vocabulary (taxonomy). Collectively, we refer to this result of the descriptive axes and their 154 associated taxonomies as GEViT (Genomic Epidemiology Visualization Typology). Below, we 155 describe each of GEViT's descriptive axes and interleave descriptive statistics to show the 156 distribution of taxonomic codes across these axes to provide an overview of the visualization 157 design space. We also operationalized our analysis to produce a browsable gallery 158 (https://gevit.net) that allows others to explore this GEViT design space through the classified Spatial; Tree; and Genomic. We compiled a taxonomy of common chart names to classify 165 specific instances of chart types with each class. When applicable, we also defined special cases of a specific chart; for example, epidemic curves are a special case of bar chart. We also defined 167 one 'Other' category, which included entities that accompanied data visualizations but were not 168 themselves data visualizations, such as tables and images, and miscellaneous visualizations that 169 did not fit elsewhere. In total we observed 23 distinct chart types (plus one miscellaneous 170 category), and found that the most commonly occurring types within data visualizations included 171 Phylogenetic Trees (17.7% of all data visualizations, although some type of tree was present in 172 23.7% of all visualizations), followed by Tables (9.7%), Bar Charts (8.9%), Genomic Maps 173 (6.9%), Line Charts(6.8%), and Images (5.7%, typically a Gel Image of Pulsed Field Gel 174 Electrophoresis). See Figure S5  contained multiple chart types that were spatially aligned -for example, a heatmap and tree 182 (dendrogram) that are spatially aligned to indicate both a hierarchical clustering and the 183 underlying data for the clustering. A tree and heatmap can also be visualized independently of 184 each other, but their combined value is evidently relevant for many researchers. Small Multiples 185 (17.3%) showed different aspects of the data through multiple instances of the same chart type. 186 Many Types Linked combinations (13.5%) used multiple different chart types that were visually 187 linked, for example using a common color to denote some property of the data across the 188 different charts, but not spatially aligned (in contrast to Composite charts). Finally, Many Types General combinations (8.8%) describe a data visualization in which there are multiple chart types, 190 and there does not appear to be any sort of spatial or visual link between them. This situation 191 often arises when authors put many unrelated charts into a single figure due to space restrictions. 192 It was not always straightforward to distinguish between some instances of Many Types Linked 193 and Many Types General, and in such cases we resolved the ambiguity in favor of the latter 194 classification. We also observed instances of Complex Combinations (11.9%) that developed 195 data visualizations using two of the previously describes types of chart combinations. It was 196 notable that trees were mostly commonly combined with other chart types. often black, however those lines can be re-encoded to incorporate data from some additional 207 source -for example, coloring lines according to geographic regions. Instead of re-encoding a 208 mark, it is also possible to add marks to the base chart type, for example, adding colored point 209 marks to a tree's leaf positions (Figure 5b), or to add linear brackets and text to delineate groups 210 (the most common reason text and lines with bracket shapes are used in our corpus). We did not consider axis text, titles, or data labels to be added marks, subsuming them as constituent parts of 212 the base chart type. 213 214 It is also possible to add more complex types of marks, which are specific instances of the basic 215 marks types presented in Figure 5a. Connection marks are a specific instance of line marks that 216 connect two other marks. Containment marks are a specific instance of area marks that enclose 217 other marks. Finally, a glyph is a complex mark that could itself be a type of chart, but that is 218 smaller than the base chart type and embedded within it (in contrast, we define that composite 219 chart types have the same frame size and one chart is not embedded within the other). The only 220 glyph we identified within our dataset was a pie chart, which was often added to geographic 221 maps or node-link graphs ( Figure 5b) to denote proportion variability in the data. 222

223
We differentiate between the instances when chart enhancements are added consistently, or just 224 as one-off marks. When the addition or re-encoding of marks is applied consistently to the base 225 chart type, for example re-encoding all or many lines in a tree, or adding points to all or many 226 leaf nodes, we defined these as structured enhancements. Adding one-off marks, even if they are 227 driven by the data or the addition of some arbitrary ink, was considered to be an annotation and 228 defined as an unstructured enhancement. It was not always easy to differentiate between 229 structured and unstructured enhancements, and in such cases we resolved ambiguities by 230 choosing structured enhancement when analyzing figures. 231 232 In our dataset we observed that most figures were enhanced (83.8% of all chart types), typically 233 through the addition of lines, points, or text (59.6%) while re-encoding of marks was less common (45.6%). The use of text as a graphical mark with aesthetic properties that can be 235 manipulated to convey information was common in our dataset, either by adding text marks to a 236 base chart type, or re-encoding of text labels by manipulating the font face. The text itself ranged 237 from the very simple case of a single letter or number, to a full word, to a complex concatenated 238 string of metadata such as specimen ID, location, and year. Annotations were also less common 239 (33.6%), and were most commonly an arrow to text, or a containment mark that highlighted only Although our approach will surely benefit from ongoing innovations in image recognition, 293 machine learning, and natural language processing, we argue that attempting to fully automate 294 the entire process would be premature. Developing a faster process that still provides a way to 295 include a human in the analysis loop will be fruitful future work for us. 296

297
There are many other ways that our resulting design space could be explored, and for brevity we 298 have only touched upon a few selected findings. Nevertheless, these results have allowed us to 299 appreciate the expressiveness of visualization designs in infectious disease genomic 300 epidemiology. Our results provide guidance to both software tool developers, including 301 bioinformaticians, and to researchers engaged with creating their own visualizations: we provide 302 a concrete terminology for describing data visualizations, and a source of inspiration through the 303      Figure 3 Chart Types in GEViT. We used common names for chart types and also separated them into seven main classes and also one Other class. Special cases of chart types were defined only when there were multiple instance of the same specific chart across our dataset. Chart types with an asterisks mark (*) indicate that they are included in the analysis through manually added articles. Other Charts Table  Image Gel Image General Image

Category Stripe
Miscellany Composition Plot Figure 4 Chart Combinations in GEViT. The six combination types differ based on the number of chart types, the number of charts, and the approach to linking them together.   Search Terms. We searched for articles related to infectious disease genomic epidemiology that 25 were published within the past ten years. We used two queries, 1) (genome AND (outbreak OR 26 pandemic OR epidemic)) OR "genomic epidemiology" and 2) (genomic epidemiology 27 OR molecular epidemiology) AND (bacteri* OR vir* OR pathogen) AND Genome 28 combined their results and retaining only unique records for further analysis. there were any). Titles and abstracts were decomposed into single terms, stemmed, and filtered 33 as described in the Adjutant paper. We calculated the term frequency inverse document 34 frequency (td-idf) metric each term, created a sparse Document Term Matrix (DTM) for further 35 analysis. A separate dataset of bigram terms was also prepared but used only for purposes of 36 linking articles to a priori concepts (see Main text). 37 38 Unsupervised Clustering. We used the t-SNE and hdbscan algorithms to perform an 39 unsupervised clustering using the DTM. While numerous sources advise against clustering on t-40 SNE results we found that on large document corpuses this approach worked well as we verified 41 with the validity checks described below. We used the Barnes-Hut implementation of t-SNE 21 , 42 which allows for some acceleration at the cost of accuracy, with the perplexity parameter set to 43 100 and otherwise default parameters of the R package implementation 22 . We then used 44 hdbscan 23 on the t-SNE co-ordinate to derive the topic clusters. Clusters are sensitive to the 45 minimum number of cluster points (minPts) parameter supplied to the hdbscan, and so we tried 46 different minPts values (50, 75, 100, 125, 150, 250, 500, 1000), observing how the cluster 47 compositions changed. We observed that some articles never held membership in any cluster 48 irrespective of the parameter settings and labelled those as "never clustered", in contrast to 49 articles that were simply not clustered with our specific final parameter settings that are labeled 50 as "currently unclustered". The final set of clusters are a blend of separate parameters (75 and 51 150). The topic of each cluster is assigned by using the top two most frequent terms within each 52 cluster. Upon observing the cluster results, we validated our clusters using an external list of 53 human pathogens and assessed the correspondence between pathogen terms and cluster topics. 54 55 Linking To A Priori Concepts. We used the dataset of bigrams and filtered out those that 56 occurred in fewer than 10 articles within a cluster or fewer than 10% of bigrams across bigrams 57 in the corpus. The remaining bigrams were mapped to a set of a priori defined concepts, except 58 for bigrams excluded because they were common writing colloquialisms or could not be clearly 59 mapped. This mapping was conducted through iterative internal discussions, in a similar spirit to 60 the visualization analysis described below. We deemed this result acceptable for our analysis 61 needs and did not attempt to further validate it. 62 63 Document Sampling. We sampled one document for each a priori concept within each topic 64 cluster. Each sampled article was examined and either considered acceptable for further analysis 65 or rejected. Reasons for rejection included: article did not contain any figures (main reason); full 66 text article not accessible; article not in English; article was mainly about a technique (i.e. 67 laboratory technique or bioinformatics method); article did not include humans (animals only, 68

OR OR
which we considered out of scope); article was a systematic review (figures were mainly 69 illustrations and not data visualizations). For each rejected article, we resampled two additional 70 articles and chose only one article (assuming both were not rejected) for further analysis. Based 71 upon the analysis of the first round of sampling, the second round only sampled articles from 72 2011 onwards to increase the chance of sampling articles containing figures, and also attempted 73 to sample underrepresented a priori concepts from the first round. Table S3 contains a list of all  74 the articles, which round they were sampled in, whether they were included or rejected, and the 75 reason for rejection. illustrations. We also included a small number of "missed opportunity" tables, which were stand-81 alone tables that we felt could have been visualized. This determination was subjective but 82 included tables that were matrices of numbers or large tables of patient metadata where each row 83 consisted of a patient (but demographic tables and statistical summaries were not considered 84 missed opportunity tables). because we wanted to understand the potential interplay between subfigures. For example, if a 90 paper contains three figures (Fig. 1, Fig.2, and Fig. 3) each figure was analyzed separately, 91 whereas if the third figure contains two parts (i.e. Fig. 3A, Fig 3B) Figure S1 to S5 2. Supplemental Table S1 to S3 Captions A reminder that analysis notebooks are also available at: https://github.com/amcrisan/GEViTAnalysisRelease Figure S1 Overview of our approach to construct a visualization design space. This approach is split into two distinct, but connected phases, consisting of a literature analysis and followed by a visualization analysis phase that itself consists of a qualitative and quantitative analysis component. We overlay these phases as concrete steps in resolving our primary research objective, which is stated below.

Figure S4
A priori concepts distributed among pathogens (a) and the number to bigram assigned to each concept (b).

Figure S5
Distribution of chart types of chart type across articles (a) and the co-occurrence of chart types with figures (b)

Supplemental Table Captions
Table S1 External list of pathogens. A list of human pathogens and their associated disease taken from Wikipedia (https://en.wikipedia.org/wiki/List_of_infectious_diseases) and used to validate the topic clustering by assessing whether the pathogen strings occur in clusters with the same name. Both the disease and the source of the disease were checked for a match within each document.