A Close Look at a Systematic Method for Analyzing Sets of Security Advice

We carry out a detailed analysis of the security advice coding method (SAcoding) of Barrera et al. (2022), which is designed to analyze security advice in the sense of measuring actionability and categorizing advice items as practices, policies, principles, or outcomes. The main part of our analysis explores the extent to which a second coder’s assignment of codes to advice items agrees with that of a first, for a dataset of 1013 security advice items nominally addressing Internet of Things devices. More broadly, we seek a deeper understanding of the soundness and utility of the SAcoding method, and the degree to which it meets the design goal of reducing subjectivity in assigning codes to security advice items. Our analysis results in suggestions for minor changes to the coding tree methodology, and some recommendations. We believe the coding tree approach may be of interest for analysis of qualitative data beyond security advice datasets alone.


INTRODUCTION
There is no shortage of security advice, in a wide range of domains.With the rise in popularity of the Internet of Things (IoT), and a corresponding rise in consumer IoT devices with security vulnerabilities [1,12] numerous organizations have offered security advice positioned as IoT security guidelines, recommendations, baseline requirements, best practices, and codes of practice (e.g., [9,18]).Barrera et al. [3] recently proposed a method called security advice coding (SAcoding, next section) to analyze such advice datasets.They used a so-called coding tree to analyze what is referred to as the cb1013-dataset [4], resulting from filtering out identical items from a UK government compilation of 1052 advice items [7].Related advice datasets that are smaller and coarser (higher-level) have been analyzed by Bellman and van Oorschot [6] using the same method.
Our analysis of the SAcoding method is motivated by a desire to increase our understanding of it, its utility, and what it can be relied upon to deliver.We set out to identify which of its aspects deserve our confidence, and where or how it might be improved.One improvement direction would be any changes that increase consistency, between different coders, in codes assigned to advice items (tag agreements).From our analysis, we note limitations and offer insights and recommendations on the structure of the coding tree, instructions to coders, and the descriptions associated with questions at tree nodes.Collectively, this may encourage others in the community to use SAcoding to analyze other sets of security advice, independently critique or improve the method, and perhaps create and design alternative systematic methods for analyzing security advice.

BACKGROUND (ON SACODING METHOD)
Qualitative data coding [8] is commonly used to manually extract themes and perform analysis of qualitative data such as verbalized thoughts and interview transcripts from study participants.It is an imprecise process by which review of a dataset results in development of a codebook containing refined labels for concepts derived from the data.In a perfect is subjective.While in many applications high accuracy is not essential, accuracy is more important in the SAcoding setting, where the method is positioned for analysis of security advice.
The SAcoding method [3] guides users through a set of questions to ultimately assign a pre-defined category code or tag per Table 1 (e.g., outcome, principle, policy, or one of several categories of practices) to individual security advice items.Tags  3 - 6 are used for practices that are pre-classified as actionable, meaning they specify step-by-step instructions for advice followers, whereas, e.g., tag  (Desired Outcome) is for an advice item that states security outcomes but suggests no means to reach them;  3 (Infeasible Practice) is for practices that are actionable, but for which advice targets might not have sufficient resources to carry out the advice.
The method uses a top-rooted coding tree (Fig. 1), with leaf nodes holding the tags.Each interior node has one yes/no question.For a given advice item, the root question is asked.The answer dictates whether the yes or no branch is followed to the next node (question).Reaching a leaf results in the leaf's code being assigned to the advice item.The design objective is to reduce subjectivity in assigning codes [3].
Ideally, the SAcoding tree would produce very few or zero tag nonagreements (e.g., coders would agree on all tags for given advice items).Each question in the tree was designed to be objective; for a given advice item, every question should have crisp, clear criteria, such that competent coders all answer yes or no in unison.However, as we will see herein, distinct coders do have nonagreements.
In our analysis, we use the previously mentioned cb1013-dataset [4].As one aspect, we look to identify coding tree questions at which a disproportionately large (or small) number of nonagreements occur, and consider potential underlying reasons.

Code Tag name Description
1 Not Useful (vague/unclear or multiple items) Advice that doesn't make sense linguistically (e.g., unclear grammar), or isn't focused on a specific task/action.An item tagged  1 may optionally be given a supplementary label, Unfocused. 2 Beyond Scope of Security Advice that is hard to argue could possibly help security.

𝑁
Security Principle Advice in the form of a general rule (broadly applicable), that historically improves security outcomes, or reduces exposures. 1 Incompletely Specified 'Practice' Advice appearing at first glance to give technical direction like a practice (e.g., technical mechanism, specific rule), but lacking clear indication of steps to take.The tag is thus considered 'non-actionable'. 2 General Policy (General 'Practice') Advice that indicates a general approach (or general policy) but gives no explicit techniques or tools.The tag is considered non-actionable (despite its name) due to its general, unspecific nature.* 3 Infeasible Practice A practice that would consume unreasonable resources (time, money), thereby failing most/all cost-benefit analyses.* 4 Specific Practice/Security Expert A practice requiring the expertise/deep knowledge of a security expert; may require inferring steps that are not explicit in the advice.* 5 Specific Practice/IT Specialist A practice executable by typical IT workers (but beyond typical end-users), using basic professional knowledge of security.* 6 Specific Practice/End-User A practice executable by typical end-users, e.g., by directly interacting with a device, mobile app, or cloud service. ,  ′ Desired Outcome Advice suggesting a generic, high-level end goal to attain, but lacking a specific method by which to reach that outcome.
Table 1.SAcoding tree codes, from Barrera et al. [3] with some descriptions reworded.The two distinct codes for Desired Outcome allow tracking of different code tree paths.The single-quoted 'Practice' in tag names for  1 ,  2 signal that these are not actionable practices.*denotes codes considered actionable.

METHODOLOGY AND RESULTS OF CODING
The previous section reviewed the SAcoding method.Here we discuss our use of it to tag the full 1013-item dataset [4] as a second coder (C2), and compare results with coding of the same dataset by a coder we call C1 from previous work [3].We use the same procedures as C1 to produce the C2 results (same tree, coder instructions, coding interface software).Our methodology is compared to others later, under Related Work.
After C2 finished tagging the 1013 items, we compared the set of tags from C2 to that from C1; the coding results from both coders are publicly available in a file that integrates the 1013-item dataset [5].We first explore differences in tag frequency and distribution, to get a sense of the tags each coder applied to the advice item dataset.For each coding, we first separately summarize the number of items tagged in each tag category.In the tag set for each coder, each of the 1013 advice items was tagged with one or two tags.Fig.Is it viable to accomplish with reasonable resources? 7 Is it intended that the end-user carry out this item? 8 Is it intended that a security expert carry out this item? 9 Is it a general policy, general practice, or general procedure? 10 Is it a broad approach or security property?
Table 2. SAcoding tree questions, asked for each advice item.For supplementary annotations available to coders (providing additional context), see Barrera et al. [3], which is also the source of the questions.Question  11 omitted (as discussed later).
per-coder sum of all tag counts exceeds 1013.We later give a more detailed analysis (starting with Table 5) showing tag distributions for C1 and C2 partitioned across 13 advice-item groups as explained later.
As an aside, Barrera et al. [3] note that the coding software interface they used (also used herein) did not prevent coders from "short-cutting" by directly assigning a tag to an item (bypassing the method); they suggest that any software interface tool used should preclude this.This change was not made in the tool we used, but inquiries with both the original coder C1 [3] and our coder C2 confirmed that short-cutting was generally avoided.

Proportions of Actionable Advice and Individual Actionable Tags
An advice item is considered actionable (for a given coder) if either the first or (when present) second tag assigned to it is actionable (one of  3 - 6 , per Table 1); the justification is that the coder has an argument supporting actionability.
Otherwise, the item is considered non-actionable.The Actionable and Non-Actionable bars of Fig. 2 show the resulting counts for each coder.The count across these two bars thus sums to 1013 for each coder.Fig. 2 shows C2 tagged 33% of items with an actionable tag, agreeing with earlier findings on how actionable the 1013-item dataset is (32% by C1 [3]).Further, on more than 80% of advice items, C1 and C2 agreed on whether the advice item was actionable (Table 3; discussed later).This supports the assertion that the coding tree-its questions, their ordering, and wording-enables effectively classifying items as actionable vs. non-actionable.
Using Fig. 2, we now consider individual actionable tags ( 3 - 6 ).Out of 2 × 1013 opportunities,  6 (advice for End-User) was used only 5 times, and  3 (Infeasible) only 3 times.The  3 result suggests that advice in this dataset is largely feasible, in the case that it is actionable; the  6 result suggests that this advice set does not target end-users.The low use of  3 and  6 is discussed further under 'Low Numbers of Q-nonagreements'.
Next consider  4 (advice requiring knowledge of a Security Expert) and  5 (requiring knowledge of an IT Specialist).
C1 and C2 used comparable proportions of  4 (21% vs 22%), and likewise for  5 (14% vs 13%), but we see  4 used 54% and 73% more, resp., than  5 .Recalling that the primary target of this 1013-item advice dataset appears to be IoT device manufacturers (or pre-deployment stakeholders) [9,17,18], we find this target consistent with the healthy proportions of items tagged  5 , and low proportions of  6 .However, the relatively high proportions of  4 appear inconsistent with this dataset being actionable by IoT device manufacturers, whom we expect (at least for small and medium-sized companies) do not always employ security experts.
Proportion gaps in non-actionable tags.For this dataset, while both coders assigned similar numbers of tags in each actionable tag category, this was not so for several non-actionable tags, including: •  1 (Unclear or Unfocused): 26% vs. 14% (for C1 and C2, respectively) •  1 (Incompletely Specified Practice): 31% vs. 18% •  or  ′ (Desired Outcome): 10% vs. 25% Fig. 2 also shows that  was used twice as often by C2 than by C1.For the other 6 code categories, coders had similar tag distributions in that the number of items assigned each tag differed between coders by at most 1 per cent of the number of items in the dataset.The latter is promising, but overall it appears premature to claim that the SAcoding method can reliably estimate the proportion of a dataset's advice items in (most) categories; while plausible for some datasets and some coders, 1 further evidence and perhaps methodology changes (discussed later) would be needed to support a claim that tag distribution is reproducible. 2e add two related comments.First, note that similar cross-coder proportions of tags does not imply tag agreements on individual items.For example, the similar proportions of  5 in Fig. 2 (14% and 13%) do not imply that C1 and C2 tagged the same individual advice items with  5 ; from tag-vs-tag details (Tables 6 and 7 given later), we see that of Fig.As the second comment, of the three large percentage gaps noted above, the gap for T (Outcome) is perhaps of greatest concern, as differentiating outcomes from actionable practices was a goal [3].However, this concern is partially mitigated by data from later Table 6, showing that most tag nonagreements involving  are with other non-actionable tags (primarily  1 ,  1 ,  2 ).
Finally, further discussion of nonagreements involving  1 is given later under 'Question 1', and similarly for  under 'Question 3' and in discussion of the tag-vs-tag tables.

Terminology for Multi-Coder Agreements and Nonagreements
We next consider at which questions coders had agreements and nonagreements for each advice item.We use agreements and nonagreements in two contexts.An agreement in the context of tags (T-agreement) is when at least one of the tags given to an advice item by a coder is the same as one of the tags given by the other coder; otherwise, it is a T-nonagreement (the coders have no tags in common among their first and possibly second tags for the advice item).
In the context of a question, an agreement (Q-agreement) is when both coders provide the same answer to a question in the tree (both answer yes, or both answer no), for a given advice item.For a given tree question, when one coder answers yes and the other no, we say it is a Q-nonagreement.
Clearly, Q-agreements are related to T-agreements, e.g., each coder assigning the same code through the tree implies Q-agreements at all questions on the path to that code (leaf).A T-agreement is determined based on tags assigned to advice items through use of the coding tree; Q-agreements are determined based on individual questions in the coding tree that both coders answer for the same advice item.
How tag and question agreements and nonagreements are determined is described more thoroughly below.We describe non-matching tags and question answers as nonagreements instead of disagreements, based on our view that the term disagreement implies two coders explicitly disagreed on something (e.g., the tag to be applied, or the answer to a question), rather than coders independently selecting different sequences through the coding tree and the outcomes differing.Understanding where coders had nonagreements most frequently (e.g., which questions in the coding tree they diverge on) allows us to consider where the coding tree might be improved to more reliably converge coder answers toward yes or no, and to identify advice items that may be vague or open to subjective interpretation by a coder.
A coding sequence (hereafter sequence) is the sequence of question nodes resulting from use of the coding tree on an advice item; the nodes are joined by edges, as determined by yes/no answers (in Fig. 1, the edges labeled Y /N from  1 to a leaf code).The number  of nodes in a sequence is the number of questions a coder answers before reaching a tag.For example, tag  1 is reached by answering Questions 1 (yes), 2 (yes), 3 (no), 4 (yes), and 5 (no), yielding sequence ( 1 ,  2 ,  3 ,  4 ,  5 ) with  = 5.As a sequence describes the path through the coding tree to reach a single code, a coder determines one sequence for each tag assigned to an item.Sequences are indirectly created by coders, as a result of answering a question at each node until reaching a tag.
For a given advice item, a diverging question is the tree question at which two coders give different answers, thus taking a different exit path from that node, eventually yielding different tags.When two coders yield a different tag on a given advice item, the diverging question can be determined by tracing backwards up the tree, from each of the two tag leaf nodes (Fig. 1), until these paths intersect at a question node.
As the number of tags given to an advice item by two coders can vary from 2 to 4, we consider three types of advice tag comparison (based on T-agreements).Each comparison is within the scope of a single advice item, i.e., by looking at the tags each coder (C1 and C2) assigned to that item.As first and second tags (resp.denoted by subscripts, e.g., 1 1 and 1 2 ) are given equal importance [3], these types are based on the number of tags assigned by coders, not which tags were assigned first or second.In all three comparison types, a match implies two identical tags (e.g., 1 1 = 2 1 ); one tag must be from each coder.

SS-type Tag Comparisons.
In an SS-type comparison (single-single), each coder gave only one tag to the item, and a T-agreement occurs if the tags match, i.e., 1 1 = 2 1 ; a T-nonagreement occurs otherwise.
SD-type Tag Comparisons.In an SD-type comparison (single-double, including also double-single), one coder gave one tag (first), and the other opted for two tags (first and second), for a total of three tags to one advice item.In this type, a T-agreement occurs if the single-tag coder's first tag matched either the double-tag coder's first or second tag (recall that the first and second tags are considered to be of equal importance); a T-nonagreement means the single-tag coder's tag was identical to neither of the double-tag coder's tags.
In SD-type comparisons, to determine a single question where coders diverged (a Q-nonagreement occurred), we use the longest T-nonagreement overlapping sequence (of the two comparisons, e.g.{1 1 , 2 1 } and {1 1 , 2 2 }), and declare the final question in that sequence to be the diverging question.If we used the shortest of the two overlapping sequences, in every instance that a coder tagged an item as  1 ( 1 's no answer), the diverging question would be  1 , as it is the final question in the overlapping sequence. 3As in a T-nonagreement at most one coder could yield an  1 tag (otherwise a T-agreement results), the other sequence of the two-tag coder is necessarily longer, providing a longer overlapping sequence to analyze.We observed no cases where one advice item had two overlapping sequences of the same length, with different leaf tags reached.
DD-type Tag Comparisons.In a DD-type comparison (double-double), both coders opt for two tags (first and second), resulting in a total of four tags assigned to an advice item.In this case, we declare a T-agreement if one coder's first or second tag is identical to either tag of the other coder; a T-nonagreement occurs if neither of the two tags from the first coder is identical to either from the second coder.As so few instances occurred where both coders used two tags (as discussed shortly), type DD is excluded from most of our subsequent analysis; however, we summarize tag results for the 19 relevant items (cf.Table 3) in the Appendix, as Table 9.

T-agreement Summary and Results
Table 3 summarizes the results of calculating T-agreements (type SS, SD, and DD) as discussed.760 advice items received one code from each coder (SS-type), 234 had one code from one coder and two from the other (SD-type), and 19 had two codes from each coder (DD-type).The high percentage of T-agreements among DD-type advice items are not surprising, based on the larger number of pairs available for a match due to a second tag from both coders (Table 3's third column shows the proportion of T-agreements increasing with the number of tags in the comparison set).Type SS and SD comparisons account for 98% of two-coder tag comparisons (column 2).As only 2% of advice items were tagged twice by both coders, we do not analyze or discuss DD-type agreements (of tags or questions) further, beyond Tables 3 and 4.   3.4 Q-nonagreement Results: Distribution across Questions, and Proportion within Each Question Fig. 3 gives the number of Q-nonagreements at each question, and their distribution across questions, i.e., the proportion of Q-nonagreements at each question among the total number of Q-nonagreements across all questions.Fig. 4 shows this same number of Q-nonagreements at each question but now relative to the number of times that question was visited by both coders on the same advice item, i.e., for each question, the percentage of joint visits that yielded Q-nonagreements.
To calculate these statistics, we used the longest overlaps between the coders' node sequences, as explained earlier.
For each tree question we counted how often coders had a Q-agreement (same answer, both yes or both no), and the number of Q-nonagreements.When coders agreed on the final tag for an item, the Q-agreement count of each node in the sequence (including final  node) is incremented.When coders did not agree on the final tag (a  -nonagreement), the Q-agreement count was incremented for each node in the overlap sequence excluding the diverging node; and the Q-nonagreement count was incremented for the final node (  ) in the overlap sequence, as this was the diverging question.Since the first question  1 is asked regardless of any other nodes visited by each coder, the number of times  1 is asked is the total number of items in each comparison type (from Table 3: 760 for SS-type, 234 for SD-type).This explains how Fig. 4 was constructed.

INTERPRETATION OF Q-NONAGREEMENT RESULTS
We now interpret the Q-nonagreement results.We examine aspects of the coding tree associated with a disproportionately large number of Q-nonagreements (areas where coding of items differed most across coders), and also small numbers thereof.As discussed later, these are candidate areas to consider for coding tree improvements; questions with very few Q-nonagreements are candidates for removal to simplify the tree.In some cases we provide concrete explanations for why Q-nonagreements take place more or less at some questions; in others, we offer conjectures with less confidence, keeping in mind the subjective nature of qualitative coding.
# of times question was posed to both coders of SS advice items Fig. 4. Q-nonagreement proportions within each question.Figure shows ratio of Q-nonagreements at a node (from Fig. 3), to how often that node was encountered by both coders (including Q-agreements and Q-nonagreements).Sum of Q-agreements at each question node includes agreements on both yes and no answers (see dashed box). 1 number of comparison instances from Table 3 are: 760 for (I) SS-type, one tag from each coder; 234 for (II) SD-type, one coder giving two tags.Shaded node sizes roughly approximate percentages; percentages shown are rounded.

High Numbers of Q-nonagreements
As Fig. 3 shows, for SS-type and SD-type comparisons, nearly all Q-nonagreements (98% combined) come from Questions 1, 3, 4, 5, and 8.We examine the top three, and consider why these dominate the Q-nonagreements.
Question 1 (35% and 17% Q-nonagreements for type SS and SD, respectively). 1 asks if advice is in unambiguous language and relatively focused.As a possible reason why there is a greater proportion of Q-nonagreements at  1 than at other questions, we hypothesize that the wording of  1 may be interpreted differently by coders C1 and C2, due to differences in their security experience (as discussed under 'Limitations') or other differences in personal interpretation independent of experience.For example, some wording in an advice item may appear to be unambiguous language to one coder but not another.Unclear wording of the question itself such as this term and the second key term in  1 , relatively focused (intending to ask whether sub-items are focused on the same topic, i.e., related), may contribute to Q-nonagreements.Here a sub-item is part of a larger integrated item that may cover sub-topics that are distinct, or only loosely-connected, or covering multiple aspects of a single or closely-related topics; sub-items appeared often in this dataset [3].
Sub-label: Unfocused.While not affecting Q-nonagreements, Fig. 5 shows that when coders agreed on tag  1 , they both chose Unfocused in over 30% of cases (30 + 2 of 106 combined SS and SD cases), exactly one selected Unfocused in (13 + 3)/106 ≈ 15% of cases, and neither selected Unfocused in about 55% of cases (53 + 5 of 106).We expect these results reflect both the underlying dataset (likely a majority of items ended in  1 for both coders because the items were unclear, i.e., not unambiguous language, rather than not relatively focused), and differing coder interpretations of the terminology in  1 including the meaning of relatively focused; this is aside from it being somewhat unclear whether a coder is expected to select Unfocused if they interpret an item to be both unclear and not relatively focused.We comment on this further under 'Recommendations'.
1 also has the greatest difference in proportion of Q-nonagreements between cases SS and SD (35% vs 17%); here we also conjecture why this is, distinct from comparing  1 to other questions.The strongest idea is that when one coder tags an advice item with two tags (as occurs in SD), there are more opportunities for an agreement.To gain confidence in this intuition, we now briefly pursue the details.Thus a Q-nonagreement at  1 (for type SD) occurs only in one case of six (B2), and here the one-tag coder necessarily answered no.The relatively constrained conditions of case B2, together with some expected variation in all answers (due to subjectivity and ambiguity, including in advice items), appears to explain the smaller proportion of Q-nonagreements at  1 for SD-type comparisons than for SS.As an aside, in case B2 (the SD-type comparison T-nonagreement case), regardless of how the two-tag coder answers questions beyond  1 , the only node both coders visit is  1 (thus  1 is always the diverging question).

SS-type comparisons for
These constrained circumstances for SD-type nonagreements at  1 also provide a reason for a greater absolute number of SD-type Q-nonagreements in some later questions ( 3 ,  4 ) than at  1 itself (right side of Fig. 3).
Question 3 (29% and 23% Q-nonagreements for type SS and SD, respectively). 3 (Table 2) asks if an item is an outcome (vs.an action to take).Recall also this definition and  3 's annotation: Outcome [3,6]: An outcome is a result of some prior activity; in our context, often the end goal of advice.
3 annotation [3]: Is the advice a high-level outcome rather than some method (or meta-outcome) for how to achieve an outcome?E.g., data is secured in transit would be an outcome because it is a desired goal or state, whereas encrypt data in transit is not because it explains [suggests] a method for achieving that outcome (in this case, encryption).Encryption may be considered a meta-outcome, as it is not meaningful to the end-user's ultimate goal of protected data.
If a coder finds this outcome definition unclear, or cannot relate it to a given dataset item, they may be unsure how to answer  3 .Note C2's tag set (Fig. 2) contained T (Outcome) 2.4 times as often as C1 (235 vs. 97).Thus, often when C2 answered yes to  3 , C1's Q-nonagreement via no yielded another tag from deeper in the tree.
To explore this, we examined SS-type and SD-type tag nonagreements involving tag  , counting how often C2 used  for an item while C1 assigned a lower tree tag (i.e., not  1 ,  2 , or  ; and  was neither C1's first nor second tag).Of 235 times C2 used tag  , there were 142 SS-type nonagreements, 106 of which (75%) involved tags other than  1 ,  2 from C1 (Table 6, column  ); and 27 SD-type nonagreements, 4 26 of which (96%) had C1 using a tag other than  1 ,  2 .
The remaining 235 − (142 + 27) = 66 cases are accounted for by 58 agreements on tag  (45 SS-type, 13 SD-type), and 8 tag agreements where C2 selected  as either first or second tag, but C2's other tag led to the agreement with C1 (3 of these involved DD-type comparisons, Table 9).
This high proportion of occurrences (106 + 26 = 132 of 235) of C1 assigning tags after  4 when C2 tags items as  may suggest the description of outcome is interpreted differently across coders, and that one or more of the definition of outcome, the wording of  3 involving it, or  3 's extended annotation might be improved in materials provided to coders.
The above definition of outcome also relies on a mutual understanding of the term "end goal" shared by coders.If this differs, one coder may be more or less likely to answer yes to  3 .A brief description of desired end goal is given in the annotation details [3] using an example, but for more complex cases in the dataset, it may be unclear to coders whether a given item fits the definition.
Question 4 (14% and 25% Q-nonagreements for type SS and SD, respectively). 4 asks if an item suggests any of: a security technique, mechanism, software tool, or specific rule.This naturally relies on the definitions of these terms, which are not given in the question itself (but are given in part in annotations [3] provided to coders, which include a few brief examples).An advice item might align poorly with these examples (if even they are consulted), resulting in coders implicitly relying on their own subjective definitions.This is discussed further under 'Limitations (Clarity of terminology)'.
Finally, we note that  4 is the main branch point where an advice item will either head toward actionable tags (plus  1 ), or toward more general advice such as principles.Q-nonagreements at  4 create a major divergence in coder sequences, in that resulting tags will differ even in terms of actionable vs. non-actionable (cf.Table 3).

Low Numbers of Q-nonagreements
As Fig. 3 shows,  2 ,  6 ,  7 ,  9 , and  10 have in total 6 Q-nonagreements across both comparison types (SS, SD).From these five, we now examine  2 ,  6 , and  7 as the three that were most visited among these (per Fig. 4), and conjecture why they had so few or zero Q-nonagreements.While discussion is omitted, we note the number of times  9 ,  10 were respectively visited by both coders was relatively low, at (59, 18) for SS-type and (13, 2) for SD-type comparisons (again from Fig. 4).
Question 2 (1% and 0% Q-nonagreements for type SS and SD, respectively). 2 asks if the advice item is arguably helpful for security.For tag comparisons of type SS and SD, resp.,  2 arose for 67% (508/760, Fig. 4.I) and 88% of items (206/234, Fig. 4.II), but was answered no only 9 times.We first note the almost total absence of Q-nonagreements despite many opportunities.We next observe that almost all the advice was judged to have some possibility of improving security (we view this as more a reflection on the dataset, than the SAcoding method of analysis).Combined, this allows a conclusion that the dataset's advice items are, as expected, apropos to security (an easy hurdle), but provides less feedback on whether  2 would successfully filter out-of-scope advice in a mistargeted advice dataset, or would do so without unreasonably large numbers of Q-nonagreements.
Question 6 (0% and 1% Q-nonagreements for type SS and SD, respectively). 6 asks if the advice is viable to accomplish with reasonable resources.It was rarely answered no, yielding  3 in total 3 times (only one being counted as a Q-nonagreement at  6 ; cf.Tables 6 and 7).This suggests nearly all actionable advice items in the 1013-item dataset are feasible to implement, per  6 's description of reasonable and Table 1's description of infeasible advice.However, we expect that the present dataset did not provide a strong test of  6 in general (similar to  2 above).A stronger test would be provided by datasets with more infeasible advice items.
Question 7 (0% Q-nonagreements for both types). 7 asks if it is intended that the end-users of a product would be expected to carry out the advice item.As the target audience for the 1013-item dataset does not focus on endusers [3], it makes sense that few Specific Practice-End-User ( 6 ) tags would be used (yes to  7 ), thus the absence of Q-nonagreements at  7 is not surprising.

Other Observations on Q-nonagreement Distributions
As a rough measure, we expect the overall distribution for Q-nonagreements (Fig. 3) to be skewed towards lowernumbered questions, with the proportion of Q-nonagreements at subsequent nodes decreasing, as Q-nonagreements higher in the coding tree preclude later Q-nonagreements for a given advice item.For the proportion of Q-nonagreements within each question (Fig. 4) we do not expect this same pattern, as a Q-nonagreement higher in the tree has no impact on the proportion of Q-nonagreements within later questions (a reduced number of visits does not directly affect a proportion of nonagreements at a node).
Reviewing Figs. 3 and 4, a few items draw our attention regarding the distribution of Q-nonagreements.Some questions have a small overall share of the Q-nonagreements (Fig. 3), but the ratio of Q-nonagreements to total joint visits to that question is high (Fig. 4).One such case is  8 , which we discuss next.A second such case is the now-removed question  11 , as discussed under 'Limitations'.
Question 8 leads to two leaf nodes.The type SS results for overall Q-nonagreement proportion (Fig. 3, 8%) and within-question Q-nonagreement proportion (Fig. 4.I, 34%) indicate that when coders both reached  8 , it was common for them to fail to agree. 8 asks about the level of security experience advice recipients are expected (or would need) to have.A coder with more security experience may answer differently than a less experienced coder; e.g., coders may apply personal security knowledge to distinguish groups ( 4 /security expert vs.  5 /IT specialist).This may explain in part why a relatively large proportion of  8 answers were Q-nonagreements, as in our study the security experience of C1 (a senior PhD student in security at time of coding) and C2 (a professor with seven years of post-PhD security experience) was far from identical.
Q-nonagreements at  8 cause concern, as we expect all IoT device manufacturers to employ IT specialists ( 5 ) in order to develop devices, but fewer to have security experts ( 4 ), especially in small and medium-sized companies.As such, which of the two audiences an advice item is actually designed or intended for is important, as items that target a security expert might not be helpful to (executable by) a non-expert.If the SAcoding method is able to identify advice items that do not match target recipients, then it offers a means to improve the targeting and thereby effectiveness of advice.

Comparing Actionable and Non-Actionable Tag Agreements
To determine how well the coding tree can estimate the actionability of an advice dataset, 5 we further calculated agreement within actionable tags, and within non-actionable tags (i.e.,  3 - 6 , versus all others) for comparison types SS and SD.The method follows that for determining tag agreements and nonagreements as done earlier.This data is included in Table 3 (column 4).
For type SS, we consider the single tag of each of the two coders for a given advice item, and if both or neither are an actionable tag, that counts as an actionability agreement; otherwise, a nonagreement.For type SS, coders agreed on 608 of 760 (80% of) advice items (both actionable or both non-actionable).For type SD, we compare the single-tag coder's tag to each of the double-tag coder's tags.If in either pair, {1, 2 1 } or {1, 2 2 }, both tags are actionable or both are non-actionable, it counts as agreement; otherwise, a nonagreement.For type SD, coders agreed on 204 of 234 (87% of) advice items regarding actionability.Unsurprisingly, this percentage is greater compared to type SS, because either of two pairs may yield an agreement.We view this as a limitation and give more weight here to the results from type SS.
This analysis indicates that using the coding tree (and for this single large dataset), coders can distinguish an actionable practice (by the definition of Barrera et al. [3]) from a non-practice in over 80% of cases-that is, actionable and nonactionable advice items are largely distinguishable.This does not necessarily, however, rule out tag nonagreements on which practice code an item has (e.g., see  8 in Fig. 4).This nonetheless suggests the coding tree is useful for estimating how actionable an overall advice dataset is.

DEEPER VIEW OF TAG NONAGREEMENTS VIA DCMS 13 CATEGORIES AND TAG-VS-TAG TABLES
As visible from Fig. 2 at the beginning of our analysis, for some tags, there are large differences in the frequency of tag occurrence between coders.Table 5 gives a different view of the differences in tag assignments, showing what proportion of items in each of 13 categories was assigned each tag by each coder; this partitioning into 13 bins sharply signals that similar proportions of items being given a tag by each coder does not imply that each coder gave the tag to the same individual items.The 13 categories are from a UK government effort, which we refer to as the DCMS 13 guidelines [18].Each of the 1013 advice items under discussion are mapped to one of these 13 categories by a "mapping" document [17] used in conjunction with the DCMS 13 guidelines.Table 5 reveals the codes most commonly assigned to each advice category (topic) by C1 and C2; its heatmap shading aims to aid comparing the coders' code distributions.
For example, among the 1013 advice items that fall in the (table column) category of UK-10, C1 tagged 49% of the items as  1 , while C2 assigned this tag to 29% of items.Proportions of other tags within this same column also differ considerably between C1 and C2.For example, in the row for tag T, the respective values are 10% and 28%.Here,  1 arose more frequently for C1 (than C2), while T resulted more frequently for C2 (than C1).
Although we do not have evidence of any causal relationship in this pairing of T and  1 , the possibility of such relationships between pairs of codes, and large cross-coder differences visible from Table 5, motivate deeper analysis.
An example of a concrete question that we would like to answer is the following.Consider all items that C1 tagged with a particular code, say  1 , and that result in tag nonagreements; how are C2's (nonagreeing) tags for this subset of items distributed over the other tag values, and are there patterns worth exploring?In addition, Table 3 shows tag agreements for (only) 46% of tags (315 + 130 + 17 = 462 of 1013 items); this is lower than we expected from earlier-reported testset results between three coders [3] (summarized herein at the end of Related Work), albeit those testset results counted Table 5. Distribution of codes assigned to advice items for coders C1 and C2, grouped into 13 DCMS topics or categories (using the mapping document [17]).Cell values give percentage of each category's items tagged with column 1 codes.Column sums for C1 or C2 may exceed 100% as coders could optionally select a second code for an item.SP is short for Specific Practice.(For context, titles of the 13 categories UK-i are given in appendix near-matches as matches through the so-called ±1 rule, which is not used herein.Nonetheless, these issues motivate deeper exploration to better understand tag nonagreements. To pursue this, we introduce tag-vs-tag nonagreement tables (Tables 6 and 7).Table 6 shows the distribution of tags for SS-type tag nonagreements.For example, its row  tells us that for the items tagged  by C1, there were 26 tag nonagreements in total, and these nonagreements were distributed across 6 different tags (identified by respective column headers), which C2 selected with frequency 4, These tables allow an understanding of aspects from Fig. 2 earlier in the paper that drew our attention, four of which we note here: the relatively large cross-coder differences in proportion of items receiving tags  ,  1 , and  1 , plus C2 using  almost twice as often as C1.While each of these gaps suggest approximate lower bounds on the numbers of resulting tag nonagreements, how much higher the actual numbers are is not revealed-for example, C1 assigned  to far fewer items than C2, yet some of these may be distinct from items tagged  by C2 (i.e., coders may have applied  to different items).Beyond the proportions shown by Fig. 2, tag-vs-tag tables reveal the actual numbers of tag nonagreements, and the relationships between each pair of tags for such nonagreements.
Table 6.Tag-vs-Tag table for SS-type tag nonagreements.For a fixed row, e.g.,  1 (implying advice items C1 single-tagged as  1 but that C2 did not tag as  1 ), the columns show the distribution of C2's (nonagreeing, single) tags, showing how many of these items C2 tagged with the code of that column's heading.For a fixed column, reverse the roles of C1 and C2 in this description.Sum of all counts is 445 (matching the number of SS-type nonagreements in Fig. 3).Dash (-) represents 0. Scanning the largest entries in Table 6, we see that relatively large numbers of SS tag nonagreements occur in particular when C1 tags items as  1 or  1 (rows), and when C2 tags items as  (column).We can also track 'where the nonagreeing tags go', including for each of the four cases noted in the preceding paragraph.Another outlier in Table 6 is value 25 at row  5 (IT Specialist), column  4 (Security Expert).This might be explained by different interpretations across coders of the capabilities or knowledge of these closely-related expert classes.On a similar note, value 21 (row  2 , col  ) might be characterized as one coder recognizing as a general approach what another considers to be an outcome; note that both codes imply a lack of stepwise details, and an item suggesting a general approach might also hint at a desired outcome, depending on advice item wording.
As a further example from Table 6, to explore the earlier-noted result (visible from Fig. 2) that C2 chooses  twice as often as C1, we ask: When this results in a tag nonagreement, what did C1 choose most frequently?To answer, from column  we find the greatest number of nonagreements is row  1 (12 times, 50% of this column's nonagreements), where C1 tagged items as Incompletely Specified Practice.In retrospect, this is perhaps not surprising: when C1 interprets an item as this, C2 often coded the same item as a Principle.Again notice for items matching both these codes, stepwise details would typically be missing.Note also that in earlier work, these codes ( 1 ,  ) were placed adjacent on a relationship continuum (Fig. 3 in [3]) ranging from focus on end-result to focus on mechanism details-suggesting that despite T-nonagreements here, the two coders are conceptually close.Finally, note that from

UTILITY, LIMITATIONS, AND RECOMMENDATIONS
We begin this section with a summary of areas where we believe the SAcoding method has been shown to be useful.
We then continue on to discuss limitations, combined with recommendations.

Utility of the SAcoding Method
We now sum up our view of the utility of the SAcoding method, and benefits it offers to the security community.
Categorizing security advice datasets.Applying the method to analyze the 1013-item advice dataset (believed to broadly represents IoT security advice) allowed a characterization of the dataset.This included, for the advice items offered, the comparative number of items in designated categories (outcomes, principles, practices, policies, and those unclassifiable).The tagging by two separate coders resulted in similar proportions of tags for many, but far from all, tag categories; however our confidence in this result is weak, as more detailed analysis revealed that the coders often did not assign the same tag to a given individual item.We thus find it premature to make a broad claim that the method is of a single pair of coders (and one fixed dataset, albeit large), further analysis with additional coders would provide stronger confidence in the method if this delivered comparable or better results; it could also serve to test out ideas for improving the SAcoding method itself.A practical challenge for large datasets such as ours is the substantial human time cost per coder (for each advice item, reading it and answering up to 10 questions).One approach to reduce the time burden on individual coders is to use larger numbers of coders each tagging a randomly-drawn subset or partition from a full dataset; however, this may introduce unanticipated side effects, or preclude useful analysis of tagging patterns, e.g., related to coder demographics or experience levels.
- ′ agreement.In our analysis, tags  and  ′ are considered a nonagreement; this allows tracking their distinct coding paths (origins).Such nonagreements occurred 6 + 2 times for SS+SD cases, per Tables 6 and 7.As both tags have the same word description, one could argue for treating them in analysis as a single code.The pros and cons of this design/analysis choice, and its impact on results, could be explored further.
Extracting sub-topics before tagging.As mentioned earlier, a notable number of items (14% and 5% for C1 and C2, resp.) were tagged with sub-label Unfocused (as an option of  1 ), implying the coders found multiple sub-topics.
Prior to tagging an item with multiple sub-topics, if those sub-topics were extracted for individual tagging (vs.as a composite item), each might well receive a tag other than  1 .This of course would change both the number of items in the dataset, and the results.This idea, which can allow more meaningful feedback to those creating advice datasets, is pursued in other work [6]; this also overlaps the next point.
Splitting  1 (and  1 ).A recommended coding tree refinement is to split  1 into two questions.A new first question  1 would ask whether the advice item is unfocused (e.g., multiple items on different topics, or too wide a variety of aspects of the same topic to be meaningfully assigned a single tag-in this case the item might be split in a pre-processing step to enable meaningful tagging); a yes here would result in a new code  1 , elevating the current sub-label Unfocused to a (regular) code.A no would proceed to  1 , asking whether the advice item is unclear; there a yes results in a new assigned code  1 for Unclear (grammar issues, language ambiguity).We believe this would address ambiguity that arose in trying to understand the meaning of coder choices in our earlier interpretation of Question 1 and Fig. 5.
Combined  tag for design principles.The coding tree of Fig. 1 is minorly simplified from prior work [3], which included a  11 (asking whether the advice item relates to the design phase of the product lifecycle), and thus had separate tags  1 (security principle) and  2 (security design principle, as a subset).In fact, coder C2 used the original method with old  11 and these separate tags, though our analysis to this point combined these into the single tag  (principle), with items coded  1 or  2 combined and counted as  .Eliminating  11 and using  simplifies the coding process minorly (one fewer question), without meaningful loss of information-as we now explain, from a brief separate analysis (next) retaining  11 and separate codes  1,  2.
Old  11 was reached 94 times in total counting visits by C1 and C2 separately-relatively infrequent, but not negligible.The main point however is that we found that  11 and separate codes  1 and  2 provided little useful information. 11 was reached by both coders relatively infrequently (18 + 2 times for SS+SD cases), but when it was, Q-nonagreement occurred in 39% of SS-type comparison instances (7 of 18), and 0 of 2 SD-type instances; this comprised 2% of Q-nonagreements over all questions.As  11 's answer is based on coders' view of where in the device lifecycle [3] the advice item would be followed, the relatively high proportion of Q-nonagreements (7 of 20 instances) might imply differences in knowledge across coders on how to map advice to lifecycle phases, or differing views of what comprises the design phase.In this sense, we might hypothesize (untested) that Q-nonagreements at  11 were attributable to ambiguity in materials provided to coders (instructions, definitions) more than to unclear advice items.As  11 is/was a terminal internal tree node leading to two near-identical codes (one semantically a subset of the other), Q-nonagreements task.They used two coders to independently establish codes, agree on a final codebook, code their dataset, and calculate inter-coder reliability (it is unclear if the calculation is on test sets or the final full coding).Ukrop et al. [19] coded interview responses from participants that had evaluated certificate validation status outputs (e.g., warnings, errors).
They used two coders to independently establish initial codebooks, which were merged through discussion, and responses were re-coded using the final codebook.A third coder coded half the responses and inter-rater reliability was calculated between the initial two coders and the third coder.
We note that the SAcoding method has not been used to code user-provided qualitative data (e.g., from participant interviews in user studies), although the datasets analyzed to date [3,6] were comprised of qualitative data (security advice items), and the analysis employs similar methods.The original design of the SAcoding tree [3] involved use of three coders on multiple testsets to iterate on a codebook, with a final coder mean pairwise agreement of 73% 6and "substantial" [14] inter-rater reliability ( = 0.67 mean; kappa values are not always given [10,11]).Rather than splitting the dataset across multiple coders (as in [10,11]), our analysis in the present paper involved two coders each independently tagging a full dataset (of 1013 items in our case), allowing comparison of full results.

CONCLUDING REMARKS
Our comparison of a second coders' results to those a single coder in previous work [3] provided substantial insight on the extent to which SAcoding results are reproducible across coders.We identified specific areas where challenges arose in reproducibly tagging a 1013-item security advice dataset.Even for the tag categories for which coders assigned a given tag to similar proportions of the dataset, detailed analysis revealed that the coders often failed to assign the identical tag to the same individual items, with the overall rate of tag agreement across the dataset being 46%.A more positive result was the high agreement (80% for SS-type comparisons, 87% for SD-type) in identifying individual advice items as actionable vs. non-actionable.Our explanation for co-existence of these two apparently contradictory results is that when coders disagreed on tags for an item, in many cases the differing tags were nonetheless closely related, e.g., adjacent on the tag continuum of Barrera et al. [3].
Despite the design intent of the SAcoding method to reduce subjectivity in assigning tags to items, natural language descriptions of security advice are inherently imprecise.It is thus unsurprising that even using SAcoding (vs.directly assigning tags), the two coders tagged significant numbers of advice items differently.This leads to asking how nonagreements between coders might be further reduced.One idea that we have noted is to explore rewording or clarification of questions that resulted in relatively large numbers of Q-nonagreements, and to provide further explanatory materials and instructions to coders.We expect that if further studies were run, one with two coders of similar expert-level security experience and another with two of vastly different experience levels, fewer tag nonagreements would result within the first pairing.This leads to asking: Is the goal of the SAcoding method to minimize nonagreements for coders of similar background, or between coders of vastly different background?Ideally, one might minimize nonagreements independent of coder experience-although we suggest that the SAcoding method itself is best limited to security experts meeting some lower bound of expertise (Barrera et al. [3] are silent on this).Beyond simply more coders, the effect of different sets of coders on SAcoding results and nonagreements remains a largely unexplored area.
The SAcoding method and our analysis of it, including the introduction of novel techniques to explore question and tag nonagreements, such as tag-vs-tag tables, have allowed new measurements and insights on security advice.We

Table 4
splits out the T-agreements from Table3, based on tag selection order in the case of optional second tags (first and second codes), across types SS, SD, and DD.Although (as noted earlier) the analysis views first and second codes as equivalent, this data is included for completeness and allows a check for anomalies.

Table 4 .
Partitioning of T-agreements by type and ordered code pairing.Cases consider whether the first or second tag of coders C1, C2 resp., result in T-agreement.T-agreement counts from Table3(column 3).First-Second includes Second-First.
Fig.5.Examination of Unfocused sub-label in the case of tag agreement on  1 .See Table1description of  1 .Smaller counts here than might be expected from Fig.2are explained by cases of one coder selecting sub-label Unfocused after reaching  1 , while the other reached a code other than  1 (cf.row  1 of Table6).

Table 6
reveals a fifth case where a relatively large number of tag nonagreements occur: sets of items that C2 tags as  4 (column).This case was not apparent from Fig.2, as C1 and C2 tagged similar proportions of items with  4 ; however, this detailed table reveals that many  4 tags were on different items for C1 than for C2.In Table6overall, the largest non-sum entry is 61, where row  1 intersects column  .This means the greatest number of tag nonagreements occurred for items tagged  1 by C1, while  by C2; this is perhaps explained by, for some advice items, an overlap between an Outcome (missing details on how to reach it) and an Incompletely Specified Practice, depending on item wording and coder interpretation of terms.Also note: the four entries in the corners of the sub-table bounded by rows  1 and  1 (the first 4 rows), and columns  4 and  , alone sum to 33 + 36 + 24 + 61 = 154 of 445 total SS-type tag nonagreements (35%); and C1's rows  1 and  1 in total account for 132 + 138 = 270 of 445 SS nonagreements (61%).Turning to C2, column's  4 and  together account for 86 + 142 = 228 of 445 SS nonagreements (51%).Thus the table reveals that a small number of conditions account for substantial portions of all tag nonagreements.

Table 7 .
Tag-vs-Tag table for SD-type tag nonagreements.See Table6for explanation.Sum of all counts is 208 tag-vs-tag nonagreements, twice the 104 tag nonagreements implied by Table3, because for items with SD-type nonagreements, each tag from the 2-tag coder yields a tag-vs-tag nonagreement.Note from Table3: 234 (SD-type comparison items) −130 = 104. 2   1  2  3  4  5  6   ′ Sum Table 6 we can observe: 24 nonagreements occurred when C2 chose  , and 11 nonagreements occurred when C1 chose  , for a total of exactly 35 SS-type nonagreements; while from Table 7 (below), there are a further 5 + 14 = 19 SD-type nonagreements related to one or the other using  , for a sum of 35 + 19 = 54.In contrast, what can be deduced from the bars for  in Fig.2is that the number of nonagreements related to coders choosing code  has approximate lower and upper bounds of 59 − 35 = 24 (C2's count minus C1's) and 59 + 35 = 94 (C2's count plus C1's).

Table 7
is for SD-type tag nonagreements; given the lengthy discussion of Table6, here we are intentionally brief.Table7provides somewhat less information, there being fewer nonagreements (104) of this type.The total of the counts in this table is 2 × 104 = 208, because for each nonagreement, there are two tags from one coder that fail to agree with the single tag from the other (as noted also in the table caption).Here the entries that draw our greatest attention are the sums (68) for row  1 , and (45) for column  .From data given in the table itself, individual cases of interest can be explored independently, as for Table6above.