Efficient Collective Action for Tackling Time-Critical Cybersecurity Threats

The latency reduction between the discovery of vulnerabilities, the build-up and dissemination of cyber-attacks has put significant pressure on cybersecurity professionals. For that, security researchers have increasingly resorted to collective action in order to reduce the time needed to characterize and tame outstanding threats. Here, we investigate how joining and contributions dynamics on MISP, an open source threat intelligence sharing platform, influence the time needed to collectively complete threat descriptions. We find that performance, defined as the capacity to characterize quickly a threat event, is influenced by (i) its own complexity (negatively), by (ii) collective action (positively), and by (iii) learning, information integration and modularity (positively). Our results inform on how collective action can be organized at scale and in a modular way to overcome a large number of time-critical tasks, such as cybersecurity threats.


Introduction
From Computer Emergency Readiness Teams (CERT) established in the nineties [1], to information-sharing analysis centers (ISACs) [2], to bug bounty programs [3,4], collective action has long been used and recognized as key for the gathering, the integration and the sharing of critical cybersecurity information [5,6].The reason for resorting to information-sharing as a form of collective action stems from the complexity associated with the continuous and somewhat decentralized (e.g., open source software) adaptation of hardware and software in information systems [7,8].Although the Internet has largely developed through an open source spirit [9][10][11] with significant positive externalities [12,13], information-sharing has remained difficult when it comes to cybersecurity [6].The expansion of threats in volume, severity and span has further challenged information infrastructures.Hence, it has forced further cooperation through information-sharing [14].While their utility has been somewhat confirmed by their wide adoption, there is a dearth of knowledge regarding how these collective action platforms concretely bring performance when addressing cybersecurity threats.For instance, cybersecurity has become increasingly time-critical and demands ever faster reaction time.Determining the chances that a threat will be fully characterized on time for security officers to act upon before attacks actually start has become crucial [15].
Here, we investigate 39, 639 threat events contributed by 485 organizations to a MISP information-sharing platform [14] operated by the Computer Incident Response Center Luxembourg (CIRCL).We specifically study how collective action unravels through information integration and how it brings significant economies of scale in terms of time needed to fully characterize cybersecurity threats (i.e., performance).We resort to a multivariate cross-sectional regression with ordinary least squares method, and we find that (i) the number of organizations engaged in information-sharing, (ii) their acquired experience in the events completion, (iii) the proportion of information integration and (iv) modularity increase performance.
The remainder of this article is organized as follows.Section 2 covers background from the perspectives of social dilemma, productivity and information integration in collective action in general and for cybersecurity.Section 3 introduces MISP and presents the data.In Section 4, we introduce the theoretical framework followed by research hypotheses in Section 5. Section 6 describes the methodological approach.Results are presented in Section 7 and discussed in Section 8 before concluding in Section 9.

Background
Knowledge sharing in cybersecurity has been considered as a crucial way to overcome number of vulnerabilities [16] and threats [1].It is however bound to limiting factors on the one hand, such as social dilemma, as well as enhancing return-on-scale effects on the other hand.Here, we review the literature on (i) social dilemma and productivity of collective action, and on (ii) challenges associated with information integration.We then review the state-of-the-art research in (iii) information sharing for cybersecurity.

Social Dilemma and Productivity in Collective Action
According to Olson's logic of collective action, small communities are more able to provide collective goods [17].The central argument is that minor interests will be over-represented and diffuse majority interests trumped, due to a free-rider problem [18].This free-riding effect is stronger for larger groups [19].For instance, while Dejean et al. [20] found a positive relation between the size of a community and the amount of collective good provided, they paradoxically also found a decreased propensity by individuals to cooperate as the size of the community increases.Yet, there is overwhelming evidence that large crowds can be organized in order to establish successful online collective action.Examples include peer-to-peer networks [20,21], Wikipedia [22], Stack Overflow [23], communities of open source software developers [24,25].The Dejean et al. paradox may at least partially resolved by considering that (i) the distribution of effort is highly skewed, with few contributors providing most effort, and (ii) the dynamics of contribution are highly non-linear [25][26][27].Taken together, these phenomena are associated with positive return-on-scale of production [25], which may be hindered by coordination costs [28].Super-linear productivity has been debated at length in organization and management sciences.Investigations of how the number of members, temporal dynamics of events generated can influence positively outputs in way that is greater than the sum of the outputs related to each element of the system (i.e., exhibiting super-linear growth patterns).Research has successfully delivered hints to improve the performance of organization [29][30][31][32] by fine-tuning complementary mechanisms within the organization [33], which also foster innovation [34].

Information Integration and Modularity
One key aspect of generating return-on-scale in knowledge production is information integration.The management of information resources has become central to organizations [35], so that knowledge appears as an utmost strategic resource [36].For instance, there is growing evidence in science that greater teams create more impacting knowledge [37].If knowledge is so important, the fundamental capability of an organization has to be considered as the specialized knowledge of each organization member.Its integration shall provide a competitive advantage [36,38].With the emergence of virtual exchanges, firms are increasingly seen as distributed knowledge systems [39].Yet, new interaction methods present various new constraints in term of mutual understanding, contextual knowledge or techniques (e.g., memory, connectivity), which lead to asymmetries in information integration.
In this respect, the tremendous development of online collaboration platforms, as tools for governance strategy and knowledge management, highlights the importance of information-sharing [40].These platforms promote knowledge transfer by generating modular collaborative units [41].One may consider that individuals, or groups of individuals, composing a subsystem (i) bring added value in their own specific field (differentiation), in order to (ii) produce a complex good by pooling together this added value (integration).Following Arrow & Debreu [42], differentiation and integration have been a focal point in optimizing the structure of organizations [43,44].In fact, differentiation considers segments of a system into subsystems.Each subsystem develops a part of a task, while the integration focuses on the interactions between these subsystems in order to accomplish the entire task [38,45].Recently, Engel and Malone used the theory of consciousness as information integration [46] to measure information integration computer systems and on collaborative platforms [45].

Collective Action and Information Integration for Cybersecurity
As early as twenty years ago, the first Computer Emergency Readiness Teams (CERT) and Information Sharing and Analysis Centers (ISACs) have been established as a central resource for sharing information on cybersecurity threats to critical infrastructures [47].Nowadays, threat intelligence platforms help organizations aggregate, correlate, and analyze threat data from multiple sources in (almost) real-time to support defensive actions [48].Further, open source solutions have been proposed as a counterweight to cyber-criminals successfully working together [5].The swift evolution of cyber-threats has forced organizations and governments to develop new strategies [49] in order to reduce the risks of security breaches [40].Although information sharing is an interesting way to enhance cybersecurity, it is believed to be thwarted by social dilemma.Without trust, commitment and shared vision between stakeholders, organizations are reluctant to share information due to the fear of disclosure, reputation risk or loss of competitive power [50].As such, information-sharing can be considered as a marketplace on which transactions occur and knowledge is transferred [51].However, human beings have a tendency to not optimize organizational goals [52] and -in the case of collective action -might adopt behaviors that are not conducive to the overall goal of sharing information [6].As a consequence, cybersecurity professionals share probably less information than desirable, leading to a knowledge asymmetry to the advantage of the attackers [6].In particular, stakeholders strategically select their contributions to share (i.e., quantity and quality), leading to truncated and imperfect information sharing.Yet, specially crafted forms of cybersecurity information-sharing platforms have developed, such as bug bounty marketplaces.These platforms act as a trusted third-party between security researchers and software editors [3].Further, in cybersecurity, resource belief, usefulness belief, and reciprocity belief are all positively associated with knowledge absorption, whereas reward belief is not [51].These empirical results show that functional cybersecurity information-sharing indeed requires to overcome social dilemma and goes beyond simple reward expectations, but foremost requires that information-sharing is efficient in a context that increasingly requires to address time-critical threats.

Data
To understand the nuts and bolts of cybersecurity information-sharing, we resort to MISP Project, 1 a popular open source platform, which is used e.g., by the North Atlantic Treaty Organization (NATO). 2 MISP stands for Malware Information Sharing Platform and Threat Sharing.Although it carries the word malware in its name, MISP is a threat intelligence platform on which people can share, store and collaborate on all sorts of incidents (e.g., COVID-19 MISP community,3 but primarily cybersecurity threats.These threats (i.e., events) are characterized by indicators of compromise (i.e., attributes), which are contributed by a multitude of organizations.
There are advantages in using MISP as an object of research.First, it is an open source software.This allows to understand in much detail how the platform is designed and works.Second, a number of threat information sharing communities use MISP to share relatively openly their threat intelligence.Here, we use the whole history of a MISP instance maintained by the Computer Incident Response Center Luxembourg (MISP CIRCL), i.e., the Luxembourg CERT.
As of February 8, 2022, the MISP CIRCL instance is a community of 1, 908 organizations (respectively 4, 013 users), which have contributed 39, 639 events, 9, 099, 685 attributes and 3, 786 tags since November 10, 2008.Table 1 shows the ten most involved organizations.One can see that the number of events contributed by organizations is highly skewed.Indeed, Figure 1A shows that the complementary cumulative distribution function exhibits a power law P (X E > x E ) ∼ 1/x µ E E with µ e = 0.54(4) (c.f., Appendix B for details on the fitting method).One may additionally note that 1, 423, i.e., around 75%, of organizations do not participate in sharing threat information as a collective good with the broad MISP CIRCL community.These organizations may however consume information or share threat information privately within informal sub-groups, which cannot be observed.Similarly to P (X E > x E ), the distributions of attributes P (X A > x A ) and tags P (X T > x T ) per event, depicted in Figure 2, follow power laws with exponents respectively µ A = 0.64 (1) (with an upper cut-off around A upper = 10 5 ) and µ T = 2.26 (6).It is additionally important to consider that only 22, 423 (i.e., around 57%) events have been marked as completed, suggesting that either threat analysis is complicated or that users tend to forget to formally close a large number of events.The cumulative number of tags N T,cum = 116, 407 used is bigger than the unique tags amount N T U = 3, 786.Thus, there is a massive reuse of already existing tags.rank org ID # users # events contributed percentage of total events 1 1092 8 We further observe that organizations have joined MISP CIRCL following an almost perfect linear relation 0.99 and p < 10 −2 ) with 161 organizations initially joining MISP CIRCL instance on September 14, 2015, the presumed date of official start.Figure 1B, not only shows the almost linear organization joining rate, but also how many events each organization has contributed over time.One see that the contribution effort is highly heterogeneous.It is also worth noting that event contributions started on November 10, 2008, long before the first organizations joined MISP CIRCL instance.This can be explained in the following way: organizations run first their MISP instance locally.At some point, they join the MISP CIRCL community and share at once all their non-private threat intelligence, yet with the nominal event timestamp, which may well be in the past.Also, it is likely that the linear organization joining function may be the result of a highly vetted joining process, controlled by CIRCL.

Reduction of the Completion Time of Events ∆t C
Following the method described in the Appendix B, we can treat the data and, from them, generate the Figure 3B.As explained in the appendix, by playing with the axis, we remark that when the axes are in linear-logarithmic scale, the data depict two straight lines.From this observation, we can deduce that ∆t C (t) follows an exponential decrease in phase.By applying a binning by month and computing the mean value ∆t C for each bin, we see a first phase that extends from 2011 to 2020 which decrease slower than the second phase from 2020 to today.By applying the linear regression on the data, according to the equation ( 9), we confirm that ∆t C exhibits an exponential decrease: where The events contributed by the organizations have been added (in dark gray), the distribution shows the heterogeneity of organizations efforts.
The fit from the linear is of high quality since its Pearson's determination coefficient R 2 = 0.86 and its p-value < 10 −2 .Hence, the time ∆t C to complete an event decreases over time, indicating an improvement of performances of the MISP CIRCL instance.

Theoretical Framework
Collective action is thought to be a fundamental tool to overcome sprawling and increasing time-critical cybersecurity threats [53][54][55].Yet, despite numerous studies of online platforms fostering collective action [56,57], very little evidence has been uncovered linking the organisation of collective action with group performance as an output.By investigating the MISP threat management platform run by the Computer Incident Response Center Luxembourg (CIRCL), we have a unique chance to better understand how collective action is organized to tackle time-critical cybersecurity threats.
We posit that the performance of collective platforms devoted to the resolution of time-critical tasks at scale, such as MISP, pull from progressively building a knowledge and action environment, made of organizations, which contribute to the resolution of events and, at the same time, bring returns of scale through (i) gaining own experience and (ii) sharing and integrating knowledge, which is associated with increased performance.We further posit that, in order to offset decreasing return-of-scale due to increased groups size and coordination costs [28], the organization of collective action must adapt in a modular way [58], as it has already been witnessed in several open source projects [59,60].We test our theory of collective action for tackling time-critical tasks, through a set of three hypotheses and six sub-hypotheses to understand how time completion performance is achieved for events, given (i) the nature of event, (ii) the collective action environment and (iii) the knowledge integration environment at the time of event arrival (c.f., section 5).We proceed with an exploratory approach to test our theory by resorting to a multivariate cross-sectional regression with ordinary least squares method (c.f., sections 6 and 7).

Hypotheses
To explain how event completion time has evolved, we consider their intrinsic nature, i.e., number of attributes and tags required to characterize events, the overall collective action environment and how knowledge is integrated.We hypthesize that these three overall factors significantly influence collective action performance, in terms of improved completion time in characterizing threat events.

Event Complexity Hinders Performance (H1)
First, events are not all equal: while some are fairly simple and require limited input in terms of attributes and of categorization with tags, others are more complex and require more effort.As shown on Figures 2A and 2B, the distribution of respectively attributes and tags is heavy-tailed: while a majority of events have a limited number of attributes (resp.tags), some carry a large numbers of attributes (resp.tags), presumably affecting the time required to complete the characterization of an event.Hypothesis 1 states: H1: The number of attributes and tags per event negatively influences performance.

Collective Action Improves Performance (H2)
We consider how collective action at scale affects positively or negatively performance.Namely, there are conflicting views on whether having more stakeholders (e.g., contributors, organizations) joining collective action is likely to enhance or hinder performance [17,[25][26][27][28]].Yet, to exist and be sustainable, collective action necessarily needs to bring economies of scale of some form, which in turn would attract more contributors.Conversely, having more participants should bring marginally increasing performance.Therefore, we aim to test the following hypothesis: H2a: The overall performance increases with the number of organizations participating in collective action.
Yet, as already shown in [61], the ongoing collective action workload is likely to affect negatively performance, by increasing completion time.Therefore, our second hypothesis states: H2b: Given a focal event, the number of simultaneously open events decreases performance.

Knowledge Integration Increases Performance (H3)
Having more contributors does not necessarily imply economies of scale [28].Economies of scale may rather be generated by "the whole is more than the sum of its parts" mechanisms [25], which may stem from productive integration of information [45,62,63] as a single entity [25] or through the efficient communication of several modular sub-systems [64,65], which in turn may even mitigate free-riding [58].Here, we recognize that the first form on knowledge integration occurs through experience as learning within organizations [66], and one may expect that an organization having accumulated experience in characterizing a large number of threat events is likely to perform better on new events, therefore : H3a: More experienced organizations solve events faster.
On MISP instances, collective action goes beyond coordinating time-critical tasks.As people and organizations contribute, a large corpus of knowledge is built as a library of events, attributes, and tags.In turn, by design of MISP software, this information can be easily reused to quickly characterize new events, proposing matching possibilities according to the preliminary entries.
Hence, the reuse of knowledge simplifies the emission of attributes and the knowledge is integrated by the creator of the new events.These new events are thus composed of a certain percentage of inherited attributes which are likely to impact positively performance: H3b: The reuse of tags and attributes from existing events contributes positively to performance in the completion of new events.
The capacity of an entity to integrate knowledge is tightly related to its modular organization [46,58,59].As MISP clusters of events or attributes, called "Galaxies", were progressively introduced and developed on MISP CIRCL, we have an opportunity to test for modularity.We therefore formulate the following hypothesis: H3c: Modularity in collective action positively influences performance.
By testing these three hypotheses (and six sub-hypotheses), we expect to gain robust insights on how collective action on MISP brings performance in terms of characterizing time-critical cybersecurity threats.

Method
We proceed to validate our theory through the testing of three hypotheses, divided in six sub-hypotheses (c.f., Section 5).For this, we specify an econometric model with completion time as the main dependent variable representing the key performance indicator in our posited theory of collective action for tackling time-critical threats (c.f., Section 4).
We define the following set of events, where N e corresponds to 22, 423 events, which have explicitly been marked as completed (i.e., with field Analysis = 2, see section 3).For each event, we define ∆t C,e the completion time of events as with t c,e the event creation date and t f,e the last event modification.
To determine the relation between the dependent variable, i.e. the completion time ∆t C,e for the events, we proceed to a multivariate cross-sectional regression [67].Specifically, we investigate if completion time ∆t C,e for the events can be explained by the selected explanatory variables.The corresponding Python variable is CompletionT.For each event e, the multivariate cross-sectional regression writes: -∆t Ce : time completion for event e, This multivariate cross-sectional regression is performed with the ordinary least squares (OLS) method.The choice of this model is adapted to deal with data without time series, which is the case here.Then, the explicated and explanatory variables are linked with a set of points in time.This set of points in time is given by the creation t c,e of the different e and contains 22, 423 elements, corresponding to the number of completed elements N e considered.Thanks to this model, it is easy to consider all chosen independent variables.However, due to the heavy-tailed behaviour of the variables and their difference of magnitude (see Section 3), we transform the variables in logarithm in base of 10 [68].However, the results are indicated as a percentage change of ∆t C,e when Z k,e varies by a certain percentage [68].
We specify the following explanatory variables in relation with the formulated hypotheses (c.f., Section 5).To test hypothesis H1 (i.e., event complexity hinders performance), we resort to two explanatory variables: -N A,e : the number of attributes per event e.The corresponding Python variable is AttrCount, which is expected to positively influence CompletionT (i.e., reduce performance).
-N T,e : the number of tags per event e, The corresponding Python variable is NTags, which is expected to positively influence CompletionT (i.e., reduce performance).
To test hypothesis H2 (i.e., collective action improves performance), we resort to two explanatory variables: -N O,e stands for the number of organizations listed on MISP CIRCL at the creation t c,e of event e.The corresponding Python variable is CumOrgs.
CumOrgs is expected to negatively influence CompletionT (i.e., increase performance) and to demonstrate the overall benefits of collective action for tackling time criticial threats (H2a).
-E sim,e is the number of simultaneously open events on MISP CIRCL at the creation t c,e of event e.The corresponding Python variable is SimEvents, which is expected to positively influence CompletionT (i.e., reduce performance) and to show that collective action performance is bound to circumstantial operational constraints associated with time as a scarce resource (H2b) [61,69].
To test hypothesis H3 (i.e., knowledge integration increases performance), we resort to three explanatory variables: -E C,e takes into account the number of already completed events by the organizations at the creation t c,e of a new event e on their behalf.The corresponding Python variable is CumCompE, which is expected to negatively influence CompletionT (i.e., increase performance) (H3a).
-I %A,e is the inherited percentage of attributes per event e.The corresponding Python variable is InhPer, which is expected to negatively influence CompletionT (i.e., increase performance) (H3b).
-N G,e counts the number of galaxies created on MISP CIRCL instance at the creation t c,e of the e.The corresponding Python variable is NbGalaxies, which is expected to negatively influence CompletionT (i.e., increase performance) (H3c).
-N E G ,e considers the number of events in its corresponding aforementioned galaxy at the creation t c,e of a new event e in this galaxy.The corresponding Python variable is NbEventsinhisG, which is expected to negatively influence CompletionT (i.e., increase performance) (H3c).
The pairwise correlations of the dependent variable and the independent ones provide the correlation matrix (see Table 2  With the explanatory variables of our model being defined, we are in position to formulate the econometric model by developing the equation ( 4): Model validation is performed as follows.When handling a multivariate regression, one must pay particular attention to multi-collinearity between the Z k 's, which may distort the model.For that, the variance inflation factor (VIF) resulting from the regression of the explanatory variable Z k on the other explanatory variables which provide R 2 k , must be computed.The VIF k is then given as VIF k = 1/(1 − R k 2 ) and must be < 10 [67].The stability of the variance has to be examined, namely by studying heteroskedasticity, which is ruled out if the p-value obtained from a White test is lower than a threshold α = 0.05 [67].The computation steps are performed with the Python libraries statsmodels.api.OLS for the regression, statsmodels.stats.outliers_influencefor the VIF and statsmodels.stats.diagnosticfor the White test.

Results
In order to establish evidence of collective action as an efficient way for tackling time-critical cybersecurity threats, we have resorted to data the MISP instance, which is run by the computer Incident Response Center Luxembourg (CIRCL).We used a multivariate crosssectional regression analysis of completion time (i.e., performance) required to characterize a threat event with both event related and collective action explanatory variables.

Dep. Variable
Completion

76.4
Table 3: Results of the ordinary least squares (OLS) regression with the explained variable CompletionT and the explanatory variables: CountAttr, InhPer, NTags, CumOrgs, CumCompE, NbGalaxies and NbEventsinhisG, namely the number of attributes per event, the inherited percentage of attributes per event, the number of tags per event, the cumulative number of organizations at the creation of the event e, the number of already completed events by the organization at the creation of his new event e, the number of galaxies at the creation of the event e and the number of events populating these galaxies at the creation of the event e.For each explanatory variable, the autoregressor coefficient (in the column coeff), as well as its standard deviation (in the column std err) are provided.The significance of the explanatory variables is given by the p-value and its threshold, i.e. p − value < 0.1 : ( * ), < 0.05 : ( * * ) or < 0.01 : ( * * * ) and the goodness-of-fit by the R-squared.The other added information are not necessary for the evaluation of the model.
The regression results are shown in Table 3. Overall, the regression model is robust and explains 41.3% of the variance (R 2 = 0.413).Testing for hypothesis 1, the model shows that indeed event complexity measured by the number of attributes CountAttr and tags NTags influences performance negatively, i.e., event characterization completion time is increased.Hypothesis H1 is supported.Regarding how collective action improves performance (H2), the model shows that overall performance (i.e., completion time reduced) is positively associated with the number of organizations participating in MISP: Hypothesis H2a is supported.Hypothesis H2b could not be tested as a result of unexplained strong multicollinearity between CumOrgs and SimEvents.Turning to Hypothesis 3 (i.e., knowledge integration increases performance), we find that more experienced organizations perform better in reducing event completion time.Hypothesis H3a is supported.We also find that the proportion of attributes that an event e inherits from previous events, i.e., from the MISP CIRCL knowledge base, also positively influences performance.Hypothesis H3b is supported.Finally, testing for hypothesis H3c, i.e., modularity, we find mixed results.While the number of MISP Galaxies, measuring the number of modular sub-systems, influences positively performance, the number of events recorded in MISP Galxies, measuring to some extent the intensity of modularity, influences performance negatively.Hypothesis H3b is only partially supported.
We have checked for multi-collinearity of the explanatory variables.We computed the variance inflation factor (VIF) for each explanatory variables, which happens to be all smaller than 10.This implies that there is no evidence of multi-collinearity between the selected explanatory variables (c.f., Table 4).We also controlled for heteroskedasticity, i.e., a possible instability of the variance by performing a White statistics tests.We obtained p-value < 10 −2 , which implies that there is no heteroskedasticity in our model.The post-analysis for the VIFs and the White statistics test completely validate the used model and its results.

Explanatory variables
Notation VIF Number of attributes per event N A,e 5.15 Inherited percentage of attributes per event e I %A,e 1.67 Number of tags per event e N T,e 1.03 Cumulated number of organizations at the creation of e F cum,e 6.73 Cumulated number of completed events at the creation of e E C,cum,e 3.28 Cumulated number of galaxies at the creation of e N G,cum,e 1.12 Cumulated number of events in galaxies at creation of e N E G ,cum,e 2.02 Table 4: Computation of the variance inflation factor (VIF) for the explanatory variables of the econometric model.The values of the VIF allows to detect the presence of multi-collinearity between the considered variables.As all values VIF < 10, there is no evidence of multi-collinearity between the explanatory variables.These results validate the econometric model.

Discussion
Organizations are increasingly encouraged to cooperate and share information to overcome cybersecurity threats.Investigating how collective action unfolds and brings performance on information-sharing platforms is necessary as cybersecurity threats have become increasingly time-critical.In other words, not only collective action shall be used to characterize threat events, it also must be used to characterize threat events before attacks unravel [55].Here, we have investigated collective action on MISP, a popular open source threat intelligence platform, from the perspective of the time required to fully characterize an event as an objective function to be optimized (i.e., completion time or performance).We found that performance is negatively associated with event complexity (Hypothesis 1) and positively associated with collective action (Hypothesis 2).Indeed, as the number of organizations taking part to information-sharing on the MISP instance studied increased, the time required to complete the characterization of events decreased.This result informs on positive returns on scale, which necessarily exist given the increased adoption of MISP as well as other information-sharing platforms.Nevertheless, the mechanisms at work generating these economies of scale have remained unclear.We considered the perspective of knowledge integration [46] as the collective action process at work to generate the "the whole is more than the sum of its parts" [25].With hypothesis 3, we tested and verified organizational learning, knowledge integration and modularity as positively associated with performance.
edWhile event completion time is associated with explanatory variables pertaining to event complexity, collective action, and knowledge integration, we could not establish causality.Although this is a significant limitation to our model, we have organized our multivariate cross-sectional regression in a way that minimizes the risks of uncovering spurious dependencies between the explained variable on the one hand and the explanatory variables on the other hand.And the fact that all our explanatory variables are significant (at the exception of SimEvents, the number of simultaneously open events on MISP CIRCL at the creation, which had to be excluded from the model), shows that our proposed theory on collective action for tackling time-critical tasks is comprehensive and altogether robust.Yet, the regression analysis approach remains exploratory.Indeed, it does not provide reliable information on which precise collective action mechanisms generate positive returns on scale.Building and testing fine-grained causal models of critical cascades in collective action, inspired from e.g.[25][26][27], may surely help better understand the activity, learning, knowledge integration and modularization paths of contributing organizations, as well as how they handle time as a particularly scarce resource [69].Indeed, when tackling large amounts of time-critical tasks, such as cybersecurity threats or incidents, contingencies necessarily appear [61], which may affect coordination between contributors, and as a result performance, either in a transient way or by triggering long-term instability through cascades of disorganization.At the meso-scale, our model does not account for affinities between events, organizations and the combined commonalities of events and organizations.Indeed, as for number of collective action online platforms, modular Galaxies on MISP show that some sub-communities of organizations have specific goals when tackling cybersecurity threats.These specific interests deserve further scrutiny.For instance, are the organizations contributing to a given MISP galaxy active in the same industry?If not, why do they share interest in similar threats?Considering MISP (or other information-sharing platforms) from the perspective of threats, we may investigate kinship between threats, as they most often share attributes.Questioning and perhaps predicting how attributes are "transmitted" from one event to others is likely to be key to anticipate threats and guide organizations in their search of (respectively contributions to) threat information.It may even help decide what information should be shared and with whom.
Finally, our results show that completion time as an objective function in collective action concerned with time-critical tasks can be optimized.This opens further perspectives for computational social science research.One may envision to use machine learning in order to recommend personalized precision strategies that optimize the organization of collective action and knowledge integration.This may help make best use of time as an increasingly critically scarce resource, especially in face of a looming tsunami of cybersecurity threats.

Conclusion
Information-sharing in cyber-security has become an increasingly common collective action practice.Yet, its benefits have so far remained unclear.We have investigated MISP, a commonly used open source threat sharing platform, and we found how building a critical mass of contributing organizations and of knowledge to be integrated from past threats brings significant economies of scale.Through collective action, security researchers overcome the challenge of characterizing cybersecurity threats, which appear to be increasingly timecritical.We find that performance, defined as the time needed to fully characterize a threat event, is (i) negatively influenced its own complexity, (ii) positively influenced by collective action, and (iii) positively by learning, knowledge integration and modularity.Our results also inform more generally on how collective action can be organized online at scale and in a modular way to overcome a large number of time-critical tasks.
Based on investigation needs or reports found in the newspapers or on specialized websites, the user creates an event to contextualize and encapsulate the related attributes (i.e., IoCs) and their properties (e.g., an IP address).All events have some general properties of the event, such creation date, aforementioned sharing level, threat level (i.e., 1: High, 2: Medium, 3: Low, 4: Undefined), analysis level (i.e., 0: Initial, 1: Ongoing, 2: Complete) and a general description.The creator of an event can choose if this event is published on the remote instance or remains internal to the organization.Then, when the event is created, some attributes are added to populate this event.The event attributes refer to intrusion artifacts or methods used by attackers.These attributes provide details and they are characterized by their type (e.g., filename|md5, sha256, etc.) and their belonging to a category (e.g., Antivirus detection, Targeting data, etc.), putting them in the context and justify then its attribution to its corresponding event.To add an attribute related to an event, global information such as its category, its type and its distribution, either the same as for the event or its own rule, is required, as well as two important text fields: value and contextual comment.The "value" field stores the data we want to add, e.g. an url leading to a report, while the "comment" field allows complementary information about the attribute.Moreover, it is possible to allocate one tag or more to an event in order to simplify the read and the classification of this event.These tags can follow the MISP taxonomy, i.e. a fixed machine-tag vocabulary, or be created by the users according to their needs.
On the platform, events, attributes, organizations and tags are associated to their own identification (ID) number and their creation are timestamped, as well as the publication and the last update of an event.
As an open-source platform, MISP relies on voluntary action.On the one hand, its members can create or exchange content.On the other hand, these same actors can obtain new insights or possible response elements from the community regarding cyber-threats of interest.To organize interactions and to create information-sharing incentives for the participants, MISP offers several aforementioned sharing levels through a comprehensive sharing model.Users can select to whom they want to share information among the following levels from the most restrictive to the most open.Regardless of access and to guarantee the quality of the shared data, only organizations that created an event have the permission to modify this event.However, each user has the possibility to submit his own suggestions to change an event created by others, who can then accept or reject the proposal.
Moreover, the experience of older MISP versions has shown that the time to fill the fields and a complicated web interface introduce some frictions.For this purpose, a free text importer has been deployed, so that data can be copied and pasted into the intended field.Further, MISP implements a heuristics-based algorithm, which helps users to match events or event attributes with events or attributes from events already in the data base.However, let us added that the matching is never performed automatically, and goes through human supervision.

A.2 Data Retrieval
To investigate our hypotheses, we have to curate the main dataset by considering only the closed events, i.e. the events with an analysis level equal to 2, meaning "complete".
To retrieve the data, we have followed the user guide 5 provided by the MISP CIRCL instance.We used the PyMISP module to download data in .jsonformat file.The main dataset contains one file per event.These event files contain the attributes (see MISP core format6 ), as well as the name and the ID of the concerned organizations.However, due to the policy of the MISP CIRCL instance, we cannot disclose the names of these organizations and present no interest and have no influence on the obtained results.

B.1 Probabilistic Distributions
In order to understand the mechanisms handling on the MISP platform, we want to investigate the distribution of our data, we have to present the selected variables and explore the distribution associated with these.In some cases, we are able to investigate the probabilities distribution.Hence, if we consider a random variable X with a probability density function (PDF) f X (x), the cumulative distribution function (CDF), F X (x) is given by: Then, thanks to the formula (6), the complementary cumulative distribution function (CCDF) F X (x) can be written as follow: This CCDF provides a rank ordering of the selected variables.

B.2 Fit of the Data
Before we start fitting our data, a visual analysis can be performed.Then, in any case, by varying the scale of axis -double linear, linear-logarithmic or double logarithmic -depicting our data, we are able, if our data follow approximately a straight line in one of cases presented below, to fit the data.The logarithmic scales are considered in base 10.

B.2.1 Double Linear Scales
By considering two vectors of data − → x and − → y and plotting the data contained in − → y (y-axis) in function of the data in − → x (x-axis) in linear scale for the axes x and y.If the displayed data shows an approximate straight line, that means that each element y i of the vector − → y is given by the relation: where a is the slope of the straight line and b, its intercept.Thanks to the relation (8), we are able to compute the estimated ŷi , a and b by applying a least-square linear regression.To validate the parameter obtained from the linear regression, we need to establish the goodness-of-fit with these parameters.For this type of simple linear regression, we use the Pearson's coefficient of determination R 2 and, to reinforce the results of R 2 , we perform a Wald test with a chosen level α = 0.05 to define if these two samples are significantly identical or not.Then a value |R 2 | ≈ 1 implies a strong correlation between − → x and − → y , while a p-value < α for the Wald test allows us to affirm that the parameters of smaller than the chosen threshold α = 0.05 , we can affirm that our data follow a power law [70].Sometimes, the fits don't fit very well with a power law distribution that is why we have to investigate other heavy-tailed distributions like the log-normal (L) or the Weibull (W) (i.e., stretched-exponential) distributions, for which we can define the goodness-of-fit with the previous Kolmogorov-Smirnov test and its p-value.However, with approximately same results, the power law is privileged because it is determined by one parameter instead of two parameters for the two aforementioned distributions.The computations in this part have been widely inspired from the works of A. Clauset & al. and done with Python libraries such that plfit for the powerlaw and implemented according to the works of A. Clauset & al. for the other distributions [70].

B.2.4 Goodness-of-fits Summary
The results for the fits presented in this article (i.e., Figure 1, 2 and 3), as well as their goodness of are detailed in the below Table 5.

Figure 1 :
Figure 1: A. Complementary cumulative distribution function (CCDF) of events per contributing organization, which is best described by a power law distribution P (XE > xE) ∼ 1/xE µ E with µE = 0.54(4).The fit and the goodness-of-fit, provided by the Kolmogorov-Smirnov statistics test, are obtained with the Python library plfit.B. Curve of the joining organizations (in blue) has followed, after the September 14, 2015, the presumed date of official start, a linear growth with slope αO = 0.79(1), (R 2 = 0.99, p-value < 10 −2 ).The events contributed by the organizations have been added (in dark gray), the distribution shows the heterogeneity of organizations efforts.

Figure 2 :
Figure 2: A. Complementary cumulative distribution function (CCDF) of attributes encapsulated in an event, which is best described by a power law distribution P (XA > xA) ∼ 1/xA µ A with µA = 0.64(1).B. CCDF of tags attached to an event which is best described by a power law distribution P (XT > xT ) ∼ 1/x µ T T with µT = 2.26(6).The fits and the goodness-of-fits, provided by the Kolmogorov-Smirnov statistics test, of panels A and B are obtained with the Python library plfit.

Table 1 :
10 of 1, 908 organizations have contributed 66.62% of the 39, 639 events, bringing further evidence of the heavy-tailed nature of the distribution of contributions by organizations in MISP CIRCL. ).

Table 2 :
Correlation matrix of dependent and explanatory variables.