The Feasibility and Inevitability of Stealth Attacks

We develop and study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence (AI) systems including deep learning neural networks. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself. Such a stealth attack could be conducted by a mischievous, corrupt or disgruntled member of a software development team. It could also be made by those wishing to exploit a ``democratization of AI'' agenda, where network architectures and trained parameter sets are shared publicly. We develop a range of new implementable attack strategies with accompanying analysis, showing that with high probability a stealth attack can be made transparent, in the sense that system performance is unchanged on a fixed validation set which is unknown to the attacker, while evoking any desired output on a trigger input of interest. The attacker only needs to have estimates of the size of the validation set and the spread of the AI's relevant latent space. In the case of deep learning neural networks, we show that a one neuron attack is possible - a modification to the weights and bias associated with a single neuron - revealing a vulnerability arising from over-parameterization. We illustrate these concepts using state of the art architectures on two standard image data sets. Guided by the theory and computational results, we also propose strategies to guard against stealth attacks.


Introduction
It is widely recognized that Artificial Intelligence (AI) systems can be vulnerable to adversarial attacks [37]; that is, small, often imperceptible, perturbations that exploit instabilities.The danger of an attacker gaining control of an automated decision-making process is of particular concern in high-stakes or safety-critical settings, including medical imaging [13,14], transport [41] and textual analysis [12].The last few years have therefore seen an escalation in the design of both attack and defence strategies [1,23,34], and recent work has considered the bigger question of whether the existence of instabilities and successful attacks is inevitable [3,33,5,40,11].
Work in this field has focused on adversarial perturbations to the training or input data, and may be classified as black-box, grey-box or white-box, depending on the level of information and access available to the attacker.We look at a different setting, which was introduced in [40] and is motivated further in subsection 2.1.
Here, the attacker is assumed to have control of the AI system, in the sense of being able to edit code or change parameters.We therefore consider attacks that make perturbations to the system itself in an attempt to change the output on a specific input point of interest to the attacker without any changes in the AI's performance on a large, unknown, validation set and without retraining.We focus on two issues: (1) strategies for constructing such attacks and (2) theoretical analysis of the likelihood of success.
In Section 2 we present a general framework of open-box stealth attack and discuss additional related work.In Section 3 we reveal a new class of attacks (Algorithms 1 -3) and establish probability bounds of their success (Theorems 1 -3).
Algorithm 1, and the corresponding Theorem 1, consider the case when the Attacker's choice of triggers is limited only by the requirement that the triggers' feature representations stay within a ball containing the Owner's validation set (choosing triggers with feature representations outside of this ball always results in successful attacks).In this case, we show that the likelihood of attack success can be made arbitrarily close to one, for any fixed dimension of the model's latent space (Theorem 1, Remark 4).This is a significant departure from the previous state of knowledge, as success likelihoods for such attacks were thought to be limited by dimension [40].To establish these high probabilities of success, the attack must be executed with arbitrarily high accuracy and the model must satisfy appropriate reachability conditions [23].
Algorithm 2 and Theorem 2 relate to approaches that enable attack triggers to be camouflaged as legitimate data by requesting that the Attacker's triggers produce latent representations within some given neighborhood of those corresponding to specified inputs.
Algorithm 3 and Theorem 3 consider the case where the Attackers' capability to change or explore feature spaces of the model is constrained to some finite number of attributes or a smaller-dimensional subspace.The case is motivated by the ideas from [7] where the authors proposed methods to generate adversarial perturbations confined to smaller-dimensional subspaces of the original input space (in contrast to the stealth attack setting considered here).This scenario enables attacks for models with sparse data representations in latent spaces.Remarkably, these constrained attacks may have significant probability of success even when the accuracy of their implementation is relatively low; see Theorem 3.
Section 4 presents experiments which illustrate the application of the theory to realistic settings and demonstrates the strikingly likely feasibility of one neuron attacks-which alter weights of just a single neuron.Section 5 concludes with recommendations on how vulnerabilities we exposed in this work can be mitigated by model design practices.Proofs of the theorems can be found in the Appendix, along with extra algorithmic details and computational results.

Stealth attacks 2.1 General framework
Consider a generic AI system, a map producing some decisions on its outputs in response to an input from U ⊂ R m .The map F can define input-output relationships for an entire deep neural network or some part (a sub-graph), an ensemble of networks, a tree, or a forest.For the purposes of our work, the AI system's specific task is not relevant and can include classification, regression, or density estimation.
In the classification case, if there are multiple output classes then we regard (1) as representing the output component of interest-we consider changes to F that do not affect any other output components; this is the setting in which our computational experiments are conducted.The flexibility for stealth attacks to work independently of the choice of output component is a key feature of our work.
The AI system has an Owner operating the AI.An Attacker wants to take advantage of the AI by forcing it to make decisions in their favour.Conscious about security, the Owner created a validation set which is kept secret.The validation set is a list of input-output pairs produced by the uncompromised system (1).The Owner can monitor security by checking that the AI reproduces these outputs.Now, suppose that the Attacker has access to the AI system but not the validation set.The phrase stealth attack was used in [40] to describe the circumstance where the Attacker chooses a trigger input and modifies the AI so that: • the Owner could not detect this modification by testing on the validation set, • on the the trigger input the modified AI produces the output desired by the Attacker.
Figure 1, panel C, gives a schematic representation of this setup.The setup is different from other known attack types such as adversarial attacks in which the Attacker exploits access to AI to compute imperceptible input perturbations altering AI outputs (shown in Figure 1, panel A), and data poisoning attacks (Figure 1, panel B) in which the Attacker exploits access to AI training sets to plant triggers directly.
We note that the stealth attack setting is relevant to the case of a corrupt, disgruntled or mischievous individual who is a member of a software development team that is creating an AI system, or who has an IT-related role where the system is deployed.In this scenario, the attacker will have access to the AI system and, for example, in the case of a deep neural network, may choose to alter the weights, biases or architecture.The scenario is also pertinent in contexts where AI systems are exchanged between parties, such as • "democratization of AI" [2], where copies of large-scale models and parameter sets are made available across multiple public domain repositories, • transfer learning [29], where an existing, previously trained tool is used as a starting point in a new application domain, • outsourced cloud computing, where a third party service conducts training [15].

Formal definition of stealth attacks
Without loss of generality, it is convenient to represent the initial general map (1) as a composition of two maps, F and Φ: The map Φ defines general latent representation of inputs from U, whereas the map F can be viewed as a decision-making part of the AI system.In the context of deep learning models, latent representations can be outputs of hidden layers in deep learning neural networks, and decision-making parts could constitute operations performed by fully-connected and softmax layers at the end of the networks.If Φ is an identity map then setting F = F brings us to the initial case (1).An additional advantage of the compositional representation ( 2) is that it enables explicit modelling of the focus of the adversarial attack-a part of the AI system subjected to adversarial modification.This part will be modelled by be the map F .
A perturbed, or attacked, map F a is defined as where the term A : R n × Θ → R models the effect of an adversarial perturbation, and Θ ⊂ R m is a set of relevant parameters.Such adversarial perturbations could take different forms including modification of parameters of the map F and physical addition or removal of components involved in computational processes in the targeted AI system.In neural networks the parameters are the weights and biases of a neuron or a group of neurons, and the components are neurons themselves.As we shall see later, a particularly instrumental case occurs when the term A is just a single Rectified Linear Unit (ReLU function), [21], and θ = (w, b) represents the weights and bias, or a sigmoid (see [18], [22] for further information on activation functions of different types) where, D ∈ R is a constant gain factor.
Having introduced the relevant notation, we are now ready to provide a formal statement of the problem of stealth attacks introduced in Section 2.1.
Problem 1 (ε-∆ Stealth Attack on F) Consider a classification map F defined by ( 1), (2).Suppose that an owner of the AI system has a finite validation (or verification) set V ⊂ U.The validation set V is kept secret and is assumed to be unknown to an attacker.The cardinality of V is bounded from above by some constant M , and this bound is known to the attacker.
Given ε ≥ 0 and ∆ > 0, a successful ε-∆ stealth attack takes place if the attacker modifies the map F in F and replaces it by F a constructed such that for some u ∈ U, known to the attacker but unknown to the owner of the map F, the following properties hold: In words, when F is perturbed to F a the output is changed by no more than ε on the validation set, but is changed by at least ∆ on the trigger, u .We note that this definition does not require any notion of whether the classification of u or u is "correct."In practice we are interested in cases where ∆ is sufficiently large for u to be assigned to different classes by the original and perturbed maps.
Remark 1 (The target class for the trigger u ) Note that the above setting can be adjusted to fit a broad range of problems, including general multi-class problems.Crucially, the stealth attacks proposed in this paper allow the attacker to choose which class the modified AI system F a predicts for the trigger image u .We illustrate these capabilities with examples in Section 4.

Related work
Adversarial attacks.A broad range of methods aimed at revealing vulnerabilities of state-ofthe art AI, and deep learning models in particular, to adversarial inputs has been proposed to date (see e.g.recent reviews [23,32]).The focus of this body of work has been primarily on perturbations to signals/data processed by the AI.In contrast to this established framework, here we explore possibilities to determine and implement small changes to AI structure and without retraining.Data poisoning.Gu et al [20] (see also [8] and references therein, and [27] for explicit upper bounds on the volume of poisoned data that is sufficient to execute these attacks) showed how malicious data poisoning occurring, for example, via outsourcing of training to a third party, can lead to backdoor vulnerabilities.Performance of the modified model on the validation set, however, is not required to be perfect.A data poisoning attack is deemed successful if the model's performance on the validation set is within some margin of the user's expectations.
The data poisoning scenario is different from our setting in two fundamental ways.First, in our case the attacker can maintain performance of the perturbed system on an unknown validation set within arbitrary small or, in case of ReLU neurons, zero margins.Such imperceptible changes on unknown validation sets is a signature characteristic of the threat we have revealed and studied.Second, the attacks we analysed do not require any retraining.
Other stealth attacks.Liu et al [26] proposed a mechanism, SIN 2 , whereby a service provider gains control over their customers' deep neural networks through the provider's specially designed APIs.In this approach, the provider needs to first plant malicious information in higher-precision bits (e.g.bits 16 and higher) of the network's weights.When an input trigger arrives the provider extracts this information via its service's API and performs malicious actions as per instructions encoded.
In contrast to this approach, stealth attacks we discovered do not require any special computational environments.After an attack is planted, it can be executed in fully secure and trusted infrastructure.In addition, our work reveals further concerns about how easily a malicious service provider can implement stealth attacks in hostile environments [26] by swapping bits in the mantissa of weights and biases of a single neuron (see Figure 2 for the patterns of change) at will.
Our current work is a significant advancement from the preliminary results presented in [40], both in terms of algorithmic detail and theoretical understanding of the phenomenon.First, the vulnerabilities we reveal here are much more severe.Bounds on the probability of success for attacks in [40] are constrained by 1 − M 2 −n .Our results show that under the same assumptions (that input reachability [23] holds true), the probabilities of success for attacks generated by Algorithms 1, 2 can be made arbitrarily close to one (Remark 4).Second, we explicitly account for cases when input reachability holds only up to some accuracy, through parameters α, δ (Remark 3).Third, we present concrete algorithms, scenarios and examples of successful exploitation of these new vulnerabilities, including the case of one neuron attacks (Section 4, Appendix; code is available in [39]).

New stealth attack algorithms
In this section we introduce two new algorithms for generating stealth attacks on a generic AI system.These algorithms return a ReLU or a sigmoid neuron realizing an adversarial perturbation A. Implementation of these algorithms will rely upon some mild additional information about the unknown validation set V. In particular, we request that latent representations Φ(u) for all u ∈ V are located within some ball B n (0, R) whose radius R is known to the attacker.We state this requirement in Assumption 1 below.

Assumption 1 (Latent representations of the validation data V)
There is an R > 0, known to the attacker, such that Given that the set V is finite, infinitely many such balls exist.Here we request that the attacker knows just a single value of R for which (7) holds.The value of R does not have to be the smallest possible.
We also suppose that the feature map Φ in (2) satisfies an appropriate reachability condition (cf.[23]), based on the following definition.
Definition 1 (υ-input reachability of the classifier's latent space) Consider a func- In what follows we will assume existence of some sets, namely n − 1 spheres, in the classifier's latent spaces which are input reachable for the map Φ. Precise definitions of these sets will be provided in our formal statements.

Remark 2
The requirement that the classifier's latent space contains specific sets which are υ-input reachable for the map Φ may appear restrictive.If, for example, feature maps are vector compositions of ReLU and affine mappings (4), then several components of these vectors might be equal to zero in some domains.At the same time, one can always search for a new feature map Φ : where the matrix T ∈ R d×n , d ≤ n is chosen so that relevant domains in the latent space formed by Φ become υ-input reachable.As we shall see later in Theorems 1 and 2, these relevant domains are spheres S d−1 (c, δR), δ ∈ (0, 1], where c ∈ R d is some given reference point. If the feature map Φ : U → R n with U ⊂ R m is differentiable and non-constant, and the set U has a non-empty interior, Int(U), then matrices T producing υ-input reachable feature maps Φ in some neighborhood of a point from Int(U) can be determined as follows.Pick a target input u 0 ∈ Int(U), then where J is the Jacobian of Φ at u 0 .Suppose that rank(J) = d, d > 0, and let be the singular value decomposition of J, where Σ is a d × d diagonal matrix containing non-zero singular values of J on its main diagonal, and U ∈ R n×n , V ∈ R m×m are unitary matrices.If m = d or n = d then the corresponding zero matrices in the above decomposition are assumed to be empty.Setting ensures that for any arbitrarily small υ > 0 there is an r(υ, u 0 ) > 0 such that the sets S d−1 ( Φ(u 0 ), ρ), ρ ∈ [0, r(υ, u 0 )) are υ-input reachable for the function Φ = T Φ.Note that the same argument applies to maps Φ which are differentiable only in some neighborhood of u 0 ∈ Int(U).This enables the application of this approach for producing υ-input reachable feature functions to maps involving compositions of ReLU functions.In what follows we will use the symbol n to denote the dimension of the space where Φ maps the input into assuming that the input reachability condition holds for this feature map.
Linearity of Φ in T enables to preserve the structure of perturbations (4), ( 5) so that they remain, in effect, just a single additional neuron.We used the latter property in our computational experiments (see Remark 10 in Section A.4) to generate examples of stealth attacks.

Target-agnostic stealth attacks
Our first stealth attack algorithm is presented in Algorithm 1.The algorithm produces a modification of the relevant part F of the original AI system that is implementable by a single ReLU or sigmoid function.Regarding the trigger input, u , the algorithm relies on another process (see step 3).
In what follows, we will denote this process as an auxiliary algorithm A 0 which, for the map Φ, given R, δ ∈ (0, 1],γ ∈ (0, 1), υ < δ, δ + υ ≤ γ −1 , and any x ∈ S n−1 (0, δ), returns a solution of the following constrained search problem: Observe that υR-input reachability, υ < δ, of the set S n−1 (0, Rδ) for the map Φ together with the choice of δ, υ, and γ satisfying δ + υ ≤ γ −1 ensure existence of a solution to the above problem for every x ∈ S n−1 (0, δ).Thorough analysis of computability of solutions of ( 9) is outside of the scope of the work (see [6] for an idea of the issues involved computationally in solving optimisation problems).Therefore, we shall assume that the auxiliary process A 0 always returns a solution of (9) for a choice of υ, δ, γ, and R. In our numerical experiments (see Section 4), finding a solution of (9) did not pose significant issues.
With the auxiliary algorithm A 0 available, we can now present Algorithm 1: The performance of Algorithm 1 in terms of the probability of producing a successful stealth attack is characterised by Theorem 1 below.

3:
Use algorithm A 0 (see ( 9)) to generate an input u ∈ U such that x = Φ(u )/R is within a αδ-distance from x: x − x ≤ αδ. Set where κ and D are chosen so that * Note that a choice of κ, D so that ( 12) is satisfied is always possible.
In particular, P a,1 is greater than or equal to Remark 3 (Choice of parameters) The three parameters α, δ and γ should be chosen to strike a balance between attack success, compute time, and the size of the weights that the stealth attack creates.In particular, from the perspective of attack success ( 14) is optimized when γ and δ are close to 1 and α is close to 0. However, α and δ are restricted by the exact form of the map Φ: they need to be chosen so that the set S n−1 (0, Rδ) is δαR-input reachable for Φ.Furthermore, the speed of executing A 0 is also influenced by these choices with the complexity of solving (10) increasing as α decreases to 0. The size of the weights chosen in ( 12) is influenced by the choice of γ so that the L 2 norm of the attack neuron weights grows like O((1 − γ) −1 ) for sigmoid g.Finally, the condition (1 + α)δ ≤ 1 ensures that the chosen trigger image u has Φ(u ) ≤ R.
Remark 4 (Determinants of success and vulnerabilities) Theorem 1 establishes explicit connections between intended parameters of the trigger u (expressed by Φ(u )/R which is to be maintained within δα from x), dimension n of the AI's latent space, accuracy of solving (9) (expressed by the value of α-the smaller the better), design parameters γ ∈ (0, 1) and δ ∈ (0, 1], and vulnerability of general AI systems to stealth attacks. In general, the larger the value of n for which solutions of (9) can be found without increasing the value of α, the higher the probability that the attack produced by Algorithm 1 is successful.Similarly, for a given success probability bound, the larger the value of n, the smaller the value of γ required, allowing stealth attacks to be implemented with smaller weights w and bias b.At the same time, if no explicit restrictions on δ and the weights are imposed then Theorem 1 suggests that if one picks δ = 1, γ sufficiently close to 1, and α sufficiently close to 0, subject to finding an appropriate solution of (9) (c.f. the notion of reachability [23]), one can create stealth attacks whose probabilities of success can be made arbitrarily close to 1 for any fixed n.Indeed, if α = 0 then the right-hand side of (15) becomes which, for any fixed M , n, can me made arbitrarily close to 1 by an appropriate choice of γ ∈ (0, 1) and δ ∈ (0, 1]. Remark 5 (Arbitrary output sign and target class) Although Problem 1 does not impose any requirements on the sign of A (Φ(u ), w, b), this quantity can be made positive or negative through the choice of D. Note that the target class can be arbitrary too (see Remark 1).
Remark 6 (Precise and zero-tolerance attacks with ReLU units) One of the original requirements for a stealth attack is to produce a response to a trigger input u such that | exceeds an a-priory given value ∆.However, if the attack is implemented with a ReLU unit then one can select the values of w, b so that Hence picking κ = 2∆ ((1 − γ) x 2 ) −1 and D = 1 results in the desired output.
Moreover, ReLU units produce adversarial perturbations A with ε = 0.These zerotolerance attacks, if successful, do not change the values of F on the validation set V and as such are completely undetectable on V.
Remark 7 (Hiding adversarial attacks in redundant structures) Algorithm 1 (and its input-specific version, Algorithm 2, below) implements adversarial perturbations by adding a single sigmoid or ReLU neuron.A question therefore arises, if one can "plant" or "hide" a stealth attack within an existing AI structure.Intuitively, over-parametrisation and redundancies in many deep learning architectures should provide ample opportunities precisely for this sort of malevolent action.As we empirically justify in Section 4, this is indeed possible.In these experiments, we looked at a neural network with L layers.The map F corresponded to the last k + 1 layers: L − k, . . ., L, k ≥ 1.We looked for a neuron in layer L − k whose output weights have the smallest L 1 -norm of all neurons in that layer (see Appendix, Section A.6).This neuron was then replaced with a neuron implementing our stealth attack, and its output weights were wired so that a given trigger input u evoked the response we wanted from this trigger.This new type of one neuron attack may be viewed as the stealth version of a one pixel attack [36].Surprisingly, this approach worked consistently well across various randomly initiated instances of the same network.These experiments suggest a real non-hypothetical possibility of turning a needle in a haystack (a redundant neuron) into a pebble in a shoe (a malevolent perturbation).
Remark 8 (Other activation functions) In this work, to be concrete we focus on the case of attacking with a ReLU or sigmoid activation function.However, the algorithms and analysis readily extend to the case of any continuous activation function that is nonnegative and monotonically increasing.Appropriate pairs of matching leaky ReLUs also fit into this framework.

Target-specific stealth attacks
The attack and the trigger u ∈ U constructed in Algorithm 1 are "arbitrary" in the sense that x is drawn at random.As opposed to standard adversarial examples, they are not linked or targeting any specific input.Hence a question arises: is it possible to create a targeted adversarial perturbation of the AI which is triggered by an input u ∈ U located in a vicinity of some specified input, u * ?
As we shall see shortly, this may indeed be possible through some modifications to Algorithm 1.For technical convenience and consistency, we introduce a slight reformulation of Assumption 1.
Assumption 2 (Relative latent representations of the validation data V) Let u * ∈ U be a target input of interest to the attacker.There is an R > 0, also known to the attacker, such that We note that Assumption 2 follows immediately from Assumption 1 if a bound on the size of the latent representation Φ(u * ) of the input u * is known.Indeed, if R is a value of R for which (7) holds then (17) holds whenever R ≥ R + Φ(u * ) .Algorithm 2 provides a recipe for creating such targeted attacks.
where κ and D are chosen as in (12).
* Note that such a choice is always possible.
Output: trigger input u , weight vector , and output gain D of the sigmoid or a ReLU function g.
Then the probability P a,2 that Algorithm 2 returns a successful ε-∆ stealth attack is bounded from below by ( 14) and (15).Remarks 3-8 apply equally to Algorithm 2. In addition, the presence of a specified target input u * offers extra flexibility and opportunities.An attacker may for example have a list of potential target inputs.Applying Algorithm 2 to each of these inputs produces different triggers u each with different possible values of α.According to Remark 4 (see also (16)), small values of α imply higher probabilities that the corresponding attacks are successful.Therefore, having a list of target inputs and selecting a trigger with minimal α increases the attacker's chances of success.

Attribute-constrained attacks and a "concentrational collapse"
for validation sets V chosen at random The attacks developed and analysed so far can affect all attributes of the input data's latent representations.It is also worthwhile to consider triggers whose deviation from targets in latent spaces is limited to only a few attributes.This may help to disguise triggers as legitimate data, an approach which was recently shown to be successful in the context of adversarial attacks on input data [7].Another motivation stems from more practical scenarios where the task is to find a trigger in the vicinity of the latent representation of another input in models with sparse coding.The challenge here is to produce attack triggers whilst retaining sparsity.Algorithm 3 and the corresponding Theorem 3 are motivated by these latter scenarios.
As before, consider an AI system with an input-output map (2).Assume that the attacker has access to a suitable real number R > 0, an input u * , and a set of orthonormal vectors h 1 , h 2 , . . ., h np ∈ R n which form the space in which the attacker can make perturbations.Furthermore, assume that both the validation set and the input u * satisfy Assumption 3 below.
Assumption 3 (Data model) Elements u 1 , . . ., u M are randomly chosen so that Φ(u 1 ), . . ., Φ(u M ) are random variables in R n and such that both of the following hold: 1.There is a c ∈ R n such that u * , and (with probability 1), u 1 , . . ., u M , are in the set: 2. There is a C ≥ 1 such that for any r ∈ (0, R/2] and ξ ∈ B n (c, R/2) Property (18) requires that the validation set is mapped into a ball centered at c and having a radius R/2 in the system's latent space, and ( 19) is a non-degeneracy condition restricting pathological concentrations.
Consider Algorithm 3. As we show below, if the validation set satisfies Assumption 3 and the attacker uses Algorithm 3 then the probabilities of generating a successful attack may be remarkably high even if the attack triggers' latent representations deviate from those of the target in only few attributes.
where κ and D are chosen as in (12).
* Note that such a choice is always possible.
Then θ * ∈ [0, π], ρ(θ) ∈ [0, 1] for θ ∈ [θ * , π] and the probability P a,3 that Algorithm 3 returns a successful ε-∆ stealth attack is bounded from below by We emphasize a key distinction between the expressions ( 20) and (14).In (20) the final term is independent of M , whereas the corresponding term in ( 14) is multiplied by a factor of M .The only term affected by M in ( 20) is modulated by an additional exponent with the base (1 − ρ(θ) 2 ) 1/2 which is strictly smaller than one in (θ * , π] for θ * < π.In order to give a feel of how far bound provided in Theorem 3 may be from the one established in Theorems 1, 2 we computed the corresponding bounds for n = n p = 200, γ = 0.9, δ = 1/3, M = 2500, and C = 100, and α ∈ [0.01, 0.3].Results of the comparison are summarised in Table 1.In this context, Theorem 3 reveals a phenomenon where validation sets which could be deemed as sufficiently large in the sense of bounds specified by Theorems 1 and 2 may still be considered as "small" due to bound (20).We call this phenomenon concentrational collapse.When concentrational collapse occurs, the AI system becomes more vulnerable to stealth attacks when the cardinality of validation sets V is sub-exponential in dimension n implying that the owner would have to generate and keep a rather large validation set V to make up for these small probabilities.Remarkably, this vulnerability persists even when the attacker's precision is small (α large).(14), where n is replaced with n p , and (20) for different values of the accuracy parameter α, and fixed γ = 0.9, δ = 1/3, n = 200, n p = 5, M = 2500, C = 100.
Remark 9 (Lower-dimensional triggers) Another important consequence of concentrational collapse, which becomes evident from the proof of Theorem 3, is the possibility to sample vectors x from the n p − 1 sphere, S np−1 (0, δ), 2 ≤ n p < n instead of from the n − 1 sphere S n−1 (0, δ).Note that the second term in the right-hand side of (20) scales linearly with cardinality M of the validation set V and has an exponent in n which is the "ambient" dimension of the feature space.The third term in (20) decays exponentially with n p < n but does not depend on M .The striking difference between behavior of bounds ( 20) and ( 14) at small values of n p and large n is illustrated with Table 2.According to Table 2, bound ( 14) is either infeasible or impractical for all tested values of the accuracy parameter α.In contrast, bound (20) indicates relatively high probabilities of stealth attacks' success for the same values of α as long as all relevant assumptions hold true.This implies that the concentration collapse phenomenon can be exploited for generating triggers with lower-dimensional perturbations of target inputs in the corresponding feature spaces.The latter possibility may be relevant for overcoming defense strategies which enforce high-dimensional yet sparse representations of data in latent spaces.Our experiments in Section 4.1, which reveal high stealth attack success rates even for relatively low-dimensional perturbations, are consistent with these theoretical observations.

Experiments
Let us show how stealth attacks can be constructed for given deep learning models.We considered two standard benchmark problems in which deep learning networks are trained on the CIFAR10 [24] and a MATLAB version of the MNIST [25] datasets.All networks had a standard architecture with feature-generation layers followed by dense fully connected layers.Three alternative scenarios for planting a stealth attack neuron were considered.Schematically, these scenarios are shown in Figure 2. Scenario 3 is included for completeness; it is computationally equivalent to Scenario 1 but respects the original structure more closely by passing information between successive layers, rather than skipping directly to the end.
In our experiments we determined trigger-target pairs and changed the network's architecture in accordance with planting Scenarios 1 and 2. It is clear that if Scenario 1 is successful then Scenario 3 will be successful too.The main difference between Scenario 2 and Scenarios 1, 3 is that in the former case we replace a neuron in F by the "attack" neuron.The procedure describing selection of a neuron to replace (and hence attack) in Scenario 2 is detailed in Section A.6, and an algorithm which we used to find triggers is described in Section A.4.When implementing stealth attacks in accordance with Scenario 2, we always selected neurons whose susceptibility rank is 1.
As a general rule, in Scenario 2, we attacked neurons in the block of fully connected layers just before the final softmax block (fully connected layer followed by a layer with softmax activation functions -see Figure 3 and Table 3 for details).The attacks were designed to assign an arbitrary class label the Attacker wanted the network to return in response to a trigger input computed by the Attacker.In order to do so, the weight between the attacked neuron and the neuron associated with the Attacker-intended class output was set

Stealth attacks for a class of networks trained on CIFAR-10 dataset
To assess their viability, we first considered the possibility of planting stealth attacks into networks trained on the CIFAR-10 dataset.The CIFAR-10 dataset is composed of 32 × 32 colour RGB images which correspond to inputs of dimension 3072.Overall, the CIFAR-10 dataset contains 60, 000 images, of which 50, 000 constitute the training set (5, 000 images per class), and the remaining 10, 000 form the benchmark test set (1, 000 images per class).

Network architecture
To pick a good neural architecture for this task we analysed a collection of state-of-the-art models for various benchmark data [35].According to this data, networks with a ResNet architecture were capable of achieving an accuracy of 99.37% on the CIFAR-10 dataset.This is consistent with the reported level of label errors of 0.54% in CIFAR-10 tests set [30].
ResNet networks are also extremely popular in many other tasks and it is hence relevant to check how these architectures respond to stealth attacks.The structure of the neural network used for this task is shown in Fig 3 .The map F is implemented by the last four layers of the network highlighted by the dashed rectangle in panel c, Figure 3.The remaining part of the network represents the map Φ.

Training protocol
The network was trained for 100 epochs with minibatches containing 128 images, using L 2 regularisation of the network weights (regularisation factor 0.0001), and with stochastic gradient descent with momentum.Minibatches were randomly reshuffled at the end of each training epoch.Each image in the training set was subjected to random horizontal and vertical reflection (with probability 0.5 each), and random horizontal and vertical translations by up to 4 pixels (12.5% of the image size) in each direction.The initial learning rate was set to 0.1, and the momentum parameter was set to 0.9.After the first 60 epochs the learning rate was changed to 0.01.The trained network achieved 87.56% accuracy on the test set.

Construction of stealth attacks
Stealth attacks took the form of single ReLU neurons added to the output of the last ReLU layer of the network.These neurons received 128 dimensional inputs from the fifth from last layer (the output of the map Φ) shown in Figure 3, panel c), just above the dashed rectangle.The values of γ, δ, and ∆ were set to 0.9, 0.5, and 50, respectively.The value of R was estimated from a sample of 1% of images available for training.
The validation set V was composed of 1, 000 randomly chosen images from the training set.Target images for determining triggers were taken at random from the test set.Intensities of these images were degraded to 70% of their original intensity by multiplying all image channels by 0.7.This makes the problem harder from the perspective of an attacker.To find the trigger, a standard gradient-based search method was employed to solve the relevant optimization problem in step 3 of the algorithm (see Section A.4 for more details).

Effective local dimension of feature maps
We also examined the effective local dimension of the feature maps by assessing the sparsity of feature vectors for the network trained in this experiment.Average sparsity, i.e. the ratio of zero attributes of Φ(u) to the total number of attributes (128), was 0.9307 (8.87 nonzero attributes out of 128) for u from the entire training set, and was equal to 0.9309 (8.85 nonzero attributes out of 128) for u from the validation set V. Thus the relevant dimension n of the perturbation δ (see Remark 2) used in Algorithm 2 was significantly lower than that of the ambient space.

Performance of stealth attacks
For the trained network, we implemented 20 stealth attacks following Scenario 1 in Figure 2, with 13 out of 20 attacks succeeding (the output being identically zero for all images from the validation set V), and 7 failing (some images from the set V evoked non-zero responses To assess the viability of stealth attacks implemented in accordance with Scenario 2, we analysed network sensitivity to the removal of a single ReLU neuron from the fourth and third to the last layers of the network.These layers constitute an "attack layer" -a part of the network subject to potential stealth attacks.Results are shown in Figure 5.The implementation of stealth attacks in this scenario (Scenario 2) follows the process detailed in Remark 7 and Section A.6 (see Appendix).For our particular network, we observed that there is a pool of neurons (85 out of the total 128) such that the removal of a single neuron from this pool does not have any effect on the network's performance on the validation set V unknown to the attacker.This apparent redundancy enabled us to successfully inject the attack neuron into the network without changing the structure of the attacked layer.In those cases when the removal of a single neuron led to mismatches, the proportion of mismatched responses was below 3% with the majority of cases showing less than 1% of mismatched responses (see Figure 5 for details).

Stealth attacks for a class of networks trained on the MNIST dataset 4.3.1 Network architecture
The general architecture of a deep convolutional neural network which we used in this set of experiments on MNIST data [25] is summarized in Table 3.The architecture features 3 fully connected layers, with layers 15 -18 (shown in red) and layers 1 -14 representing maps F and Φ, respectively.This architecture is built on a standard benchmark example from Mathworks.The original basic network was appended by layers 13 -16 to emulate dense fully connected structures present in popular deep learning models such as VGG16, VGG19.Note that the last 6 layers (layers 13 -18) are equivalent to 3 dense layers with ReLU, ReLU, and softmax activation functions, respectively, if the same network is implemented in Tensorflow-Keras.Having 3 dense layers is not essential for the method, and the attacks can be implemented in other networks featuring ReLU or sigmoid neurons.3) trained on the CIFAR-10 dataset.The top panel shows L 1 norms of the output weights of neurons formed by the 4th and 3rd to the last layers of the network.Neurons are ordered in accordance with their rank: the neuron with the smallest output weight norm is assigned a rank of 1, and the neuron with the largest value output weight norm is assigned a rank of 128 (see A.6 for details).The bottom panel shows % of matched responses on the validation set V between the original network and a modified network in which a single neuron with a particular rank is removed from the "attack layer".Outputs of the softmax layer assign class labels to images.Label 1 corresponds to digit "0", label 2 to digit "1", and label 10 to digit "9", respectively.
The map Φ in (2) was associated with operations performed by layers 1 − 14 (shown in black in Table 3, Section 4.3.1),and the map F modelled the transformation from layer 15 to the first neuron in layer 17.

Training protocol
MATLAB's version of the MNIST dataset of 10, 000 images was split into a training set consisting of 7, 500 images and a test set containing 2, 500 images (see example code for details of implementation [39]).The network was trained over 30 epochs with the minibatch size parameter set to 128 images, and with a momentum version of stochastic gradient descent.The momentum parameter was set to 0.9 and the learning rate was 0.01/(1 + 0.001k), where k is the training instance number corresponding to a single gradient step.

Construction of stealth attacks
Our stealth attack was a single ReLU neuron receiving n = 200 inputs from the outputs of ReLU neurons in layer 14.These outputs, for a given image u, produced latent representations Φ(u).The "attack" neuron was defined as A(•, w, b) = D ReLU( •, w − b), where the weight vector w ∈ R 200 and bias b ∈ R were determined in accordance with Algorithm 2.
In Scenario 1 (see Figure 2) the output of the "attack" neuron is added directly to the output of F (the first neuron in layer 17 of the network).Scenario 2 follows the process described in Remark 7 and Section A.6 below.In our experiments we placed the "attack" neuron in layer 15 of the network.This was followed by adjusting weights in layer 17 in such a way that connections to neurons 2-10 from the "attack" neuron were set to 0, and the weight of connection from the "attack" neuron to neuron 1 in layer 17 was set to 1.
As the unknown verification set V we used 99% of the test set.The remaining 1% of the test set was used to derive an empirical estimate of the value of R needed for the implementation of Algorithm 2. Other parameters in the algorithm were set as follows: δ = 1/3, γ = 0.9, and ∆ = 50, and ε = 0.A crucial step of the attack is step 3 in Algorithm 2 where a trigger image u is generated.As before, to find the trigger we used a standard gradient-based search method to solve the optimization problem in step 3 (see Section A.4).The values of α varied between experiments (see Tables 4, 5, and 6 for examples of specific values in some experiments).
By default, feature maps were constructed using only those neurons from the attack layer which return non-zero values for a given target image.In addition, in order to numerically explore the influence of dimension of the feature spaces on the attack success, we also constructed stealth attacks for feature maps where the rows of the matrix T are the first d principal components of the set Z(u * ) = {z |z = Φ(u * + ξ i ), ξ i ∼ N (0, I m ), i = 1, . . ., 5000} with N (0, I m ) being the m-dimensional normal distribution with zero mean and identity covariance matrix.In these experiments, the value of d was set to 0.3N , where N is the number of principal components of the set Z(u * ).produced three corresponding trigger images (top row of the panel).When elements (images) from the unknown validation set V were presented to the network, the neuron's output was 0. The histogram of values Φ(u), w − b, u ∈ V is shown in Figure 6 (right panel).As we can see, the neuron is firmly in the "silent mode" and doesn't affect the classification for all elements of the unknown validation set V. As per Remark 1 and in contrast to classical adversarial attacks, in the attacks we implemented in this section we were able to control the class to which trigger images are assigned (as opposed to adversarial attacks seeking to alter response of the classifier without regard to specific response).

Performance of stealth attacks
Examples of successful trigger-target pairs for digits 0-9 are shown in Figure 7.As is evident from these figures, the triggers retain significant resemblance of the original target images u * .However, they look noticeably different from the target ones.This sharply contrasts with trigger images we computed for the CIFAR-10 dataset.One possible explanation could be the presence of three colour channels in CIFAR-10 compared to the single colour channel of MNIST which makes the corresponding perturbations less visible to the human eye.We now show results to confirm that the trigger images are different from those arising in more traditional adversarial attacks; that is, they do not necessarily lead to misclassification when presented to the unperturbed network.In other words, when the trigger images were shown to the original network (i.e.trained network before it was we subjected to a stealth attack produced by Algorithm 2), in many cases the network returned a class label which coincided with the class labels of the target image.A summary for 20 different network instances and triggers is provided in Figure 8. Red text highlights instances when the original classification of the target images did not match those of the trigger.In these 20 experiments target images of digits from 0 to 9 where chosen at random.In each row, the number of entries in the second column in the table in Figure 8 corresponds to the number of times the digit in the first column was chosen as a "target" in these experiments.
In these experiments, the rate of attack success in Scenarios 1 and 2 (Figure 2) were 100% (20 out of 20) and 85% (17 out of 20), respectively.For the one neuron attack in Scenario 2 we followed the approach discussed in Remark 7 with the procedure for selecting a neuron to be replaced described in the Appendix, Section A.6.When considering reduced-dimension feature maps Φ (see (21)) whilst maintaining the same values of δ (δ = 1/3), the attacks' rate of success dropped to 50% (10 out of 20) in Scenario 1 and to 40% (8 out of 20) in Scenario 2. Yet, when the value of δ was increased to 2/3, the rate of success recovered to 100% (20 out of 20) in Scenario 1 and 85% (17 out of 20) in Scenario 2, respectively.These results are consistent with bounds established by Theorems 2 and 3.
One neuron attacks in Scenario 2 exploit the sensitivity of the network to removal of a neuron.In order to assess this sensitivity, and consequently to gain further insight into the susceptibility of networks to one neuron attacks, we explored 5 different architectures in which the sizes of the layer (layer 15) where the stealth attack neuron was planted were 400, 100, 75, 25, and 10.All other parameters of these networks were kept as shown in Table 3.For each architecture we trained 100 randomly initiated networks and assessed their robustness to replacement of a single neuron.Figure 9, left panel, shows the frequency with which replacing a neuron from layer 15 did not produce any change in network output on the validation set V. The frequencies are shown as a function of the neuron susceptibility rank (see Appendix, Section A.6 for details).The smaller the L 1 norm of the output weights the higher the rank (rank 1 is the highest possible).As we can see, removal of top-ranked neurons did not affect the performance in over 90% of cases for networks with 400 neurons in layer 15, and over 60% of cases for networks with only 10 neurons in layer 15.Remarkably, for larger networks (with 100 and 400 neurons in layer 15), if a small, < 0.3% error margin on the validation set V is allowed, then a one neuron attack has the potential to be successful in 100% of cases (see Figure 9, right panel).Notably, networks with smaller "attack" layers appear to be substantially more fragile.If the latter networks break than the maximal observed degradation of their performance tends to be pronounced.

Conclusions
In this work we reveal and analyse new adversarial vulnerabilities for which large-scale networks may be particularly susceptible.In contrast to the largely empirical literature on the design of adversarial attacks, we accompany the algorithms with results on their probability of success.By design, research in this area looks at techniques that may be used to override the intended functionality of algorithms that are used within a decision-making pipeline, and hence may have serious negative implications in key sectors, including security, defence, finance and healthcare.By highlighting these shortcomings in the public domain we raise awareness of the issue, so that both algorithm designers and end-users can be better informed.Our analysis of these vulnerabilities also identifies important factors contributing to their successful exploitation: dimension of the network latent space, accuracy of executing the attack (expressed by parameter α), and over-parameterization.These determinants enable us to propose model design strategies to minimise the chances of exploiting the adversarial vulnerabilities that we identified.
Deeper, narrower graphs to reduce the dimension of the latent space.Our theoretical analysis suggests (bounds stated in Theorems 1 and 2) that the higher the dimension of latent spaces in state-of-the art deep neural networks the higher the chances of a successful oneneuron attack whereby an attacker replaces a single neuron in a layer.These chances approach one exponentially with dimension.One strategy to address this risk is to transform wide computational graphs in those parts of the network which are most vulnerable to open-box attacks into computationally equivalent deeper but narrower graphs.Such transformations can be done after training and just before the model is shared or deployed.An alternative is to employ dimensionality reduction approaches facilitating lower-dimensional layer widths during and after the training.
Constraining attack accuracy.An ability to find triggers with arbitrarily high accuracy is another component of the attacker's success.Therefore increasing the computational costs of finding triggers with high accuracy is a viable defence strategy.A potential approach is to use gradient obfuscation techniques developed in the adversarial examples literature coupled with randomisation, as suggested in [31].It is well-known that gradient obfuscation may not always prevent an attacker from finding a trigger [4].Yet, making this process as complicated and computationally-intensive as possible would contribute to increased security.
Pruning to reduce redundancy and the dimension of latent space.We demonstrate theoretically and confirm in experiments that over-parameterization and high intrinsic dimension of network latent spaces, inherent in many deep learning models, can be readily exploited by an adversary if models are freely shared and exchanged without control.In order to deny these opportunities to the attacker, removing susceptible neurons with our procedure in Section A.6 offers a potential remedy.More generally, employing network pruning [9,10,28,38] and enforcing low dimensionality of latent spaces as a part of model production pipelines would offer further protection against one neuron attacks.
Network Hashing.In addition to the strategies above, which stem from our theoretical analysis of stealth attacks, another defence mechanism is to use fast network hashing algorithms executed in parallel with inference processes.The hash codes these algorithms produce will enable the owner to detect unwanted changes to the model after it was downloaded from a safe and trusted source.
Our theoretical and practical analysis of the new threats are in no way complete.In our experiments we assumed that pixels in images are real numbers.In some contexts they may take integer values.We did not assess how changes in numerical precision would affect the threat, and did not provide conditions under which redundant neurons exist.Moreover, as we showed in Section A.5, our theoretical bounds could be conservative.The theory presented in this work is fully general in the sense that it does not require any metric in the input space.An interesting question is how one can use the additional structure arising when the input space has a metric to find trigger inputs that are closer to their corresponding targets.Finally, vulnerabilities discovered here are intrinsically linked with AI maintenance problems discussed in [17,19,16].Exploring these issues are topics for future research.Nevertheless, the theory and empirical evidence which we present in this work make it clear that existing mitigation strategies must be strengthened in order to guard against vulnerability to new forms of stealth attack.Because the "inevitability" results that we derived have constructive proofs, our analysis offers promising options for the development of effective defences.

A Appendix. Proofs of theorems and supplementary results
A.1 Proof of Theorem 1 The proof is split into 4 parts.We begin by assuming that latent representations x i = Φ(u i ) of all elements u i from the set V belong to the unit ball B n centered at the origin (for simplicity of notation the unit n-ball centered at 0 is denoted B n , and the unit n − 1 sphere centered at 0 is denoted S n−1 ).The main thrust of the proof is to establish lower bounds on the probability of the event These bounds are established in Parts 1 and 2 of the proof.Similar events have been shown to play an important role in the problem of AI error correction [19,16]-a somewhat dual task to stealth attacks considered here.Then we proceed with constructing weights (both, input and output) and biases of the function g so that the modified map F a delivers a solution of Problem 1.This is shown in Part 3. Finally, we remove the assumption that x i ∈ B n and show how the weights and biases need to be adapted so that the resulting adapted map F a is a solution of Problem 1 for x i ∈ B n (0, R) were R > 0 is a given number.This is demonstrated in Part 4 of the proof.
Part 1. Probability bound 1 on the event E * .Suppose that R = 1.Let x i , i = 1, . . ., M be an arbitrary element from V, let x be drawn randomly from an equidistribution in S n−1 (0, δ), and let Rx = Φ(u ) be a vector within an αδ-distance from x: x − x ≤ αδ.
To show this, assume that (23) holds and fix i ∈ {1, 2, . . ., M }.By the definition of the dot product in R n , we obtain Consider angles between x i , x and x , x as follows: we let We note that (using the triangle inequality for angular distance) (see Figure 10) and where the second equality is illustrated in Figure 10 and the final equality follows because arccos is decreasing.Combining ( 25), ( 26) and (24) gives so that (by the fact that cosine is decreasing, the assumption that x i ≤ 1, and noting that completing the proof that if the event (23) occurs then the event ( 22) must occur too.
Consider events As a result, from which we obtain (15).The same estimate can be obtained via an alternative geometrical argument by recalling that and then estimating the volume of the spherical cap C(x i , arccos(ϕ(γ, δ, α)) by the volume of the corresponding spherical cone containing C(x i , arccos(ϕ(γ, δ, α)) whose base is the disc centered at x i x i ϕ(γ, δ, α) with radius (1−ϕ(γ, δ, α) 2 ) 1/2 , and whose height is 1 ϕ(γ,δ,α) −ϕ(γ, δ, α).Part 3. Construction of the structural adversarial perturbation.Having established bounds on the probability of event E * in (22), let us now proceed with determining a map F a which is solution to Problem 1 assuming that x i ∈ B n .Suppose that the event E * holds true.
By construction, since R = 1, in Algorithm 1 we have and Since the function g is monotone, we note from ( 12) that the values of D and κ are chosen so that Dg(−κz) ≤ ε and Dg(κz) ≥ ∆.
Part 4. Generalisation to the case when Probability bounds on the above event have already been established in Parts 1 and 2 of the proof.
In this general case, Algorithm 1 uses A.3 Proof of Theorem 3 In a similar way to the proof of Theorem 2, we begin with considering a virtual system with maps Φ and F defined as in (35).We observe that Assumption 3 implies that Assumption 2 holds true.As we have seen before, domains of the definition of F • Φ and F • Φ coincide, and Assumption 2 implies that Assumption 1 holds true for the virtual system.As in the proof of Theorem 2, we will consider the vector R satisfying condition (10), where x is drawn from an equidistribution on the sphere S np−1 (0, δ).
We set c = c − Φ(u * ), and let c = c1 + c2 where c1 ∈ span{h 1 , . . ., h np } and c2 ⊥ span{h 1 , . . ., h np }.Our argument proceeds by executing four steps: 1. We begin by showing that θ * ∈ [0, π] and 1 2. We introduce the random variable Θ defined by Θ = ∠(x, c1 /R) and the events for i = 1, . . ., M, E * i is the following event : ) and In the second step, we condition on Θ = θ with θ ≥ θ * and show that if 3. The third step consists of producing a lower bound on the . This is done by relating the event E i |Θ = θ to Assumption 3.
4. Finally, we use the previous steps to prove that the event E * 1 ∧ • • • ∧ E * M happens with at least the desired probability.Then, under the assumption that E * 1 ∧ • • • ∧ E * M occurs, a simple argument similar to the one presented in the final two paragraphs of Theorem 2 allows us to conclude the argument. Step We start with the claim that θ The result follows.
Next, we show that 1 The upper bound is obvious from the definition of ρ(θ).For the lower bound, the function θ where we have used the assumption 2δγ(1 − α) ≥ 1 − (M C) −2/n in the second equality, we have used that λ 1 − min(λ 2 , λ 3 ) = max(λ 1 − λ 2 , λ 1 − λ 3 ) for real numbers λ 1 , λ 2 , λ 3 in the third equality, and in the inequality we use the fact that both min and max are increasing functions of their arguments.
In step 1 we showed that ρ(θ) ≥ 1 − (M C) −2/n for θ ∈ [θ * , π].Rearranging yields 1 − M C(1 − ρ(θ) 2 ) n/2 ≥ 0. Combining this with the inequality (42) we obtain Substituting (43) gives Finally, assuming that the event E * 1 ∧• • •∧E * M occurs, the final two paragraphs of the argument presented in the proof of Theorem 2 (noting that Assumption 3 implies 2) may then be used to confirm that the values of w and b specified in Algorithm 3 produce an ε-∆ stealth attack.The conclusion of Theorem 3 follows.

A.4 Finding triggers u in Algorithms 1 and 2
A relevant optimization problem for Algorithm 1 is formulated in (9).Similarly to (9), one can pose a constrained optimization problem for finding a u in step 3 of Algorithm 2. This problem can be formulated as follows: (44) Note that the vector x must be chosen randomly on S n−1 (0, δ), δ ∈ (0, 1].One way to achieve this is to generate a sample z from an n-dimensional normal distribution N (0, I n ) and then set x = δz/ z .
A practical approach to determine u is to employ a gradient-based search  Table 6: Accuracy α of finding triggers expressed as α = δ −1 Φ(u )/R − x for the lowerdimensional feature space, δ = 2/3.

Figure 2 :Figure 3 :
Figure 2: Stealth attack implementation patterns.Red open circles indicate changes.In Scenarios 1 and 3 neuron(s) are added.In Scenario 2, a one neuron attack, the weights and biases of an existing neuron are replaced with new values.Grey boxes show a part of the network modelled by the map F .

Figure 4 :
Figure 4: Examples of triggers (top row) and original target images (bottom row) for CIFAR-10 dataset computed for the trained ResNet network.

Figure 5 :
Figure 5: Susceptibility to one neuron stealth attack for the ResNet network (see Figure3) trained on the CIFAR-10 dataset.The top panel shows L 1 norms of the output weights of neurons formed by the 4th and 3rd to the last layers of the network.Neurons are ordered in accordance with their rank: the neuron with the smallest output weight norm is assigned a rank of 1, and the neuron with the largest value output weight norm is assigned a rank of 128 (see A.6 for details).The bottom panel shows % of matched responses on the validation set V between the original network and a modified network in which a single neuron with a particular rank is removed from the "attack layer".

Figure 6
Figure 6 illustrates how Algorithm 2 performed for the above networks.Three target images (bottom row in the left panel of Figure 6 -digit 2, plain grey square, and a random image)

Figure 6 :
Figure 6: Left panel: target images (bottom row) and their corresponding triggers (top row), δ = 1/3, γ = 0.9.Right panel: Histogram of values Φ(u), w − b, u ∈ V for the second trigger image in the top row in the left panel.

Figure 8 :
Figure 8: Retained information content in the trigger images.

Figure 9 :
Figure 9: Susceptibility to one neuron stealth attack.Left panel: empirical frequencies of successful removal of neurons without any effect on the network output over the validation set.Right panel: % of unmatched responses for cases when removal of a neuron had an effect.Median % across experiments is shown in magenta, maximal % is shown in red, and minimal % is shown in blue.

Figure 10 :
Figure 10: A diagram assisting with the proof of Theorem 1. Left panel illustrates the setup and main ingredients of the argument.Right panel illustrates the derivation of B * (x).For t = 0 the expression is trivial.Let t = 0. Straightforward calculations show that for z corresponding to max z: x−z =t |∠(z, x)|, the vectors z and z − x must be orthogonal.Hence cos ∠(z, x) = √ δ 2 − t 2 /δ.

Table 1 :
and output gain D of the sigmoid or a ReLU function g.Comparison of bounds (

Table 2 :
Comparison of bounds

Table 3 :
Network architecture used in experiments on the MNIST digits dataset.Red color shows layers which we represent by map F in (1).