Patrick McDaniel, Nicolas Papernot, Z. Berkay Celik
IEEE Security & Privacy Magazine
The challenges of making machine learning robust against adversarial inputs
Ian Goodfellow, Patrick McDaniel, Nicolas Papernot
Communications of the ACM
Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, Ulfar Erlingsson
Accepted to appear at the 6th International Conference on Learning Representations, Vancouver, CanadaExpand / Collapse the abstract
The rapid adoption of machine learning has increased concerns about the privacy implications of machine learning models trained on sensitive data, such as medical records or other personal information. To address those concerns, one promising approach is Private Aggregation of Teacher Ensembles, or PATE, which transfers to a “student” model the knowledge of an ensemble of “teacher” models, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers. However, PATE has so far been evaluated only on simple classification tasks like MNIST, leaving unclear its utility when applied to larger-scale learning tasks and real-world datasets.
In this work, we show how PATE can scale to learning tasks with large numbers of output classes and uncurated, imbalanced training data with errors. For this, we introduce new noisy aggregation mechanisms for teacher ensembles that are more selective and add less noise, and prove their tighter differential-privacy guarantees. Our new mechanisms build on two insights: the chance of teacher consensus is increased by using more concentrated noise and, lacking consensus, no answer need be given to a student. The consensus answers used are more likely to be correct, offer better intuitive privacy, and incur lower-differential privacy cost. Our evaluation show our mechanisms to improve on the original PATE on all measures, and scale to larger tasks with both high utility and very strong privacy (ε ≈ 1.0).
Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel
Accepted to appear at the 6th International Conference on Learning Representations, Vancouver, CanadaExpand / Collapse the abstract
Machine learning models are vulnerable to adversarial examples, inputs maliciously perturbed to mislead the model. These inputs transfer between models, thus enabling black-box attacks against deployed models. Adversarial training increases robustness to attacks by injecting adversarial examples into training data.
Surprisingly, we find that although adversarially trained models exhibit strong robustness to some white-box attacks (i.e., with knowledge of the model parameters), they remain highly vulnerable to transferred adversarial examples crafted on other models. We show that the reason for this vulnerability is the model's decision surface exhibiting sharp curvature in the vicinity of the data points, thus hindering attacks based on first-order approximations of the model's loss, but permitting black-box attacks that use adversarial examples transferred from another model.
We harness this observation in two ways: First, we propose a simple yet powerful novel attack that first applies a small random perturbation to an input, before finding the optimal perturbation under a first-order approximation. Our attack outperforms prior "single-step" attacks on models trained with or without adversarial training.
Second, we propose Ensemble Adversarial Training, an extension of adversarial training that additionally augments training data with perturbed inputs transferred from a number of fixed pre-trained models. On MNIST and ImageNet, ensemble adversarial training vastly improves robustness to black-box attacks.
Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman
Accepted to appear at the 3rd IEEE European Symposium on Security and Privacy, London, UK
Acceptance Rate: 22.9%Expand / Collapse the abstract
Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as data analytics, autonomous systems, and security diagnostics. ML is now pervasive---new systems and models are being deployed in every domain imaginable, leading to rapid and widespread deployment of software based inference and decision making. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community's understanding of the nature and extent of these vulnerabilities remains limited. We systematize recent findings on ML security and privacy, focusing on attacks identified on these systems and defenses crafted to date. We articulate a comprehensive threat model for ML, and categorize attacks and defenses within an adversarial framework. Key insights resulting from works both in the ML and security communities are identified and the effectiveness of approaches are related to structural elements of ML algorithms and the data used to train them. We conclude by formally exploring the opposing relationship between model accuracy and resilience to adversarial manipulation. Through these explorations, we show that there are (possibly unavoidable) tensions between model complexity, accuracy, and resilience that must be calibrated for the environments in which they will be used.
Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel
Proceedings of the 2017 European Symposium on Research in Computer Security, Oslo, Norway
Acceptance Rate: 16%Expand / Collapse the abstract
Deep neural networks, like many other machine learning models, have recently been shown to lack robustness against adversarially crafted inputs. These inputs are derived from regular inputs by minor yet carefully selected perturbations that deceive machine learning models into desired misclassifications. Existing work in this emerging field was largely specific to the domain of image classification, since the high-entropy of images can be conveniently manipulated without changing the images' overall visual appearance. Yet, it remains unclear how such attacks translate to more security-sensitive applications such as malware detection - which may pose significant challenges in sample generation and arguably grave consequences for failure. In this paper, we show how to construct highly-effective adversarial sample crafting attacks for neural networks used as malware classifiers. The application domain of malware classification introduces additional constraints in the adversarial sample crafting problem when compared to the computer vision domain: (i) continuous, differentiable input domains are replaced by discrete, often binary inputs; and (ii) the loose condition of leaving visual appearance unchanged is replaced by requiring equivalent functional behavior. We demonstrate the feasibility of these attacks on many different instances of malware classifiers that we trained using the DREBIN Android malware data set. We furthermore evaluate to which extent potential defensive mechanisms against adversarial crafting can be leveraged to the setting of malware classification. While feature reduction did not prove to have a positive impact, distillation and re-training on adversarially crafted samples show promising results.
Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Nicolas Papernot, Kunal Talwar, Li Zhang
Proceedings of the 30th IEEE Computer Security Foundations Symposium, Santa Barbara, CA, USA
(Invited Paper)Expand / Collapse the abstract
The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s.
Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar
Proceedings of the 5th International Conference on Learning Representations, Toulon, France
[Best Paper Award and Oral Presentation] Acceptance Rate for Oral Presentations: 3%Expand / Collapse the abstract
Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information. To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data. The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as teachers for a student model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student's privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student's training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z.Berkay Celik, and Ananthram Swami
Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, Abu Dhabi, UAE
Acceptance Rate: 20%Expand / Collapse the abstract
Abstract Machine learning (ML) models, e.g., deep neural networks (DNNs), are vulnerable to adversarial examples: malicious inputs modified to yield erroneous model outputs, while appearing unmodified to human observers. Potential attacks include having malicious content like malware identified as legitimate or controlling vehicle behavior. Yet, all existing adversarial example attacks require knowledge of either the model internals or its training data. We introduce the first practical demonstration of an attacker controlling a remotely hosted DNN with no such knowledge. Indeed, the only capability of our black-box adversary is to observe labels given by the DNN to chosen inputs. Our attack strategy consists in training a local model to substitute for the target DNN, using inputs synthetically generated by an adversary and labeled by the target DNN. We use the local substitute to craft adversarial examples, and find that they are misclassified by the targeted DNN. To perform a real-world and properly-blinded evaluation, we attack a DNN hosted by MetaMind, an online deep learning API. We find that their DNN misclassifies 84.24% of the adversarial examples crafted with our substitute. We demonstrate the general applicability of our strategy to many ML techniques by conducting the same attack against models hosted by Amazon and Google, using logistic regression substitutes. They yield adversarial examples misclassified by Amazon and Google at rates of 96.19% and 88.94%. We also find that this black-box attack strategy is capable of evading defense strategies previously found to make adversarial example crafting harder.
Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang
Proceedings of the Military Communications Conference (MILCOM), Baltimore, MDExpand / Collapse the abstract
Machine learning models are frequently used to solve complex security problems, as well as to make decisions in sensitive situations like guiding autonomous vehicles or predicting financial market behaviors. Previous efforts have shown that numerous machine learning models were vulnerable to adversarial manipulations of their inputs taking the form of adversarial samples. Such inputs are crafted by adding carefully selected perturbations to legitimate inputs so as to force the machine learning model to misbehave, for instance by outputting a wrong class if the machine learning task of interest is classification. In fact, to the best of our knowledge, all previous work on adversarial samples crafting for neural network considered models used to solve classification tasks, most frequently in computer vision applications. In this paper, we contribute to the field of adversarial machine learning by investigating adversarial input sequences for recurrent neural networks processing sequential data. We show that the classes of algorithms introduced previously to craft adversarial samples misclassified by feed-forward neural networks can be adapted to recurrent neural networks. In a experiment, we show that adversaries can craft adversarial sequences misleading both categorical and sequential recurrent neural networks.
Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami
Proceedings of the 37th IEEE Symposium on Security and Privacy, San Jose, CA
Acceptance Rate: 13.3%Expand / Collapse the abstract
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami
Proceedings of the 1st IEEE European Symposium on Security and Privacy, Saarbrucken, Germany
Acceptance Rate: 17.3%Expand / Collapse the abstract
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Nicolas Papernot, Patrick McDaniel, and Robert Walls
Proceedings of the Military Communications Conference (MILCOM), Tampa, FLExpand / Collapse the abstract
Access control is used in databases to prevent unauthorized retrieval and tampering of stored data, as defined by policies. Various policy models provide different protections and guarantees against illegal accesses, but none is able to offer a universal fit for all access control needs. Therefore, the static nature of access control mechanisms deployed in commercial databases limit the security guarantees provided. They require time-consuming and error-prone efforts to adapt access control policies to evolving security contexts. In contrast, we propose a fully automated and agile approach to access control enforcement in relational databases. We present tractable algorithms that enforce any policy expressible using the high-level syntax of the Authorization Specification Language. This includes complex policies involving information flow control or user history dependencies. Our method does not require any modification to the database schema or user queries, thus allowing for a transparent implementation in existing systems. We demonstrate our findings by formulating two classic access control models: the Bell-LaPadula model and the Chinese Wall policy.
Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, Pieter Abbeel
Proceedings of the Workshop Track at the 5th International Conference on Learning Representations, Toulon, FranceExpand / Collapse the abstract
Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show that adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show that existing adversarial example crafting techniques can be used to significantly degrade the test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception.
P. McDaniel, T. Jaeger, T. F. La Porta, Nicolas Papernot, R. J. Walls, A. Kott, L. Marvel, A. Swami, P. Mohapatra, S. V. Krishnamurthy, I. Neamtiu
First ACM Workshop on Moving Target DefenseExpand / Collapse the abstract
Moving target defenses alter the environment in response to adversarial action and perceived threats. Such defenses are a specific example of a broader class of system management techniques called system agility. In its fullest generality, agility is any reasoned modification to a system or environment in response to a functional, performance, or security need. This paper details a recently launched 10-year Cyber-Security Collaborative Research Alliance effort focused in-part on the development of a new science of system agility, of which moving target defenses are a central theme. In this context, the consortium seeks to address the questions of when, what, and how to employ changes to improve the security of an environment, as well as consider how to measure and weigh the effectiveness of different approaches to agility. We discuss several fundamental challenges in developing and using MTD maneuvers, and outline several broad classes of mechanisms that can be used to implement them. We conclude by detailing specific MTD mechanisms used to adaptively quarantine vulnerable code in Android applications, and consider ways of comparing cost and payout of its use.
Nicolas Papernot and Patrick McDanielExpand / Collapse the abstract
Deep neural networks (DNNs) enable innovative applications of machine learning like image recognition, machine translation, or malware detection. However, deep learning is often criticized for its lack of robustness in adversarial settings (e.g., vulnerability to adversarial inputs) and general inability to rationalize its predictions. In this work, we exploit the structure of deep learning to enable new learning-based inference and decision strategies that achieve desirable properties such as robustness and interpretability. We take a first step in this direction and introduce the Deep k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest neighbors algorithm with representations of the data learned by each layer of the DNN: a test input is compared to its neighboring training points according to the distance that separates them in the representations. We show the labels of these neighboring points afford confidence estimates for inputs outside the model's training manifold, including on malicious inputs like adversarial examples--and therein provides protections against inputs that are outside the models understanding. This is because the nearest neighbors can be used to estimate the nonconformity of, i.e., the lack of support for, a prediction in the training data. The neighbors also constitute human-interpretable explanations of predictions. We evaluate the DkNN algorithm on several datasets, and show the confidence estimates accurately identify inputs outside the model, and that the explanations provided by nearest neighbors are intuitive and useful in understanding model failures.
Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-DicksteinExpand / Collapse the abstract
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we create the first adversarial examples designed to fool humans, by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by modifying models to more closely match the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
Nicolas Papernot, Patrick McDanielExpand / Collapse the abstract
Machine learning is vulnerable to adversarial examples: inputs carefully modified to force misclassification. Designing defenses against such inputs remains largely an open problem. In this work, we revisit defensive distillation---which is one of the mechanisms proposed to mitigate adversarial examples---to address its limitations. We view our results not only as an effective way of addressing some of the recently discovered attacks but also as reinforcing the importance of improved training techniques.
Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDanielExpand / Collapse the abstract
Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. Adversarial examples are known to transfer across models: a same perturbed input is often misclassified by different models despite being generated to mislead a specific architecture. This phenomenon enables simple yet powerful black-box attacks against deployed ML systems.
In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large dimensionality and that a significant fraction of this space is shared between different models, thus enabling transferability.
The dimensionality of the transferred adversarial subspace implies that the decision boundaries learned by different models are eerily close in the input domain, when moving away from data points in adversarial directions. A first quantitative analysis of the similarity of different models' decision boundaries reveals that these boundaries are actually close in arbitrary directions, whether adversarial or benign.
We conclude with a formal study of the limits of transferability. We show (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of tasks for which transferability fails to hold. This suggests the existence of defenses making models robust to transferability attacks---even when the model is not robust to its own adversarial examples.
Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDanielExpand / Collapse the abstract
Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards a model-agnostic defense against adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. As the number of malicious points included in samples presented to the test diminishes, its detection confidence decreases. Hence, we introduce a complimentary approach to identify specific inputs that are adversarial among sets of inputs flagged by the statistical test. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and Jacobian-based saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs with confidence above 80%. Furthermore, our augmented model either detects adversarial examples with high accuracy (>80%) or increases the adversary's cost---the perturbation added---by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.
Nicolas Papernot, Patrick McDaniel, and Ian GoodfellowExpand / Collapse the abstract
Many machine learning models are vulnerable to adversarial examples: inputs that are specially crafted to cause a machine learning model to produce an incorrect output. Adversarial examples that affect one model often affect another model, even if the two models have different architectures or were trained on different training sets, so long as both models were trained to perform the same task. An attacker may therefore train their own substitute model, craft adversarial examples against the substitute, and transfer them to a victim model, with very little information about the victim. Recent work has further developed a technique that uses the victim model as an oracle to label a synthetic training set for the substitute, so the attacker need not even collect a training set to mount the attack. We extend these recent techniques using reservoir sampling to greatly enhance the efficiency of the training procedure for the substitute model. We introduce new transferability attacks between previously unexplored (substitute, victim) pairs of machine learning model classes, most notably SVMs and decision trees. We demonstrate our attacks on two commercial machine learning classification systems from Amazon (96.19% misclassification rate) and Google (88.94%) using only 800 queries of the victim model, thereby showing that existing machine learning approaches are in general vulnerable to systematic black-box attacks regardless of their structure.
Nicolas Papernot, Ian Goodfellow, Ryan Sheatsley, Reuben Feinman, Patrick McDanielExpand / Collapse the abstract
cleverhans is a software library that provides standardized reference implementations of adversarial example construction techniques and adversarial training. The library may be used to develop more robust machine learning models and to provide standardized benchmarks of models' performance in the adversarial setting. Benchmarks constructed without a standardized implementation of adversarial example construction are not comparable to each other, because a good result may indicate a robust model or it may merely indicate a weak implementation of the adversarial example construction procedure. This technical report is structured as follows. Section 1 provides an overview of adversarial examples in machine learning and of the cleverhans software. Section 2 presents the core functionalities of the library: namely the attacks based on adversarial examples and defenses to improve the robustness of machine learning models to these attacks. Section 3 describes how to report benchmark results using the library. Section 4 describes the versioning system.
Characterizing the Limits and Defenses of Machine Learning in Adversarial Settings
PhD Dissertation, Pennsylvania State UniversityExpand / Collapse the abstract
Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as object recognition, autonomous systems, security diagnostics, and playing the game of Go. Machine learning is not only a new paradigm for building soft- ware and systems, it is bringing social disruption at scale. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community’s understanding of the nature and extent of these vulnerabilities remains limited. In this thesis, I focus my study on the integrity of ML models. Integrity refers here to the faithfulness of model predictions with respect to an expected outcome. This property is at the core of traditional machine learning evaluation, as demonstrated by the pervasiveness of metrics such as accuracy among practitioners. A large fraction of ML techniques were designed for benign execution environments. Yet, the presence of adversaries may invalidate some of these underlying assumptions by forcing a mismatch between the distributions on which the model is trained and tested. As ML is increasingly applied and being relied on for decision-making in critical applications like transportation or energy, the models produced are becoming a target for adversaries who have a strong incentive to force ML to mispredict. I explore the space of attacks against ML integrity at test time. Given full or lim- ited access to a trained model, I devise strategies that modify the test data to create a worst-case drift between the training and test distributions. The implications of this part of my research is that an adversary with very weak access to a system, and little knowledge about the ML techniques it deploys, can nevertheless mount powerful attacks against such systems as long as she has the capability of interacting with it as an oracle: i.e., send inputs of the adversary’s choice and observe the ML prediction. This system- atic exposition of the poor generalization of ML models indicates the lack of reliable confidence estimates when the model is making predictions far from its training data. Hence, my efforts to increase the robustness of models to these adversarial manipu- lations strive to decrease the confidence of predictions made far from the training distri- bution. Informed by my progress on attacks operating in the black-box threat model, I first identify limitations to two defenses: defensive distillation and adversarial training. I then describe recent defensive efforts addressing these shortcomings. To this end, I introduce the Deep k-Nearest Neighbors classifier, which augments deep neural networks with an integrity check at test time. The approach compares internal representations produced by the deep neural network on test data with the ones learned on its train- ing points. Using the labels of training points whose representations neighbor the test input across the deep neural network’s layers, I estimate the nonconformity of the pre- diction with respect to the model’s training data. An application of conformal prediction methodology then paves the way for more reliable estimates of the model’s prediction credibility, i.e., how well the prediction is supported by training data. In turn, we distin- guish legitimate test data with high credibility from adversarial data with low credibility. This research calls for future efforts to investigate the robustness of individual layers of deep neural networks rather than treating the model as a black-box. This aligns well with the modular nature of deep neural networks, which orchestrate simple computations to model complex functions. This also allows us to draw connections to other areas like interpretability in ML, which seeks to answer the question of: “How can we provide an explanation for the model prediction to a human?” Another by-product of this research direction is that I better distinguish vulnerabilities of ML models that are a consequence of the ML algorithms from those that can be explained by artifacts in the data.
Masters Thesis, Pennsylvania State UniversityExpand / Collapse the abstract
Deep learning takes advantage of large datasets and computationally ecient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them, like other machine learning techniques, vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing machine learning algorithms to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classied by human subjects but misclassied in specic targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of dierent sample classes to adversarial perturbations by dening a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by dening a predictive measure of distance between a benign input and a target classication.