Adversarial Detection with Model Interpretation

author: Ninghao Liu, Texas A&M University
published: Nov. 23, 2018,   recorded: August 2018,   views: 821

Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


Machine learning (ML) systems have been increasingly applied in web security applications such as spammer detection, malware detection and fraud detection. These applications have an intrinsic adversarial nature where intelligent attackers can adaptively change their behaviors to avoid being detected by the deployed detectors. Existing efforts against adversaries are usually limited by the type of applied ML models or the specific applications such as image classification. Additionally, the working mechanisms of ML models usually cannot be well understood by users, which in turn impede them from understanding the vulnerabilities of models nor improving their robustness. To bridge the gap, in this paper, we propose to investigate whether model interpretation could potentially help adversarial detection. Specifically, we develop a novel adversary-resistant detection framework by utilizing the interpretation of ML models. The interpretation process explains the mechanism of how the target ML model makes prediction for a given instance, thus providing more insights for crafting adversarial samples. The robustness of detectors is then improved through adversarial training with the adversarial samples. A data-driven method is also developed to empirically estimate costs of adversaries in feature manipulation. Our approach is model-agnostic and can be applied to various types of classification models. Our experimental results on two real-world datasets demonstrate the effectiveness of interpretation-based attacks and how estimated feature manipulation cost would affect the behavior of adversaries.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: