Next: 9. Conclusions Up: Data Mining Methods for Previous: 7. Results and Analysis

8. Defeating Detection Models

Although these methods can detect new malicious executables, a malicious executable author could bypass detection if the detection model were to be compromised.

First, to defeat the signature-based method requires removing all malicious signatures from the binary. Since these are typically a subset of a malicious executable's total data, changing the signature of a binary would be possible although difficult.

Defeating the models generated by RIPPER would require generating functions that would change the resource usage. These functions do not have to be called by the binary but would change the resource signature of an executable.

To defeat our implementation of the Naive Bayes classifier it would be necessary to change a significant number of features in the example. One way this can be done is through encryption, but encryption will add overhead to small malicious executables.

We corrected the problem of authors evading a strings-based rule set by initially classifying each example as malicious. If no strings that were contained in the binary had ever been used for training then the final class was malicious. If there were strings contained in the program that the algorithm had seen before then the probabilities were computed normally according to the Naive Bayes rule from Section 4.3. This took care of the instance where a binary had encrypted strings, or had changed all of its strings.

The Multi-Naive Bayes method improved on these results because changing every line of byte code in the Naive Bayes detection model would be an even more difficult proposition than changing all the strings. Changing this many of the lines in a program would change the binary's behavior significantly. Removing all lines of code that appear in our model would be difficult and time consuming, and even then if none of the byte sequences in the example had been used for training then the example would be initially classified as malicious.

The Multi-Naive Bayes is a more secure model of detection than any of the other methods discussed in this paper because we evaluate a binary's entire instruction set whereas signature methods looks for segments of byte sequences. It is much easier for malicious program authors to modify the lines of code that a signature represents than to change all the lines contained in the program to evade a Naive Bayes or Multi-Naive Bayes model. The byte sequence model is the most secure model we devised in our test.

A further security concern is what happens when malicious software writers obtain copies of the malicious binaries that we could not detect, and use these false negatives to generate new malicious software. Presumably this would allow them to circumvent our detection models, but in fact having a larger set of similar false negatives would make our model more accurate. In other words, if malicious binary authors clone the undetectable binaries, they are in effect making it easier for this framework to detect their programs. The more data that the method analyzes, and the more false positives and false negatives that it learns from, the more accurate the method becomes at distinguishing between benign and malicious programs.

Next: 9. Conclusions Up: Data Mining Methods for Previous: 7. Results and Analysis

Erez Zadok
2001-05-19