9. Conclusions

The first contribution that we presented in this paper was a method for detecting previously undetectable malicious executables. We showed this by comparing our results with traditional signature-based methods and with other learning algorithms. The Multi-Naive Bayes method had the highest accuracy and detection rate of any algorithm over unknown programs, 97.76%, over double the detection rates of signature-based methods. Its rule set was also more difficult to defeat than other methods because all lines of machine instructions would have to be changed to avoid detection.

The first problem with traditional anti-malicious executable detection methods is that in order to detect a new malicious executable, the program needs to be examined and a signature extracted from it and included in the anti-malicious executable software database. The difficulty with this method is that during the time required for a malicious program to be identified, analyzed and signatures to be distributed, systems are at risk from that program. Our methods may provide a defense during that time. With a low false positive rate, the inconvenience to the end user would be minimal while providing ample defense during the time before an update of models is available.

Virus Scanners are updated about every month. 240-300 new malicious executables are created in that time (8-10 a day [27]). Our method would catch roughly 216-270 of those new malicious executables without the need for an update whereas traditional methods would catch only 87-109. Our method more than doubles the detection rate of signature based methods for new malicious binaries.

The methods discussed in this paper are being implemented as a network mail filter. We are implementing a network-level email filter that uses our algorithms to catch malicious executables before users receive them through their mail. We can either wrap the potential malicious executable or we can block it. This has the potential to stop some malicious executables in the network and prevent Denial of Service (DoS) attacks by malicious executables. If a malicious binary accesses a user's address book and mails copies of itself out over the network, eventually most users of the LAN will clog the network by sending each other copies of the same malicious executable. This is very similar to the old Internet Worm attack. Stopping the malicious executables from replicating on a network level would be very advantageous.

Since both the Naive Bayes and Multi-Naive Bayes methods are probabilistic we can also tell if a binary is borderline. A borderline binary is a program that has similar probabilities for both classes (i.e., could be a malicious executable or a benign program). If it is a borderline case we have an option in the network filter to send a copy of the malicious executable to a central repository such as CERT. There, it can be examined by human experts.

9.1 Future Work

Future work involves extending our learning algorithms to better utilize byte-sequences. Currently, the Multi-Naive Bayes method learns over sequences of a fixed length, but we theorize that rules with higher accuracy and detection rates could be learned over variable length sequences. There are some algorithms such as Sparse Markov Transducers [7] that can determine how long a sequence of bytes should be for optimal classification.

We are planning to implement the system on a network of computers to evaluate its performance in terms of time and accuracy in real world environments. We also would like to make the learning algorithms more efficient in time and space. Currently, the Naive Bayes methods have to be run on a computer with one gigabyte of RAM.

Finally, we are planning on testing this method over a larger set of malicious and benign executables. Only when testing over a significantly larger set of malicious executables can we fully evaluate the method. In that light, our current results are preliminary. In addition with a larger data set, we plan to evaluate this method on different types of malicious executables such as macros and Visual Basic scripts.

Next: Bibliography Up: Data Mining Methods for Previous: 8. Defeating Detection Models

Erez Zadok
2001-05-19