Next: 3. Methodology Up: Data Mining Methods for Previous: 1. Introduction

2. Background

Detecting malicious executables is not a new problem in security. Early methods used signatures to detect malicious programs. These signatures were composed of many different properties: filename, text strings, or byte code. Research also centered on protecting the system from the security holes that these malicious programs created.

Experts were typically employed to analyze suspicious programs by hand. Using their expertise, signatures were found that made a malicious executable example different from other malicious executables or benign programs. One example of this type of analysis was performed by Spafford [24] who analyzed the Internet Worm and provided detailed notes on its spread over the Internet, the unique signatures in the worm's code, the method of the worm's attack, and a comprehensive description of system failure points.

Although accurate, this method of analysis is expensive, and slow. If only a small set of malicious executables will ever circulate then this method will work very well, but the Wildlist [22] is always changing and expanding. The Wildlist is a list of malicious programs that are currently estimated to be circulating at any given time.

Current approaches to detecting malicious programs match them to a set of known malicious programs. The anti-virus community relies heavily on known byte-code signatures to detect malicious programs. More recently, these byte sequences were determined by automatically examining known malicious binaries with probabilistic methods.

At IBM, Kephart and Arnold [9] developed a statistical method for automatically extracting malicious executable signatures. Their research was based on speech recognition algorithms and was shown to perform almost as good as a human expert at detecting known malicious executables. Their algorithm was eventually packaged with IBM's anti-virus software.

Lo et al. [15] presented a method for filtering malicious code based on ``tell-tale signs'' for detecting malicious code. These were manually engineered based on observing the characteristics of malicious code. Similarly, filters for detecting properties of malicious executables have been proposed for UNIX systems [10] as well as semi-automatic methods for detecting malicious code [4].

Unfortunately, a new malicious program may not contain any known signatures so traditional signature-based methods may not detect a new malicious executable. In an attempt to solve this problem, the anti-virus industry generates heuristic classifiers by hand [8]. This process can be even more costly than generating signatures, so finding an automatic method to generate classifiers has been the subject of research in the anti-virus community. To solve this problem, different IBM researchers applied Artificial Neural Networks (ANNs) to the problem of detecting boot sector malicious binaries [25]. An ANN is a classifier that models neural networks explored in human cognition. Because of the limitations of the implementation of their classifier, they were unable to analyze anything other than small boot sector viruses which comprise about 5% of all malicious binaries.

Using an ANN classifier with all bytes from the boot sector malicious executables as input, IBM researchers were able to identify 80-85% of unknown boot sector malicious executables successfully with a low false positive rate ( $< 1\%$ ). They were unable to find a way to apply ANNs to the other 95% of computer malicious binaries.

In similar work, Arnold and Tesauro [1] applied the same techniques to Win32 binaries, but because of limitations of the ANN classifier they were unable to have the comparable accuracy over new Win32 binaries.

Our method is different because we analyzed the entire set of malicious executables instead of only boot-sector viruses, or only Win32 binaries.

Our technique is similar to data mining techniques that have already been applied to Intrusion Detection Systems by Lee et al. [13,14]. Their methods were applied to system calls and network data to learn how to detect new intrusions. They reported good detection rates as a result of applying data mining to the problem of IDS. We applied a similar framework to the problem of detecting new malicious executables.

Next: 3. Methodology Up: Data Mining Methods for Previous: 1. Introduction

Erez Zadok
2001-05-19