A malicious executable is defined to be a program that performs a malicious function, such as compromising a system's security, damaging a system or obtaining sensitive information without the user's permission. Using data mining methods, our goal is to automatically design and build a scanner that accurately detects malicious executables before they have been given a chance to run.
Data mining methods detect patterns in large amounts of data, such as byte code, and use these patterns to detect future instances in similar data. Our framework uses classifiers to detect new malicious executables. A classifier is a rule set, or detection model, generated by the data mining algorithm that was trained over a given set of training data.
One of the primary problems faced by the virus community is to devise methods for detecting new malicious programs that have not yet been analyzed . Eight to ten malicious programs are created every day and most cannot be accurately detected until signatures have been generated for them . During this time period, systems protected by signature-based algorithms are vulnerable to attacks.
Malicious executables are also used as attacks for many types of intrusions. In the DARPA 1999 intrusion detection evaluation, many of the attacks on the Windows platform were caused by malicious programs . Recently, a malicious piece of code created a hole in a Microsoft's internal network . That attack was initiated by a malicious executable that opened a back-door into Microsoft's internal network resulting in the theft of Microsoft's source code.
Current virus scanner technology has two parts: a signature-based detector and a heuristic classifier that detects new viruses . The classic signature-based detection algorithm relies on signatures (unique telltale strings) of known malicious executables to generate detection models. Signature-based methods create a unique tag for each malicious program so that future examples of it can be correctly classified with a small error rate. These methods do not generalize well to detect new malicious binaries because they are created to give a false positive rate as close to zero as possible. Whenever a detection method generalizes to new instances, the tradeoff is for a higher false positive rate. Heuristic classifiers are generated by a group of virus experts to detect new malicious programs. This kind of analysis can be time-consuming and oftentimes still fail to detect new malicious executables.
We designed a framework that used data mining algorithms to train multiple classifiers on a set of malicious and benign executables to detect new examples. The binaries were first statically analyzed to extract properties of the binary, and then the classifiers trained over a subset of the data.
Our goal in the evaluation of this method was to simulate the task of detecting new malicious executables. To do this we separated our data into two sets: a training set and a test set with standard cross-validation methodology. The training set was used by the data mining algorithms to generate classifiers to classify previously unseen binaries as malicious or benign. A test set is a subset of dataset that had no examples in it that were seen during the training of an algorithm. This subset was used to test an algorithms' performance over similar, unseen data and its performance over new malicious executables. Both the test and training data were malicious executables gathered from public sources.
We implemented a traditional signature-based algorithm to compare with the the data mining algorithms over new examples. Using standard statistical cross-validation techniques, our data mining-based method had a detection rate of 97.76%--more than double the detection rate of a signature-based scanner over a set of new malicious executables. Comparing our method to industry heuristics cannot be done at this time because the methods for generating these heuristics are not published and there is no equivalent or statistically comparable dataset to which both techniques are applied. However, the framework we provide is fully automatic and could assist experts in generating the heuristics.