3. Methodology

The goal of this work was to explore a number of standard data mining techniques to compute accurate detectors for new (unseen) binaries. We gathered a large set of programs from public sources and separated the problem into two classes: malicious and benign executables. Every example in our data set is a Windows or MS-DOS format executable, although the framework we present is applicable to other formats. To standardize our data-set, we used an updated MacAfee's [16] virus scanner and labeled our programs as either malicious or benign executables. Since the virus scanner was updated and the viruses were obtained from public sources, we assume that the virus scanner has a signature for each malicious virus.

We split the dataset into two subsets: the training set and the test set. The data mining algorithms used the training set while generating the rule sets. We used a test set to check the accuracy of the classifiers over unseen examples.

Next, we automatically extracted a binary profile from each example in our dataset, and from the binary profiles we extracted features to use with classifiers. In a data mining framework, features are properties extracted from each example in the data set--such as byte sequences--that a classifier can use to generate detection models. Using different features, we trained a set of data mining classifiers to distinguish between benign and malicious programs. It should be noted that the features extracted were static properties of the binary and did not require executing the binary.

The framework supports different methods for feature extraction and different data mining classifiers. We used system resource information, strings and byte sequences that were extracted from the malicious executables in the data set as different types of features. We also used three learning algorithms:

an inductive rule-based learner that generates boolean rules based on feature attributes.
a probabilistic method that generates probabilities that an example was in a class given a set of features.
a multi-classifier system that combines the outputs from several classifiers to generate a prediction.

To compare the data mining methods with a traditional signature-based method, we designed an automatic signature generator. Since the virus scanner that we used to label the data set had signatures for every malicious example in our data set, it was necessary to implement a similar signature-based method to compare with the data mining algorithms. There was no way to use an off-the-shelf virus scanner and simulate the detection of new malicious executables because these commercial scanners contained signatures for all the malicious executables in our data set. Like the data mining algorithms, the signature-based algorithm was only allowed to generate signatures over the set of training data. This allowed our data mining framework to be fairly compared to traditional scanners over new data.

To quantitatively express the performance of our method we show tables with the counts for true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). A true positive is a malicious example that is correctly tagged as malicious, and a true negative is a benign example that is correctly classified. A false positive is a benign program that has been mislabeled by an algorithm as a malicious program, while a false negative is a malicious executable that has been misclassified as a benign program.

To evaluate the performance, we compute the false positive rate and the detection rate. The false positive rate is the number of benign examples that are mislabeled as malicious divided by the total number of benign examples. The detection rate is the number of malicious examples that are caught divided by the total number of malicious examples.

3.1 Dataset Description

Our data set consisted of a total of 4,266 programs split into 3,265 malicious binaries and 1,001 clean programs. There were no duplicate programs in the data set and every example in the set is labeled either malicious or benign by the commercial virus scanner.

The malicious executables were downloaded from various FTP sites and were labeled by a commercial virus scanner with the correct class label (malicious or benign) for our method. 5% of the data set was composed of Trojans and the other 95% consisted of viruses. Most of the clean programs were gathered from a freshly installed Windows 98 machine running MSOffice 97 while others are small executables downloaded from the Internet. The entire data set is available from our Web site http://www.cs.columbia.edu/ids/mef/software/.

We also examined a subset of the data that was in Portable Executable (PE) [17] format. The data set consisting of PE format executables was composed of 206 benign programs and 38 malicious executables.

After verification of the data set the next step of our method was to extract features from the programs.

Next: 4. Feature Extraction Up: Data Mining Methods for Previous: 2. Background

Erez Zadok
2001-05-19