next up previous
Next: 8. Defeating Detection Models Up: Data Mining Methods for Previous: 6. Rules


7. Results and Analysis

We estimate our results over new data by using 5-fold cross validation [12]. Cross validation is the standard method to estimate likely predictions over unseen data in Data Mining. For each set of binary profiles we partitioned the data into 5 equal size groups. We used 4 of the partitions for training and then evaluated the rule set over the remaining partition. Then we repeated the process 5 times leaving out a different partition for testing each time. This gave us a very reliable measure of our method's accuracy over unseen data. We averaged the results of these five tests to obtain a good measure of how the algorithm performs over the entire set.

To evaluate our system we were interested in several quantities:

True Positives (TP), the number of malicious executable examples classified as malicious executables
True Negatives (TN), the number of benign programs classified as benign.
False Positives (FP), the number of benign programs classified as malicious executables
False Negatives (FN), the number of malicious executables classified as benign binaries.

We were interested in the detection rate of the classifier. In our case this was the percentage of the total malicious programs labeled malicious. We were also interested in the false positive rate. This was the percentage of benign programs which were labeled as malicious, also called false alarms.

The Detection Rate is defined as $\frac{TP}{TP+FN}$, False Positive Rate as $
\frac{FP}{TN+FP}$, and Overall Accuracy as $ \frac{TP+TN}{TP+TN+FP+FN}$. The results of all experiments are presented in Table 3.

Table 3: These are the results of classifying new malicious programs organized by algorithm and feature. Multi-Naive Bayes using Bytes had the highest Detection Rate, and Signature Method with strings had the lowest False Positive Rate. Highest overall accuracy was the Naive Bayes algorithm with strings. Note that the detection rate for the signature-based methods are lower than the data mining methods.


True True False False Detection False Positive Overall


Positives (TP) Negatives (TN) Positives (FP) Negatives (FN) Rate Rate Accuracy
Signature Method              
-- Bytes 1102 1000 0 2163 33.75% 0% 49.28%
-- DLLs used 22 187 19 16 57.89% 9.22% 83.62%
-- DLL function calls 27 190 16 11 71.05% 7.77% 89.36%
-- DLLs with              
counted function calls 20 195 11 18 52.63% 5.34% 89.07%
Naive Bayes              
-- Strings 3176 960 41 89 97.43% 3.80% 97.11%
Multi-Naive Bayes              
-- Bytes 3191 940 61 74 97.76% 6.01% 96.88%



For all the algorithms we plotted the detection rate vs. false positive rate using Receiver Operating Characteristic (ROC) curves [11]. ROC curves are a way of visualizing the trade-offs between detection and false positive rates.

7.1 Signature Method

As is shown in Table 3, the signature method had the lowest false positive rate, 0% This algorithm also had the lowest detection rate, 33.75%, and accuracy rate, 49.28%.

Since we use this method to compare with the learning algorithms we plot its ROC curves against the RIPPER algorithm in Figure 7 and against the Naive Bayes and Multi-Naive Bayes algorithms in Figure 8.

The detection rate of the signature-based method is inherently low over new executables because the signatures generated were never designed to detect new malicious executables. Also it should be noted that although the signature based method only detected 33.75% of new malicious programs, the method did detect 100% of the malicious binaries that it had seen before with a 0% false positive rate.


The RIPPER results shown in Table 3 are roughly equivalent to each other in detection rates and overall accuracy, but the method using features from Figure 2, a list of DLL function calls, has a higher detection rate.

Figure 7: RIPPER ROC. Notice that the RIPPER curves have a higher detection rate than the comparison method with false-positive rates greater than 7%.

The ROC curves for all RIPPER variations are shown in Figure 7. The lowest line represents RIPPER using DLLs only as features, and it was roughly linear in its growth. This means that as we increase detection rate by 5% the false positive would also increase by roughly 5%.

The other lines are concave down so there was an optimal trade-off between detection and false alarms. For DLL's with Counted Function Calls this optimal point was when the false positive rate was 10% and the detection rate was equal to 75%. For DLLs with Function Calls the optimal point was when the false positive rate was 12% and the detection rate was less than 90%.

7.3 Naive Bayes

The Naive Bayes algorithm using strings as features performed the best out of the learning algorithms and better than the signature method in terms of false positive rate and overall accuracy (see Table 3). It is the most accurate algorithm with 97.11% and within 1% of the highest detection rate, Multi-Naive Bayes with 97.76%. It performed better than the RIPPER methods in every category.

In Figure 8, the slope of the Naive Bayes curve is initially much steeper than the Multi-Naive Bayes. The Naive Bayes with strings algorithm has better detection rates for small false positive rates. Its results were greater than 90% accuracy with a false positive rate less than 2%.

Figure 8: Naive Bayes and Multi-Naive Bayes ROC. Note that the Naive Bayes and Multi-Naive Bayes methods have higher detection rate than the signature method with a greater than 0.5% false positive rate.

7.4 Multi-Naive Bayes

The Multi-Naive Bayes algorithm using bytes as features had the highest detection rate out of any method we tested, 97.76%. The false positive rate at 6.01% was higher than the Naive Bayes methods (3.80%) and the signature methods ($< 1\%$).

The ROC curves in Figure 8 show a slower growth than the Naive Bayes with strings method until the false positive rate climbed above 4%. Then the two algorithms converged for false positive rates greater than 6% with a detection rate greater than 95%.

7.5 Same Model, Different Applications

The ROC curves in Figures 7 and 8 also let security experts understand how to tailor this framework to their specific needs. For example, in a secure computing setting, it may be more important to have a high detection rate of 98.79%, in which case the false positive rate would increase to 8.87%. Or if the framework were to be integrated into a mail server, it may be more important to have a low false positive rate below 1% (0.39% FP rate in our case) and a detection rate of 65.62% for new malicious programs.

next up previous
Next: 8. Defeating Detection Models Up: Data Mining Methods for Previous: 6. Rules
Erez Zadok