Next: 5. Experiments Up: Toward Cost-Sensitive Modeling for Previous: 3. Cost Models

Subsections

4. Cost-Sensitive Modeling

Like risk analysis [2], cost-sensitive modeling for intrusion detection must be performed periodically because cost metrics must take into account changes in information assets and security policies. It is therefore important to develop tools that can automatically produce cost-sensitive models for given cost metrics.

We have done extensive development and evaluation of the use of machine learning methods for reducing the CumulativeCost of intrusion detection [10,16]. Because of space constraints, in this section and Section 5, we describe and evaluate the particular methods which have proven most effective.

4.1 Reducing Operational Cost

In order to reduce OpCost, ID models need to use low cost features as often as possible while maintaining a desired level of accuracy. Our approach is to build multiple ID models, each of which uses different sets of features at different cost levels. Low cost models are always evaluated first by the IDS, and high cost models are used only when the low cost models can not make a prediction with sufficient accuracy. We implement this multiple-model approach using RIPPER [6], a rule induction algorithm. However, other machine learning algorithms or knowledge-engineering methods may be used as well.

Given a training set in which each event is labeled as either normal or some intrusions, RIPPER builds an ordered or unordered ruleset. Each rule in the ruleset uses the most discriminating feature values for classifying a data item into one of the classes. A rule consists of conjunctions of feature comparisons, and if the rule evaluates to true, then a prediction is made. An example rule for predicting teardrop is `` $\mathbf{if}\ number\_bad\_fragments \geq 2\ \mathbf{and}\ protocol = udp \ \mathbf{then}\ teardrop$ .'' Before discussing the details of our approach, it is necessary to outline the advantages and disadvantages of ordered and un-ordered rulesets.

Ordered Rulesets: An ordered ruleset has the form $\mathbf{if}\ r_1\ \mathbf{then}\ i_1\ \mathbf{elseif}\ r_2\ \mathbf{then}\ i_2, \ldots, \mathbf{else}\ default$ , where r_n is a rule and i_n is the class label predicted by that rule. Before learning, RIPPER first orders the classes by one of the following heuristics: +freq, which orders by increasing frequency in the training data; -freq, by decreasing frequency; given, which is a user-defined ordering; mdl, which uses the minimal description length to guess an optimal ordering [17] . After arranging the classes, RIPPER finds rules to separate class₁ from classes $class_{2},\ldots, class_{n}$ , then rules to separate class₂ from classes $class_{3},\ldots, class_{n}$ , and so on. The final class, class_n, will become the default class. The end result is that rules for a single class will always be grouped together, but rules for class_i are possibly simplified, because they can assume that the class of the example is one of $class_{i},\ldots,class_{n}$ . If an example is covered by rules from two or more classes, this conflict is resolved in favor of the class that comes first in the ordering.

An ordered ruleset is usually succinct and efficient. Evaluation of an entire ordered ruleset does not require each rule to be tested, but proceeds from the top of the ruleset to the bottom until any rule evaluates to true. The features used by each rule can be computed one by one as evaluation proceeds. The operational cost to evaluate an ordered ruleset for a given event is the total cost of computing unique features until a prediction is made. For intrusion detection, a -freq ruleset is usually lowest in operational cost and accurately classifies normal events. This is because the first rules of the ruleset identify normal, which is usually the most frequently occurring class. On the contrary, a +freq ruleset would most likely be higher in operational cost but more accurate in classifying intrusions because the ruleset partitions intrusions from normal events early in its evaluation, and normal is the final default classification. Depending on the class ordering, the performances of given and mdlwill lie between those of -freq and +freq.

Un-ordered Rulesets: An un-ordered ruleset has at least one rule for each class and there are usually many rules for frequently occurring classes. There is also a default class which is used for prediction when none of these rules are satisfied. Unlike ordered rulesets, all rules are evaluated during prediction and conflicts are broken by using the most accurate rule. Un-ordered rulesets, in general, contain more rules and are less efficient in execution than -freq and +freq ordered rulesets, but there are usually several rules of high precision for the most frequent class, resulting in accurate classification of normal events.

With the advantages and disadvantages of ordered and un-ordered rulesets in mind, we propose the following multiple ruleset approach:

We first generate multiple training sets T₁, T₂, T₃, T₄ using different feature subsets. T₁ uses only cost 1 features; T₂ uses features of costs 1 and 5; T₃ uses features of costs 1, 5, and 10; and T₄ uses all available features of costs 1, 5, 10, and 100.
Rulesets R₁, R₂, R₃, R₄ are learned using their respective training sets. R₄ is learned as either +freq or -freq ruleset for efficiency, as it may contain the most costly features. R₁, R₂, R₃ are learned as either -freq or un-ordered rulesets, as they will contain accurate rules for classifying normal events and we filter normal as early as possible to reduce operational cost. given and mdl might be used, but their performance would not be better.
A precision measurement p_r is computed for every rule, r, except for the rules in R₄².
A threshold value $\tau_{i}$ is obtained for every class, and determines the tolerable precision required for a prediction to be made in execution.

In real-time execution, the feature computation and rule evaluation proceed as follows:

R₁ is evaluated and a prediction i is made by some rule r.
If $p_{r} \ge \tau_{i}$ , the prediction i is final. In this case, no more features are computed and the system examines the next event. Otherwise, additional features required by R₂ are computed and R₂ is be evaluated.
This process continues until a final prediction is made. The evaluation of R₄ always produces a final prediction because R₄ uses all features.

The precision and threshold values used by the multiple model approach can be obtained during model training from the training set, or can be computed using a separate hold-out validation set. The precision of a rule can be obtained easily from the positive and negative counts of a rule: $\frac{p}{p + n}$ . Threshold values are set to the precisions of the rules in a single ruleset using all features (R₄) for each class in the chosen dataset, as we do not want to make less precise classifications in R₁, R₂, R₃ than would be made using R₄.

4.2 Reducing Consequential Cost

A traditional IDS that does not consider the trade-off between RCost and DCost will attempt to respond to every intrusion that it detects. As a result, the consequential cost for FP, TP, and misclassified hits will always include some response cost. We use a cost-sensitive decision module to determine whether response should ensue based on whether DCost is greater than RCost.

The decision module takes as input an intrusion report generated by the detection module. The report contains the name of the predicted intrusion and the name of the target, which are then used to look up the pre-determined DCost and RCost. If DCost $\geq$ RCost, the decision module invokes a separate module to initiate a response; otherwise, it simply logs the intrusion report.

The functionality of the decision module can be implemented before training using some data re-labeling mechanism such as MetaCost [9], which will re-label intrusions with DCost < RCost to normal so that the generated model will not contain rules for predicting these intrusions at all. We have experimented with such a mechanism [10], but have decided to implement this functionality in the post-detection decision module to eliminate the necessity of re-training a model when cost factors change, despite the savings in operational cost due to the generation of a smaller model.

Next: 5. Experiments Up: Toward Cost-Sensitive Modeling for Previous: 3. Cost Models

Erez Zadok
2000-11-09