We have done extensive development and evaluation of the use of machine learning methods for reducing the CumulativeCost of intrusion detection [10,16]. Because of space constraints, in this section and Section 5, we describe and evaluate the particular methods which have proven most effective.
Given a training set in which each event is labeled as either normal or some intrusions, RIPPER builds an ordered or unordered ruleset. Each rule in the ruleset uses the most
discriminating feature values for classifying a data item into one of
the classes. A rule consists of conjunctions of feature comparisons,
and if the rule evaluates to true, then a prediction is made. An
example rule for predicting teardrop is ``
.'' Before discussing the details of our approach, it is
necessary to outline the advantages and disadvantages of ordered and
un-ordered rulesets.
Ordered Rulesets: An ordered ruleset has the form
,
where rn is a rule and in is the class
label predicted by that rule. Before learning, RIPPER first orders
the classes by one of the following heuristics: +freq, which orders
by increasing frequency in the training data; -freq, by decreasing
frequency; given, which is a user-defined ordering; mdl, which
uses the minimal description length to guess an optimal
ordering [17] . After arranging the classes, RIPPER
finds rules to separate class1 from classes
,
then rules to separate class2 from classes
,
and so on. The final class,
classn, will become the default class. The end result is that
rules for a single class will always be grouped together, but rules
for classi are possibly simplified, because they can assume that
the class of the example is one of
.
If
an example is covered by rules from two or more classes, this conflict
is resolved in favor of the class that comes first in the ordering.
An ordered ruleset is usually succinct and efficient. Evaluation of an entire ordered ruleset does not require each rule to be tested, but proceeds from the top of the ruleset to the bottom until any rule evaluates to true. The features used by each rule can be computed one by one as evaluation proceeds. The operational cost to evaluate an ordered ruleset for a given event is the total cost of computing unique features until a prediction is made. For intrusion detection, a -freq ruleset is usually lowest in operational cost and accurately classifies normal events. This is because the first rules of the ruleset identify normal, which is usually the most frequently occurring class. On the contrary, a +freq ruleset would most likely be higher in operational cost but more accurate in classifying intrusions because the ruleset partitions intrusions from normal events early in its evaluation, and normal is the final default classification. Depending on the class ordering, the performances of given and mdlwill lie between those of -freq and +freq.
Un-ordered Rulesets: An un-ordered ruleset has at least one rule for each class and there are usually many rules for frequently occurring classes. There is also a default class which is used for prediction when none of these rules are satisfied. Unlike ordered rulesets, all rules are evaluated during prediction and conflicts are broken by using the most accurate rule. Un-ordered rulesets, in general, contain more rules and are less efficient in execution than -freq and +freq ordered rulesets, but there are usually several rules of high precision for the most frequent class, resulting in accurate classification of normal events.
With the advantages and disadvantages of ordered and un-ordered rulesets in mind, we propose the following multiple ruleset approach:
The precision and threshold values used by the multiple model approach
can be obtained during model training from the training set, or can be
computed using a separate hold-out validation set. The precision of a
rule can be obtained easily from the positive and negative counts of a
rule:
.
Threshold values are set to the precisions of
the rules in a single ruleset using all features (R4) for each
class in the chosen dataset, as we do not want to make less precise
classifications in
R1, R2, R3 than would be made using
R4.
The decision module takes as input an intrusion report generated by
the detection module. The report contains the name of the predicted
intrusion and the name of the target, which are then used to look up
the pre-determined DCost and RCost. If DCost
RCost, the
decision module invokes a separate module to initiate a response;
otherwise, it simply logs the intrusion report.
The functionality of the decision module can be implemented before training using some data re-labeling mechanism such as MetaCost [9], which will re-label intrusions with DCost < RCost to normal so that the generated model will not contain rules for predicting these intrusions at all. We have experimented with such a mechanism [10], but have decided to implement this functionality in the post-detection decision module to eliminate the necessity of re-training a model when cost factors change, despite the savings in operational cost due to the generation of a smaller model.