MEF filters malicious attachments by replacing the signature based virus filter found in Procmail with a data mining generated detection model. Procmail is a program that processes email messages looking for particular information in the headers or body of each message, and takes actions based on what it finds . Currently the mail server supported is sendmail. MEF uses a procmail script to extract attachments from emails and save them temporarily based on their name. The script then runs the filter on each attachment.
The filter first decodes each binary and then examines the binary using a data mining classifier. It evaluates the attachment by comparing it to all the byte strings found with it to the byte-sequences contained in the detection model. The system calculates the probability of the binary being malicious, and if it is greater that its likelihood of being benign then the executable is labeled malicious. Otherwise, the binary is labeled benign. This is reported as a score back to Procmail, and then is used to either send the mail along untouched, or the entry is logged as the attack and email is wrapped with a warning. The log is a collection of information about the attachment. Exactly what this information is depends upon the configuration of the system.
Borderline binaries are binaries that have similar probabilities of being benign and malicious (e.g. 50% chance it is malicious, and 50% chance it is benign). The binaries are important to keep track of because they are likely to be mislabeled, so they should be included in the training set. To facilitate this, the system archives the borderline cases, and at periodic intervals the collection of borderline binaries is sent back to a central server by the system administrator.
Once at the central repository, these binaries can then be analyzed by experts to determine whether they are malicious or not, and subsequently included in the future versions of the detection models. Any binary that is determined to be a borderline case will be forwarded to the repository and wrapped with a warning as though it were a malicious attachment.
A simple metric to detect borderline cases and redirect them to an evaluation party is to define a borderline case to be a case where the difference between the probability it is malicious and the probability it is benign is above a threshold. This threshold is set based on the policies of the host.
For example in a secure setting, the threshold could be set at 20%. In this case all binaries that have a 60/40 split are labeled as borderline. In other words, binaries with a 60% chance (according to the model) of being malicious and 40% chance of being benign would be labeled borderline, and vice versa. This setting can be determined by the system administrator or left on the default setting of 51.25/48.75, a threshold of 2.5%.
Receiving borderline cases and updating the detection model is an important aspect of the data mining approach. The larger the data set that is used to generate models then the more accurate the detection models will be. This is because borderline cases are executables that could potentially lower the detection and accuracy rates by being misclassified, so they should be trained over.
This system will require updates periodically, and in the following section we detail the update algorithm. After a number of borderline cases have been received, it is necessary to generate a new detection model, and subsequently distribute updated models.
A new model is first generated by running the data mining algorithm on the new data set that contained the borderline cases along with their correct classification, and the previous data set. This model will then be distributed.
Updating the models is accomplished by distributing portions of the models that changed, and not the entire model. This is important because the detection models are large. In order to avoid constantly sending a large model to the filters, the administrator has the option of receiving this smaller file. Using the update algorithm, the older model can then be updated. The full model will also be available to provide additional options for the system administrator.
Efficient update of the model is possible because the underlying representation of the models is probabilistic. As is explained later, the model is a count of the number of times that each byte string appears in a malicious program versus the number of times that it appears in a benign program. An update model can then be easily summed with the older model to create a new model.
In future versions of MEF, the model will be made available for the system administrator on a public ftp site. If a system administrator subscribes to the mailing list then when a new model is made available, the system administrator will receive an email. The email will detail where the model is located, what version it is, and include a form of authentication. At the ftp site the model will be available to download as either an upgrade from a previous version, or as a full model. An archive of old models will also be kept on the ftp site.
There are also a host of options for automatically receiving the updates. One way to distribute the email is just to attach the update to the notification email. Then the administrator could update the model later without having to ftp it. In the future, a program included in the email filter could automatically poll the central server to see if a new model is available and then download it and update the current model. These last methods have not yet been implemented.