Skip navigation, view page content (access key = C)

Begin OSU masthead and toolbar

The Ohio State University
www.osu.edu
  1. Help
  2. Campus map
  3. Find people
  4. Webmail


Ohio State University logo Office of Information Technology Technology Support Center (8help)

Anti-Spam Service: Training the Filter



While training the filter to your individual preferences is not required to take advantage of the Anti-spam service, the benefits gained by investing some time and effort to do so will maximize the benefit gained while using the service.

Once you have turned on the service and set up your personal email stream, you can start training your filter to recognize spam as you choose.

To do this, you will need to set up Bayes Filtering. Log into the anti-spam service, and click on Preferences: Stream Settings and adjust the following:

  • Enable Bayesian analysis: Yes
  • Enable Bayesian training: Yes
  • Remove pre-existing Baysian training links from incoming mail: Yes

You may also wish to set up the Message Voting Feature at this point so that you can train the filter to recognize unwanted messages that end up coming to your Inbox as spam.

Once the Bayesian Filtering is set up, you can begin training the filter to recognize what you consider spam and what you want to receive in your Inbox.

Here's how the filtering works:

Each incoming e-mail message is broken up into tokens. Roughly speaking, a token corresponds to a word. In addition to single-word tokens, the filter keeps track of token pairs, which can greatly increase the accuracy of Bayesian filtering.

Each time a message is marked as spam or not-spam, CanIt-PRO updates counters for each token and token pair in the message. The training statistics are unique for each stream; each stream therefore has its own training set and own notion of what is and isn’t spam. The set of messages on which CanIt-PRO is trained is called the training corpus.

When size of the training corpus is large enough, the filter applies statistical analysis to incoming messages. Each token in the message is looked up to see how many times it appeared in a spam message, and how many times in a non-spam message.

The 15 "most interesting" tokens are collected, and a combined probability is computed based on the individual token probability. A token is considered "interesting" if it is either very likely to appear in a spam message, or very likely to appear in a non-spam message. Tokens that can appear in both spam and non-spam messages are not considered interesting.

After the system computes the combined probability, it consults a table to add points to (or subtract points from) the spam score.

The most useful sections for customizing the filtering settings (aside from the Message Voting Feature) are Rules: Senders and Rules: Domains, where you can set up specific rules to automatically accept or reject email from specific addresses or domains (a portion of an e-mail address to the right of the @).

Many spammers use one-time disposable sender addresses. Many addresses are not even valid. So we do not recommend blacklisting addresses unless you receive many different spam e-mail messages from the same address.

Blacklisting individual addresses is usually not effective. Whitelisting known good addresses (for example, mailing-list sending addresses) can be very effective. The sender report may, however, highlight a persistent spam sender address which is worth blacklisting.

Just as sender addresses are often fake, sender domains are too. However, some domains are known spammers and these can be profitably blacklisted.

Blacklisting entire domains can be effective under limited circumstances. Whitelisting known good addresses can be very effective. Holding all mail from free e-mail services like Hotmail and Yahoo can be effective if you use it in conjunction with whitelisting of known good senders from those services. Use the domain report to help make these decisions.

In Rules: Custom Rules, you can filter out messages containing certain words. Be very careful when writing custom rules, especially rules that can match on the message body. For example, a straightforward rule that contains "sex" in the body will match "sexton", "Essex" and others.

Be very careful when using Rules: Hosts. A host is a mail server on the Internet that sends mail to your address. Unless you are sure you understand the consequences of host rules, we recommend that you not create any. It is easy to inadvertently block all email coming from a legitimate host. You should only enter a single IP address for each host rule. The filter system does not support network entries, host names or wild cards in host rules.



Current Record: 2652

Create Date: 08-12-2005
Last Reviewed: 04-30-2007


Please give us your feedback!
Was this document helpful?  





Home

 

return to top