 |
Index for Section 1 |
|
 |
Alphabetical listing for B |
|
 |
Bottom of page |
|
BOGOFILTER(1)
NAME
bogofilter - fast Bayesian spam filter
SYNOPSIS
bogofilter [help options | classification options | registration
options] [algorithm options] [general options]
where
help options are:
[-V] [-Q]
classification options are:
[-e] [-t] [-u] [-2] [-3] [-M] [-b] [-B filename ...] [-F] [-R] [algorithm
options] [general options] [parameter options]
registration options are:
| -n] [-S | -N] [algorithm options] [general options]
general options are:
filename] [-C] [-d dir] [-k size] [-W] [-WW] [-l] [-L tag]
[-I filename] [-O filename]
algorithm options are:
| -r | -f]
parsing options are:
[-Ph/-PH] [-Pt/-PT]
parameter options are:
[value] [,value][,value]] [-o [value] [,value]]
info options are:
[-v] [-y date] [-D] [-x flags]
DESCRIPTION
Bogofilter is a Bayesian spam filter. In its normal mode of operation, it
takes an email message or other text on standard input, does a statistical
check against lists of "good" and "bad" words, and returns a status code
indicating whether or not the message is spam. Bogofilter is designed with
fast algorithms, uses the Berkeley DB for fast startup and lookups, coded
directly in C, and tuned for speed, so it can be used for production by
sites that process a lot of mail.
THEORY OF OPERATION
Bogofilter treats its input as a bag of tokens. Each token is checked
against "good" and "bad" wordlists, which maintain counts of the numbers of
times it has occurred in non-spam and spam mails. These numbers are used to
compute the probability that a mail in which the token occurs is spam.
After probabilities for all input tokens have been computed, a fixed number
of the probabilities that deviate furtherest from average are combined
using Bayes's theorem on conditional probabilities. If the computed
probability that the input is spam exceeds a cutoff determined at compile
time (currently 0.95, for the Robinson-Fisher algorithm), bogofilter
returns 0, otherwise 1.
While this method sounds crude compared to the more usual pattern-matching
approach, it turns out to be extremely effective. Paul Graham's paper A
Plan For Spam: http://www.paulgraham.com/spam.html is recommended reading.
This program substantially improves on Paul's proposal by doing smarter
lexical analysis. In particular, hostnames and IP addresses are retained as
recognition features rather than broken up. Various kinds of MTA cruft such
as dates and message-IDs are discarded so as not to bloat the wordlists.
Lex's Swiss-army-knife nature rises again.
Another seeming improvement is that this program offers Gary Robinson's
suggested modifications (S and f(w) but not g(w)) to the calculations.
These modifications are described in Robinson's paper Spam Detection:
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html.
Since then, Robinson and others have realized that the S calculation can be
further optimized: if a vector of length k contains random, uniformly-
distributed probabilities p, then -2 * sum(ln(p)) is distributed as chi-
squared with 2n degrees of freedom. This is believed to be the most
sensitive test of the hypothesis that the vector of probabilities is, in
fact, uniformly distributed. Bogofilter now offers the option of applying
this test (known as Fisher's method) to yield P(spam) and P(not spam), and
using the difference as the "spamicity" score.
The input may be one message or many. Messages are broken up on "From "
lines. The algorithm is relatively insensitive to message miscounts.
OPTIONS
Without command-line options, bogofilter returns 1 if the message is non-
spam, 0 if it is spam. The non-spam wordfile is created if absent.
HELP OPTIONS
The -h option prints the help message and exits.
The -V option prints the version number and exits.
The -Q (query) option prints bogofilter's configuration, i.e. registration
parameters, parsing options, bogofilter directory, etc.
CLASSIFICATION OPTIONS
The -p (passthrough) option writes a copy of the input mail to the output
with an X-Bogosity header (in the style of SpamAssassin) inserted. The
header will begin with "Yes" or "No" according as the mail is judged to be
spam or non-spam respectively. Note: the memory consumption depends on
whether the input file is regular and allows seek operations. Within these
constraints, the file will be rewound and read a second time, without using
much memory. If the input file however is not regular (for example, a
pipeline or socket), then bogofilter will cache a copy if the entire mail
in memory.
The -e (embed) option tells bogofilter to exit with code 0 even if the mail
is not spam. This simplifies using bogofilter from procmail or maildrop.
The -t (terse) option tells bogofilter to print an abbreviated spamicity
message containing 1 letter and the score. Spam is indicated with "Y", ham
by "N", and unsure by "U".
The -u option tells bogofilter to register the message's text after
classifying it as spam or non-spam. A spam message will be registered on
the spamlist and a non-spam message on the goodlist. If using the
Robinson-Fisher method and the classification is "unsure", the message will
not be registered. Effectively this option runs bogofilter with the -s or
-n flag, as appropriate. (Caution is urged in the use of this capability,
as any classification errors bogofilter may make will be preserved and
accumulated until corrected with the -Sn and -Ns option combinations.)
The -2 option tells bogofilter to binary classify the message as either ham
or spam, and never as unsure. When this option is used with -u, a wordlist
is always updated.
The -3 option tells bogofilter to use tristate classification for the
message, i.e. classify the message as ham, spam, or unsure. This option is
effective only if ham_cutoff is non-zero.
The -M option tells bogofilter to process its input as a mbox formatted
file. If the -v or -t option is also given, a spamicity line will be
printed for each message.
The -b (streaming bulk mode) option tells bogofilter to classify multiple
messages whose names are read from stdin. If the -v or -t option is also
given, bogofilter will print a line giving file name and classification
information for each file.
The -Bfilename (bulk mode) option tells bogofilter to classify multiple
messages named as files on the command line. If the -v or -t option is also
given, bogofilter will print a line giving file name and classification
information for each file.
The -F (force) ignores threshold values when printing spamicity statistics.
The -R option tells bogofilter to output an R data frame in text form on
the standard output. See the section on integration with R, below, for
further detail.
REGISTRATION OPTIONS
The -s option tells bogofilter to register the text presented on standard
input as spam. The spam wordfile is created if absent.
The -n option tells bogofilter to register the text presented on standard
input as non-spam.
Bogofilter doesn't detect if a message registered twice. If you do this by
accident, the token counts will off by 1 from what you really want and the
corresponding spam scores will be slightly off. Given a large number of
tokens and messages in the wordlists, this doesn't matter. The problem
_can_ be corrected by using the -S option or the -N option.
The -S option tells bogofilter to undo a prior registration of the same
message as spam. If a message was incorrectly entered in the spam wordfile
by '-n' or '-u' and you want to remove it from the spam wordfile and enter
it in the non-spam wordfile, use options '-Sn'. If '-S' is used for a
message that wasn't registered as spam, the counts will still be
decremented.
The -N option tells bogofilter to undo a prior registration of the same
message as non-spam. If a message was incorrectly entered in the non-spam
wordfile by '-n' or '-u' and you want to remove it from the non-spam
wordfile and enter it in the spam wordfile, then use '-Ns'. If '-N' is used
for a message that wasn't registered as non-spam, the counts will still be
decremented.
GENERAL OPTIONS
The -cfilename option tells bogofilter to read the config file named.
The -C option prevents bogofilter from reading configuration files.
The -d dir option allows you to set the directory under which the wordlists
will be found to dir. If omitted, the default directory will be
$BOGOFILTER_DIR if BOGOFILTER_DIR is set and $HOME/.bogofilter otherwise.
The -k tag option sets the cache size for the BerkeleyDB subsystem.
Properly sizing the cache improves bogofilter's performance. Run the
bogotune script to determine the recommended size.
The -l option writes an informational line to the system log each time
bogofilter is run. The information logged depends on how bogofilter is run.
The -L tag option configures a tag which can be included in the information
being logged by the -l option, but it requires a custom format that
includes the %l string for now. This option implies -l.
The -I filename option tells bogofilter to read its input from the
specified file, rather than from stdin
The -O filename option tells bogofilter where to write its output in
passthrough mode. Note that this only works when -p is explicitly given.
The -W option tells bogofilter to operate with a single wordlist, named
wordlist.db. Each token in wordlist.db is stored as an ASCII string with
two counts (for spam and ham) and (optionally) a timestamp.
The -WW option tells bogofilter to operate with a pair of wordlists, named
spamlist.db and goodlist. Spamlist.db stores tokens, counts, and timestamps
for tokens from spam messages. Hamlist.db stores tokens, counts, and
timestamps for tokens from ham messages.
The -O filename option tells bogofilter where to write its output in
passthrough mode. Note that this only works when -p is explicitly given.
ALGORITHM OPTIONS
The Robinson-Fisher method is the default algorithm used for computing a
message's spamicity score, unless bogofilter has been compiled without it,
by using the --disable-robinson-fisher option to the configure script. The
method to be used can be specified on the command line or in the
configuration file.
The -g option selects the original Graham form of the calculation method.
The -r option selects the Robinson modifications to the calculation method.
The -f option selects the Robinson-Fisher modifications to the calculation
method.
The configure script has options --disable-graham-method, --disable-
robinson-method, and --disable-robinson-fisher so that bogofilter can be
built to support a subset of the available methods.
PARSING OPTIONS
Bogofilter has three special parsing options which can be enabled (or
disabled) at the user's discretion. The options ar of form -Px and -PX
where x designates an option letter. For the parsing options, a lower case
letter enables the option and an upper case letter disables it.
Options -Ph and -PH are for header line markup, i.e. whether to create
special tags for header lines. When enable, tokens in "To:", "From:",
"Return-Path:", and "Subject:" lines will be given special prefixes.
Enabling this option increases bogofilter's accuracy.
Options -Pi and -PI are for ignoring case, i.e. whether to map upper case
to lower case (or not). Disabling this option increases bogofilter's
accuracy.
Options -Pt and -PT are for tokenizing the innards of 3 html tags, i.e.
>a<, >img<, and >font<. Tokenizing these tags adds urls and font names to
the message's tokens. Enabling this option increases bogofilter's accuracy.
PARAMETER OPTIONS
The -m [value][,value][,value] option allows setting the min_dev value and,
optionally, the robs and robx values. If one value is supplied, then
min_dev is set. If a comma followed by one value is supplied, then robs is
set. With two values, both min_dev and robs are set; with three, mindev,
robs and robx are set; and other combinations of values and commas behave
as one would expect. Note the syntax is misleading, at least one of the
values MUST be present, and the commas determine what value(s) will be set.
Note: spaces are not allowed after the comma.
The -o [value][,value] option allows setting the spam_cutoff value and,
optionally, the ham_cutoff value. If one value is supplied, then
spam_cutoff is set. If a comma followed by one value is supplied, then
ham_cutoff is set. With two values, both spam_cutoff and ham_cutoff are
set. Note the syntax is misleading, at least one of the values MUST be
present, and the comma determines whether it is to set the spam or the ham
cutoff. Note: spaces are not allowed after the comma.
INFO OPTIONS
The -q (quiet) suppresses warning messages.
The -v option produces a report to standard output on bogofilter's analysis
af the input. Each additional v will increase the verbosity of the output,
up to a maximum of 4. With -vv, the report lists the tokens with highest
deviation from a mean of 0.5 association with spam.
Option -y date specifies the date to give to tokens that don't have dates.
The -D option redirects debug output to stdout.
The -x flags option allows setting of debug flags for printing debug
information.
ENVIRONMENT
Bogofilter will initialize its data base directory to$BOGOFILTER_DIR if
BOGOFILTER_DIR is set. If it is not set, bogofilter will use
$HOME/.bogofilter instead. If neither BOGOFILTER_DIR nor HOME is set, the
-d dir option must be present.
CONFIGURATION
The bogofilter command line allows setting of many options that determine
how bogofilter operates. File @sysconfdir@/bogofilter.cf can be used to set
additional parameters that affect its operation. File
@sysconfdir@/bogofilter.cf.example has samples of all of the parameters.
Status and logging messages can be customized for each site (see
@sysconfdir@/bogofilter.cf.example).
RETURN VALUES
0 for spam; 1 for non-spam; 2 for I/O or other errors.
If both -p and -e are used, the return values are: 0 for spam or non-spam;
2 for I/O or other errors.
Error 2 usually means that the wordlist file(s) bogofilter wants to read at
startup are missing or the hard disk has filled up in -p mode.
INTEGRATION WITH OTHER TOOLS
Use with Procmail
The following procmail rule will take mail on stdin and direct it to
Mail/spam if bogofilter thinks it's spam:
:0HB:
* ? bogofilter
Mail/spam
and this similar rule will also register the tokens in the mail according
to the bogofilter classification:
:0HB:
* ? bogofilter -u
Mail/spam
If bogofilter fails (returning 2) the message will be treated as non-spam.
The following recipe (a) spam-bins anything that bogofilter rates as spam,
(b) adds the words in messages rated as spam to the spam wordlist, and (c)
adds the words in messages rated as non-spam to the non-spam wordlist. With
this in place, it will normally only be necessary for the user to intervene
(with -Ns or -Sn) when bogofilter miscategorizes something.
# filter mail through bogofilter, tagging it as spam and
# updating the wordlists
:0fw
| bogofilter -u -e -p
# if bogofilter failed, return the mail to the queue, the MTA will
# retry to deliver it later
# 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h
:0e
{ EXITCODE=75 HOST }
# file the mail to spam-bogofilter if it's spam.
:0:
* ^X-Bogosity: Yes, tests=bogofilter
spam-bogofilter
This one is for maildrop, it automatically defers the mail and retries
later when the xfilter command fails, use this in your ~/.mailfilter:
xfilter "bogofilter -u -e -p"
if (/^X-Bogosity: Yes, tests=bogofilter/)
{
to "spam-bogofilter"
}
The following .muttrc lines will create mutt macros for dispatching mail to
bogofilter.
macro index d "<enter-command>unset wait_key\n\
<pipe-entry>bogofilter -n\n\
<enter-command>set wait_key\n\
<delete-message>" "delete message as non-spam"
macro index \ed "<enter-command>unset wait_key\n\
<pipe-entry>bogofilter -s\n\
<enter-command>set wait_key\n\
<delete-message>" "delete message as spam"
Integration with Mail Transport Agent (MTA)
1. bogofilter can also be integrated into an MTA to filter all incoming
mail. While the specific implementation is MTA dependent, the general
steps are as follows
2. Install bogofilter on the mail server
3. Prime the bogofilter databases with a spam and non-spam corpus. Since
bogofilter will be serving a larger community, it is important to prime
it with a representative set of messages.
4. Set up the MTA to invoke bogofilter on each message. While this is an
MTA specific step, you'll probably need to use the -p, -u, and -e
options.
5. Set up a mechanism for users to register spam/nonspam messages, as well
as to correct mis-classifications. The most generic solution is to set
up alias email addresses to which users bounce messages.
6. See the doc and contrib directories for more information
Use of R to verify Bogofilter calculations
The -R option tells bogofilter to generate an R data frame. The data frame
contains one row per token analysed. Each such row contains the token, the
sum of its database "good" and "spam" counts, the "good" count divided by
the number of non-spam messages used to create the training database, the
"spam" count divided by the spam message count, Robinson's f(w) for the
token, the natural logs of (1 - f(w)) and f(w), and an indicator character
(+ if the token's f(w) value exceeded the minimum deviation from 0.5, - if
it didn't). There is one additional row at the end of the table that
contains a label in the token field, followed by the number of words
actually used (the ones with + indicators), Robinson's P, Q, S, s and x
values and the minimum deviation.
The R data frame can be saved to a file and later read into an R session
(see the R project website: http://cran.r-project.org for information about
the mathematics package R). Provided with the bogofilter distribution is a
simple R script (file bogo.R) that can be used to verify bogofilter's
calculations. Instructions for its use are included in the script in the
form of comments.
LOG MESSAGES
Bogofilter writes messages to the system log when the -l option is used.
What is written depends on which other flags are used.
A classification run will generate (we are not showing the date and host
part here):
bogofilter[1412]: X-Bogosity: No, spamicity=0.000227
bogofilter[1415]: X-Bogosity: Yes, spamicity=0.998918
Using '-u' to classify a message and update a wordlist will produce (one a
single line):
bogofilter[1426]: X-Bogosity: Yes, spamicity=0.998918,
register -s, 329 words, 1 messages
Registering words ('-l' and '-s', '-n', '-S', or '-N') will produce:
bogofilter[1440]: register-n, 255 words, 1 messages
A registration run (using '-s', '-n', '-N', or '-S') will generate messages
like:
bogofilter[17330]: register-n, 574 words, 3 messages
bogofilter[6244]: register-s, 1273 words, 4 messages
FILES
@sysconfdir@/bogofilter.cf
System configuration file.
~/.bogofilter.cf
User configuration file.
~/.bogofilter/goodlist.db
List of good tokens.
~/.bogofilter/spamlist.db
List of spam tokens.
~/.bogofilter/wordlist.db
Combined list of good and spam tokens.
BUGS
bogofilter counts messages on input by looking for "From " lines. As a
special case, a single message without "From " line is counted correctly.
Multiple messages without intervening "From " lines will be counted as one
message.
Bogofilter does not canonicalize the transport encoding or character set,
sacrificing precision. We used to believe that spam with enclosures
invariably gives itself away through cues in the headers and non-enclosure
parts, but this is not true. This will be fixed in a future version.
AUTHOR
Eric S. Raymond <esr@thyrsus.com>.
For updates, see the bogofilter project page:
http://bogofilter.sourceforge.net/.
 |
Index for Section 1 |
|
 |
Alphabetical listing for B |
|
 |
Top of page |
|