This is what I use to filter spam. See the comments for an explanation of how it works and where the algorithm comes from. An earlier version of this filter, after years of training, produced about one false positive and one false negative per week; the performance of the current version, which has trained over a shorter interval during which spammers have worked harder to disguise their mailings, has not been as good. I am not confident enough in this filter to stop looking at the identified spam to see if any of it is real, but it does seem to do a pretty good job of separating out the spam so I only have to look at it once in a while. However, this is dangerous experimental software that may have arbitrarily bad effects on your system and/or email. It is provided free of charge with no warranty of any kind. Use at your own risk.
If you use this, you will need to change dbfilename. You will also need to configure procmail or some similar system to call the program appropriately. When training using winnow 0 or winnow 1, be sure to include all headers as received by procmail; otherwise you may get odd results since the training set won't match the message stream. I generally call the program from pine by using the pipe command with the ^W (raw message text) option. For bulk training you can use pine's apply-pipe command with ^W (raw message text), ^T (new pipe for each message), and ^Y (don't capture output) options. Note that unlike Bayesian filters, this learning algorithm is not affected by how many times it sees the same message, and ignores any message that it correctly classifies (so you can run it on your entire inbox twice a day, but it probably won't improve its classifications unless it has been making mistakes).
Download the source: winnow.py. Or see below.
1 #!/usr/bin/python
2 """Unwanted mail detector based on Littlestone's Winnow algorithm as adapted
3 by Blum in
4
5 A. Blum. Empirical support for winnow and weighted-majority based algorithms:
6 results on a calendar scheduling domain. In Proceedings of the Twelfth
7 International Conference on Machine Learning, pages 64--72, July 1995.
8 (http://citeseer.ist.psu.edu/blum97empirical.html)
9
10 Feature vector for a text is just a list of all words (loosely defined)
11 including those found after decoding anything that looks like a line of base64
12 encoding. Each word found corresponds to two specialists, one that votes
13 spam if it sees the word and one that votes not spam. To simplify the data
14 kept around, weights are always adjusted by 1/2 or 2 and Blum's enhancement
15 of Winnow that reduces the weight of a specialist whenever it votes wrong
16 (even if the majority votes right) is not used.
17
18 Usage:
19
20 winnow < text
21 winnow c [dbfile] < text
22
23 Exits with exit code 0 if text matches filter (i.e., is unwanted)
24 and with exit code 1 if it doesn't (i.e., is good). This allows
25 failsafe operation---mail is presumed good if winnow doesn't work.
26
27 winnow 0 [dbfile] < text
28
29 As above, but update weights assuming text is a negative example
30 of unwanted mail (i.e., assume text is good).
31
32 winnow 1 [dbfile] < text
33
34 As above, but assume text is unwanted.
35
36 winnow e [dbfile] < text
37
38 As above, but only prints out explanation of classification and doesn't update
39 database.
40
41 If not given on the command line, the database is assumed to be in the first of these locations that is defined:
42
43 1. $WINNOWDB
44 2. $HOME/.winnow.db
45 3. /homes/theory/aspnes/.winnow.db
46
47 Some notes on the voting:
48
49 We'd like to use the version of Winnow described in Avrim Blum's
50 paper, where we have classifiers that either vote or abstain
51 and we update by factors of 2 on errors. But we don't really want to
52 keep around separate +foo and -foo values.
53
54 So what weights do we get on +foo and -foo?
55
56 If +foo is correct, its weight is doubled and -foo is halved; the
57 reverse occurs if +foo is incorrects. So we always have
58
59 wt(+foo) = 1/wt(-foo) = 2^(+foo correct - +foo incorrect)
60
61 And the net vote when we see foo is
62
63 2^x - 2^-x
64
65 where x is the net number of times foo has classified
66 correctly/incorrectly on error.
67
68 Since we'd like to use (long) integer arithmetic to avoid round-off errors,
69 we'll approximate
70
71 2^x - 2^-x ~= 2^x for x > 0, 0 for x = 0, -2^-x for x < 0.
72 """
73
74 import string, re, binascii
75
76 class Vector:
77 """Class of vectors with default values for unspecified coordinates."""
78 def __init__(self, default = 0, store = None):
79 if store != None:
80 self._vector = store
81 else:
82 self._vector = {}
83 self._default = default
84 def __getitem__(self, index):
85 if self._vector.has_key(index):
86 return self._vector[index]
87 else:
88 return self._default
89 def __setitem__(self, index, value):
90 self._vector[index] = value
91 def keys(self):
92 return self._vector.keys()
93 def has_key(self, index):
94 return self._vector.has_key(index)
95 def __len__(self):
96 return len(self._vector)
97 def __mul__(self, other):
98 """Dot-product. Assumes missing entries in other are zero."""
99 sum = 0
100 for i in other.keys():
101 sum = sum + self[i] * other[i]
102 return sum
103
104 class Winnow:
105 """Littlestone's Winnow learning algorithm."""
106 def __init__(self, store = None):
107 self._w = Vector(0, store)
108 def classify(self, x):
109 """Return 1 if x is a positive example, 0 otherwise."""
110 sum = 0L
111 for i in x.keys():
112 wt = self._w[i]
113 if wt > 0:
114 sum = sum + (1L << wt)
115 elif wt < 0:
116 sum = sum - (1L << -wt)
117 return sum > 0
118 def explain(self, x):
119 """Return a list of all relevant tokens and their values, together with
120 the raw score for x and the threshold."""
121 ret = []
122 for i in x.keys():
123 if self._w.has_key(i):
124 ret.append((self._w[i], i))
125 ret.sort()
126 return ret
127 def learn(self, x, ispositive):
128 """Update weights based on example x."""
129 guess = self.classify(x)
130 correct = not not ispositive
131 if guess != correct:
132 # oops
133 if ispositive:
134 delta = +1
135 else:
136 delta = -1
137 for i in x.keys():
138 if x[i]:
139 self._w[i] = self._w[i] + delta
140 return guess
141
142 class TokenVector(Vector):
143 """Vector representing occurrence of tokens in text."""
144
145 # a token is 3 to 50 alphanumerics or various additional characters
146 token_re = re.compile(r"\b[-a-zA-Z0-9_+./@'!]{3,50}\b")
147
148 def __init__(self):
149 Vector.__init__(self, 0)
150 def add_text(self, text):
151 for token in self.token_re.findall(text):
152 token = string.lower(token)
153 self[token] = 1
154
155 class MailTokenVector(TokenVector):
156 base64_re = re.compile(r'^(?:[a-zA-Z0-9/+]{4,4}){10,19}$', re.MULTILINE)
157
158 def add_text(self, text):
159 # look for embedded base64
160 for line in string.split(text, '\n'):
161 if self.base64_re.match(line):
162 TokenVector.add_text(self, binascii.a2b_base64(line))
163 else:
164 TokenVector.add_text(self, line)
165
166 if __name__ == '__main__':
167 import sys, shelve, os
168
169 dbfilename = os.getenv('WINNOWDB') or ((os.getenv('HOME') or '/homes/theory/aspnes') + '/.winnow.db')
170
171 # max number of bytes to process
172 # this may reduce accuracy but avoids problems with huge viruses
173 size_limit = 100000
174
175 # allow override
176 if len(sys.argv) > 2:
177 dbfilename = sys.argv[2]
178
179 # avoids corruption over NFS
180 lockfilename = '%s.lock' % (dbfilename,)
181 locktimeout = 30
182 os.system("/usr/bin/lockfile -l %d '%s'" % (locktimeout, lockfilename))
183
184 store = shelve.open(dbfilename)
185 classifier = Winnow(store = store)
186
187 # read in everything to avoid broken pipes
188 # we'll cut off size_limit bytes below
189 text = sys.stdin.read()
190
191 t = MailTokenVector()
192 t.add_text(text[:size_limit])
193 print "Tokens:", len(t)
194
195 if len(sys.argv) <= 1 or sys.argv[1] in ('classify', 'c'):
196 result = classifier.classify(t)
197 else:
198 if sys.argv[1] in ('0', '1'):
199 result = classifier.learn(t, int(sys.argv[1]))
200 elif sys.argv[1] in ('explain', 'e'):
201 list = classifier.explain(t)
202 for score, token in list:
203 print '%6d' % score, token
204 result = classifier.classify(t)
205 else:
206 print "Unknown option", sys.argv[1]
207 result = 0
208
209 print "Reject", result
210
211 # close the store and release the lock
212 store.close()
213 os.remove(lockfilename)
214
215 # exit normally if flagged
216 # this allows procmail to treat failure to find the script as not flagging
217 sys.exit(not result)