SpamFilter

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

This is what I use to filter spam. See the comments for an explanation of how it works and where the algorithm comes from. An earlier version of this filter, after years of training, produced about one false positive and one false negative per week; the performance of the current version, which has trained over a shorter interval during which spammers have worked harder to disguise their mailings, has not been as good. I am not confident enough in this filter to stop looking at the identified spam to see if any of it is real, but it does seem to do a pretty good job of separating out the spam so I only have to look at it once in a while. However, this is dangerous experimental software that may have arbitrarily bad effects on your system and/or email. It is provided free of charge with no warranty of any kind. Use at your own risk.

If you use this, you will need to change dbfilename. You will also need to configure procmail or some similar system to call the program appropriately. When training using winnow 0 or winnow 1, be sure to include all headers as received by procmail; otherwise you may get odd results since the training set won't match the message stream. I generally call the program from pine by using the pipe command with the ^W (raw message text) option. For bulk training you can use pine's apply-pipe command with ^W (raw message text), ^T (new pipe for each message), and ^Y (don't capture output) options. Note that unlike Bayesian filters, this learning algorithm is not affected by how many times it sees the same message, and ignores any message that it correctly classifies (so you can run it on your entire inbox twice a day, but it probably won't improve its classifications unless it has been making mistakes).

Download the source: winnow.py. Or see below.

   1 #!/usr/bin/python
   2 """Unwanted mail detector based on Littlestone's Winnow algorithm as adapted
   3 by Blum in
   4 
   5 A. Blum. Empirical support for winnow and weighted-majority based algorithms:
   6 results on a calendar scheduling domain. In Proceedings of the Twelfth
   7 International Conference on Machine Learning, pages 64--72, July 1995.
   8 (http://citeseer.ist.psu.edu/blum97empirical.html)
   9 
  10 Feature vector for a text is just a list of all words (loosely defined)
  11 including those found after decoding anything that looks like a line of base64
  12 encoding.  Each word found corresponds to two specialists, one that votes
  13 spam if it sees the word and one that votes not spam.  To simplify the data
  14 kept around, weights are always adjusted by 1/2 or 2 and Blum's enhancement
  15 of Winnow that reduces the weight of a specialist whenever it votes wrong
  16 (even if the majority votes right) is not used.
  17 
  18 Usage:
  19 
  20     winnow < text
  21     winnow c [dbfile] < text
  22 
  23 Exits with exit code 0 if text matches filter (i.e., is unwanted)
  24 and with exit code 1 if it doesn't (i.e., is good).  This allows 
  25 failsafe operation---mail is presumed good if winnow doesn't work.
  26 
  27     winnow 0 [dbfile] < text
  28 
  29 As above, but update weights assuming text is a negative example
  30 of unwanted mail (i.e., assume text is good).
  31 
  32     winnow 1 [dbfile] < text
  33 
  34 As above, but assume text is unwanted.
  35 
  36     winnow e [dbfile] < text
  37 
  38 As above, but only prints out explanation of classification and doesn't update
  39 database.
  40 
  41 If not given on the command line, the database is assumed to be in the first of these locations that is defined:
  42 
  43     1. $WINNOWDB
  44     2. $HOME/.winnow.db
  45     3. /homes/theory/aspnes/.winnow.db
  46 
  47 Some notes on the voting:
  48 
  49 We'd like to use the version of Winnow described in Avrim Blum's
  50 paper, where we have classifiers that either vote or abstain 
  51 and we update by factors of 2 on errors.  But we don't really want to
  52 keep around separate +foo and -foo values.
  53 
  54 So what weights do we get on +foo and -foo?
  55 
  56 If +foo is correct, its weight is doubled and -foo is halved; the
  57 reverse occurs if +foo is incorrects.  So we always have
  58 
  59   wt(+foo) = 1/wt(-foo) = 2^(+foo correct - +foo incorrect)
  60 
  61 And the net vote when we see foo is
  62 
  63  2^x - 2^-x
  64 
  65 where x is the net number of times foo has classified
  66 correctly/incorrectly on error.
  67 
  68 Since we'd like to use (long) integer arithmetic to avoid round-off errors,
  69 we'll approximate
  70 
  71  2^x - 2^-x ~= 2^x for x > 0, 0 for x = 0, -2^-x for x < 0.
  72 """
  73 
  74 import string, re, binascii
  75 
  76 class Vector:
  77     """Class of vectors with default values for unspecified coordinates."""
  78     def __init__(self, default = 0, store = None):
  79 	if store != None:
  80 	    self._vector = store
  81 	else:
  82 	    self._vector = {}
  83 	self._default = default
  84     def __getitem__(self, index):
  85 	if self._vector.has_key(index):
  86 	    return self._vector[index]
  87 	else:
  88 	    return self._default
  89     def __setitem__(self, index, value):
  90 	self._vector[index] = value
  91     def keys(self):
  92 	return self._vector.keys()
  93     def has_key(self, index):
  94 	return self._vector.has_key(index)
  95     def __len__(self):
  96 	return len(self._vector)
  97     def __mul__(self, other):
  98 	"""Dot-product.  Assumes missing entries in other are zero."""
  99 	sum = 0
 100 	for i in other.keys():
 101 	    sum = sum + self[i] * other[i]
 102 	return sum
 103 
 104 class Winnow:
 105     """Littlestone's Winnow learning algorithm."""
 106     def __init__(self, store = None):
 107 	self._w = Vector(0, store)
 108     def classify(self, x):
 109 	"""Return 1 if x is a positive example, 0 otherwise."""
 110 	sum = 0L
 111 	for i in x.keys():
 112 	    wt = self._w[i]
 113 	    if wt > 0:
 114 		sum = sum + (1L << wt)
 115 	    elif wt < 0:
 116 		sum = sum - (1L << -wt)
 117 	return sum > 0
 118     def explain(self, x):
 119 	"""Return a list of all relevant tokens and their values, together with
 120 the raw score for x and the threshold."""
 121 	ret = []
 122 	for i in x.keys():
 123 	    if self._w.has_key(i):
 124 		ret.append((self._w[i], i))
 125 	ret.sort()
 126 	return ret
 127     def learn(self, x, ispositive):
 128 	"""Update weights based on example x."""
 129 	guess = self.classify(x)
 130 	correct = not not ispositive
 131 	if guess != correct:
 132 	    # oops
 133 	    if ispositive:
 134 		delta = +1
 135 	    else:
 136 		delta = -1
 137 	    for i in x.keys():
 138 		if x[i]:
 139 		    self._w[i] = self._w[i] + delta
 140 	return guess
 141 
 142 class TokenVector(Vector):
 143     """Vector representing occurrence of tokens in text."""
 144 
 145     # a token is 3 to 50 alphanumerics or various additional characters
 146     token_re = re.compile(r"\b[-a-zA-Z0-9_+./@'!]{3,50}\b")
 147 
 148     def __init__(self):
 149 	Vector.__init__(self, 0)
 150     def add_text(self, text):
 151 	for token in self.token_re.findall(text):
 152 	    token = string.lower(token)
 153 	    self[token] = 1
 154 
 155 class MailTokenVector(TokenVector):
 156     base64_re = re.compile(r'^(?:[a-zA-Z0-9/+]{4,4}){10,19}$', re.MULTILINE)
 157 
 158     def add_text(self, text):
 159 	# look for embedded base64
 160 	for line in string.split(text, '\n'):
 161 	    if self.base64_re.match(line):
 162 		TokenVector.add_text(self, binascii.a2b_base64(line))
 163 	    else:
 164 		TokenVector.add_text(self, line)
 165 
 166 if __name__ == '__main__':
 167     import sys, shelve, os
 168 
 169     dbfilename = os.getenv('WINNOWDB') or ((os.getenv('HOME') or '/homes/theory/aspnes') + '/.winnow.db')
 170 
 171     # max number of bytes to process
 172     # this may reduce accuracy but avoids problems with huge viruses
 173     size_limit = 100000
 174 
 175     # allow override
 176     if len(sys.argv) > 2:
 177 	dbfilename = sys.argv[2]
 178 
 179     # avoids corruption over NFS
 180     lockfilename = '%s.lock' % (dbfilename,)
 181     locktimeout = 30
 182     os.system("/usr/bin/lockfile -l %d '%s'" % (locktimeout, lockfilename))
 183 
 184     store = shelve.open(dbfilename)
 185     classifier = Winnow(store = store)
 186 
 187     # read in everything to avoid broken pipes
 188     # we'll cut off size_limit bytes below
 189     text = sys.stdin.read()
 190 
 191     t = MailTokenVector()
 192     t.add_text(text[:size_limit])
 193     print "Tokens:", len(t)
 194 
 195     if len(sys.argv) <= 1 or sys.argv[1] in ('classify', 'c'):
 196 	result = classifier.classify(t)
 197     else:
 198 	if sys.argv[1] in ('0', '1'):
 199 	    result = classifier.learn(t, int(sys.argv[1]))
 200 	elif sys.argv[1] in ('explain', 'e'):
 201 	    list = classifier.explain(t)
 202 	    for score, token in list:
 203 		print '%6d' % score, token
 204 	    result = classifier.classify(t)
 205 	else:
 206 	    print "Unknown option", sys.argv[1]
 207 	    result = 0
 208 
 209     print "Reject", result
 210 
 211     # close the store and release the lock
 212     store.close()
 213     os.remove(lockfilename)
 214 
 215     # exit normally if flagged
 216     # this allows procmail to treat failure to find the script as not flagging
 217     sys.exit(not result)

winnow.py