SMS Spam Collection Dataset

A dataset of 5572 text messages, to be classified as spam or ham (i.e. not spam).

See: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data

Our approach is to implement a Naive Bayes classifier, using the training set to estimate the a priori probability of a text being spam or ham, as well as the probability of particular words being in spam/ham texts. From there, Bayesian probability gives us the probability of a message being spam or not.

Imports

Read the dataset and do some preprocessing

Split into train/test sets.

75% train, 25% test.

Feature selection and extraction.

We use one type of feature only: boolean values representing whether particular strings are in the text. All of the following substrings correspond to a single boolean feature:

This class abstracts over which feature gets mapped to which index in the features vector. Every feature has a label; for words we can use the word itself, and we can give special names to non-words.

Actually computing the feature vector.

Testing out the feature extraction...

Naive Bayes implementation.

Here's the math. If y is the label (1 if spam and 0 if not) and x is the vector of features, then the probability of something being spam is

P(y=1|x) = P(x,y=1)/P(x)
         = P(y=1)P(x|y=1)/(P(y=1)P(x|y=1) + P(y=0)P(x|y=0))

P(y) can be estimated by the proportion of spam in the training set.

Note that P(x|y=1) ~= \prod_{i=1}^D P(x_i|y=1), where D is the number of features. For each word i, P(x_i|y=1) can be estimated as the rate of occurrence of that word in the spam class.

Laplace smoothing is used in the estimate of P(x_i|y=1), which is:

(#TimesWordOccurs+1)/(#Texts+2)

The +1 ensures that, if #TimesWordOccurs is 0, the probability of occurrence isn't 0 (the word could just be v rare). Similarly, if a word occurs in every single text, the +2 ensures that the probability isn't 1.

Testing out on a sample case. I manually verified the results. It's not the ideal test case because the a priori probabilities are equal.

Bringing it all together and evaluating on the test set.

Success! An accuracy of almost 99% on the test set.

Analysis of results.

A priori, about 13.5% of the texts are spam. Declaring everything to be not-spam would give us 86.5% accuracy, so we're doing better than that, but maybe not as good as it might appear at first sight.

Precision (fraction of texts classified as spam that are actually spam), recall (accuracy at identifying spam), and confusion matrix.

13 spam texts are incorrectly classified as ham, 2 ham texts are incorrectly classified as spam.

Taking a look at the misclassified texts. One of the false positives (incorrectly-classified ham) is a chain text (thanks to ChatGPT for helping me to remember that term) and could arguably be considered spam. The other is tricky because it has characteristics of spam, e.g. the word "contact".

On the false negative side, i.e. undetected spam, some of them have mispellings that probably help to evade detection ("wining") - we could use basic spelling correction to counter this. Some of the spam has URLs / other types of links, it would be helpful to have a feature for that. One of them splits up the phone number - "0800 505060" - which then doesn't trigger our "8+-digit" feature.

Retroactively going back and improving the feature selection would possibly lead to overfitting on this dataset. If we had been more careful, we could've tried our model on a subset of the texts to figure out what sort of spam evades detection, leading to improved feature engineering. Although, the dataset may not be large enough for that.

Seeing the prediction for texts that consist of a single word. This can tell us which are the "most spammy" words, according to the model.

To explain the strangely low probabilities for single-word texts, we look at P(0^D|y), where we have D features and 0^D is the all-zero vector. This shows us that spam messages are unlikely to have no known words, perhaps because they tend to be longer. A clever spammer could maybe craft a short spam text and it would be harder to detect.

Confirming this hunch -- spam messages are longer and tend to have more features present.

So, how can a spammer circumvent the filter? Use a thesaurus to avoid spam-indicative words, and try to use words that are NOT spam indicative. Use misspellings: "cal1", "s3x", "one hundred" instead of "100". And short messages!

Future work.