Data Science & Machine Learning Portfolio

These are data science and machine learning projects I've worked on in my spare time. Further description, and links to Jupyter Notebooks, can be found below.


Spam detection with Naive Bayes

Confusion matrix consisting of 4 panels: 1199 true negatives (not spam, classified as not spam), 176 true positives (spam, classified as spam), 13 false negatives and 5 false positives. This differs from the
Confusion matrix showing the number of true positives (spam classified as spam), false positives (spam classified as ham), true negatives and false negatives.

Implementing the Naive Bayes classifier to achieve almost 99% test set accuracy on Kaggle's SMS Spam Collection Dataset. The classifier also achieves high precision (98.8%) and recall (92.7%). The features are based on words used, as well as the presence of money symbols and numbers of various lengths.


Reinforcement learning for navigation

3-by-4 grid with arrows in each cell indicating the direction taken by the optimal policy.
Visualisation of a navigation task and the optimal policy.

An implementation of value iteration to find the optimal policy for reinforcement learning problems. In the scenario depicted above, the agent is placed 3 cells across and 3 cells down in a 3-by-4 grid. The top-right corner is the goal state, giving a reward of +1. Below that is a fail state, giving a reward of -1. Every other state gives -0.1, incentivising the agent to not dilly-dally. The arrows indicate the direction to move, as proposed by the optimal policy. There's a 75% chance of moving in the intended direction, and a 25% chance of veering left or right. This is why, at grid position (2,3), the optimal decision is to go left, avoiding the possibility of accidentally reaching the fail state.


Linear models and gaussian discriminative analysis of iris dataset

Plot of sepal length vs. the predicted value, showing that the prediction is pretty good for all flower types.
Sepal length vs. predicted value, roughly following the line x=y.

Applying linear models to regression and classification tasks on the iris flower dataset. Also includes my own implementation of gaussian discriminative analysis, which happens to perform worse than logistic regression on this dataset.


Country clustering with k-means and GMM

2-dimensional plot of country data, clustered into 4 groups.
PCA visualisation of country clusters with k=4.

Implementing k-means and the gaussian mixture model to cluster countries based on economic data. The elbow and silhouette methods are used to pick an appropriate number of clusters.