Iris dataset

Here we apply various techniques to perform regression and classification tasks on the famous iris dataset.

See: https://www.kaggle.com/datasets/vikrishnan/iris-dataset

Imports

Load the data

Visualise 2 features

Visualise all features using dimensionality reduction (PCA)

Using linear regression to predict sepal length

Mean squared error doesn't give us a great intuition for how close the predictions were, so let's plot the predictions vs the true value.

Logistic regression to classify flower type

^ Not sure how to get rid of those warnings, oh well.

Just 1 mistake, a virginica that got classified as a versicolor. That makes sense because those clusters are close together.

Homegrown linear regression

We want to solve for x in

Ax = b,

where A is our data matrix (rows are samples) and b is the variable we're trying to predict. However, A is generally not invertible. A^T A, however, IS invertible, as long as the columns of A are linearly independent. So we instead solve for xhat in

A^T A xhat = A^T b,
xhat = (A^T A)^-1 A^T b,

and this minimises the error because A xhat is the projection of b onto the column space of A, i.e. xhat gets us as close to b as possible given A.

I'm not sure how the sklearn version is able to get slightly lower error! Maybe it adds a column of all-1s to allow a constant term?

Gaussian discriminative analysis (GDA)

This is a generative classification algorithm. Rather than directly predicting the probability P(y|x) for a class y and data x, it instead models the data in each class using a Gaussian distribution, giving P(x|y). Then, using the prior probabilities P(y) and Bayes' Rule, this gives...

P(y0|x) = P(x,y)/P(x)
        = P(y0)P(x|y0) / (sum_{y} P(y)P(x|y)).

Let's see how it compares to logistic regression. Assuming a Gaussian distribution might not be a bad idea, given that phenotypes traits in animals tend to be normally distributed due to being the result of many different effects (genes).

Also worth noting that GDA can be reduced to logistic regression, but not vice versa.

Take a look at the model parameters. A reminder that the features are: sepal_length, sepal_width, petal_length, petal_width.

We see that GDA doesn't perform as well on this dataset.

The confusion matrix gives a hint as to why. It struggles to distinguish between versicolor and virginica because there isn't a clear separating boundary between them. For this reason, the more general-purpose logistic regression model is better in this case. GDA would have the upper hand in cases where its stronger assumptions compensate for a lack of data.