Unsupervised Learning on Country Data

A dataset of countries and various metrics (child mortality, GDP per person, ...) that indicate how "developed" they are. The goal is to identify the countries that are most in need of aid from an NGO.

From Kaggle: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/data

(The wording is yikes-inducing, as it describes them as "backward" countries).

Here, we implement 2 methods to cluster the countries: k-means and the gaussian mixture model. To determine the optimal number of clusters, k, we use a combination of the silhouette method and the elbow method. Furthermore, for each value of k, we run the clustering multiple times and pick the "best" clustering (according to the silhouette method), just in case there's an unlucky run due to the random starting parameters being bad.

Imports

Reading and preprocessing the data

Note: I renamed the data files.

Normalise the data.

Visualise using PCA.

Plotting in 2 dimensions with PCA. There's already a clear gradient from left to right, with poorer countries off to the left. There are also weird, small, tax haven countries off on their own in the top right.

Metrics for clustering

Implementation of k-means clustering

Test it out on sample data

Applying 2-means clustering to the actual country data

Fit the model.

Visualise again with PCA, this time coloured by group.

We do seem to have captured most of the wealthy countries in Group 2, and most of the poorer countries in Group 1. However, there's a significant amount of overlap, and the cutoff seems somewhat arbitrary. Perhaps we can get a more granular view by increasing the number of groups k.

Another possibility: may need to look at algorithms besides k-means, that separate "extreme" examples from "normal" examples, as it's the "extremely poor" countries that are most likely to be in need of aid.

Using metrics to determine an appropriate number of clusters

First try the elbow method. What's the average distance from each point to the center of its group? Note: this may be sensitive to the random centers we start with.

Eyeballing the plot above, the "elbow" seems to be at k=3 or k=4, though the shape varies wildly between runs. So I think it depends a lot on the centres we start with. Let's try the silhouette method, and also do multiple runs for each k to hopefully mitigate the issue of bad starting centers.

Between the silhouette score and the elbow method, the ideal number of clusters seems to be k=2 or k=3. In any case, let's see what the clustering looks like with k=3 and k=4, reusing the centers that we got from the parameter testing.

Implementing Gaussian Mixture Model

An alternative to k-means clustering. Assume the datapoints in each cluster follow a Gaussian distribution, as we saw with Gaussian Discriminative Analysis (iris.ipynb). The difference is that this is an unsupervised learning problem and we don't know the clusters, so we use the Expectation-Maximisation (EM) algorithm to find the parameters of the distributions, as follows:

  1. Initialize the parameters to reasonable values.
  2. Expectation step. Update the probability that each datapoint belongs to each cluster, according to the parameters. (See: Gaussian Distriminative Analysis).
  3. Maximisation step. Maximise the likelihood of the observations, as a function of the parameters, i.e. update the parameters to maximise the likelihood. The updates are derived by taking the derivative of the log likelihood of the data with respect to the various parameters.
  4. Repeat from step 2 until the likelihood of the observations/datapoints has converged.

Once we've fit the parameters, we know the cluster that each observation is most likely to belong to, which gives us our clustering.

As with k-means, this algorithm is sensitive to randomisation, and may require multiple runs for a good fit.

Test on some relatively simple data. Actual gaussian distributions.

Oops, the group IDs of the model don't match those we assigned ourselves.

Fix that.

Plotting the means of the two clusters as well as any incorrectly-classified points. We see that 1 outlier from the yellow group got classified as part of the purple group. Also, the predicted means came very close to the real ones.

Applying GMM to country clustering

Silhouette score seems to indicate the 2 clusters is optimal.

Interesting! k=2 is the best fit according to the silhouette score, but the actual clustering it produces basically implies that all the datapoints are generated by a single Gaussian. This actually makes sense -- eyeballing the data, it DOES seem to come from a single Gaussian! k=3 is closer to the clustering produced by k-means: haves, have-nots, and tax tavens. For k>3, the clusters are harder to distinguish (spatially) in 2 dimensions. An alternative dimensionality reduction method might show the clusters better.

Conclusions

We've seen how the k-means algorithm and the gaussian mixture model can be applied to clustering of country data. According to the metrics we examined, the optimal number of clusters seems to be 2-4, though I'm not sure that this is an actually useful way to divide countries for the purposes of allocating NGO resources.

Future work: