Assignment 4: Unsupervised learning

Deadline: July 5, 2021 12:00 CEST
Late deadline: July 9, 2021 12:00 CEST

This set of exercises includes hands-on experiments with unsupervised learning. In particular, we will try to find meaningful patters in words based on the distribution of letters in them, using both clustering and dimensionality reduction. Our aim is to group letters in a language into clusters, and find useful continuous low dimensional representations for them.

The data for this assignment can be any word list, but we will assume we are working with German for some of the exercises. A small word list of German is provided in your repository as an example, but your code should work on any text file that contains one word per line. It may be particularly interesting to also apply the same ideas to the IPA data set we used in Assignment 3 (not required for this assignment).

As before, please use scikit-learn for clustering and PCA, Keras for the neural network model(s). As in earlier assignments, please follow the instruction below, and the structure described in the template.

Exercises

4.1 Read the data, return a letter-context matrix

Implement the usual read_data() function that reads a simple one-word-per-line word list, and computes a matrix of associations between letter (rows) contexts (columns). A letter is simply an individual Unicode character (that is, German sch consists of three different letters. Also, the best is to have NFC normalized input, but you do not have to pay attention to actual Unicode representation for the assignment). We create left and right contexts for each letter separately. For example for words ihn and in, we want to generate the following matrix (whose cells are frequencies):

    #_l     i_l     h_l     h_r     n_r     #_r
i     2      0       0       1       1       0
h     0      1       0       0       1       0
n     0      1       1       0       0       2

In this example _l indicates left context and _r indicates right context. The character # used for indicating a word boundary. The labeling is only for demonstration, the actual labels of the columns will not be important for the assignment.

Depending on the arguments, the cells of the matrix should be either

You are also required implement optional standardizing of the data along the ‘context’ axis (columns). That means the columns of the matrix should be subtracted from their mean and divided to their standard deviation.

Please follow the other instructions indicated in the template.

4.2 Clustering the letters with k-means and agglomerative clustering

Implement the function cluster() in the template, which takes a matrix (rows correspond to instances, columns correspond to features) and returns the cluster ids for each instance. The clustering method is selected by the parameter method. Please follow the additional instructions in the template.

It is not required for the assignment, but you are strongly recommended to try plotting a dendrogram, as well experimenting with the various options implemented in the scikit-learn methods for clustering.

4.3 Evaluating clustering results

  1. Implement the function evaluate_cluster() in the template, which should return the silhouette score if no gold-standard labels are given, or V-measure otherwise.
  2. Calculate silhouette score for both clustering methods with varying number of clusters between 2 and 10, and plot the output as a scree plot to two PDF files with names k-means-scree.pdf and hierarchical-scree.pdf. Implement this part of the exercise in function plot_scree() in the template.
  3. Implement an ‘uninformed’ baseline that clusters the letters using the distances between the letters in the lexicographic order. For example, distance between a and b should be 1, while a and d should be 3. You are free to chose the clustering method for this purpose. The implementation should go into cluster_baseline() in the template.
  4. Evaluate k-means and agglomerative clustering (with your choice of linkage function) with two-way clustering. We assume that the aim of clustering is to distinguish vowels from consonants. The definition of vowel is not straightforward for written language, but any approximation is fine for our purposes. Note also that our instances are still individual Unicode characters. We treat diphthongs (e.g., ei) or repetition of the same vowel as separate letters.

4.4 Dimensionality reduction with PCA

4.5 Dimensionality reduction with a neural network (optional, +2 points)

Train a feed forward network predicting the context vectors of a letter from the (one-hot encoded) letter. The network should use hidden layer(s) that with lower dimensions than the one-hot encoding, and we take the final hidden layer’s output as the low-dimensional representation.

If you chose to do this exercise, your code should go to dimreduce_nn(). You are free to choose the hyperparameters of your network without explanation (you should, however, try to come up with a good set of hyperparameters, based on your intuitions on the problem, and possibly some trial and error).

You are recommended (but not required) to try to compare the results with the PCA.

Questions

Again a few more questions to stimulate further thinking about the methods. You do not need to provide answers to these questions, but trying find answers and discuss them with others (e.g., your assignment partner) will increase your understanding of the methods and their applications in NLP.