Assignment 4: Unsupervised learning

Deadline: July 5, 2021 12:00 CEST
Late deadline: July 9, 2021 12:00 CEST

This set of exercises includes hands-on experiments with unsupervised learning. In particular, we will try to find meaningful patters in words based on the distribution of letters in them, using both clustering and dimensionality reduction. Our aim is to group letters in a language into clusters, and find useful continuous low dimensional representations for them.

The data for this assignment can be any word list, but we will assume we are working with German for some of the exercises. A small word list of German is provided in your repository as an example, but your code should work on any text file that contains one word per line. It may be particularly interesting to also apply the same ideas to the IPA data set we used in Assignment 3 (not required for this assignment).

As before, please use scikit-learn for clustering and PCA, Keras for the neural network model(s). As in earlier assignments, please follow the instruction below, and the structure described in the template.

Exercises

4.1 Read the data, return a letter-context matrix

Implement the usual read_data() function that reads a simple one-word-per-line word list, and computes a matrix of associations between letter (rows) contexts (columns). A letter is simply an individual Unicode character (that is, German sch consists of three different letters. Also, the best is to have NFC normalized input, but you do not have to pay attention to actual Unicode representation for the assignment). We create left and right contexts for each letter separately. For example for words ihn and in, we want to generate the following matrix (whose cells are frequencies):

    #_l     i_l     h_l     h_r     n_r     #_r
i     2      0       0       1       1       0
h     0      1       0       0       1       0
n     0      1       1       0       0       2

In this example _l indicates left context and _r indicates right context. The character # used for indicating a word boundary. The labeling is only for demonstration, the actual labels of the columns will not be important for the assignment.

Depending on the arguments, the cells of the matrix should be either

frequencies (the number of times each letter occurs with each context)
pointwise mutual information (PMI) between the letter and context. For PMI calculations, you can use smoothed probability estimates. A simple way to achieve this is to add 1 to all counts. With add one smoothing, a <context, letter> pair that is not observed in the data is assigned a count of one, a <context, letter> pair that occurs in the data once is counted as occurring twice.

You are also required implement optional standardizing of the data along the ‘context’ axis (columns). That means the columns of the matrix should be subtracted from their mean and divided to their standard deviation.

Please follow the other instructions indicated in the template.

4.2 Clustering the letters with k-means and agglomerative clustering

Implement the function cluster() in the template, which takes a matrix (rows correspond to instances, columns correspond to features) and returns the cluster ids for each instance. The clustering method is selected by the parameter method. Please follow the additional instructions in the template.

It is not required for the assignment, but you are strongly recommended to try plotting a dendrogram, as well experimenting with the various options implemented in the scikit-learn methods for clustering.

4.3 Evaluating clustering results

Implement the function evaluate_cluster() in the template, which should return the silhouette score if no gold-standard labels are given, or V-measure otherwise.
Calculate silhouette score for both clustering methods with varying number of clusters between 2 and 10, and plot the output as a scree plot to two PDF files with names k-means-scree.pdf and hierarchical-scree.pdf. Implement this part of the exercise in function plot_scree() in the template.
Implement an ‘uninformed’ baseline that clusters the letters using the distances between the letters in the lexicographic order. For example, distance between a and b should be 1, while a and d should be 3. You are free to chose the clustering method for this purpose. The implementation should go into cluster_baseline() in the template.
Evaluate k-means and agglomerative clustering (with your choice of linkage function) with two-way clustering. We assume that the aim of clustering is to distinguish vowels from consonants. The definition of vowel is not straightforward for written language, but any approximation is fine for our purposes. Note also that our instances are still individual Unicode characters. We treat diphthongs (e.g., ei) or repetition of the same vowel as separate letters.

4.4 Dimensionality reduction with PCA

Implement the function dimreduce_pca() which returns a reduced-dimensionality version of the given data. If the number of dimensions is not specified, the function should pick the number automatically, returning the dimensionality-reduced data that preserves at least 95% of the variation in the data.
Also implement the plot_components() function that plots the given two dimensions in the data to an external file.

4.5 Dimensionality reduction with a neural network (optional, +2 points)

Train a feed forward network predicting the context vectors of a letter from the (one-hot encoded) letter. The network should use hidden layer(s) that with lower dimensions than the one-hot encoding, and we take the final hidden layer’s output as the low-dimensional representation.

If you chose to do this exercise, your code should go to dimreduce_nn(). You are free to choose the hyperparameters of your network without explanation (you should, however, try to come up with a good set of hyperparameters, based on your intuitions on the problem, and possibly some trial and error).

You are recommended (but not required) to try to compare the results with the PCA.

Questions

Again a few more questions to stimulate further thinking about the methods. You do not need to provide answers to these questions, but trying find answers and discuss them with others (e.g., your assignment partner) will increase your understanding of the methods and their applications in NLP.

How would the clustering and dimensionality reduction results be affected if you used other context representations (e.g., using larger context or not paying attention to the direction)
How does the clustering results differ if you use the dimensionality-reduced data instead of the original one?
Can you think about applications where the letter clusters or low-dimensional representations of letters can be useful?
What type of difficulties do you expect if we used the same models above for learning representations for words rather than letters?
What is the differences or similarities between the model defined in 4.5 and the skip-gram model of word2vec?
How can you tell if the PCA and neural network methods used for dimensionality reduction produces good results?