Assignment 5: Segmentation with RNNs

Deadline: July 19, 2021 12:00 CEST
Late deadline: July 23, 2021 12:00 CEST

In this assignment, we will use a recurrent neural network for segmentation. In particular, we will experiment with learning hyphenation, where the objective is to determine points in words where one can insert hyphens. Given a word like hyphenation, we want to obtain hy–phen–ation, where ‘-’ indicates all positions where we can split this word at the end of a line (e.g., in word processing or typesetting).

Hyphenation is typically performed using rule-bases methods, the use of machine learning for this task is not common. However, this is an instance of an important problem, segmentation. Other instances of segmentation include tokenization, word segmentation (for languages that do not use word boundaries in writing), and compound splitting (e.g., for German). We will follow the common practice in the field, and turn the segmentation problem to a sequence labeling problem. We will label each letter either as Beginning of a segment or as Inside of a segment. For example the characters and the corresponding labels for the above example are:

input:  hyphenation
labels: IIBIIIBIIII

Note that we do not mark the beginning of the first segment (although it would not affect the results substantially). This is a variation of a popular labeling scheme (often called B/I/O or I/O/B) for segmentation tasks, which has quite a few variations for different tasks. For example, for tokenization of English texts, it is common to include an Outside label, which is used for labeling characters that are not part of a token (e.g., white space). Similarly, if we want to segment a text to sentences and tokenize it at the same time, one may include a label S indicating beginning of a sentence.

You can find a copy of the data for this assignment in your assignment repository. The file contains one word per line, where potential hyphenation points are indicated by a single space.

Please use Keras for building the model(s), and follow the instructions below and the structure described in the template.

Exercises

5.1 Read the data, return input and labels

Implement the function read_data() that reads the data file and returns two sequences: one containing unsegmented words, and the second one containing the B and I labels corresponding to each word. For example, for input words ‘hy phen ation’, ‘lan guage’ and ‘desert’ you should return the sequences ['hyphenation', 'language', 'desert'] and ['IIBIIIBIIII', 'IIIBIIII', 'IIIIII'].

5.2 Encode the input and the labels

Implement the class Encoder, which encodes the input as a two-dimensional Python array, where each row represents a word, and each column corresponds to a character. The rows have to be padded with zeros or truncated to the same length. The cells of the matrix should be integers corresponding to characters (or the special symbols described below). We prepend each row with a beginning-of-word symbol, and append an end-of-word symbol. For example the encoding for words abc, ba and caba should be similar to

4 5 6 2 0
5 4 2 0 0
6 4 5 4 2

Where we (arbitrarily) use 1 for beginning of word, 2 for end of word, 4 for a, 5 for b and 6 for c.

The fit() method of the class should define a mapping similar to the one described above, as well as performing any other necessary bookkeeping (e.g., setting the maximum sequence length).

The method transform() should transform the input into a matrix like the one presented above. Note that if we get a word like aaadbc (a word longer than our encoding, and including an unknown character), we encode it as 1 4 4 4 3 5, where 3 is another special symbol for unknown characters, and the sequence is truncated to the same length as the one determined in fit().

If provided, the transform() method also transforms the labels to a numpy array. For example IBI, II and IBIB should be encoded as:

0 1 0 0 0
0 0 0 0 0
0 1 0 1 0

To use encoded labels with the Keras model described below, you will need to have a 3D array for the labels. For example, the array above should have shape (3, 5, 1).

5.3 Define and train a recurrent model for the task

We want to build a recurrent network that makes a binary prediction (B or I) for every character in its input. Build a network which uses an Embedding layer at the input, followed by a gated RNN (you can use either GRU or LSTM), and a simple classifier layer (Dense) suitable for the task. You can choose the hyperparameters of the network (the number of units at each layer, batch size, optimizer, learning rate, number of epochs to train, …) as you like.

Also implement the function hyphenate() which returns the segmented strings based on the predictions of a trained model. At each time step we choose the highest probability label based on the model’s predictions.

5.4 Evaluating segmentation

Implement the function evaluate() which calculates and returns binary precision, recall and F1-score for given gold-standard labels and predictions.

The gold standard labels should be strings of labels, e.g., ['IBI', 'II', 'IBIB'] for the example above. Predictions are a sequence of probabilities (of label ‘B’) returned by the model. You should disregard the predictions for padding and beginning and end of sequence characters.

You are only asked for calculating the metrics described. However, you should keep in mind that to get a good indication of success, you need to calculate these scores on a held-out data set which does not overlap with the training set.

5.5 Improve the model

Improve your model with the following changes:

Add an early stopping mechanism to stop training the network when the validation loss stops improving (you can use the EarlyStoppping callback from Keras for this step)
Add masking at the input (Embedding) layer, so that the model does not update the weights for padding
Instead of using only a forward RNN, use a bidirectional gated RNN (GRU or LSTM)

Compare the effects of each improvement (you do not need to show this in your submission) to the initial network defined in 5.3.

Additional questions

You are not required to provide answers for these questions. However, trying to answer them will help you understand the methods/concepts better.

Do you expect significant gains of performance because of the use of gated units for this task? (You can actually test your expectations empirically.)
Given the task at hand, do you think improving either precision or recall should be prioritized?
One of the difficulties with this data set (you are not asked to improve on) is class imbalance. The number of ‘B’ labels are much smaller than the number of ‘I’ labels. One possible solution in case of class imbalance is oversampling (of the minority class) during training. Can you perform oversampling in this setup? Can you think of any other method to alleviate the effects of class imbalance?
The way we predict the segmentation points in this exercise is greedy. That is, we take the highest-probability label at each step. Although the internal state of the RNN helps keeping long-range information, still each prediction is local. Can you think of a method for finding globally more probable label sequences for a word?
You are suggested above to use early stopping as a way to prevent overfitting. What other methods can you think of for the same purpose?
While evaluating the segmentation performance, we calculate the global precision, recall and F1-score. That is, we count true positives for all words in the given test set, and divide it by the total number of positive predictions (for precision), or by positive gold standard instances (for recall). Another possible approach is to calculate precision, recall and F1-score values for each word, and take the average values for all words. Consider the differences between these two approaches. What sort of models would each of these two scoring methods score highly?
What sort of differences do you expect from the representations learned in the embedding layer of the current model and the low-dimensional letter representations learned in assignment 4?