Assignment 2: Regression & regularization

Deadline: June 04, 2021 12:00 CEST
Late deadline: June 11, 2021 12:00 CEST

The objectives of this set of exercises is to become familiar with regression, and some of the common issues/practices in developing machine learning models. For this purpose, we will experiment with predicting a variable from an eye tracking experiment. Namely, we are interested in predicting the fixation duration on a word during reading. Modeling reader behaviour may have paractical applications, like identifying or simplifying complex texts, and it also is useful for understanding how humans process (written) language.

For implementation of the models in the exercise we will use scikit-learn library. Mainly, we are interested in their Ridge regression implementation. You may also use other utilities offered by this library, unless stated otherwise in the exercise description.

Data

The data for this set of experiments comes from the ZuCo corpus, a corpus with eye tracking and EEG measurements during reading sentences on screen. The part of the corpus we will use is a sub-selection of the data used in CMCL 2021 shared task. The data is a biases selection, and the variables of interest are selected / modified for simplification of these exercises. As a result, the data have substantially diverged from the original. You should not make any generalizations based on your findings.

Exercises

2.1 Read the primary data, split it into train/dev sets

The primary dataset we will use for this exercise is included in your repository as a2-data.txt as a text file. The file includes tab-separated blocks with the following variables:

word: The word as shown on the screen. Note that unlike typical tokenization in NLP, punctuation marks are not considered separate tokens.
ffd: First fixation duration in milliseconds. This is the value we want to predict.
freq: Frequency of the word (per million tokens) calculated on a large corpus.
pos: Predicted part of speech tag

Each block (sentence) ends with a single blank line. The lines in the file are either a data about space separated tokens with four variables described above separated by tabs, or a sentence boundary marked with an empty line.

The task in this exercise is to read the data, and return it as two sequences, a training set and a development set. After reading the sentences, you are required to randomize them. We are not going to make use of the context of the words in this assignment (although the context of the word is important for predicting reading behavior). The lists you return does not have to include sentence boundaries. However, for the sake of the exercise, you are required to shuffle sentences (not words), so that the order of words within sentences are preserved in both sets.

Implement your changes in the function read_data() in the provided template following further instructions in the docstring of the function.

2.2 Train/test a simple regression model

In this exercise we will train a simple regression model predicting the first fixation duration (ffd) from the word frequency alone. We are also going to set up a common function for training and evaluating a regression model with L2 regularization to use in the upcoming exercises.

Implement the function evaluate_regression() following the explanation in the template. Your function should train an appropriate regression model on the training set and return RMSE and R2 scores for training and development sets provided. On the main function body, simply output scores.

Some relevant questions

Although not required for your solution (we will only check your source code), trying to answer the following questions may help you understand the methods we are using better.

How do you interpret the RMSE and R2?
What are the coefficients found by the model, and how do you interpret them? (requires additional work not described in the exercise)
What is the effect of the regularization constant?
Does this model overfit if we do not use regularization at all? Why or why not?

2.3 Adding one more predictor

Add word-length as another feature to your regression model. For this exercise, you need to calculate the word length, augment the predictor (x) variables for both training and test set, and use the evaluate_regression() function implemented before to obtain the scores, and print them out.

Relevant questions

Are the contributions of these two predictors cumulative? That is, does the combined model explain the variation in the data equal to the sum of the variations explained by models using two single predictors?
How do you interpret the coefficients of the combined model?

2.4 Additional categorical predictor(s) (one-hot encoding)

Use the POS tags included in the data as additional predictors. You are required to encode the POS tags using one-hot encoding. For the sake of exercise, implement one-hot encoding yourself, do not use a high-level library function (using numpy array manipulation functions is OK). The implementation should follow the description in the template, implementing the function encode_onehot(). The set of POS tags should be obtained from the training set, and used as is for encoding the development set.

Again, use the evaluate_regression() function implemented before for training and evaluating a regression model with the additional feature(s).

2.5 Even more features: embeddings

Word embeddings are continuous vector representations of words learned (typically) from large corpora. The values in the embedding vectors are arbitrary, but the relations between them reflect semantic and syntactic properties of words. We will discuss embeddings later in the course in more detail. For our purposes, they are just another set of features for each word (although they are not as explainable as the features we used above). For your convenience, word2vec embeddings for the words that occur in our corpus is extracted and provided as embeddings.txt in your repository. The format of the file follows a common format for embeddings. Each row contains a word, followed by a sequence of numbers, which are the embedding vector for the word in the first column.

Note that we do not have embeddings for all words. As a result we will follow the following procedure for determining the word vector for each word in the corpus:

If the word is in the provided embeddings, use it as is
Otherwise, try lowercase version of the word, and use it if it is available
If neither the word or the lowercase word is in the embeddings, use the “average embedding”, which is the mean of all the embeddings in the provided file

Implement the function read_embeddings(), which returns embeddings for training and development sets, augment your predictors with the embedding features, and train test a model as before.

2.6 Find the best regularization parameter

So far, we made our model progressively more complex. Even though we have a relatively large training data compared to the complexity of the model, you should already realize some level of overfitting.

In this exercise, we want to find the optimum value for the regularization constant by searching through a reasonable range of parameters. Implement the function tune() in the template, which trains and tests models with different hyperparameter values, and returns the best hyperparameter value.

Please implement your own search, do not use a library function.

Relevant questions

What is the relation between the optimum regularization constant and model complexity (and training set size)?
Would the optimum regularization constant differ if predictors and/or the output variable had a different scale (e.g., if we measured the fixation duration in seconds rather than milliseconds)?

Wrapping up

Do not forget, edit the file header, including your name(s), and mark your final solution with final tag and push to your assignment repository,

The task here is simplified in many ways, and the predictions can be with many extensions. If you keen on improving your model, here are a few suggestions:

Use beginning/end of sentence as predictors
Use punctuation marks (before after a word) as predictors
Make use of context, do not only use the features of the word, but use the features of the word(s) before (or maybe even after) as predictors
Use better models of sequence processing (will be discussed later in the class)