Assignment 3: Classification

Deadline: June 21, 2021 12:00 CEST
Late deadline: June 25, 2021 12:00 CEST

The purpose of this assignment is to get hands-on exercises with classification and evaluating classification methods. The problem we will try to solve is to identify the language of a word form its phonetic transcription. Even though the task is kept relatively simple (computationally), the problem and solution is similar to many practical classification problems in NLP.

We will experiment with both linear/traditional classifiers, and neural network models. We will use scikit-learn for the linear models, and Keras for the neural network classifier.

As in earlier assignments, please follow the instruction below, and the structure described in the template.

Data

The data for this set of exercises comes from the NorthEuralex. NorthEuralex is a lexical database of words expressing the same concepts in many languages. This data set is particularly interesting for investigating linguistic variation. For this set of exercises, we will not make use of the important structure in the database. Our aim is to identify the language of a word (without knowledge of the concept it expresses). The data is not included in your assignment repositories. You need to download it (we are only interested in the lexical data).

Exercises

3.1 Read the data, return words and languages

As usual, your first task is to read the provided data file, and return the features and the labels for classification. The data is distributed as a tab-separated text file. For the exercises below, the labels come from the first column (Language_ID). We will use the features from the sixth column (IPA, not rawIPA). The column includes pronunciation of the words coded using International Phonetic Alphabet (IPA). Since some natural language sound segments are expressed using multiple IPA characters, the column contains space-separated segments consisting of one or more Unicode (IPA) characters.

Implement the data reader in function read_data(). The function should return a tuple of sequences, one containing the pronunciation of the words, the other the language code corresponding to each word. We also want to be able to select the set of languages returned from this function. Please follow the instruction in the template for expected implementation details.

3.2 Split the data, extract and encode the features

This exercise includes three preliminary steps for (almost) any machine learning model:

Split the data into a training and test set. Here we need to make sure that split is randomized. Otherwise the distribution of labels will not be similar in training and test sets (since the data is ordered by language). In heavily imbalanced data sets, it is better to perform a stratified split, but you are not required to (but welcome to) do this for this exercise.
Extract phonetic segment n-grams. Features of a word will consist of phone-ngrams. An n-gram is simply the combination of overlapping, consecutive units in a sequence. For example, 2-grams (bigrams) of the IPA sequence d͡ʒ ɔ ɔ should be ['d͡ʒ ɔ', 'ɔ ɔ'].
Use a sparse vector of n-gram counts (bag of n-grams) as the features to be used. For example, given we have unigram features for two words ['d͡ʒ' 'ɔ' 'ɔ'] and ['k' 'l' 'ɔ' 'ɔ'], they should be coded as [1, 2, 0 0] and [0, 2, 1, 1], where indexes of the features correspond to feature values ['d͡ʒ', 'ɔ', 'k' 'l'].

For the last two steps, you can use CountVectorizer from scikit-learn. However, make sure you understand the underlying mechanism for constructing these vector representations. You should also make sure that the training and test sets are consistently encoded. However, the encoding should only be based on the data in the training set.

Please implement this exercise in function encode().

3.3 Train a logistic regression classifier, return the predictions

Given the encoded data, train a logistic regression classifier on the training set, predict the languages of the words in the test set.

Please implement this exercise in function linear_classifier(), following further instructions in the function’s docstring.

Note that you are not required to demonstrate the tuning of your classifier for this exercise. However, keep in mind that tuning hyperparameters of a machine learning models is important. Library defaults do not provide good results for most tasks / data sets.

Scikit-learn provides many classification functions. Although it is not required for the assignment, you are encouraged to try other classifiers as well. Since these classifier have a compatible API, you can try many of them with minimal changes to your code. However, you are strongly recommended to read and understand the classification mechanisms you use for real-world classification tasks.

3.4 Train a MLP classifier for the same task

Train a simple MLP perceptron for the same task. Again follow the instructions in docstring of the template for nn_classifier() where you should implement this exercise.

Note that unlike scikit-learn classifiers which accept symbolic labels, you will need to encode your labels for working with Keras models (this is also typical for other NN libraries).

Similar to the linear model, there are many aspects of the network that you can adjust for better classification. You should experiment with parameters such as number layers, units, dropout rate, training epochs etc. However, you are not required to show this in your solution. Note that if you used CountVectorizer from scikit-learn to encode the data, you may need to convert it to non-sparse, numpy arrays for training a Keras/Tensorflow model. Also note that, the MLP is not a very good architecture for the task at hand. We will introduce other architectures that are likely to be better for this problem.

3.5 Calculate evaluation metrics

Given gold-standard labels and predictions, calculate macro-averaged precision, recall and F1-score, as well as a confusion matrix, optionally also printing the results. Template for this exercise is in function evaluate().

Questions

The answers to the question below are not required, but by answering them will help you understand / learn the topic better.

Are the evaluation metrics we use here appropriate? Would the use of accuracy or micro-averaged precision / recall / F1-score be a good idea?
Which model (linear and NN) works better? What should you pay attention to make sure that the comparisons of the models are reliable?
Are the models linear in ‘IPA segment’ input space?
Do your models work well? How can you demonstrate that they are really useful models?
How would training/test scores differ if you use only higher order ngrams as features? For example, using only trigrams.
How would the linear and the neural model scale if we were using many features?
It is noted above that the MLP is not necessarily a good NN architecture for the task. Why?
How varied are the results of the models? Which model do you expect (or observe) to produce higher variation on test the scores? Why?