Assignment 7: Text classification

This is a draft, please do not work on this assignment before it is released

Deadline: Aug 16, 2021 12:00 CEST
Late deadline: Aug 21, 2021 12:00 CEST

Many problems in NLP are types of text classification problems. Although each problem requires some special attention (e.g., the way features are selected or preprocessing is performed) the same methods can be used for all text classification problems. In this assignment you are required to predict the binary gender of an author from a short text. The problem, which is a sub-problem of author profiling, is a well studied problem.

The challenge you will face during this assignment is the small amount of training data. As a result, to be able to obtain good results, you will need to make use of external sources.

If you want to get good results, you are recommended to look for similar corpora (as additional training data), and/or pre-trained embeddings or language models.

Data

The data from this assignment comes from the short essays written by you. The essays were obtained during the beginning of the semester from each student for the last couple of years. The data set is provided as two tab-separated files: a7-train.txt contains the training set, and a7-test.txt contains the test set. The first column in the training set is the label (female or male). The test set does not contain label information. The first column is set to _ for all texts in the test set. The second column contains the text in both training and test files. Note that the text may contain newlines, you are recommended to use the Python csv library to read it.

Your repository also includes an optional data file a7-abstracts.txt which contains the collection of computational linguistics abstract and the meta data you collected in assignment 1. You are not required to work with this file, but the system you develop can be used to perform various experiments on this file (e.g., predicting gender or author). This is, again, a small data set with similar challenges.

Requirements

You can choose any machine learning method, and you are free to implement your classifier system the way you like. However, make sure that your code is readable and runs on a modern Python 3 environment (version >3.6).

There is only one requirement in the template. The function predict_gender() should train the model using the given training data, and return the predicted labels for the test data. The following code from an external Python script should work without errors.

from a7 import predict_gender
pred = predict_gender(trn_texts, trn_labels, tst_texts)

If tst_texts above is ['I like programming.', 'I like linguistics'], the return value should be similar to ['f', 'f'] (assuming the classifier predicted label f for both texts).

You should, of course, tune the hyperparameters and the architecture of your system. However, the above call should (re)train your system on the given data with optimum setup you determined, and return the best predictions for each given test instance.

Besides the above interface, you are required to place your best predictions on a file test-lables.txt, containing one label per line. The labels in the file should correspond to the test instances (texts) in the same order as a7-test.txt. Do not forget to add the file test-lables.txt to your repository and push it to GitHub.

Evaluation

Your system will be evaluated based on both your implementation and the success of the predictions. The score used for evaluation is the macro-averaged F1 score.

In general, you will get

5 points if your implementation is correct (and readable). You will lose points if we have difficulty understanding your code. Please make sure you include concise but clear comments if needed.
2 points if the above command runs without errors in less than 15 minutes on a recent (but not high-end) quad-core CPU (e.g., 3GHz Intel i7).
1 point if the score of your system is better than the random baseline.
2 points if the score of your system is better than 0.5 (macro-averaged F1 score) on the test set.
+1 point if the score of your system is better than 0.7 on the test set.