Assignment 1: Corpus collection

Deadline: May 14, 2021 12:00 CEST
Late deadline: May 21, 2021 12:00 CEST

The methods we discuss in this course are data driven. The success of these methods rely heavily on quality and quantity of the available data, but collecting and annotating data is often the most time consuming (and the most underappreciated) part of NLP applications. In this assignment, we will build a small corpus that can be used for stylometry experiments by combining open-source data from multiple sources. The first data source we will use is the ACL Anthology which keeps a repository of publications related to computation linguistics. In particular, we will be using the bibliographic data with abstracts from the ACL anthology. To obtain additional information about the authors, we will make use of the data available from Wikidata.

Our aim in this assignment is to build a corpus of research paper abstracts for which we can obtain some properties of authors, namely birth date, binary gender (sex or gender), and nationality (country of citizenship).

The assignment requires rather little Python knowledge and programming skills. The main challenge is understanding the basics of some systems/technologies that you are likely not familiar with (yet). The first of these is a query language for structured data, SPARQL (you are recommended to go through this tutorial, but you are also welcome to get an example from the web, and modify it until it does what we need). Second, you are asked to query a web-based API using Python requests library. For both, the knowledge you need for this assignment is very limited. However, you may consider learning more than needed, since you will likely to make use of similar tools or languages in the future.

General information on assignments

Since this is the first assignment, a few clarifications are in order.

The assignments are accepted only through GitHub classroom setup. You should follow the assignment link on the private course repository to work on the assignment. In the process, you will be asked to either create a team, or join an existing one. If you are working on the assignment alone, or if you are the first member of your group, you should create a new team. If your assignment partner already created the team, you should join the team created. After accepting the assignment, you will have a repository with instructions and starter code (a single shared repository for groups).
You are strongly recommended to use git properly, and productively. Commit individual changes separately, commit/push frequently. Your work will be graded only based on its final status. Intermediate versions of your code does not contribute to your grade.
Do not forget to add your name(s) in the assignment file header(s).
If you are not taking the course for credit, please indicate it in the file header, so that case your asignments are not graded. You may still get some feedback on your solution.
You are welcome create an issue on the private course repository for your questions.
Please implement the exercises in the specified places in provided template(s). You can implement parts of the exercises in other functions/classes/methods/modules and use them from the template if it is (really) needed, but your implementation should follow the “API” defined in the assignment template.
When you are done, tag your repository as final, and push the tag to GitHub. Here is how to do it on the command line:
```
git tag final
git push --tags
```
The repositories not tagged as final will not be checked until the late submission deadline.

Exercises

1.1 Extract first authors from the bibliography file

We want to get a list of authors from the bibliography file. The list should be sorted according to number of articles where the person is the first author (higher number of articles first). Since the bibliography file distributed by ACL is automatically generated, it is relatively easy to parse the file. Nevertheless, you may find the Python modules pybtex (for parsing the BibTeX file), and pylatexenc or latexcodec (for converting author names in LaTeX to Unicode strings) useful. Please implement this exercise in function get_authors() in the template. The returned sequence can be a sequence of strings, e.g., "John Smith", or tuples that contain parts of the name in a more structured manner, e.g., ("John", "Smith"). The later parts of your assignment should to work with the sequence returned by this function.

Note: if you use pybtex, it will fail parsing the file because of unbalanced braces in two abstracts. You can either correct these yourself, or apply the patch provided as acl-bib.patch.

1.2 Collect meta data for authors from Wikidata

Create a list of 50 authors whose birth date, binary gender (male/female) is known. We also would like to record nationality of the author when available, but we will not require it. Implement this exercise in the function get_author_info() in the template. The function should query Wikidata for each author given in the sorted sequence until 50 authors whose occupations are listed in Wikidata as one of computational linguist, computer scientist, or researcher (you can add others, e.g., linguist if you like) and all the required information is specified (gender and birth date). Your function should return a sequence of tuples with the author, Wikidata ID, birth date, gender and nationality. For missing information you can use None as a placeholder.

The challenge in this exercise is to write a SPARQL query that finds people by name in an efficient manner. You are recommended to test your query on Wikidata query service first, and then parameterize it based on author name, and use it in your script. You may also access many useful example queries through this query interface.

There are many high-level libraries for querying Wikidata. However, for the sake of exercise, you are required to only rely on Python requests library. You should form your query, send it to the API endpoint at https://query.wikidata.org/sparql, and process the returned result.

Note that many authors in your list will not be in Wikidata, or they will not have the data we require. As a result you will need to skip hundreds of authors. You are allowed to `cheat’ a bit to speed up the process by entering the correct data about the authors you know or you can find information about.

Please try not to cause heavy load on this public service: test your code with small samples, and wait between the queries in your final corpus collection attempt (you may also get rejected temporarily or even blacklisted if you send too many queries in a short time). You are also recommended to use a Session with the requests library, which will reuse the connections to the server rather than creating an HTTP connection for each query.

Tip: you may have better luck with matching full names against Wikidata person labels rather than matching given/family names in the structured data. You are also welcome to try multiple options, or try a bit harder matching names (e.g., query names with/without diacritics) but an approximate method that you can obtain 50 authors as described above is enough for the purposes of this assignment.

1.3 Putting the information together

Save the corpus as a tab-separated file with the following fields (implement it in the function save_corpus()).

id: Wikidata ID for the author
name: Name of the author, like “John Smith”
birth: Birth date of the author in ISO format, e.g., “2021-04-30”
gender: binary gender of the author
nationality: nationality of author (if specified)
pubdate: year of the publication
nauthors: number of authors of the publication
abstract: the text of the abstract

You should select abstracts in the bibliography whose first author is in the author list. Note that the first five fields will be repeated for authors with more than one first-authored abstract in the bibliography file.

Wrapping it up

Do not forget to check in all your code as well as the file containing the corpus, tag your repository with the final tag, and push the code/data and the tag to GitHub.

Now you have a corpus you can use for training models for:

Predicting the age or gender of an author
Predicting publication year of an abstract
Predicting whether the author is a native speaker or not (using nationality as a noisy label)
Automatically grouping (clustering) authors based on what they write
…

For most of these tasks, however, the corpus we created is rather small. But, this is likely to change in coming years, as Wikidata (and other sources of linked/open data become better populated), and it may already be useful if we pick other, larger sources of textual data (e.g., PubMed).