Assignment 1: Corpus collection
Deadline: May 14, 2021 12:00 CEST
Late deadline: May 21, 2021 12:00 CEST
The methods we discuss in this course are data driven. The success of these methods rely heavily on quality and quantity of the available data, but collecting and annotating data is often the most time consuming (and the most underappreciated) part of NLP applications. In this assignment, we will build a small corpus that can be used for stylometry experiments by combining open-source data from multiple sources. The first data source we will use is the ACL Anthology which keeps a repository of publications related to computation linguistics. In particular, we will be using the bibliographic data with abstracts from the ACL anthology. To obtain additional information about the authors, we will make use of the data available from Wikidata.
Our aim in this assignment is to build a corpus of research paper abstracts for which we can obtain some properties of authors, namely birth date, binary gender (sex or gender), and nationality (country of citizenship).
The assignment requires rather little Python knowledge and programming skills. The main challenge is understanding the basics of some systems/technologies that you are likely not familiar with (yet). The first of these is a query language for structured data, SPARQL (you are recommended to go through this tutorial, but you are also welcome to get an example from the web, and modify it until it does what we need). Second, you are asked to query a web-based API using Python requests library. For both, the knowledge you need for this assignment is very limited. However, you may consider learning more than needed, since you will likely to make use of similar tools or languages in the future.
General information on assignments
Since this is the first assignment, a few clarifications are in order.
- The assignments are accepted only through GitHub classroom setup. You should follow the assignment link on the private course repository to work on the assignment. In the process, you will be asked to either create a team, or join an existing one. If you are working on the assignment alone, or if you are the first member of your group, you should create a new team. If your assignment partner already created the team, you should join the team created. After accepting the assignment, you will have a repository with instructions and starter code (a single shared repository for groups).
- You are strongly recommended to use git properly, and productively. Commit individual changes separately, commit/push frequently. Your work will be graded only based on its final status. Intermediate versions of your code does not contribute to your grade.
- Do not forget to add your name(s) in the assignment file header(s).
- If you are not taking the course for credit, please indicate it in the file header, so that case your asignments are not graded. You may still get some feedback on your solution.
- You are welcome create an issue on the private course repository for your questions.
- Please implement the exercises in the specified places in provided template(s). You can implement parts of the exercises in other functions/classes/methods/modules and use them from the template if it is (really) needed, but your implementation should follow the “API” defined in the assignment template.
- When you are done, tag your repository as
final
, and push the tag to GitHub. Here is how to do it on the command line:git tag final git push --tags
The repositories not tagged as
final
will not be checked until the late submission deadline.
Exercises
1.1 Extract first authors from the bibliography file
We want to get a list of authors from the bibliography file.
The list should be sorted according to number of articles
where the person is the first author (higher number of articles first).
Since the bibliography file distributed by ACL is automatically generated,
it is relatively easy to parse the file.
Nevertheless,
you may find the Python modules
pybtex (for parsing the BibTeX file),
and
pylatexenc
or latexcodec
(for converting author names in LaTeX to Unicode strings) useful.
Please implement this exercise in function get_authors()
in the template.
The returned sequence can be a sequence of strings,
e.g., "John Smith"
, or tuples that contain parts of the name
in a more structured manner, e.g., ("John"
, "Smith"
).
The later parts of your assignment should to work
with the sequence returned by this function.
Note: if you use pybtex
, it will fail parsing the file because of
unbalanced braces in two abstracts. You can either correct these
yourself, or apply the patch provided as acl-bib.patch.
1.2 Collect meta data for authors from Wikidata
Create a list of 50 authors whose birth date,
binary gender (male/female) is known.
We also would like to record nationality of the author when available,
but we will not require it.
Implement this exercise in the function get_author_info()
in the template.
The function should query Wikidata for each author given in the
sorted sequence until 50 authors
whose occupations are listed in Wikidata as one of
computational linguist, computer scientist, or researcher
(you can add others, e.g., linguist if you like)
and all the required information is specified (gender and birth date).
Your function should return a sequence of tuples
with the author, Wikidata ID, birth date, gender and nationality.
For missing information you can use None
as a placeholder.
The challenge in this exercise is to write a SPARQL query that finds people by name in an efficient manner. You are recommended to test your query on Wikidata query service first, and then parameterize it based on author name, and use it in your script. You may also access many useful example queries through this query interface.
There are many high-level libraries for querying Wikidata. However, for the sake of exercise, you are required to only rely on Python requests library. You should form your query, send it to the API endpoint at https://query.wikidata.org/sparql, and process the returned result.
Note that many authors in your list will not be in Wikidata, or they will not have the data we require. As a result you will need to skip hundreds of authors. You are allowed to `cheat’ a bit to speed up the process by entering the correct data about the authors you know or you can find information about.
Please try not to cause heavy load on this public service:
test your code with small samples, and wait between the queries in
your final corpus collection attempt
(you may also get rejected temporarily or even blacklisted if you
send too many queries in a short time).
You are also recommended to use a Session
with the requests
library,
which will reuse the connections to the server rather than
creating an HTTP connection for each query.
Tip: you may have better luck with matching full names against Wikidata person labels rather than matching given/family names in the structured data. You are also welcome to try multiple options, or try a bit harder matching names (e.g., query names with/without diacritics) but an approximate method that you can obtain 50 authors as described above is enough for the purposes of this assignment.
1.3 Putting the information together
Save the corpus as a tab-separated file with the following fields
(implement it in the function save_corpus()
).
- id: Wikidata ID for the author
- name: Name of the author, like “John Smith”
- birth: Birth date of the author in ISO format, e.g., “2021-04-30”
- gender: binary gender of the author
- nationality: nationality of author (if specified)
- pubdate: year of the publication
- nauthors: number of authors of the publication
- abstract: the text of the abstract
You should select abstracts in the bibliography whose first author is in the author list. Note that the first five fields will be repeated for authors with more than one first-authored abstract in the bibliography file.
Wrapping it up
Do not forget to check in all your code
as well as the file containing the corpus,
tag your repository with the final
tag,
and push the code/data and the tag to GitHub.
Now you have a corpus you can use for training models for:
- Predicting the age or gender of an author
- Predicting publication year of an abstract
- Predicting whether the author is a native speaker or not (using nationality as a noisy label)
- Automatically grouping (clustering) authors based on what they write
- …
For most of these tasks, however, the corpus we created is rather small. But, this is likely to change in coming years, as Wikidata (and other sources of linked/open data become better populated), and it may already be useful if we pick other, larger sources of textual data (e.g., PubMed).