In the last post I introduced some theory about how nouns are declined in German. This and the next post will dive into downloading a treebank, how to parse it, and look at some basic statistics. A treebank is a structured list of words in connection to a text corpus annotated with word type and other essential information. The word type is called a POS tag. It denotes the category of the word related to the syntactic level, e.g. noun, verb, adjective, etc. The Universal Dependencies project is publishing treebanks for many languages. It is based on a multitude of text corpus, manually analysed, labeled, and annotated. It follows the descriptive model, because it is based on real free (as in speech) texts from various sources including online.
Many languages use space as a divider of words. Some languages, like Chinese and Japanese, have a close relation between words and characters. So spaces are not as important as in alphabetic languages using e.g. the Latin letters. However, even if you limit yourself to languages using space as a word divider, this definition of a word might be too simple depending on the task you're working on.
I will use German as an example, because that's where my interest lies. In German you have abbreviations like ‘z.B.’ or ‘z. B.’ for ‘zum Beispiel’. Is it counted as one word (the abbreviation itself) or two words (what it stands for)? You also have contractions like ‘ins’ (‘in das’) and ‘vom’ (‘von dem’ / ‘von einem’). Should we count those as one word or two words? In some cases, you might want to measure the word distribution and see ‘ins’/‘vom’ as one word, because it is a different usage pattern than ‘in das’/‘von dem’. In other cases, you might want to find all prepositional phrases, and it's easier, if you first split the multiword contractions into their components. Then it's possible to look for ‘in’ or ‘von’ directly. In the Universal Dependencies treebank for German, you'll find that contractions are multiword elements, which are linked to their components, e.g. ‘vom’ is an element that is linked to ‘von’ + ‘dem’. In this representation, you can both talk about the contraction itself as well as its components. Unfortunately, abbreviations are not treated the same way. The treebank considers ‘z.B.’ as a single word, which is not linked to ‘zum’ + ‘Beispiel’.
Numbers can be challenging as well, since a space is sometimes used as a thousands separator, e.g. ‘150 000 000’ is written to mean 150 mio. That doesn't mean ‘150’ and ‘000’ should be handled as words on their own.
Words are usually tied to the syntactic level, i.e. you look at the actual letters. However, in some cases, it can be beneficial not to consider the various inflections of a word, or rather consider all inflections as the same word. This concept is called lemma. A lemma is representing all the declined forms of a word, e.g.
Haus is the lemma representing the word forms ‘Haus’, ‘Hauses’, ‘Häuser’, and ‘Häusern’. Universal Dependencies annotates all words with the lemma, so you have the option to use either the word form or the lemma.
To take it a step further, it's possible to combine lemmas into so-called synsets, which groups all synonyms into one representation. This is closer to what we say a word means. However, it could also have the downside of leaving out subtle differences in meaning that exist between seemingly synonymous words. Some machine-readable dictionaries like WordNet are based on the principle of synsets, which I will look into in another post.
The last fact that complicates the definition of a word is slightly more philosophical, though still of practical significance. When you download a word list or a dictionary, it is either descriptive or normative or a combination of both. Descriptive means that it is just describing how the language is used: Slang, new usage patterns (dative used instead of genetive), or usage normally considered incorrect, can all be included and annotated in a descriptive manner. Normative on the other hand refers to how it ought to be. Many dictionaries list only the normative usages of words, ie. how you ought to use the words, which doesn't necessarily correlate to how they are used in reality. In a descriptive model, it's perfectly fine to annotate ‘Sinn machen’, if you find it in a text. In a normative model, you would only include the correct phrase ‘Sinn ergeben’/‘Sinn haben’. If you're building a spell checker, it is sensible to follow the normative model, at least for the suggestions. For many applications, however, it is more practical to follow the descriptive model. Otherwise you'll miss word usages that are not fully correct, but still commenly used.
If you want to dive deeper into how Universal Dependencies is segmenting words, please refer to the online resources:
In the next part, there will be code. We will look at parsing Universal Dependencies and the CoNLL-U file format. If you're eager, you can already check out my NPM module
conllu-stream NPM GitHub. If you liked this post, write a comment below.