First look at Universal Dependencies — part 2

In part 1 I introduced you to Universal Dependencies and touched on some of the concepts about treebanks. We are now going to take a deeper look at the German treebank. You can download the ud-treebanks-v2.0.tgz file here. After unpacking, you'll see multiple folders named UD plus the language. In the UD_German folder, you’ll find the file of interest. It is called de-ud-train.conllu. This is the training data to teach a machine learning algorithm to recognise the features of German sentences (the structure, the types of words, etc.). Let’s get started.

Bookshelves filled with books in a library.
Library by Kevin Wong under BY 2.0.
Cropped from original and color effects applied.

Parsing Universal Dependencies

The CoNLL-U file format contains sentences, which are broken into individual words and multiwords. A multiword can be ‘vom’ (‘von dem’ / ‘von einem’). Some of the words (also called tokens) represents numbers and symbols like comma, dots, exclamation marks and even emoticons.

The format looks like this:

CoNNL-U (spaces instead of tabs) Gist

# sent_id = train-s2
# text = Die Kosten sind definitiv auch im Rahmen.
#id     form            lemma           upostag         xpostag         feats                                                                   head    deprel          deps    misc
1       Die             der             DET             ART             Case=Nom|Definite=Def|Gender=Fem|Number=Sing|PronType=Art               2       det             _       _
2       Kosten          Kosten          NOUN            NN              Case=Nom|Gender=Fem|Number=Sing                                         3       nsubj:pass      _       _
3       sind            sein            VERB            VAFIN           Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin                   0       root            _       _
4       definitiv       definitiv       ADV             ADJD            _                                                                       3       advmod          _       _
5       auch            auch            ADV             ADV             _                                                                       3       advmod          _       _
6-7     im              _               _               _               _                                                                       _       _               _       _
6       in              in              ADP             APPR            _                                                                       8       case            _       _
7       dem             der             DET             ART             Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art         8       det             _       _
8       Rahmen          Rahmen          NOUN            NN              Case=Dat|Gender=Masc,Neut|Number=Sing                                   3       obl             _       SpaceAfter=No
9       .               .               PUNCT           $.              _                                                                       3       punct           _       _

A lot is going on in this example, but the format is rather simple, when you look at it. It's a line based format. Each line represents one word or multiword. Each line have multiple columns. Here's a visualisation of the above data:

Visualisation of the sentence ‘Die Kosten sind definitiv auch im Rahmen.’ with lemmas and POS tags from CoNLL-U format underneath.
Figure 1: Visualisation of the sentence ‘Die Kosten sind definitiv auch im Rahmen.’ with lemmas and POS tags from CoNLL-U format underneath.

The first column is the id of the word. Normal words have 1, 2, 3, etc. Multiwords have ids like 6-7 (‘im’), which means they are split into 6 (‘in’) and 7 (‘dem’).

The second column is the word form; how it is actually written in the sentence. The third column is the lemma, ie. the reduced form of the word without any declensions, which I mentioned in the previous post.

The fourth column is the universal POS tag, the type of word in the Universal Dependencies terminology. This tag is universal across all languages. The fifth column is the local POS tag, the type of word specific to the language. Since languages differ from each other, the local POS tag will also differ between languages. However the universal POS tag tries to capture a more generalized view on the word type. One of the purposes of the Universal Dependencies project is to publish a list of word types and relations that are universally applicable to words in all languages.

The remaining columns annotate other various word features and link the words in the sentence with other words. These word relations allows a deeper understanding of the sentence structure.

To parse the CoNLL-U format, I developed a small Node.js module. You can download it as conllu-stream NPM GitHub. This allows us to quickly parse the files and look into the data. Here's an example:

Node.js Gist

var conllu = require('conllu-stream');
var fs     = require('fs');

fs.createReadStream('ud-treebanks-v2.0/UD_German/de-ud-train.conllu')
    .pipe(conllu())
    .on('data', sentence => {
        console.log(sentence.features.sent_id, sentence.toString());
    });

This just prints out all the sentences, 14118 in total.

Basic statistics

To get an idea about the treebank, we can write simple programs to gather statistics about the data. Let's calculate the histogram of word forms and lemmas:

Node.js Gist

var _      = require('lodash');
var conllu = require('conllu-stream');
var fs     = require('fs');

// Function to print value as percent (nicely).
function percent(value) {
    value *= 100;
    return  isNaN(value) ? '' :
            value > 10   ? value.toPrecision(3)+'%' :
            value > 1    ? value.toPrecision(2)+'%' :
                           value.toPrecision(1)+'%';
}

// Function to calculate and display histogram.
// It first calculates the histogram of the primary `key`.
// For each key it calculates a secondary histogram of the seconday `linkKey`.
function displayHistogram(words, key, linkKey) {

    // Calculate `key` histogram of words, sorted by the frequenzy.
    var grouped   = _.groupBy(words, key);
    var histogram = _(grouped)
        .mapValues('length')
        .toPairs()
        .sortBy([ 1, 0 ])
        .reverse()
        .value();

    // Get top-10 and bottom-10 parts of the histogram.
    var top       = histogram.slice(0, 10);
    var bottom    = histogram.slice(-10);

    // Combine top and bottom parts to display.
    var entries   = top.concat([['--']]).concat(bottom);

    // For each entry, calculate top-6 of secondary `linkKey` histogram.
    entries.forEach(entry => {
        entry[2] = percent(entry[1] / words.length);
        var linked = _.map(grouped[entry[0]], linkKey);
        entry[3] =
            _(linked)
            .groupBy()
            .mapValues('length')
            .toPairs()
            .sortBy([ 1, 0 ])
            .reverse()
            // Show percent values for each item.
            .map(item => `${item[0]} (${percent(item[1]/linked.length)})`)
            .slice(0, 6)
            .join(', ');
    });

    // Display table of results.
    console.log('-- %s --', key);
    console.log();
    console.log(entries.map(entry => entry.join('\t')).join('\n'));
    console.log();
    console.log('#words     :', words.length);
    console.log('#histogram :', histogram.length);
    console.log();

}


// Array to store all word objects we encounter.
var words = [];

// Parse CoNLL-U file.
fs.createReadStream('ud-treebanks-v2.0/UD_German/de-ud-train.conllu')
    .pipe(conllu())
    .on('data', sentence => {
        // Collect all words and lemmas in lowercase (except punctuation/numbers).
        sentence.getSequence()
            .filter(word => [ 'PUNCT', 'NUM' ].indexOf(word.upostag) === -1)
            .forEach(word => {
                // Using lowercase so "Haus" and "haus" are counted together.
                word.form  = word.form.toLowerCase();
                word.lemma = word.lemma.toLowerCase();
                words.push(word);
            });
    })
    .on('end', () => {
        // Calculate and display histograms of words vs. lemmas and vice versa.
        displayHistogram(words, 'form', 'lemma');
        displayHistogram(words, 'lemma', 'form');
        console.log('-- done --');
    });

The results are shown in the tables below:

Table 1: Most and least represented word forms together with the corresponding lemmas.
# Form Frequency Lemmas
1 der 4.1% der98.7%, d1.3%
2 die 3.2% der99.2%, d0.7%, die0.03%
3 in 3.2% in100%, er0.03%
4 und 2.9% und100%
5 dem 2.5% der99.8%, d0.2%
6 von 1.5% von100%
7 zu 1.3% zu100%
8 den 1.1% der99.5%, d0.5%
9 das 1.0% der98.7%, d1.3%
10 mit 0.9% mit100%
46337 1,2 0.0004% 1,2100%
46338 08 0.0004% 08100%
46339 07/10/11 0.0004% 07/10/11100%
46340 020c 0.0004% 020c100%
46341 .limitierte 0.0004% .limitierte100%
46342 .einzelne 0.0004% .einzelne100%
46343 .. 0.0004% ..100%
46344 . 0.0004% .100%
46345 ) 0.0004% )100%
46346 'm 0.0004% 'm100%
Table 2: Most and least represented lemmas together with the corresponding word forms (lower-case).
# Lemma Frequency Forms
1 der 12.9% der31.1%, die24.9%, dem19.4%, den8.9%, das8.0%, des6.8%, …
2 in 3.2% in100%
3 und 2.9% und100%
4 sein 2.6% ist32.8%, war21.0%, sind12.3%, seine7.1%, sein6.1%, seiner5.0%, …
5 ein 2.3% eine30.4%, ein29.8%, einer13.0%, einen10.7%, einem10.7%, eines5.3%, …
6 von 1.5% von99.7%, v.0.3%
7 werden 1.5% wurde43.9%, werden19.6%, wird18.3%, wurden12.5%, worden2.8%, würde1.0%, …
8 zu 1.3% zu100%
9 er 1.1% er86.9%, ihm7.2%, ihn5.9%, in0.08%
10 mit 0.9% mit100%
38755 1,2 0.0004% 1,2100%
38756 08 0.0004% 08100%
38757 07/10/11 0.0004% 07/10/11100%
38758 020c 0.0004% 020c100%
38759 .limitierte 0.0004% .limitierte100%
38760 .einzelne 0.0004% .einzelne100%
38761 .. 0.0004% ..100%
38762 . 0.0004% .100%
38763 ) 0.0004% )100%
38764 'm 0.0004% 'm100%

We can make some interesting observations. Articles like der/dem/die/den/das/des alone account for 12.9% of all words. It also seems that the people annotating the words were a bit confused whether to use lemma der or d for the articles, even though der is much more popular. There might be a wrongly assigned lemma in some cases, fx. ‘in’ is tagged as er in 0.03% cases. In the raw data it looks like a spelling mistake, where it should have said ‘ihn’ correctly mapping to the lemma er. When we look in the tables for the least representated word forms and lemmas, we see that it's usually dates, codes or wrongly tagged numbers or punctuation. We explicitly excluded numbers (NUM), however some numbers are tagged as proper nouns (PROPN) or even regular nouns (NOUN). In the case of ‘.limitierte’ and ‘.einzelne’ it looks like a missing space after the dot caused them to be included as words instead of marking new sentences. You should expect inconsitencies in datasets like this, that are built manually by people and from text corpuses that themselves contains typos and other kinds of mistakes.

As a last exercise, we will identify all the multiwords and list them. Here's the code:

Node.js Gist

var _      = require('lodash');
var conllu = require('conllu-stream');
var fs     = require('fs');

var multiwords = [];

fs.createReadStream('ud-treebanks-v2.0/UD_German/de-ud-train.conllu')
    .pipe(conllu())
    .on('data', sentence => {
        // Collect all words and lemmas in lowercase.
        sentence.structure.multiwords
            .map(id => sentence.tokens[id])
            .forEach(multiword => {
                // Get expanded form of the multiword.
                var expansion =
                    _.range(multiword.position, multiword.endPosition+1)
                    .map(id => sentence.tokens[''+id].form)
                    .join(' ');

                // Store multiword and its expansion.
                multiwords.push(multiword.form.toLowerCase() +
                    '\t-->\t' + expansion.toLowerCase());
            });
    })
    .on('end', () => {
        // Calculate and show histogram sorted by frequency.
        console.log(
            _(multiwords)
            .groupBy()
            .mapValues('length')
            .toPairs()
            .sortBy([ 1, 0 ])
            .reverse()
            .map(row => row.join('\t\t'))
            .join('\n')
        );
    });

The results from running the program are as follows:

Table 3: Contractions and their frequency from the Universal Dependencies German treebank.
# Contraction Expanded form Frequency
1 im in dem 51.2%
2 zum zu dem 14.6%
3 zur zu der 11.9%
4 am an dem 10.7%
5 vom von dem 5.91%
6 beim bei dem 3.71%
7 ins in das 1.61%
8 ans an das 0.19%
9 ums um das 0.17%
10 aufs auf das 0.064%
11 übers über das 0.021%
12 am an dem an dem 0.021%

This is it for now. In the next article we will look at Wiktionary and the pros and cons of using this dataset compared to Universal Dependencies. If you want to play around with the data presented in this article, take a look on the Gist, where I shared all the code snippets.