Linguistic annotation – A parsed corpus of Southern Dutch dialects

The GCND is a ‘parsed’ corpus. This means that, at different levels, all kinds of linguistic information is added to the transcriptions, making it fully searchable. It is not only possible to search for words, but also for word types and syntactic information. Much of this was done automatically using a classifier. For the GCND, that classifier is the ALPINO parser. Because ALPINO is trained on standardised data (and especially written language), that classifier sometimes makes mistakes when working with dialect data. It was therefore necessary to help ALPINO sometimes, by (1) preparing the data properly in advance (preprocessing) and (2) making manual corrections afterwards (post-processing).

Detailed information on preprocessing and post-processing can be downloaded here:

About the GCND (version October 2024)