Challenges in tagging and parsing spoken dialects of Dutch

Keywords: tagging, parsing, dialects, Dutch, corpus, spoken dialects


This paper reports on the construction of a tagged and parsed pilot corpus of the southern Dutch dialects. The corpus aims to facilitate diachronic research into the syntax of Dutch, as its dialects have retained many interesting (morpho)syntactic features which can often be traced back to changes starting in or characteristics retained from older stages of historical Dutch. The discussion mainly focuses on initial test results achieved by applying existing NLP tools which have been developed or optimised for POS tagging and parsing standard Dutch. We report on initial tests on our data with Frog, TreeTagger and Alpino. We discuss some of the challenges we have encountered working with spoken, unstandardised language in general on the one hand and on specific (morpho)syntactic problems for POS tagging and parsing the southern Dutch dialects on the other hand. The challenges and solutions we present in this pilot study will inform our choices for the NLP tools we will use or adapt for the development of a more extensive annotated corpus.