Challenges in tagging and parsing spoken dialects of Dutch
AbstractThis paper reports on the construction of a tagged and parsed pilot corpus of the southern Dutch dialects. The corpus aims to facilitate diachronic research into the syntax of Dutch, as its dialects have retained many interesting (morpho)syntactic features which can often be traced back to changes starting in or characteristics retained from older stages of historical Dutch. The discussion mainly focuses on initial test results achieved by applying existing NLP tools which have been developed or optimised for POS tagging and parsing standard Dutch. We report on initial tests on our data with Frog, TreeTagger and Alpino. We discuss some of the challenges we have encountered working with spoken, unstandardised language in general on the one hand and on specific (morpho)syntactic problems for POS tagging and parsing the southern Dutch dialects on the other hand. The challenges and solutions we present in this pilot study will inform our choices for the NLP tools we will use or adapt for the development of a more extensive annotated corpus.
Copyright (c) 2022 Melissa Farasyn, Anne-Sophie Ghyselen, Jacques Van Keymeulen, Anne Breitbarth
This work is licensed under a Creative Commons Attribution 4.0 International License.
Articles appearing in Journal of Historical Syntax are published under a Creative Commons Attribution License. Authors retain copyright.