Texts and more texts: corpora in the CoEDL

Corpus development is one of the goals of the ARC Centre of Excellence for the Dynamics of Language (see this web page for more details). We have run a number of workshops on corpus-related themes (e.g. the 2017 workshop that included a day on converting early sources).

In addition to creating useable materials for the source communities (which we have a strong commitment to supporting) we are archiving records that include primary media, transcripts and associated annotations. We aim to produce from this material a subset of accessible texts for a number of languages.
Here it is worth noting that we have come up with this terminology (thanks to Jane Simpson for the formulation) to distinguish the objects we have collected:
Assemblage – all material collected, working files, early sources, multiple versions and drafts
Collection – the archived material, a subset of the above, but curated with sufficient metadata to allow the user to know what all items are
Corpus – a crafted set of texts in the language that can be used for further analysis

A corpus is a collection of texts in a language and is often built to address a particular research question, typically coding parts of the corpus to allow analysis of certain features. Some corpora are created with no particular question in mind, for example the Brown corpus or the International Corpus of English. It is the latter kind of material that we will be producing, texts that can be used for various kinds of analysis.

The aim is to have corpora from texts in as many of the following languages as possible: Abui, Anindilyakwa, Bininj Gun-Wok, Cook Islands Māori, Dalabon, Gamilaaray/Yuwal, Gurindji, Gurindji Kriol, Kalam, Kanjimey, Kayardild, Kaytetye, Kriol, Ku Waru, Marind, Mawng, Mudburra, Murrinhpatha, Nafsan, Nen, Ngaanyatjarra, Nungon, Vera’a, Warlpiri, Warumungu, Wubuy, Wutung, Yolngu (not yet ready to be made public).

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately. We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham. In addition to the above, we ask that you please observe the Gricean maxims: Be relevant That is, stay reasonably on topic. Be truthful This goes without saying; don’t give us any nonsense. Be concise Say as much as you need to without being unnecessarily long-winded. Be perspicuous This last one needs no explanation. We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification. All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions. Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment