Corpus development is one of the goals of the ARC Centre of Excellence for the Dynamics of Language (see this web page for more details). We have run a number of workshops on corpus-related themes (e.g. the 2017 workshop that included a day on converting early sources).
In addition to creating useable materials for the source communities (which we have a strong commitment to supporting) we are archiving records that include primary media, transcripts and associated annotations. We aim to produce from this material a subset of accessible texts for a number of languages.
Here it is worth noting that we have come up with this terminology (thanks to Jane Simpson for the formulation) to distinguish the objects we have collected:
Assemblage – all material collected, working files, early sources, multiple versions and drafts
Collection – the archived material, a subset of the above, but curated with sufficient metadata to allow the user to know what all items are
Corpus – a crafted set of texts in the language that can be used for further analysis
A corpus is a collection of texts in a language and is often built to address a particular research question, typically coding parts of the corpus to allow analysis of certain features. Some corpora are created with no particular question in mind, for example the Brown corpus or the International Corpus of English. It is the latter kind of material that we will be producing, texts that can be used for various kinds of analysis.
The aim is to have corpora from texts in as many of the following languages as possible: Abui, Anindilyakwa, Bininj Gun-Wok, Cook Islands Māori, Dalabon, Gamilaaray/Yuwal, Gurindji, Gurindji Kriol, Kalam, Kanjimey, Kayardild, Kaytetye, Kriol, Ku Waru, Marind, Mawng, Mudburra, Murrinhpatha, Nafsan, Nen, Ngaanyatjarra, Nungon, Vera’a, Warlpiri, Warumungu, Wubuy, Wutung, Yolngu (not yet ready to be made public).