Building and using corpora from language documentation corpora: A LIP discussion

Lauren Gawne recaps the April edition of Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.

Last month we focused on outputs from language documentation projects that could be of use to the language-speaking communities we work with, and a wider audience. This month, inspired by the LD&C special publication on the Potentials of Language Documentation) we turned to looking at how the same projects could also be used for research beyond the immediate scope of the initial documentation project. This discussion took in a wide range of areas – including returning to older data, the kinds of projects that can be undertaken when revisiting existing corpora and the realities of building a corpus during a documentation project.

For the sake of our discussion we took “corpus” in regards to language documentation projects to mean a collection of audio and/or video recordings with (optimally) time-aligned transcription and some degree of interlinear glossing and metadata descriptions of the content and speakers. A well-structured corpus of this kind can then be interrogated from a range of perspectives – quantitative corpus linguistics, projects looking at related languages or features, or typologists looking for the broadest range of phenomena. This can allow for greater cross-linguistic comparison, or for people to return to existing data with new questions.

The scope of what can be done with an existing corpus of material from language documentation projects is only limited by the nature of the collected data and the types of questions we want to ask of the data. There are a number of projects underway at the moment to turn existing printed collections of data into digital collections to make out ability to work with them more powerful by fully accessing and interrogating what is there. As one example, the Daisy Bates collection held in the Australian National Library, Barr Smith Library (SA) and Battye Librarby (WA) contains 8,600 pages of manuscripts with at least 123 speakers named, but in their current printed state we aren’t even sure how many languages are represented. By digitising the collection researchers will be able to gain a better idea of what is present in the collection, and enrich this will additional information. I’m sure there are many other examples of this kind of work, and we’d love to hear about them in the comments section below!

As many of us present at the evening are currently working on language documentation projects, attention was paid to what we should be aiming to do now to ensure that the corpora we build will be as useful as possible for future research. There is a need to ensure that the workflow involves any transcription tools that will be in open formats so they can be used by others in the future. Collecting enough metadata so that it is clear who the speakers are in each recording, and the nature of the recordings is also important. There was some concern that focusing overly on the corpus output might distract us from collecting as much data as possible. Transcribing and marking up recordings and texts is time consuming work, that often has little recognition in terms of academic outputs (although this will hopefully change). While modern tools and digital methods have certainly made the work we do more powerful, we are now expected to do more in less time, and building a well-structured corpus of data is just one expectation. This is not necessarily a bad thing – after all, many of the outputs discussed at the March LIP are possible with well-structured data as well – but the time cost of building better corpora during language documentation projects is still something that needs to be discussed.

There was also a concern that the realities of documenting a language do not always match the ideals of corpus linguistics. For languages that have not had any prior documentation the forms that are transcribed and the glosses applied to them may vary across multiple years of work. Also, future work can only be done addressing what has been captured. For example, there is potential for some excellent studies into the use of gesture in a wide range of languages, but this can only be done where video has been used for data collection. Likewise, many language documentation projects occur where there is more than one language – meaning there is potential to study the relationship between the target language of the original project and any contact language or other languages in the recordings, but many language documentation projects will only transcribe the target language. This is not necessarily a criticism of current workflows, as it comes back to the discussion about the realities of fitting everything into time available, but it is a reminder that we should strive to enrich the data we have collected as much as possible, but also that not all existing corpora will be useful for all potential future research questions and that even with the best-organised data additional labeling or analysis may be inevitable.

Finally, there was a discussion about whether people would be willing to share corpora they collected with other researchers. While everyone seemed happy to return to their own data with new questions, there is still a discomfort with sharing data with others. Hopefully as more researcher make their data available, and people can be more comfortable with discussing the content of language documentation corpora, there can be greater acknowledgement of the fact that they are always a work in progress. This will hopefully lead to more interesting work being done with the output of existing projects.