More on language documentation corpora

We had an interesting discussion about documentation corpora in the course I taught last week for the LOT winter school at the Universiteit van Tilburg.
In the course I took the somewhat strong view that a documentary corpus minimally consists of: (a) media or text recordings (inscriptions), with (b) time-aligned transcription, and (c) time-aligned translation, and (d) relevant metadata about the documentation and communicative context. Thus, on this view, the 150 hours of untranscribed video collected by a project that one of the students is involved in is not part of any corpus (though it might be what Himmelmann (2006:10) calls ‘primary data’ (“recordings of observable linguistic behaviour and metalinguistic knowledge”), or what OLAC calls ‘a resource’, and it might become part of the corpus when it is worked on in the future). Neither is the audio recording of a 6-person conversation that another student made in Sri Lanka that neither he nor his consultants are able to transcribe. Media recordings without transcription or translation thus do not constitute data by themselves and don’t document anything. This view of what a corpus is also appears in the DoBeS guidelines as presented in Brugman 2003, available here, and on the HRELP website. A corpus can be enriched by annotation (see Bird and Liberman 2001) with the addition of linguistic information like morphemic analysis, morpheme-by-morpheme glosses, part of speech tags etc (see Schultze-Berndt 2006), or non-linguistic information like kinship relations or cultural practices etc (see Franchetto 2006).

I suggested in an earlier post that size may not be a useful criterion for determining the value of a documentary corpus. In class last week, we talked about what some evaluative criteria might be. Through discussions over a number of years, Robert Munro, David Nathan and I have come up with the following list of possible qualitative evaluative dimensions (in no particular order, and recognising that some may be in conflict) that could be applied to a documentary corpus:

comprehensiveness — to what degree does the corpus represent a range of speech event types and situations in which the language is used?
uniqueness — does the corpus contain material that is unusual or special in some way, or material that cannot be easily reproduced or collected again?
novelty — to what extent is the content of the corpus new and contain material never collected before?
usefulness and adaptability — can the corpus be used for a range of purposes? What range of potential users does it serve? Can the corpus be modified for uses other than those intended by the original collector? Can it be converted into other formats? To what degree does it meet the needs of stakeholders other than the collector?
ethics — was the corpus collected in a responsible manner in accordance with clearly stated ethical procedures? Are there explicitly stated protocols for access and use of the corpus? (See Holton 2005 [.pdf])
organisation and management — here we might identify several dimensions (see Gibbon 2002 [.pdf] for discussion of one possible model for fieldwork linguists. There are also useful resources here, especially Nick Thieberger’s presentation):
- explicitness and robustness — is the corpus stored in a well-structured format that is portable (in the sense of Bird and Simons 2003) and transparent to other users? Are there explicit links between information in different parts of the corpus?
- consistency — are the annotation schemes (for transcription, glossing etc) applied rigorously across the whole corpus? Are the media recorded in a consistent manner?
- meaningfulness — do the annotation schemes have clear semantic interpretations?
- conventionality — is the representation of the data in the corpus in some commonly used or standardised format?
- preservability — is the corpus stored in such a way that it can be archived and preserved for future users? Munro 2005 sets out a 6-point scale of corpus archivability.

Might these, and perhaps other dimensions, serve as a basis for a descriptive vocabulary for talking about documentary corpora? One of the concerns I heard expressed by the students in my LOT course, and by students and post-doctoral and other junior researchers at SOAS and elsewhere, is that corpus preparation work does not ‘count’ for the purposes of research evaluation, as demanded by our academic audit culture for things like job applications, promotion, tenure review etc. (Interestingly, despite the commodification of endangered languages research, it is one product that appears to have no value to the accounting system.)
As the LOT students noted, corpora tend to be left ‘messy’ or ‘incomplete’ or ‘half-done’ because researchers determine that time should be ‘better spent’ on the writing and publication of descriptive or theoretical materials which will be counted by the audit. If a review process that categorised corpora along these dimensions (or others) could be established, and given institutional backing, then perhaps the resulting evaluations would be a spur to getting corpus preparation judged more positively by everyone.
Note: thanks to Alexandra, Felix, Sebastian and Sonja for lively discussion in Tilburg. Rob Munro and David Nathan are not to be held responsible for any misuse of their ideas that I may have made.
References
Bird, Steven & Mark Liberman 2001 A formal framework for linguistic annotation. Speech Communication 33:23-60
Bird, Steven & Gary Simons 2003 Seven dimensions of portability. Language 79:557-582.
Himmelmann, Nikolaus 2006 Language documentation: what is it and what is it good for?, In Jost Gippert, Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 1-30. Berlin: Mouton de Gruyter.
Franchetto, Bruna 2006 Ethnography in language documentation, In Jost Gippert,
Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 183-211. Berlin: Mouton de Gruyter.
Gibbon, Daffyd 2002 Ubiquitous multilingual corpus management in computational fieldwork. LREC Proceedings.
Holton, Gary 2005 Ethical practices in language documentation and archiving. OLAC Tutorial Archiving and linguistic resources or How to keep your data from becoming endangered. Linguistics Society of America annual meeting, Oakland CA.
Munro, Robert 2005 The digital skills of language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 3, 141-156. London: SOAS.
Schultze-Berndt, Eva 2006 Linguistic annotation, In Jost Gippert, Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 213-251. Berlin: Mouton de Gruyter.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a comment