Is mine big enough?

At the recent Linguistic Society of America annual meeting in Chicago, Sandra Chung from University of California Santa Cruz gave an invited plenary address on the topic “How much can understudied languages really tell us about how language works?” She argued, among other things, that data from understudied languages should play a crucial role in the development of linguistic theory since only by including them can we get a full picture of the array of phenomena found in human languages that need to be taken account of. She illustrated her talk with examples from her work on Chamorro, an endangered Austronesian language spoken on Guam.
During the question time following Sandy’s talk, one person commented something along the following lines (I paraphrase, since I was rather stunned to hear the opinion being openly expressed before a linguistics audience, and don’t recall the exact formulation):

Linguistic research needs to concentrate on working with corpora and for the sort of languages you were talking about, like Chamorro, you will never be able to put together a corpus of sufficient size to be able to do anything meaningful. We should give up on the small (and disappearing) languages and concentrate on ones where we are likely to be able to get a decent sized corpus.


There was quite a corpus buzz at the meeting (John Goldsmith gave an invited plenary talk entitled “Towards a new empiricism for linguistics” presenting his ideas about statistical corpus-based research), and I imagine many people had in mind ‘big language’ corpora of the 1-100 million words range (or perhaps even the two billion word corpus of English that the Oxford Dictionary folks have just compiled). At the Symposium on “Mobilizing Linguistic Resources Within Speaker Communities” (held after Sandy Chung’s talk) one of the presenters, Andrew Garett, was explicitly asked by an audience member how big was the text corpus for Yurok, the indigenous Californian language that he has been working on for some years and which has been the focus of recent language revitalization and teaching efforts.
So, should we just pack up, stop wasting our time, and leave the small languages alone? How big does a corpus have to be in order to be useful?
A partial answer can be found in Friederike Luepke’s 2005 paper entitled ‘small is beautiful: contributions of field-based corpora to different linguistic disciplines, illustrated by Jalonke – published in Language Documentation and Description, Volume 3. Friederike shows how her Jalonke corpus of 7,000 intonation units (roughly 6,000 clauses) of transcribed and glossed text data can be explored quantitatively and qualitatively to uncover significant information on verb argument structure and alternations, genre-based variation, language contact phenomena, and language standardization tendencies. It is an impressive demonstration of the value of a richly annotated ‘small’ corpus.
Alternatively, there is Andrew Garrett’s response to the LSA Symposium question: the Yurok corpus of audio and text data is larger than the corpus for Luwian, an extinct Indo-European language that has played an important role in elucidating the Anatolian branch. it’s also bigger than that for Palaic, or several other languages that are ‘well respected’ in historical linguistics research.
Size is just one measure of value, and a pretty poor one it seems to me when it comes to endangered languages corpora in particular.

1 thought on “Is mine big enough?”

  1. Andrew Taylor contacted me with the following information which he has given me permission to reproduce here:
    “Your recent contribution on corpus size reminded me of a paper given by Leonard Newell of SIL Philippines at a conference on lexicography in Manila in 1992, in which he discussed this issue. I no longer have the paper, alas, but if I remember correctly he then suggested aiming for a corpus of a million words. The paper, ‘Computer processing of texts for lexical analysis’ was published in the conference proceedings (Papers from the first Asia International Lexicography Conference. Manila: Linguistic Society of the Philippines Special Monograph No. 35).
    Then, in his Handbook on Lexicography for Philippine and Other Languages (Linguistic Society of the Philippines, Special Monograph No. 36, 1995) the third chapter, Developing a textual corpus, deals with a range of issues involved in compiling a useful corpus. The last section is 3.8, The size of the corpus for a modest project. By this time, his suggestion was for a somewhat larger corpus.
    He estimated that a keyboarder could, conservatively, collect, enter, and do a spelling edit on about one million words of text in a year and goes on to say ‘Based on the experience of the Romblomanon project, a corpus yielding about three million morphemes is considered both attainable and adequate to meet the needs of a modest lexicographic project on a lesser-known language’ (p.43). However, he does acknowledge the limitations of human and financial resources which usually apply to projects on languages with small numbers of speakers. (I notice the change from words to morphemes in his paragraph, which would affect the count.)
    I am not suggesting his view is correct, and he may well have changed it subsequently, but it is an interesting early attempt to quantify the problem.”

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment