At the recent Linguistic Society of America annual meeting in Chicago, Sandra Chung from University of California Santa Cruz gave an invited plenary address on the topic “How much can understudied languages really tell us about how language works?” She argued, among other things, that data from understudied languages should play a crucial role in the development of linguistic theory since only by including them can we get a full picture of the array of phenomena found in human languages that need to be taken account of. She illustrated her talk with examples from her work on Chamorro, an endangered Austronesian language spoken on Guam.
During the question time following Sandy’s talk, one person commented something along the following lines (I paraphrase, since I was rather stunned to hear the opinion being openly expressed before a linguistics audience, and don’t recall the exact formulation):
Linguistic research needs to concentrate on working with corpora and for the sort of languages you were talking about, like Chamorro, you will never be able to put together a corpus of sufficient size to be able to do anything meaningful. We should give up on the small (and disappearing) languages and concentrate on ones where we are likely to be able to get a decent sized corpus.
There was quite a corpus buzz at the meeting (John Goldsmith gave an invited plenary talk entitled “Towards a new empiricism for linguistics” presenting his ideas about statistical corpus-based research), and I imagine many people had in mind ‘big language’ corpora of the 1-100 million words range (or perhaps even the two billion word corpus of English that the Oxford Dictionary folks have just compiled). At the Symposium on “Mobilizing Linguistic Resources Within Speaker Communities” (held after Sandy Chung’s talk) one of the presenters, Andrew Garett, was explicitly asked by an audience member how big was the text corpus for Yurok, the indigenous Californian language that he has been working on for some years and which has been the focus of recent language revitalization and teaching efforts.
So, should we just pack up, stop wasting our time, and leave the small languages alone? How big does a corpus have to be in order to be useful?
A partial answer can be found in Friederike Luepke’s 2005 paper entitled ‘small is beautiful: contributions of field-based corpora to different linguistic disciplines, illustrated by Jalonke – published in Language Documentation and Description, Volume 3. Friederike shows how her Jalonke corpus of 7,000 intonation units (roughly 6,000 clauses) of transcribed and glossed text data can be explored quantitatively and qualitatively to uncover significant information on verb argument structure and alternations, genre-based variation, language contact phenomena, and language standardization tendencies. It is an impressive demonstration of the value of a richly annotated ‘small’ corpus.
Alternatively, there is Andrew Garrett’s response to the LSA Symposium question: the Yurok corpus of audio and text data is larger than the corpus for Luwian, an extinct Indo-European language that has played an important role in elucidating the Anatolian branch. it’s also bigger than that for Palaic, or several other languages that are ‘well respected’ in historical linguistics research.
Size is just one measure of value, and a pretty poor one it seems to me when it comes to endangered languages corpora in particular.