How much room is there in the arc(hive)?

Forty-five years ago the annual fieldwork reports of some of the researchers funded by the then Australian Institute of Aboriginal Studies (now AIATSIS) included specifications of how much research had been completed in terms of the number of feet of tapes that had been recorded during the project year (“this year was especially productive with 45 feet 3 inches of tape being recorded”). The modern measure of this kind of quantitative nonsense is the number of gigabytes of digital files (soon to be terabytes) created by the researcher. Don’t mind the quality, it’s the length/bytes that count.
My colleague David Nathan, Director of the Endangered Languages Archive (ELAR) at SOAS, has been approached on several occasions by researchers (both those funded by ELDP and those not (yet)) asking how much data they would be allowed to deposit in the archive. “Would it be OK if I deposit 500 gigabytes of data?” they ask. When you think about it for a moment or two, this is a truly odd request, but one driven by part of what David (in Nathan 2004, see also Dobrin, Austin and Nathan 2007, 2009) has termed “archivism”. This is the tendency for researchers to think that an archive should determine their project outcomes. Parameters stated in terms of audio resolution and sampling rate, file format, and encoding standards take the place of discussions of documentation hypotheses, goals, or methods that are aligned with a project’s actual needs and intentions. David’s response to such a question is usually: if the material to be deposited is “good quality” (stated in terms of some parameters (not volume!) established by the project in discussion with ELAR) then the archive will be interested in taking it.
Another quantity that comes up in this context (and in the context of grant applications as well) is the statement that “10% of the deposited archival data will be analysed”. The remainder of the archive deposit will be, in the worst case, a bunch of media files, or in the best case, media files plus transcription (and/or translation). Where does this magical 10% come from? It seems to have originated around 10 years ago with the DOBES project which established a set of guidelines for language documentation during its pilot phase in 2000. As Wittenburg and Mosel (2004:1) state:

“During a pilot year intensive discussions … took place amongst the participants. The participants agreed upon a number of basic guidelines for language documentation projects. … For some material a deep linguistic analysis should be provided such that later researchers will be able to reconstruct the (grammar of the) language”

Similarly, the guidelines for ELDP grant applications (downloadable here) include the following:

“Note that audio and video are not usable, accessible or archivable without accompanying textual materials such as transcription, annotation, or notes about content and participants. While you are encouraged to transcribe and annotate as much of the material as possible, we recognise that this is very time-consuming and you may not be able to do this for all recorded materials. However, you must provide some text indication of the content of all recordings. This does not have to be the linguistic content and could include, for example, description of the topics or events (e.g. names of songs), or names of participants, preferably with time alignment (indication of where they occur in the recording).”

No actual figure is given of how much “some material” (for DOBES) or “as much of the material as possible” (for ELDP) amounts to. In earlier published versions of advice to applicants both DOBES and ELDP did mention 10%.
Interestingly, Wittenburg (2009, slide 34) has done an analysis of the language documentation data collected by DOBES projects between 2000 and 2009, and he notes that the average project team has recorded 131 hours of media (59 hours of audio, 72 hours of video), transcribed 50 hours of this, and translated 29 hours. Linguistic analysis on average exists for 14 hours of recordings — strikingly this is exactly 10.68% of the average corpus!!
How much of the corpus needs to be linguistically annotated so that “later researchers will be able to reconstruct the (grammar of the) language” or indeed so that the rest of the corpus can be parsed? Well, it depends on a range of factors, including the nature of the language(s) being documented. Some Austronesian languages, like Sasak or Toratan, have relatively little morphology with pretty straightforward morpho-phonemics of such morphology that does exist, and so a relatively small amount of morpheme-by-morpheme glossed materials in conjunction with a lexicon would enable users to bootstrap the morphological analysis of other parts of a transcribed corpus in those languages. Other languages, like Athapaskan tongues with their fiendishly complex verb morphology, might need more annotated data to help the user deal with the whole corpus.
This is however an empirical question, and one that to my knowledge has not been addressed so far. There are now a number of documentary corpora available, with more coming on stream, and it should be possible to establish whether the “magical 10%” is a real goal to be aimed for, or just a figure that researchers have created and continue to repeat to one another.


Thanks to Anthony Jukes, David Nathan and Mandana Seyfeddinipur for discussion of some of the ideas presented here; none of them is responsible for the opinions expressed however.
Dobrin, Lise, Peter K. Austin and David Nathan. 2007. Dying to be counted: commodification of endangered languages. In Peter K. Austin, Oliver Bond and David Nathan (eds.) Proceedings of Conference on Language Documentation and Linguistic Theory, 59-68. London: SOAS. (online here)
Dobrin, Lise, Peter K. Austin & David Nathan. 2009. Dying to be counted: the commodification of endangered languages in language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 6, 37-52. London: SOAS.
Nathan, David. 2004. ‘Documentary linguistics: alarm bells and whistles?’, Seminar presentation, SOAS. 23 November 2004.
Wittenburg, Peter. 2009. Introduction to DOBES – Overview. Powerpoint slides, DOBES training course, June 2009.
Wittenburg, Peter, and Ulrike Mosel. 2004. The DOBES Programme and its Contribution to Standardization and Revitalization. Paper presented at Linguapax 2004 (on line here)

3 thoughts on “How much room is there in the arc(hive)?”

  1. Thanks for mentioning this issue, Peter. Fieldwork is notoriously hard to ‘measure’, and outcomes may not be seen for years after linguists have had time to mentally digest all the material and immaterial findings that come from fieldwork. So it’s obvious that granting agencies would look for something measurable to use in their reports and hence to serve as a requirement for linguists working on grants.
    I think that most linguists don’t take the measurement of their fieldwork output very seriously, instead seeing it as yet another grant hurdle to overcome. There are already so many seemingly arbitrary requirements in filing for grants and in fulfilling grant obligations once awarded that the archive quantity and 10% requirements are just one more thing in the pile. Certainly a good fieldworker knows intuitively when they’ve collected a reasonable amount of material, and when they have done intensive transcription and translation of enough to be somehow useful to others.
    On the other hand, there are plenty of people who aren’t experienced fieldworkers who will take such ‘measurements’ of fieldwork productivity as adequate for evaluation of one’s work. That issue leaves me with a sense of disquiet.
    Also something to consider is the fact that percentage is a relative quantity. Some folks I know have well over 1000 hours of recorded material, and I think nowhere near ten percent of that has been transcribed. Asking for someone to do the ten percent for this before being willing to accept it is a bit unreasonable.
    In addition, the ‘good quality’ requirement needs to be looked past sometimes too. If you stumble upon a recording made in the 1960s on a lousy reel-to-reel which has since been converted to cassette, and it’s got a bunch of noise in the background, but it’s of a person speaking a dialect long since lost or perhaps a person speaking in a style that is now extinct, wouldn’t this recording be far more valuable in some sense than a recording made just yesterday?

  2. Regarding those “pesky Athapaskan tongues with their fiendishly complex verb morphology”, I think we might find that the high degree of cognacy across the prefix complex puts a lower burden on the user to parse untranscribed verbs. With some knowledge of the language-specific prefix phonology an Athabaskanist can probably tease out the prefixes even in an unfamiliar Athabaskan language. This is much more difficult when there are huge differences in morphological structure across the family.

  3. James
    Many thanks for your detailed comments — I am currently preparing another post where I address the issue of depositing “1000 hours of recorded material” less than 10% of which is transcribed. As for “good quality” note that I did not define what the parameters for deciding that between the depositor and the archive might be (other than not data volume). For some thoughts from January 2008 about what possible parameters might be have a look here.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately. We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham. In addition to the above, we ask that you please observe the Gricean maxims: Be relevant That is, stay reasonably on topic. Be truthful This goes without saying; don’t give us any nonsense. Be concise Say as much as you need to without being unnecessarily long-winded. Be perspicuous This last one needs no explanation. We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification. All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions. Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment