Psst, want some data?

Last month I wrote a blog post about quantification in language documentation and “[h]ow much of the corpus needs to be linguistically annotated so that ‘later researchers will be able to reconstruct the (grammar of the) language’ or indeed so that the rest of the corpus can be parsed”. Note that I was talking about linguistic annotation (not just transcription) here, but in his very useful comments on my post, James Crippen wrote:

“Some folks I know have well over 1000 hours of recorded material, and I think nowhere near ten percent of that has been transcribed. Asking for someone to do the ten percent for this before being willing to accept it is a bit unreasonable.”

Well, the first thing I have to say is: 1000 hours is an awful lot of recordings. It’s about 7.5 times the average DoBeS corpus (based on the figure I mentioned in my previous post) and if it’s video it’s equivalent to around 550 feature length movies (which average around 110 minutes each). If you spent every waking hour of the working week, with no time for eating, bathing, shopping, checking e-mail etc, it would take you six and a half months to merely watch or listen to it all, let alone create any metadata, analysis, transcription, or index (and remember that this is probably going to be in a language you don’t understand and with no subtitles). You’d want to have a good reason to do so, I reckon.
Anyway, be that as it may, James’ comment prompted me to seek some empirical data about this issue, so I wrote to five colleagues who are responsible for archives of materials on endangered languages, namely Peter Wittenburg of the DoBeS archive, Heidi Johnson of the Archive of the Indigenous Languages of Latin America AILLA, Gary Holton of the Alaska Native Language Archive ANLA, Nick Thieberger of the Pacific And Regional Archive for Digital Sources in Endangered Cultures PARADISEC, and David Nathan of the Endangered Languages Archive ELAR at SOAS. I asked them the following questions:

“If someone approached you about depositing 1000 hours of recorded digital data on some language, less than 10% of which was transcribed, what advice would you (Archive_Name) give them? What would be the minimal requirements that you would have in order to accept the materials for deposit?”


The five friendly archivists all wrote back (three of them within 24 hours!) and provided very helpful advice in response to these two questions. I’ll attempt to summarise it here, and point to some similarities and differences in practices (note that the archivists are not responsible for any inaccuracies in my summary).
Firstly, it’s important to note that all archives have a collection policy that determines what materials they are going to be interested in and what they will not include as a deposit, and these five are no exception. ANLA only collects materials on Alaskan and neighbouring languages, including especially Athabaskan-Eyak-Tlingit and Eskimo. AILLA accepts anything that is “in or about a real indigenous language of Latin America”, while PARADISEC is interested in, according to the website, “endangered materials from the Pacific region, defined broadly to include Oceania and East and Southeast Asia” (though Nick Thieberger points out that they also accept materials from elsewhere). DoBeS and ELAR are primarily set up to accept deposits from researchers funded by their partner granting organisations (Volkswagen Foundation and the Endangered Languages Documentation Programme respectively), however they will also accept materials on endangered languages spoken anywhere in the world.
Secondly, all these archives are interested in materials that are usable (ie. can be listened to or viewed comfortably) and have a demonstrable value, though how this is established varies among them. For PARADISEC value is determined on a case-by-case basis, often with reference to an expert in the languages of the region, and includes whether there is some metadata information about the recording’s contents and context (where it was recorded, when, who by, and who is recorded). PARADISEC also includes uniqueness as a measure of value and is more likely to accept what may be the only recordings of a particular language. ANLA is interested in “recordings that have some real language content (i.e., not just tapes of … meetings in English with occasional native language)”. For AILLA authenticity is also important. Heidi Johnson writes:

“we would not accept a recording of a Kuna chant performed by [an anthropologist], but we will happily accept anything created by Kuna speakers in the Kuna language or by non-Kunas in collaboration with Kuna speakers”.

David Nathan of ELAR gives the following criteria:

  • is it a coherent body of material (eg materials on one language/group/place) with information or motivation to group the materials together (ie., is it potentially a single deposit or perhaps multiple deposits)?
  • has the material been selected according to some criteria (eg. documentation value, somehow formulated)?
  • is (most of) the material on open access?
  • is there some symbolic information to allow identification and discovery of most of the content? (Note that this doesn’t have to be transcription, or even the actual linguistic form/content, but can indicate information about contributors, genres, events, key aspects of content etc.) – this could appear as metadata, annotations, or notes
  • has the depositor demonstrated their commitment to the value of the material, eg. by annotating, transcribing, or adding metadata of various types?

This question of depositor commitment to the materials is also mentioned by other archives. So Heidi Johnson of AILLA writes:

“As the years go by, I am becoming less interested in receiving boxes of ragged old scraps, mouldy tapes, manuscripts in pencil with sloppy handwriting, videos of empty chairs and roadside panoramas and other unplanned, incoherent, undocumented, badly lit ephemera. Note that I will still accept these things. They just go at the end of the queue and there is grumbling. I’m interested in … good clean audio, well-lit video, good contrast between text and background for manuscripts, digital texts that are easy to convert to PDF/A. And a metadata catalog with all the necessary information.”

David Nathan suggests that commitment might be the most important criterion when assessing whether to accept materials as an archive deposit:

“whether the depositor demonstrates their commitment seems to give the best and most flexible way to judge what we should take. After all, depositors generally know more about the value of materials than ELAR will (although sometimes I do ask specialists for a brief opinion). If they do not have any written up description of what the materials are – and are not willing to make an effort to provide one – then I take that as a indication of the value of the materials (to them, and thus probably overall).”

The DoBeS requirements for commitment are stated in more technical terms:

  • use of the Arbil tool to create proper metadata
  • the materials are encoded according to generally accepted standards (format, codec, etc)

So what about materials without much transcription? DoBeS did not answer this question but all the other archives I contacted will accept recordings alone, provided there is associated metadata about their content and context, and ideally an expectation that transcriptions and annotations will be added later. As Gary Holton notes: “the task of sorting through thousands of hours of undocumented and untranscribed recordings represents a massive philological project”.
PARADISEC collects both legacy materials (recorded in the past) and work in progress by current researchers. Nick Thieberger writes:

“In principle we accept a collection of media without transcripts, and we have been able to negotiate virtually unlimited storage space at the National Computational Infrastructure so the cost of storage is not a consideration. If the circumstances of the collection’s existence were such that no transcriptions were possible (say they had been recorded by a patrol officer over twenty years whose interest was in recording rather than transcribing) then they would still be a very useful historical record. As we also accept collections of media from researchers from the field or shortly after their return from fieldwork, then we routinely do not have transcriptions, but hope that they will be deposited over time as the analysis proceeds. In the meantime, the deposited item has persistent identification that provides a good foundation for the research

The case of ANLA is rather special however, as Gary Holton describes:

“From a global perspective, all of the languages [we deal with] are extremely well-documented. Moreover, both the Athabaskan and Eskimo families are relatively young, exhibiting great similarities in morphology (somewhat less so in syntax/discourse). What this means is that if I come across a recording without a transcription I am likely to be able to make some sense of it, either working on my own or in conjunction with a modern (semi-)speaker. This may not be the case for language families with little or no extant documentation. Of course it is always vastly easier to work with an annotated recording, but what we are finding today is that many of the old unannotated recordings are extremely valuable, as they document a stage in which the language was much more vibrant and viable. The last decades of the 20th century were critical for language decay in Alaska.”

So, to respond to James’ comment on my blog post, all the archives I contacted will accept recordings, with or without transcriptions, that fall within their collection policy and that are authentic and valuable, provided that there is accompanying metadata about the content and context of the materials. Ideally the depositor should show a level of commitment to the deposit by describing it and being willing to add information to it over time.
So, if you have 1000 hours of recordings (or 100 hours, or even 10 hours) contact an archivist today and start a conversation about how you can deposit your materials in a secure and properly managed facility that will make them available (subject to any expressed access and use restrictions) for current and future generations of researchers, community members and other interested users.


Note: My thanks to Gary Holton, Heidi Johnson, David Nathan, Nick Thieberger and Peter Wittenburg for their responses and for providing valuable advice and feedback. I alone am responsible for the content of this blog post.

2 thoughts on “Psst, want some data?”

  1. I want to stand up for the value of mouldy tapes…
    In 2007 I spent many hours auditioning a collection whose associated metadata consisted largely of a box of tape labels that had become separated from the tapes. The tapes had been preliminarily auditioned by someone who didn’t know the languages or places mentioned. A junk collection people might have thought. Turned out that there were lots of errors in the constructed metadata and this collection included language materials from languages we otherwise only knew from Daisy Bates or as names on the map. Now, this was not an easy collection to work with (not least I nearly threw up about halfway through because of the movement of the microphone and the wind noise irregularly oscillating between L and R headphone pieces — like trying to take detailed notes on a computer on a tinny in high seas!) and I’m very glad most archival recording collections aren’t like that, but by having the attitude that only good quality materials are accepted, there’s a real risk of losing people’s random tapes from the 1950s and 60s that they made when the elders were still alive.

  2. It’s a nice story Claire and clearly shows the value of someone making a commitment as you did and spending time to uncover and document what was on the tapes. Note that Heidi wrote: “I will still accept these things. They just go at the end of the queue and there is grumbling”. Paradisec also deals with these kinds of materials (see here) but makes it clear that there are real costs involved with this sort of material: “It can take several weeks to prepare such tapes for digital transfer” and provides a ball-park costing of A$150 per tape for baking and mould removal plus digitisation cost of A$170 per hour. At SOAS we have digitisation equipment and make it available (and provide training) for researchers to come in and digitise their own materials for deposit in the archive. We have already had several people do this. The point is, someone has to be committed enough and sure enough of the value of the materials to make it a priority to process them.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment