Who uses digital language archives?

Over the past 10 years or so it has become increasingly common for researchers working on endangered languages to deposit their recordings and analysis (transcriptions, translations, annotations, dictionaries, grammars etc.) in a language archive1 In fact, in Himmelmann’s manifesto on language documentation (HImmelmann 1998, 2002, see also Himmelmann 2006) and Woodbury’s seminal articles (Woodbury 2003, 2011), archiving is seen as one of the fundamental characteristics that differentiates documentary linguistics from descriptive linguistics.

Language archiving of course has an older history than this, and physical archives (which store paper, tapes and other analogue objects) have existed for a very long time. Digital archives of language materials also go back to the beginning of the use of computers and digital storage for linguistic research. Today, however, there are several large digital language archives specifically dedicated to curation, preservation and dissemination of endangered languages materials which together house many terabytes of audio and video recordings along with still images, text, and metadata.

There are a number of compelling reasons for researchers to archive their research materials. Johnson (2004, 2005) sets out the following:

  1. to preserve recordings of endangered/minority languages for future generations.
  2. to facilitate the re-use of materials for:
    • language maintenance and revitalization programs;
    • typological, historical, comparative studies;
    • any kind of linguistic, anthropological, psychological, etc. study that you yourself won’t do.
  3. to foster development of both oral and written literatures for endangered languages.
  4. to make known what documentation there is for which languages.
  5. to build your CV and get credit for all your hard work.

Johnson (2004:143) argues that:

“Archiving can be considered a form of publishing: even if the materials themselves are archived with highly restricted access conditions, the metadata … is published in the archive’s catalogue. You should list all materials that you have archived on your curriculum vitae, so that future employers will know how much work you have done.

Archived materials should also be cited in scholarly and other publications, just as we cite any other published work. This enables those who read a work to locate the primary materials on which that work is based. It also ensures that the speakers whose knowledge and artistry are preserved in the documentation materials are given proper credit for their contributions.”

So an interesting question is: who actually uses materials that have been deposited in a digital language archive, and what are they using them for? On 3rd March 2011 I wrote to the Alaska Native Language Archive (ANL), The Archive of the Indigenous Languages of the Americas (AILLA), the Survey of California and Other Indian Languages, the DOBES archive, the Endangered Languages Archive (ELAR), and Paradisec, asking the following:

  1. “who uses EL archives? – researchers, community members, journalists, others? – roughly how many users do you have in a typical 12 month period?
  2. the relative proportions of use
  3. what do users typically want the materials for? — if digital copies are available do users make such copies?
  4. whether you keep track of secondary publication of archival materials downloaded from your archive, eg. use in creating teaching materials or in linguistic publications”

The responses I got2 are summarised in the following paragraphs.

Gary Holton from ANLA replied:

Our numbers vary considerably year to year, and so far we are not keeping track of online usage. In-person visits average about 200 distinct visits per year. Length of visit varies considerably, from a few minutes to a few weeks. This number may under-report actual usage because many visitors from Native communities represent a project or village council and bring back materials to that project or village. In other words, a single representative may bring materials which are eventually used by many more people. About 5% of visitors are linguists. But … the few linguists who do use the archive tend to use it fairly intensely, often over a period of days or weeks. For usage, there are several common agendas: (1) to acquire materials in their language; (2) to acquire pedagogical materials; (3) oral history, usually focused on particular person/village; (4) songs. We don’t keep track of secondary products, but we should.

Andrew Garret replied concerning the California Survey and the new online California Language Archive (CLA) that is currently being set up:

We don’t track visits. In the new CLA, registration will be required to view online content (as opposed to consulting the catalog), but even then we won’t track users by type. The vast majority of current users seem to be members of heritage communities. They definitely want to get copies of things, audio or paper, for their own use at home. Uses relate generally to language revival and cultural interest. We do not keep track of secondary publication arising from archival materials.

Paul Trilsbeek wrote back about the DOBES archive:

Over the year 2010 there were about 1200 unique visitors to the DOBES part of the archive accessing an archived resource (not counting visitors who only looked at metadata). The most active visitors come from universities or institutes where DOBES projects are located (so most likely DOBES project members themselves), followed by visitors from other universities and research institutes, followed by visitors from various commercial internet providers around the world. The latter group could be anyone really; community members, researchers at home, the general public. Looking at user registrations and access requests for resources that are not publicly accessible, the largest proportion of those come from linguists or researchers of related disciplines. Users who request access to closed data typically want it for research purposes. We do not actively keep track of whether the data has been used for secondary publications but we would like to see those materials included in the archive as well.

Ed Garrett and David Nathan provided information about ELAR:

The ELAR archive became publicly accessible in June 2010 and now has 312 registered users, of whom 60 are registered depositors. Around 1,000 file downloads have been made since the site went live nine months ago. There are currently 7,000 ‘bundles’ of files that are available to users, including 4,210 ‘bundles’ that contain files immediately accessible to any registered user. We do not track secondary publications that use archival materials but encourage users to deposit such materials with ELAR.

Based on this (admittedly limited) sample we can conclude that regionally-oriented archives like those in Alaska and California are essentially used by speaker communities or their descendants to access materials for cultural, historical or language-learning purposes. The DOBES archive is primarily used by researchers, particularly its depositors3. The ELAR archive has only been operating for a relatively short time, but most users seem to be depositors or other researchers. Unfortunately, no information is available from any of the archives about how downloaded or copied resources get used in secondary publications.

Note: Thanks to Andrew Garrett, Ed Garrett, Gary Holton, David Nathan and Paul Trilsbeek for responding so promptly and helpfully to my request for usage data on their archives. None of them is responsible for the content of this blog post.

  1. indeed, in the case of research funded by Volkswagen Foundation’s DOBES programme and ELDP, archiving is a contractual requirement of the grant.
  2. I did not receive replies from AILLA or Paradisec
  3. The Volkswagen Foundation is funding a summer school that will include a workshop Language Documentation meets Corpus Linguistics: how to exploit DOBES corpora for descriptive linguistics and language typology? (September 27-28, 2011). According to the summer school website: “[t]he goal of this workshop is to bring together documentation linguists and corpus linguists in order to discuss and explore the question how DOBES corpora … can be used and exploited for descriptive (and typological) purposes by means of corpus linguistic methods.” Also, the form and potential uses of archival corpora are topics for discussion at a workshop to be held at SOAS on 18th November 2011 in conjunction with the LDLT-3 conference.

