Where are the records?

Further to my post about the SOAS Endangered Languages Archive (ELAR) holding much less (70 collections) than could be expected given the number of ELDP funded projects (216), I thought it would be interesting to look at archival holdings and the amount deposited in language archives now, after a decade and a half of language documentation theory.

To begin with, how do we know how much documentation is going on? Newman (1992 and 2005) reports 34 US departments running fieldmethods courses, and if we roughly estimate that there are a similar number, plus some, elsewhere in the world then let us say (most likely somewhat generously) there are 70 departments in the world that not only run fieldmethods courses, but whose students and staff engage in fieldwork. Of those not all will be working on small (or endangered) languages, so let’s say half of them are. If we look at the LLL conference in 2009 there were 180 abstracts submitted. The 2nd International Conference Language Documentation and Conservation in 2011 had 230 abstracts. So, let’s assume conservatively there are at least 100 current fieldwork-based linguistic projects. If we extend this assumption back in time, we could assume an average of 30 per year since 1960, so there should be reasonable records of 1530 languages (OK, this is broad brush stuff, and projects go on for years, but 30 is a small and conservative estimate).

Turning now to the digital language archives, DoBeS has 49 projects listed (http://www.mpi.nl/DOBES/projects/) and the archive lists 46 collections from DoBeS projects, but includes 18 donated corpora – which were not DoBeS funded, so it seems that DoBeS also does not contain outcomes of all of its funded projects.

If we look at the OLAC archives, there are 45 listed. 16 have been active within the past six months, 19 within the past twelve months and the rest inactive for past 12 months. 15 archives have more than 1000 records, 11 have between 100 and 1000 records and the rest have less than 100 records. It is not easy to distinguish OLAC archives that hold primary material from those that are indexes of material (like Ethnologue for example) so estimating the number of languages with significant amounts of recorded material in an archive is not immediately possible.

Clearly, not all records of linguistic research go into the archives listed here. Other archives provide homes for the material from local researchers (university libraries, specialist research collections and so on). Ultimately, it would be ideal if such collections could all expose simple language metadata in a form that could be harvested by a service like OLAC. One suggestion that would require minimal effort would be for any webpage that included language material to include a footer or embedded metadata of a form something like ‘olac xmlns=”http://www.language-archives.org/OLAC/1.0/” ISO-639-3 XXX’.

The conclusion to be drawn from this sweeping set of generalisations is that there is still a long way to go to get linguists creating archival material and depositing in archives. I reckon the best way to improve on this record is to make it easier for them to deposit and to create the data in the first place.

Newman, Paul. 1992 . Fieldwork and Field Methods in Linguistics. CA Linguistic Notes 23(2):1, 3-8. (reprinted as Newman, Paul. 2009. Fieldwork and Field Methods in Linguistics. Language Documentation & Conservation 3(1):112-124. http://hdl.handle.net/10125/4428)
Newman, Paul. 2005. Field methods courses in linguistics. Paper presented at the Linguistic Society of America conference on Language Documentation: Theory, Practice, and Values. July 9–11, Harvard University.


  1. Tom Honeyman says:

    The conclusion to be drawn from this sweeping set of generalisations is that there is still a long way to go to get linguists creating archival material and depositing in archives. I reckon the best way to improve on this record is to make it easier for them to deposit and to create the data in the first place.

    It’d be good to see not just data, but also more metadata exposed through archives for projects in progress too (prior even to the creation of data). ELAR could do this for instance, by exposing their “projects” listing through the archive search engine. PARADISEC could do this by encouraging researchers to create “collection” level metadata early in their projects, rather than as data is being archived. Of course I think the actual departments where the work was taking place would have to get involved to make something like this happen. Otherwise, the creation of project or personal webpages for researchers is a bit patchy, which makes it an unreliable method of searching for work on languages. And waiting for published papers introduces a lag in notifying others of work underway.

  2. Peter Austin says:


    There is a comprehensive listing of grants awarded by ELDP here and you can read outline descriptions of them by clicking on any of the more links. So, for example, you can see here that at small grant was awarded in 2009-2010 for work on Akha — the ISO 639-3 code akh is given, along with the number of speakers, the goals, location and intended outcomes of the project. It is true that this information is not currently indexed by ELAR however.

    Note that there is also an ELAR map interface that shows deposits that are currently available, those being curated, and those at an earlier development stage. Clicking on any marker on the map brings up information about the deposit and links into the ELAR catalogue, which can then be searched.

  3. Peter Austin says:


    Actually, it is to be expected that ELAR’s list of available deposits might be less than the number of grants awarded by ELDP to date, particularly since quite a number of grants that were awarded over the past several years are still running, or haven’t even begun yet. For example, there is a project to document Sadu in Yuxi City, China, that was awarded in 2010 but only begins this year and runs until 2014 (notice it is on the ELAR map tagged as a “proposed” deposit).

    I think there is another, and possibly even better way “to get linguists creating archival material and depositing in archives” and that is to reward them for doing so. The Linguistic Society of America has passed a resolution recognising the scholarly merit of language documentation that notes that scholarly work now includes “archives of primary data, electronic databases, corpora, critical editions of legacy materials, pedagogical works designed for the use of speech communities, software, websites, or other digital media”. It argues that “the recognition of these materials as scholarly contributions to be given weight in the awarding of advanced degrees and in decisions on hiring, tenure, and promotion of faculty”. Perhaps other professional societies and organisations could be invited to do this also. When it becomes normal for those wanting a job, tenure or a promotion to have their accessible archival deposits clearly indicated on their CVs then we will get those deposits flooding into archives.

    Notice that ELDP has wielded a small “stick” in its latest round of grant applications by requiring all applicants who had previously held an ELDP grant to have archived the results of their previous work before being eligible to apply again. This is course does not preclude researchers who have held grants with other agencies and not archived (yet) presenting themselves as “fresh faces” to ELDP. Again, I wonder if other funding agencies could consider including this in their grant application requirements?

  4. Tom Honeyman says:

    Peter: Yes! Agreed – Rewards are perhaps one of the best ways to encourage archiving.

  5. Peter Austin says:


    Additional encouragement is provided in the case of ELAR by recommending to ELDP-grantees (and other depositors) that early in their project they send to the archive samples of their work (audio recordings, video recordings, annotation files, metadata spreadsheets etc.) for evaluation, advice and feedback. ELAR staff provide detailed feedback on the materials they receive and this not only encourages depositors but also establishes an ongoing relationship between the archive and the researcher which pays off for both over the longer term. We have been discussing ways to extend this in the future so that other colleagues (eg. from the Academic Programme) can be involved in the evaluation and feedback process by taking part in deposit and pre-deposit reviews. This is labour intensive activity but has real value and positive outcomes in many ways. Note that in this model there is no “creat[ing] the data in the first place” (as Nick called it) but rather an on-going conversation from which the data and analysis, and the archival deposit(s), emerge and develop, changing over time. My colleague David Nathan says that in his view archiving nowadays is primarily about relationships rather than preservation (though of course it includes that) and standardisation. His 2010 publications in Language Documentation and Description, Volume 7 and elsewhere elaborate on this new perspective, sometimes referred to as Archiving 2.0.

Leave a Reply