Further to my post about the SOAS Endangered Languages Archive (ELAR) holding much less (70 collections) than could be expected given the number of ELDP funded projects (216), I thought it would be interesting to look at archival holdings and the amount deposited in language archives now, after a decade and a half of language documentation theory.
To begin with, how do we know how much documentation is going on? Newman (1992 and 2005) reports 34 US departments running fieldmethods courses, and if we roughly estimate that there are a similar number, plus some, elsewhere in the world then let us say (most likely somewhat generously) there are 70 departments in the world that not only run fieldmethods courses, but whose students and staff engage in fieldwork. Of those not all will be working on small (or endangered) languages, so let’s say half of them are. If we look at the LLL conference in 2009 there were 180 abstracts submitted. The 2nd International Conference Language Documentation and Conservation in 2011 had 230 abstracts. So, let’s assume conservatively there are at least 100 current fieldwork-based linguistic projects. If we extend this assumption back in time, we could assume an average of 30 per year since 1960, so there should be reasonable records of 1530 languages (OK, this is broad brush stuff, and projects go on for years, but 30 is a small and conservative estimate).
Turning now to the digital language archives, DoBeS has 49 projects listed (http://www.mpi.nl/DOBES/projects/) and the archive lists 46 collections from DoBeS projects, but includes 18 donated corpora – which were not DoBeS funded, so it seems that DoBeS also does not contain outcomes of all of its funded projects.
If we look at the OLAC archives, there are 45 listed. 16 have been active within the past six months, 19 within the past twelve months and the rest inactive for past 12 months. 15 archives have more than 1000 records, 11 have between 100 and 1000 records and the rest have less than 100 records. It is not easy to distinguish OLAC archives that hold primary material from those that are indexes of material (like Ethnologue for example) so estimating the number of languages with significant amounts of recorded material in an archive is not immediately possible.
Clearly, not all records of linguistic research go into the archives listed here. Other archives provide homes for the material from local researchers (university libraries, specialist research collections and so on). Ultimately, it would be ideal if such collections could all expose simple language metadata in a form that could be harvested by a service like OLAC. One suggestion that would require minimal effort would be for any webpage that included language material to include a footer or embedded metadata of a form something like ‘olac xmlns=”http://www.language-archives.org/OLAC/1.0/” ISO-639-3 XXX’.
The conclusion to be drawn from this sweeping set of generalisations is that there is still a long way to go to get linguists creating archival material and depositing in archives. I reckon the best way to improve on this record is to make it easier for them to deposit and to create the data in the first place.
Newman, Paul. 1992 . Fieldwork and Field Methods in Linguistics. CA Linguistic Notes 23(2):1, 3-8. (reprinted as Newman, Paul. 2009. Fieldwork and Field Methods in Linguistics. Language Documentation & Conservation 3(1):112-124. http://hdl.handle.net/10125/4428)
Newman, Paul. 2005. Field methods courses in linguistics. Paper presented at the Linguistic Society of America conference on Language Documentation: Theory, Practice, and Values. July 9–11, Harvard University.