
Most of the UK seems to have been distracted over the past few weeks (and especially over the four-day long weekend that is just now drawing to an end) by the celebrations surrounding the Diamond Jubilee of Queen Elizabeth II.
Not so the hard working team at the Endangered Languages Archive (ELAR) at SOAS who have been curating and processing materials to add to our website.
In the past week the following nine deposits (in alphabetical order) have been added:
- Avatime from Ghana by Saskia van Putten and Rebecca Defina — video and audio recordings in various genres such as ceremonial events, personal stories, route descriptions, folk tales, conversations, recipes and speech elicited using various materials. Part of the corpus has been transcribed and translated using ELAN, and there is a word list in Toolbox format
- Baram from Nepal by Yogendra Prasad Yadava — audio and video files with annotations in ELAN and Toolbox, and metadata files in IMDI format
- Cappadocian from Greece by Mark Janse — a Greek-Turkish mixed language thought to have died in the 1960s until its rediscovery in 2005. The corpus includes digital audio and text files
- Chatino from Mexico by Anthony Woodbury — a collection of audio and video recordings of narratives, interviews, conversations, oratory, ritual speech, linguistic elicitations, and other genres in all major varieties of Chatino. The collection of almost 2,500 files also includes transcriptions, translations, and annotations of some of the recorded texts, data sets, word lists and analyses, academic papers, and pedagogical materials
- Glavda from Nigeria by Jonathan Owens — includes audio data based on interviews, free conversations and verbal art among speakers in the rural homeland, along with the language of Glavda speakers in Maiduguri, the largest urban center in the region, and the goal of considerable out-migration from the rural homeland
- Inuit sign from Canada by Joke Schuit — a collection of video stories of past and present life of deaf Inuit community members, and some elicitation tasks based on picture drawings and/or cartoon clips, plus descriptive documents and annotation files
- Ju|’hoan from Namibia by Megan Biesele — the corpus contains 150 audio recordings, 27 video recordings, transcriptions, language lessons and a dictionary. The materials cover 1970 to the present
- Middle Chulym from Siberia by David Harrison — includes unedited video, audio, photos, lexica, and field notes, as well as processed, edited and annotated recordings, scholarly articles, and a documentary film
- Langue des Signes Malienne from Mali by Victoria Nyst — video recordings of spontaneous narratives and dialogues by deaf signers, as well as semi-spontaneous discourse in response to cartoons and picture-based tasks, annotated in ELAN at the gloss, translation or abstract level
In addition to this we have also been working on our “person pages” on the ELAR website. Since ELAR first went online each depositor has been provided with a basic home page giving information about themselves, including links to their ELDP project information (if they are a grantee), their personal web site etc. (see for example Claire Bowern’s depositor page). We are now extending this to all registered users of the ELAR archive who are being invited to set up and edit their own information (see my user page as a sample — there is even a picture hidden in one of the tabs!). In this way we are adding more social context to archive depositing and use. So, for example, if a user requests access to materials with the protocol of “S” (subscriber only) the depositor can access the details on their ELAR user page in order to assist in deciding if this is an appropriate person to be given access to the requested materials (in the parlance of residents in my local area in London it could help figuring out whether ‘e’s a dimond geezer or not). We are planning further developments in this area in the near future that we will report on when they are ready to go live.

If you are interested in Australian Aboriginal languages you might like to take at look at the growing number of collections of audio, video and text materials that are now available in the ELAR archive.
Currently there are six online collections (comprising almost 900 file bundles) for languages from northern Australia, with one more from central Australia that we are currently working on, and several others queued for processing. The following is a brief listing of what is available right now:
- Claire Bowern’s Yan-nhangu Language Documentation 1 from north-east Anrhemland — 160 audio files, along with transcriptions of many of the recordings
- Claire Bowern’s Yan-nhangu Language Documentation 2 — over 140 audio and video files, as well as some translations into English and Djambarrpuyŋu
- Clair Hill’s Paman languages: Umpila, Kuuku Ya’u, Kaanju from Cape York Peninsula — including over 70 stories
- Eric Round’s Documentation of Kayardild from Bentick Island in the Gulf of Carpentaria — about 500 files, including audio, video, ELAN transcription files, and summary metadata
- Ruth Singer’s Mawng Dictionary Project from northern Anrhemland — audio recordings of myths and stories about traditional customs, video recordings made by Elizabeth Langslow as part of a community video project, and materials checking dictionary definitions
- Jean-Christophe Verstraete’s Paman languages: Umpithamu, Morrobolam, Mbarrumbathama from Cape York Peninsula — audio and video recordings and transcriptions of texts, along with lexical and grammatical elicitation
We have recently received Carmel O’Shannessy’s Traditional Warlpiri songs from Central Australia — six traditional Warlpiri love songs, called yilpinji, sung by Teddy Morrison Jupurrurla (transcribed and translated video and audio files) and two ceremonial initiation songs, sung by Peter Dixon Japanangka and a group of elder men (video and audio files). This collection is being curated and will be available on the ELAR website soon.
Several other Australian Aboriginal collections have been received from depositors and are being curated for addition to our archive. News about them will be circulated when they are available online.

In the past month (since my previous update post) the Endangered Languages Archive (ELAR) at SOAS has been moving ahead with leaps and bounds. We now have 66 deposits available on our website, with six more having been added on Monday this week. There are now 41,690 files available online, amounting to 2 terabytes (2,000 gigabytes) of audio, video, image, text and metadata materials.
Our user group has also jumped and now stands at 545; it has been increasing at the rate of 1 per day for the past month! It is exciting to see the rising numbers of people interested in using the endangered languages materials in ELAR.
This will probably be my last update about ELAR here — that’s right, you won’t have to read about “ELAR update update update”
. We have just launched on Twitter (@ELARarchive) and Facebook (ELAR archive) so if you want to keep in touch with our activities in future you can follow us on Twitter or become our Facebook friend. And if you are not already a user do sign up here.
Back in the old days when some of us were younger and starting out on our language documentation and description careers (for me in 1972, as described in this blog post) the world was pretty much analogue and we didn’t have digital hardware or software to think about.
Back then recordings were made with reel-to-reel tape recorders, like the Uher Report, or if you had really fancy kit a Nagra. For those of us working in Australia on Aboriginal languages you could archive your tapes at the Australian Institute of Aboriginal Studies (AIAS), as it then was, later the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). They would copy your tapes onto their archive masters and return the originals to you and all you, as a depositor, had to do was fill in tape deposit sheets. You were supplied with a book of these, alternately white and green, with a sheet of carbon paper to be placed between them. For each tape you had to complete a white sheet listing basic metadata and a summary of the contents of the tape, tear off the white copies (keeping the green carbon copy) and submit them to the AIAS archive. In addition, the Institute encouraged the preparation of tape audition sheets where the content of the tapes was summarised alongside time codes (in minutes and seconds) starting from the beginning of the tape. Sometimes these were created by the depositor and sometimes by the resident linguist (at that time Peter Sutton).
So, if you wanted to find out where in your stack of tapes you could find Story X by Speaker Y you simply had to look at the deposit sheets and/or the audition sheets.
Alas, those days are gone and we are in the digital world, where our experience is mediated via software interfaces that can fool us into seeing the world the way the interface presents it. For language documenters Toolbox is often the software tool of analytical choice (along with ELAN) for the processing and value adding analysis and annotation of recordings. As I claimed in a previous post, the existence of Toolbox means that for many documenters annotational value adding only means interlinear glossing, and alternatives such as overview or summary annotation (like the old tape audition sheets) are not part of their tool set. I have two pieces of evidence for this:
- the Endangered Languages Archive (ELAR) at SOAS has so far received around 100 deposits comprising roughly 800,000 files. Among these deposits there are many that are made up entirely of media files (plus basic cataloguing metadata) with no textual representation of the content of the files beyond a short description in the cataloguing metadata. When asked about annotations, depositors typically respond that they “are working on transcription and glossing” but because of the time needed they cannot provide anything now. They do not seem to consider an alternative, namely time-coded overview annotation which can (and probably should) be done for all the media files, only some of which would then be selected and given priority for interlinear glossing. Why? One reason might be because there is no dedicated software tool designed and set up to do this in an easy and simple manner (interestingly a tool that can be so used, and that produces nice time-coded XML output is Transcriber, though it is generally thought of as a tool for transcription annotation only — it also does not have a “reader mode” that would allow for easy viewing and searching across a set of overview annotations created with it);
- during training courses and public presentations over the past couple of years I have been warning that current approaches to language documentation risk the creation of “data dumps” (which I have also called “data middens”) because researchers are not well trained in corpus and workflow management and additionally suffer from ILGB or “interlinear gloss blindness” which drives them to see textual value adding annotation in terms of the interlinear glossing paradigm The most recent example of such a presentation was during last months grantee training course at SOAS (the Powerpoint slides from my presentation are available on Slideshare). All but one of the grantees attending the training had never heard of, or considered creating, overview summary annotation before launching (selectively) into transcription and interlinear glossing of their recordings.
I may be wrong about the source of the current ILGB and perhaps Toolbox is not (solely) to blame, but I do believe that it plays a part in a narrowing of conceptual thinking about annotation in language documentation, and hence the behaviour of language documenters.
NB: Thanks to Andrew Garrett for his comments on my previous post that caused me to think more deeply about these issues and attempt to explicate and exemplify them more clearly here.
Notes
CALL FOR EXPRESSIONS OF INTEREST
DEVELOP A USER-FRIENDLY SEARCH INTERFACE AND TOUCHPAD APP FOR A DIGITAL ARCHIVE OF LITERATURE IN ABORIGINAL LANGUAGES
THE LIVING ARCHIVE PROJECT
Submission date: 30 April 2012
During the era of bilingual education in the NT, books were produced in 25 Literature Production Centres in more than 16 languages. These materials are widely dispersed and endangered, and contain interesting and significant stories in indigenous Australian languages, often beautifully illustrated. This is an important collection and must be preserved for the future. We are creating a living archive of these endangered materials, in partnership with the communities of origin. This archive will be stored in the Charles Darwin University eSpace repository (http://espace.cdu.edu.au/). With permission from the language owners, materials in the archive will be accessible to Aboriginal communities, academics and the world. As some users may not have high levels of text literacy or technical ability the archive will require a user-friendly visual interface to allow searches beyond the conventional database search capabilities.
- browse by image
users view thumbnails of the covers of books and roll-over to see a larger image with basic metadata and select items to view
- search by text
users start typing a word and resources are selectively filtered to retain only those with that sequence of characters in their metadata. For example typing dja would retain books in Djapu and Djambarrpuyŋu, as well as books by Djäwa and books about djamarrkuli or with the word djamarrkuli in their title.
- search by location
users click on an area on a map to retrieve all materials in that language or from that region
See the attached call for expressions of interest: EoI LAAL User-friendly search interface.
Read more about the project at the living archive of Aboriginal languages
Submit applications to livingarchive AT cdu.edu.au including samples of references from clients and an estimate of cost, by 30 April 2012.
As of this week the Endangered Languages Archive (ELAR) at SOAS has 52 online deposits available comprising around 51,000 files. There are 12,700 data bundles in the online collection, of which 6,000 are available to any registered user and a further 5,000 require access approval from the depositors. The number of users is now 515 with one or two people registering (via this web form) each week.
Recently we have been looking in the cupboards around SOAS and uncovering some interesting and valuable materials that we are digitising and hope to be adding to our online collection in ELAR. For example, one cupboard contained two tin cases with a set of 78rpm vinyl recordings of Zuaran Berber from eastern Lybia recorded by T.F. (Terence Frederick) Mitchell (1919–2007) in the late 1950′s. One of the speakers on the records is probably Mr. Ramadan Hadji Azzabi (cf. Mitchell 1953:28), who was T.F. Mitchell’s research assistant and who studied with him in London. Some of the recordings are conversations and these were published in Mitchell 2009. British Academy post-doctoral researcher Lameen Souag remarks that:
“Zuaran Berber is spoken only around the town of Zuwara in northeastern
Libya. While its status has recently been improved by the removal of
Qaddafi, its small population and the national and regional dominance
of Arabic should qualify it as “threatened” at the least. I’m not
aware of any work on whether the language is being retained; such
research has been quite impossible for the past fifty years or so.”
Bernard Howard, the Linguistics Department technical officer, has digitised the Zuaran Berber materials and we are investigating adding them to the collection in ELAR.
It will be interesting to see if there is material on other endangered languages in the SOAS cupboards.
References
Mitchell, T.F. 1953. Particle-Noun Complexes in a Berber Dialect (Zuara). Bulletin of the School of Oriental and African Studies 15:375-390.
Mitchell, T.F. 2009. Zuaran Berber (Libya) Grammar and Texts. Cologne: Ruediger Koeppe.
I just had a visit from a student wanting to deposit a collection of recordings made in the course of PhD fieldwork in the PARADISEC archive. It is a great shame that they are only just now thinking about how to deposit this material, as it will need considerable work to make it archivable. If they had sought advice before doing all of the research (or looked at the PARADISEC page ‘Depositing with PARADISEC’, or looked at the RNLD pages, e.g, http://www.rnld.org/node/40) it would have been so much easier for all of us. Why?
Continue reading ‘Retrofitting a collection? I’d rather not’ »
March 8, 2012, 2:08 pm by
admin
PARADISEC now holds 177 collections containing 7,516 items and 59,083 files that are 5.59 TB in size. There are 3,310 hours of audio recordings in the collection. The catalog of these collections can be viewed via the Australian National Data Service, or the Open Language Archives Community or the Virtual Language Observatory.
Since our last report, Nick Fowler-Gilmore, the Audio Preservation Officer in the Sydney office, has completed the digitisation of Calvin Roesler‘s tapes (CR1) the last of which were his 1959 recordings in Asmat. See the fieldnotes and a summary of the collection at http://www.paradisec.org.au/fieldnotes/ROES/web/ROES001.htm.
Continue reading ‘The latest stats at PARADISEC’ »

The latest, and fiftieth, deposit to go online at the Endangered Languages Archive (ELAR) is Trevor Johnson’s magnificent Auslan Corpus. Auslan is Australian Sign Language and the corpus consists of over 900 bundles (including over 850 video recordings) of one hundred native or near-native deaf signers filmed in pairs between mid-2004 and mid-2007 in five cities across Australia (Adelaide, Brisbane, Melbourne, Perth, Sydney). The materials include interviews, retelling stories, recalling personal events, responding to a questionnaire, engaging in spontaneous conversation, and responding in Auslan to various stimuli such as a picture-book story, a filmed cartoon, and a filmed story told in Auslan.
The corpus was originally deposited at the Endangered Languages Archive in late 2008 (following completion of an ELDP-funded Major Documentation Project) and during the intervening years Trevor Johnson and fellow researchers have been glossing, translating and annotating parts of the corpus using the software tool ELAN in order to make it machine readable and searchable. The corpus is now being published for researchers, signers and interested others to access; parts of the video deposit are publicly accessible and other parts are accessible to subscribers on application to the depositor. The glossing, translation and annotation work on the corpus will take many more years to complete and updates will be added as they become available.
In December 2011 PARADISEC hosted a conference titled ‘Sustainable data from digital research: Humanities perspectives on digital scholarship’. Presentations from that conference are now available as audio or video downloads from the following repository: http://ses.library.usyd.edu.au/handle/2123/7890. Ten of these presentations also include a peer-reviewed chapter in the conference proceedings.
See below for an RSS feed of all titles and links in the The University of Sydney eScholarship Repository
Continue reading ‘Sustainable data from digital research – presentations available’ »