Over the last year or so, the Paradisec team, in collaboration with software developers Robot Parade, Silvia Pfeiffer and John Ferlito, have been working on the development of a replacement to our ageing catalogue and database systems and a couple of weeks ago, this work culminated in the release of the new catalogue.
There are several features of the software that represent a vast improvement over the previous catalogue, including a much simpler search function, for both collections and individual items, a Google Maps API as a way of exploring data (shown here) and, most usefully for our depositors, the ability to play their own files straight through the browser, or download them directly from the archive.
At the same time, the Paradisec team have been working with the Australian National Data Service (ANDS) to provide collection-level metadata to Research Data Australia (RDA) a federal initiative to aggregate metadata from research around the country using a standard metadata format and make it searchable and discoverable via the RDA website. The new catalogue automates this process, so that when a collection has the required metadata to maximise discoverability, it is harvested by RDA and appears in their database. As such it is important that depositors, or managers of others’ data, provide as rich metadata as possible.
While the software is still undergoing some post-release bug fixes and improvements, we welcome the public to explore the breadth and depth of data in the collection. Access to data itself however, is restricted to collectors and managers of those collections in line with Paradisec policy. Access to other people’s data is subject to access conditions.
We encourage interested readers to explore the collection, and we especially invite collectors of relevant data to get in contact with us to investigate depositing their collections with us for safekeeping.
As we approach our tenth year of operation, it is gratifying that PARADISEC has achieved this seal of approval (DSA), based on 16 criteria (listed below, and see how we meet these criteria here: https://assessment.datasealofapproval.org/assessment_75/seal/html/). We have been a five-star Open Language Archives Community repository for some time, which also means that we are one of the 1800 archives whose catalog and metadata conform to the Open Archives Initiative standards, but the DSA looks more broadly at the whole process of the repository, from accession of records, through their description and curation and to disaster management. This is important for our depositors to know as they can be sure that their research output is properly described and curated, and can be found using various search tools, including google, but more specifically the Australian National Data Service, OLAC and the WorldCat, and also the aggregated information served in the Virtual Language Observatory.
Continue reading ‘PARADISEC’s ‘Data Seal of Approval’’ »
Further to my last post, I’ve read on, and my disappointment has only deepened at the treatment of the Sydney Language in Ross Gibson’s 26 views of the starburst world.
Think about the notes you made when you were getting into learning an undocumented language … Imagine they get archived and in a century or two someone looks through them and tries to work out what was going on when you made the notes. With only shreds of metadata and general knowledge of the historical period to go on, the future reader makes inferences from the content. Could a cluster of words in one of your vocabulary lists point to a hunch you were checking? Or a sequence of illustrative sentences could be the skeletal narrative of a memorable experience shared with your teachers.
Continue reading ‘Bursting through Dawes (2)’ »
A two-day conference titled ‘Charting Vanishing Voices: A Collaborative Workshop to Map Endangered Oral Cultures’ ran on June 29/30 in Cambridge, UK. Organised by the World Oral Literature Project, the conference brought together a range of ‘scholars, digital archivists and international organisations to share experiences of mapping ethno-linguistic diversity using interactive digital technologies.’
A discussion of the conference at the Arctic Anthropology blog gives a good overview, so, rather than duplicate what you can read there, I’ll just add some useful pointers to things I discovered at the conference:
– I didn’t know that the data under the UNESCO atlas of endangered languages can be downloaded freely.
– There can be a model of user-pays for information that adds value to open source material and is commercially viable (Alexander Street Press).
– There are students at SOAS who have produced a great website of geocoded language information.
– The Glottolog/langdoc project has 175,000 references linked to what 94,000 what they call ‘languoids’ (languages, dialects, families).
– There is a great project at the CNRS for making media available online (http://telemeta.org/) and for annotating it. They also use the Vamp plugin that looks interesting as a way of analysing and extracting information from audio files.
Lauren Gawne recaps last night’s Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.
This week at Linguistics in the Pub it was all about technology, and how it impacts on our practices. The announcement for the session briefly outlined some of the ways technology has shaped expectations for language documentation:
The continual developments in technology that we currently enjoy are inextricably connected to the development of our field. Most would agree that technology has changed language documentation for the better. But while nobody is advocating a return to paper and pen, most would concur that technology has changed the way we work in unexpected ways. The focus is usually on the materials we produce such as video, audio and annotation files as well as particular types of computer-aided analysis. In a recent ELAC post, ‘Hammers and nails‘ Peter Austin claims that metadata is not what it was, in the days of good old reel-to-reel tape recorders. The volume of comments suggests that this topic is ripe for discussion. This session of Linguistics in the Pub will give us a chance to reflect on how our practices change with advances in technology.
There are a (very) few linguists who advocate that researchers should go to the field with nothing beyond a spiral-bound notebook and a pen, though no one at the table was quite willing to go that far; all of us, it seems, go to the field with a good quality audio recorder at the very least. Without the additional recordings (be they audio or video) the only output of the research becomes the final papers written by the linguist, which are in no way verifiable. The recording of verifiable data, and the slowly increasing practice of including audio recordings in the final research output are allowing us to further stake our claim as an empirical and verifiable field of scientific inquiry. Many of us shared stories of how listening back to a recording that we had made enriched the written records that we have, or allow us to focus on something that wasn’t the target of our inquiry at the time of the initial recording. The task of trying to do the level of analysis that is now expected for even the lowliest sketch grammar is almost impossible without the aid of recordings, let alone trying to capture the subtleties present in naturalistic narrative or conversation. Continue reading ‘Technology and language documentation: LIP discussion’ »
The Endangered Languages Archive (ELAR) at SOAS reaches an important milestone this week when our 100th deposit goes online. We will be working on a further 10 deposits and doing additional curation work on those currently online over the next two months.
ELAR now has 4 terabytes (4,000 gigabytes — double that I reported in April) of language, music and cultural data and analysis online and available for registered users (and, of course, fully preserved on our storage area network and securely backed up). Some material requires subscription but we have now implemented an online subscription system that enables user requests to be easily made, with the depositor being automatically asked for permission to access the relevant files.
The following is a list of our recent additions in alphabetical order of the main language:
- Ainu from Japan — Documentation of the Saru dialect of Ainu by Anna Bugaeva
- Bajjika from India — Bajjika: Swadesh List Elicitation Sessions by Jay Huweiler
- Baram from Nepal — Linguistic and ethnographic documentation of Baram by Yogendra Prasad Yadava
- Bedik from Senegal — Documentation of Bedik by Adjaratou Oumar Sall
- Choguita Rarámuri from Mexico — Choguita Rarámuri description and documentation by Gabriela Caballero
- Kabardian from Turkey — Documentation and Analysis of Kabardian as Spoken in Turkey — by Ayla Applebaum Bozkurt
- Kiksht from USA — Conversational Kiksht by Nariyo Kono
- Kunwinjku from Australia — Itpi-itpi songs in Kunwinjku, Mawng and Kunbarlang by Linda Barwick
- Kunwinjku from Australia — Karrbarda songs in Kunwinjku and Kunbarlang by Linda Barwick
- Lakandon from Mexico — Temporal Reference in Lakandon Maya by Henrik Bergqvist
- Mawng from Australia — Mirrijpu (seagull) songs in spirit-language Manangkarri by Linda Barwick
- Mawng from Australia — Inyjalarrku songs in spirit-language Mawng by Linda Bawrick
- Nahua from Peru — Documentation of mythology and shamanic songs of the Nahua — Conrad Feather
- North Ambrym from Vanuatu — A documentation of North Ambrym, a language of Vanuatu and research into its possessive structures by Michael Franjieh
- Paman from Australia — Paman languages: Umpila, Kuuku Ya’u, Kaanju by Claire Hill
- Paresi-Haliti from Brazil — Verbal events in Paresi-Haliti by Glauber Romling da Silva
- Pingelapese from the Marshall Islands — Pingelapese language data by Ryoko Hattori
- Tesltal from Mexico — Ethnographic and discursive audiovisual corpus of Tseltal by Gilles Pollian
- Yami from Taiwan —Yami Documentation by Meng Chien Yang and Der-Hwa Victoria Rau
- Yoloxochitl Mixtec from Mexico — Yoloxochitl Mixtec stories and other oral traditions by Jonathan Amith
The following deposits are in curation and will be available soon:
- Alipur Village sign language from India — Investigation of an endangered village sign language in India by Sibaji Panda
- Bom and Kim from Sierra Leone — Documentation of Kim and Bom Languages of Sierre Leone by Tucker Childs
- Asheninka Perene from Peru — Asheninka Perene (Arawak) 2010 collection, from eastern Peru by Elena Mihas
- Desano from Brazil — Desano – audio and video materials by Wilson Silva
- Enindhilyakwa from Australia — Documentation of Enindhilyakwa by Marie van Egmond
- Ingrian from Russia — Ingrian narratives and elicitations by Fedor Rozhanskiy and Ilya Nikolaev
- Ingrian from Russia — Ingrian, Vatic, and Ingrian Finnish elicitations and conversations by Fedor Rozhanskiy and Mehmet Muslimov
- Ingrian from Russia — Ingrian narratives recorded in 2011 by Fedor Rozhanskiy and Elena Markus
- Mmani from Guinea — Documentation of the moribund language Mmani, a Southern Atlantic language of Niger-Congo by Tucker Childs
- Solega from India — Documentation of Solega by Aung Si
We will be having a small celebratory party at ELAR this week to mark what is a pretty significant milestone for us.
Most of the UK seems to have been distracted over the past few weeks (and especially over the four-day long weekend that is just now drawing to an end) by the celebrations surrounding the Diamond Jubilee of Queen Elizabeth II.
Not so the hard working team at the Endangered Languages Archive (ELAR) at SOAS who have been curating and processing materials to add to our website.
In the past week the following nine deposits (in alphabetical order) have been added:
- Avatime from Ghana by Saskia van Putten and Rebecca Defina — video and audio recordings in various genres such as ceremonial events, personal stories, route descriptions, folk tales, conversations, recipes and speech elicited using various materials. Part of the corpus has been transcribed and translated using ELAN, and there is a word list in Toolbox format
- Baram from Nepal by Yogendra Prasad Yadava — audio and video files with annotations in ELAN and Toolbox, and metadata files in IMDI format
- Cappadocian from Greece by Mark Janse — a Greek-Turkish mixed language thought to have died in the 1960s until its rediscovery in 2005. The corpus includes digital audio and text files
- Chatino from Mexico by Anthony Woodbury — a collection of audio and video recordings of narratives, interviews, conversations, oratory, ritual speech, linguistic elicitations, and other genres in all major varieties of Chatino. The collection of almost 2,500 files also includes transcriptions, translations, and annotations of some of the recorded texts, data sets, word lists and analyses, academic papers, and pedagogical materials
- Glavda from Nigeria by Jonathan Owens — includes audio data based on interviews, free conversations and verbal art among speakers in the rural homeland, along with the language of Glavda speakers in Maiduguri, the largest urban center in the region, and the goal of considerable out-migration from the rural homeland
- Inuit sign from Canada by Joke Schuit — a collection of video stories of past and present life of deaf Inuit community members, and some elicitation tasks based on picture drawings and/or cartoon clips, plus descriptive documents and annotation files
- Ju|’hoan from Namibia by Megan Biesele — the corpus contains 150 audio recordings, 27 video recordings, transcriptions, language lessons and a dictionary. The materials cover 1970 to the present
- Middle Chulym from Siberia by David Harrison — includes unedited video, audio, photos, lexica, and field notes, as well as processed, edited and annotated recordings, scholarly articles, and a documentary film
- Langue des Signes Malienne from Mali by Victoria Nyst — video recordings of spontaneous narratives and dialogues by deaf signers, as well as semi-spontaneous discourse in response to cartoons and picture-based tasks, annotated in ELAN at the gloss, translation or abstract level
In addition to this we have also been working on our “person pages” on the ELAR website. Since ELAR first went online each depositor has been provided with a basic home page giving information about themselves, including links to their ELDP project information (if they are a grantee), their personal web site etc. (see for example Claire Bowern’s depositor page). We are now extending this to all registered users of the ELAR archive who are being invited to set up and edit their own information (see my user page as a sample — there is even a picture hidden in one of the tabs!). In this way we are adding more social context to archive depositing and use. So, for example, if a user requests access to materials with the protocol of “S” (subscriber only) the depositor can access the details on their ELAR user page in order to assist in deciding if this is an appropriate person to be given access to the requested materials (in the parlance of residents in my local area in London it could help figuring out whether ‘e’s a dimond geezer or not). We are planning further developments in this area in the near future that we will report on when they are ready to go live.
If you are interested in Australian Aboriginal languages you might like to take at look at the growing number of collections of audio, video and text materials that are now available in the ELAR archive.
Currently there are six online collections (comprising almost 900 file bundles) for languages from northern Australia, with one more from central Australia that we are currently working on, and several others queued for processing. The following is a brief listing of what is available right now:
- Claire Bowern’s Yan-nhangu Language Documentation 1 from north-east Anrhemland — 160 audio files, along with transcriptions of many of the recordings
- Claire Bowern’s Yan-nhangu Language Documentation 2 — over 140 audio and video files, as well as some translations into English and Djambarrpuyŋu
- Clair Hill’s Paman languages: Umpila, Kuuku Ya’u, Kaanju from Cape York Peninsula — including over 70 stories
- Eric Round’s Documentation of Kayardild from Bentick Island in the Gulf of Carpentaria — about 500 files, including audio, video, ELAN transcription files, and summary metadata
- Ruth Singer’s Mawng Dictionary Project from northern Anrhemland — audio recordings of myths and stories about traditional customs, video recordings made by Elizabeth Langslow as part of a community video project, and materials checking dictionary definitions
- Jean-Christophe Verstraete’s Paman languages: Umpithamu, Morrobolam, Mbarrumbathama from Cape York Peninsula — audio and video recordings and transcriptions of texts, along with lexical and grammatical elicitation
We have recently received Carmel O’Shannessy’s Traditional Warlpiri songs from Central Australia — six traditional Warlpiri love songs, called yilpinji, sung by Teddy Morrison Jupurrurla (transcribed and translated video and audio files) and two ceremonial initiation songs, sung by Peter Dixon Japanangka and a group of elder men (video and audio files). This collection is being curated and will be available on the ELAR website soon.
Several other Australian Aboriginal collections have been received from depositors and are being curated for addition to our archive. News about them will be circulated when they are available online.
In the past month (since my previous update post) the Endangered Languages Archive (ELAR) at SOAS has been moving ahead with leaps and bounds. We now have 66 deposits available on our website, with six more having been added on Monday this week. There are now 41,690 files available online, amounting to 2 terabytes (2,000 gigabytes) of audio, video, image, text and metadata materials.
Our user group has also jumped and now stands at 545; it has been increasing at the rate of 1 per day for the past month! It is exciting to see the rising numbers of people interested in using the endangered languages materials in ELAR.
This will probably be my last update about ELAR here — that’s right, you won’t have to read about “ELAR update update update” :-). We have just launched on Twitter (@ELARarchive) and Facebook (ELAR archive) so if you want to keep in touch with our activities in future you can follow us on Twitter or become our Facebook friend. And if you are not already a user do sign up here.
Back in the old days when some of us were younger and starting out on our language documentation and description careers (for me in 1972, as described in this blog post) the world was pretty much analogue and we didn’t have digital hardware or software to think about.
Back then recordings were made with reel-to-reel tape recorders, like the Uher Report, or if you had really fancy kit a Nagra. For those of us working in Australia on Aboriginal languages you could archive your tapes at the Australian Institute of Aboriginal Studies (AIAS), as it then was, later the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). They would copy your tapes onto their archive masters and return the originals to you and all you, as a depositor, had to do was fill in tape deposit sheets. You were supplied with a book of these, alternately white and green, with a sheet of carbon paper to be placed between them. For each tape you had to complete a white sheet listing basic metadata and a summary of the contents of the tape, tear off the white copies (keeping the green carbon copy) and submit them to the AIAS archive. In addition, the Institute encouraged the preparation of tape audition sheets where the content of the tapes was summarised alongside time codes (in minutes and seconds) starting from the beginning of the tape. Sometimes these were created by the depositor and sometimes by the resident linguist (at that time Peter Sutton).
So, if you wanted to find out where in your stack of tapes you could find Story X by Speaker Y you simply had to look at the deposit sheets and/or the audition sheets.
Alas, those days are gone and we are in the digital world, where our experience is mediated via software interfaces that can fool us into seeing the world the way the interface presents it. For language documenters Toolbox is often the software tool of analytical choice (along with ELAN) for the processing and value adding analysis and annotation of recordings. As I claimed in a previous post, the existence of Toolbox means that for many documenters annotational value adding only means interlinear glossing, and alternatives such as overview or summary annotation (like the old tape audition sheets) are not part of their tool set. I have two pieces of evidence for this:
- the Endangered Languages Archive (ELAR) at SOAS has so far received around 100 deposits comprising roughly 800,000 files. Among these deposits there are many that are made up entirely of media files (plus basic cataloguing metadata) with no textual representation of the content of the files beyond a short description in the cataloguing metadata. When asked about annotations, depositors typically respond that they “are working on transcription and glossing” but because of the time needed they cannot provide anything now. They do not seem to consider an alternative, namely time-coded overview annotation which can (and probably should) be done for all the media files, only some of which would then be selected and given priority for interlinear glossing. Why? One reason might be because there is no dedicated software tool designed and set up to do this in an easy and simple manner (interestingly a tool that can be so used, and that produces nice time-coded XML output is Transcriber, though it is generally thought of as a tool for transcription annotation only — it also does not have a “reader mode” that would allow for easy viewing and searching across a set of overview annotations created with it);
- during training courses and public presentations over the past couple of years I have been warning that current approaches to language documentation risk the creation of “data dumps” (which I have also called “data middens”) because researchers are not well trained in corpus and workflow management and additionally suffer from ILGB or “interlinear gloss blindness” which drives them to see textual value adding annotation in terms of the interlinear glossing paradigm The most recent example of such a presentation was during last months grantee training course at SOAS (the Powerpoint slides from my presentation are available on Slideshare). All but one of the grantees attending the training had never heard of, or considered creating, overview summary annotation before launching (selectively) into transcription and interlinear glossing of their recordings.
I may be wrong about the source of the current ILGB and perhaps Toolbox is not (solely) to blame, but I do believe that it plays a part in a narrowing of conceptual thinking about annotation in language documentation, and hence the behaviour of language documenters.
NB: Thanks to Andrew Garrett for his comments on my previous post that caused me to think more deeply about these issues and attempt to explicate and exemplify them more clearly here.