Archive for the ‘Archiving’ Category.

PARADISEC’s decade celebration conference

Announcing the conference “Research, records and responsibility (RRR): Ten years of the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)”

Dates: 2nd-3rd December 2013
Venue: University of Melbourne, Australia

Keynote speaker:

Shubha Chaudhuri
Associate Director General (Academic)
Archives and Research Centre for Ethnomusicology
American Institute of Indian Studies
Gurgaon, India

For details and the call for papers see:

This will coincide with the Workshop on digital tools and methods for language documentation on the 3rd-4th December 2013

Keynote speakers:

Alexandre Arkhipov (Moscow State University) on methods used by his research group to build an integrated documentation and analysis system.

Andreas Witt (Head of the TEI-SIG, Institut für Deutsche Sprache, Mannheim) on the Text Encoding Initiative-Special Interest Group and TEI for linguists.

For details and the call for papers see:

Digitising Mali’s cultural heritage — Simon Tanner

Simon Tanner has a blog post on his experience of working with various manuscript collections and the tragic destruction of potentially thousands of manuscripts from the New Ahmed Baba Institute building in Mali: “I have worked with manuscripts for over 20 years now; as a librarian, academic and as a consultant helping others to digitise their collections. I have worked in Africa with various libraries and archives for over 10 years. [..] Africa is a continent that has been wracked by the three horsemen of the manuscript conservationists nightmares: war, pestilence and natural disaster.” (

Could DNA be the future of digital preservation?

Genetic scientists in Britain overnight, have successfully demonstrated the data-storage potential of DNA, as explained in The Conversation today.

In a proof-of-concept experiment, a string of DNA with a physical size around that of a grain of dust, was encoded with an MP3 file of the ‘I have a dream’ speech of Dr Martin Luther King, a photo, a pdf of the 1953 paper that first described the structure of DNA, as well as the entire sonnets of Shakespeare. Critically however, the speck of DNA was sent via post to the US, where it was decoded and found to be an exact digital copy of the input.

The benefit of using DNA, besides the mind-blowing data space potential, is that it can be freeze-dried and stored for a very long time without any loss of information. In fact the encoding procedure utilises the built-in redundancy in DNA; the replication of data in multiple strings, like a biological RAID array, making it very unlikely that the same information will be lost in all strings.

Sequencing DNA still takes a couple of weeks at the least, and both encoding and decoding are unsurprisingly expensive, although costs are predicted to come down quite dramatically over the next 20 years or so.

Even so, the potential of this technology for the perpetual and secure storage of our cultural heritage is obvious, and digital preservation of linguistic and ethnological materials such as PARADISEC contains, would be a very suitable use of this technology, given that we aim to store and preserve digital objects for a very long time.

Counting Collections

As will be clear to regular readers of this blog, we are concerned here to encourage the creation of the best possible records of small languages. Since much of this work is done by researchers (linguists, musicologists, anthropologists etc.) within academia, there needs to be a system for recognising collections of such records in themselves as academic output. This question is being discussed more widely in academia and in high-level policy documents as can be seen by the list of references given below.

The increasing importance of language documentation as a paradigm in linguistic research means that many linguists now spend substantial amounts of time preparing corpora of language data for archiving. Scholars would of course like to see appropriate recognition of such effort in various institutional contexts. Preliminary discussions between the Australian Linguistic Society (ALS) and the Australian Research Council (ARC) in 2011 made it clear that, although the ARC accepted that curated corpora could legitimately be seen as research output, it would be the responsibility of the ALS (or the scholarly community more generally) to establish conventions to accord scholarly credibility to such products. Here, we report on some of the activities of the authors in exploring this issue on behalf of the ALS and discuss issues in two areas: (a) what sort of process is appropriate in according some form of validation to corpora as research products, and (b) what are the appropriate criteria against which such validation should be judged?

“Scholars who use these collections are generally appreciative of the effort required to create these online resources and reluctant to criticize, but one senses that these resources will not achieve wider acceptance until they are more rigorously and systematically reviewed.” (Willett, 2004)

Continue reading ‘Counting Collections’ »

Announcing Paradisec’s new catalogue

Over the last year or so, the Paradisec team, in collaboration with software developers Robot Parade, Silvia Pfeiffer and John Ferlito, have been working on the development of a replacement to our ageing catalogue and database systems and a couple of weeks ago, this work culminated in the release of the new catalogue.

There are several features of the software that represent a vast improvement over the previous catalogue, including a much simpler search function, for both collections and individual items, a Google Maps API as a way of exploring data (shown here) and, most usefully for our depositors, the ability to play their own files straight through the browser, or download them directly from the archive.

At the same time, the Paradisec team have been working with the Australian National Data Service (ANDS) to provide collection-level metadata to Research Data Australia (RDA) a federal initiative to aggregate metadata from research around the country using a standard metadata format and make it searchable and discoverable via the RDA website. The new catalogue automates this process, so that when a collection has the required metadata to maximise discoverability, it is harvested by RDA and appears in their database. As such it is important that depositors, or managers of others’ data, provide as rich metadata as possible.

While the software is still undergoing some post-release bug fixes and improvements, we welcome the public to explore the breadth and depth of data in the collection. Access to data itself however, is restricted to collectors and managers of those collections in line with Paradisec policy. Access to other people’s data is subject to access conditions.

We encourage interested readers to explore the collection, and we especially invite collectors of relevant data to get in contact with us to investigate depositing their collections with us for safekeeping.

PARADISEC’s ‘Data Seal of Approval’

As we approach our tenth year of operation, it is gratifying that PARADISEC has achieved this seal of approval (DSA), based on 16 criteria (listed below, and see how we meet these criteria here: We have been a five-star Open Language Archives Community repository for some time, which also means that we are one of the 1800 archives whose catalog and metadata conform to the Open Archives Initiative standards, but the DSA looks more broadly at the whole process of the repository, from accession of records, through their description and curation and to disaster management. This is important for our depositors to know as they can be sure that their research output is properly described and curated, and can be found using various search tools, including google, but more specifically the Australian National Data Service, OLAC and the WorldCat, and also the aggregated information served in the Virtual Language Observatory.

Continue reading ‘PARADISEC’s ‘Data Seal of Approval’’ »

Bursting through Dawes (2)

Further to my last post, I’ve read on, and my disappointment has only deepened at the treatment of the Sydney Language in Ross Gibson’s 26 views of the starburst world.

Think about the notes you made when you were getting into learning an undocumented language … Imagine they get archived and in a century or two someone looks through them and tries to work out what was going on when you made the notes.  With only shreds of metadata and general knowledge of the historical period to go on, the future reader makes inferences from the content. Could a cluster of words in one of your vocabulary lists point to a hunch you were checking? Or a sequence of illustrative sentences could be the skeletal narrative of a memorable experience shared with your teachers.
Continue reading ‘Bursting through Dawes (2)’ »

Charting Vanishing Voices: A Collaborative Workshop to Map Endangered Oral Cultures

A two-day conference titled ‘Charting Vanishing Voices: A Collaborative Workshop to Map Endangered Oral Cultures’ ran on June 29/30 in Cambridge, UK. Organised by the World Oral Literature Project, the conference brought together a range of ‘scholars, digital archivists and international organisations to share experiences of mapping ethno-linguistic diversity using interactive digital technologies.’

A discussion of the conference at the Arctic Anthropology blog gives a good overview, so, rather than duplicate what you can read there, I’ll just add some useful pointers to things I discovered at the conference:
– I didn’t know that the data under the UNESCO atlas of endangered languages can be downloaded freely.
– There can be a model of user-pays for information that adds value to open source material and is commercially viable (Alexander Street Press).
– There are students at SOAS who have produced a great website of geocoded language information.
– The Glottolog/langdoc project has 175,000 references linked to what 94,000 what they call ‘languoids’ (languages, dialects, families).
– There is a great project at the CNRS for making media available online ( and for annotating it. They also use the Vamp plugin that looks interesting as a way of analysing and extracting information from audio files.

Technology and language documentation: LIP discussion

Lauren Gawne recaps last night’s Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.

This week at Linguistics in the Pub it was all about technology, and how it impacts on our practices. The announcement for the session briefly outlined some of the ways technology has shaped expectations for language documentation:

The continual developments in technology that we currently enjoy are inextricably connected to the development of our field. Most would agree that technology has changed language documentation for the better. But while nobody is advocating a return to paper and pen, most would concur that technology has changed the way we work in unexpected ways. The focus is usually on the materials we produce such as video, audio and annotation files as well as particular types of computer-aided analysis. In a recent ELAC post, ‘Hammers and nails‘ Peter Austin claims that metadata is not what it was, in the days of good old reel-to-reel tape recorders. The volume of comments suggests that this topic is ripe for discussion. This session of Linguistics in the Pub will give us a chance to reflect on how our practices change with advances in technology. 

There are a (very) few linguists who advocate that researchers should go to the field with nothing beyond a spiral-bound notebook and a pen, though no one at the table was quite willing to go that far; all of us, it seems, go to the field with a good quality audio recorder at the very least. Without the additional recordings (be they audio or video) the only output of the research becomes the final papers written by the linguist, which are in no way verifiable. The recording of verifiable data, and the slowly increasing practice of including audio recordings in the final research output are allowing us to further stake our claim as an empirical and verifiable field of scientific inquiry. Many of us shared stories of how listening back to a recording that we had made enriched the written records that we have, or allow us to focus on something that wasn’t the target of our inquiry at the time of the initial recording. The task of trying to do the level of analysis that is now expected for even the lowliest sketch grammar is almost impossible without the aid of recordings, let alone trying to capture the subtleties present in naturalistic narrative or conversation. Continue reading ‘Technology and language documentation: LIP discussion’ »

ELAR cracks a ton

The Endangered Languages Archive (ELAR) at SOAS reaches an important milestone this week when our 100th deposit goes online. We will be working on a further 10 deposits and doing additional curation work on those currently online over the next two months.

ELAR now has 4 terabytes (4,000 gigabytes — double that I reported in April) of language, music and cultural data and analysis online and available for registered users (and, of course, fully preserved on our storage area network and securely backed up). Some material requires subscription but we have now implemented an online subscription system that enables user requests to be easily made, with the depositor being automatically asked for permission to access the relevant files.

The following is a list of our recent additions in alphabetical order of the main language:

  1. Ainu from Japan — Documentation of the Saru dialect of Ainu by Anna Bugaeva
  2. Bajjika from India — Bajjika: Swadesh List Elicitation Sessions by Jay Huweiler
  3. Baram from Nepal — Linguistic and ethnographic documentation of Baram by Yogendra Prasad Yadava
  4. Bedik from Senegal — Documentation of Bedik by Adjaratou Oumar Sall
  5. Choguita Rarámuri from Mexico — Choguita Rarámuri description and documentation by Gabriela Caballero
  6. Kabardian from Turkey — Documentation and Analysis of Kabardian as Spoken in Turkey — by Ayla Applebaum Bozkurt
  7. Kiksht from USA — Conversational Kiksht by Nariyo Kono
  8. Kunwinjku from Australia — Itpi-itpi songs in Kunwinjku, Mawng and Kunbarlang by Linda Barwick
  9. Kunwinjku from Australia — Karrbarda songs in Kunwinjku and Kunbarlang by Linda Barwick
  10. Lakandon from Mexico — Temporal Reference in Lakandon Maya by Henrik Bergqvist
  11. Mawng from Australia — Mirrijpu (seagull) songs in spirit-language Manangkarri by Linda Barwick
  12. Mawng from Australia — Inyjalarrku songs in spirit-language Mawng by Linda Bawrick
  13. Nahua from Peru — Documentation of mythology and shamanic songs of the Nahua — Conrad Feather
  14. North Ambrym from Vanuatu — A documentation of North Ambrym, a language of Vanuatu and research into its possessive structures by Michael Franjieh
  15. Paman from Australia — Paman languages: Umpila, Kuuku Ya’u, Kaanju by Claire Hill
  16. Paresi-Haliti from Brazil — Verbal events in Paresi-Haliti by Glauber Romling da Silva
  17. Pingelapese from the Marshall Islands — Pingelapese language data by Ryoko Hattori
  18. Tesltal from Mexico — Ethnographic and discursive audiovisual corpus of Tseltal by Gilles Pollian
  19. Yami from Taiwan —Yami Documentation by Meng Chien Yang and Der-Hwa Victoria Rau
  20. Yoloxochitl Mixtec from Mexico — Yoloxochitl Mixtec stories and other oral traditions by Jonathan Amith

The following deposits are in curation and will be available soon:

  1. Alipur Village sign language from India — Investigation of an endangered village sign language in India by Sibaji Panda
  2. Bom and Kim from Sierra Leone — Documentation of Kim and Bom Languages of Sierre Leone by Tucker Childs
  3. Asheninka Perene from Peru — Asheninka Perene (Arawak) 2010 collection, from eastern Peru by Elena Mihas
  4. Desano from Brazil — Desano – audio and video materials by Wilson Silva
  5. Enindhilyakwa from Australia — Documentation of Enindhilyakwa by Marie van Egmond
  6. Ingrian from Russia — Ingrian narratives and elicitations by Fedor Rozhanskiy and Ilya Nikolaev
  7. Ingrian from Russia — Ingrian, Vatic, and Ingrian Finnish elicitations and conversations by Fedor Rozhanskiy and Mehmet Muslimov
  8. Ingrian from Russia — Ingrian narratives recorded in 2011 by Fedor Rozhanskiy and Elena Markus
  9. Mmani from Guinea — Documentation of the moribund language Mmani, a Southern Atlantic language of Niger-Congo by Tucker Childs
  10. Solega from India — Documentation of Solega by Aung Si

We will be having a small celebratory party at ELAR this week to mark what is a pretty significant milestone for us.