Author Archive

Research, records and responsibility conference: Ten years of PARADISEC


The conference celebrating ten years of PARADISEC in early December had a suitably interdisciplinary mix of presentations. Joining in the reflection on building records of the world’s languages and cultures were musicologists, linguists, and archivists from India, Hong Kong, Poland, Canada, Alaska, Hawai’i, Australia, the UK and Russia. The range of topics covered can be seen in the program:

The conference ended with a discussion of what was missing in our current tools and methods. While it is clear that linguists have done pretty well at using appropriate tools for transcribing and annotating text, and building repositories to provide long-term citation and access to the material, there is still a long way to go. Continue reading ‘Research, records and responsibility conference: Ten years of PARADISEC’ »

The long road to language resources—CLARIN

CLARIN, the ‘Common Language Resources and Technology Infrastructure’ is a European initiative to support the creation, curation and exploration of language material for research purposes and for as broad an audience as possible. The stated aim is that you should not need to be a technical expert to use the corpora, lexica and annotations that are targeted in CLARIN.

It is part of the European Research Infrastructure Consortium (ERIC). This is a huge project, with a budget of some €104 million. CLARIN-D is the German section of CLARIN and it recently had its 2-year showcase, which I was able to attend (see current activities at Given that this is the first two years of a longterm project it has clearly achieved a great deal already, and certainly more than can be glimpsed in a short blog post.

This is part of a ‘roadmap’ process that actually leads somewhere, unlike the Australian version I reported on earlier that appears to have cost hundreds of thousands of dollars only to have been abandoned even before it was published.

Continue reading ‘The long road to language resources—CLARIN’ »

Print on demand, again

In an earlier post I talked about getting texts from Toolbox into books for use in the language community. The print-on-demand service I was so enthusiastic about and which I pointed to for copies of my books, has now closed, fallen victim to a change of bookshop ownership at Melbourne Uni.

After talking with Manfred Krifka and Kilu von Prince and seeing their work (the Daakie literacy book and Sóróusian ne vilye Ambrym: Siiwisian ne or Ambrym) being printed by Amazon’s online service (CreateSpace), I took the same pdf files I had previously created and uploaded them to Amazon. Within a couple of days all the checks had been done and Natrauswen nig Efat is now available online for less than $7. The pdf version is downloadable for free ( It will also be available for Kindle!


Exploring data from language documentation

The workshop ‘Exploring data from language documentation’, organised by Kilu von Prince and Felix Rau, (May 10/11 2013) included a number of interesting presentations which can be downloaded here:

I talked about some gaps in the current language documentation workflow and tools that could help fill them, in particular ExSite9 for improving metadata collection, and EOPAS for presenting text and media online for citation and verification.

Christian Chanard and Amina Mettouchi showed a hybrid version of Elan they have developed that allows parsing and morphological labeling, as well as another tool that allows websearching of Elan files.

Continue reading ‘Exploring data from language documentation’ »

PARADISEC’s decade celebration conference

Announcing the conference “Research, records and responsibility (RRR): Ten years of the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)”

Dates: 2nd-3rd December 2013
Venue: University of Melbourne, Australia

Keynote speaker:

Shubha Chaudhuri
Associate Director General (Academic)
Archives and Research Centre for Ethnomusicology
American Institute of Indian Studies
Gurgaon, India

For details and the call for papers see:

This will coincide with the Workshop on digital tools and methods for language documentation on the 3rd-4th December 2013

Keynote speakers:

Alexandre Arkhipov (Moscow State University) on methods used by his research group to build an integrated documentation and analysis system.

Andreas Witt (Head of the TEI-SIG, Institut für Deutsche Sprache, Mannheim) on the Text Encoding Initiative-Special Interest Group and TEI for linguists.

For details and the call for papers see:

Fieldwork helper – ExSite9

ExSite9 is an open-source cross-platform tool for creating descriptions of files created during fieldwork. We have been working on the development of ExSite9 over the past year and it is now ready for download and use:

ExSite9 collects information about files from a directory on your laptop you have selected, and presents it to you onscreen for your annotation, as can be seen in the following screenshot. The top left window shows the filenames, and the righthand window shows metadata characteristics that can be clicked once a file or set of files is selected.The manual is here:

Researchers who undertake fieldwork, or capture research data away from their desks, can use ExSite9 to support the quick application of descriptive metadata to the digital data they capture. This also enables researchers to prepare a package of metadata and data for backup to a data repository or archive for safekeeping and further manipulation.

Scholars in the Humanities, Arts and Social Sciences (HASS) typically need to organise heterogeneous file-based information from a multitude of sources, including digital cameras, video and sound recording equipment, scanned documents, files from transcription and annotation software, spreadsheets and field notes.

The aim of this tool is to facilitate better management and documentation of research data close to the time it is created. An easy to use interface enables researchers to capture metadata that meets their research needs and matches the requirements for repository ingestion.

Continue reading ‘Fieldwork helper – ExSite9’ »

Digitising Mali’s cultural heritage — Simon Tanner

Simon Tanner has a blog post on his experience of working with various manuscript collections and the tragic destruction of potentially thousands of manuscripts from the New Ahmed Baba Institute building in Mali: “I have worked with manuscripts for over 20 years now; as a librarian, academic and as a consultant helping others to digitise their collections. I have worked in Africa with various libraries and archives for over 10 years. [..] Africa is a continent that has been wracked by the three horsemen of the manuscript conservationists nightmares: war, pestilence and natural disaster.” (

Counting Collections

As will be clear to regular readers of this blog, we are concerned here to encourage the creation of the best possible records of small languages. Since much of this work is done by researchers (linguists, musicologists, anthropologists etc.) within academia, there needs to be a system for recognising collections of such records in themselves as academic output. This question is being discussed more widely in academia and in high-level policy documents as can be seen by the list of references given below.

The increasing importance of language documentation as a paradigm in linguistic research means that many linguists now spend substantial amounts of time preparing corpora of language data for archiving. Scholars would of course like to see appropriate recognition of such effort in various institutional contexts. Preliminary discussions between the Australian Linguistic Society (ALS) and the Australian Research Council (ARC) in 2011 made it clear that, although the ARC accepted that curated corpora could legitimately be seen as research output, it would be the responsibility of the ALS (or the scholarly community more generally) to establish conventions to accord scholarly credibility to such products. Here, we report on some of the activities of the authors in exploring this issue on behalf of the ALS and discuss issues in two areas: (a) what sort of process is appropriate in according some form of validation to corpora as research products, and (b) what are the appropriate criteria against which such validation should be judged?

“Scholars who use these collections are generally appreciative of the effort required to create these online resources and reluctant to criticize, but one senses that these resources will not achieve wider acceptance until they are more rigorously and systematically reviewed.” (Willett, 2004)

Continue reading ‘Counting Collections’ »

PhD Top-up scholarship in Linguistics within cross-corpus DoBeS project on three-participant events

Posted by Anna Margetts

The project Cross-linguistic patterns in the encoding of three-participant events will start in 2013 as a cross-corpus project of the Documentation of Endangered Languages Program (DoBeS) of the Volkswagen Foundation (; chief investigator: Anna Margetts (Monash University), co-applicants: Nikolaus Himmelmann (University of Cologne) and Katharina Haude (CNRS, Paris).

Faculty/School: Faculty of Arts, School of Languages, Cultures & Linguistics
Location: Clayton Campus, Melbourne
Scholarship tenure: 3 years full time, beginning in 2013
Scholarship value: $6,750 per annum (conditions apply)
Laptop & standard software up to a value of $1700
Closing Date: 31 October 2012

Project summary: The project investigates the linguistic encoding of events which involve three participants. It brings together three areas of study: the encoding of three-participant events, the typological parameter of basic valence orientation, and the field of text-based typology. (For more details see the project description further below).
Continue reading ‘PhD Top-up scholarship in Linguistics within cross-corpus DoBeS project on three-participant events’ »

PARADISEC’s ‘Data Seal of Approval’

As we approach our tenth year of operation, it is gratifying that PARADISEC has achieved this seal of approval (DSA), based on 16 criteria (listed below, and see how we meet these criteria here: We have been a five-star Open Language Archives Community repository for some time, which also means that we are one of the 1800 archives whose catalog and metadata conform to the Open Archives Initiative standards, but the DSA looks more broadly at the whole process of the repository, from accession of records, through their description and curation and to disaster management. This is important for our depositors to know as they can be sure that their research output is properly described and curated, and can be found using various search tools, including google, but more specifically the Australian National Data Service, OLAC and the WorldCat, and also the aggregated information served in the Virtual Language Observatory.

Continue reading ‘PARADISEC’s ‘Data Seal of Approval’’ »