Psst, want some data?

Last month I wrote a blog post about quantification in language documentation and “[h]ow much of the corpus needs to be linguistically annotated so that ‘later researchers will be able to reconstruct the (grammar of the) language’ or indeed so that the rest of the corpus can be parsed”. Note that I was talking about linguistic annotation (not just transcription) here, but in his very useful comments on my post, James Crippen wrote:

“Some folks I know have well over 1000 hours of recorded material, and I think nowhere near ten percent of that has been transcribed. Asking for someone to do the ten percent for this before being willing to accept it is a bit unreasonable.”

Well, the first thing I have to say is: 1000 hours is an awful lot of recordings. It’s about 7.5 times the average DoBeS corpus (based on the figure I mentioned in my previous post) and if it’s video it’s equivalent to around 550 feature length movies (which average around 110 minutes each). If you spent every waking hour of the working week, with no time for eating, bathing, shopping, checking e-mail etc, it would take you six and a half months to merely watch or listen to it all, let alone create any metadata, analysis, transcription, or index (and remember that this is probably going to be in a language you don’t understand and with no subtitles). You’d want to have a good reason to do so, I reckon.
Anyway, be that as it may, James’ comment prompted me to seek some empirical data about this issue, so I wrote to five colleagues who are responsible for archives of materials on endangered languages, namely Peter Wittenburg of the DoBeS archive, Heidi Johnson of the Archive of the Indigenous Languages of Latin America AILLA, Gary Holton of the Alaska Native Language Archive ANLA, Nick Thieberger of the Pacific And Regional Archive for Digital Sources in Endangered Cultures PARADISEC, and David Nathan of the Endangered Languages Archive ELAR at SOAS. I asked them the following questions:

“If someone approached you about depositing 1000 hours of recorded digital data on some language, less than 10% of which was transcribed, what advice would you (Archive_Name) give them? What would be the minimal requirements that you would have in order to accept the materials for deposit?”

Read more

Hugo Schuchardt Archiv

I’ve been meaning to express my love and gratitude for the excellent Hugo Schuchardt Archiv at the Uni Graz for a while now. I was thinking of maybe saying a little something about Schuchardt for his birthday or Todestag, but the dates passed and in any case I come to exhume Schurchardt, not to praise … Read more

How much room is there in the arc(hive)?

Forty-five years ago the annual fieldwork reports of some of the researchers funded by the then Australian Institute of Aboriginal Studies (now AIATSIS) included specifications of how much research had been completed in terms of the number of feet of tapes that had been recorded during the project year (“this year was especially productive with 45 feet 3 inches of tape being recorded”). The modern measure of this kind of quantitative nonsense is the number of gigabytes of digital files (soon to be terabytes) created by the researcher. Don’t mind the quality, it’s the length/bytes that count.
My colleague David Nathan, Director of the Endangered Languages Archive (ELAR) at SOAS, has been approached on several occasions by researchers (both those funded by ELDP and those not (yet)) asking how much data they would be allowed to deposit in the archive. “Would it be OK if I deposit 500 gigabytes of data?” they ask. When you think about it for a moment or two, this is a truly odd request, but one driven by part of what David (in Nathan 2004, see also Dobrin, Austin and Nathan 2007, 2009) has termed “archivism”. This is the tendency for researchers to think that an archive should determine their project outcomes. Parameters stated in terms of audio resolution and sampling rate, file format, and encoding standards take the place of discussions of documentation hypotheses, goals, or methods that are aligned with a project’s actual needs and intentions. David’s response to such a question is usually: if the material to be deposited is “good quality” (stated in terms of some parameters (not volume!) established by the project in discussion with ELAR) then the archive will be interested in taking it.
Another quantity that comes up in this context (and in the context of grant applications as well) is the statement that “10% of the deposited archival data will be analysed”. The remainder of the archive deposit will be, in the worst case, a bunch of media files, or in the best case, media files plus transcription (and/or translation). Where does this magical 10% come from? It seems to have originated around 10 years ago with the DOBES project which established a set of guidelines for language documentation during its pilot phase in 2000. As Wittenburg and Mosel (2004:1) state:

“During a pilot year intensive discussions … took place amongst the participants. The participants agreed upon a number of basic guidelines for language documentation projects. … For some material a deep linguistic analysis should be provided such that later researchers will be able to reconstruct the (grammar of the) language”

Similarly, the guidelines for ELDP grant applications (downloadable here) include the following:

“Note that audio and video are not usable, accessible or archivable without accompanying textual materials such as transcription, annotation, or notes about content and participants. While you are encouraged to transcribe and annotate as much of the material as possible, we recognise that this is very time-consuming and you may not be able to do this for all recorded materials. However, you must provide some text indication of the content of all recordings. This does not have to be the linguistic content and could include, for example, description of the topics or events (e.g. names of songs), or names of participants, preferably with time alignment (indication of where they occur in the recording).”

No actual figure is given of how much “some material” (for DOBES) or “as much of the material as possible” (for ELDP) amounts to. In earlier published versions of advice to applicants both DOBES and ELDP did mention 10%.
Interestingly, Wittenburg (2009, slide 34) has done an analysis of the language documentation data collected by DOBES projects between 2000 and 2009, and he notes that the average project team has recorded 131 hours of media (59 hours of audio, 72 hours of video), transcribed 50 hours of this, and translated 29 hours. Linguistic analysis on average exists for 14 hours of recordings — strikingly this is exactly 10.68% of the average corpus!!
How much of the corpus needs to be linguistically annotated so that “later researchers will be able to reconstruct the (grammar of the) language” or indeed so that the rest of the corpus can be parsed? Well, it depends on a range of factors, including the nature of the language(s) being documented. Some Austronesian languages, like Sasak or Toratan, have relatively little morphology with pretty straightforward morpho-phonemics of such morphology that does exist, and so a relatively small amount of morpheme-by-morpheme glossed materials in conjunction with a lexicon would enable users to bootstrap the morphological analysis of other parts of a transcribed corpus in those languages. Other languages, like Athapaskan tongues with their fiendishly complex verb morphology, might need more annotated data to help the user deal with the whole corpus.
This is however an empirical question, and one that to my knowledge has not been addressed so far. There are now a number of documentary corpora available, with more coming on stream, and it should be possible to establish whether the “magical 10%” is a real goal to be aimed for, or just a figure that researchers have created and continue to repeat to one another.

Read more

Fishing

For a beautifully organised site run by a small group, check out Sarah Colley‘s new site: the Sydney fish project. What fish have been found in archaeological sites in Sydney? What do the bits look like, what does the whole fish look like (i.e. a reference skeleton)? What fish did Aborigines eat at what period? … Read more

Random locations of grammars and dictionaries

Further to the discussion of making online material discoverable (using standard metadata or via a more elaborate infrastructure proposed by ELIIP), other useful sources of free online grammars or dictionaries include ‘Online Books’ and the Project Gutenberg sites. These are ‘free’ as in unencumbered by intellectual property or copyright concerns, typically because the authors have been dead for over 50 years, not because they were placed in an open access archive. A sample of the files available follows, but wouldn’t it be great to have a way of announcing these items using standard metadata terms so they could all be searched via a dedicated language service? For example, the entry for Sgau Karen below is followed by Sgaw Karen, so google searching on Sgaw will only give you one of these three items.

Read more

Back in Tokyo

9 February 2009
David Nathan, Director of the Endangered Languages Archive, at SOAS, and I are back in Tokyo at the invitation of Toshihide Nakayama of ILCAA, the Institute for Languages and Cultures of Asia and Africa, at Tokyo University of Foreign Studies for 10 days to run a workshop on language documentation that follows up our 2008 workshop. This year we are taking a different tack and focusing the week of seminars and practical sessions on the principles and practices of archiving endangered languages materials. The week begins on Monday (today) with preparations in the morning and David’s public lecture on “Archiving endangered language materials” in the afternoon. Classes begin in earnest on Tuesday and run until Friday, with sessions from 10am to 5pm each day. There will be 15 attendees, mostly students who are doing fieldwork in various locations around the world. Details of the workshop can be found here.
The topics we plan to cover include:

  • Language documentation and language archiving – major issues
  • Audio – good practices refresher
  • Audio recording – how to make great audio
  • Data and metadata – good practices refresher
  • Data management practical
  • Workflow for archiving
  • Mobilisation and delivery of language materials
  • Transcription, annotation, translation – good practices refresher
  • IP and ethical issues in the delivery, usage, and archiving of materials

There will be group work in the practical sessions and a final discussion with presentations by the attendees on the last day. If time and energy permit I will blog about how the workshop goes and report on some of the outcomes.

Read more

The Endangered Languages Archive at SOAS: developing and sharing language materials through archiving

The Endangered Languages Archive (ELAR) was established at SOAS in January 2004, with the first deposits accepted in late 2005. Our initial priority was on preservation but recently the ELAR public catalogue was released and it will soon extend to providing access to materials (where permissions allow). To date, ELAR has received over 50 deposits and stores about 4 terabytes of data. Audio recordings make up about 60% of this (both in terms of the total number of files and the total volume of data).
ELAR was established primarily to preserve and disseminate data collected by grantees from the Endangered Languages Documentation Programme (ELDP) and by staff and students from the Endangered Languages Academic Programme (ELAP). Because language documentation is an emerging area that relies a lot on new techniques and technologies, ELAR also provides training, advice and support to ELDP grantees, ELAP staff and students, and others through international training workshops (see, for example, the various Rausing Room, the Linguistics Resources Room, and the pool of fieldwork equipment available to ELAP staff and students.
ELAR now has four staff, with David Nathan and Ed Garrett being card-carrying linguists and IT professionals, and technicians Tom Castle and Bernard Howard having specialist skills in digital and analogue audio techniques and equipment.
With these resources, skills and experience, ELAR is able to help people who want to archive resources for endangered languages, including individual and retired researchers who may not have alternative sources of equipment or advice. Dietrich Schüller, the former Director of the Austrian Phonogrammarchiv, has warned in a recent paper[.pdf] that the great majority of the world’s human cultural heritage is sitting unpreserved and uncatalogued on the shelves of individual researchers. We can help these researchers with preparing materials, including digitising and converting audio, as well as providing advice and training in how to create metadata and cataloguing information.
Over the last few years ELAR has collaborated with a number of individual researchers in preparing their materials for deposit:

Read more

E-research and language documentation, a natural fit – Nick Thieberger

[From our man in Hawai’i and Melbourne – Nick Thieberger]
The Australian government has millions of dollars that it will be spending on what it calls the National Collaborative Research Infrastructure Strategy (NCRIS) to support new technologies in research in Australia.

“Through NCRIS, the Government is providing $542 million over 2005-2011 to provide researchers with major research facilities, supporting infrastructure and networks necessary for world-class research.”

DEST released a paper outlining what it called ‘capabilities’ which it proposed to fund, and they were ALL in the sciences, including lots of shiny pointy instruments (synchrotron, new telescopes and so on) to do the whizzbang experiments that are so popular and capture the imagination of politicians. While the physical science community has amazing capacity to pull in big research dollars, there are not that many of them, and even fewer who actually want to use each of these very expensive instruments.
On the other hand, the Humanities, Arts and Social Science (HASS) community is huge, and also does the kind of work that, in the main, is immediately relevant to those who fund it (taxpayers). So, in the consultation that followed, the clamour of HASS proponents resulted in a new ‘capability’ being added to the ‘roadmap’, but without any funding (yet) associated with it. There will be an ‘Innovation White Paper’ announcement before the end of 2008, and the current roadmap leads to the White Paper.
All of this is important for us, as it is the bucket from which national infrastructure like a National Data Service may be funded, and where policies on standards for data repositories like PARADISEC will be set. It is where funding will come from for the national computer facility that houses the online version of the PARADISEC collection.

Read more