Retrofitting a collection? I’d rather not

I just had a visit from a student wanting to deposit a collection of recordings made in the course of PhD fieldwork in the PARADISEC archive. It is a great shame that they are only just now thinking about how to deposit this material, as it will need considerable work to make it archivable. If they had sought advice before doing all of the research (or looked at the PARADISEC page ‘Depositing with PARADISEC’, or looked at the RNLD pages, e.g, it would have been so much easier for all of us. Why?

They used a Sony dictaphone recorder, which is very low quality and records in a proprietary format (.msv) via an internal and rather poor microphone. These files need to be converted using Sony software (on a PC only) and will never be as good as they would have been if recorded on the normally recommended equipment. They transcribed the recordings in Microsoft Word documents that have no time-alignment so can’t resolve back to the recordings for citation of example sentences. These too will need to be converted out of the Microsoft format into rtf, pdf or text format. There was no naming convention established so now all files need to be renamed. There is no catalog of the collection in any form, so that needs to be created, not something they really want to do while writing up a PhD dissertation.

So, if you are in this situation or are supervising students in this situation, please make sure they have thought about the standards they should adopt before they start their research!


  1. Peter Austin says:

    Thanks for sharing this Nick — a salutary lesson, however what you identify, it seems to me, are issues about doing documentation and corpus creation and management that are independent of archiving. Isn’t it the case that students and other researchers should pay attention to how they make their recordings, organise and structure and manage their data and analysis, record metadata in some organised fashion, and have clearly articulated workflows, separately from whether or how they wish to archive their materials? To me, what you describe are good policies and practices for linguistic research, not for archiving per se. Of course, you have had to confront the lacunae in the student’s training since they now do want to archive.

    At SOAS I try to teach the principles you discuss as part of my Fieldmethods course (in the MA, and taken by all PhD students who have not done it before and who plan fieldwork) — this year we are not working with an endangered language and don’t have plans for archiving, but the students are still learning about what you describe.

  2. Nick Thieberger says:

    Yes, I agree completely. This is about good research practice and reuse of research data, and should apply to all research in general, whether or not the material is intended to be archived. We will increasingly see funding bodies requiring public access to publicly funded research and I hope we will see changes to academic recognition of corpus creation that will allow a corpus to be counted for advancement (as you have also argued before on this blog). It does, however, become more relevant to think about data management methods when the records being created are of a language that has never before been recorded and for which subsequent recording may not be possible.

    The idea of creating well-formed reusable data so that it can be easily archived has been criticised by your colleague in ELAR as archivism (Nathan 2010) so I’m glad to hear that you are teaching the principles you mention in your courses. He uses the peculiar example of Winston Churchill’s pipes, and notes that, just as no-one would have suggested to Churchill that his pipes should have been laid out in a drawer in a certain way so that future generations could see them there, so linguists should not create well-formed data from their fieldwork with the aim of archiving it. Rather it seems he would have a thousand flowers bloom, and somehow the archive should deal with the resulting material, as, for example, produced by the student who turned up to my office.

    ELAR must have the resources to deal with such heterogeneously formed collections. My blog post is from the perspective of PARADISEC which operates on a sort of drought and flood funding model. Our systems allow us to operate through periods of funding drought, with minimal staff, and the last thing we need is large collections of files in non-standard formats that need to be converted.

    David Nathan. 2010. “Language documentation and archiving: from disk space to MySpace.” In Peter Austin (ed) Language Documentation and Description. Vol 7. London: SOAS. 172-208.

  3. David Nathan says:

    I strongly agree with Nick and Peter! And I welcome them onboard to what I have been arguing for some time, despite continued criticisms, misunderstanding and distortion. Let’s make it clear: the ELAR archive has NEVER encouraged other than standard and archivable formats. What the ‘archivism’ idea expressed was that the nature of language documentation was not defined – and certainly not exhausted – by technical desiderata that we all agree on. In other words, I was saying quite a while ago exactly what Peter and Nick seem to have caught up on – that proper and skilled language documentation would incorporate data management principles. I certainly *never* wrote or said anything like Nick’s attribution to me that “linguists should not create well-formed data from their fieldwork with the aim of archiving it”. To be precise, the article he quotes (Nathan 2010:176) says that “language documentation has become rather confused about the relationship between data preparation, data formats, and archiving” and that documentation should not be just “archive-driven”, but should, like any research area, use “good data management and judicious use of standards”. In training, website advice, and indeed in articles like Nathan 2010, I and ELAR colleagues have held exactly and consistently to the kind of criteria that Nick’s blog speaks about – proper recording equipment capturing in appropriate formats, avoidance of MS Word files, and good filenaming systems. It is insulting to suggest that the student who needed Nick’s help is somehow a result of my or ELAR’s efforts over the last 8 years.

    The sad residual fact is, however, that there is much more to language documentation if it is going to fulfil the hopes, expectations, and needs of it. As I wrote some years ago, it is going to need new genres and modes of expression (Nick’s Audiamus being an admirable contribution) to deal with its multimodality, hypertextuality, relationships between resources, and diversity of users and usages. These genres and modes of expression cannot be described or exhausted by the considerations of archiving-as-preservation. ELAR is using the very simple *bundle* concept (like AILLA, and similar to IMDI’s sessions) as an organisational, discovery, presentational and navigational device, and which loosely echoes the way that some documenters and software organise resources. But even this is difficult for many documenters to implement, and it seems that in general many documenters are struggling with formal aspects of their documentary work because of a late recognition by leaders in documentary linguistics that a good language documentation might be very much more than a set of dozens, hundreds, or thousands of files in archivable formats.

Leave a Reply