Glossed texts — the fiddle factor

In a recent blog post, Jane Simpson reported on opinions expressed by a group at ANU meeting to discuss grammar writing:

“We all agree it’s a good thing to publish glossed texts so that readers can check out the hypotheses proposed in the grammar, and expressed by the glossing.”

I’d like to inject a note of caution here. It seems to me that many times published texts, with interlinear glossing or not, and especially those that derive from transcriptions of spoken language, have often been fiddled with (or to put it more politely ‘edited’) on their way from recording to printed page. This is also often true of published texts that are based on written originals produced by literate native speakers. It is rarely the case that, as Wamut commented about Jeffrey Heath’s work on Ngandi at the end of Jane’s blog post:

“What is especially great, is that when you go back to Heath’s archived field recordings, the spoken texts are there in pristine form, that is, the spoken text and written text correlate perfectly” [emphasis added]

Heath adopted the same principle of “perfect correlation” in his published work on other languages such as his 1980 Nunggubuyu Myths and Ethnographic Texts which clearly states in the introduction: “in the texts presented here I have not ‘weeded out’ false starts, intrusive English words, or grammatical errors by the narrators”.
In many other cases of text publication, I know editing has taken place — I have done it myself, and some other researchers have admitted to it (though rarely indicating exactly what editorial changes were made — more on this below). The texts in my 1997 book of Texts in the Mantharta Languages, Western Australia. [Tokyo: ILCAA, Tokyo University of Foreign Studies] were heavily edited, though I didn’t mention that in print at the time, and it was only when it came to creating a multimedia Jiwarli website where both published texts and original recordings were presented that I had to confess: “[y]ou may also notice that the Jiwarli texts are not word for word identical to the sound files, as Jack Butler, after recording the stories, made his own corrections in the texts”. There was no attempt to deceive here, rather it was Jack’s explicit wish that the stories be edited for publication.
As an example, consider published Text 50 (which appears on the website here) and the way it corresponds to the original recording (italics indicates material on the tape which was deleted in the editing process, bold indicates text added during editing, and { x == y} indicates substitution during editing):

Nhukuramartuthu ngurrunyjarri julyumartu ngunha nhanyaartu {porcupinemanha == jiriparrinha} puniyanha. {porcupine == Jiriparri} ngunha jakuparlarrirarru. Ngurntirarri jakuparlarru parnajipithu ngunha warrirru nhanyapuka. Ngurrunyjarrilu yarnararnilaartu ngurntapuka ngunhapa jakuparla. Wangkirarringu. Yarnararrima nhurra. Yarnararrima nhurra. Ngatha {nhurranha murrurrpa manara nhurranha}. Yarnararrima. Ngatha {nhurranha murrurrpa manara nhurranha}. Yarnararrima. Ngatha murrurpa manara nhurranha. Kunyarnurru ngunha kumpanhu. {Porcupinemanha == Jiriparri} ngunha kurlkanyunthurru yarnararrira. When he Yarnararrirathu parnarru thangkalpuka wurungku wirntupinyangurru pirrurru yanararri thikaru.
Editorial changes that Jack and I made are the following:

  • replacement of the loan word ‘porcupine’ with the indigenous word jiriparri, and deletion of the English expression ‘when he’
  • omission of the enclitics: -thu ‘old information’, -pa ‘specific referent’ in order to decontextualise reference
  • omission of repetition three repeats of ‘Lie on your back. I’ll get you cicatrices’
  • reordering of constituents: the possessor ‘your’ and ‘cicatrices’ are separated on the tape but were made adjacent in the editing for publication

Wamut also mentions in his comment on Jane’s post another possible way in which published texts can differ from recordings:

“I’ve heard other spoken texts vary from the published text because the field worker has interrupted the speaker for clarification etc.”

There are also cases I know of where speakers “interrupt” themselves. My colleague David Nathan tells me that when he was working with Luise Hercus to produce a multimedia CD-ROM of Baagandji materials, he found Luise’s audio recordings of stories also contained interpolations and explanations in English by the speaker which do not appear in the published texts.
I think descriptive linguists and language documenters could well take some guidance in this area from the work of epigraphers who have been developing a TEI/XML markup for epigraphy called EpiDoc. Some of the EpiDoc proposals are concerned with adaptation of the TEI guidelines to deal with a range of issues such as legibility of characters on stone, missing elements or partially represented signs, but in addition there are several issues that I think should equally be of concern to language documentation:

  • additions and deletions to the text
  • editorial supplements, observations, and hypotheses, including:
    • identification and expansion of abbreviations understood by the editor
    • identification of abbreviations not understood by the editor
    • editorial supplement in which the editor makes a “subaudible” word manifest
    • editorial supplement in which the editor explains a “breviatio” or note
    • editorial supplement for characters wholly lost
    • letters omitted because the stonecutter did not carry out the text to the end
  • editorial corrections
    • letters erroneously included in the text, which the editor suppresses
    • letters erroneously omitted from the text, which the editor adds
    • letters erroneously substituted in the text, which the editor corrects

The EpiDoc guidelines contain explicit recommendations on how to encode these as markup annotations to the text. For work on endangered languages I think there are some additional aspects that should be encoded, especially because we need to typically distinguish at least three participants in the process of published text creation, namely the original speaker, the transcriber, and the linguist-editor. We should pay attention to:

  • encoding code-switching, code-mixing and borrowing, ideally by coding for the language (or variety) of the items transcribed
  • puristic editorial amendments on the part of the transcriber
  • puristic editorial amendments on the part of the linguist
  • deletions by the transcriber
  • additions by the transcriber
  • reorderings by the transcriber
  • additions and clarifications (editorial comments) by the linguist-editor
  • when the transcriber is not the originally recorded speaker we need to deal with (1) inter-speaker variation at the dialect or idiolect level and (2) inter-speaker variation arising from language loss, eg. phonemic or grammatical reduction among semi-speakers in a later generation transcribing earlier recorded texts

To my mind, it will only be when linguists make available marked up documents encoding these aspects along with the published texts, and the original media recordings (ideally publically available through an archive or distributed on CD or DVD along with the published texts), that we can start truly talking about “falsifiability” of grammars and other analytical claims about languages. The “published texts” alone are often simply not enough.

1. The ideas presented here have been fermenting since they were first publicly presented at an ELAP Workshop at SOAS in February 2005. At the Simposio Internacional: Contacto de Lenguas y Documentatión (International Symposium on Language Contact and Documentation) held in Buenos Aires last month, Ulrike Mosel presented a paper entitled “Putting oral narratives into writing experiences from a language documentation project in Papua New Guinea” in which she explored the issue of editing recorded Teop texts for publication. She independently identified many of the same issues I outline here.
2. I have been unable to find any discussion of the importance of explicit encoding of transcriptional and analytical editing decisions among the list of “best practices” promoted, eg. by the E-MELD School of Best Practice, despite the fact that, to me at least, they play an important role in “practices which are intended to make digital language documentation optimally longlasting, accessible, and re-usable by other linguists and speakers”.

4 thoughts on “Glossed texts — the fiddle factor”

  1. There is some nice discussion about the material presented here over at Claire Bowern’s blog, especially in the comments section.

  2. Peter Austin makes a very important point. It was for this reason that I made a CD version of my PhD thesis and subsequent book on the Tai Languages of Assam. In the version of the grammar on the CD (which has exactly the same text as the printed book), linguistic examples are linked to sound files so the reader can hear the recording and judge for themselves. So far nobody has written back to me and suggested an alternative reading or analysis, but I look forward to that day.

  3. Stephen – I haven’t seen your CD version of the grammar, but it sounds similar to Nick Thieberger’s Grammar of South Efate (and his previous PhD on the language) which was published with CD of the sound files in the back.
    I was also trying to make the point that there may be good reasons to “clean up” transcriptions or written texts for publication (community preferences for editing, removal of loan words, corrections of ‘slips of the tongue’ etc) but anyone who does that should also make available annotations (eg. in an XML version, as the epigraphers do) which explain why the published version differs from the original. That makes for good science and helps people in the future understand how published texts can differ from other potential sources of data.

  4. Interesting post. Nick Enfield’s recent Lao grammar (Mouton) has meticulously transcribed conversations (including of course all the errors, false starts, repairs, etc.) instead of the traditional texts appended. Many of the examples given in his grammar are drawn from these (and other) conversations.
    His motivation is worth quoting: ‘The texts supplied in this chapter illustrate the kind of discourse in which Lao grammar emerges. (…) The choice to concentrate exclusively on conversation here is a form of affirmative action. Conversation as a structured domain is under-studied in linguistics compared to research on structure in semantics and sentence-level syntax. Yet conversation is by far the dominant, unmarked genre in language usage, and in language acquisition. This chapter reverses the usual balance in the ‘texts’ section of grammars: elicited monologues, with a very occasional fragment of conversation. (…) [W]ith a large enough sample, conversation yields the full complement of a language’s structural resources, including embedded narratives, procedural descriptions, and similar genres more familiar to descriptive linguistics.’ (Enfield 2007:487)

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment