Is Toolbox the linguistic equivalent of Nietzsche’s typewriter?

There is an aphorism (apparently derived from Maslow 1966) that goes “if all you have is a hammer, everything looks like a nail”. For some documentary linguists reliance on the Toolbox software program means that everything linguistic looks like an interlinear gloss.

Toolbox (developed originally in 1987 as Shoebox by the Summer Institute of Linguistics) is a widely used data management and analysis tool for field linguists. It is designed for researchers to take units of transcribed text (typically ssentences) and semi-automatically “gloss” them to create multi-tier interlinearised text broken into words, which are then broken into constituent morphemes with aligned annotations such as sentence translations, morphemic translations, part of speech designations, and so on (for further discussion of interlinear text models see Bow, Hughes and Bird 2003).

Because Toolbox is free, and widely recommended for use in language analysis (it is commonly taught in training courses, such as InField, or ELDP grantee training, for example), it has had a large and constraining impact on how documentary linguists think they should do their research. I would suggest that it is a tool that has had a narrowing effect, like Nietzsche’s typewriter, as described by Carr 2008:

Sometime in 1882, Friedrich Nietzsche bought a typewriter—a Malling-Hansen Writing Ball, to be precise. His vision was failing, and keeping his eyes focused on a page had become exhausting and painful, often bringing on crushing headaches. He had been forced to curtail his writing, and he feared that he would soon have to give it up. The typewriter rescued him, at least for a time. Once he had mastered touch-typing, he was able to write with his eyes closed, using only the tips of his fingers. Words could once again flow from his mind to the page.

But the machine had a subtler effect on his work. One of Nietzsche’s friends, a composer, noticed a change in the style of his writing. His already terse prose had become even tighter, more telegraphic. “Perhaps you will through this instrument even take to a new idiom,” the friend wrote in a letter, noting that, in his own work, his “‘thoughts’ in music and language often depend on the quality of pen and paper.”

“You are right,” Nietzsche replied, “our writing equipment takes part in the forming of our thoughts.” Under the sway of the machine, writes the German media scholar Friedrich A. Kittler , Nietzsche’s prose “changed from arguments to aphorisms, from thoughts to puns, from rhetoric to telegram style.”

I believe that how annotation is conceptualised in language documentation, and presented in reference works like Schultze-Bernd 2006, reflects the narrowing influence of software tools like Toolbox and the dominance of interlinear glossing as an analytical method.

An alternative, developed originally by David Nathan, that we recommend at SOAS for corpus creation, is summary or overview annotation:

An overview annotation can be considered as a kind of “roadmap” or index of a recording. It could consist of approximately time-aligned information about what is in the recording, who is participating, and other interesting phenomena. For example, you could write:

“from 1 to 3 mins Auntie Freda is singing the song called Fat frog; from 3-7 mins Harry Smith is telling a story about joining the army; from 7-10 mins there is some interesting use of applicative morphology; from 15-18 mins contains rude content that should not be used for teaching children”
This could be written as prose (as above) or, better, structured into a table.

If you are familiar with software such as Transcriber or ELAN, you can do an overview annotation by marking breaks in topics/speakers etc, and typing descriptive text into the segments between breaks. Another strategy is to simply type a number into the time-aligned segment and then create a table which links the numbers with the overview information categories.

Interlinearisation of the Toolbox type is very time consuming (see my blog post on how much time transcription and interlinear annotation takes) while overview annotation can be done rapidly and relatively richly for a whole corpus, rather than the magical 10% of it too frequently referred to in the literature on linguistic annotation. This means that potentially it is a good alternative to the restricted representations that have been affected, like Nietzsche’s typewriter, by the very tool that documenters have come to rely upon.

References

Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a general model of interlinear text. EMELD paper. [available online at http://emeld.org/workshop/2003/bowbadenbird-paper.pdf, accessed 2012-04-21]

Carr, Nicholas. 2008. Is Google making us stupid? What the internet is doing to our brains. Atlantic Magazine July/August 2008.

Maslow, Abraham. 1966. The Psychology of Science: A reconnaisance. New York: Harper Collins.

Schultze-Bernd, Eva. 2006. Linguistic annotation. In Jost Gippert, Nikolaus P. Himmelmann and Ulrike Mosel (eds.) Essentials of Language Documentation, 213-251. Berlin: Mouton de Gruyter.

5 Comments

  1. Andrew Garrett says:

    This is interesting, Peter. But as it stands, is it a bit speculative? It’s clear from the example how Nietzsche’s typewriter (supposedly) constrained his prose. It would be helpful to see some concrete examples where, in your judgment, linguists’ preoccupation with parsing texts and analyzing their linguistic form, rather than just giving a general content & context summary, has constrained their documentary work. It seems to me that doing the linguistic grunt work also involves knowing generally about the content of a text one is working on. If all you’re saying is that it’s good to do both things, viz. analyze details and present a big corpus with information about its contents (analogously, one might publish a diplomatic edition of the Gaulish texts with general information about their apparent contents but full transcriptions of only a few), then it’s hard to see that this is controversial. As I say, it would be useful to point to some specific cases where progress has been limited by the narrowness you describe. (Or if it seems cruel to highlight specific cases, you could anonymize them somehow.)

  2. Lameen Souag says:

    Another case of “the best is the enemy of the good”, perhaps? After the first paragraph, I assumed you were going to discuss Toolbox-induced neglect of non-concatenative morphology, but that would be quite a different issue.

  3. Wayne Leman says:

    One of my mentors, T. Givón, once commented to me about software-driven analysis. It’s a danger all of us fieldworkers need to be aware of in this technologically advanced world where our machines can do so many things so much faster than we can. But can they do the things we really need them to do?! Only we humans can decide that. Of course, sometimes we really do need the help of the software to help us remember some things our humans brains might have forgotten!

  4. Peter Austin says:

    @Lameen — I have a kind of love-hate relationship with Toolbox. I use it regularly in my own work, attempting to “tame” it and make it properly relational (see this workshop paper [.pdf]) but I can also wax poetical about its shortcomings (including not only failure to deal with non-concatenative morphology, but also lack of any attention to tone, especially grammatical tone). This post was rather about the idea that Toolbox’s existence (readiness to hand as a Heideggerian might say) blinkers some linguists into thinking that interlinear glossing is what language documentation is about.

    @Wayne — a recent experience I had of what you describe was playing with some Guwamu (Southern Queensland, Australia) sentences in Fieldworks (Flex) and the program proposing to write a grammar for me based on the data entered!

    @Andrew — thanks for your comments and sorry I wasn’t clear about my argumentation. I’ll probably have to craft a follow-up post to respond to your points properly and provide some evidence for my opinion.

  5. Claire says:

    I am just old enough to have started fieldwork in the system Peter mentions; with AIATSIS white/green audition sheets, a cassette recorder, and no reliable electricity on fieldwork (and certainly no laptop!), but shifted to fully digital by the end of the project. My early tapes are probably easier to find things on, since I have fairly detailed running summaries (handwritten, now scanned), but when playing snippets back to speakers on a recent trip, and when looking for examples from the grammar, I tend to search through Elan, and so the second part of the collection is rather easier to use. Roughly equal percentages of the recordings are transcribed, I’d guess, but almost all the transcripts for the second part are time-aligned, whereas only some of the first part is (thanks to one of my students, who has been helping me with this). Having said that, very little of my corpus is interlinearised. Almost all the transcriptions are Bardi and English free translation, with annotation but not interlinearisation. My metadata for early recordings is more consistently better than for the later ones, but once I got back in the habit of annotating both the elicitation notebooks and adding information about recordings to a database, it didn’t make much difference I don’t think. One big issue (and something to be addressed down the road) is that I no longer have a single searchable set of transcripts for Bardi; there’s the pre-Elan (dumped into a text file) and the post-Elan search done through Elan.
    There must be others in this situation – e.g. get all the ANU and Melbourne PhD students between 1995 and 2005 who are still working in the area and analyse their corpora!

Leave a Reply