Last month I received the following email query from a colleague:
“I am currently submitting a grant application for a small grant at the HRELP to document …. One concern I have is how many hours it will realistically take to transcribe one hour of text. I have done fieldwork in the past, but this would be the first time that I will have trained a transcriber who would work (mostly) independently. (The linguists on the project would consult with them.) I would like to give some sort of concrete number of total hours transcribed and translated (in contrast to fully annotated).”
Since this is an issue I have been asked about several times, I present here an elaborated version of what I wrote back to my correspondent (here I am using ‘source language’ to refer to the language of the recording, and ‘target language’ to refer to the language of a translation of the recording. I restrict my remarks to transcription of spoken languages).
I wrote back:
The answer to your questions is kind of like the answer to the question: ‘How long is a piece of string?’
There are so many variables:
- how many languages/varieties are represented in the recording (is it monolingual or multilingual) and what languages these are
- the transcriber’s familiarity with and fluency in the source language(s) (including, if they are a native speaker, whether they speak the same dialect as the interviewees)
- whether the transcriber can work alone or needs to work together with someone else (the interviewee or another speaker) to listen to the recording and have it repeated back (possibly at a slower rate) for transcription. Some transcribers do a ‘first pass’ rough transcription that is then checked with another person to arrive at a more refined transcription. The transcription time should be calculated as the sum of the times for these two processes
- the phonology of the source language – some languages have more segmental distinctions than others, and, depending on who is doing the transcribing, some distinctions may be more difficult to hear and transcribe than others. If a language has suprasegmental contrasts to be included in the transcription (eg. tonal contrasts) the nature of these will also affect the amount of transcription time. Tony Woodbury reports that:
“The Eastern Chatino of Quiahije has 20 phonemically different tones, with complex sandhi phenomena that affect morpholexical tones. Transcription alone by trained fluent native speakers takes 1 hour for 5-10 minutes of clear monologic speech. I’m slower than that, and I typically transcribe in tandem the post-sandhi phonemic version and a lexical version, mainly as a check on myself (that commits me to sandhi testing except if the context is just right). So for transcription alone, I’d be more like 1 hour for 2-3 minutes. In reality though, I also gloss and determine inflectional categories as I go, because that also helps me narrow down the tone possibilities.
I’m a bit slower with the Eastern Chatino of Zacatepec; it is hard because two of the most “populous” tone categories sound exactly alike in isolation and can only be distinguished with sandhi tests.”
Transcription of other aspects such as melody of songs or chants, or gesture, will require special training and be correspondingly more time consuming.
- the transcriber’s familiarity with and fluency in the orthography for the transcription
- the transcriber’s familiarity with a number of aspects of the recording, including:
- genre — talk in more everyday registers may be less time-consuming to transcribe than special and rarer genres, eg. chants
- topic — talk about more familiar topics is easier to transcribe than less familiar ones, eg. topics that require specialist knowledge
- mode (monologue vs. dialogue vs. multi-party) — conversation between two people is more difficult to transcribe than monologue, and increasing the number of conversational partners greatly increases the difficult of transcription
- setting — recordings made in noisy environments are more difficult to transcribe, especially if there is spoken language (eg. on a TV or radio) in the background
- identity of participants — if the transcriber knows that the people recorded have particular speech traits then that can help to identify that person in conversation and to transcribe their speech
The more familiar the transcriber is with these factors, the easier it will be to do the transcription
- the attention spans and stamina of the linguist/transcribers — Pete Budd reports that he found that doing more than 60-90 minutes of transcribing at a stretch was tough for all parties
- whether the transcription is digital (typed as a computer file) or analogue (hand written) that needs conversion to digital? If digital:
- whether there is a (continuous) power supply that allows transcribers to work for extended periods
- what level of IT skills does the transcriber have — several colleagues have reported low levels of basic computer skills of collaborators (like being able to save files and then find them again) which adds to training and transcription time
- what input method is used? (eg. if there are accents or non-ASCII characters are they entered via the keyboard or via ‘insert symbol’?)
- whether the transcription is time aligned? Is software to be used for this? How familiar is the transcriber with the software, and how easy is it to use for the given task (eg. ELAN is good for multi-party transcription but requires a lot of training — see this article for an interesting discussion of some relevant issues)?
- whether the transcription needs checking and post-editing, and how much time needs to be allocated for that
- for translation, the level of fluency in the target language
- what kind of translation is intended – will it be literal, morpho-syntactic, idiomatic, UN-style, literary? (see Woodbury 2007)
- whether notes, exegeses, comments are to be included?
The experience of several colleagues is that having video recordings available speeds up the transcription process by making it easier to identify speaker turns and providing some access to context and extra-linguistic cues. Anthony Jukes, a colleague who works in Indonesia on Toratan, found that video recordings made transcription a more bearable and interesting task for the documentation team, and that the transcribers would persistently place audio-only transcription at the bottom of their ‘to do’ list.
A rough rule of thumb seems to be that for an experienced transcriber fluent in the source language and skilled with transcription software a ratio of at least 10:1 for monologue and 15:1 for conversation is needed for transcription, ie. 6 minutes of monologue takes at least 1 hour to transcribe, 6 minutes of conversation takes at least 1.5 hours. For rough transcription plus checking and refinement a factor of 15:1 for monologue and 30:1 for conversation seems not uncommon. If we add translation, it is not uncommon for a ratio of 50:1 to apply, ie. for 6 minutes of recording at least 5 hours is required to transcribe and translate it.
Note that, for all of this, as the car ads say “your mileage may differ”.
Woodbury, Anthony C. 2007. On thick translation in language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 4, 120-135. London: SOAS.
Footnote: Many thanks to Pete Budd, Anthony Jukes and Tony Woodbury for comments on a draft of this post.