Archive for the ‘Technology’ Category.

Researching child language in the field: October LIP

Ruth Singer recaps some of the interesting points of the last week’sLinguistics in the Pub, an informal gathering of linguists and language activists that is held monthly in Melbourne

A number of linguists in Melbourne have recently begun documenting child language in the field. In the November 2011 LIP we discussed what you need to think about if you want to document child language and why you might document child language as part of a broader language documentation project (see blog at The most recent LIP, led by Lauren Gawne and Birgit Hellwig last week, revisited the topic of child language documentation. This allowed those who have recently returned from the field to discuss some of the problems they faced and how they dealt with them. In particular, we looked at the gap between what is possible in remote fieldsites and some of the assumptions in the field of child language acquisition about what type of data is needed to study child language development. The quantity and frequency of data that can be collected in remote fieldsites is quite different to what can be done in the developed world. The limitations can be quite simple. For example, not being able to get accurate information on children’s ages.

To kick off the discussion we looked at ethics, from a personal point of view. The previous LIP on child language was criticised for focussing too much on the requirements of institutional ethics boards at universities, schools etc. So we discussed what types of decisions researchers had made to satisfy their own ethical concerns. A number of researchers said that they had no plans to make their recordings public. This goes against the current trend to make recordings of endangered languages as open as possible, given community consent.

Just to give an example, I have decided to keep access to my recordings of child language closed, until the children are 18. If they are happy for me to open access to their recordings after they are 18, I will do so. However since I am currently recording children in groups at least 3 people, it is likely that in many cases I will not be able to contact all participants so the recording will remain closed. One of the issues we returned to a number of times in the evenings is that our recordings are often made in open environments, which means that many people wander through the field of view. This is in contrast to mainstream child language data, which is usually made in a room through which only a limited number of people pass by. It was mentioned that the CHILDES language database is a great example of an open access archive but it lacks much data from endangered languages. CHILDES contains data recorded from many different studies of child language acquisition. However to upload data to CHILDES you must have the consent of every person who appears, even if just walking past. This is not going to be possible for many recordings of endangered languages in remote areas. It is often difficult to find a room to record in and even if one is found, it is likely that many people will pass through it.

Some of the other assumptions about child language acquisition research that can prove difficult in remote settings:

  • that a mother and child pair form natural conversational partners (they may rarely engage in idle chit-chat)
  • that adults typically play with children (it may be the case that children typically play with other children, not adults

Since it is often difficult for a mother-child pair to engage in conversation in front of the camera, some suggested structured tasks, such as those used at Max Planck Institute for Psycholinguistics. Although others pointed out that this makes it difficult to study language socialization, because you are asking people to engage in a culturally foreign activity. Others suggested identifying local games that could be used in language acquisition research.

One big problem in applying the standards of child language acquisition research to remote contexts is the difficulty of obtaining recordings of the same child over regular intervals. Many of the linguists attending the LIP session work in Papua New Guinea and Australian Aboriginal communities. They pointed out that children and often their whole families move around much more than they had expected. The set of children living in a community may barely overlap from one fieldtrip to the next.  In addition, some child language researchers recommend making recordings every 2 months or so, and it is not possible to do this in remote settings. The limitations are partly financial and partly due to the time needed for the linguist to travel to the remote location from their home.

There was quite a bit of time devoted to the technology used to record children, who are rather more mobile than adults. One researcher recommended the use of teddy-bear shaped backpacks for children. These can carry the heavy transmitter of the radio microphone. Everyone agreed that noise is a big issue. Even if there is no wind, which small radio microphones don’t handle well, children’s motion invariably causes noise. One researcher only recorded in areas without many leaves as the noise of these being crushed beneath children’s running feet was too loud.

Birgit Hellwig discussed some of the data from her recent 2 month fieldtrip to Papua New Guinea which she did with child language acquisition specialist Evan Kidd (ANU). She said that by the end of the 2 months, the community they were visiting had more or less gotten used to the cameras and exactly what it meant to have child language researchers in the community. One thing that Birgit emphasised is that what participants need to do is not as obvious to them as we might think. Birgit gave a lovely example in the use of the frog story task. The frog story ‘Frog: where are you?’, is a short children’s picture book without any words. Children were asked to tell the story in their own words. It became apparent during the course of Birgit’s 2 month fieldtrip that changes in how children told the story from week to week were related to narrative practices in the community. The story was circulating in the community, just as any story does, and changing slightly over the course of time. Rather than each new child that particpated in the task telling the story afresh, ‘in their own words’, each told it as it was in its current form in the community. This resulted in remarkable convergence between tellings that were recorded around the same time.

It became clear from the discussion that we can’t expect to do research on child language in the same way as it is done in more controlled environments. We will not get comparable quantities of data for each child. However, whatever we do record is likely to be really interesting. We only have data on child language for a small number of languages, so anything will help.

Technology and language documentation: LIP discussion

Lauren Gawne recaps last night’s Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.

This week at Linguistics in the Pub it was all about technology, and how it impacts on our practices. The announcement for the session briefly outlined some of the ways technology has shaped expectations for language documentation:

The continual developments in technology that we currently enjoy are inextricably connected to the development of our field. Most would agree that technology has changed language documentation for the better. But while nobody is advocating a return to paper and pen, most would concur that technology has changed the way we work in unexpected ways. The focus is usually on the materials we produce such as video, audio and annotation files as well as particular types of computer-aided analysis. In a recent ELAC post, ‘Hammers and nails‘ Peter Austin claims that metadata is not what it was, in the days of good old reel-to-reel tape recorders. The volume of comments suggests that this topic is ripe for discussion. This session of Linguistics in the Pub will give us a chance to reflect on how our practices change with advances in technology. 

There are a (very) few linguists who advocate that researchers should go to the field with nothing beyond a spiral-bound notebook and a pen, though no one at the table was quite willing to go that far; all of us, it seems, go to the field with a good quality audio recorder at the very least. Without the additional recordings (be they audio or video) the only output of the research becomes the final papers written by the linguist, which are in no way verifiable. The recording of verifiable data, and the slowly increasing practice of including audio recordings in the final research output are allowing us to further stake our claim as an empirical and verifiable field of scientific inquiry. Many of us shared stories of how listening back to a recording that we had made enriched the written records that we have, or allow us to focus on something that wasn’t the target of our inquiry at the time of the initial recording. The task of trying to do the level of analysis that is now expected for even the lowliest sketch grammar is almost impossible without the aid of recordings, let alone trying to capture the subtleties present in naturalistic narrative or conversation. Continue reading ‘Technology and language documentation: LIP discussion’ »

Hammers and nails

Back in the old days when some of us were younger and starting out on our language documentation and description careers (for me in 1972, as described in this blog post) the world was pretty much analogue and we didn’t have digital hardware or software to think about.

Back then recordings were made with reel-to-reel tape recorders, like the Uher Report, or if you had really fancy kit a Nagra. For those of us working in Australia on Aboriginal languages you could archive your tapes at the Australian Institute of Aboriginal Studies (AIAS), as it then was, later the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). They would copy your tapes onto their archive masters and return the originals to you and all you, as a depositor, had to do was fill in tape deposit sheets. You were supplied with a book of these, alternately white and green, with a sheet of carbon paper to be placed between them. For each tape you had to complete a white sheet listing basic metadata and a summary of the contents of the tape, tear off the white copies (keeping the green carbon copy) and submit them to the AIAS archive. In addition, the Institute encouraged the preparation of tape audition sheets where the content of the tapes was summarised alongside time codes (in minutes and seconds) starting from the beginning of the tape. Sometimes these were created by the depositor and sometimes by the resident linguist (at that time Peter Sutton).

So, if you wanted to find out where in your stack of tapes you could find Story X by Speaker Y you simply had to look at the deposit sheets and/or the audition sheets.

Alas, those days are gone and we are in the digital world, where our experience is mediated via software interfaces that can fool us into seeing the world the way the interface presents it. For language documenters Toolbox is often the software tool of analytical choice (along with ELAN)1 for the processing and value adding analysis and annotation of recordings. As I claimed in a previous post, the existence of Toolbox means that for many documenters annotational value adding only means interlinear glossing, and alternatives such as overview or summary annotation (like the old tape audition sheets) are not part of their tool set. I have two pieces of evidence for this:

  1. the Endangered Languages Archive (ELAR) at SOAS has so far received around 100 deposits comprising roughly 800,000 files. Among these deposits there are many that are made up entirely of media files (plus basic cataloguing metadata) with no textual representation of the content of the files beyond a short description in the cataloguing metadata. When asked about annotations, depositors typically respond that they “are working on transcription and glossing” but because of the time needed they cannot provide anything now. They do not seem to consider an alternative, namely time-coded overview annotation which can (and probably should) be done for all the media files, only some of which would then be selected and given priority for interlinear glossing. Why? One reason might be because there is no dedicated software tool designed and set up to do this in an easy and simple manner (interestingly a tool that can be so used, and that produces nice time-coded XML output is Transcriber, though it is generally thought of as a tool for transcription annotation only — it also does not have a “reader mode” that would allow for easy viewing and searching across a set of overview annotations created with it);
  2. during training courses and public presentations over the past couple of years I have been warning that current approaches to language documentation risk the creation of “data dumps” (which I have also called “data middens”) because researchers are not well trained in corpus and workflow management and additionally suffer from ILGB or “interlinear gloss blindness” which drives them to see textual value adding annotation in terms of the interlinear glossing paradigm2 The most recent example of such a presentation was during last months grantee training course at SOAS (the Powerpoint slides from my presentation are available on Slideshare). All but one of the grantees attending the training had never heard of, or considered creating, overview summary annotation before launching (selectively) into transcription and interlinear glossing of their recordings.

I may be wrong about the source of the current ILGB and perhaps Toolbox is not (solely) to blame, but I do believe that it plays a part in a narrowing of conceptual thinking about annotation in language documentation, and hence the behaviour of language documenters.

NB: Thanks to Andrew Garrett for his comments on my previous post that caused me to think more deeply about these issues and attempt to explicate and exemplify them more clearly here.


  1. ELAN is a tool designed for time-aligned transcription and annotation of media files, and is also widely used by language docunenters, bringing with it its own kind of window on the world that I do not discuss here
  2. There may be a separate further dimension to be concerned about that results from the shift from analogue to digital hardware, rather than being a software issue. In the old days tapes were expensive and junior researchers in particular only had access to a rationed supply and therefore had to think seriously about what and how much to record. Today, with digital storage being so cheap and easy to use (especially for copying and file transfer), there is a temptation to “record everything” on multiple machines (one or more video cameras plus one or more audio recorders) and not write much down because “you can always listen to it later”. This can easily and quickly give rise to megabytes of files to be managed and processed. I saw this temptation among the students taking my Fieldmethods course this year — they learned after a few sessions of working with the consultant this way about the pain that then comes from the need to search through hours of digital recordings for which they had few fieldnotes or metadata annotations.

Retrofitting a collection? I’d rather not

I just had a visit from a student wanting to deposit a collection of recordings made in the course of PhD fieldwork in the PARADISEC archive. It is a great shame that they are only just now thinking about how to deposit this material, as it will need considerable work to make it archivable. If they had sought advice before doing all of the research (or looked at the PARADISEC page ‘Depositing with PARADISEC’, or looked at the RNLD pages, e.g, it would have been so much easier for all of us. Why?

Continue reading ‘Retrofitting a collection? I’d rather not’ »

Re-posting – creating interfaces in different languages for Facebook

That munanga linguist has a nice post on how he used Kevin Scannell‘s code to develop a Kriol interface for Facebook. (Kevin Scannell also records Indigenous language tweets, and his web-page has lots of papers and comments on Indigenous languages and Natural Language Programming.

Endangered languages, technology and social media (again)

There has been a little flurry of media stories about endangered languages in the last couple of days with titles like “Digital tools ‘to save languages'” on the BBC News website and “Cyber zoo to preserve endangered languages” in the Sydney Morning Herald (readers who are on Facebook can find a full listing on David Harrison’s home page). The stories were all triggered by publicity from a session at the American Association for the Advancement of Science in Toronto called “Endangered and Minority Languages Crossing the Digital Divide” co-organized by David Harrison and Claire Bowern (see Mark Liberman’s Language Log post for a report). The abstract for the session says:

“Speakers of endangered languages are leveraging new technologies to sustain and revitalize their mother tongues. The panel explores new uses of new digital tools and the practices and ideologies that underlie these innovations. What new possibilities are gained through social networking, video streaming, twitter, software interfaces, smartphones, machine translation, and digital talking dictionaries?”

It’s good that the mainstream media is focussing attention on endangered languages again, though as usual they find themselves falling back on the old tropes of “technology saves dying tongues” (surely the SMH has to win the booby prize with its use of the word “zoo” in this context!). I suppose I would be told it’s sour grapes if I were to point out that for over three and a half years already some of us have been writing about and making talking dictionaries on mobile phones (see James McElvenny’s 2008 blog post and the Project for Free ELectronic Dictionaries), and observing and participating in the use by minority language speakers of social media like Facebook and Twitter, but it’s interesting that it takes a news story out of North America National Geographic to get some publicity for these topics.

Oh well, at least it’s in the news for a day or two.

Making old dictionaries new again

Today’s post is something of a recipe for making old dictionaries new again. I’ll explain how a 35 year old old, single-copy typewritten dictionary is living a new life as a digital database.

The language of this dictionary is Kagate – A Tibeto-Burman language of the Central Bodic branch, spoken in Nepal. I met some speakers of this language a number of years ago, as I’m working on a dialect of Yolmo, which is closely related. There was some documentation of Kagate in the mid-1970s although most of the material output was liturgical instead of linguistic.

As well as the two publications on Kagate mentioned on the Ethnologue site Monika Höhlig and Anna Maria Hari also created a typewritten Kagate-Nepali-English-German dictionary. A copy of this dictionary has remained with their primary consultant, and although it is well looked after and still useable it is also the only copy they have access to. It is also only in Latin script instead of the Devanagari script they have developed for their language.

On a previous visit the Kagate speakers were kind enough to allow me and my colleague Amos Teo to scan the pages of the dictionary. At this point we also made them another paper copy of the dictionary, but obviously this is an unsustainable process in the long term. As you can see, the dictionary is already becoming discoloured and faded:

Amos took the scans and used the optical character recognition (OCR) software that comes with Adobe Acrobat 9. Even with such faded font the OCR was effective at recognising the characters. As is to be expected with this kind of process though there was still a fair bit of cleaning up to do at this point. There were some alignment issues and some irregular characters. Also, some entries would copy strangely, with a row of 5-7 lexical items and then the corresponding definitions all in the lone below.

From here the data needed to be massaged so that the appropriate headers were present for Toolbox to read. With the data that we had we needed, at a minimum, to create these headers:

\lx – the Kagate word
\ps – part of speech
\de – an English definition
\dn – a Nepali definition
\xv – an example sentence

Using the find and replace function in an .RTF file Amos was able to create these using the formatting of the original document to his benefit. For example, all of the Nepali definitions start with Np: so we replaced “Np:” with “\gn.” Also all of the colons are at the start of the English definition, so Amos just selected “find : ” and “replace \de.” Of course Amos careful to do this in a set order – doing these two the other way around would have lead to more confusion. Of course, using Regular Expressions is a more efficient way of doing this task – but even if you don’t know how to use RegEx (yet) it won’t stop you from doing this kind of work.

Once the file was made ready to open in Toolbox it still required a little bit of cleaning up. There were a few instances where the letter ‘l’ had been read as the number  ‘1’ and some reduplicated entries – but going through each entry and cleaning up these kinds of problems is still much more efficient than retyping out the whole thing again.

The great thing about now having a database to work with instead of a photocopy is that it was the work of an hour to create this:

It’s still exactly the same data as above – but it is much easier to manipulate into different forms. For example I could have just created a list of nouns, or only included the Nepali definitions. This database is also the start of a project to create a new dictionary. While the owner of this dictionary is proud of it, there are many limitations. The first is that it is all written in Latin script, and there is now a fully functional Devanagari script for Kagate, as well of course for Nepali. There are also few example sentences, and some items are missing – such as the number eleven. But of course the most pressing issue with the current dictionary is that there is only one copy. By working in a database we’ll be able to make as many copies as we like at the end, and use the information in other ways too. But that’s all a story for another post.

Child language documentation: a LIP discussion

Lauren Gawne recaps last night’s Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.

This month we were joined by Barbara Kelly (The University of Melbourne), who has extensive experience in the fields of language documentation and child language acquisition for a discussion into the why and how of documenting child language. Barb started the discussion by mentioning that many people who work in language documentation have the perception that child language is not relevant to them – but child data is relevant to anyone. Although the general fieldwork model of only working with adult native speakers is the current general practise it is only one way to document a language and documenting child language can also provide useful data.

Child language acquisition data is important for a number of reasons, and the discussion only touched upon a few of the most pressing. One of the most pressing is that language doesn’t occur in a vacuum, to get a full understanding of how the language works and is used it is insufficient to just record adults talking with adults. In language communities adults spend a lot of time interacting with children and so how they talk, and are talked to by the children, are important. It’s also important to understand how the language is acquired. Granted, it’s not possible for a single researcher to work on ever angle, but to even collect data while on fieldwork gives someone else the opportunity to investigate potentially interesting acquisition patterns. We might have a good idea of how English language features develop, but for grammatical features outside of English such as evidential or highly polysynthetic languages there are still some very basic questions that need to be addressed. Also, in terms of language maintenance and revival working with children is paramount. By asking them to share their language with you there’s the potential to help them understand what is special or important about their language, and in reclamation projects the easiest way to figure out materials to teach a child is to listen to what a child sounded like. Finally, working with children can be fun and challenging. It’s an opportunity to throw out the last shred of control you thought you had over a fieldwork situation and just see where a session takes you.

Continue reading ‘Child language documentation: a LIP discussion’ »

Using video in language documentation: a LIP discussion

This is a recap of Linguistics in the Pub held at Prince Alfred Hotel, Carlton on Tuesday the 6th of September written by Lauren Gawne. From now on this will be a regular feature here at Endangered Languages and Cultures.

For the topic of video in language documentation we were lucky to be joined by Joe Blythe (Max Planck Institute, Nijmegen) and Jenny Green (ELDP funded Postdoc at The University of Melbourne), who have both worked extensively with video and both recently returned from fieldwork. Joe started off the session by talking us through some of his data. Joe has just returned from a field trip in Wadeye where he is continuing to collect conversational data. On this trip Joe tried working with some new speakers and some of his regular speakers but in different environments. He found it interesting that a shift in location for people he worked with regularly, for example into a house instead of out bush, would lead to very different behaviour towards the camera. He was very kind to show us not only some of his excellent (and often quite scenic) data but also some of these less successful attempts. Even less successful recordings are interesting in their own way.

Continue reading ‘Using video in language documentation: a LIP discussion’ »

NRPIPA Symposium in Darwin 13-14 August 2011

Another stunning array of papers and associated performances will feature at the 10th Annual Symposium of NRPIPA (The National Recording Project for Indigenous Performance in Australia). This year there will be a focus on community databases for access to recordings.
Venue: North Australian Research Unit, The Australian National University, Darwin, 13–14 August 2011
Presented in association with:
The University of Sydney, ‘Intercultural Inquiry in a Transnational Context: Exploring the Legacy of the 1948 American–Australian Scientific Expedition to Arnhem Land’ (an Australian Research Council Discovery Project, hosted at PARADISEC, University of Sydney)
and The Australian National University’s School of Music, College of Arts & Social Sciences


Saturday 13 AUGUST 2011
9.30–10.30 Joe Gumbula and Martin Thomas ‘Ceremonial Responses to the Repatriation of Human Remains from Arnhem Land’
10.30–11.00 Amanda Harris ‘The Nutritionist and Her Chaperone: The American– Australian Expedition’s Fish Creek Camp in Arnhem Land’
11.30–12.30 Archie Brown, David Manmurulu, Charlie Mangulda, Bruce Birch and Linda Barwick ‘Welcoming the Upcoming Generations in Western Arnhem through Song’
12.30–1.00 Anthony Linden Jones ‘“You Couldn’t Take it Down in Our Scale”: Traditional Song and the Musical Score to CP Mountford’s Documentary Films’
2.00-2.30 Peter Williams ‘The Wollombi Corroboree’
2.30-3.00 Helen Rrikawuku Yunupiŋu ‘Milkarri Wäŋa-Ŋarakaŋur: Keening on Country’
3.00-3.30 Cathy Hilder, Anja Tait, Kate King and Tony Gray ‘Recording Stories: Revitalising and Maintaining Indigenous Languages in the Northern Territory Library’
4.00–4.30 Samuel Curkpatrick ‘Grooving with the Ancestors: Wägilak Song and the Australian Art Orchestra’
4.30–5.30 Aaron Corn ‘Nations of Song’

Sunday 14 August
9.00–9.30 Myfany Turpin ‘Text Setting in Warlpiri Yawulyu’
9.30–10.00 Nicholas Kirlew ‘Community Stories: The New Version of the Successful Ara Iritija Software’
10.00–10.30 Linda Barwick, Joe Blythe and John Mansfield ‘The Wadeye Song Database’
11.00–12.00 Genevieve Campbell Teresita Puruntatameri and the Wangatunga Strong Women ‘Ngariwanajirri — The Strong Kids Song’
12.00–1.00 Joe Blythe ‘From Malgarrin to Metallica: A Rockumentary History of Wadeye Music’
2.00–3.00 Matthew Martin, Pansy Nulgit, Sherika Nulgit and Sally Treloyn ‘Moving People and Places: The Sustaining Junba Project’
3.00–3.30 Allan Marett ‘It’s Not Just about Preserving Music and Dance: It’s Something Much Bigger’
4.00–5.00 Roundtable discussion on ‘Community Databases: Access, Training, Management’