PARADISEC has digitised thousands of hours of legacy audio, the results of fieldwork by many researchers since the 1950s. The recordings are in a number of small languages (more than 1,400 languages are represented in the collection).
For many of these recordings the only written information we have are scant notes on the tape cover. If we could listen to all the files we could gather more metadata and make the contents more useful for speakers of these languages. But that would take thousands of hours of listening time, that we don’t have. While we are working on Automated Speech Recognition methods to transcribe media automatically, it is still early days for that technology being applied to all the different languages in PARADISEC.
In the past, we have had a project of asking volunteers to download files and to provide summaries in Elan that has been successful in providing guides to media files. Sometimes these are summaries that tell you where events in the file occur (singing here, a woman talking here, a different woman talking here) sometimes they include the opening metadata spoken by the recorder, in other cases they capture the English words used in elicitation. Once a summary is prepared, a user can go straight to the section they are interested in and transcribe it (like correcting OCR in TROVE). For example, Arthur Capell’s file of Vanuatu recordings has a transcript here: https://catalog.paradisec.org.au/repository/AC1/417/AC1-417-A1.mp3. The summary provided by the volunteer for this 40 minute file is below, giving essentially seven sections in the file:
Omba recordings, Walurigi dialect done by Simon Garae
Reading of the prodigal son in Omba
Part two – The Prodigal Son story
Part Three – story text
Reading of story in Aneityum
Readings in the Kwamera dialect of Tanna
Recordings in Raga. Northern dialect of Pentacost Islands
You can see that it gives a high level summary of the content of the file and is not a transcript of the content of the recording. To make it easier for this kind of summary to be provided, we now have an online transcription system, called Cockatiel, that runs in the browser.
Cockatiel does the initial segmentation of the audio, based on silence detection, and speaker diarisation (identifying the voices of speakers in the recording) as can be seen in this image:

Cockatiel can be used to transcribe local files, but Elan does that job already. More important is the ability for Cockatiel to draw a file from an online archive, make the file available for transcription, and then push it to a workspace for approval before being accepted into the archive. It will soon become part of the PARADISEC interface, available for logged-in users.
Cockatiel was developed by John Ferlito in a project led by Nick Thieberger with funding from the Language Data Commons of Australia. Code is available in github.


Follow
Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.
We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.
In addition to the above, we ask that you please observe the Gricean maxims:*Be relevant: That is, stay reasonably on topic.
*Be truthful: This goes without saying; don’t give us any nonsense.
*Be concise: Say as much as you need to without being unnecessarily long-winded.
*Be perspicuous: This last one needs no explanation.
We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.
All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.
Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.