Books, HTML, audio, images – falling out from fieldwork

I’ll be going to Vanuatu next month courtesy of Catriona Hyslop’s DoBeS project, to help build an installation of three computer-based interactive dictionaries (Vurës, Tamambo and South Efate) for the Museum there. We will have hyperlinked dictionaries with sound and images where possible. All of this will be HTML-based for low maintenance and to allow new dictionaries to be added to the set over time. This post is aimed at outlining the method used to get these various files into deliverable formats and follows on from an earlier one where I talked about using ITunes to get media back to the village.


Each of the dictionaries is in Toolbox, but the Tamambo dictionary (by Dorothy Jauncey) started out as a MS Word document that needed to be converted, using regular expressions, into a lexical database. Each dictionary was then processed through LexiquePro and exported to HTML, as can be seen in the online version here. The audio function needed tweaking to encode HTML5 media calls, but it didn’t take much work to get audio for 2,000 headwords into the Vurës dictionary. The process of getting the audio into the right shape started with a speaker being recorded reading headwords from a script. The recording was then time-aligned to the script using Transcriber and the resulting text file was exported to a ‘label’ format that could be imported into Audacity. Opening the audio file and the label file in Audacity, then selecting the ‘export multiple’ option resulted in a collection of short audio files, each named by the headword that they contain. These were then linked to in the HTML version by duplicating the headword in a tag that calls the audio, or else by using the contents of the \sf field as the source for the media file name.

In preparation for the trip to Vila I have also prepared two books to take to Erakor village, a dictionary and a collection of stories in South Efate and English. These books are printed by the publish-on-demand book machine at the University of Melbourne, with full colour covers and perfect binding at a cost of around $10 per copy. The pdf version is in the digital repository with handles to ensure persistent location and to allow open access free download of the content. The handle is also printed in the book, as is an ISBN number and a creative commons licence.

The data for each of these books came from Toolbox structured files. The dictionary is an MDF exported dictionary together with a finderlist. The stories are presented in English and South Efate without interlinear information (interlinear versions can be seen in Eopas), so need just the language and free gloss lines to be exported with the story’s metadata header (title, speaker, abstract). Inserting a tab before the free gloss line allows it to become the right hand cell in a table which has English in the right column and South Efate in the left column.

The books will be available for sale here or here and on Amazon (!) as part of the e-press function provided by the publishing centre, and copies will also be distributed by the World Oral Literature Project and there is no postage cost as they can use the pdf file to make as many (or as few) copies as needed.

So many possibilities!

1 thought on “Books, HTML, audio, images – falling out from fieldwork”

  1. Wow, sounds great. And I’d been wondering what would be the easiest way to get audio snippets out of ELAN too – thanks for that!

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately. We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham. In addition to the above, we ask that you please observe the Gricean maxims: Be relevant That is, stay reasonably on topic. Be truthful This goes without saying; don’t give us any nonsense. Be concise Say as much as you need to without being unnecessarily long-winded. Be perspicuous This last one needs no explanation. We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification. All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions. Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment