Making old dictionaries new again

Today’s post is something of a recipe for making old dictionaries new again. I’ll explain how a 35 year old old, single-copy typewritten dictionary is living a new life as a digital database.

The language of this dictionary is Kagate – A Tibeto-Burman language of the Central Bodic branch, spoken in Nepal. I met some speakers of this language a number of years ago, as I’m working on a dialect of Yolmo, which is closely related. There was some documentation of Kagate in the mid-1970s although most of the material output was liturgical instead of linguistic.

As well as the two publications on Kagate mentioned on the Ethnologue site Monika Höhlig and Anna Maria Hari also created a typewritten Kagate-Nepali-English-German dictionary. A copy of this dictionary has remained with their primary consultant, and although it is well looked after and still useable it is also the only copy they have access to. It is also only in Latin script instead of the Devanagari script they have developed for their language.

On a previous visit the Kagate speakers were kind enough to allow me and my colleague Amos Teo to scan the pages of the dictionary. At this point we also made them another paper copy of the dictionary, but obviously this is an unsustainable process in the long term. As you can see, the dictionary is already becoming discoloured and faded:

Amos took the scans and used the optical character recognition (OCR) software that comes with Adobe Acrobat 9. Even with such faded font the OCR was effective at recognising the characters. As is to be expected with this kind of process though there was still a fair bit of cleaning up to do at this point. There were some alignment issues and some irregular characters. Also, some entries would copy strangely, with a row of 5-7 lexical items and then the corresponding definitions all in the lone below.

From here the data needed to be massaged so that the appropriate headers were present for Toolbox to read. With the data that we had we needed, at a minimum, to create these headers:

\lx – the Kagate word
\ps – part of speech
\de – an English definition
\dn – a Nepali definition
\xv – an example sentence

Using the find and replace function in an .RTF file Amos was able to create these using the formatting of the original document to his benefit. For example, all of the Nepali definitions start with Np: so we replaced “Np:” with “\gn.” Also all of the colons are at the start of the English definition, so Amos just selected “find : ” and “replace \de.” Of course Amos careful to do this in a set order – doing these two the other way around would have lead to more confusion. Of course, using Regular Expressions is a more efficient way of doing this task – but even if you don’t know how to use RegEx (yet) it won’t stop you from doing this kind of work.

Once the file was made ready to open in Toolbox it still required a little bit of cleaning up. There were a few instances where the letter ‘l’ had been read as the number  ‘1’ and some reduplicated entries – but going through each entry and cleaning up these kinds of problems is still much more efficient than retyping out the whole thing again.

The great thing about now having a database to work with instead of a photocopy is that it was the work of an hour to create this:

It’s still exactly the same data as above – but it is much easier to manipulate into different forms. For example I could have just created a list of nouns, or only included the Nepali definitions. This database is also the start of a project to create a new dictionary. While the owner of this dictionary is proud of it, there are many limitations. The first is that it is all written in Latin script, and there is now a fully functional Devanagari script for Kagate, as well of course for Nepali. There are also few example sentences, and some items are missing – such as the number eleven. But of course the most pressing issue with the current dictionary is that there is only one copy. By working in a database we’ll be able to make as many copies as we like at the end, and use the information in other ways too. But that’s all a story for another post.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately. We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham. In addition to the above, we ask that you please observe the Gricean maxims: Be relevant That is, stay reasonably on topic. Be truthful This goes without saying; don’t give us any nonsense. Be concise Say as much as you need to without being unnecessarily long-winded. Be perspicuous This last one needs no explanation. We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification. All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions. Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment