Making old dictionaries new again

Today’s post is something of a recipe for making old dictionaries new again. I’ll explain how a 35 year old old, single-copy typewritten dictionary is living a new life as a digital database.

The language of this dictionary is Kagate – A Tibeto-Burman language of the Central Bodic branch, spoken in Nepal. I met some speakers of this language a number of years ago, as I’m working on a dialect of Yolmo, which is closely related. There was some documentation of Kagate in the mid-1970s although most of the material output was liturgical instead of linguistic.

As well as the two publications on Kagate mentioned on the Ethnologue site Monika Höhlig and Anna Maria Hari also created a typewritten Kagate-Nepali-English-German dictionary. A copy of this dictionary has remained with their primary consultant, and although it is well looked after and still useable it is also the only copy they have access to. It is also only in Latin script instead of the Devanagari script they have developed for their language.

On a previous visit the Kagate speakers were kind enough to allow me and my colleague Amos Teo to scan the pages of the dictionary. At this point we also made them another paper copy of the dictionary, but obviously this is an unsustainable process in the long term. As you can see, the dictionary is already becoming discoloured and faded:

Amos took the scans and used the optical character recognition (OCR) software that comes with Adobe Acrobat 9. Even with such faded font the OCR was effective at recognising the characters. As is to be expected with this kind of process though there was still a fair bit of cleaning up to do at this point. There were some alignment issues and some irregular characters. Also, some entries would copy strangely, with a row of 5-7 lexical items and then the corresponding definitions all in the lone below.

From here the data needed to be massaged so that the appropriate headers were present for Toolbox to read. With the data that we had we needed, at a minimum, to create these headers:

\lx – the Kagate word
\ps – part of speech
\de – an English definition
\dn – a Nepali definition
\xv – an example sentence

Using the find and replace function in an .RTF file Amos was able to create these using the formatting of the original document to his benefit. For example, all of the Nepali definitions start with Np: so we replaced “Np:” with “\gn.” Also all of the colons are at the start of the English definition, so Amos just selected “find : ” and “replace \de.” Of course Amos careful to do this in a set order – doing these two the other way around would have lead to more confusion. Of course, using Regular Expressions is a more efficient way of doing this task – but even if you don’t know how to use RegEx (yet) it won’t stop you from doing this kind of work.

Once the file was made ready to open in Toolbox it still required a little bit of cleaning up. There were a few instances where the letter ‘l’ had been read as the number ‘1’ and some reduplicated entries – but going through each entry and cleaning up these kinds of problems is still much more efficient than retyping out the whole thing again.

The great thing about now having a database to work with instead of a photocopy is that it was the work of an hour to create this:

It’s still exactly the same data as above – but it is much easier to manipulate into different forms. For example I could have just created a list of nouns, or only included the Nepali definitions. This database is also the start of a project to create a new dictionary. While the owner of this dictionary is proud of it, there are many limitations. The first is that it is all written in Latin script, and there is now a fully functional Devanagari script for Kagate, as well of course for Nepali. There are also few example sentences, and some items are missing – such as the number eleven. But of course the most pressing issue with the current dictionary is that there is only one copy. By working in a database we’ll be able to make as many copies as we like at the end, and use the information in other ways too. But that’s all a story for another post.