Converting docx to FLEx format for dictionaries

Following the previous blog post I had requests for more detail on how to convert a word-processor dictionary into the format needed to put the text into the software Fieldworks Language Explorer (FLEx). I’ll set out the steps below, but it does require some knowledge of regular expressions that I’ll explain as I go (you can also watch an intro to regular expressions here).

What FLEx needs to import is a text file that has the dictionary parts marked up with codes, like this, where \lx starts the headword, and \de starts the definition:

\lx alata’a

\de speak straight-faced and unsmiling

Subentries are marked with \se, and scientific names are marked by \sc.

First, you have to analyse the dictionary file to see if it is formatted consistently. Make sure you can see the tab, space, and carriage return marks in the document, by clicking on the icon in the header that looks like the one to the left here.

For example, lots of entries look this :

The structure here is a word at the line start, followed by a tab, followed by other text (the English definition), followed by a carriage return. So far, so good. But, other lines look like this:

Here we have indented lines, with two spaces before the text. This is used to indicate subentries to the main entry.

Another type of entry is as follows, where the definition extends beyond the end of the line and is wrapped, but with a tab to space it across to the right column:

In some entries this wrapping can go for several lines:

Scientific names are given in italics and in brackets:

‘ala’ala croton (Codiaeum)

‘alabusi type of tree (Acalypha)

Because there is italic formatting here, we can find text with that formatting using MS Word’s Advanced Find and Replace function to then insert codes. Leave the search box empty, but select the font style italic. This will search for all italic text. You have to replace the text with \sc ^&. ^& means ‘put the thing I found here’, so it will put whatever text was in italics after \sc.

This is what it looks like after doing that change, with \sc inserted before an italic word. You need to take out the brackets around this, by finding “( \ ” and replacing with \, and finding “) ^p” and replacing with ^p.

Now, we need to find all carriage returns followed just by a tab, to find where the definition has gone longer than the end of the line. We will then replace that with a space.

Again, using Advanced Find and Replace, you need to choose a paragraph mark from the special list at the bottom of the window. This will insert ^p into the search box. Similarly, now choose the tab character from the menu and it will put ^t into the search box. Put a single space into the replace box, and it will now replace all carriage returns followed by tabs to a single space. Do this as many times as it returns a result.

Now that we know all tabs only precede a definition, we can replace them with \de , and we do this the same way, by finding ^t and replacing with \de (with a space following it).

Next, we can identify subentries as having two spaces at the beginning of a line, so we can look for “^p ” (that is, ^p followed by two spaces), and replace it with “\se “

Now all remaining carriage returns occur before a headword, so can all be replaced by themselves with “\lx ” which is the marker for the headword.

Last, you can put a carriage return before each backslash to create the format needed by FLEx. The last thing is to save this file as plain text.

You can now import that file into FLEx.