{"id":9316,"date":"2021-10-29T10:52:10","date_gmt":"2021-10-29T00:52:10","guid":{"rendered":"https:\/\/www.paradisec.org.au\/blog\/?p=9316"},"modified":"2021-10-29T14:13:14","modified_gmt":"2021-10-29T04:13:14","slug":"converting-docx-to-flex-format-for-dictionaries","status":"publish","type":"post","link":"https:\/\/www.paradisec.org.au\/blog\/2021\/10\/converting-docx-to-flex-format-for-dictionaries\/","title":{"rendered":"Converting docx to FLEx format for dictionaries"},"content":{"rendered":"\n<p>Following <a href=\"https:\/\/www.paradisec.org.au\/blog\/2021\/10\/reviving-dictionaries\/\" target=\"_blank\" rel=\"noreferrer noopener\">the previous blog post<\/a> I had requests for more detail on how to convert a word-processor dictionary into the format needed to put the text into the software <a href=\"https:\/\/software.sil.org\/fieldworks\/\" data-type=\"URL\" data-id=\"https:\/\/software.sil.org\/fieldworks\/\" target=\"_blank\" rel=\"noreferrer noopener\">Fieldworks Language Explorer (FLEx)<\/a>. I\u2019ll set out the steps below, but it does require some knowledge of regular expressions that I\u2019ll explain as I go (you can also watch an<a href=\"https:\/\/www.youtube.com\/watch?v=8ILToE0CNpM\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/watch?v=8ILToE0CNpM\" target=\"_blank\" rel=\"noreferrer noopener\"> intro to regular expressions here<\/a>).<\/p>\n\n\n\n<p>What FLEx needs to import is a text file that has the dictionary parts marked up with codes, like this, where \\lx starts the headword, and \\de starts the definition:<\/p>\n\n\n\n<p>\\lx alata&#8217;a<\/p>\n\n\n\n<p>\\de speak straight-faced and unsmiling<\/p>\n\n\n\n<p>Subentries are marked with \\se, and scientific names are marked by \\sc. <\/p>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"52\" height=\"42\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.02.56-am.png\" alt=\"\" class=\"wp-image-9322 size-full\"\/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\"><meta charset=\"utf-8\">First, you have to analyse the dictionary file to see if it is formatted consistently. Make sure you can see the tab, space, and carriage return marks in the document, by clicking on the icon in the header that looks like the one to the left here. <\/p>\n<\/div><\/div>\n\n\n\n<p>For example, lots of entries look this :<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.54.41-am.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.54.41-am.png\" alt=\"\" class=\"wp-image-9319\" width=\"333\" height=\"43\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.54.41-am.png 542w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.54.41-am-300x39.png 300w\" sizes=\"auto, (max-width: 333px) 100vw, 333px\" \/><\/a><\/figure>\n\n\n\n<p>The structure here is a word at the line start, followed by a tab, followed by other text (the English definition), followed by a carriage return. So far, so good. But, other lines look like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.00.04-am.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.00.04-am.png\" alt=\"\" class=\"wp-image-9320\" width=\"328\" height=\"92\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.00.04-am.png 556w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.00.04-am-300x84.png 300w\" sizes=\"auto, (max-width: 328px) 100vw, 328px\" \/><\/a><\/figure>\n\n\n\n<p>Here we have indented lines, with two spaces before the text. This is used to indicate subentries to the main entry.<\/p>\n\n\n\n<p>Another type of entry is as follows, where the definition extends beyond the end of the line and is wrapped, but with a tab to space it across to the right column:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.01.34-am.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.01.34-am.png\" alt=\"\" class=\"wp-image-9321\" width=\"356\" height=\"99\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.01.34-am.png 654w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.01.34-am-300x83.png 300w\" sizes=\"auto, (max-width: 356px) 100vw, 356px\" \/><\/a><\/figure>\n\n\n\n<p>In some entries this wrapping can go for several lines:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.12.21-am.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.12.21-am.png\" alt=\"\" class=\"wp-image-9325\" width=\"343\" height=\"86\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.12.21-am.png 606w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.12.21-am-300x75.png 300w\" sizes=\"auto, (max-width: 343px) 100vw, 343px\" \/><\/a><\/figure>\n\n\n\n<p>Scientific names are given in italics and in brackets:<\/p>\n\n\n\n<p>&#8216;ala&#8217;ala&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; croton (<em>Codiaeum<\/em>)<\/p>\n\n\n\n<p>&#8216;alabusi&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type of tree (<em>Acalypha<\/em>)<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:33% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"834\" height=\"1024\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.15.31-am-834x1024.png\" alt=\"\" class=\"wp-image-9326 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.15.31-am-834x1024.png 834w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.15.31-am-244x300.png 244w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.15.31-am-768x943.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-9.15.31-am.png 1062w\" sizes=\"auto, (max-width: 834px) 100vw, 834px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\">Because there is italic formatting here, we can find text with that formatting using MS Word\u2019s Advanced Find and Replace function to then insert codes. Leave the search box empty, but select the font style italic. This will search for all italic text. You have to replace the text with \\sc ^&amp;.  ^&amp; means &#8216;put the thing I found here&#8217;, so it will put whatever text was in italics after \\sc.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:39% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"584\" height=\"66\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.24.26-am.png\" alt=\"\" class=\"wp-image-9328 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.24.26-am.png 584w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.24.26-am-300x34.png 300w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\">This is what it looks like after doing that change, with \\sc inserted before an italic word. You need to take out the brackets around this, by finding &#8220;( \\ &#8221; and replacing with \\, and finding &#8220;) ^p&#8221; and replacing with ^p.<\/p>\n<\/div><\/div>\n\n\n\n<p>Now, we need to find all carriage returns followed just by a tab, to find where the definition has gone longer than the end of the line. We will then replace that with a space.<\/p>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"800\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am-1024x800.png\" alt=\"\" class=\"wp-image-9330 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am-1024x800.png 1024w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am-300x234.png 300w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am-768x600.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am-1536x1200.png 1536w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.31.48-am.png 1684w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\"> <br>Again, using Advanced Find and Replace, you need to choose a paragraph mark from the special list at the bottom of the window. This will insert ^p into the search box. Similarly, now choose the tab character from the menu and it will put ^t into the search box. Put a single space into the replace box, and it will now replace all carriage returns followed by tabs to a single space. Do this as many times as it returns a result.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"320\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am-1024x320.png\" alt=\"\" class=\"wp-image-9332 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am-1024x320.png 1024w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am-300x94.png 300w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am-768x240.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am-1536x480.png 1536w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.41.21-am.png 1778w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\">Now that we know all tabs only precede a definition, we can replace them with \\de , and we do this the same way, by finding ^t and replacing with \\de (with a space following it).<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:59% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"364\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am-1024x364.png\" alt=\"\" class=\"wp-image-9334 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am-1024x364.png 1024w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am-300x107.png 300w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am-768x273.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am-1536x546.png 1536w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.54.36-am.png 1704w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\"><meta charset=\"utf-8\">Next, we can identify subentries as having two spaces at the beginning of a line, so we can look for &#8220;^p  &#8221; (that is, ^p followed by two spaces), and replace it with &#8220;\\se &#8220;<\/p>\n<\/div><\/div>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:56% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"303\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am-1024x303.png\" alt=\"\" class=\"wp-image-9335 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am-1024x303.png 1024w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am-300x89.png 300w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am-768x227.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am-1536x454.png 1536w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.56.50-am.png 1684w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\">Now all remaining carriage returns occur before a headword, so can all be replaced by themselves with &#8220;\\lx &#8221; which is the marker for the headword.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"398\" src=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am-1024x398.png\" alt=\"\" class=\"wp-image-9336 size-full\" srcset=\"https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am-1024x398.png 1024w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am-300x116.png 300w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am-768x298.png 768w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am-1536x596.png 1536w, https:\/\/www.paradisec.org.au\/blog\/wp-content\/uploads\/2021\/10\/Screen-Shot-2021-10-29-at-10.59.31-am.png 1566w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-normal-font-size\">Last, you can put a carriage return before each backslash to create the format needed by FLEx. The last thing is to save this file as plain text.<\/p>\n<\/div><\/div>\n\n\n\n<p>You can now import that file into FLEx.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Following the previous blog post I had requests for more detail on how to convert a word-processor dictionary into the format needed to put the text into the software Fieldworks Language Explorer (FLEx). I\u2019ll set out the steps below, but it does require some knowledge of regular expressions that I\u2019ll explain as I go (you &#8230; <a title=\"Converting docx to FLEx format for dictionaries\" class=\"read-more\" href=\"https:\/\/www.paradisec.org.au\/blog\/2021\/10\/converting-docx-to-flex-format-for-dictionaries\/\" aria-label=\"Read more about Converting docx to FLEx format for dictionaries\">Read more<\/a><\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[79,3],"tags":[55,97],"class_list":["post-9316","post","type-post","status-publish","format-standard","hentry","category-dictionaries","category-technology","tag-dictionary","tag-regular-expressions"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/9316","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/comments?post=9316"}],"version-history":[{"count":12,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/9316\/revisions"}],"predecessor-version":[{"id":9346,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/9316\/revisions\/9346"}],"wp:attachment":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/media?parent=9316"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/categories?post=9316"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/tags?post=9316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}