Exploring data from language documentation

The workshop ‘Exploring data from language documentation’, organised by Kilu von Prince and Felix Rau, (May 10/11 2013) included a number of interesting presentations which can be downloaded here: http://www.zas.gwz-berlin.de/1701.html

I talked about some gaps in the current language documentation workflow and tools that could help fill them, in particular ExSite9 for improving metadata collection, and EOPAS for presenting text and media online for citation and verification.

Christian Chanard and Amina Mettouchi showed a hybrid version of Elan they have developed that allows parsing and morphological labeling, as well as another tool that allows websearching of Elan files. http://corpafroas.tge-adonis.fr/tools.html

Joshua Crowgey presented on behalf of his co-authors (Emily M. Bender, Fei Xia, Michael Wayne Goodman) about extraction of grammatical information directly from interlinear glossed text (IGT) that can be used to boot-strap the development of precision implemented grammars (RiPLes: information engineering and synthesis for Resource Poor Languages).

Alexander König from the MPI outlined the problems encountered by the heterogeneous nature of the linguistic annotations deposited by DoBeS teams, and, suggested that, rather than try to develop a new standard, we try to make the existing annotation standards more interoperable and easier to understand. A service that could help here is ISOCat which is a registry of category types to which, for example, ELAN tiers could be mapped to allow users to work with them without knowing that they are.

Nikolaus Himmelmann’s presentation, titled ‘Some small things that would be a big help in processing fieldwork data’ made a plea for tools to enforce filenaming conventions and for maintaining version control of files created by collaborative teams. He argued that the capture of metadata should be able to be automated rather than needing to be entered by the researcher. He also hoped there could be a way of speeding up the preliminary segmentation of recorded speech (and gesture).

Using R for linguistic analysis is not new, but applying it to a corpus of IGT to be able to answer queries like this is new: “find all records in the corpus where the first word has a locative case marker and the finite verb is not the last word”. Taras Zakharko gave examples of his R scripts, called ToolboxSearch and available here: bitbucket.org/tzakharko/toolboxsearch

Frank Seifart, Jan Strunk and Florian Schiel showed a set of analyses of alignment of transcripts and media using WebMAUS. While this is working well for mainstream languages, their presentation showed very good results for a range of small languages, with results on new languages being comparable to results on German. As they note, this is a promising tool for the alignment of transcripts in heritage corpora.

Ciprian Gestenberger talked about the importance of creating well-annotated corpora for empirically based analyses. He also emphasized the need to take care in converting between formats that nothing is lost in the conversion.

Peter Bouda showed the use of python scripts and GrAF and POIO to create a data structure that can be presented as IGT. He showed the powerful ways of working with Toolbox and Elan files using annotation graphs as the basic structure and then having multiple possible outputs from the same dataset. http://www.peterbouda.eu/annotation-graphs-and-the-ipython-notebook.html

Seunhen Lee and EmilyElfner talked about ‘web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation’ . They ask, what can databases do for syntax‐phonology interface research and will it be possible to show correlations between prosodic phrasing and syntactic units, for example. They are using Annis2 http://www.sfb632.uni-potsdam.de/annis/, which, it seems, requires some technical knowledge to install postgres and then Annis from a command line.

Finally, Kilu von Prince showed common problems with Toolbox data from her corpus of over 6000 sentences in Daakaka and her approach to dealing with them, using XML. The format is PAULA XML and will be hosted on the ANNIS platform (http://annis2.sfb632.uni-potsdam.de/Annis/search.html). Conversion into other formats is provided for by SaltNPepper (https://korpling.german.hu-berlin.de/p/projects/saltnpepper/wiki/).

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment