Exploring data from language documentation

The workshop ‘Exploring data from language documentation’, organised by Kilu von Prince and Felix Rau, (May 10/11 2013) included a number of interesting presentations which can be downloaded here: http://www.zas.gwz-berlin.de/1701.html

I talked about some gaps in the current language documentation workflow and tools that could help fill them, in particular ExSite9 for improving metadata collection, and EOPAS for presenting text and media online for citation and verification.

Christian Chanard and Amina Mettouchi showed a hybrid version of Elan they have developed that allows parsing and morphological labeling, as well as another tool that allows websearching of Elan files. http://corpafroas.tge-adonis.fr/tools.html

Joshua Crowgey presented on behalf of his co-authors (Emily M. Bender, Fei Xia, Michael Wayne Goodman) about extraction of grammatical information directly from interlinear glossed text (IGT) that can be used to boot-strap the development of precision implemented grammars (RiPLes: information engineering and synthesis for Resource Poor Languages).

Alexander König from the MPI outlined the problems encountered by the heterogeneous nature of the linguistic annotations deposited by DoBeS teams, and, suggested that, rather than try to develop a new standard, we try to make the existing annotation standards more interoperable and easier to understand. A service that could help here is ISOCat which is a registry of category types to which, for example, ELAN tiers could be mapped to allow users to work with them without knowing that they are.

Nikolaus Himmelmann’s presentation, titled ‘Some small things that would be a big help in processing fieldwork data’ made a plea for tools to enforce filenaming conventions and for maintaining version control of files created by collaborative teams. He argued that the capture of metadata should be able to be automated rather than needing to be entered by the researcher. He also hoped there could be a way of speeding up the preliminary segmentation of recorded speech (and gesture).

Using R for linguistic analysis is not new, but applying it to a corpus of IGT to be able to answer queries like this is new: “find all records in the corpus where the first word has a locative case marker and the finite verb is not the last word”. Taras Zakharko gave examples of his R scripts, called ToolboxSearch and available here: bitbucket.org/tzakharko/toolboxsearch

Frank Seifart, Jan Strunk and Florian Schiel showed a set of analyses of alignment of transcripts and media using WebMAUS. While this is working well for mainstream languages, their presentation showed very good results for a range of small languages, with results on new languages being comparable to results on German. As they note, this is a promising tool for the alignment of transcripts in heritage corpora.

Ciprian Gestenberger talked about the importance of creating well-annotated corpora for empirically based analyses. He also emphasized the need to take care in converting between formats that nothing is lost in the conversion.

Peter Bouda showed the use of python scripts and GrAF and POIO to create a data structure that can be presented as IGT. He showed the powerful ways of working with Toolbox and Elan files using annotation graphs as the basic structure and then having multiple possible outputs from the same dataset. http://www.peterbouda.eu/annotation-graphs-and-the-ipython-notebook.html

Seunhen Lee and EmilyElfner talked about ‘web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation’ . They ask, what can databases do for syntax‐phonology interface research and will it be possible to show correlations between prosodic phrasing and syntactic units, for example. They are using Annis2 http://www.sfb632.uni-potsdam.de/annis/, which, it seems, requires some technical knowledge to install postgres and then Annis from a command line.

Finally, Kilu von Prince showed common problems with Toolbox data from her corpus of over 6000 sentences in Daakaka and her approach to dealing with them, using XML. The format is PAULA XML and will be hosted on the ANNIS platform (http://annis2.sfb632.uni-potsdam.de/Annis/search.html). Conversion into other formats is provided for by SaltNPepper (https://korpling.german.hu-berlin.de/p/projects/saltnpepper/wiki/).

Leave a Reply