The long road to language resources—CLARIN

CLARIN, the ‘Common Language Resources and Technology Infrastructure’ is a European initiative to support the creation, curation and exploration of language material for research purposes and for as broad an audience as possible. The stated aim is that you should not need to be a technical expert to use the corpora, lexica and annotations that are targeted in CLARIN.

It is part of the European Research Infrastructure Consortium (ERIC). This is a huge project, with a budget of some €104 million. CLARIN-D is the German section of CLARIN and it recently had its 2-year showcase, which I was able to attend (see current activities at http://clarin-d.net/de/aktuelles/). Given that this is the first two years of a longterm project it has clearly achieved a great deal already, and certainly more than can be glimpsed in a short blog post.

This is part of a ‘roadmap’ process that actually leads somewhere, unlike the Australian version I reported on earlier that appears to have cost hundreds of thousands of dollars only to have been abandoned even before it was published.

In its place arose yet another committee structure, the Australian Research Committee (not to be confused with the Australian Research Council) which is now setting a new Australian research agenda and that includes not a single Humanities and Social Science (HASS) researcher in its membership (see its webpage). This ARCommittee released a set of guidelines on June 21st which may, for the next period, be important for funding applications to the Australian government.

But I digress. Back to CLARIN-D and the 9 centres in Germany working on a timeline ending in 2020 (yes, a funding programme that covers 12 years!).
The sort of questions that CLARIN should be able to answer are:

• give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350)

• give me all negative articles about Islam or about soccer in the Slovenski Narod daily newspaper (1868-1943)

• find Norwegian TV news interviews that involve speakers with a German accent

• summarize all articles in European newspapers of April 2012 about machine translation – in Nynorsk

• Show me the pronoun systems of the languages of Alaska

source: http://clarin.b.uib.no/files/2012/08/krauwer-clarino.pdf, page 4

Most tools shown at the workshop center on text processing in well-known languages but there are some central technologies being developed that would underlie tools that can be used in language documentation work. For example, ISOcat is a data registry for concepts used in linguistics that could be a point of reference for part of speech tags, specifying usage more clearly than present practices generally do. However, it is rather cumbersome and is designed for developers to implement and not for individual researchers to use. It could be the point of reference for newly developed tools that display encoding concepts from ISOcat with provision for new ones to be added. A big problem that will no doubt emerge is a proliferation of ‘standard’ terms each slightly different to the next and each embedded within its own community and history of practice.
So far, CLARIN has provided storage space and personal workspace (sort of like RDSI and NECTAR in Australia). There are several existing projects that have become part of CLARIN, for example WebLicht, a chain of tools that do part of speech tagging, parsing, lemmatisation and so on, for mainstream languages in a distributed set of interlinked services located in different physical locations around the CLARIN-D projects. TextGrid is another tool that has, since its start in 2006, established the infrastructure for a text-based virtual research environment.
The projects that look like being of most use to language documentation are the media annotation services like Avatech for automatic recognition of video content, and SpeechFinder and WebMAUS (also mentioned earlier here).