{"id":7679,"date":"2013-07-02T19:00:18","date_gmt":"2013-07-02T08:00:18","guid":{"rendered":"http:\/\/www.paradisec.org.au\/blog\/?p=7679"},"modified":"2013-07-02T19:00:50","modified_gmt":"2013-07-02T08:00:50","slug":"the-long-road-to-language-resources-clarin","status":"publish","type":"post","link":"https:\/\/www.paradisec.org.au\/blog\/2013\/07\/the-long-road-to-language-resources-clarin\/","title":{"rendered":"The long road to language resources\u2014CLARIN"},"content":{"rendered":"<p><a href=\"http:\/\/www.clarin.eu\/\">CLARIN, the &#8216;Common Language Resources and Technology Infrastructure&#8217;<\/a> is a European initiative to support the creation, curation and exploration of language material for research purposes and for as broad an audience as possible. The stated aim is that you should not need to be a technical expert to use the corpora, lexica and annotations that are targeted in CLARIN.<\/p>\n<p>It is part of the <a href=\"http:\/\/ec.europa.eu\/research\/infrastructures\/index_en.cfm?pg=eric\">European Research Infrastructure Consortium (ERIC). <\/a> This is a huge project, with a budget of some <a href=\"http:\/\/ec.europa.eu\/research\/index.cfm?pg=newsalert&amp;lg=en&amp;year=2012&amp;na=na-290212-1\">\u20ac104 million<\/a>. CLARIN-D is the German section of CLARIN and it recently had its 2-year showcase, which I was able to attend (see current activities at <a href=\"http:\/\/clarin-d.net\/de\/aktuelles\/\">http:\/\/clarin-d.net\/de\/aktuelles\/<\/a>). Given that this is the first two years of a longterm project it has clearly achieved a great deal already, and certainly more than can be glimpsed in a short blog post.<\/p>\n<p>This is part of a &#8216;roadmap&#8217; process that actually leads somewhere, unlike the Australian version <a href=\"http:\/\/www.paradisec.org.au\/blog\/2011\/03\/australian-humanities-research-infrastructure-funding\/\">I reported on earlier<\/a> that appears to have cost hundreds of thousands of dollars only to have been abandoned even before it was published.<\/p>\n<p><!--more-->In its place arose yet another committee structure, the <a href=\"http:\/\/www.innovation.gov.au\/Research\/Pages\/AustralianResearchCommittee.aspx\">Australian Research Committee<\/a> (not to be confused with the Australian Research Council) which is now setting a new Australian research agenda and that includes not a single Humanities and Social Science (HASS) researcher in its membership (<a href=\"http:\/\/www.innovation.gov.au\/Research\/Pages\/AustralianResearchCommitteeMembership.aspx\">see its webpage<\/a>). This ARCommittee released <a href=\"http:\/\/www.innovation.gov.au\/Research\/Pages\/StrategicResearchPriorities.aspx\">a set of guidelines on June 21st<\/a> which may, for the next period, be important for funding applications to the Australian government.<\/p>\n<p>But I digress. Back to CLARIN-D and the 9 centres in Germany working on a timeline ending in 2020 (yes, a funding programme that covers 12 years!).<br \/>\nThe sort of questions that CLARIN should be able to answer are:<\/p>\n<ul>\n<ul>\u2022\u00a0give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350)<\/ul>\n<\/ul>\n<ul>\n<ul>\u2022 give me all negative articles about Islam or about soccer in the Slovenski Narod daily newspaper (1868-1943)<\/ul>\n<\/ul>\n<ul>\n<ul>\u2022 find Norwegian TV news interviews that involve speakers with a German accent<\/ul>\n<\/ul>\n<ul>\n<ul>\u2022 summarize all articles in European newspapers of April 2012 about machine translation \u2013 in Nynorsk<\/ul>\n<\/ul>\n<ul>\n<ul>\u2022 Show me the pronoun systems of the languages of Alaska<\/ul>\n<\/ul>\n<p style=\"padding-left: 120px;\"><em>source: <a href=\"http:\/\/clarin.b.uib.no\/files\/2012\/08\/krauwer-clarino.pdf\">http:\/\/clarin.b.uib.no\/files\/2012\/08\/krauwer-clarino.pdf, page 4<\/a><\/em><\/p>\n<p>Most tools shown at the workshop center on text processing in well-known languages but there are some central technologies being developed that would underlie tools that can be used in language documentation work. For example, <a href=\"www.isocat.org\">ISOcat<\/a> is a data registry for concepts used in linguistics that could be a point of reference for part of speech tags, specifying usage more clearly than present practices generally do. However, it is rather cumbersome and is designed for developers to implement and not for individual researchers to use. It could be the point of reference for newly developed tools that display encoding concepts from ISOcat with provision for new ones to be added. A big problem that will no doubt emerge is a proliferation of &#8216;standard&#8217; terms each slightly different to the next and each embedded within its own community and history of practice.<br \/>\nSo far, CLARIN has provided storage space and personal workspace (sort of like <a href=\"http:\/\/rdsi.uq.edu.au\/\">RDSI<\/a> and <a href=\"http:\/\/nectar.org.au\/\">NECTAR<\/a> in Australia). There are several existing projects that have become part of CLARIN, for example <a href=\"http:\/\/weblicht.sfs.uni-tuebingen.de\/weblichtwiki\/index.php\/Main_Page\">WebLicht<\/a>, a chain of tools that do part of speech tagging, parsing, lemmatisation and so on, for mainstream languages in a distributed set of interlinked services located in different physical locations around the CLARIN-D projects. <a href=\"http:\/\/www.textgrid.de\/en\/community\/\">TextGrid<\/a>\u00a0is another tool that has,\u00a0since its start in 2006, established the infrastructure for a text-based virtual research environment.<br \/>\nThe projects that look like being of most use to language documentation are the media annotation services like <a href=\"http:\/\/tla.mpi.nl\/projects_info\/avatech\/\">Avatech<\/a> for automatic recognition of video content, and <a href=\"https:\/\/clarin.phonetik.uni-muenchen.de\/BASWebServices\">SpeechFinder and WebMAUS<\/a> (also mentioned earlier <a href=\"http:\/\/www.paradisec.org.au\/blog\/2013\/05\/exploring-data-from-language-documentation\">here<\/a>).<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>CLARIN, the &#8216;Common Language Resources and Technology Infrastructure&#8217; is a European initiative to support the creation, curation and exploration of language material for research purposes and for as broad an audience as possible. The stated aim is that you should not need to be a technical expert to use the corpora, lexica and annotations that &#8230; <a title=\"The long road to language resources\u2014CLARIN\" class=\"read-more\" href=\"https:\/\/www.paradisec.org.au\/blog\/2013\/07\/the-long-road-to-language-resources-clarin\/\" aria-label=\"Read more about The long road to language resources\u2014CLARIN\">Read more<\/a><\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-7679","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/7679","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/comments?post=7679"}],"version-history":[{"count":21,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/7679\/revisions"}],"predecessor-version":[{"id":7700,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/posts\/7679\/revisions\/7700"}],"wp:attachment":[{"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/media?parent=7679"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/categories?post=7679"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.paradisec.org.au\/blog\/wp-json\/wp\/v2\/tags?post=7679"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}