Peter Austin’s blog post deals with online endangered language archive searchability. As one of the targets of his latest post, PARADISEC apparently does not provide him with the results he wants in searching a catalog. Searching for ‘Educational material’ in a catalog makes lots of assumptions about the way that catalog has been constructed, one of which must be that the term is provided by the catalog or that the typical depositor would use the term in their freeform description of the item. Strangely, the answer he offers is not to provide the infrastructure on which such searches may succeed in future, but to advocate a folksonomy in which such searches will always be sure to fail.
The post is an advertisement for what is undoubtedly a very nice interface to a set of material held by ELAR, but we should also bear in mind the large amount of funding that ELAR/ELDP have had, so we would hope for at least a nice looking webpage after eight years now. It is also interesting that ELAR holds only 70 collections after ELDP has funded 216 projects, what has happened to the rest of the material, or am I being too commodifying to think of such a thing?
The comments on the post raise OLAC – a great service that provides information for the broader community (including linguists, but especially speakers who can access it via google), harvesting information from archives around the world every 8 hours to update its language documentation index. OLAC provides a system for digital archives to maximise the searchability of their catalogs. There are 45 digital archives who take advantage of this free service. That represents almost all language archives in the world but to date ELAR has unfortunately chosen not to be part of that community.
What OLAC may lack in flashiness (although the new faceted search that Tom points to in his comment is pretty smart) it certainly makes up for in depth of coverage, see http://search.language-archives.org/index.html.
And if, as Peter says he is, you are only interested in searching for endangered languages, well who has had the resources to have created a list of those languages? Rather than one of the well-endowed projects providing this resource, the World Oral Literature Project (WOLP) has done a fine first job of this with minimal resources, and they harvest suitable archives (those that comply to the relevant standards for exchange of metadata) to get that information, and, yes, you guessed it, ELAR’s silo catalog is not there either.
Peter dismisses efforts to standardise terms as being outdated (in the olden days it seems, ‘key metadata notions were interoperability, standardisation, discovery, and access’) and advocates a relativist metadata mush in which there is a ‘focus on expressivity and individuality in metadata descriptions’. Expressivity and individuality certainly have their place, but they don’t help when it comes to targeted location of information, especially at the scale of material to be searched on the web. The keywords given in the short set of genres in Peter’s post is a perfect example.
Looking for ‘songs’ will not find song, looking for ‘kastom’ will not find Custom description or Custom narrative or Custom story, let alone Folk Tale, Narrative, Myth narrative, Narration, Narrative from visual prompt and many more. Who knows what ‘Chronicle’ or ‘Semi-spontaneous interview’ will find. And it is nice that the terms can be in any language, but that reduces the predictability of the search finding anything even more. I can’t see why it is an advantage to have all of those terms that Peter lists rather than a standard set of terms and then a free form field in which such stream of consciousness tags can also be listed.
A product of allowing users to enter their own terms rather than providing them with a set list and a freeform field for their own version is that a collection will not have any standard terms for locating information. Thus, for example, the Arandic songs project in ELAR is tagged with ‘Language: Arandic’, while none of the standard language terms lists ‘Arandic’. Searching for the more usual term ‘Arrernte’ does not locate the ELAR items in the first ten pages of a google search (I gave up looking any deeper than that).
By participating in international standards, the items in the ELAR collection could be found by pages like this: http://www.language-archives.org/language/are
Here, the standard three-letter code at the end of the URL links to a page listing all available information held in participating archives, and this is updated every 8 hours, effectively providing a dynamic documentation index. Of course there are still problems with the three-letter codes, but they are improving over time, and this and other issues could be improved by cooperation rather than competition from the small community who are doing this work.
ELDP/ELAR is a multi-million dollar enterprise that has been running for eight years and has achieved great things. It could lead the field with open-source tools for linguists to use, and perhaps an open-source version of their catalog for other archives to adopt. Archives like PARADISEC have no funding beyond occasional grants and are staffed by committed people concerned to make legacy linguistic material safe. We are content to know that we have digitised field recordings and curate over 3,000 hours of recorded material that would otherwise have been lost and that the catalog makes it locatable. We are in the Open Language Archives Community and in WorldCat (1.5 billion metadata items) and we take advantage of this existing infrastructure by having our catalog maximally exposed to targeted search tools.