Counting Collections

As will be clear to regular readers of this blog, we are concerned here to encourage the creation of the best possible records of small languages. Since much of this work is done by researchers (linguists, musicologists, anthropologists etc.) within academia, there needs to be a system for recognising collections of such records in themselves as academic output. This question is being discussed more widely in academia and in high-level policy documents as can be seen by the list of references given below.

The increasing importance of language documentation as a paradigm in linguistic research means that many linguists now spend substantial amounts of time preparing corpora of language data for archiving. Scholars would of course like to see appropriate recognition of such effort in various institutional contexts. Preliminary discussions between the Australian Linguistic Society (ALS) and the Australian Research Council (ARC) in 2011 made it clear that, although the ARC accepted that curated corpora could legitimately be seen as research output, it would be the responsibility of the ALS (or the scholarly community more generally) to establish conventions to accord scholarly credibility to such products. Here, we report on some of the activities of the authors in exploring this issue on behalf of the ALS and discuss issues in two areas: (a) what sort of process is appropriate in according some form of validation to corpora as research products, and (b) what are the appropriate criteria against which such validation should be judged?

“Scholars who use these collections are generally appreciative of the effort required to create these online resources and reluctant to criticize, but one senses that these resources will not achieve wider acceptance until they are more rigorously and systematically reviewed.” (Willett, 2004)

We propose that the process, which would parallel quite closely the peer review process applied to traditional publications, should involve the following steps:

1.     Submission of a corpus by its creators to the ALS with a request for review.

2.     Refereeing by an anonymous panel representing the ALS.

3.     Report of the panel provided to the creators, possibly with suggestions for improvements.

4.     Response by the creators, including possibly including revisions to the collection.

5.     Publication of a summary of the panel’s report and the creators’ response under the auspices of the ALS (for example in the ALS newsletter, AJL, or the ALS website and constituting recognition of the corpus as a research product.

The question of what criteria to use in evaluating a corpus is more problematic, but we believe that some aspects should not be controversial. The accessibility of a corpus is clearly fundamental. As only accessible data can be considered to be “published” only such data can be included in the review.  Beyond the data being in principle available to corpus users, accessibility can also be defined (measured?) in terms of the technical resources necessary to use it (including metadata) and in terms of the background information provided in order to make the data comprehensible. On the other hand, we are less certain that consensus will be easily achieved on questions about quantities of data and the weight to be given to different data types. For example, we would expect a useful corpus to include some fully glossed material (in most cases anyway), but is there a minimum amount (or minimum proportion) of such material which should be taken as a threshold for approval? Another difficult question is whether it is useful to attempt to establish any equivalence between corpora as output and traditional publications, and if so, how any such equivalence might be measured.

As researchers are being asked to account for research quality, it is timely to consider the importance of collections of research data as an indicator of research quality. Such collections can, like any research endeavour, vary considerably in size and in the amount of curation that has gone into developing them. Further, primary data, while critically important to preserve for future reuse (particularly by the source community) and for verification of analyses, is best created within a framework that allows for its description and annotation. For example, using international standard metadata systems (such as Dublin Core or Open Archives Initiative) for description of research material means that it can then be located using any targeted search tool, resulting in more precision in searches than is possible using generic tools like google.

How can we ensure that collections of primary language data can count towards academic advancement?

This blog post summarises a presentation given at the Australian Linguistic Society conference in Perth (December 2012), titled: Assessing curated corpora as research output: issues of process and evaluation. The authors are Anna Margetts (Monash) , Stephen Morey (La Trobe), Simon Musgrave (Monash), Adam Schembri (La Trobe), Nick Thieberger (Uni Melbourne).

Suggested process for assessing a collection

1.  Submission of corpus by creators to ALS with request for review

2. Review by ALS committee

3. Report of committee to the creators

– possibly with suggestions for improvements

4. Response by the creators

– including possibly revisions to the collection

5. Publication of report summary & creators’ response
= recognition of the corpus as a research product

– E.g. in ALS newsletter, AJL, or ALS website

Criteria suggested for assessing collections

1. Collection is housed in a repository

– which has a commitment to long-term curation and access

– which provides a citation form for items within the collection.

2. Accessibility

–  are the files in the corpus in a format that is freely accessible?

–  do they depend on certain software?

•  this can render them difficult or impossible to use

–  does the software require additional meta-files?

•  e.g. Toolbox files need .typ and .lng files to be read

3. Contextual information

– background to how the collection came into being and overview of what it contains

4. Metadata

– sufficiently described to allow corpus to be located

– what it contains e.g. texts, media, vocab, speaker info, perhaps content keywords

5. Annotations

– transcription

– text-media linkage

– translation into language(s) of wider communication

– interlinear glosses

6. Content

– good range of speakers of different ages and genders

• relative to opportunity

– good range of text types (narratives, procedural, hortatory texts, written, songs, etc.)

• as per Himmelmann (2006:21) — intro to Gippert et al.

• relative to opportunity

7. Supporting documentation

– list of abbreviations used

– information on orthography and its relation to sound system

– grammatical information sufficient to allow further analysis

– lexicon (quality rated from short wordlist to detailed dictionary)

– ethnographic information

‘Quantity’ criteria

• Hours/units of primary data?

• Proportion of data which is annotated etc. could be handled as a quality issue, not a quantity issue

• Perhaps the quality rating may end up being independent of quantity?


Background documents

A. May 2002 ALS Newsletter: Open letter from linguistic postgraduate students:
1) the recognition by linguists of the need for urgent work to record small (‘endangered’) languages and;
2) the availability of funding for that work, and;
3) rapid advances in technological aids to doing that work, then there is a clear need for the members of the Australian linguistic community to consider the following:
– the current PhD in linguistics makes no provision for language documentation beyond an academic grammar (in fact it positively discourages it).
– documentation of a language with few speakers or with little prospect of being spoken in the next generation should be considered a suitable PhD topic in linguistics.
– this language documentation would produce information in the language in a form that makes it accessible to speakers  and their descendants, and to the linguistic community. Typically this would include a grammar sketch, dictionary and texts.
– the form of the documentation would include as much information as possible, but would minimally provide audio and video recording of performance in the language. It could also include ethnobiological information such as pictures of plants and animals, their uses and names and so on.
– the document would include grammatical information, but not of the detail currently expected of a  PhD in linguistics.
– the document would be produced using current standard tools (e.g. digital recording, text/audio linkage).
– the document would be presented in archive quality and placed in an appropriate repository to ensure its accessibility and usability into the future
– This all entails training students in documentary techniques and linguistic data management.
In the long run it is the documentation that will prove more valuable for linguistic analysis than the traditional PhD. At present we have to rely on the writer of a PhD nearly 100% for some languages – and certainly 100% if the language is now gone.
While the current system values language analysis, it places no value on linguistic data management, nor on safely archiving recorded materials.

Claire Bowern – Harvard
Nicolette Bramley – UC/ANU
Anthony Jukes  – Melbourne Uni
Doug Marmion – ANU
Stephen Morey  – Monash Uni
Adam Paliwala  – Sydney Uni
Carol Priestley – ANU
Adam Saulwick – Melbourne Uni
St John Skilton – Sydney Uni
Nick Thieberger – Melbourne Uni
Myfany Turpin – Sydney Uni


B. Linguistic Society of America Resolution Recognizing the Scholarly Merit of Language Documentation

The following was passed by LSA members present at the Annual Business Meeting of the LSA in Baltimore, Maryland, on January 8, 2010 as a “sense of majority of the meeting” resolution. It was submitted to the membership at large in May 2011 for a “sense of the majority of the membership” and passed by a majority of the members responding.
Whereas the practice of linguistic fieldwork is shifting to a more collaborative endeavor firmly based on ethical responsibilities to speech communities and a commitment to broadening the impacts of scholarship; and
Whereas this shift in practice has broadened the range of scholarly work to include not only grammars, dictionaries, and text collections, but also archives of primary data, electronic databases, corpora, critical editions of legacy materials, pedagogical works designed for the use of speech communities, software, websites, or other digital media; and
Whereas the products of language documentation and work supporting linguistic vitality are of significant importance to the preservation of linguistic diversity, are fundamental and permanent contributions to the foundation of linguistics, and are intellectual achievements which require sophisticated analytical skills, deep theoretical knowledge, and broad linguistic expertise;
Therefore the Linguistic Society of America supports the recognition of these materials as scholarly contributions to be given weight in the awarding of advanced degrees and in decisions on hiring, tenure, and promotion of faculty. It supports the development of appropriate means of review of such works so that their functionality, import, and scope can be assessed relative to other language resources and to more traditional publications.

C. Motions at last year’s ALS AGM (2011):
Motion: That the ALS write to the ARC noting that curated corpora of linguistic data and accompanying analysis should be counted as research outputs subject to certain criteria being met.
(a) Further, that the ALS Executive establish a sub-committee to detail what constitutes a collection for these purposes and how such collections can be evaluated.

D. Thomson Reuters and citation of primary data. (Humanities accounts for only 4% of the repositories of primary data that they cite right now).

They say: “While peer review of deposited data is by no means universal, application of the peer-review process is another indication of repository standards and signifies overall quality of the data presented and the completeness of any cited references. It is also recommended that whenever possible, each repository, data study or data set is published with information on the funding source supporting the research presented. ”

More information on the Data Citation Index:

Data Citation Index Selection essay:

E. Digital Research Infrastructure for the Arts and Humanities (DARIAH) EU programme

“Long-term viability and access to data and resources is one of the most crucial requirements for research within the digital arts and humanities. Robust policies for collection management and preservation of data will ensure that resources are available for future use by researchers, and will allow discovery and sharing through the DARIAH infrastructure. All users of DARIAH should have an interest in the long-term maintenance of the data they create and use.”

DARIAH Collection Ingest, Management and Preservation Policy 


F. European Science Foundation 2011 report on Research Infrastructure in the Humanities

In summary, the development of a new culture of digital research within the Humanities requires a multifaceted approach:

  • Advocacy is needed to strengthen the acceptance of digital research, publications and the development of data.
  • The character of research as a social activity requires fostering and support.
  • A new academic recognition system must begin to recognise the scholarly value of electronic editions and publications; to review them in highly ranked journals; and to evaluate them as research contributions.  (p.20)

“There are numerous examples of databases, tools and services in Humanities, but their lack of coherence is a significant problem. As a result, there is an urgent need for standards for metadata, for the organisation and interlinking of data and texts (semantic web) and for (Open Access and Permanent Access) publishing in text and data repositories. The following is a brief summary of current challenges in this area: •  Digital research in Humanities is mainly project-driven; small scattered research groups are working to short timescales. •  Digital data and documents are volatile: they need long-term preservation. •  Digital objects in Humanities have to be able to be consulted for a long period: they need institutions responsible for maintaining them for future generations of researchers. (European Science Foundation 2011: 19)

“In summary, the development of a new culture of digital research within the Humanities requires a multi- faceted approach:

•  Advocacy is needed to strengthen the acceptance of digital research, publications and the development of data.

•  The character of research as a social activity requires fostering and support.

•  A new academic recognition system must begin to recognise the scholarly value of electronic editions and publications; to review them in highly ranked journals; and to evaluate them as research contributions.” (ibid. p.20)


G. National Research Investment Plan produced by the  Australian Research Committee (ARCom).

Released 28th November 2012. (Only two mentions of ‘humanities’ in 132 pages).

Under the ‘Human Domain’ heading:

“The domain includes the capability to make old and new data discoverable and reusable and to extract greater value from existing collections that are as varied as statistical data, manuscripts, documents, artefacts and audiovisual recordings. This domain will enable discovery and use of previously inaccessible information, stimulating connections and synergies and catalysing innovative research.” (p.62)


3 thoughts on “Counting Collections”

  1. I came across some work on reviewing collections of primary data in the Earth Sciences by Sarah Callaghan:

    She also has an article, ‘Data without Peer: Examples of Data Peer Review in the Earth Sciences’ ( in this collection:

    The first hurdles a collection needs to get over, she says, are:
    Does the dataset have a permanent identifier?
    Does it have a landing page (or README file or similar) with additional information/metadata, which allows you to determine that this is indeed the dataset you’re looking for?
    Is it in an accredited/trusted repository?
    Is the dataset accessible? If not, are the terms and conditions for access clearly defined?

    If it fails any of these then it is rejected from the review process.

Here at Endangered Languages and Cultures, we fully welcome your opinion, questions and comments on any post, and all posts will have an active comments form. However if you have never commented before, your comment may take some time before it is approved. Subsequent comments from you should appear immediately.

We will not edit any comments unless asked to, or unless there have been html coding errors, broken links, or formatting errors. We still reserve the right to censor any comment that the administrators deem to be unnecessarily derogatory or offensive, libellous or unhelpful, and we have an active spam filter that may reject your comment if it contains too many links or otherwise fits the description of spam. If this happens erroneously, email the author of the post and let them know. And note that given the huge amount of spam that all WordPress blogs receive on a daily basis (hundreds) it is not possible to sift through them all and find the ham.

In addition to the above, we ask that you please observe the Gricean maxims:

*Be relevant: That is, stay reasonably on topic.

*Be truthful: This goes without saying; don’t give us any nonsense.

*Be concise: Say as much as you need to without being unnecessarily long-winded.

*Be perspicuous: This last one needs no explanation.

We permit comments and trackbacks on our articles. Anyone may comment. Comments are subject to moderation, filtering, spell checking, editing, and removal without cause or justification.

All comments are reviewed by comment spamming software and by the site administrators and may be removed without cause at any time. All information provided is volunteered by you. Any website address provided in the URL will be linked to from your name, if you wish to include such information. We do not collect and save information provided when commenting such as email address and will not use this information except where indicated. This site and its representatives will not be held responsible for errors in any comment submissions.

Again, we repeat: We reserve all rights of refusal and deletion of any and all comments and trackbacks.

Leave a Comment