Sharing the load? Problems with the ‘lone depositor’ model for the archiving of materials in endangered language archives

Ruth Singer recaps last week’s Linguistics in the Pub, a monthly informal gathering of linguists in Melbourne to discuss topical areas in our field.

Traditionally collections in endangered languages archives are identified with a single depositor. This depositor is typically a researcher, who is not a member of the community in which the recordings were made. This depositor decides on access restrictions to the materials, ideally in consultation with the community. There are a number of quite separate problems with this position, for those who manage archives and for those who find themselves in the position of lone depositor. In this era of collaborative fieldwork, we can also ask whether the lone depositor model is the best one for communities who speak endangered languages. One suggestion is to make collections open access so that the depositor does not need to be contacted. Another suggestion is to name a number of depositors for each collection, so that no single person has sole responsibility. In this LIP we will discuss potential solutions to the problems of the lone depositor model in the light of participants experiences as depositors and archivists.
Happy IDWIP!! 29 years on

I nearly missed this [thanks Bruce!]

Mick Gooda, the Social Justice Commissioner celebrates International Day of the World’s Indigenous Peoples – the birthday cake theme this year being ‘Indigenous designs: celebrating stories and cultures, crafting our own future’

His press release notes that International Day of the World’s Indigenous Peoples marks the day of the first meeting, in 1982, of the UN Working Group on Indigenous Populations.

Birrguu Matya, or the game Tapatan

‘A new board game based on an ancient Aboriginal game has just been released by N S W Aboriginal artist Donna Hensen. Called Hunters Tactics,’ reported the Koori Mail 166 (17 December 1997), page 25. ‘Traditionally, the game was played on the ground using sticks, stones or kangaroo dung and was one of many used to teach children the skills of hunting and gathering.’

Aboriginal artist Donna Hensen’s initiative was cited as an example in a marketing guide from the Australia Council for the Arts in 20001

She has designed a new board game, based on a traditional Aboriginal game, to be distributed through duty-free stores.
The game won the Innovative Indigenous Product Design award at the Indigenous Art Expo held in Casino, NSW in 1997. Made of ceramic, fibre resins and shells, she describes it as a mix of noughts-and-crosses and chess, requiring lateral thinking and patience.
With the help of the Expo co-ordinator, Donna used her prize money to trademark the name Hunters Tactics, then to find an agent to approach toy companies for a children’s version and to test market her art product.

In 1997 the game also had the name Birrguu Matya, according to the date on one in the University of Ballarat Library, though this name wasn’t used in the official mentions. Birrguu Matya has since been marketed through various online stores such as Gecko Educational, or Dreamtime Kullilla-Art which describes the product this way:

Birrguu Matya (Bush Game) Similar to tic-tac-toe & chess and designed to develop skill, patience and lateral thinking. This game has been played by the Aboriginal people for centuries and can be played by all ages.

The game received a favourable mention from Leesa Watego in her blog post Birrguu Matya: A Wiradjuri Game by Donna Hensen, with a couple of comments added in 2009 by Donna Hensen herself.

How to play the game is described in its Wikipedia entry, and more clearly in a recent post on the blog of a Melbourne primary school, which also shows there’s no need to purchase the kit. The game is clearly identical to one known for centuries in Asia as Tapatan (and synonyms and near synonyms), as set out in the marvellous Elliott Avedon Virtual Museum of Games.2 Under the name Tapatan the game is available free for iOS 3.0 or later3.

Now, at last, to the ELAC angle. The words birrguu and matya look like they’ve been taken from the widely available 1994 publication Macquarie Aboriginal words. In its English index there are a few entries under bush, and one points to Wiradjuri birrguu ‘scrub, the bush’. There is only one entry under game: matya, which points to the Paakantyi language chapter, and the entry under Non-physical qualities matya, matyitya ‘bold, game, daring, tame’.

So, what to think? Two words have been taken from separate NSW languages, one from a quite different sense (‘game’ as ‘bold, daring’), and used to market, especially to schools, a kit for a game with no recorded Australian antecedents (unless a reader can correct me?). The venture has not been in the context of language revitalization, and the instructions do not involve any Australian language vocabulary. Call the authenticity police, or let a thousand (plastic) flowers bloom?


  1. Online in p.134 in Section 3 of What’s my plan? A guide to developing arts marketing plans, citing an article ‘Hunters for Collectors’, p.25 in Smarts 12 (December 1997).
  2. A quite similar game with four (not three) pieces for each player is known in Ghana as Achi.
  3. Thanks to The Regents’ Center for Early Developmental Education at the University of Northern Iowa

LEGO blocks

Jeff Good has written a blog post about how citation metadata was dealt with in various file conversions for the Lexicon Enhancement via the GOLD Ontology (LEGO) project. His post was written in response to my discussion and follow up (plus James McElvenny’s contribution) about citation practices of data aggregators like LEGO and PanLex.

Jeff’s bottom line is that he and others acted in good faith, didn’t try to pass off other people’s data as their own, and that “[t]echnology was the problem, not people”.

Citation, citation

Continuum International Publishing Group has just sent me a complimentary copy of Jim Miller’s new textbook A Critical Introduction to Syntax which includes a chapter on “Noun Phrases and Non-configurationality” (pages 61-98). Since this is a topic I have published on (Austin and Bresnan 1996, Austin 2001a, 2001b) I figured I’d have a quick look at this chapter first. Interestingly, on page 78 I found example sentence (27) which is “from the Native Australian language Jiwarli” for which Miller (2011:77) quotes as the source “Pensalfini (2004, p. 364)”:

(27) Kutharra-rru ngunha ngurnti-nha jiluru
two-now that lie-pres egg (Nom)
‘Now those two eggs are lying there.’

As readers who have done research on Australian Aboriginal languages will probably recognize, Pensalfini cannot be the original source for the example since only Alan Dench and I ever recorded data on Jiwarli from its last speaker, the late Jack Butler, and only I have published primary material on the language. Pensalfini (2004) indeed gives Austin (1993) as his source, but Miller makes no mention of this (my article was actually published as Austin 2001a, three years before Pensalfini’s article appeared1). This seems to be what we could call ‘the example sentence variant’ of the “violation of citation etiquette” described so eloquently by Pullum 1988.

However, the story has a further twist to it. The glossing of the Jiwarli example, faithfully copied by Miller, is not the glossing given in Austin (1993, 2001a) , but was changed by Pensalfini. Here is the example in its original form:

(27) Kutharra-rru ngunha ngurnti-nha jiluru
two.nom-now that.nom lie-pres egg.nom
‘Now those two eggs are lying there.’ [T51s9]

What I was trying to show in my glossing is that each nominal element in Jiwarli can be understood as being nominative case-marked and that there is no evidence for noun phrases. Hence, each of ‘two’, ‘that’ and ‘egg’ is marked for case, something that Pensalfini’s reglossing does not make clear. Rather more egregious however is that a whole category of information, the “[T51s9]” following the English free translation, has been silently eliminated. Let me explain what this is.

In 1981 I returned to Australia from a short-term teaching post at Harvard University to take up a position at La Trobe University Linguistics Department, and recommenced my research on Western Australian languages, including Jiwarli, after a three year break in the United States. I started the “Gascoyne-Ashburton Languages Project” (GALP) at La Trobe and as part of that established a basic principle of providing metadata giving the source of all the sentence examples (and lexical items) collected in the project. In doing so I was influenced by the same practice I had seen in Jane Simpson’s PhD research (I had been in contact with Jane in Boston in 1980-81); as Simpson (1983:4) says2 : “I have tried to indicate the source of each example sentence where I know it. If the sentence is made up, I have indicated this, unless the sentence is elementary.” For GALP I developed a metadata source indication system that distinguished two categories (usually indicated in publications as material in square brackets following each English free translation):

  1. elicited examples whose metadata reference has the form [AABBCCNDDpEEsFF] where AA is an abbreviation representing the language, BB is an abbreviation representing the speaker, CC is an abbreviation representing the recorder, DD is an integer for a fieldnote book, EE is an integer for the page of the notebook, and FF is an integer for the sequential order of the sentence on the notebook page. Thus [TRCYTKN01p79s07] is the seventh sentence on page 79 of notebook 1 collected by Terry Klokeid from Chubby Yowadji in Tharrkari.
  2. text examples whose metadata reference has the form [AABBCCTDDsEE] where AA is an abbreviation representing the language, BB is an abbreviation representing the speaker, CC is an abbreviation representing the recorder, DD is an integer for a text in a collection, EE is an integer for the sequential order of the sentence within the text. Thus [WRAEOGT03s01] is the first sentence in text 3 collected by Geoffrey O’Grady from Alec Eagles speaking Warriyangka.

For Jiwarli text examples, a reference like [JIJBPAT51s9] could be reduced to [T51s9] since the texts were just those recorded by myself from the late Jack Butler (and published in Austin 1997). I introduced this system to keep track of the contributions of individual speakers and recorders, the genre of examples, and to ensure that it was always possible to go back to the original fieldnotes and text collections to check materials, if necessary. I have maintained this system in my Toolbox data sets and in publications since.

Interestingly, a feature of Miller’s A Critical Introduction to Syntax is that it makes use of “real language examples” taken from spoken and written English corpora. Each such example has relevant source metadata clearly indicated (thus page 39 example (79) is from “Miller-Brown corpus, conversation 58”, and page 133 example (25) is from “The Herald, 17 October 2009, p. 4”) yet no example sentence in a language other than English gets a metadata source reference, not even Russian which is extensively exemplified. Surely what’s good for the (English) goose should be good for the gander?

In their seminal paper on data portability and digital language documentation, Bird and Simons (2003) identify citation as one of the major problems currently faced by those who wish to document and describe languages. They state that3 : “[w]e value the ability of users of a resource to give credit to its creators, as well as to learn the provenance of the sources on which it is based. Thus the best practice is one that makes it easy for … language documentation and description to be cited.” Having developed such a system for my own research some thirty years ago, I find it disappointing that Miller, and Pensalfini before him, simply left out the crucial identifying citation metadata.

Let’s hope that practices in linguistic research improve in this area so that the hard work of language speakers and language documenters can be properly recognised, especially as material is passed around, resulting in second and third hand publications4.


Austin, Peter. 1993. Word order in a free word order language: the case of Jiwarli. La Trobe University Manuscript.
Austin, Peter. 1997. Texts in the Mantharta Languages, Western Australia. Tokyo: ILCAA, Tokyo University of Foreign Studies.
Austin, Peter K. 2001a. Word order in a free word order language: the case of Jiwarli. In Jane Simpson, David Nash, Mary Laughren, Peter Austin, Barry Alpher, (eds.) Forty years on: Ken Hale and Australian languages, 205-323. Canberra: Pacific Linguistics.
Austin, Peter K. 2001b. Zero arguments in Jiwarli, Western Australia. Australian Journal of Linguistics 21(1): 83-98.
Austin, Peter and Joan Bresnan. 1996. Non-configurationality in Australian Aboriginal languages. Natural Language and Linguistic Theory 14: 215-268.
Bird, Steven and Gary Simons. 2003. Seven dimensions of portability. Language 79(3): 557-582.
Bow, Cathy, Biaden Hughes and Steven Bird. 2003. Towards a general model of interlinear text. E-MELD workshop paper. Available here
Pensalfini, Robert. 2004. Towards a typology of configurationality. Natural Language and Linguistic Theory 22(2): 359-408.
Pullum, Geoffrey. 1988. Citation etiquette beyond Thunderdome. Natural Language and Linguistic Theory 6(4): 579-588.
Simpson, Jane. 1983. Aspects of Warlpiri Morphology and Syntax. PhD dissertation, MIT.
Simpson, Jane. 1991. Warlpiri Morpho-Syntax. Amsterdam: Kluwer Academic Publishers.


  1. The same example also appears in Austin 2001b, p. 85, example 3, as well as in Austin and Bresnan 1996, p. 246 example 42
  2. Simpson 1991 is the revised published version which continues the same practice
  3. see also Bow, Hughes and Bird 2003 who propose a four-level model of interlinear glossed text that includes a text level which is “the complete unit of data under examination which functions as a unit in its entirety … The text level includes metadata”.
  4. The example sentence quoted above gets particularly woeful treatment at the hands of ODIN, the Online Database of Interlinear text, which is “a repository of interlinear glossed text extracted mainly from scholarly linguistic papers”. ODIN identifies the language of this example as Mangala, spoken a thousand kilometers north of Jiwarli on the coast of Western Australia, because of misidentification of Jiwarli with Juwarliny, a dialect of Mangala!

Rights, responsibilities, and data duffers

In a recent blog post James McElvenny presents a broad-ranging discussion about copyright, in response to my earlier post about the use of materials from my published work without attribution by the PanLex project. James covers a lot of ground and brings in many different aspects, including his frustrations that he “can’t play region-coded DVDs that [he] … bought in Europe on [his] Australian DVD player” and ” ‘licence agreements’ consisting of several thousand words of legalese gibberish”. I won’t attempt to cover all the topics he mentions (indeed, I feel unqualified to do so) but I do want to go back to the more narrow issue of rights and responsibilities in relation to (linguistic) research that I originally raised, and to highlight a sociological aspect that seems to have been missed in the discussion so far.

Yesterday I was fortunate to be invited to attend a seminar organised by the UK Arts and Humanities Research Council (AHRC) Beyond Text project on the topic “Beyond Copyright”. The seminar brought together 20 scholars and researchers from the UK and elsewhere in Europe representing a range of disciplines in arts and humanities , media, archiving, and intellectual property (IP) law. I found it an incredibly stimulating day where I learned a lot, especially about the complexities of IP, copyright, and the massive economic and social changes currently going on in this area. It also helped me to clarify my thinking about what it is that I was trying to present in my original blog post published on 4th April.

Let’s start with copyright. Yesterday’s presentation by Jeremy Silver, self-described “entrepreneur, digital media adviser and thought-leader”, set out some of the basic issues that are relevant to our current discussion. He explained that copyright exists to:

  1. reward creators, and
  2. provide consumer access

through control of:

  • reproduction
  • attribution
  • renumeration
  • integrity

using mechanisms for encoding and managing:

  1. primary rights (and moral rights)
  2. secondary exploitation rights
  3. licensing

There are tensions between these three aspects that get played out through the economic system — these tensions have been exacerbated by the rise of the internet and the current rapidly changing digital landscape.

I would like to clarify that my original blog post was about copyright as a means for dealing with primary rights (and moral rights) with a particular emphasis on reproduction, attribution and integrity. I was not writing about renumeration, as I made clear in my comment responding to James’ comment on my post.1 I am not averse to PanLex reproducing Diyari materials and including them in their database2; what I object to is lack of attribution. I am also concerned that materials are being presented in an orthography that is 30 years old and does not represent the spelling that has been agreed with the Dieri Aboriginal Corporation for current use — if PanLex had consulted me, I would have suggested revising and updating the included data to the current spellings, thus improving the integrity of my and the Dieri community’s contribution to their work. (At the time I posted I also sent an email to Emily Bender who is a Board member of Utilika Foundation and she promised to take these issues up with its Director.) There may be similar problems with other data in the PanLex collection and I would encourage interested readers to check on their favourite languages from this perspective.3

Now for the sociology. Over the past several years, linguistics as a discipline has been undergoing a range of socio-cultural shifts, including changing emphases on the rôle of primary data collection and corpus curation, especially in terms of their relationship to linguistic analysis and theorisation. This is reflected in the impact of seminal works like Bird and Simons 20034, last year’s LSA motion (see also Jane Simpson’s report) on recognising the scholarly merit of language documentation, and the growth of digital language archives like ELAR at SOAS that now has terabytes of documentation materials available for registered users to browse and download (subject to the ELAR terms and conditions of use5 ). These changes and developments can be encouraged and protected so long as researchers feel that their work will be properly recognised and referenced, especially when employed by others in their own research and publications — indeed, I have heard on numerous occasions expressions of reluctance by scholars unwilling to deposit their data and analyses in an archive because of a fear of being “ripped off”. Such fears are real and need to be addressed. One way of doing so is to publicly highlight instances of apparent abuse, as I did in the Prof Parker case and Nick Thieberger did in the Stolen Grammars case. As Nick pointed out:

” there has to be a mechanism for recognising creative effort, otherwise no-one will put their work online. ‘Stolen Grammars’ did not link to existing open access resources, but copied them without proper attribution. … Linguistic archives rely on the good faith of those signing agreements about how they will use data from the archive. Depositors have a right to trust that the material they deposit will not be misused.”

So, I believe that Nick Thieberger, David Nash, and I (and possibly others6 ) are being vigilant and following up on cases of ‘data duffers’7 like these in a spirit of service to the field, aimed at ensuring that copyright and licensing agreements are respected, and that fears are allayed for existing and potential contributors/publishers about their rights being violated and materials being misused. I see it as a way to encourage more and better access to linguistic research, not what James calls “a greater sin” of “[u]sing copyright to stop or hinder other research projects”.

At yesterday’s workshop there were repeated calls for ‘creative thinking’ about IP and copyright. I applaud the President of the Utilika Foundation for setting aside “funds for legal services in the 2012 budget [that] reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses”8 and I encourage him to engage in creative thinking about new mechanisms for licensing and attribution arrangements for the PanLex project, rather than complaining that “creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it”. I am confident that if the right environment can be established, I and others will happily contribute to this and other projects. It’s the way of the future.


  1. In his post, James says I “would presumably not want anyone to make money from it (except his publisher, Cambridge University Press)” — in fact my 1981 book is now out of print and 20 years ago I asked CUP to cede to me the copyright on the book, which they did, so they currently have no rights in the work and renumeration is not an issue for them, or indeed, anyone.
  2. In fact I have contributed Sasak materials to the Austronesian Basic Vocabulary Database, which contribution is fully acknowledged in the relevant entry.
  3. Angela Terrill has commented to me via Facebook: “I had a quick look and found they have Lavukaleve data there, but in an orthography I have not seen. Don’t know where they got it from”.
  4. Bird, Steven and Gary Simons. 2003. Seven dimensions of portability for language documentation and description. Language 79(3): 557-582.
  5. written in 10 sentences of plain English, not “legalese gibberish”
  6. The ELAR Terms and Conditions of Use explicitly state: “I understand that ELAR may take legal action on behalf of owners of materials in the case of serious infringement of this agreement.
  7. My use of “duffers” here is intended as a pun, relying on its polysemy in Australian English. For US readers, one sense, as in Australian “cattle duffers”, corresponds to your “cattle rustlers”.
  8. see here

Professor Austin and copyright

Peter Austin has raised his voice on this blog to ‘protect [his] legal rights and those of the Dieri people who have contributed to [his] knowledge of their language’ (source). He suggests that the PanLex project is guilty of ‘theft’ for using, without citation, data from a Dieri-English word list contained in his 1981 grammar of the Dieri language.1 He also implies that the PanLex project should not use the data without his permission.2

I think Austin ought to clarify exactly what he believes he owns and how he would justify the claim that it has been stolen. There are two aspects to copyright, commercial rights and moral rights. Austin has indicated that he is not interested in receiving royalties for the use of his data (source), although he would presumably not want anyone to make money from it (except his publisher, Cambridge University Press). There seems to be little danger of that, anyway: although I am not particularly familiar with the PanLex project and its financial backers, the administering Utilika Foundation appears to be a non-profit organisation (source).

Austin seems to be more concerned with asserting his moral rights. Under the Berne Convention, the main international copyright treaty that specifically mentions the moral rights of authors, the author of a work has

…the right to claim authorship of the work and object to any distortion, mutilation or other modification of, or other derogatory action inrelation to, the said work, which would be prejudicial to his honor or reputation.

The PanLex project seems to have denied Austin his right to be recognised as the author of the word list by not providing a citation to his grammar, in which it originally appeared. This lack of citation appears to be a genuine oversight, however. The project maintains an extensive list of resources they are using in compiling their database. Although Austin’s 1981 grammar is not on the list, his and David Nathan’s online Kamilaroi/Gamilaraay dictionary is (a fact that Austin does not mention in his blog post). There is the possibility that the PanLex project has acquired the Dieri data through a secondary source that is listed but which does not acknowledge Austin’s original work properly.

The PanLex project clearly does not claim to be the original collector of the data and is not out to appropriate other people’s data with no acknowledgement, which is what Austin implies in his blog post. However, the project’s referencing definitely leaves something to be desired. As to possible distortion of Austin’s work that could be ‘prejudicial to his honor or reputation’, that is a separate issue about which he has not yet expressed an opinion.

What exactly is Austin claiming ownership over? Presumably not the entire Dieri language, but just the data contained in his word list. But what exactly is the data represented in his list? Is it just the equivalencies he has established between the Dieri and English words? It should be noted that raw data is not covered by copyright, although a particular representation of it is. This is a fact that Austin is aware of.3 The particular record of the equivalencies in his list is therefore protected by copyright. But we could also ask whether the list really is a creative work. A lot of effort certainly goes into acquiring and organising the knowledge required to produce such a list, but it could be considered merely sweat of the brow – that is, a work of diligence rather than creativity. If so, it would not be protected under US copyright law, but it would be under European copyright laws.4

Is Austin also claiming ownership over the actual words, as represented in his orthography? He uses the spelling of the Dieri word wadaŋaɲɟu to identify its origin in his book and the orthography is of his own devising. As a non-tangible system the orthography itself cannot be copyrighted, but perhaps particular instatiations of it could be. Should Noah Webster be cited every time someone writes the word ‘color’, because Webster was the first to propagate this spelling in his published work?

It should be pointed out that the PanLex project is not simply a copy of Austin’s 1981 word list. It is a new work that incorporates material from a large number of sources. It is what would be considered a ‘derivative work’. Under UK copyright law, the jurisdiction where Austin’s book was published and where he now resides and works, permission does appear to be required to use copyrighted material in a derivative work,5 although there are some possible exceptions where only excerpts are used. It could be argued that the Dieri words that appear in PanLex are excerpts from Austin’s book. They certainly do not represent a reproduction of the entire work. There may be no need to get permission in this case. But this is one for a judge to decide.

What are the potential implications of Austin’s assertion of ownership of the Dieri data? Since his publications contain a large amount of the Dieri linguistic data available outside the community, he could be seen as appointing himself as a gatekeeper to pretty much any non-primary research into the language. This is a point the President of the Utilika Foundation makes in the minutes of their 2011 AGM, which Austin cites as evidence of their ‘playing fast and loose’:

The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement.

Austin is not himself a fatcat publisher, movie studio or software company, but his wielding of the copyright bludgeon is reminiscent of their current practices. When we want to install software or sign up for an online service, we are confronted with ‘licence agreements’ consisting of several thousand words of legalese gibberish. We can’t go any further until we confirm that we have ‘read and agree’ to the terms.6 I can’t play region-coded DVDs that I have bought in Europe on my Australian DVD player. In the US, the publisher HarperCollins recently moved to force libraries to limit e-books to being borrowed only 26 times (source). The list goes on.

Using copyright to stop or hinder other research projects is perhaps a greater sin, however, even if we might not agree with the aims of the project or are not impressed by the quality of their work. Such abuses of copyright stifle innovation and the advancement of knowledge. If it were not for restrictive copyright, the underlying data that went into producing the ‘Culturomics’ database could have been made available to users, which would perhaps have improved its usefulness.

Now to help us all calm down, perhaps we should hear the message in Sesame Street style from Nina Paley:

Marcel Duchamp's L.H.O.O.Q., a derivative work

Thanks to David Nash for reading this post and suggesting some improvements, mainly restraint – you should have seen the first draft! Of course, what I have said here cannot be taken to reflect his views.


  1. Austin, Peter. 1981. A grammar of Diyari, South Australia. Cambridge: Cambridge University Press.
  2. This is not the first time he has raised such concerns. He made similar complaints in a somewhat different case in an earlier blog post. Since this earlier case is not exactly parallel to the current one, my comments here cannot necessarily be taken to apply there.
  3. Austin comments: ‘The World Intellectual Property Organisation (WIPO) defines intellectual property as “creations of the mind: inventions, literary and artistic works, and symbols, names, images, and designs used in commerce”. Here “creations of the mind” refers to something that somebody created, and hence does not cover general knowledge like the meanings of words, or forms of a morphological paradigm (a particular definition, e.g., that found in a printed dictionary, would however be subject to intellectual property rights).’ Austin, Peter. 2010. Communities and Ethics in Language Documentation, in Austin, Peter, ed, Language Documentation and Description Volume 7, p.41.
  4. Wikipedia contains an article that addresses the main points.
  5. See the UK copyright service fact sheet P-22.
  6. And what do we really surrender in these agreements?

They’re out to get you (or your data at least)

A couple of years ago I wrote a blog post about Professor Phillip M Parker PhD, a Professor of Marketing in France who had established a website called Webster’s Online Dictionary that contained materials on endangered languages taken from copyrighted sources.1 Parker also published a set of books based on materials taken from copyrighted websites, such as Webster’s Kamilaroi-English Thesaurus Dictionary.2

Well, it looks like someone else is also harvesting data on languages from copyrighted sources without attribution. This is the PanLex project funded by the Utilka Foundation that:

“gathers knowledge about all the words in all the languages of the world, so that any word may be translated into any language, a step toward panlingual communication. For this work we consult multilingual, bilingual, and monolingual resources named “dictionaries”, “thesauri”, “lexical databases”, “wordnets”, “glossaries”, “terminologies”, “vocabularies”, and “word lists”, as well as individuals.”

Although the website gives a list of “the resources we are now consulting”, a simple search using the TerraDict tool shows that in fact unlisted materials are also being used. I searched for “left” in the Dieri (Diyari) language (which I have worked on for the past 35 years) and got the following result (click image to enlarge it):

This can only have come from the vocabulary list in my 1981 book A Grammar of Diyari, South Australia (published by Cambridge University Press) because it is only in that book that I used the letter d for the trill sound — in later publications I used rrh. This word would now be spelled as warrangantyu in the orthography that the Dieri Aboriginal community prefers. QED, the Diyari material has been nicked without attribution from my copyrighted book.

Johnathan Poole, President of the Utilika Foundation, realises they are playing fast and loose here as the following statements from the minutes of their 2011 Annual Meeting held just last month make clear (note the last sentence in particular):

“intellectual-property obstacles to the expansion of PanLex have not yet been a major problem. If they prevented us from using one resource, we could move on to the next. The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement. … I believe that our normalization, structuring, and selective use of published data, combined with our provision of links to the original data, will satisfy most content creators. However, the inclusion of funds for legal services in the 2012 budget reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses.”

Well, as far as I can see there is no “complex[ity]” surrounding “intellectual-property issues” here — the Diyari materials (and possibly lots more on lots more languages) are copyright and subject to fair dealing. Anything else is theft.

PS: Thanks to David Nathan for passing on pointers to the PanLex project, including the Annual Meeting minutes quoted here. He bears no responsibility for the content of this blog post.


  1. After much discussion (see the 19 comments on my post), Professor Parker appeared to take on board feedback from a number of researchers about apparent violations of intellectual property rights and moral rights, and wrote that he was planning to write an “Open Letter to Field Linguists” and a guide to “Copyrights and Moral Rights for Languages and their Translations” — neither has so far been published.
  2. This is now showing as “out of print” at but is still available as an e-Book from the Book Depository among other sources.