They’re out to get you (or your data at least)

A couple of years ago I wrote a blog post about Professor Phillip M Parker PhD, a Professor of Marketing in France who had established a website called Webster’s Online Dictionary that contained materials on endangered languages taken from copyrighted sources.1 Parker also published a set of books based on materials taken from copyrighted websites, such as Webster’s Kamilaroi-English Thesaurus Dictionary.2

Well, it looks like someone else is also harvesting data on languages from copyrighted sources without attribution. This is the PanLex project funded by the Utilka Foundation that:

“gathers knowledge about all the words in all the languages of the world, so that any word may be translated into any language, a step toward panlingual communication. For this work we consult multilingual, bilingual, and monolingual resources named “dictionaries”, “thesauri”, “lexical databases”, “wordnets”, “glossaries”, “terminologies”, “vocabularies”, and “word lists”, as well as individuals.”

Although the website gives a list of “the resources we are now consulting”, a simple search using the TerraDict tool shows that in fact unlisted materials are also being used. I searched for “left” in the Dieri (Diyari) language (which I have worked on for the past 35 years) and got the following result (click image to enlarge it):

This can only have come from the vocabulary list in my 1981 book A Grammar of Diyari, South Australia (published by Cambridge University Press) because it is only in that book that I used the letter d for the trill sound — in later publications I used rrh. This word would now be spelled as warrangantyu in the orthography that the Dieri Aboriginal community prefers. QED, the Diyari material has been nicked without attribution from my copyrighted book.

Johnathan Poole, President of the Utilika Foundation, realises they are playing fast and loose here as the following statements from the minutes of their 2011 Annual Meeting held just last month make clear (note the last sentence in particular):

“intellectual-property obstacles to the expansion of PanLex have not yet been a major problem. If they prevented us from using one resource, we could move on to the next. The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement. … I believe that our normalization, structuring, and selective use of published data, combined with our provision of links to the original data, will satisfy most content creators. However, the inclusion of funds for legal services in the 2012 budget reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses.”

Well, as far as I can see there is no “complex[ity]” surrounding “intellectual-property issues” here — the Diyari materials (and possibly lots more on lots more languages) are copyright and subject to fair dealing. Anything else is theft.

PS: Thanks to David Nathan for passing on pointers to the PanLex project, including the Annual Meeting minutes quoted here. He bears no responsibility for the content of this blog post.


  1. After much discussion (see the 19 comments on my post), Professor Parker appeared to take on board feedback from a number of researchers about apparent violations of intellectual property rights and moral rights, and wrote that he was planning to write an “Open Letter to Field Linguists” and a guide to “Copyrights and Moral Rights for Languages and their Translations” — neither has so far been published.
  2. This is now showing as “out of print” at but is still available as an e-Book from the Book Depository among other sources.

11 thoughts on “They’re out to get you (or your data at least)”

  1. I’m a little bit concerned about your wielding the copyright bludgeon to prevent the ‘theft’ of this language data. I’d like you to elaborate on exactly what you mean by ‘fair dealing’ in this case – do you just want a reference to your work, or a royalty as well?

    Remember that copyright law is currently being abused by greedy fatcat publishers and studios who keep intellectual and artistic works from their audiences unless huge ransoms are paid. The comments of the President of the Utilika Foundation are probably directed at this sort of abusive use of copyright law. Your rhetoric makes you sound like you belong to this camp.

    Although it does seem that the PanLex people are guilty of at least omission and at worst ‘theft’ in not listing your 1981 grammar among their sources, you failed to mention that they have in fact provided two references to your online Kamilaroi/Gamilaraay dictionary. Perhaps they’re not quite the thieving scoundrels you imply they are. There is the possibility that their Dieri data comes from an intermediate source that does not acknowledge your original work.

  2. Prompted by James’ last guess, I checked the Rosetta Project list and there it is, ‘left: wadaŋaɲɟu’. The list is linked from Dieri Swadesh List which shows the correct attribution to the Austin 1981 CUP Grammar. Not that this explains why Panlex doesn’t declare the source for the Dieri.

  3. James

    Thanks for your comment, but I think you may have misunderstood a few basic points. Firstly, I am not “wielding the copyright bludgeon” — copyright is a legal concept that exists in national and international law and one that I believe should be respected. My interest is to protect my legal rights and those of the Dieri people who have contributed to my knowledge of their language, and to have those rights respected by others who wish to make use of that intellectual property. There exist a simple mechanisms to do that which involve licensing and attribution, and they were not used in this case. I personally am not interested in demanding royalties, contrary to what your comment seems to suggest, rather in having proper respect for copyright (and moral rights) and proper attribution.

    As for “fair dealing” (also called “fair use”, or “fair practice”), this too is a legal concept that is clearly laid out in copyright law. The definition from the UK Copyright Service makes clear that among the requirements is that “the source of the quoted material is mentioned, along with the name of the author”. PanLex does not do this, even though, as David Nash points out in his comment, they appear to have copied materials from another source which does reference the original source. Note that fair dealing allows one “to use quotations or excerpts” but it prohibits copying of the whole of a copyright work. This applies to written works but not to recorded sound or images, by the way.

    I do not claim to be an expert on copyright and intellectual property rights (though I have written about the topic in Language Documentation and Description, Volume 7 published last year), however I believe it is something that we researchers should know at least basic information about. Copyright and copyright law is quite complicated in our modern digital world and in fact in my capacity as a member of the Steering Committee of the Beyond Text initiative of the Arts and Humanities Research Council I am attending a workshop entitled “Beyond Copyright” here in London this coming Friday to update myself on current views (a doc file about the workshop can be found here). I may write about it in a future blog post.

  4. One of the issues discussed in this posting is the citation (or noncitation) of sources in general, and of the source of Dieri-English translations in particular, by the PanLex project.

    The project makes provisions for the citation of sources, but some aspects of the creation, modification, and use of the PanLex data pose challenges to source citation. The challenge most involved in this particular case is subsidiary sources. In addition to the general problem of subsidiary-source citation, PanLex obtained the data in question from an as-yet unpublished source that has not yet published a catalogue of its own sources.

    For more details, please see the “Example” section of my essay, “PanLex Source Citation”. Comments would be welcome.

  5. Jonathan

    Thanks for your comment and the new materials included today in your “PanLex Source Citation” document that appears to have been written in response to this post and my later one.

    Sadly, you are still getting simple citation information wrong. The name of the South Australian language under discussion is Diyari (as in the title of my 1981 book), not “Dieri”. The Aboriginal body which represents the community who identify with, and in some cases still speak this language is the Dieri Aboriginal Corporation who use “Dieri” to refer to the people, not the language.

  6. “The name of the South Australian language under discussion is Diyari (as in the title of my 1981 book), not “Dieri”

    In this instance, the trouble is the reliance upon the Ethnologue for language names and identifications:

    While close to authoritative in some regions (e.g. those in which SIL is engaged in active fieldwork,) its treatment of Australian languages is appalling; this “Dieri/Diyari” issue is the least of its problems.

  7. A rather more authoritative source for information on Australian Aboriginal languages is AUSTLANG, the Australian Indigenous Languages Database. There you will find not only that the standard name is given as “Diyari”, but also a range of information about documentation of each language in the database and (accurate) bibliographical references.

  8. The trouble is that it is nearly impossible to convince tools-driven projects to adopt a heterogeneous curation-driven system. What they all want is a coded index that they don’t need to maintain in-house. Even where it is manifestly mistaken, they will ask that attestations be twisted to fit it, rather than vice-versa. A large number of LEGO vocabularies are “unidentified” not because an Australianist would not know what they are, but because Ethnologue either omits (Victoria) or horribly mangles (Cape York) entire provinces.

  9. Thanks for noting that “Dieri” isn’t the accepted name of the language. I have changed it to “Diyari” in the above-mentioned essay and in PanLex.

