Open access and intimate fieldwork

A report on the Linguistics in the Pub discussion Tuesday 11th March, Prince Alfred Hotel, Grattan St, Melbourne.

This Linguistics in the Pub discussion brought together fieldworkers who do research in Indigenous Australia, Africa, South Asia, Papua New Guinea and Nepal, as well as a computational linguist who has developed software to automate language documentation. The linguists were not all Australian, in fact we were lucky to have four participants who identify as European who are living in Australia, temporarily or permanently. The linguists’ experience in language documentation ranged from between 6-30 years and between them had deposited in the digital archives: DoBeS, Paradisec and ELAR. The timeliness of this discussion is demonstrated by David Nathan’s very recent ELAC post on the same topic.
Scam alert or how to make a lot of money really quickly

Felicity Meakins writes…

Just recently I was on Amazon, when I came across two potentially interesting books:

At first I berated myself for having never noticed these books before, let alone the authors. Surely these were important volumes that I should have referenced! However a little further investigation revealed a scam that grew bigger (and actually more impressive) as I dug deeper.

I first became suspicious when I recognised some of the wording of the abstract of the first book. Sure enough, the entire abstract was a word-for-word copy of the Wikipedia entry on mixed languages. A loud excited outburst from me drew Myf Turpin into the fray. We had a look at the Alphascript publishing website only to find that ALL of their books were edited by Frederic P. Miller, Agnes F. Vandome and John McBrewster, with topics ranging from Japanese mythology and Franco-Belgium comics to cloud seeding and swine flu! And when I say “ALL of their books”, I mean all 1006 books. Who were these prolific authors?!

When we googled their names, we found a number of scam alerts, so we are certainly not the first to notice them. Unfortunately the University of Queensland library was drawn in for five books’ worth on topics including abalone and Mayan civilisations. Indeed, as Alphascript publishing proudly announce on their webpage, most of the major book distributors, including Amazon, list their books.

One can’t help being secretly impressed with the size of the scam. Most of the books are sold for AU$40.00. UQ Library would have spent around AU$200 on their books, and there is a good chance too that many other university libraries did the same before realising it was all a scam. In a single year, Frederic P. Miller, Agnes F. Vandome and John McBrewster probably had enough in the bank to buy an small island and disappear.

Aside from being impressed (or gobsmacked), it is probably worth checking your university library and alerting them to the scam, and letting other prospective buyers know if you come across their books on book seller pages.

LEGO blocks

Jeff Good has written a blog post about how citation metadata was dealt with in various file conversions for the Lexicon Enhancement via the GOLD Ontology (LEGO) project. His post was written in response to my discussion and follow up (plus James McElvenny’s contribution) about citation practices of data aggregators like LEGO and PanLex.

Jeff’s bottom line is that he and others acted in good faith, didn’t try to pass off other people’s data as their own, and that “[t]echnology was the problem, not people”.

Rights, responsibilities, and data duffers

In a recent blog post James McElvenny presents a broad-ranging discussion about copyright, in response to my earlier post about the use of materials from my published work without attribution by the PanLex project. James covers a lot of ground and brings in many different aspects, including his frustrations that he “can’t play region-coded DVDs that [he] … bought in Europe on [his] Australian DVD player” and ” ‘licence agreements’ consisting of several thousand words of legalese gibberish”. I won’t attempt to cover all the topics he mentions (indeed, I feel unqualified to do so) but I do want to go back to the more narrow issue of rights and responsibilities in relation to (linguistic) research that I originally raised, and to highlight a sociological aspect that seems to have been missed in the discussion so far.

Yesterday I was fortunate to be invited to attend a seminar organised by the UK Arts and Humanities Research Council (AHRC) Beyond Text project on the topic “Beyond Copyright”. The seminar brought together 20 scholars and researchers from the UK and elsewhere in Europe representing a range of disciplines in arts and humanities , media, archiving, and intellectual property (IP) law. I found it an incredibly stimulating day where I learned a lot, especially about the complexities of IP, copyright, and the massive economic and social changes currently going on in this area. It also helped me to clarify my thinking about what it is that I was trying to present in my original blog post published on 4th April.

Let’s start with copyright. Yesterday’s presentation by Jeremy Silver, self-described “entrepreneur, digital media adviser and thought-leader”, set out some of the basic issues that are relevant to our current discussion. He explained that copyright exists to:

  1. reward creators, and
  2. provide consumer access

through control of:

  • reproduction
  • attribution
  • renumeration
  • integrity

using mechanisms for encoding and managing:

  1. primary rights (and moral rights)
  2. secondary exploitation rights
  3. licensing

There are tensions between these three aspects that get played out through the economic system — these tensions have been exacerbated by the rise of the internet and the current rapidly changing digital landscape.

I would like to clarify that my original blog post was about copyright as a means for dealing with primary rights (and moral rights) with a particular emphasis on reproduction, attribution and integrity. I was not writing about renumeration, as I made clear in my comment responding to James’ comment on my post.1 I am not averse to PanLex reproducing Diyari materials and including them in their database2; what I object to is lack of attribution. I am also concerned that materials are being presented in an orthography that is 30 years old and does not represent the spelling that has been agreed with the Dieri Aboriginal Corporation for current use — if PanLex had consulted me, I would have suggested revising and updating the included data to the current spellings, thus improving the integrity of my and the Dieri community’s contribution to their work. (At the time I posted I also sent an email to Emily Bender who is a Board member of Utilika Foundation and she promised to take these issues up with its Director.) There may be similar problems with other data in the PanLex collection and I would encourage interested readers to check on their favourite languages from this perspective.3

Now for the sociology. Over the past several years, linguistics as a discipline has been undergoing a range of socio-cultural shifts, including changing emphases on the rôle of primary data collection and corpus curation, especially in terms of their relationship to linguistic analysis and theorisation. This is reflected in the impact of seminal works like Bird and Simons 20034, last year’s LSA motion (see also Jane Simpson’s report) on recognising the scholarly merit of language documentation, and the growth of digital language archives like ELAR at SOAS that now has terabytes of documentation materials available for registered users to browse and download (subject to the ELAR terms and conditions of use5 ). These changes and developments can be encouraged and protected so long as researchers feel that their work will be properly recognised and referenced, especially when employed by others in their own research and publications — indeed, I have heard on numerous occasions expressions of reluctance by scholars unwilling to deposit their data and analyses in an archive because of a fear of being “ripped off”. Such fears are real and need to be addressed. One way of doing so is to publicly highlight instances of apparent abuse, as I did in the Prof Parker case and Nick Thieberger did in the Stolen Grammars case. As Nick pointed out:

” there has to be a mechanism for recognising creative effort, otherwise no-one will put their work online. ‘Stolen Grammars’ did not link to existing open access resources, but copied them without proper attribution. … Linguistic archives rely on the good faith of those signing agreements about how they will use data from the archive. Depositors have a right to trust that the material they deposit will not be misused.”

So, I believe that Nick Thieberger, David Nash, and I (and possibly others6 ) are being vigilant and following up on cases of ‘data duffers’7 like these in a spirit of service to the field, aimed at ensuring that copyright and licensing agreements are respected, and that fears are allayed for existing and potential contributors/publishers about their rights being violated and materials being misused. I see it as a way to encourage more and better access to linguistic research, not what James calls “a greater sin” of “[u]sing copyright to stop or hinder other research projects”.

At yesterday’s workshop there were repeated calls for ‘creative thinking’ about IP and copyright. I applaud the President of the Utilika Foundation for setting aside “funds for legal services in the 2012 budget [that] reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses”8 and I encourage him to engage in creative thinking about new mechanisms for licensing and attribution arrangements for the PanLex project, rather than complaining that “creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it”. I am confident that if the right environment can be established, I and others will happily contribute to this and other projects. It’s the way of the future.


  1. In his post, James says I “would presumably not want anyone to make money from it (except his publisher, Cambridge University Press)” — in fact my 1981 book is now out of print and 20 years ago I asked CUP to cede to me the copyright on the book, which they did, so they currently have no rights in the work and renumeration is not an issue for them, or indeed, anyone.
  2. In fact I have contributed Sasak materials to the Austronesian Basic Vocabulary Database, which contribution is fully acknowledged in the relevant entry.
  3. Angela Terrill has commented to me via Facebook: “I had a quick look and found they have Lavukaleve data there, but in an orthography I have not seen. Don’t know where they got it from”.
  4. Bird, Steven and Gary Simons. 2003. Seven dimensions of portability for language documentation and description. Language 79(3): 557-582.
  5. written in 10 sentences of plain English, not “legalese gibberish”
  6. The ELAR Terms and Conditions of Use explicitly state: “I understand that ELAR may take legal action on behalf of owners of materials in the case of serious infringement of this agreement.
  7. My use of “duffers” here is intended as a pun, relying on its polysemy in Australian English. For US readers, one sense, as in Australian “cattle duffers”, corresponds to your “cattle rustlers”.
  8. see here

Professor Austin and copyright

Peter Austin has raised his voice on this blog to ‘protect [his] legal rights and those of the Dieri people who have contributed to [his] knowledge of their language’ (source). He suggests that the PanLex project is guilty of ‘theft’ for using, without citation, data from a Dieri-English word list contained in his 1981 grammar of the Dieri language.1 He also implies that the PanLex project should not use the data without his permission.2

I think Austin ought to clarify exactly what he believes he owns and how he would justify the claim that it has been stolen. There are two aspects to copyright, commercial rights and moral rights. Austin has indicated that he is not interested in receiving royalties for the use of his data (source), although he would presumably not want anyone to make money from it (except his publisher, Cambridge University Press). There seems to be little danger of that, anyway: although I am not particularly familiar with the PanLex project and its financial backers, the administering Utilika Foundation appears to be a non-profit organisation (source).

Austin seems to be more concerned with asserting his moral rights. Under the Berne Convention, the main international copyright treaty that specifically mentions the moral rights of authors, the author of a work has

…the right to claim authorship of the work and object to any distortion, mutilation or other modification of, or other derogatory action inrelation to, the said work, which would be prejudicial to his honor or reputation.

The PanLex project seems to have denied Austin his right to be recognised as the author of the word list by not providing a citation to his grammar, in which it originally appeared. This lack of citation appears to be a genuine oversight, however. The project maintains an extensive list of resources they are using in compiling their database. Although Austin’s 1981 grammar is not on the list, his and David Nathan’s online Kamilaroi/Gamilaraay dictionary is (a fact that Austin does not mention in his blog post). There is the possibility that the PanLex project has acquired the Dieri data through a secondary source that is listed but which does not acknowledge Austin’s original work properly.

The PanLex project clearly does not claim to be the original collector of the data and is not out to appropriate other people’s data with no acknowledgement, which is what Austin implies in his blog post. However, the project’s referencing definitely leaves something to be desired. As to possible distortion of Austin’s work that could be ‘prejudicial to his honor or reputation’, that is a separate issue about which he has not yet expressed an opinion.

What exactly is Austin claiming ownership over? Presumably not the entire Dieri language, but just the data contained in his word list. But what exactly is the data represented in his list? Is it just the equivalencies he has established between the Dieri and English words? It should be noted that raw data is not covered by copyright, although a particular representation of it is. This is a fact that Austin is aware of.3 The particular record of the equivalencies in his list is therefore protected by copyright. But we could also ask whether the list really is a creative work. A lot of effort certainly goes into acquiring and organising the knowledge required to produce such a list, but it could be considered merely sweat of the brow – that is, a work of diligence rather than creativity. If so, it would not be protected under US copyright law, but it would be under European copyright laws.4

Is Austin also claiming ownership over the actual words, as represented in his orthography? He uses the spelling of the Dieri word wadaŋaɲɟu to identify its origin in his book and the orthography is of his own devising. As a non-tangible system the orthography itself cannot be copyrighted, but perhaps particular instatiations of it could be. Should Noah Webster be cited every time someone writes the word ‘color’, because Webster was the first to propagate this spelling in his published work?

It should be pointed out that the PanLex project is not simply a copy of Austin’s 1981 word list. It is a new work that incorporates material from a large number of sources. It is what would be considered a ‘derivative work’. Under UK copyright law, the jurisdiction where Austin’s book was published and where he now resides and works, permission does appear to be required to use copyrighted material in a derivative work,5 although there are some possible exceptions where only excerpts are used. It could be argued that the Dieri words that appear in PanLex are excerpts from Austin’s book. They certainly do not represent a reproduction of the entire work. There may be no need to get permission in this case. But this is one for a judge to decide.

What are the potential implications of Austin’s assertion of ownership of the Dieri data? Since his publications contain a large amount of the Dieri linguistic data available outside the community, he could be seen as appointing himself as a gatekeeper to pretty much any non-primary research into the language. This is a point the President of the Utilika Foundation makes in the minutes of their 2011 AGM, which Austin cites as evidence of their ‘playing fast and loose’:

The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement.

Austin is not himself a fatcat publisher, movie studio or software company, but his wielding of the copyright bludgeon is reminiscent of their current practices. When we want to install software or sign up for an online service, we are confronted with ‘licence agreements’ consisting of several thousand words of legalese gibberish. We can’t go any further until we confirm that we have ‘read and agree’ to the terms.6 I can’t play region-coded DVDs that I have bought in Europe on my Australian DVD player. In the US, the publisher HarperCollins recently moved to force libraries to limit e-books to being borrowed only 26 times (source). The list goes on.

Using copyright to stop or hinder other research projects is perhaps a greater sin, however, even if we might not agree with the aims of the project or are not impressed by the quality of their work. Such abuses of copyright stifle innovation and the advancement of knowledge. If it were not for restrictive copyright, the underlying data that went into producing the ‘Culturomics’ database could have been made available to users, which would perhaps have improved its usefulness.

Now to help us all calm down, perhaps we should hear the message in Sesame Street style from Nina Paley:

Image: Marcel Duchamp’s L.H.O.O.Q., a derivative work, from Wikipedia

Thanks to David Nash for reading this post and suggesting some improvements, mainly restraint – you should have seen the first draft! Of course, what I have said here cannot be taken to reflect his views.


  1. Austin, Peter. 1981. A grammar of Diyari, South Australia. Cambridge: Cambridge University Press.
  2. This is not the first time he has raised such concerns. He made similar complaints in a somewhat different case in an earlier blog post. Since this earlier case is not exactly parallel to the current one, my comments here cannot necessarily be taken to apply there.
  3. Austin comments: ‘The World Intellectual Property Organisation (WIPO) defines intellectual property as “creations of the mind: inventions, literary and artistic works, and symbols, names, images, and designs used in commerce”. Here “creations of the mind” refers to something that somebody created, and hence does not cover general knowledge like the meanings of words, or forms of a morphological paradigm (a particular definition, e.g., that found in a printed dictionary, would however be subject to intellectual property rights).’ Austin, Peter. 2010. Communities and Ethics in Language Documentation, in Austin, Peter, ed, Language Documentation and Description Volume 7, p.41.
  4. Wikipedia contains an article that addresses the main points.
  5. See the UK copyright service fact sheet P-22.
  6. And what do we really surrender in these agreements?

They’re out to get you (or your data at least)

A couple of years ago I wrote a blog post about Professor Phillip M Parker PhD, a Professor of Marketing in France who had established a website called Webster’s Online Dictionary that contained materials on endangered languages taken from copyrighted sources.1 Parker also published a set of books based on materials taken from copyrighted websites, such as Webster’s Kamilaroi-English Thesaurus Dictionary.2

Well, it looks like someone else is also harvesting data on languages from copyrighted sources without attribution. This is the PanLex project funded by the Utilka Foundation that:

“gathers knowledge about all the words in all the languages of the world, so that any word may be translated into any language, a step toward panlingual communication. For this work we consult multilingual, bilingual, and monolingual resources named “dictionaries”, “thesauri”, “lexical databases”, “wordnets”, “glossaries”, “terminologies”, “vocabularies”, and “word lists”, as well as individuals.”

Although the website gives a list of “the resources we are now consulting”, a simple search using the TerraDict tool shows that in fact unlisted materials are also being used. I searched for “left” in the Dieri (Diyari) language (which I have worked on for the past 35 years) and got the following result (click image to enlarge it):

This can only have come from the vocabulary list in my 1981 book A Grammar of Diyari, South Australia (published by Cambridge University Press) because it is only in that book that I used the letter d for the trill sound — in later publications I used rrh. This word would now be spelled as warrangantyu in the orthography that the Dieri Aboriginal community prefers. QED, the Diyari material has been nicked without attribution from my copyrighted book.

Johnathan Poole, President of the Utilika Foundation, realises they are playing fast and loose here as the following statements from the minutes of their 2011 Annual Meeting held just last month make clear (note the last sentence in particular):

“intellectual-property obstacles to the expansion of PanLex have not yet been a major problem. If they prevented us from using one resource, we could move on to the next. The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement. … I believe that our normalization, structuring, and selective use of published data, combined with our provision of links to the original data, will satisfy most content creators. However, the inclusion of funds for legal services in the 2012 budget reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses.”

Well, as far as I can see there is no “complex[ity]” surrounding “intellectual-property issues” here — the Diyari materials (and possibly lots more on lots more languages) are copyright and subject to fair dealing. Anything else is theft.

PS: Thanks to David Nathan for passing on pointers to the PanLex project, including the Annual Meeting minutes quoted here. He bears no responsibility for the content of this blog post.


  1. After much discussion (see the 19 comments on my post), Professor Parker appeared to take on board feedback from a number of researchers about apparent violations of intellectual property rights and moral rights, and wrote that he was planning to write an “Open Letter to Field Linguists” and a guide to “Copyrights and Moral Rights for Languages and their Translations” — neither has so far been published.
  2. This is now showing as “out of print” at but is still available as an e-Book from the Book Depository among other sources.