At the Linguistic Society of America Summer Institute in Berkeley last week (17-19th July) the National Science Foundation sponsored Cyberling 2009, a workshop exploring how computational infrastructure (called “cyberinfrastructure” in the US, and e-Science or e-Humanities in the UK) can support linguistic research in a variety of fields. There was a panel discussion about data sharing that looked at the proposal:
“A cyberinfrastructure for linguistic data would allow unprecedented access [to] the empirical base of our field, but only if we collectively build that empirical base by contributing data. This panel addresses the benefits of data sharing and the obstacles to the widespread adoption of sharing practices, from the perspective of a variety of subfields”
But the bulk of the workshop was given over to closed discussion sessions by seven working groups looking at annotation standards, other standards, new multi-purpose software (so-called “killer apps”), data reliability and provenance, models from other fields, funding sources, and collaboration structure. The group discussions and resulting final day presentations are available on the Cyberling Wiki.
I was co-chair of Working Group 4 that was charged with discussing “protecting data reliability and provenance”, i.e. how to keep track of the creation of data and analysis and its passage through the electronic infrastructure as researchers access and use each other’s materials. As the Cyberling Wiki says, this is crucial
“for data creators (who need credit for the work they have done and the academic contribution of collecting, curating and annotating data) and the data users (who need to know where the data has come from so they can form an opinion of how much credence to give it and how to give proper credit to the originator of the data)”.
We also looked at how to establish a culture of data sharing and what mechanisms might be put in place to encourage people to share data. Clearly, for endangered language research where data are unique and fragile, these are very important issues.
After two and a half days of intense discussions our group came up with a set of proposals relating to data reliability and provenance that can be summarised as follows:
- Curated data as publication — the best way to ensure reliability and provenance would be to treat data that has been curated (selected, structured and analysed, with associated metadata) as a form of publication. The technology to do this is already available, however to do it successfully there needs to be institutional and social engagement so that creators will receive recognition and credit, and users will properly use and cite other researcher’s materials
- Handles — we need to set up a system of globally unique, persistent identifiers for entities (people, organisations, roles — similar to, but much broader than, the OpenID system for customer identification), documents (since URLs are volatile), and mashups and views (combinations of data, often generated on the fly from a range of sources, e.g. the forthcoming Rosetta Platform which will draw data about languages and speakers from Freebase, classification information from Ethnologue, and locations from Googlemaps, using language codes and GPS references as the ‘glue’).
- Software as a Service — we need provision of software on the web to analyse, restructure and repurpose data, while keeping track of its provenance and reliability. Again, some of this service provision is currently available, but more would make collaboration and data sharing a real possibility.
Our group also proposed some first steps that could lead to more sharing and collaboration among linguists:
- proactive education to ensure that all linguists understand the value of data sharing (as well as the ways in which access and use can be controlled and proper citation ensured). In the case of endangered languages materials, there is the added importance of bringing out into the open materials that are unique and fragile
- mentors (eg. PhD supervisors) should publish and share their data sets as models for the next generation
- websites should provide a “cite as” button with their data views so that proper referencing can be maintained — this could be ideally extended later using the new linguistic identifier (handle) system
- more service provision for data structure, integrity validation, and conversion — this already existing in some areas, eg. ELAR at SOAS provides these services on a case-by-case basis to ELDP grantees
- publishers and editors could require provenance information when linguistic material, eg. example sentences, is included in books and articles
- editors and funding agencies could encourage data sets to be published
On the final point, publication of data sets currently exists in some areas of science, such as Earth System Science Data which publishes articles on “the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles.” The first realisation of this in linguistics can be seen in the newly established on-line Journal of Experimental Linguistics which aims to publish:
“reproducible computational experiments on topics related to speech and language. These experiments may involve the analysis of previously published corpus data, or of experiment -specific data that is published for the occasion … In all cases, JEL articles will be accompanied by executable recipes for re creating all figures, tables, numbers and other results. These recipes will be in the form of source code that runs in some generally available computational environment” [emphasis mine, PKA]
Perhaps the time is ripe for this approach to be applied to endangered languages research, with full publication of media files and annotation sets. Some individual researchers are already doing this, eg. Stuart McGill’s Cicipu texts website presents time-aligned, annotated and glossed texts that are available to other researchers to check the analyses presented in his recently-submitted PhD thesis. However, establishment of an edited journal that publishes endangered languages data could do much to promote collaborative research, and open up the field to replicability and testability of results in a way not seen so far on a large scale.