My colleagues (Jim Cowie and Steve Helmreich of New Mexico State University) and I just submitted a paper titled “Language Preservation: A case study in collecting and digitizing machine-tractable language data” to the Chicago Colloquium. The abstract is:
In this paper we describe a process for collecting and digitizing machine-tractable resources for lesser-studied languages. We illustrate this process by using examples from the Paraguayan indigenous language Guarani, and Uighur, a Altaic Turkic language spoken in the Xinjiang province of China. By ‘machine-tractable’ we mean that in addition to being readable by people, the resource can also be processed by a computational tool. Our goal in acquiring these resources is to use them for quick ramp-up machine translation. These resources are also useful to scholars who are studying these particular languages.
In previous work we developed a complex web-based acquisition system, Boas, that guided linguistically-naive language informants through the process of acquiring descriptive knowledge about the parameter inventory for a particular language. For example, through a set of guided examples, morphological parameters such as number, gender, and case would be elicited from the informant. The system was designed to be used for any of the world’s languages and, as a result, the elicitation process was complex. Most of our language informants understandably grew tired of the process and the quality of the resources we collected suffered. These experiences led us to a different methodology for our current collection efforts. Instead of guiding acquirers through an elicitation process designed to gather knowledge about parameters, acquirers are used to construct basic resources for a language including:
monolingual text corpus of at least 250,000 words
a parallel bilingual (with English) text corpus of at least 250,000 words
a bilingual lexicon of at least 10,000 headwords or lemmas
a small manually-annotated part of speech tagged corpus
small manually-annotated named entity tagged corpus (a corpus where the proper names are tagged with their classification)
a morphological analyzer
Boas, as described above, was a web-based application and as a result had all the benefits of a modern web application. This had a number of direct advantages for language preservation projects. For example, acquirers could use the software from the nearest browser. It enabled the developers to fix defects promptly without having to distribute updates which users would need to install. Finally it facilitated the central storage of linguistic information. However, a significant downside was that it required acquirers to have adequate connection to the Internet and this could be a considerable obstacle for some acquirers living in remote areas.
In this paper we contrast our previous work which focused on developing a sophisticated tool with our current work which embodies a number of principles of language acquisition including:
1. Pragmatic use of language acquirers. The creation of a team to carry out the acquisition task is often tricky. The languages we are considering are not normally supported by large translation agencies. In addition language acquirers for many of the languages we have been working on are rare in the United States. There may be a population of refugees, but their linguistic expertise and their grasp of English may be poor. This, however, is not always true and we have carried out acquisition efforts for Uighur and Chechen by finding a few people willing to undertake what is a very heavy workload. There may also be an occasional academic, who has the the language and linguistic skills needed, but is usually too busy for the chore of actual acquisition. This type of expert, however, may be a good choice for validating some subset of the acquired corpora. At this point we need to start considering acquirers in the country of use. These acquirers are in their linguistic community, but pose unique problems in terms of training. In the case of Guarani, for example, we worked with Idelguap, the Instituto de la Lingüística Guaraní del Paraguay where it was possible to visit and carry out acquisition training. Concepts such as corpora were completely unknown to our team of acquirers; they imagined this to be some sort of word list. Other issues, such as part-of-speech inventory had to be agreed upon by the entire geographically dispersed team. It is necessary to identify one person who has a good grasp of the ideas after training and who can support the remainder of the team in the acquisition effort. Our development teams have usually emerged serendipitously and are a combination of computer capable linguists and bilingual speakers.
2. Use a bridge language only if absolutely necessary. For many lesser-studied languages, it is difficult to find native speakers who are bilingual in their language and English. In such cases we use a bridge language. Speakers of Guarani are primarily bilingual in Spanish, and none are bilingual in English and Guarani. In cases such as this, we use Spanish as a “bridge language” both for within-project communication, and also for acquisition. We expect that this will be an increasingly common way of acquiring resources as the languages for which resources are being acquired become less and less well-known and are spoken by fewer and fewer people. The main problem is addressing ambiguity issues when developing lexical resources. Some of this can be resolved using the corpora by a speaker of the bridge language and English, but often it involves long discussions between the primary developer and the secondary (to English) developer. In these cases Skype becomes an essential tool of the acquisition process.
3. PC based application. We opted to replace the “run from the nearest browser” approach used in the Boas system with a mobile solution that enabled the acquirer to be nearly untethered from the web. This aspect was particularly important for our Guarani acquirers as high bandwidth internet connections are not widespread in Paraguay. This also allowed the system to be taken into the field. At times convenient to the acquirers, they can connect to our web-server and upload the resources they have collected to a centralized store. The quality of the resources are automatically checked during the upload process.
4. Favor applications and interfaces known to the acquirers. Thus, from the acquirer’s perspective, there is a high preference for tools to run on the operating system they are most familiar with. In the case of our Guarani acquirers this was Windows 98 and for our Chechen acquirers this was Windows XP SP2. Our design criteria is that all our applications will run on the acquirers’ native operating system.
5. Keep things simple. The people in our lab—us included—have a passion for building computational tools. The Boas system mentioned above embodied a substantial amount of linguistic information in its acquisition tools and took over 12 person years to develop. Echoing point 4, acquirers prefer familiar applications and interfaces. From the acquirer’s perspective, the best tool is invisible allowing them to focus on the acquisition task. Our current acquisition effort makes use of common applications (when possible) that are familiar to the acquirers. For example, instead of a lexicon acquisition tool which has knowledge of inherent features and irregular inflectional paradigms, our current lexicon acquisition is done using a standard spreadsheet template with a handful of macros.
6. Preference to open source solutions. In an effort to have the acquisition effort continue and thrive after our initial funding expires—to have local language communities take over the effort, we prefer to use and develop tools and language resources that are open source. All the tools and resources that we develop in-house are under a Creative Commons Attribution Non-Commercial license which allows others to tweak and build upon our work.
If you have any comments or suggestions please email me at the address in the footer.