I just presented a half-day session on data and text mining at the Digital Jumpstart Workshop at the University of Kansas. I am grateful to the co-directors of the Institute for Digital Research in the Humanities for inviting me to this event, which was open to KU faculty, staff, and graduate students. Links to the resources I covered at the workshop are available at Resources for the Digital Jumpstart Workshop.
I just presented an invited paper “Don’t throw the analysis out with the bath water: Lessons learned from Modern Standard Arabic geographical classification” at the University of Kansas. The talk was sponsored by the departments of Linguistics and Slavic Languages. The abstract is as follows:
In corpus linguistics we throw out information. For example, in collecting corpora we necessarily omit some information about the extralinguistic context and only record that which we consider relevant for the purpose of our current research. In the analysis stage, we often remove data without thinking plavix price. One clear example of this is the routine practice of removing frequent words (commonly referred to as ‘stop words’) in a pre-processing step before analysis. In this talk I describe my work in Modern Standard Arabic geographical classification to illustrate the importance of being more mindful when we make these decisions about what to keep and what to discard. For example, I will show that it is possible to geographically classify text solely using words that some researchers have described as being fluff, superfluous, and non-significant. I will also describe how the paucity of metadata of commonly available Arabic corpora hampers research such as this
I just attended, as an invited participant, the Maryland Institute for Technology in the Humanities’ API Workshop, which was held on February 25th and 26th. The workshop alternated between presentations, lightning talks, and what the organizers called ‘unconferencing’. The highlights for me were the talks given by Mano Marks on Google’s MAP API, Google’s Fusion Table, and Google Refine. I’ve spent more time than I care to remember cleaning up language data. For example, I spent weeks cleaning up a Guarani lexicon. Google Refine is a tool that helps automate that process. If I used that for the Guarani lexicon I would have been done in a day! Google Fusion Table is an amazingly easy way to create map mashups just using a spreadsheet. The maps you create can be embedded on your web page.
I just got back from attending the Chicago Colloquium for Digital Humanities and Computer Science (21-22 of November). I presented the paper “Language Preservation: A case study in collecting and digitizing machine-tractable language data.” The paper was about work I have done with Jim Cowie and Steve Helmreich of New Mexico State University on our collection efforts to collect resources for lesser-studied languages. It reported on work we have done on the Paraguayan indigenous language Guarani, and Uighur, an Altaic Turkic language spoken in the Xinjiang province of China.
I was an invited participant at THATCamp Chicago (The Humanities and Technology Camp), “a user-generated unconference where humanists and technologists work together for the common good” which was held on November 20th. I participated in a number of great sessions. Of particular interest to me was the GeoTools/GIS session. Jo Guldi, a historian at Harvard, was interested in what she calls ‘geo-parsing’– identifying place names in text. She is interested in detecting subaltern agency in Britain by analyzing books published between 1848 and 1919. It sounds like a fun named entity extraction task and I volunteered to help her. I also attended sessions on GIT and XML/TEI.
Jeanette Gundel, Nancy Hedberg, and I just finished a revision of our paper: Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions and resubmitted it to the journal, Topics In Cognitive Science.
My colleagues (Jim Cowie and Steve Helmreich of New Mexico State University) and I just submitted a paper titled “Language Preservation: A case study in collecting and digitizing machine-tractable language data” to the Chicago Colloquium. The abstract is:
In this paper we describe a process for collecting and digitizing machine-tractable resources for lesser-studied languages. We illustrate this process by using examples from the Paraguayan indigenous language Guarani, and Uighur, a Altaic Turkic language spoken in the Xinjiang province of China. By ‘machine-tractable’ we mean that in addition to being readable by people, the resource can also be processed by a computational tool. Our goal in acquiring these resources is to use them for quick ramp-up machine translation. These resources are also useful to scholars who are studying these particular languages. Continue reading
Jeanette Gundel (University of Minnesota), Nancy Hedberg (Simon Fraser University) and I just had our paper, Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions, accepted for publication in the Cognitive Science Society journal, Topics in Cognitive Science. To quote Nancy: “Hallelujia!!! … I am ecstatic!!!” That mirrors my feelings. I am grateful to the reviewers for their wonderful comments. Now there is a moderate amount of work to do to address the reviewers’ comments. Here is the abstract. Continue reading
For over 20 years I have been collaborating with Jeanette Gundel and Nancy Hedberg on research focusing on referring expressions. As part of this research we propose something we term the Givenness Hieararchy–a set of cognitive statuses the are on an implicational scale. Oddly enough this research is mentioned in the just-published novel Starting from Scratch by Susan Gilbert-Collins. Continue reading
Jeanette Gundel (University of Minnesota), Nancy Hedberg (Simon Fraser University) and I just submitted a paper to the new journal topiCS (topics in Cognitive Science). This pretty much consumed my entire spring break. The title of the paper is Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions. Here’s the abstract. Continue reading