Jeanette Gundel presented a talk “Underspecification of cognitive status in reference production: the grammar-pragmatics interface” at the Workshop on Bridging the Gap between Computational, Empirical, and Theoretical Approaches to Reference at the Annual Cognitive Science Meeting in Boston. The talk was about our work on the Givenness Hierarchy. Continue reading
There are two common indices for scientific productivity: the h-index and the g-index. I have gone up in both these ratings. My h-index is now 13. This means that I have 13 articles each of which has been cited at least 13 times. The Wikipedia entry for the g-index says ” a value for h of about 12 might be typical for advancement to tenure (associate professor) at major research universities.” To put that number in perspective, Sebastian Thrun, a robotics professor at Stanford and developer of the Google Self-Driving Car, has an h-index of 95. So while I am very happy with my 13, it isn’t that great in the scheme of things. My g-index is now 39– I have 39 publications whose combined citations exceed 39 squared. My Erdõs number is 3.
Jeanette Gundel, Nancy Hedberg, and I finally got our paper, Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions accepted to the journal, Topics In Cognitive Science, a journal of the Cognitive Science Society. It will be appearing in the special issue on “Production of Referring Expressions: Bridging the Gap between Computational and Empirical Approaches to Reference.” Within the Givenness Hierarchy framework we outlined in our 1993 paper, lexical items included in referring forms are assumed to conventionally encode two kinds of information: conceptual information about the speaker’s intended referent and procedural information about the assumed cognitive status of that referent in the mind of the addressee. In this current paper we explore the role of underspecification of cognitive status in reference processing.We show how this framework accounts for a number of experimental results in the literature.
I just presented a half-day session on data and text mining at the Digital Jumpstart Workshop at the University of Kansas. I am grateful to the co-directors of the Institute for Digital Research in the Humanities for inviting me to this event, which was open to KU faculty, staff, and graduate students. Links to the resources I covered at the workshop are available at Resources for the Digital Jumpstart Workshop.
I just presented an invited paper “Don’t throw the analysis out with the bath water: Lessons learned from Modern Standard Arabic geographical classification” at the University of Kansas. The talk was sponsored by the departments of Linguistics and Slavic Languages. The abstract is as follows:
In corpus linguistics we throw out information. For example, in collecting corpora we necessarily omit some information about the extralinguistic context and only record that which we consider relevant for the purpose of our current research. In the analysis stage, we often remove data without thinking plavix price. One clear example of this is the routine practice of removing frequent words (commonly referred to as ‘stop words’) in a pre-processing step before analysis. In this talk I describe my work in Modern Standard Arabic geographical classification to illustrate the importance of being more mindful when we make these decisions about what to keep and what to discard. For example, I will show that it is possible to geographically classify text solely using words that some researchers have described as being fluff, superfluous, and non-significant. I will also describe how the paucity of metadata of commonly available Arabic corpora hampers research such as this
I just attended, as an invited participant, the Maryland Institute for Technology in the Humanities’ API Workshop, which was held on February 25th and 26th. The workshop alternated between presentations, lightning talks, and what the organizers called ‘unconferencing’. The highlights for me were the talks given by Mano Marks on Google’s MAP API, Google’s Fusion Table, and Google Refine. I’ve spent more time than I care to remember cleaning up language data. For example, I spent weeks cleaning up a Guarani lexicon. Google Refine is a tool that helps automate that process. If I used that for the Guarani lexicon I would have been done in a day! Google Fusion Table is an amazingly easy way to create map mashups just using a spreadsheet. The maps you create can be embedded on your web page.
I just got back from attending the Chicago Colloquium for Digital Humanities and Computer Science (21-22 of November). I presented the paper “Language Preservation: A case study in collecting and digitizing machine-tractable language data.” The paper was about work I have done with Jim Cowie and Steve Helmreich of New Mexico State University on our collection efforts to collect resources for lesser-studied languages. It reported on work we have done on the Paraguayan indigenous language Guarani, and Uighur, an Altaic Turkic language spoken in the Xinjiang province of China.
I was an invited participant at THATCamp Chicago (The Humanities and Technology Camp), “a user-generated unconference where humanists and technologists work together for the common good” which was held on November 20th. I participated in a number of great sessions. Of particular interest to me was the GeoTools/GIS session. Jo Guldi, a historian at Harvard, was interested in what she calls ‘geo-parsing’– identifying place names in text. She is interested in detecting subaltern agency in Britain by analyzing books published between 1848 and 1919. It sounds like a fun named entity extraction task and I volunteered to help her. I also attended sessions on GIT and XML/TEI.
Jeanette Gundel, Nancy Hedberg, and I just finished a revision of our paper: Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions and resubmitted it to the journal, Topics In Cognitive Science.
My colleagues (Jim Cowie and Steve Helmreich of New Mexico State University) and I just submitted a paper titled “Language Preservation: A case study in collecting and digitizing machine-tractable language data” to the Chicago Colloquium. The abstract is:
In this paper we describe a process for collecting and digitizing machine-tractable resources for lesser-studied languages. We illustrate this process by using examples from the Paraguayan indigenous language Guarani, and Uighur, a Altaic Turkic language spoken in the Xinjiang province of China. By ‘machine-tractable’ we mean that in addition to being readable by people, the resource can also be processed by a computational tool. Our goal in acquiring these resources is to use them for quick ramp-up machine translation. These resources are also useful to scholars who are studying these particular languages. Continue reading