I just presented a half-day session on data and text mining at the Digital Jumpstart Workshop at the University of Kansas. I am grateful to the co-directors of the Institute for Digital Research in the Humanities for inviting me to this event, which was open to KU faculty, staff, and graduate students. Links to the resources I covered at the workshop are available at Resources for the Digital Jumpstart Workshop.
Monthly Archives: March 2011
Kansas Corpus Linguistics Talk
I just presented an invited paper “Don’t throw the analysis out with the bath water: Lessons learned from Modern Standard Arabic geographical classification” at the University of Kansas. The talk was sponsored by the departments of Linguistics and Slavic Languages. The abstract is as follows:
In corpus linguistics we throw out information. For example, in collecting corpora we necessarily omit some information about the extralinguistic context and only record that which we consider relevant for the purpose of our current research. In the analysis stage, we often remove data without thinking plavix price. One clear example of this is the routine practice of removing frequent words (commonly referred to as ‘stop words’) in a pre-processing step before analysis. In this talk I describe my work in Modern Standard Arabic geographical classification to illustrate the importance of being more mindful when we make these decisions about what to keep and what to discard. For example, I will show that it is possible to geographically classify text solely using words that some researchers have described as being fluff, superfluous, and non-significant. I will also describe how the paucity of metadata of commonly available Arabic corpora hampers research such as this