I just presented an invited paper “Don’t throw the analysis out with the bath water: Lessons learned from Modern Standard Arabic geographical classification” at the University of Kansas. The talk was sponsored by the departments of Linguistics and Slavic Languages. The abstract is as follows:
In corpus linguistics we throw out information. For example, in collecting corpora we necessarily omit some information about the extralinguistic context and only record that which we consider relevant for the purpose of our current research. In the analysis stage, we often remove data without thinking plavix price. One clear example of this is the routine practice of removing frequent words (commonly referred to as ‘stop words’) in a pre-processing step before analysis. In this talk I describe my work in Modern Standard Arabic geographical classification to illustrate the importance of being more mindful when we make these decisions about what to keep and what to discard. For example, I will show that it is possible to geographically classify text solely using words that some researchers have described as being fluff, superfluous, and non-significant. I will also describe how the paucity of metadata of commonly available Arabic corpora hampers research such as this