Ron Zacharski | Paper submitted to the Arabic Script-based Languages Workshop

Ahmed Abdelali, Steve Helmreich and I just submitted a paper to CAASL3: Computational Approaches to Arabic Script-based Languages to be held in Ottawa on August 26th. It reports on work we have done on geographical classification of Arabic text. We presented a paper on this topic at the Chicago Colloquia on Digital Humanities and Computer Science back in November 2008 (Linguistic Dumpster Diving: Geographical Classification of Arabic Text – pdf). At that colloqiua a number of people gave us good suggestions and criticisms. Our work since then has included investigating the suggestions these people made and also addressing the criticisms. For example, one individual suggested we look at non-linear methods of classification.One thing we did was to compare learning algorithms on this task. In our original work we used a support vector machine approach. We compared that approach to C4.5 decision trees, Bagging C4.5, Hyperpipes, nearest neighbor, K-nearest neighbors, Naive Bayes, Neural Network classifiers, SMO with a polynomial kernel and SMO with an RBF kernel. Of these, SMO with a polynomial kernel, neural nets, and Bagging C4.5 appear to perform the best. In addition, we invested the performance improvement from adding data from new sources. We are continuing work in this area. If you have any questions or suggestions please let us know.