I am grateful that the paper Ahmed Abdelali, Steve Helmreich, and I worked on was accepted at the Computational Approaches to Arabic Script-based Languages workshop to be held August 26th in Ottawa (workshop program). I would also like to thank to the three reviewers for their helpful comments. Here is the conclusion.
Our work focused on answering two questions: (1) can we geographically classify documents solely on the frequency of common words, and, (2) rather than dialects, can we classify regional variations in one dialect (for example, can we classify regional differences in Modern Standard Arabic). We developed a series of studies aimed at answering these questions. These studies showed that it is possible to accurately classify newspaper documents solely using the common words in the documents. One study compared the performance of 10 classifiers on this task and provided some evidence that Bagging C4.5, C4.5, and SMO with a polynomial kernel produce the most accurate classifiers. One major limitation of these studies is that they relied on a single data source for each country. Because a single newspaper source was used for each region, it could be argued that the classifiers were classifying the documents based on the newspaper rather than on geographical region. To examine this possibility, we evaluated the performance of the classifier on a different genre: forum posts. The results here are less than compelling; nevertheless the classifier had moderate accuracy on classifying forum posts.1 We will examine this in more detail in future work using a larger corpus from a wider breadth of sources. Finally, we examined the effect of document size on classification accuracy finding that we could get good classification accuracy even for relatively short documents. These studies suggest that the answer to both questions raised in the beginning sentence of this paragraph is yes: yes we can geographically classify document based on common word frequency and yes we can classify regional differences in Modern Standard Arabic.
This work has direct practical application to intelligence tasks. It may help in determining the author of an anonymous document. For example, a geographical classifier can be used as one module of a system designed to detect cyber terrorist threats against the U.S. by aiding in the identification of the source of the threat. Finally, many Arabic scholars (Shukri B. Abed, p.c.) believe there are no regional variations of Modern Standard Arabic. The work reported on here provides some support for the alternative view that there are regional variations (see, for example, Ibrahim and Ibrahim, 2009 and Abdelali, 2004). Future work using larger corpora from a broad number of sources may provide stronger evidence for this position. (PDF of the draft paper)