About an hour ago I presented the talk titledLinguistic Dumpster Diving: Geographical Classification of Arabic Text. I co-authored this paper with my colleagues at New Mexico State University, Ahmed, Jim, and Steve. I think the talk was well-received and I received a number of great comments and suggestions. Unfortunately, I don’t know the names of all the people who made suggestions so I can’t credit them all by name. In the talk, I primarily focused on a support vector machine approach to geographically classifying text. In passing I compared it to a naive Bayes approach, an approach based on character ngrams, and one on word ngrams. Patrick Juola suggested we look at other non-linear methods. He said he has had good luck with nearest neighbor classifiers. This is a good suggestion and one we will look into. Another person questioned how we knew the classifier was classifying geographically rather than picking up on individual authors of the newspapers. That’s another good point and one I will look into. Some evidence that we are picking up geographical classes is that we were 87% accurate in categorization English from India, the Philippines, and Singapore. The English corpus represented a wide range of writers. Someone asked if there were location names in the 1000 word common word list. I said probaby but wasn’t sure. (I need to check on that) Another asked what were the distinctive words. I think my colleague Ahmed did this but I need to check. Someone asked how can we be sure that the people who posted in a forum were actually from that country. I said we were aware of this problem but had no way of knowing. So in sum, there is plenty of work to be done on this project.