Over the Christmas break I have been looking at words in Standard Arabic that are more common in one region compared to another. This is a continuation of work I have been doing with Ahmed Abdelali and Steve Helmreich. Ahmed has collected a corpus of Standard Arabic texts from newspapers in Egypt, Sudan, Libya, Syria, and the UK. In previous work we looked at distinguishing texts from different regions using the frequency of common words (the equivalent of common English words such as at, on,and in). In this work over Christmas break, I was looking for the difference in the frequency of content words (similar to Amazon’s ’statistically improbably phrases’)–words that occur in texts more frequently than you would expect by chance. I used 2 statistics, log likelihood and mutual information. Work by Ted Dunning suggests that log likelihood works better for statistically rare events than mutual information does. Currently I am not sure what to make of the results but here are the top 5 ’statistically improbably’ words from each region (using log likelihood):
Sudan المقاولون الساحل برانكو اعداده الموردة استعدادا
Egypt المقاولون الساحل برانكو الموردة التضامن اعداده
UK المقاولون الساحل برانكو اعداده التضامن الامل
Libya مدني الساحل استعدادا الامل الاولمبي حليم
Syria بشار البوكمال المادة النادي الفنان الشاعر