Over the Christmas break I have been looking at words in Standard Arabic that are more common in one region compared to another. This is a continuation of work I have been doing with Ahmed Abdelali and Steve Helmreich. Ahmed has collected a corpus of Standard Arabic texts from newspapers in Egypt, Sudan, Libya, Syria, and the UK. In previous work we looked at distinguishing texts from different regions using the frequency of common words (the equivalent of common English words such as at, on,and in). In this work over Christmas break, I was looking for the difference in the frequency of content words (similar to Amazon’s ’statistically improbably phrases’)–words that occur in texts more frequently than you would expect by chance. Continue reading
News
Delivered paper at the Computational Approaches to Arabic Script-based Languages workshop
I presented the paper Investigations on standard Arabic geographical classification at theComputational Approaches to Arabic Script-based Languages workshop. Immediately before my talk, I convinced myself that the paper was not related to the conference topic and that it was simplistic. However, it seems that it was well received. One of the conference organizers, Ali Farghaly, said it was important work, which is nice to hear. I probably received a half dozen positive statements from people and I am extremely grateful for their kind words. Several people offered great suggestions for future work. Most of them related to trying to identify content words that may help in the geographical classification. Several people suggested using Term Frequency Inverse Document Frequency (TFIDF). Prior to the workshop I was thinking of using log likelihood or mutual information to do a similar identification and several people suggested similar approaches. I am thankful that people took the time to offer these suggestions.
Paper accepted to Arabic Script Languages Workshop
I am grateful that the paper Ahmed Abdelali, Steve Helmreich, and I worked on was accepted at the Computational Approaches to Arabic Script-based Languages workshop to be held August 26th in Ottawa (workshop program). I would also like to thank to the three reviewers for their helpful comments. Here is the conclusion. Continue reading
Paper submitted to the Arabic Script-based Languages Workshop
Ahmed Abdelali, Steve Helmreich and I just submitted a paper to CAASL3: Computational Approaches to Arabic Script-based Languages to be held in Ottawa on August 26th. It reports on work we have done on geographical classification of Arabic text. We presented a paper on this topic at the Chicago Colloquia on Digital Humanities and Computer Science back in November 2008 (Linguistic Dumpster Diving: Geographical Classification of Arabic Text – pdf). At that colloqiua a number of people gave us good suggestions and criticisms. Our work since then has included investigating the suggestions these people made and also addressing the criticisms. For example, one individual suggested we look at non-linear methods of classification. Continue reading
Paper submitted to the Machine Translation Summit
As I mentioned in previous posts, I developed (with tremendous help from Adam Zacharski) a cross-language instant messaging system using Adobe Flex. This system provides concurrent real-time translation for instant messaging using multiple machine translation engines. During this last academic year, Bill Ogden, my colleague in New Mexico, and several people in his lab (Sieun An and Yuki Ishikawa) used this system to evaluate the performance of machine translation systems based on how effective they were in helping people accomplish shared tasks. They used paid participants who worked in pairs (one Japanese speaker paired with a native English speaker) to accomplish a photo identification task using this instant messaging system. We just submitted a paper describing the results of this work to the Machine Translation Summit in Ottawa in August.
Amazing Remix
Okay. This is my first youtube post. What this guy did was take individual performers on youtube–many of them were instructional videos and remixed them into a band. Truly amazing!
Playing with the Stanford Log-linear Part-Of-Speech Tagger
I would like to create a part-of-speech tagger for Paraguayan Guarani. Initially I thought I would use the Brill part of speech tagger, but it seems to have vanished from the web. In my search, I ran across the Stanford Log-Linear Part-Of-Speech Tagger. It was developed by Chris Manning’s group and I figured anything developed by Chris Manning is probably exceptional. I downloaded it and ran the included English part-of-speech tagger on a 250k text (a public domain Tom Swift book). It took about about 1/2 hr. on a newish Core Duo machine. Training a part-of-speech tagger is a bit more complex simply because of the lack of documentation. First you need a tagged corpus. There is some variability allowed in how this text is formatted. I simply used a text file where the word-tag pair is represented as word_tag. For example,
The_DT old_JJ Foger_NNP homestead_NN is_VBZ closed_VBN up_RP ,_, though_IN I_PRP did_VBD see_VB a_DT man_NN working_VBG around_IN it_PRP to-day_JJ as_IN I_PRP came_VBD past_NN ._.
In addition to this text file you need a props file (basically a configuration file). There actually is a sample configuration file in the models folder of the Stanford download. You need to edit that to match your local settings. Finally you will need to up the amount of memory allocated to java.
The command I used that actually generated a part of speech tagger is
java -mx500m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model guaranimodel -trainFile guarani.txt -prop models\mymodel.props
Once I play around with this more I will post how well this works.
Textbooks for data mining
I finally made a decision regarding what textbook to use for a data mining course I will be teaching in the spring. One challenge was that the course is cross-listed in a variety of departments: computer science, business, and information technology and, as a result, the students taking the class will have a diversity of backgrounds–some strong in statistics, others in programming. My original plan was not to have people do programming at all and have them just use Weka, a free, data mining tool. I was considering 2 textbooks: Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar; and Data Mining: Practical Machine Learning Tools and Techniques, by Ian Witten and Eibe Frank. Continue reading
Delivered presentation at the Chicago Digital Humanities Conference
About an hour ago I presented the talk titledLinguistic Dumpster Diving: Geographical Classification of Arabic Text. I co-authored this paper with my colleagues at New Mexico State University, Ahmed, Jim, and Steve. I think the talk was well-received and I received a number of great comments and suggestions. Unfortunately, I don’t know the names of all the people who made suggestions so I can’t credit them all by name. In the talk, I primarily focused on a support vector machine approach to geographically classifying text. In passing I compared it to a naive Bayes approach, an approach based on character ngrams, and one on word ngrams. Patrick Juola suggested we look at other non-linear methods. He said he has had good luck with nearest neighbor classifiers. This is a good suggestion and one we will look into. Another person questioned how we knew the classifier was classifying geographically rather than picking up on individual authors of the newspapers. That’s another good point and one I will look into. Some evidence that we are picking up geographical classes is that we were 87% accurate in categorization English from India, the Philippines, and Singapore. The English corpus represented a wide range of writers. Someone asked if there were location names in the 1000 word common word list. I said probaby but wasn’t sure. (I need to check on that) Another asked what were the distinctive words. I think my colleague Ahmed did this but I need to check. Someone asked how can we be sure that the people who posted in a forum were actually from that country. I said we were aware of this problem but had no way of knowing. So in sum, there is plenty of work to be done on this project.
Linguistic Dumpster Diving
I am grateful to have the paper I wrote with my colleagues accepted at the Chicago Colloquium on Digital Humanities and Computer Science to be held Nov 1-3. Here’s the abstract.
In many text analysis tasks it is common to remove frequently occurring words as part of the pre-processing step prior to analysis. Frequent words are removed for two reasons: first, because they are unlikely to contribute in any meaningful way to the results; and, second, removing them can greatly reduce the amount of computation required for the analysis task. In the literature on information retrieval and text classification, such words have been called noise in the system, fluff words, and non-significant words. While the removal of frequent words is correct for many text analysis tasks, it is not correct for all tasks. There are many analysis tasks where frequent words play a crucial role. To cite just one example, Mosteller and Wallace in their seminal book on stylometrics noted that the frequencies of various function words could distinguish the writings of Alexander Hamilton and James Madison. We use a similar frequent word technique to geographically classify Arabic news stories. In representing a document, we throw away all content words and retain only the most frequent words. In this way, we represent each document by a vector of common word frequencies. In our study we used a collection of 4,167 Arabic documents from 5 newspapers (representing Egypt, Sudan, Libya, Syria, and the U.K.). We then train on this data using a sequential minimal optimization algorithm, and evaluate the approach using 10-fold cross-validation. Depending on the number of frequent words, results range from 92% classification accuracy to 99.8%.
Linguistic Dumpster Diving: Geographical Classification of Arabic Text Using Words People Commonly Throw Away. Ron Zacharski, Ahmed Abdelali, Stephen Helmreich, and James Cowie