Ron Zacharski | Playing with the Stanford Log-linear Part-Of-Speech Tagger

I would like to create a part-of-speech tagger for Paraguayan Guarani. Initially I thought I would use the Brill part of speech tagger, but it seems to have vanished from the web. In my search, I ran across the Stanford Log-Linear Part-Of-Speech Tagger. It was developed by Chris Manning’s group and I figured anything developed by Chris Manning is probably exceptional. I downloaded it and ran the included English part-of-speech tagger on a 250k text (a public domain Tom Swift book). It took about about 1/2 hr. on a newish Core Duo machine. Training a part-of-speech tagger is a bit more complex simply because of the lack of documentation. First you need a tagged corpus. There is some variability allowed in how this text is formatted. I simply used a text file where the word-tag pair is represented as word_tag. For example,

The_DT old_JJ Foger_NNP homestead_NN is_VBZ closed_VBN up_RP ,_, though_IN I_PRP did_VBD see_VB a_DT man_NN working_VBG around_IN it_PRP to-day_JJ as_IN I_PRP came_VBD past_NN ._.

In addition to this text file you need a props file (basically a configuration file). There actually is a sample configuration file in the models folder of the Stanford download. You need to edit that to match your local settings. Finally you will need to up the amount of memory allocated to java.

The command I used that actually generated a part of speech tagger is

java -mx500m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model guaranimodel -trainFile guarani.txt -prop models\mymodel.props

Once I play around with this more I will post how well this works.