In this paper we describe language recognition algorithms for mono- and multi-lingual documents that are based on mixed-order n-grams, Markov chains, maximum likelihood, and dynamic programming. We compare the monolingual algorithm to those suggested by other researchers. This comparison suggests that this algorithm significantly outperforms commonly used language recognition algorithms. We then describe the multilingual algorithm, which allows for segmenting a multilingual document into single language chunks and identifying the languages of those chunks.
Cowie, Jim, Yevgeny Ludovik, and Ron Zacharski. 1999. Language recognition for mono- and multi-lingual documents. Proceedings of the Vextal Conference, 209-214. Venice, November 22-24, 1999. 209-214. (pdf)