In one of my recent projects, there was a business requirement to identify the language of a text document automatically and segregate them.
I tried to do some research on the internet and came up with some open-source tools that can help in identifying a language. One such popular tool is "Lingua" - open source and written in Pearl.
Language identification happens by searching for common patterns of that language. Those patterns can be prefixes, suffixes, common words, ngrams or even sequences of words. More information about n-grams can be found here.
Other interesting links on the same subject:
http://staff.science.uva.nl/~jvgemert/mia_page/LangTools.html
http://odur.let.rug.nl/~vannoord/TextCat/Demo/
http://staff.science.uva.nl/~jvgemert/mia_page/demo.html#Lid
Thursday, June 15, 2006
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment