Saturday, November 6, 2010

Working with source text

There are open-source ways to break text into sentences and to find terms. Those need to be part of the toolkit.

The splitta library is a sentence boundary finder. This I have to incorporate, as segmentation is an extremely important function of any translation system. So that should be Perl-ized here.

The Topia term extractor is the other thing I wanted to point out here.

Also, the fact that both of these libraries are in Python. An awful lot of natural language work ends up in Python. That's kind of interesting, actually.

