Friday, August 27, 2010

Tesseract OCR

Google's Tesseract seems to be just about the best OCR out there. It doesn't seem to play well with others yet (it's written on the assumption that it's a standalone utility, not a library) but given that it's Google, it'll probably get a lot better fast.

I should probably investigate. OCR is an important component of a lot of translation jobs, and all existing OCR sucks. Sigh. That's only partly hyperbole.

