Tuesday, November 16, 2010

Another workflow with PDFs

I have a set of documents that consist of PDFs that have been highlighted and scanned. That is, each PDF consists of a set of documents. The text to be translated has been highlighted - with a physical marker, I mean - and the documents scanned. The PDFs were unfortunately not encoded as allowing comments (this is unfortunately a flag of the digital signature, not a flag in the PDF standard, and Adobe has not provided the key for digital signature from what I'm reading - thus there is no tool in the world that can flag a PDF to allow comments from Adobe Reader except for the full paid version of Adobe Acrobat.)

So my workflow is to go through the documents and use the snapshot tool to copy the highlighted bits. I put each bit into one column of a Word file, and the translation in the other column. It's nearly as good as comments in the PDF.

It seems to me that this would be a simple tool to implement: create the Word file, create the table, then every time I select something that's graphical, put it into the Word file for me and bring Word to the top. It's not a huge help, but it's the principle of the thing.

Saturday, November 6, 2010

Working with source text

There are open-source ways to break text into sentences and to find terms. Those need to be part of the toolkit.

The splitta library is a sentence boundary finder. This I have to incorporate, as segmentation is an extremely important function of any translation system. So that should be Perl-ized here.

The Topia term extractor is the other thing I wanted to point out here.

Also, the fact that both of these libraries are in Python. An awful lot of natural language work ends up in Python. That's kind of interesting, actually.