Friday, October 8, 2010


This has been done to death, of course, but I need to start thinking about a terminology engine, and also about specific terminology - right now I'd like a database of titles of industrial standards in various languages. They come up rather a lot.

Wednesday, October 6, 2010

PDF reading

There are a couple of workflows where PDFs are needed.

First is where a series of pages have been scanned and need to be translated starting from the graphics. OCR can come in handy here (if it works, which it usually doesn't), but I want to highlight the fact that (1) the pages are very often disjoint (think medical records) and (2) sometimes have Bates numbers (legal annotations identifying each individual page in a set of documents). This overall structure could do with some software support. I'm thinking something that takes individual document segments and ties them back into a structured overall document with, say, the Bates numbers.

[With respect to that OCR: it would be nice to have a pre-OCR stage that finds and identifies pages that are similar - this could simplify finding letterhead, headings, and so on.]

The second workflow of interest is text PDFs. See, PDFs don't have document structure like Word documents. If a header appears on every page, well then it will be reproduced on every page in text. So it would be nice to be able to impose - to recognize - this sort of structure in order to take PDFs and translate them. (You could argue that a TM tool would do this for you - but I would prefer to abstract out the different document parts in order to translate them separately, when we start thinking about machine translation. The MT tool will need as much help as it can get.)

Anyway. Just a thought I'm too busy to follow up on right at the moment.

Friday, October 1, 2010


So I tried Systran on a new potential project in German and Italian (SOPs from the same company in both languages, for translation to English). I figured after the corporate charter, with its quite passable results, I'd try Systran on these as well.


Here's just the first sentence of the German:

The available SOP serves for the Sicherstellung of the requirements and conditions, which must fulfill the suppliers, so that them for the supply of a supplier sample to become certified to be able.

You can't edit that. All you can do is retranslate it - either directly from the German or from this intermediate not-German not-English near-gibberish. So maybe the corporate charter was a fluke, or maybe Systran performs better on French than on German (or Italian - the results were equally unreadable on my Italian sample). Either way, my initial hopes for being able to use Systran are pretty much shot. This does not speed me up, and it's clear that careful glossary work, while it might help a little, wouldn't be enough - Systran doesn't actually appear to understand or use syntax.