The Xlat Project: September 2010

Thursday, September 30, 2010

Thoughts on practical use of machine translation

So since I haven't had the time to get OpenLogos running (I swear, just when I started, the work just came pouring in - I'm at 123,000 words for the month, phew) and given that I was far, far behind schedule on a large and boring corporate charter in French, I decided to try Systran.

(Oh, no, he didn't go there!)

I hadn't looked at Systran since 2005, when I had some work post-editing its abysmal output for an agency in Italy. I came to the conclusion then that it was normally just as easy to translate a given text myself than to try to decipher what Systran had come up with and whip it into something comprehensible by an English speaker, and that translating it myself paid five times as well. So: no-brainer, and I actually lost my Systran install.

But, well, it's been five years. Surely they could do a better job by now, right? And hey, it's only 100 bucks for the home version, which now includes a whole raft of languages - in fact, with the exception of Hungarian, all the languages I work with. So.

Here's the workflow I used: I ran Systran on my file, then aligned it with the original, and loaded it into my TM. Then I started down the file sentence by sentence in the normal manner, with the aligned segments coming up as I went.

This worked pretty damn well, actually. OK, there were some Systranisms - mille, in a year, is generally not translated "millet" and I'm not sure why that would be default. I dealt with these by loading the TM transfer file in my editor of choice, and doing global search-and-replace on them as I went. Then I'd import the edited segments back into the TM, and proceed. So commonly mistranslated terms got better as I went. Since the file was 13,000 words, this approach had time to work.

I should note that nearly every sentence needed modification. There were some real screamers in terms of Martian word order - so this should be considered kind of a rock-bottom minimum; what I wanted to know is whether it would accelerate my work even so.

My normal "fast" progress is 700 to 1000 words an hour. For this dreary text, I would probably have managed no more than 400 or 500 an hour. With this procedure, though, I managed a throughput of something between 1500 and 2500 words an hour. That ... that freaking works.

I think quality suffered somewhat, although as it was a corporate charter, I don't think I would have done fantastic quality anyway, so it's hard to say. I should continue to give this a try - certainly the preliminary results on this one job were entirely convincing and I now have much more confidence that machine translation should be part of my toolkit.

How would I improve things, you ask? Pretty much using the same tools I want to implement anyway:

Global search and replace for terms in a bilingual list. (This has two aspects: replacement should be sensitive to grammar in the target language, i.e. pluralizing correctly, but it should also be sensitive to the source phrase, sort of a "replace X with X' only if it's a translation for Y".)
Automation of simple TRADOS tasks (e.g. reloading the TM after I do a global search and replace.)
A database of rewording rules. This is slowly taking shape in my mind - it would be a valuable tool for any proofreader. It could also "translate" between American and British, if you see what I mean. Kind of a spellchecker on steroids, if you will.
Automation of Systran itself; the home version runs inside Word or with a standalone tool and they don't really want you to do things like automating it without giving them a lot of money for the Professional or Enterprise versions.

Anyway, I wanted to post this while the job was fresh in my memory. Now it's back to work for me, this time without the Systran crutch.

The real takeaway for me was: even bad MT, if well managed, would augment my throughput, potentially by a lot. And the various accessories I would need for Systran work will also be applicable to work with OpenLogos, so it's not wasted work if I get around to writing some.

Wednesday, September 15, 2010

XLIFF

So I started a File::XLIFF module yesterday. XLIFF [spec] is an interesting format. Like many XML formats, it's overengineered to the point that I suspect nobody will ever use it to its fullest extent. It maps onto the much simpler TTX format only with a lot of folding, spindling, and mutilation.

The basic Xlat::File model of a file as a simple set of segments may turn out to be oversimplified when it comes to XLIFF. As the most obvious example, a single XLIFF file can contain multiple sections, each of which refers to content from a separate file, and thus each of which has its own header and its own body.

Under the assumption that an XLIFF file will usually correspond to a single source file, I'm going to define a "default section" (that being the first in the file) that will be the target for the API against the file object; using XLIFF-specific functions, I'll expose a way to get a list of sections and create a separate file object pointing to a section that's numbered 2 or above.

Each of the File:: modules should probably have an Xlat::File superclass. I don't want to introduce needless dependencies, though; perhaps I can test for installation of Xlat::File before superclassing? Or maybe this is a plate of beans.

Tuesday, September 14, 2010

Patent language

Not software-related, but patent-related, I just wanted to link to this incredible example of clear exposition explaining the structure of patent claims.

Friday, September 3, 2010

OpenLogos on SourceForge

OpenLogos isn't really part of the xlat project, so I'll be transitioning its blog over to its own home on SourceForge. The real news being that it's now on SourceForge, with yours truly as maintainer.

Installing OpenLogos on Ubuntu 10.x (32-bit)

So, as noted below in the Fedora post, I'm giving up on 64-bit Fedora right now and falling back to an older machine, installing 32-bit Ubuntu on it so I can follow the instructions in Torsten Scheck's article without needing to work too hard. I'll post any discoveries here as I go; right now, I'm downloading the Ubuntu installer.

(10/17/2010) It's embarrassing, but I'm only now to this point. I spent some time getting Ubuntu installed on an old machine, but an update seems to have clobbered the boot sector or something - and frankly, that machine has been a problem for a while now. So this week I built a 32-bit Ubuntu virtual machine on my desktop box, and I'm chugging along.

After making my earlier changes again (the ones I made on Fedora), things are compiling well. I'm getting a lot of warnings from including logos_libs/ruleengine/rulebase.h of the form: "%.2d" expects type 'int' but argument 3 has type 'long unsigned int', but aside from those warnings, things worked fine. I'm going to have to look into those.

Ah. In lgsentity.cpp, "warning: deprecated conversion from string constant to 'char*'", and in a couple of other files, as well.

I ended up adding various headers to about ten files all in all. Not too bad.

The installation routine failed to create /usr/local/share/openlogos/bin for some reason - acting as though it wasn't running as sudo root. Strange, and something that should be examined.

But ... I seem to have installed OpenLogos at long last.

Pages