Pages

Wednesday, September 15, 2010

XLIFF

So I started a File::XLIFF module yesterday. XLIFF [spec] is an interesting format. Like many XML formats, it's overengineered to the point that I suspect nobody will ever use it to its fullest extent. It maps onto the much simpler TTX format only with a lot of folding, spindling, and mutilation.

The basic Xlat::File model of a file as a simple set of segments may turn out to be oversimplified when it comes to XLIFF. As the most obvious example, a single XLIFF file can contain multiple sections, each of which refers to content from a separate file, and thus each of which has its own header and its own body.

Under the assumption that an XLIFF file will usually correspond to a single source file, I'm going to define a "default section" (that being the first in the file) that will be the target for the API against the file object; using XLIFF-specific functions, I'll expose a way to get a list of sections and create a separate file object pointing to a section that's numbered 2 or above.

Each of the File:: modules should probably have an Xlat::File superclass. I don't want to introduce needless dependencies, though; perhaps I can test for installation of Xlat::File before superclassing? Or maybe this is a plate of beans.

2 comments:

  1. Maybe the mapping problem is because TTX is a proprietary format that only corresponds to SDL Trados needs ? The TTX dtd is not very clear about what some tags really represent and the difference between a non segmented TTX and a segmented TTX is a big problem for XLIFF convertors. Check Swordfish and Rainbow on not presegmented files and you'll get two totally different XLIFF files because of the ambiguity of the dtd. Besides, Trados 2009 is slowly moving to (sdl)xliff and seems to have a good implementation of ITS so instead of focussing on the lowest common denominator between TTX and XLIFF you should better produce an XLIFF compliant tool and see how other tools (see above but not exclusively) handle the issue ?

    ReplyDelete
  2. Yeah, except I like the notion of a simple, cut-and-dried single source of segments akin to a table. Plenty of agencies still send out spreadsheets or tables for entry of the translation in a separate column, and it's a clean concept.

    I'm going to have to get used to the XLIFF format. I honestly just never considered a hierarchical set of segments.

    Your point about TTX being proprietary and destined for the scrapheap is true as far as it goes - although, like I say, as a concept of just "simple list of segments" it's pretty sound. The full library is going to have an abstraction of just that - a list of segments - that will then map onto whatever interchange format is needed. So I don't really care about TTX per se, except insofar as I still work with it on a daily basis and so I get plenty of trial by fire opportunities.

    The logical map in this case to XLIFF is from the abstraction onto some kind of query or virtual segment table defined over a given XLIFF. I don't know exactly how it's going to play out, but that's kind of the direction.

    Thanks for the comment! I'm amazed anybody has even found this blog!

    ReplyDelete