The Xlat Project

TransTools for Visio

2014-10-23T11:33:00.001-07:00

Not terribly Perl-specific, obviously, but here's the first tool I've ever found to help with translating Visio documents and embeds: TransTools for Visio. It's not perfect (what in the Microsoft ecosystem can ever be?) but it's a damn sight better than slogging through by hand, and a point of departure should I run into more Visio in the future. For some reason, it's sadly rare.

As things stand, with Word 2003 and Visio 2007, translating in TRADOS 2011, the workflow is:

Open the TransTools for Visio file in Visio.
Find a Visio drawing. If it opens directly, great, otherwise:
Cut and paste the drawing into a new document and open it directly there.
Select All and copy out everything in the drawing.
Paste into a new unnamed drawing in the same process as the open TTV drawing with the macros.
Run the magical macro to search the unnamed drawing and build a Word table.
Copy that table into a new document because the macro doesn't open the table in a full Word instance with menus (this could be fixed, obviously)
Drag that new document into the source language side of SDL.
Prepare that for translation.
Move to target language and translate. Don't use tabs, as they'll mess us up three steps from now.
Save the target document.
Open that document, select the left column, and paste into the original interim document, then copy the whole table.
Run the magical macro to search-and-replace the unnamed drawing. This uses a tab-delimited paste of the table, so tabs in your text will be Very Bad Indeed.
Select All in the unnamed drawing, and copy.
Delete everything in the embedded drawing, and paste the translated bits.
Jockey around to align properly. This is pretty manual. Sometimes you need to adjust text sizes as well, so actually this probably has to be manual.
If you copied the drawing out into a separate document, now copy it back in. Adjust for location and size if that step screwed things up, which they appear always to do (I don't have enough experience yet).

That's a pretty lengthy process, and yet it's still far superior in terms of mental capacity and quality control than not translating using a real tool. Everything except the translation and the last two steps could be automated, some during job preparation (which is obviously a good thing, because otherwise your word counts will be off).

So. Very nice tool.

Marpa or ParZu?

2014-07-06T13:40:00.000-07:00

I spent most of May working through my old natural-language tokenizer, adding a vocabulary-driven lexer/lexicon for German, all in preparation for undertaking a Marpa-based German parser. That's looking halfway decent at this point (except I need to do much better stemming), and then I decided to do a general search on German parsers and found ParZu.

The unusual thing about ParZu, among parsers especially, is that it's fully open source. That is, it has a free license, not a free-for-academics-only license - and it's hosted on GitHub. Also, I can try it online. So I fed it some more-or-less hairy sentences from my current translation in progress - and it parsed them perfectly.

So here's the thing. I kind of want to do my own work and come to terms with the hairiness of things myself. And then on the other hand, parsing German by any means would allow me to jump ahead and maybe start doing translation-related tasks directly....

It's a dilemma.

Update 2014-09-26: Maybe not such a dilemma. ParZu is written in Prolog and I'm just not sure I'm up for that. It honestly seems it would be easier to do it in Marpa...

This is probably incorrect. But I think I'm going to start finding out, this week.

Compiling corpi

2014-06-08T04:18:00.003-07:00

So there's this German news corpus obtained between 1996 and 2000 from online retrieval that I intend to use for some of my NLP work, and it occurred to me that I could build a similar corpus (well, the monolingual side of it, anyway) by doing my own periodic retrievals.

To that end, here's the RSS feed pages for the Süddeutsche Zeitung, the Népszabaság Online, and the Népszava (published in New York for Hungarian-Americans).

Analysis of chemical names

2014-06-08T04:11:00.004-07:00

Turns out the linguistic structure of chemical names is non-trivial. Unfortunately, as it's also quite profitable, it all seems to be behind paywalls, but I'm visiting Bloomington this summer and will have the opportunity to spend some time in the library, so this is one of the things I hope to make some headway on.

In the meantime, here's a paywalled article from the promisingly named Journal of Chemical Information and Modeling, which describes an early version of Name>Struct, a closed-source interpreter for chemical names that strives to understand them in a way similar to a human chemist - that is, they attempt to model actual usage, not just reflect the official definitions of usage. Descriptive, not prescriptive, chemical linguistics.

Anyway, the folks at CambridgeSoft who make Name>Struct have also highlighted some of the pitfalls of chemical linguistics here.

Ah - silly me. A search on "Name>Struct open source" quickly returns OPSIN, an open-source algorithm that I could probably adapt pretty easily. It's here at BitBucket, and written in (shudder) Java. Nifty Web interface here.

TRADOS 2007 has an OLE dispatch API

2014-04-28T11:58:00.000-07:00

I wish I'd known this years ago! TRADOS TagEditor 2007 exposes an OLE API. Workbench doesn't appear to. SDL doesn't support it any more, either. But here are some references to it I ran across the other day. I'm going to try to dump the typelib soon (I have no time at all until Thursday).

This - from Perl
This from VB
This from VB again.

GlobalSight and Microsoft Translator

2014-04-18T15:37:00.004-07:00

Two links to software resources: GlobalSight is an open-source translation manager, and this Microsoft Translator page is probably a pretty decent definitive list of who to watch in the industry.

Vocabulary in English

2014-03-19T06:24:00.000-07:00

I've been trying to find resources pertaining to the frequency of words in English - surely there must be some kind of graded scale of "commonness" or something? But so far I can't find anything that organized.

Instead, I've got two interesting links here:

If we consider each of those and their ilk to be "general vocabulary words", then we'll have that much more luck in identifying technical content in a given document.

TaaS

2014-03-19T01:38:00.002-07:00

Terminology as a Service, as envisaged by the Europeans.

Terminology resources

2013-12-26T13:02:00.002-08:00

A bit of a linkdump here, first:

WordNet is now at 3.0 on Unix, still 2.1 on Windows. The database from Linux is probably more useful. Interestingly, it's also available in Prolog. The licensing is pretty open these days. I don't think it used to be. That's welcome news.
Here's something called CoreLex.
A good overview of the OLIF format.

I think I could do worse for termbase storage in Perl than simply a database schema that mirrors OLIF (at least partly). That could be part of a general OLIF-handling set of modules. OLIF is attractive because it's model-agnostic in terms of how terms are conceptualized, so an OLIF-based module should be able to do something reasonable with essentially any terminological source.

Lingua::OLIF?

TreeTagger

2013-05-04T15:06:00.001-07:00

The Perl module Lingua::TreeTagger provides an interface to the TreeTagger program (Win installation here). The only problem with TreeTagger is that it's got a commercial-license requirement. That, and the approach isn't good for Hungarian - but you can't have everything. This would probably be the best possible intermediate structure for frame-based translation, which I still think should be a valid approach.

Word lists

2013-04-27T04:30:00.001-07:00

I really need some kind of principled way to keep track of word lists and terminology. Ideally this would be a full-blown terminology management system with an online component and everything, but it would also be a word list source.

Here are some good places to start with word lists:

File::TMX

2012-12-10T08:22:00.000-08:00

I just started scratching the surface for TMX files. I'm going to end up with some generalized useful tools for XML files.

Terminology source: EMA

2012-12-10T08:21:00.001-08:00

The European Medicines Agency publishes patient information about European-approved (not nationally approved) drugs in all the European languages. This would be a useful corpus for terminological (and syntactic) analysis.

OLIF

2012-10-17T12:36:00.003-07:00

Open Lexicon Interchange Format (OLIF) is an XML terminology format that SDL Multiterm 2009 can import. (In other news, unlike the first time I bought them, TRADOS 2007 and Multiterm 2009 are now interoperable. Must have been an upgrade between then and when I bought this laptop. Bodes well for my everyday work!)

So ... building on OLIF and my new SQP tool, maybe it's time to consider writing that terminology database thing with a nice Perly wrapper.

By the way, OLIF was initially supported by SAP (and it's an SAP-related job I'm working on right now!) and the OLIF Consortium is like a who's-who of the big players in the translation industry. So it's probably worth grokking.

Translation of chemical names

2012-09-11T03:51:00.001-07:00

Here's a pretty fascinating survey of chemical name translation (I've been doing a lot of pharma translation this month). Turns out it's pretty tricky - looking at it, I'm not 100% sure it's as tricky as people make it out to be, because it's typical of language people that they find software magical, and typical of programmers to find natural language unreasonably hairy. But still - I think there's probably a (small) market for this kind of tool.

Cross-posted.

Terminology from patent databases

2012-06-15T12:40:00.002-07:00

It should be relatively easy to automate a crawl of any patent database and extract terminology from the abstracts and translations of abstracts.

Just a thought.

Terminology identification

2012-03-08T18:03:00.003-08:00

On term extraction: I checked Okapi; they're just doing the same up-to-n-gram extraction I've always done and found wanting. Their English tokenizer is also comparable to mine (better tested, to be sure). So all in all, maybe I'm capable of doing competent work here. I need to do better testing, though.

But anyway, I figured maybe NLTK might have something more suited to my needs, so I searched on "NLTK term extraction" and came upon this miscategorized post on the topic at nltk-dev. That post led me to the Wikipedia page for named-entity recognition (not so fantastically relevant but interesting nonetheless) and - gold - a suggestion to check Chapter 6 of Manning, Ragavan, and Schütze (which is one of the texts for the upcoming Stanford online NLP course, actually).

Chapter 6 addresses term weighting. Finding relevant terms for indexing of documents is equivalent to finding interesting terms for terminology searches, and it turns out that the best way to weight terms is by inverse document frequency. (Which makes sense; clustering of terms in documents indicates that they're low-entropy and contain more information than, say, "the", whose inverse document frequency is 1.)

Long story short, a term in a document is interesting in proportion to the number of times it appears in the document and in inverse proportion to the number of documents in which it appears in a given relevant corpus. Given that my corpus is about five million words of translation memories, I have a good corpus, provided that I organize it into something document-like.

I'm going to consider a "document" to be each group of entries in the TM clustered by date. Since not all my translation goes through one TM, I can pretty much guarantee that all my work will be easy to cluster; from that I can derive an overall set of terminology for the entire corpus and calculate inverse document frequency for each term. From that, I can score each term found in a new document. If it's a known term for which I already have a gloss, I'm happy. If it's a new term with high relevance for which I have no gloss, I can research it. And if it's a known term for which I have no gloss, I can study my own corpus to try to extract a likely candidate for translation.

(This leads to a terminology extraction tool, too - working on both target and source languages and trying to correlate presence in segments, I'll bet I can come up with some pretty good guesses at glosses. Make that service a free one and you ought to do well.)

So I've got a pretty good plan for terminology identification at this point. Just have to find the time to implement it. Here's kind of the list of subtasks:

Tool to convert Trados TM to TMX without my getting my hands dirty. I love Win32 scripting anyway - and this can run on the laptop for a day or two to chew its way through my five million words.
Do that TMX module for Perl.
For all my TMs, cluster the segments into "documents".
Polish up a terminology extraction tool (e.g. the same n-grams between stop words strategy I've used in the past).
Run said tool on the five million. Might want to do some kind of proper database and index at this point so I never have to do this again.
Calculate inverse document frequencies for everything.
Take a target TTX, extract terms, and classify them. This is the actual task.

Okapi

2012-02-29T20:00:00.003-08:00

Okapi (Java) is a pretty comprehensive set of open-source tools to facilitate the translation process - including a simple workflow manager. (You can group sets of steps together to define your own processes, a technique I'm going to steal.)

One more thing to take note of.

Incidentally, its token types are rather similar to the ones I've proposed.

Task: concordance-to-glossary tool

2012-02-13T11:30:00.000-08:00

I want to be able to look up one or more terms in a TM in the same way that concordances work now, then make a decision for a given document or customer, then have that decision checked globally. I'm most of the way to having this ready to go.

Task: find "actionable" terms in a given source

2012-02-13T11:27:00.000-08:00

This is probably solved by NLTK somehow, but given a source text I want to be able to find probable glossary items to be researched and to be checked against a TM or glossary.

Task: Generalize File::XLIFF to work on zipped XLIFF

2012-01-24T19:32:00.000-08:00

The files Lionbridge uses in their XLIFF editor are actually zipped XLIFF (with a .xlz extension) and include a "skeleton" file that seems to have some kind of information about placeables.

It would be nice to have a way of dealing with those for batch manipulation (global find-and-replace, etc.).

OpenTag, TMX, and translation memory manipulation

2012-01-07T19:45:00.000-08:00

Here's an interesting thing: opentag.com, including the format definitions for TMX and a few other rather fascinating XML interchange formats (including one for segmentation rules!)

I'm off onto a new tangent: a TMX manipulation module. I still don't have a fantastic API for it, but you know, I think I'm going to dump the xmlapi for real now. It's been 12 years now and I think it's time to move on. So I'm going to rewrite File::TTX to work with a different XML library (probably XML::Reader/XML::Writer) and do the same with TMX. This will allow me to choose between loading the file into memory in toto, or just writing a stream processor to filter things out on the fly for really large files.

I envision an overarching Xlat::TM API that will work with File::TMX in specific, and perhaps with others if and when.

TRADOS XML noncompliant

2011-11-07T08:16:00.000-08:00

So I'm working on a command-line utility for doing things with TTX files and ran into an unpleasantness: TTX files that are generated with the Word converter from Word documents with soft hyphens contain hex 0x1F values - but those values are illegal in XML. And when the XML standard says "illegal" they actually mean you're not supposed to call any parser that accepts them an XML parser.

This is really quite dismaying - and I can only imagine the discussions that must have gone on at TRADOS when they clearly suborned this restriction in their XML parser. It would have been far cleaner from an XML standpoint to have translated soft hyphens into a tag - but that would have made the editing experience far less clean. So they were stuck.

And now, I am too - I have to preprocess all TTX before passing it through the XML parser, which is a performance hit (which doesn't bother me too much) - but far worse, non-preprocessed TTX will infect the TM, so if I now make changes to the sanitized file and write it back out, it won't match the TM. This would be OK if we could be sure of sanitization before the TM were affected, but that's clearly too much to hope for in most real-world agency/freelancer workflows.

It's also rather nasty that TMX dumps from an infected TM will also contain 0x1F characters - meaning non-TRADOS tools won't be able to parse those, either. And they are supposed to be interoperable.

I think as a matter of policy I'm just going to sanitize and not worry overmuch about the rather small operability hit - at least until some actual project requires me to worry about it. Then I'll cross that bridge.

Enchant

2011-11-03T17:46:00.000-07:00

The future of open source spelling may be Enchant. There's no Perl binding. Yet.

Text::Aspell on Win32 - non-trivial

2011-11-03T16:40:00.000-07:00

Aspell is the default open-source spell checking engine; its Perl binding is Text::Aspell. The problem is that both Aspell and Text::Aspell are developed on Unix, and Things Are Different under Windows and MINGW32. Not insuperably different, but different enough that if you're the first person to try something, you'll live to regret it.

OK. So, first things first; the W32 installation of Aspell is back a release version but very stable. It doesn't actually have the include and library files bundled, but they're readily available - the problem being that W32 Aspell is developed with MSVC, and Strawberry Perl (my Perl of choice) compiles with MINGW32. Joy. So the library files are useless; we have to build our own. But let's make include and lib directories under Aspell.

Now, we set environment variables: CPATH should point (at least) to the Aspell include directory and LIBRARY_PATH to the Aspell lib directory. Don't forget that your PATH should also include Aspell's bin directory - which will make it easier to use Aspell's command line tools for your dictionary maintenance anyway. So do it!

Figuring out those environment variables, by the way, cost me about three hours. The remainder of the day was occupied with the next step: building a .DEF file that dlltool likes (some help was had from this page in remembering how a .DEF file is supposed to work), and then finding the appropriate combination of dlltool parameters. Turns out this:

   dlltool -d libaspell.defined --dllname aspell-15.dll
             --output-lib libaspell.a --kill-at

is the only incantation that will work. Leaving out the --dllname, even though it is specified in the .DEF file, will cause linkage failures at runtime. Not helpful ones, either. This took me four hours, ultimately culminating in this page, which at least mentions the --dllname parameter.

When dynamically linked, Aspell assumes the location of the DLL linked is either the root for dictionary searches or is in a 'bin' directory which is itself in the root for dictionary searches - in either case, the 'dict' directory of that root is where dictionaries should be. I had placed a local DLL in the Text::Aspell directory while flailing around; it took me half an hour to remember that.

Anyway, I finally managed to get it running. Next step: extract words from a TTX to throw against it.