Thursday, October 23, 2014

TransTools for Visio

Not terribly Perl-specific, obviously, but here's the first tool I've ever found to help with translating Visio documents and embeds: TransTools for Visio. It's not perfect (what in the Microsoft ecosystem can ever be?) but it's a damn sight better than slogging through by hand, and a point of departure should I run into more Visio in the future. For some reason, it's sadly rare.

As things stand, with Word 2003 and Visio 2007, translating in TRADOS 2011, the workflow is:

  • Open the TransTools for Visio file in Visio.
  • Find a Visio drawing. If it opens directly, great, otherwise:
  • Cut and paste the drawing into a new document and open it directly there.
  • Select All and copy out everything in the drawing.
  • Paste into a new unnamed drawing in the same process as the open TTV drawing with the macros.
  • Run the magical macro to search the unnamed drawing and build a Word table.
  • Copy that table into a new document because the macro doesn't open the table in a full Word instance with menus (this could be fixed, obviously)
  • Drag that new document into the source language side of SDL.
  • Prepare that for translation.
  • Move to target language and translate. Don't use tabs, as they'll mess us up three steps from now.
  • Save the target document.
  • Open that document, select the left column, and paste into the original interim document, then copy the whole table.
  • Run the magical macro to search-and-replace the unnamed drawing. This uses a tab-delimited paste of the table, so tabs in your text will be Very Bad Indeed.
  • Select All in the unnamed drawing, and copy.
  • Delete everything in the embedded drawing, and paste the translated bits.
  • Jockey around to align properly. This is pretty manual. Sometimes you need to adjust text sizes as well, so actually this probably has to be manual.
  • If you copied the drawing out into a separate document, now copy it back in. Adjust for location and size if that step screwed things up, which they appear always to do (I don't have enough experience yet).
That's a pretty lengthy process, and yet it's still far superior in terms of mental capacity and quality control than not translating using a real tool. Everything except the translation and the last two steps could be automated, some during job preparation (which is obviously a good thing, because otherwise your word counts will be off).

So. Very nice tool.

Sunday, July 6, 2014

Marpa or ParZu?

I spent most of May working through my old natural-language tokenizer, adding a vocabulary-driven lexer/lexicon for German, all in preparation for undertaking a Marpa-based German parser. That's looking halfway decent at this point (except I need to do much better stemming), and then I decided to do a general search on German parsers and found ParZu.

The unusual thing about ParZu, among parsers especially, is that it's fully open source. That is, it has a free license, not a free-for-academics-only license - and it's hosted on GitHub. Also, I can try it online. So I fed it some more-or-less hairy sentences from my current translation in progress - and it parsed them perfectly.

So here's the thing. I kind of want to do my own work and come to terms with the hairiness of things myself. And then on the other hand, parsing German by any means would allow me to jump ahead and maybe start doing translation-related tasks directly....

It's a dilemma.

Update 2014-09-26: Maybe not such a dilemma. ParZu is written in Prolog and I'm just not sure I'm up for that. It honestly seems it would be easier to do it in Marpa...

This is probably incorrect. But I think I'm going to start finding out, this week.

Sunday, June 8, 2014

Compiling corpi

So there's this German news corpus obtained between 1996 and 2000 from online retrieval that I intend to use for some of my NLP work, and it occurred to me that I could build a similar corpus (well, the monolingual side of it, anyway) by doing my own periodic retrievals.

To that end, here's the RSS feed pages for the Süddeutsche Zeitung, the Népszabaság Online, and the Népszava (published in New York for Hungarian-Americans).

Analysis of chemical names

Turns out the linguistic structure of chemical names is non-trivial. Unfortunately, as it's also quite profitable, it all seems to be behind paywalls, but I'm visiting Bloomington this summer and will have the opportunity to spend some time in the library, so this is one of the things I hope to make some headway on.

In the meantime, here's a paywalled article from the promisingly named Journal of Chemical Information and Modeling, which describes an early version of Name>Struct, a closed-source interpreter for chemical names that strives to understand them in a way similar to a human chemist - that is, they attempt to model actual usage, not just reflect the official definitions of usage. Descriptive, not prescriptive, chemical linguistics.

Anyway, the folks at CambridgeSoft who make Name>Struct have also highlighted some of the pitfalls of chemical linguistics here.

Ah - silly me. A search on "Name>Struct open source" quickly returns OPSIN, an open-source algorithm that I could probably adapt pretty easily. It's here at BitBucket, and written in (shudder) Java. Nifty Web interface here.

Monday, April 28, 2014

TRADOS 2007 has an OLE dispatch API

I wish I'd known this years ago! TRADOS TagEditor 2007 exposes an OLE API. Workbench doesn't appear to. SDL doesn't support it any more, either. But here are some references to it I ran across the other day. I'm going to try to dump the typelib soon (I have no time at all until Thursday).

Friday, April 18, 2014

GlobalSight and Microsoft Translator

Two links to software resources: GlobalSight is an open-source translation manager, and this Microsoft Translator page is probably a pretty decent definitive list of who to watch in the industry.

Wednesday, March 19, 2014

Vocabulary in English

I've been trying to find resources pertaining to the frequency of words in English - surely there must be some kind of graded scale of "commonness" or something? But so far I can't find anything that organized.

Instead, I've got two interesting links here:

If we consider each of those and their ilk to be "general vocabulary words", then we'll have that much more luck in identifying technical content in a given document.