<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3904594442791801969</id><updated>2012-02-13T11:32:09.788-08:00</updated><category term='Systran'/><category term='command-line utilities'/><category term='workflow'/><category term='PDF'/><category term='XLIFF'/><category term='terminology'/><category term='spell checking'/><category term='openlogos'/><category term='TTX'/><category term='links'/><category term='writing quality'/><category term='File::TTX'/><category term='roadmap'/><category term='NooJ'/><category term='tasks'/><category term='File::XLIFF'/><category term='machine translation'/><category term='configuration'/><category term='craft'/><category term='natural language'/><category term='Xlat::Declarative'/><category term='character encodings'/><category term='Xlat::File'/><category term='TM'/><category term='Text::Aspell'/><category term='TRADOS'/><category term='term extraction'/><category term='OCR'/><category term='segmentation'/><category term='syntax editor'/><title type='text'>The Xlat Project</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>41</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-2110248088935526784</id><published>2012-02-13T11:30:00.000-08:00</published><updated>2012-02-13T11:31:55.314-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><category scheme='http://www.blogger.com/atom/ns#' term='tasks'/><title type='text'>Task: concordance-to-glossary tool</title><content type='html'>I want to be able to look up one or more terms in a TM in the same way that concordances work now, then &lt;i&gt;make a decision&lt;/i&gt; for a given document or customer, then have that decision checked globally.  I'm most of the way to having this ready to go.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-2110248088935526784?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/2110248088935526784/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2012/02/task-concordance-to-glossary-tool.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2110248088935526784'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2110248088935526784'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2012/02/task-concordance-to-glossary-tool.html' title='Task: concordance-to-glossary tool'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8706938141856632140</id><published>2012-02-13T11:27:00.000-08:00</published><updated>2012-02-13T11:32:09.799-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='term extraction'/><category scheme='http://www.blogger.com/atom/ns#' term='tasks'/><title type='text'>Task: find "actionable" terms in a given source</title><content type='html'>This is probably solved by NLTK somehow, but given a source text I want to be able to find probable glossary items to be researched and to be checked against a TM or glossary.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8706938141856632140?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8706938141856632140/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2012/02/task-find-actionable-terms-in-given.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8706938141856632140'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8706938141856632140'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2012/02/task-find-actionable-terms-in-given.html' title='Task: find &quot;actionable&quot; terms in a given source'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3588321983600160282</id><published>2012-01-24T19:32:00.000-08:00</published><updated>2012-01-24T19:34:20.573-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='File::XLIFF'/><category scheme='http://www.blogger.com/atom/ns#' term='tasks'/><title type='text'>Task: Generalize File::XLIFF to work on zipped XLIFF</title><content type='html'>The files Lionbridge uses in their XLIFF editor are actually zipped XLIFF (with a .xlz extension) and include a "skeleton" file that seems to have some kind of information about placeables.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It would be nice to have a way of dealing with those for batch manipulation (global find-and-replace, etc.).&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3588321983600160282?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3588321983600160282/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2012/01/task-generalize-filexliff-to-work-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3588321983600160282'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3588321983600160282'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2012/01/task-generalize-filexliff-to-work-on.html' title='Task: Generalize File::XLIFF to work on zipped XLIFF'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3263330742604355758</id><published>2012-01-07T19:45:00.000-08:00</published><updated>2012-01-07T19:51:28.249-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TM'/><title type='text'>OpenTag, TMX, and translation memory manipulation</title><content type='html'>Here's an interesting thing: &lt;a href="http://opentag.com/"&gt;opentag.com&lt;/a&gt;, including the format definitions for TMX and a few other rather fascinating XML interchange formats (including one for segmentation rules!)&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm off onto a new tangent: a TMX manipulation module.  I still don't have a fantastic API for it, but you know, I think I'm going to dump the xmlapi for real now.  It's been 12 years now and I think it's time to move on.  So I'm going to rewrite File::TTX to work with a different XML library (probably XML::Reader/XML::Writer) and do the same with TMX.  This will allow me to choose between loading the file into memory in toto, or just writing a stream processor to filter things out on the fly for really large files.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I envision an overarching Xlat::TM API that will work with File::TMX in specific, and perhaps with others if and when.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3263330742604355758?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3263330742604355758/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2012/01/opentag-tmx-and-translation-memory.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3263330742604355758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3263330742604355758'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2012/01/opentag-tmx-and-translation-memory.html' title='OpenTag, TMX, and translation memory manipulation'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7957367777598693685</id><published>2011-11-07T08:16:00.000-08:00</published><updated>2011-11-07T08:24:46.565-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TTX'/><category scheme='http://www.blogger.com/atom/ns#' term='TRADOS'/><title type='text'>TRADOS XML noncompliant</title><content type='html'>So I'm working on a command-line utility for doing things with TTX files and ran into an unpleasantness: TTX files that are generated with the Word converter from Word documents with soft hyphens contain hex 0x1F values - but those values are illegal in XML.  And when the XML standard says "illegal" they actually mean you're not supposed to call any parser that accepts them an XML parser.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is really quite dismaying - and I can only imagine the discussions that must have gone on at TRADOS when they clearly suborned this restriction in their XML parser.  It would have been far cleaner from an XML standpoint to have translated soft hyphens into a tag - but that would have made the editing experience far less clean.  So they were stuck.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And now, I am too - I have to preprocess all TTX before passing it through the XML parser, which is a performance hit (which doesn't bother me too much) - but far worse, non-preprocessed TTX will infect the TM, so if I now make changes to the sanitized file and write it back out, it won't match the TM.  This would be OK if we could be sure of sanitization before the TM were affected, but that's clearly too much to hope for in most real-world agency/freelancer workflows.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's also rather nasty that TMX dumps from an infected TM will also contain 0x1F characters - meaning non-TRADOS tools won't be able to parse those, either.  And they &lt;i&gt;are&lt;/i&gt; supposed to be interoperable.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think as a matter of policy I'm just going to sanitize and not worry overmuch about the rather small operability hit - at least until some actual project requires me to worry about it.  Then I'll cross that bridge.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7957367777598693685?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7957367777598693685/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/trados-xml-noncompliant.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7957367777598693685'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7957367777598693685'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/trados-xml-noncompliant.html' title='TRADOS XML noncompliant'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-5752963218564334968</id><published>2011-11-03T17:46:00.000-07:00</published><updated>2011-11-03T17:47:14.197-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='spell checking'/><title type='text'>Enchant</title><content type='html'>The future of open source spelling may be &lt;a href="http://www.abisource.com/projects/enchant/"&gt;Enchant&lt;/a&gt;.  There's no Perl binding.  Yet.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-5752963218564334968?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/5752963218564334968/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/enchant.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5752963218564334968'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5752963218564334968'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/enchant.html' title='Enchant'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7457579450793955165</id><published>2011-11-03T16:40:00.000-07:00</published><updated>2011-11-03T17:47:52.294-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Text::Aspell'/><category scheme='http://www.blogger.com/atom/ns#' term='spell checking'/><title type='text'>Text::Aspell on Win32 - non-trivial</title><content type='html'>&lt;a href="http://aspell.net/"&gt;Aspell&lt;/a&gt; is the default open-source spell checking engine; its Perl binding is &lt;a href="http://search.cpan.org/~hank/Text-Aspell/Aspell.pm"&gt;Text::Aspell&lt;/a&gt;.  The problem is that both Aspell and Text::Aspell are developed on Unix, and Things Are Different under Windows and MINGW32.  Not insuperably different, but different enough that if you're the first person to try something, you'll live to regret it.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;OK.  So, first things first; the &lt;a href="http://aspell.net/win32/"&gt;W32 installation&lt;/a&gt; of Aspell is back a release version but very stable.  It doesn't actually have the include and library files bundled, but they're readily available - the problem being that W32 Aspell is developed with MSVC, and Strawberry Perl (my Perl of choice) compiles with MINGW32.  Joy.  So the library files are useless; we have to build our own.  But let's make &lt;i&gt;include&lt;/i&gt; and &lt;i&gt;lib&lt;/i&gt; directories under Aspell.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, we set environment variables: CPATH should point (at least) to the Aspell include directory and LIBRARY_PATH to the Aspell lib directory.  Don't forget that your PATH should also include Aspell's bin directory - which will make it easier to use Aspell's command line tools for your dictionary maintenance anyway.  So do it!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Figuring out those environment variables, by the way, cost me about three hours.  The remainder of the day was occupied with the next step: building a .DEF file that dlltool likes (some help was had from &lt;a href="http://www.emmestech.com/moron_guides/moron1.html"&gt;this page&lt;/a&gt; in remembering how a .DEF file is supposed to work), and then finding the appropriate combination of dlltool parameters.  Turns out this:&lt;/div&gt;&lt;div&gt;&lt;pre&gt;   dlltool -d libaspell.defined --dllname aspell-15.dll&lt;br /&gt;             --output-lib libaspell.a --kill-at&lt;/pre&gt;is the &lt;i&gt;only&lt;/i&gt; incantation that will work.  Leaving out the --dllname, &lt;i&gt;even though it is specified in the .DEF file&lt;/i&gt;, will cause linkage failures at runtime.  Not helpful ones, either.  This took me four hours, ultimately culminating in &lt;a href="http://www.willus.com/mingw/yongweiwu_stdcall.html"&gt;this page&lt;/a&gt;, which at least mentions the --dllname parameter.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When dynamically linked, Aspell assumes the location of the DLL linked is either the root for dictionary searches or is in a 'bin' directory which is itself in the root for dictionary searches - in either case, the 'dict' directory of that root is where dictionaries should be.  I had placed a local DLL in the Text::Aspell directory while flailing around; it took me half an hour to remember that.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway, I finally managed to get it running.  Next step: extract words from a TTX to throw against it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7457579450793955165?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7457579450793955165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/textaspell-on-win32-non-trivial.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7457579450793955165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7457579450793955165'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/11/textaspell-on-win32-non-trivial.html' title='Text::Aspell on Win32 - non-trivial'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1820428863224979614</id><published>2011-09-06T14:03:00.001-07:00</published><updated>2011-09-06T14:07:34.557-07:00</updated><title type='text'>Blogs to follow</title><content type='html'>Here are a couple of blogs I'm going to be following:&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.translationtribulations.com/"&gt;Translation Tribulations&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://kv-emptypages.blogspot.com/"&gt;eMpTy Pages&lt;/a&gt; (on machine translation and the industry thereof)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;It's really shocking how little I know about my adoptive industry.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1820428863224979614?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1820428863224979614/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/09/blogs-to-follow.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1820428863224979614'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1820428863224979614'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/09/blogs-to-follow.html' title='Blogs to follow'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8539712503388064031</id><published>2011-06-14T13:51:00.001-07:00</published><updated>2011-06-14T13:54:56.760-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Xlat::Declarative'/><category scheme='http://www.blogger.com/atom/ns#' term='command-line utilities'/><title type='text'>General TTX utility</title><content type='html'>So File::TTX may be slipping ever closer to irrelevance, but I'm still using it for a number of things.  The only problem is, it's a pain always having to write a special-purpose Perl script just to change, say, the source language of a TTX.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Obviously, a command-line program would be the first step towards usability.  (And way easier than a GUI program, obviously.)  Let this stand as my to-do for that command-line utility.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also: I think it's time to admit that I'm going to write the UI portions of the Xlat project in Decl, not plain Perl.  This will probably require the definition of a Xlat::Declarative module.  (That's a good thing.)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8539712503388064031?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8539712503388064031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/06/general-ttx-utility.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8539712503388064031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8539712503388064031'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/06/general-ttx-utility.html' title='General TTX utility'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3640847685497693471</id><published>2011-06-02T15:44:00.001-07:00</published><updated>2011-06-02T15:45:25.669-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><title type='text'>Automotive engineering glossary HU&gt;EN</title><content type='html'>Wow!  &lt;a href="http://www.jarmuszotar.hu/jarmuszotar_hunen.php?"&gt;This is a great find&lt;/a&gt;!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3640847685497693471?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3640847685497693471/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/06/automotive-engineering-glossary-huen.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3640847685497693471'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3640847685497693471'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/06/automotive-engineering-glossary-huen.html' title='Automotive engineering glossary HU&gt;EN'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-2608346575018294155</id><published>2011-05-15T10:10:00.000-07:00</published><updated>2011-05-15T11:07:03.648-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><title type='text'>Terminology</title><content type='html'>Another non-Xlat post!&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Automotive terminology is kind of tricky and I'm finding it hard to find good references - although I'm seeing more demand.  Here are a couple of links not to forget.&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;A l&lt;a href="http://www.ats-group.net/glossaries/glossary-lexicon-automotive.html"&gt;ist of glossaries&lt;/a&gt; from ATS Group.&lt;/li&gt;&lt;li&gt;&lt;a href="http://translationjournal.net/journal//04auto.htm"&gt;Translation Journal&lt;/a&gt; has a short glossary.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Second topic: I really want to mine the SAP help site for accounting terminology.  Here's &lt;a href="http://help.sap.com/saphelp_sbo88/helpdata/it/6b/915a9ad88345b580e84f2c6489b01d/content.htm"&gt;just a teaser link&lt;/a&gt; that's been open on my browser for a couple of weeks now - the technique is simple.  Google "site:help.sap.com xxx" for a likely term, then replace the language in the link with "en".  Then align your results.  It works!  A list of likely terms (from a tagger, perhaps) is the right place to start.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A generalized terminology research framework would be useful.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-2608346575018294155?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/2608346575018294155/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/05/terminology.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2608346575018294155'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2608346575018294155'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/05/terminology.html' title='Terminology'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1140241616796030083</id><published>2011-03-12T16:59:00.000-08:00</published><updated>2011-03-12T17:00:06.954-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='File::TTX'/><title type='text'>File::TTX 0.03 released</title><content type='html'>I haven't been moving very fast on this project, have I?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1140241616796030083?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1140241616796030083/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2011/03/filettx-003-released.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1140241616796030083'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1140241616796030083'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2011/03/filettx-003-released.html' title='File::TTX 0.03 released'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-2203691029522697416</id><published>2010-12-04T07:37:00.000-08:00</published><updated>2010-12-04T08:23:45.597-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='PDF'/><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='workflow'/><title type='text'>That whole MT project</title><content type='html'>OK, so the post-editing project I foolishly agreed to help with consisted of:&lt;div&gt;&lt;ul&gt;&lt;li&gt;OCR with Able2Extract&lt;/li&gt;&lt;li&gt;MT with a mixture of (I think) Google Translate and Systran&lt;/li&gt;&lt;li&gt;First-pass proofreading&lt;/li&gt;&lt;li&gt;Second-pass post-editing&lt;/li&gt;&lt;/ul&gt;So let's talk about that.  A far, far better workflow would have been:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;OCR with whatever&lt;/li&gt;&lt;li&gt;Source-language spell checking and correction&lt;/li&gt;&lt;li&gt;Identification of key phrases and terminology as cues for MT&lt;/li&gt;&lt;li&gt;TRADOS or similar to avoid rework of existing sentences&lt;/li&gt;&lt;li&gt;MT with whatever&lt;/li&gt;&lt;li&gt;Target-language spell checking, feeding results back through MT until at least everything is English&lt;/li&gt;&lt;li&gt;First-pass post-editing&lt;/li&gt;&lt;li&gt;Second-pass proofreading&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;This workflow uses (or at least could use) the exact same tools as above, but without the introduction of errors at each step that make later steps impossible to manage.  First-pass post-editing should be done by a bilingual translator, using specialized post-editing tools (not yet written) plus a normal translation memory (and of course the TM should also be used before passing text off to the MT stage). Systematic errors should be documented and recycled through the MT process.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One key insight: terminology research really starts to get a lot more important in this workflow than in normal CAT.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-2203691029522697416?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/2203691029522697416/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/12/that-whole-mt-project.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2203691029522697416'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2203691029522697416'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/12/that-whole-mt-project.html' title='That whole MT project'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-2656593638999014531</id><published>2010-12-02T16:16:00.000-08:00</published><updated>2010-12-02T16:23:21.037-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='syntax editor'/><title type='text'>More thoughts on a non-stupid text editor</title><content type='html'>I'm doing some post-editing for Portuguese today (I know, I know, &lt;i&gt;never do MT post-editing&lt;/i&gt;, but this customer is a good one and I just couldn't say no).  As usual with post-Systran work, there is a lot of dragging and dropping involved, and frankly?  Word freaking &lt;u style="font-style: italic; "&gt;sucks&lt;/u&gt; at dragging and dropping.  Why should that be?  Why can't I drag a word from the end of a punctuated sentence into its middle and have Word get the spacing right?&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The mind boggles.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So it looks like I'm just going to have to break down and address non-stupid text editing again.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-2656593638999014531?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/2656593638999014531/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/12/more-thoughts-on-non-stupid-text-editor.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2656593638999014531'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/2656593638999014531'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/12/more-thoughts-on-non-stupid-text-editor.html' title='More thoughts on a non-stupid text editor'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1400376628857683562</id><published>2010-11-16T09:51:00.001-08:00</published><updated>2010-11-16T09:55:32.605-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='workflow'/><title type='text'>Another workflow with PDFs</title><content type='html'>I have a set of documents that consist of PDFs that have been highlighted and scanned.  That is, each PDF consists of a set of documents.  The text to be translated has been highlighted - with a physical marker, I mean - and the documents scanned.  The PDFs were unfortunately not encoded as allowing comments (this is unfortunately a flag of the digital signature, not a flag in the PDF standard, and Adobe has not provided the key for digital signature from what I'm reading - thus there is no tool in the world that can flag a PDF to allow comments from Adobe Reader except for the full paid version of Adobe Acrobat.)&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So my workflow is to go through the documents and use the snapshot tool to copy the highlighted bits.  I put each bit into one column of a Word file, and the translation in the other column.  It's nearly as good as comments in the PDF.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It seems to me that this would be a simple tool to implement: create the Word file, create the table, then every time I select something that's graphical, put it into the Word file for me and bring Word to the top.  It's not a huge help, but it's the principle of the thing.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1400376628857683562?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1400376628857683562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/11/another-workflow-with-pdfs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1400376628857683562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1400376628857683562'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/11/another-workflow-with-pdfs.html' title='Another workflow with PDFs'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1690128933259266797</id><published>2010-11-06T08:04:00.000-07:00</published><updated>2010-11-06T08:07:58.762-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='segmentation'/><category scheme='http://www.blogger.com/atom/ns#' term='term extraction'/><title type='text'>Working with source text</title><content type='html'>There are open-source ways to break text into sentences and to find terms.  Those need to be part of the toolkit.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The &lt;a href="http://code.google.com/p/splitta/"&gt;splitta&lt;/a&gt; library is a sentence boundary finder.  This I &lt;i&gt;have&lt;/i&gt; to incorporate, as segmentation is an extremely important function of any translation system.  So that should be Perl-ized here.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The &lt;a href="http://pypi.python.org/pypi/topia.termextract/"&gt;Topia&lt;/a&gt; term extractor is the other thing I wanted to point out here.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also, the fact that both of these libraries are in Python.  An awful lot of natural language work ends up in Python.  That's kind of interesting, actually.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1690128933259266797?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1690128933259266797/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/11/working-with-source-text.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1690128933259266797'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1690128933259266797'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/11/working-with-source-text.html' title='Working with source text'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-5896124880999387661</id><published>2010-10-08T16:07:00.000-07:00</published><updated>2010-10-08T16:09:10.033-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><title type='text'>Terminology</title><content type='html'>This has been done to death, of course, but I need to start thinking about a terminology engine, and also about specific terminology - right now I'd like a database of titles of industrial standards in various languages.  They come up rather a lot.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-5896124880999387661?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/5896124880999387661/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/terminology.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5896124880999387661'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5896124880999387661'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/terminology.html' title='Terminology'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8970686541524322073</id><published>2010-10-06T20:04:00.000-07:00</published><updated>2010-10-06T20:17:00.365-07:00</updated><title type='text'>PDF reading</title><content type='html'>There are a couple of workflows where PDFs are needed.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First is where a series of pages have been scanned and need to be translated starting from the graphics.  OCR can come in handy here (if it works, which it usually doesn't), but I want to highlight the fact that (1) the pages are very often disjoint (think medical records) and (2) sometimes have Bates numbers (legal annotations identifying each individual page in a set of documents).  This overall structure could do with some software support.  I'm thinking something that takes individual document segments and ties them back into a structured overall document with, say, the Bates numbers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[With respect to that OCR: it would be nice to have a pre-OCR stage that finds and identifies pages that are similar - this could simplify finding letterhead, headings, and so on.]&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second workflow of interest is text PDFs.  See, PDFs don't have document structure like Word documents.  If a header appears on every page, well then it will be reproduced on every page in text.  So it would be nice to be able to impose - to &lt;i&gt;recognize -&lt;/i&gt; this sort of structure in order to take PDFs and translate them.  (You could argue that a TM tool would do this for you - but I would prefer to abstract out the different document parts in order to translate them separately, when we start thinking about machine translation.  The MT tool will need as much help as it can get.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway.  Just a thought I'm too busy to follow up on right at the moment.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8970686541524322073?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8970686541524322073/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/pdf-reading.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8970686541524322073'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8970686541524322073'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/pdf-reading.html' title='PDF reading'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-5587523573170479007</id><published>2010-10-01T22:54:00.000-07:00</published><updated>2010-10-01T23:08:43.580-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='Systran'/><title type='text'>Oops...</title><content type='html'>So I tried Systran on a new potential project in German and Italian (SOPs from the same company in both languages, for translation to English).  I figured after the corporate charter, with its quite passable results, I'd try Systran on these as well.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Abysmal.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's just the first sentence of the German:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div&gt;The available SOP serves for the Sicherstellung of the requirements and conditions, which must fulfill the suppliers, so that them for the supply of a supplier sample to become certified to be able.&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You can't edit that.  All you can do is retranslate it - either directly from the German or from this intermediate not-German not-English near-gibberish.  So maybe the corporate charter was a fluke, or maybe Systran performs better on French than on German (or Italian - the results were equally unreadable on my Italian sample).  Either way, my initial hopes for being able to use Systran are pretty much shot.  This does not speed me up, and it's clear that careful glossary work, while it might help a little, wouldn't be enough - Systran doesn't actually appear to understand or use syntax.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-5587523573170479007?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/5587523573170479007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/oops.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5587523573170479007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5587523573170479007'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/10/oops.html' title='Oops...'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8762038336788913375</id><published>2010-09-30T14:14:00.000-07:00</published><updated>2010-09-30T14:36:08.195-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='Systran'/><title type='text'>Thoughts on practical use of machine translation</title><content type='html'>So since I haven't had the time to get OpenLogos running (I swear, just when I started, the work just came pouring in - I'm at 123,000 words for the month, phew) and given that I was far, far behind schedule on a large and boring corporate charter in French, I decided to try Systran.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(Oh, no, he didn't go there!)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I hadn't looked at Systran since 2005, when I had some work post-editing its abysmal output for an agency in Italy.  I came to the conclusion then that it was normally just as easy to translate a given text myself than to try to decipher what Systran had come up with and whip it into something comprehensible by an English speaker, and that translating it myself paid five times as well.  So: no-brainer, and I actually lost my Systran install.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But, well, it's been five years.  Surely they could do a better job by now, right?  And hey, it's only 100 bucks for the home version, which now includes a whole raft of languages - in fact, with the exception of Hungarian, &lt;i&gt;all&lt;/i&gt; the languages I work with.  So.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's the workflow I used: I ran Systran on my file, then aligned it with the original, and loaded it into my TM.  Then I started down the file sentence by sentence in the normal manner, with the aligned segments coming up as I went.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This worked pretty damn well, actually.  OK, there were some Systranisms - mille, in a year, is generally not translated "millet" and I'm not sure why that would be default.  I dealt with these by loading the TM transfer file in my editor of choice, and doing global search-and-replace on them as I went.  Then I'd import the edited segments back into the TM, and proceed.  So commonly mistranslated terms got better as I went.  Since the file was 13,000 words, this approach had time to work.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I should note that nearly every sentence needed modification.  There were some real screamers in terms of Martian word order - so this should be considered kind of a rock-bottom minimum; what I wanted to know is whether it would accelerate my work even so.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My normal "fast" progress is 700 to 1000 words an hour.  For this dreary text, I would probably have managed no more than 400 or 500 an hour.  With this procedure, though, I managed a throughput of something between 1500 and 2500 words an hour.  That ... that freaking &lt;i&gt;works&lt;/i&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think quality suffered somewhat, although as it was a corporate charter, I don't think I would have done fantastic quality anyway, so it's hard to say.  I should continue to give this a try - certainly the preliminary results on this one job were entirely convincing and I now have much more confidence that machine translation should be part of my toolkit.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How would I improve things, you ask?  Pretty much using the same tools I want to implement anyway:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Global search and replace for terms in a bilingual list.  (This has two aspects: replacement should be sensitive to grammar in the target language, i.e. pluralizing correctly, but it should also be sensitive to the source phrase, sort of a "replace X with X' only if it's a translation for Y".)&lt;/li&gt;&lt;li&gt;Automation of simple TRADOS tasks (e.g. reloading the TM after I do a global search and replace.)&lt;/li&gt;&lt;li&gt;A database of rewording rules.  This is slowly taking shape in my mind - it would be a valuable tool for any proofreader.  It could also "translate" between American and British, if you see what I mean.  Kind of a spellchecker on steroids, if you will.&lt;/li&gt;&lt;li&gt;Automation of Systran itself; the home version runs inside Word or with a standalone tool and they don't really want you to do things like automating it without giving them a lot of money for the Professional or Enterprise versions.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Anyway, I wanted to post this while the job was fresh in my memory.  Now it's back to work for me, this time &lt;i&gt;without&lt;/i&gt; the Systran crutch.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The real takeaway for me was: even bad MT, if well managed, would augment my throughput, potentially by &lt;i&gt;a lot&lt;/i&gt;.  And the various accessories I would need for Systran work will also be applicable to work with OpenLogos, so it's not wasted work if I get around to writing some.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8762038336788913375?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8762038336788913375/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/thoughts-on-practical-use-of-machine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8762038336788913375'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8762038336788913375'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/thoughts-on-practical-use-of-machine.html' title='Thoughts on practical use of machine translation'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-6843544995117356134</id><published>2010-09-15T06:53:00.000-07:00</published><updated>2010-09-15T18:29:48.309-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='XLIFF'/><category scheme='http://www.blogger.com/atom/ns#' term='File::XLIFF'/><category scheme='http://www.blogger.com/atom/ns#' term='Xlat::File'/><title type='text'>XLIFF</title><content type='html'>So I started a File::XLIFF module yesterday.  XLIFF [&lt;a href="http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html"&gt;spec&lt;/a&gt;] is an interesting format.  Like many XML formats, it's overengineered to the point that I suspect nobody will ever use it to its fullest extent.  It maps onto the much simpler TTX format only with a lot of folding, spindling, and mutilation.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The basic Xlat::File model of a file as a simple set of segments may turn out to be oversimplified when it comes to XLIFF.  As the most obvious example, a single XLIFF file can contain multiple sections, each of which refers to content from a separate file, and thus each of which has its own header and its own body.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Under the assumption that an XLIFF file will &lt;i&gt;usually&lt;/i&gt; correspond to a single source file, I'm going to define a "default section" (that being the first in the file) that will be the target for the API against the file object; using XLIFF-specific functions, I'll expose a way to get a list of sections and create a separate file object pointing to a section that's numbered 2 or above.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Each of the File:: modules should probably have an Xlat::File superclass.  I don't want to introduce needless dependencies, though; perhaps I can test for installation of Xlat::File before superclassing?  Or maybe this is a plate of beans.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-6843544995117356134?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/6843544995117356134/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/xliff.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6843544995117356134'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6843544995117356134'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/xliff.html' title='XLIFF'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7736717017644901403</id><published>2010-09-14T20:06:00.000-07:00</published><updated>2010-09-15T06:53:06.060-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='craft'/><title type='text'>Patent language</title><content type='html'>Not software-related, but patent-related, I just wanted to link to this incredible example of clear exposition explaining the &lt;a href="http://www.bpmlegal.com/howtopat5.html"&gt;structure of patent claims&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7736717017644901403?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7736717017644901403/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/patent-language.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7736717017644901403'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7736717017644901403'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/patent-language.html' title='Patent language'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8835081580346076808</id><published>2010-09-03T18:56:00.001-07:00</published><updated>2010-09-03T19:02:24.468-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='openlogos'/><title type='text'>OpenLogos on SourceForge</title><content type='html'>OpenLogos isn't really part of the xlat project, so I'll be transitioning its blog over to its &lt;a href="https://sourceforge.net/projects/openlogos-mt/"&gt;own home&lt;/a&gt; on SourceForge.  The real news being that it's now &lt;i&gt;on&lt;/i&gt; SourceForge, with yours truly as maintainer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8835081580346076808?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8835081580346076808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/openlogos-on-sourceforge.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8835081580346076808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8835081580346076808'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/openlogos-on-sourceforge.html' title='OpenLogos on SourceForge'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-6806117547211932602</id><published>2010-09-03T18:53:00.000-07:00</published><updated>2010-10-16T21:14:31.342-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='openlogos'/><title type='text'>Installing OpenLogos on Ubuntu 10.x (32-bit)</title><content type='html'>So, as noted below in the Fedora post, I'm giving up on 64-bit Fedora right now and falling back to an older machine, installing 32-bit Ubuntu on it so I can follow the instructions in &lt;a href="http://www.pro-linux.de/artikel/2/253/1,openlogos-101-installation-und-anwendung.html"&gt;Torsten Scheck's article&lt;/a&gt; without needing to work too hard.  I'll post any discoveries here as I go; right now, I'm downloading the Ubuntu installer.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(&lt;i&gt;10/17/2010&lt;/i&gt;) It's embarrassing, but I'm only now to this point.  I spent some time getting Ubuntu installed on an old machine, but an update seems to have clobbered the boot sector or something - and frankly, that machine has been a problem for a while now.  So this week I built a 32-bit Ubuntu virtual machine on my desktop box, and I'm chugging along.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After making my earlier changes again (the ones I made on Fedora), things are compiling well.  I'm getting a lot of warnings from including logos_libs/ruleengine/rulebase.h of the form: "%.2d" expects type 'int' but argument 3 has type 'long unsigned int', but aside from those warnings, things worked fine.  I'm going to have to look into those.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Ah. In lgsentity.cpp, "warning: deprecated conversion from string constant to 'char*'", and in a couple of other files, as well.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I ended up adding various headers to about ten files all in all.  Not too bad.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The installation routine failed to create /usr/local/share/openlogos/bin for some reason - acting as though it wasn't running as sudo root.  Strange, and something that should be examined.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But ... I seem to have installed OpenLogos at long last.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-6806117547211932602?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/6806117547211932602/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/installing-openlogos-on-ubuntu-10x-32.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6806117547211932602'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6806117547211932602'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/09/installing-openlogos-on-ubuntu-10x-32.html' title='Installing OpenLogos on Ubuntu 10.x (32-bit)'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7435943806819212749</id><published>2010-08-27T20:11:00.000-07:00</published><updated>2010-08-27T20:13:23.774-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OCR'/><title type='text'>Tesseract OCR</title><content type='html'>Google's &lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;Tesseract&lt;/a&gt; seems to be just about the best OCR out there.  It doesn't seem to play well with others yet (it's written on the assumption that it's a standalone utility, not a library) but given that it's Google, it'll probably get a lot better fast.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I should probably investigate.  OCR is an important component of a lot of translation jobs, and all existing OCR sucks.  Sigh.  That's only partly hyperbole.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7435943806819212749?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7435943806819212749/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/tesseract-ocr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7435943806819212749'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7435943806819212749'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/tesseract-ocr.html' title='Tesseract OCR'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8397560193349038709</id><published>2010-08-19T19:06:00.001-07:00</published><updated>2010-08-19T19:24:37.141-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='NooJ'/><title type='text'>NooJ</title><content type='html'>So investigating some of the background for OpenLogos led me to the &lt;a href="http://www.nooj4nlp.net/"&gt;NooJ&lt;/a&gt; project, the brainchild of one Max Silberztein.  Weirdly, it's in .NET, but aside from its choice of platform and the worryingly closed source, it appears to be manna from heaven and crack for my natural-language habit.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lemme put it this way: 90% of the heavy lifting of the xlat project has now already been done.  All that's left is integrating all this stuff into something like a coherent toolset.  I fully intend to enjoy myself immensely.  (While chafing at the closed source - but them's the breaks, kid.)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8397560193349038709?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8397560193349038709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/nooj.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8397560193349038709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8397560193349038709'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/nooj.html' title='NooJ'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-4872864787816319626</id><published>2010-08-19T18:18:00.000-07:00</published><updated>2010-09-03T18:46:00.227-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='openlogos'/><title type='text'>Compiling OpenLogos under Fedora Core 11</title><content type='html'>I've definitely gone down the rabbit hole with OpenLogos.  It is truly a thing of utter archaic beauty.  It's forty years old this year!  Which, in terms of software, makes it one of the oldest existing codebases on the planet - certainly the oldest open-source codebase in existence.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm trying to get it running on my Fedora Core 11 box, very much in spare time.  I'll continue to update this post as I get things running.  There will be later posts on how to &lt;i&gt;run&lt;/i&gt; the thing once it's built.  Assuming I can; this is a 64-bit machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1. Dependencies: Java and unixODBC&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Although most of the code is written in C++, there appear to be some Java components.  I've had nothing like time even to make a cursory survey of the codebase yet, so I don't know what's using Java and what isn't, but Java is definitely a prerequisite for the build.  Since Fedora ships with OpenJava, not Sun's Java, the first thing to do is to get the java-devel package installed (my runtime is 1.6.0, the latest as of this writing, so I obtained the matching devel):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;yum install java-1.6.0-openjdk-devel&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once that's done, you'll specify the installation directory in your configure command to build the make environment for OpenLogos, like this:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;./configure --with-java=/usr/lib/jvm/java-openjdk/&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Don't run that yet, though, because the other compilation prerequisite is &lt;a href="http://www.unixodbc.org/"&gt;unixODBC&lt;/a&gt;.  I tried installing it with yum, but it didn't work for me, so I fell back on the ancient technique of downloading and compiling it yourself.  I'm going to assume you can manage that (otherwise, trust me here, you're going to have a hell of a time with OpenLogos) - the download is where you expect it, so get that, unpack it, do the configure-make-make install thing, and you're good to go.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2. gcc 4.3 header cleanup&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;Now&lt;/i&gt; you can run your configure.  At this point, this worked fine for me.  However, you're not quite done yet.  Assuming you're using the DFKI distro 1.03, like I am, and gcc 4.4.1, you'll find that as of 4.3, the &lt;a href="http://www.cyrius.com/journal/2007/05/10#gcc-4.3-include"&gt;gcc headers have been cleaned up&lt;/a&gt;, and so there are dependencies missing.  What compiled last time DFKI built (obviously gcc 4.2 or earlier) needs patching now.  That is the status as I write this post; I'll update as I go, and provide a patch file at some point.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;Update 8/23/2010:&lt;/i&gt;&lt;/div&gt;&lt;div&gt;The errors here take the form of:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;error: 'xxxx' was not declared in this scope&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And apply to the following functions:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;strchr&lt;/span&gt; was assumed to be in &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;string.h&lt;/span&gt;, but is now in &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;cstring.h&lt;/span&gt; (affects &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;lgsstring.h&lt;/span&gt;).&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;atoi&lt;/span&gt; was assumed to be in &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;string.h&lt;/span&gt;, but is now in &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;cstdlib.h&lt;/span&gt; (affects &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;lgsstring.h&lt;/span&gt;).&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That might be it, actually.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The other sloppy programming (not casting aspersions! I'm guilty of plenty of sloppiness, which is why I just gave up and decided to use Perl from now on in the first place) exposed by the move to gcc 4.3 is a duplicated parameter name in the declaration of rightTrim (two parameters 's', oops!)  I renamed the &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;const char * s&lt;/span&gt; to '&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;t&lt;/span&gt;' to match the cpp file, but man, that looks like something &lt;i&gt;I&lt;/i&gt; would have done.  Weird that earlier compiler versions didn't flag that.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3. 32-bit architectural assumptions&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Those fixes complete (&lt;i&gt;and it's still 8/23/2010&lt;/i&gt;), the next problem is:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;error: cast from 'const char*' to 'int' loses precision&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Whoops.  Did I mention I'm compiling on a 64-bit architecture?  Yeah.  So &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int&lt;/span&gt; is a 32-bit value, and addresses are 64 bits now.  The answer is to replace with &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;a href="http://linux.die.net/man/3/intptr_t"&gt;intptr_t&lt;/a&gt;&lt;/span&gt;&lt;a href="http://linux.die.net/man/3/intptr_t"&gt;,&lt;/a&gt; a guaranteed right-sized integer value defined in &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;stdint.h&lt;/span&gt; &lt;stdint.h&gt;and mandated in the C99 standard, so really there was no excuse to be casting pointers to vanilla &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int&lt;/span&gt; in 2006 (not that I would have done differently, but I'm old and distracted and prefer Perl anyway, allowing the interpreter contributors to worry about this stuff).  Anyway, this little gem affects the parser, which uses addresses throughout as integer hash lookups.  That's gotta go, but that's probably going to take some more thorough investigation and I've got deadlines for tomorrow morning, so that's it for August 23.&lt;/stdint.h&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I wish more of the individual modules had unit tests.  I'm going to &lt;a href="http://www.thealmightyguru.com/Humor/Docs/ShootYourselfInTheFoot.html"&gt;shoot myself in the foot&lt;/a&gt; fast with this stuff sooner or later.  Perhaps I should write some (if I only knew what to test, that would probably work out great - and I have to admit, it would be a great way to start understanding internals).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyway, the &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int&lt;/span&gt; usage appears to be just in private members of the &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;CParser&lt;/span&gt; class, but I worry that they're going to end up getting used to talk to PostgreSQL, and then where will I be?  I should probably worry about that if and when it comes up.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;Update 9/3/2010&lt;/i&gt;:&lt;/div&gt;&lt;div&gt;I've been too busy to keep up with the 64-bit conversion, so I'm repurposing an older box I have as a 32-bit Ubuntu box (by which I mean, I pulled it out of the storage room, where it was gathering dust for just such an occasion), just so I can get a fresh compile and see this thing run once in my life.  I may or may not get back to compiling under FC11 on the 64-bit machine.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-4872864787816319626?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/4872864787816319626/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/compiling-openlogos-under-fedora-core.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4872864787816319626'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4872864787816319626'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/compiling-openlogos-under-fedora-core.html' title='Compiling OpenLogos under Fedora Core 11'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-4797787252155879730</id><published>2010-08-14T14:18:00.000-07:00</published><updated>2010-08-14T14:28:54.076-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><title type='text'>tf-idf weights</title><content type='html'>&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Quoth &lt;/span&gt;&lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Wikipedia&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;, "&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;tf–idf&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; weight (term frequency–inverse document frequency) is a weight often used in &lt;/span&gt;&lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Information_retrieval" title="Information retrieval" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;information retrieval&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Text_mining" title="Text mining" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;text mining&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;."&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The idea is that you determine the weights of terms based on their frequency in both the current document and in your overall corpus.  This lets you find documents based on terms they use that are less frequent overall, and thus that are likely to indicate what the document is about.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;Terminology mining is a technique by means of which "interesting" terms can be found in a document.  The interesting terms can then be researched in advance of the translation process, so that the translation itself can be both consistent and quick.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;There are lots of links I want to save that are tangentially related to this sort of textual analysis.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;a href="http://nlp.fi.muni.cz/projekty/gensim/intro.html"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Gensim &lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;is a textual analysis library in Python.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;An earlier &lt;/span&gt;&lt;/span&gt;&lt;a href="http://ctp.di.fct.unl.pt/~jmag/classic/1988.Term-weighting%20approaches%20in%20automatic%20text%20retrieval.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;paper&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; on term weighting.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;a href="http://code.google.com/p/tfidf/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;tdidf&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; library in Python at Google Code.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;And &lt;/span&gt;&lt;/span&gt;&lt;a href="http://github.com/timtrueman/tf-idf/blob/master/tf-idf.py"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;another&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; at Github.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-4797787252155879730?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/4797787252155879730/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/tf-idf-weights.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4797787252155879730'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4797787252155879730'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/tf-idf-weights.html' title='tf-idf weights'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7622566733739258996</id><published>2010-08-02T16:26:00.001-07:00</published><updated>2010-08-02T16:39:00.051-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='roadmap'/><title type='text'>Roadmap</title><content type='html'>So my roadmap, or to-do list, or what have you, is kind of like this:&lt;div&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;Word client&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Port Anaphraseus&lt;ul&gt;&lt;li&gt;Write OOo &lt;=&gt; Word Basic cross parser&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Use an IP-based server connection to a TM of my own devising (below)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;TTX/Xliff client&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Based on wxPerl and Wx::Declarative&lt;/li&gt;&lt;li&gt;Features can be taken largely from Xliff editor in Translation Workspace&lt;/li&gt;&lt;li&gt;Also talks to TM via IP-based protocol&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;TM server with IP-based protocol&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Basic database is easy&lt;/li&gt;&lt;li&gt;Fuzzy matching needs some examination&lt;/li&gt;&lt;li&gt;Also Wx::Declarative target&lt;/li&gt;&lt;/ul&gt;&lt;/ol&gt;Here's how I expect to increase my productivity:&lt;div&gt;&lt;ul&gt;&lt;li&gt;Simultaneous spell checking and terminology checking as I work; separate query window pops up queries unobtrusively after each segment committed&lt;/li&gt;&lt;li&gt;Decisions made in the query window are propagated back into the active document and any other documents in the same open project - this includes both terminology checks and spell checker dictionary additions.  (Terminology and the spell checker will share a database.)&lt;/li&gt;&lt;li&gt;Frequent words are identified for accelerators; accelerators for terminology in the open segment are displayed in a cheat sheet window.  Any repeated words in incoming segment translations are also identified as potential accelerators.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;That's the first phase.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second phase will probably start to incorporate some MT.  Note OpenLogos especially in this regard; there's a library I could use with confidence.  Post-editing will include the syntax-aware editor in some way.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well - this has definitely been a late-night post; it's really more note-taking than anything.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7622566733739258996?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7622566733739258996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/roadmap.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7622566733739258996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7622566733739258996'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/roadmap.html' title='Roadmap'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7177638086468122304</id><published>2010-08-02T16:15:00.001-07:00</published><updated>2010-08-19T18:39:26.365-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='openlogos'/><title type='text'>OpenLogos</title><content type='html'>&lt;a href="http://logos-os.dfki.de/"&gt;Open-source machine translation&lt;/a&gt;.  &lt;i&gt;Open-source.  Machine.  Translation.&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7177638086468122304?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7177638086468122304/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/openlogos.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7177638086468122304'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7177638086468122304'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/openlogos.html' title='OpenLogos'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3124446722131979879</id><published>2010-08-02T16:14:00.001-07:00</published><updated>2010-08-02T16:14:23.909-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='links'/><title type='text'>ATA Translation Tools overview</title><content type='html'>&lt;a href="http://www.slideshare.net/icotext/ata-2009-translation-tools-seminar"&gt;Hmm&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3124446722131979879?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3124446722131979879/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/ata-translation-tools-overview.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3124446722131979879'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3124446722131979879'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/ata-translation-tools-overview.html' title='ATA Translation Tools overview'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-8020401397992202686</id><published>2010-08-02T15:51:00.000-07:00</published><updated>2010-08-02T15:57:47.896-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='links'/><title type='text'>Useful catalog of translation-related software</title><content type='html'>&lt;a href="http://www.trans-k.co.uk/software_e.html"&gt;Software&lt;/a&gt; for translation.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I thought I remembered something like &lt;a href="http://sourceforge.net/projects/anaphraseus/"&gt;Anaphraseus &lt;/a&gt;for Word-native Basic, but I was apparently suffering from hopeful memory.  Anaphraseus uses OpenOffice.org only; I'm wondering, though, whether I could port it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I really want to use Word.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-8020401397992202686?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/8020401397992202686/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/useful-catalog-of-translation-related.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8020401397992202686'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/8020401397992202686'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/08/useful-catalog-of-translation-related.html' title='Useful catalog of translation-related software'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-7075654619041883741</id><published>2010-07-21T13:28:00.000-07:00</published><updated>2010-07-21T13:34:24.351-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='syntax editor'/><title type='text'>Syntactically savvy editors</title><content type='html'>So I just finished 15,000 words of a community impact study from Hungarian to English.  It was a mind-expanding experience, as HU&gt;EN always is, and I found myself thinking hard about how Word is insufficient for my needs when translating between languages with radically different sentence structure.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This doesn't bother me so much any more with German (I compensate automatically, looking ahead in the German sentence for the structure I know will end up at the start of the English sentence), but in the Romance languages, I tend to backtrack a lot to insert adjectives that I hadn't noticed before starting to type a phrase.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That much I could probably learn to use word navigation for (I've just never learned that because text editors don't have word navigation, so it's not built into my motor cortex like other navigation commands).  But in &lt;i&gt;Hungarian&lt;/i&gt;, things are freaking &lt;i&gt;different.&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;In Hungarian, it's not at all unusual for me to need to construct a sentence painfully, phrase by phrase, realizing again and again that the words I'm finding at the end of what I thought was the full phrase actually need to go at the front in English.  What I'd &lt;i&gt;like&lt;/i&gt; in a situation like this is an editor that understands the syntax of what I'm doing.  (Which, of course, is in general impossible - but sometimes, it could probably work.)  Some way of keying into a separate tree mode for this sort of editing would really be very useful.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;More thought is required; I just wanted to mark the idea now.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-7075654619041883741?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/7075654619041883741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/syntactically-savvy-editors.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7075654619041883741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/7075654619041883741'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/syntactically-savvy-editors.html' title='Syntactically savvy editors'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-5884592087009282241</id><published>2010-07-19T12:43:00.001-07:00</published><updated>2010-07-19T12:44:34.943-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='writing quality'/><title type='text'>3 writing-quality metric scripts</title><content type='html'>I'm not sure how relevant &lt;a href="http://matt.might.net/articles/shell-scripts-for-passive-voice-weasel-words-duplicates/"&gt;this is&lt;/a&gt; to translation, but it's still an interesting approach to handling natural language automatically.  I'm guilty of using weasel words, so I find this interesting.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think some sort of quality metric facility would be a useful tool in the kit.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-5884592087009282241?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/5884592087009282241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/3-writing-quality-metric-scripts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5884592087009282241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/5884592087009282241'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/3-writing-quality-metric-scripts.html' title='3 writing-quality metric scripts'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-4389948350572904389</id><published>2010-07-16T09:48:00.000-07:00</published><updated>2010-07-16T10:41:32.475-07:00</updated><title type='text'>Link dump: interesting natural language modules from CPAN</title><content type='html'>&lt;a href="http://search.cpan.org/~nids/Algorithm-WordLevelStatistics/"&gt;Algorithm::WordLevelStatistics&lt;/a&gt; - finds keywords in generic text.  This should be a useful analysis tool for terminology research.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~snowhare/Lingua-Stem/"&gt;Lingua::Stem&lt;/a&gt; - finds stems for a smallish set of languages.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~jjoao/Lingua-StarDict-Gen/"&gt;Lingua::StarDict::Gen&lt;/a&gt; - generates StarDict dictionaries.  (Might be useful.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~lesv/Lingua-StarDict/"&gt;Lingua::StarDict&lt;/a&gt; itself (2004) and the &lt;a href="http://stardict.sourceforge.net/"&gt;StarDict&lt;/a&gt; project (2007) at SourceForge, hmm.  (&lt;a href="https://sourceforge.net/projects/sdcv/"&gt;console version&lt;/a&gt;, dates to 2006)  - This might be dead, but it's intriguing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~thhamon/Lingua-YaTeA/"&gt;Lingua::YaTeA&lt;/a&gt; - extracts noun phrase candidates from a corpus.  Definitely to be studied.  Seems to have a &lt;i&gt;lot&lt;/i&gt; of innards.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~dbrian/Lingua-Wordnet/"&gt;Lingua::WordNet&lt;/a&gt; - pure Perl WordNet.  Apparently.  Needs study.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~samv/Lingua-Translate/"&gt;Lingua::Translate&lt;/a&gt; - interface to a Web-accessible machine translator, e.g. Babelfish.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~achimru/Lingua-Sentence/"&gt;Lingua::Sentence&lt;/a&gt; - Hello, segmentation!  Thanks, CPAN!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://search.cpan.org/~vlado/Text-Ngrams/"&gt;Text::Ngrams&lt;/a&gt;, &lt;a href="http://search.cpan.org/~kubina/Text-Ngramize/"&gt;Text::Ngramize&lt;/a&gt;, &lt;a href="http://search.cpan.org/~revmischa/Algorithm-NGram/"&gt;Algorithm::NGram&lt;/a&gt; - n-gram analysis of text.  Oh, and &lt;a href="http://search.cpan.org/~ambs/Text-WordGrams/"&gt;Text::WordGrams&lt;/a&gt;, too.  And maybe &lt;a href="http://search.cpan.org/~btmcinnes/Text-Positional-Ngram/"&gt;Text::Positional::Ngram&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-4389948350572904389?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/4389948350572904389/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/link-dump-interesting-natural-language.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4389948350572904389'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/4389948350572904389'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/link-dump-interesting-natural-language.html' title='Link dump: interesting natural language modules from CPAN'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-6389035641794017723</id><published>2010-07-16T09:40:00.001-07:00</published><updated>2010-07-16T09:44:31.873-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='natural language'/><title type='text'>Proposal: generic natural-language-smart string handling</title><content type='html'>... or something.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are operations I'd like to do on these sentences that are weakened by the assumptions made for strings.  For instance, I'd like &lt;i&gt;not&lt;/i&gt; to do terminology checking on a case-insensitive basis, because there are words that are incorrect if not capitalized.  But that simply means that everything that starts a sentence will register as misspelled, which is also wrong.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Spell checkers probably already take this into consideration, but I'd like to be able to point at a file, say definitively "this file contains textually encoded natural language", and do some smarter things than normal file I/O will allow.  Even assumptions about character encoding are different if we know something is language.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Similarly, just some function to extract words and n-phrases correctly from a punctuated sentence would be a great help; this is one of the many things the current Xlat::Termbase is overly naive about.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So that's my thought for the day.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-6389035641794017723?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/6389035641794017723/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/proposal-generic-natural-language-smart.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6389035641794017723'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/6389035641794017723'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/proposal-generic-natural-language-smart.html' title='Proposal: generic natural-language-smart string handling'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1578285126224270903</id><published>2010-07-15T16:14:00.001-07:00</published><updated>2010-07-15T16:21:09.553-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='configuration'/><title type='text'>Configuration and the command line</title><content type='html'>OK, so I'll admit it - I've used computers for well over two decades now and I'm comfortable with Linux, and I &lt;i&gt;still&lt;/i&gt; don't prefer the command line.  Oh, for some stuff it's great (like, power tools) - but I just have no head for remembering commands and parameters.  I sucked at the Rocky Horror Picture Show, too.  Always inadvertently paraphrasing.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It makes me a good translator, actually: what is translation but reading a German sentence and paraphrasing it in English?  But for command line manipulation, I'm not your man.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, in this initial stage and for the foreseeable future, Xlat will be a set of command-line tools.  (And those command-line tools are damned important, anyway, so they'll stick around.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's an idea, though.  When I'm in a directory, I want the entire Xlat suite to know some important things about that directory, i.e. the termbase I want to use there, the customer's name and ID perhaps, I don't know what all, but I want an open-ended scheme to set it all up.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And that scheme needs to cascade.  If a value isn't found in the context for a directory, we should check the parent directory (i.e. if it's not in the project, I want it in the customer's main directory).  And so on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Moreover, I want to be able to override things on the command line if I want to use an alternative termbase or something.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In addition to this, I want a session context to be saved, i.e. the last file touched and things like that.  The next time I do a termcheck, if I don't give it a file, it'll pull the last file I used.  That kind of thing.  Just a way to make this stuff easier to use, while preserving the power and convenience of command-line utilities.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm not sure what to call the module.  Config:: something, but both Cascading and Context have been taken for things that aren't entirely what I want.  So it deserves some thought.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1578285126224270903?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1578285126224270903/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/configuration-and-command-line.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1578285126224270903'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1578285126224270903'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/configuration-and-command-line.html' title='Configuration and the command line'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1361812342982523027</id><published>2010-07-15T16:11:00.000-07:00</published><updated>2010-07-15T16:14:12.276-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='character encodings'/><title type='text'>A note on character encodings</title><content type='html'>Here's a sticky wicket (as character encodings always are).  By default, Notepad (my text editor of choice for simple files) represents umlauted vowels in the normal ISO eight-bit character set.  Padre (my Perl IDE of choice) represents umlauted vowels within strings as UTF-8, which is much better.&lt;br /&gt;&lt;br /&gt;Here's the problem: if I edit a German word in a text file, and the same German word in a string, they don't test as equivalent.  &lt;i&gt;This is a problem&lt;/i&gt;.  It's a widespread enough problem that I'm going to have to come up with a principled, central way to deal with it.  So watch this space, I guess.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1361812342982523027?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1361812342982523027/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/note-on-character-encodings.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1361812342982523027'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1361812342982523027'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/note-on-character-encodings.html' title='A note on character encodings'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3318742837756278526</id><published>2010-07-15T16:08:00.001-07:00</published><updated>2010-07-15T16:11:36.204-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminology'/><title type='text'>Terminology checking v0.01</title><content type='html'>So.  I just posted v0.01 of a &lt;a href="https://sourceforge.net/apps/mediawiki/xlat/index.php?title=Termchecker0_01"&gt;terminology checker script&lt;/a&gt; to the Wiki.  It is painfully naive in its structure and coding, but it got the job done tonight for some terminology checking I wanted to do, and it illustrates just how simple these basic tools can be.  The key of it is this:&lt;br /&gt;&lt;pre&gt;foreach $s ($ttx-&gt;segments()) {&lt;br /&gt;   my $c = $t-&gt;check ($s-&gt;source, $s-&gt;translated);&lt;br /&gt;   if ($c) {&lt;br /&gt;      foreach my $missing (keys %$c) {&lt;br /&gt;         $terms-&gt;{$missing} = $c-&gt;{$missing};&lt;br /&gt;         $bad-&gt;{$missing} = [] unless defined $bad-&gt;{$missing};&lt;br /&gt;         push @{$bad-&gt;{$missing}}, $s;&lt;br /&gt;      }&lt;br /&gt;   }&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;Now, note that it's using a termbase module I haven't published yet (because it's even more terribly naive), but the key here is that this loop is really, really simple.&lt;br /&gt;&lt;br /&gt;This is what translation tools should look like.  I'm pretty happy with this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3318742837756278526?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3318742837756278526/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/terminology-checking.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3318742837756278526'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3318742837756278526'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/terminology-checking.html' title='Terminology checking v0.01'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-1628916247523804767</id><published>2010-07-14T12:22:00.000-07:00</published><updated>2010-07-15T12:25:06.399-07:00</updated><title type='text'>Announcing File::TTX</title><content type='html'>&lt;a href="http://search.cpan.org/~michael/File-TTX/"&gt;File::TTX&lt;/a&gt; is the first fruit of the project.  It's the early version of a Perl module that works with TRADOS TTX files.  It will probably end up as a plugin for something like Xlat::Document, if all goes according to plan.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-1628916247523804767?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/1628916247523804767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/announcing-filettx.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1628916247523804767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/1628916247523804767'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/announcing-filettx.html' title='Announcing File::TTX'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3904594442791801969.post-3689614333025692222</id><published>2010-07-10T12:14:00.000-07:00</published><updated>2010-07-15T12:21:54.129-07:00</updated><title type='text'>Introduction</title><content type='html'>The Xlat project is an open-source set of Perl tools and modules to facilitate translation.  A bit of background might be useful: I'm a technical translator, mostly German to English.  But before I ever started that career, I was a programmer, mostly C at the time.  Lately I lean towards Perl; CPAN is just so useful.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At any rate, I tend to want to write tools to help me work.  Until recently, I hadn't been organized enough to publish any of my scripts, and so every time I had a script need, I'd start from scratch.  Now that I've started publishing on CPAN, I'm no longer losing quite so much ground.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This &lt;i&gt;blog&lt;/i&gt; is where I organize my thoughts and plans, and announce new tools in the toolkit.  If you're reading it, great!  I would appreciate any and all feedback; as I'm sure you know, handling natural language is very hard indeed.  My philosophy is to release early and raw, then iterate.  That means that for any given project, these tools may well fail in egregious ways (character encodings are always a great way to get that to happen).  Caveat emptor - a full refund is always available.  (A little open-source joke.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you've got comments, you know what to do.  I'd appreciate any feedback.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3904594442791801969-3689614333025692222?l=xlat-perl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xlat-perl.blogspot.com/feeds/3689614333025692222/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/introduction.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3689614333025692222'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3904594442791801969/posts/default/3689614333025692222'/><link rel='alternate' type='text/html' href='http://xlat-perl.blogspot.com/2010/07/introduction.html' title='Introduction'/><author><name>Michael</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://www.vivtek.com/images/me.gif'/></author><thr:total>0</thr:total></entry></feed>
