Wednesday, October 6, 2010

PDF reading

There are a couple of workflows where PDFs are needed.

First is where a series of pages have been scanned and need to be translated starting from the graphics. OCR can come in handy here (if it works, which it usually doesn't), but I want to highlight the fact that (1) the pages are very often disjoint (think medical records) and (2) sometimes have Bates numbers (legal annotations identifying each individual page in a set of documents). This overall structure could do with some software support. I'm thinking something that takes individual document segments and ties them back into a structured overall document with, say, the Bates numbers.

[With respect to that OCR: it would be nice to have a pre-OCR stage that finds and identifies pages that are similar - this could simplify finding letterhead, headings, and so on.]

The second workflow of interest is text PDFs. See, PDFs don't have document structure like Word documents. If a header appears on every page, well then it will be reproduced on every page in text. So it would be nice to be able to impose - to recognize - this sort of structure in order to take PDFs and translate them. (You could argue that a TM tool would do this for you - but I would prefer to abstract out the different document parts in order to translate them separately, when we start thinking about machine translation. The MT tool will need as much help as it can get.)

Anyway. Just a thought I'm too busy to follow up on right at the moment.

No comments:

Post a Comment