Friday, July 16, 2010

Proposal: generic natural-language-smart string handling

... or something.

There are operations I'd like to do on these sentences that are weakened by the assumptions made for strings. For instance, I'd like not to do terminology checking on a case-insensitive basis, because there are words that are incorrect if not capitalized. But that simply means that everything that starts a sentence will register as misspelled, which is also wrong.

Spell checkers probably already take this into consideration, but I'd like to be able to point at a file, say definitively "this file contains textually encoded natural language", and do some smarter things than normal file I/O will allow. Even assumptions about character encoding are different if we know something is language.

Similarly, just some function to extract words and n-phrases correctly from a punctuated sentence would be a great help; this is one of the many things the current Xlat::Termbase is overly naive about.

So that's my thought for the day.

No comments:

Post a Comment