There are operations I'd like to do on these sentences that are weakened by the assumptions made for strings. For instance, I'd like not to do terminology checking on a case-insensitive basis, because there are words that are incorrect if not capitalized. But that simply means that everything that starts a sentence will register as misspelled, which is also wrong.
Spell checkers probably already take this into consideration, but I'd like to be able to point at a file, say definitively "this file contains textually encoded natural language", and do some smarter things than normal file I/O will allow. Even assumptions about character encoding are different if we know something is language.
Similarly, just some function to extract words and n-phrases correctly from a punctuated sentence would be a great help; this is one of the many things the current Xlat::Termbase is overly naive about.
So that's my thought for the day.
No comments:
Post a Comment