Saturday, August 14, 2010

tf-idf weights

Quoth Wikipedia, "The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining."

The idea is that you determine the weights of terms based on their frequency in both the current document and in your overall corpus. This lets you find documents based on terms they use that are less frequent overall, and thus that are likely to indicate what the document is about.

Terminology mining is a technique by means of which "interesting" terms can be found in a document. The interesting terms can then be researched in advance of the translation process, so that the translation itself can be both consistent and quick.

There are lots of links I want to save that are tangentially related to this sort of textual analysis.
  • Gensim is a textual analysis library in Python.
  • An earlier paper on term weighting.
  • tdidf library in Python at Google Code.
  • And another at Github.

No comments:

Post a Comment