Documentation Center

Common stem–word difference penalty

The use of stemming allows for the fuzzy process to identify some match candidates that it was previously unable to find. Words that share a common stem arguably should be scored in a manner that reflects their commonality. The current scoring algorithm does not take into account stem commonality between words being compared. Any change in the word results in a full word penalty being applied. For instance (as noted earlier), if the sentence The cat in the hat were compared to the sentence fragment The cats in the hats. the current scoring process would report a score of 60% because three of the five words are the same. cat and cats are treated as two completely different words.

The stemming penalty suggests that instead of treating words with a common stem as completely different, they should be treated as being related. Instead of applying a full penalty, only apply a partial penalty that reflects the perceived effort for translating from one form of a word to another of its form.

The TM property, tm_score_same_stem_penalty, allows you to configure this value. The value should be between 0 and 1. A value of 0 means that no penalty is exacted on the match if the stem is the same, but the word is not exactly the same. A value of 1 means that the full penalty is taken, and different words with the same stems are treated as completely different words. A meaningful value is somewhere between these two extremes, and the customer should consider carefully what makes sense for them.

This penalty is applied only when stemmers are used. The default value for this penalty is .05 (5% weighted penalty). Alternatively, it can default to 1, which will force the customer to assign a meaningful value that may more appropriately represent the effort of translating from one form of a word to another.

The tm.properties file entry and default value is:
tm_score_same_stem_penalty=.05