What Is Stemming?
Stemming is the process of identifying or creating canonical forms (or roots) for words with the purpose of normalizing out certain differences. These canonical forms are referred to as stems. The primary goal of stemming is to identify words that are related. For instance, the word cat and cats are related, but one is the singular form of the word and the other is the plural form. While the words may translate differently, you would expect the translations to be very close. When you compare these two words to each other, you would not expect a full penalty to be assessed, such as you would if you were comparing two words such as cat and dog.
Stems allow hits to be found for different forms of the sought word. For instance, if you are searching for cats, then you might find all of the entries containing cats, but you would not naturally find the entries containing the singular form cat. However, if the search process operated on the stems of words, versus the true form, then you would get the additional hits. For example, the stem for both cats and cat would most likely be cat (depending on the implementation). Therefore, searches for either cat or cats would yield the same results.
The cat in the hatGenerated shingles:
the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
The cats in the hatsGenerated shingles:
the|cats|in|, cats|in|the|, in|the|hats|, the|hats|the, and hats|the|cats|
These shingles are used for finding match candidates based on the number of shared shingles. For the TM, this is used in the fuzzy matching process, where the candidates with the most shingles in common to the search string are returned. If we consider the two sentences above, and the shingles that would currently result, we notice that even though the sentences are close to each other in relation to the words being used, these two sentences do not share any common shingles. As a result, one would not be returned as a candidate for a possible fuzzy match for the other. If the sentence The cat in the hat. is stored in the TM, and The cats in the hats. is used for a new search, the search would result in no match candidates.
cat and cats would have the stem cat.) If the shingling process in WorldServer uses stemming, the above example would look more like this:
The cat in the hatGenerated shingles:
the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
The cats in the hatsGenerated shingles:
the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
Notice that cat and cats would share the same stem, and hat and hats would share the same stem. As a result, all of the produced shingles for the two sentences would be the same. If we were to initiate a search using either of the sentences, the sentence would generate the same number of shingle hits to each of the strings. Both entries would be returned, and then scored. The resulting score would then reflect the actual differences between the two strings. If the sentence The cat in the hat is stored in the TM, and The cats in the hats is used for a new search, the search would result in a single match candidate. With stemming, the score would be a 60% match (reflecting the fact that 2 out of the 5 words in each sentence are different.)
The results with stemming would produce more match candidates, thus potentially allowing greater leverage of the content stored in the repository. In this example, we were able to gain leverage of a 60% match, versus a 0% (or no match) without the use of stems. Arguably, the 60% score is too low in that it is based on treating cat and cats (and hat and hats) as completely different words. Perhaps instead of applying a full word penalty, a lower stemming penalty should be applied for different words that share a common stem. This lower penalty would reflect the fact that the words are considered to be related. Depending on the penalty value, the score of The cat in the hat when scored against a match of The cats in the hats could result in a score of 80%, 90%, 95% or whatever the user deems appropriate (provided that the penalty is configurable).