Stemming
In the translation memory definition page, you can select the indexing option for stemming.
Stemming is the process of reducing morphological variants of words (for example, cats and cat) to a common root form or stem (for example, cat) that can be used to identify all variants of the word. Stems can be used to find matching entries containing word variations. Stemming can generate a larger number of match hits.
The WorldServer fuzzy match process is based on finding matches that share common word runs (or shingles). This is different from simply comparing segments on a word per word basis. The shingles can be created using the actual words found in the segment (standard shingles) or by using the stemmed form of the words found in the segment (stemmed shingles.) Standard shingles lead more to exact matches where stemmed shingles may help find matches containing different grammatical forms of the words.
If you have uploaded a custom stemmer component to improve the fuzzy matching and fuzzy scoring processes, you are presented on the TM definition page with Index Options from which you can select Without Stemming (standard shingles—the default), With Stemming, or Both. Searches based on standard shingles (the Without Stemming selection) are sensitive to the slightest change in a word. For example, standard searches consider run, ran, and running to be three different words and do not count them as matched. Standard shingles are optimal for finding exact matches and fuzzy matches having identical word runs. For example, standard searching would consider Run to the store and Ran to the store as fuzzy matches because they both share the shingle to the store. But they would not be considered 100% matches.
Stem-based matching (the With Stemming selection) changes the matching algorithm to ignore minor differences in words like tense or plurality. In the example above, stem-based matching may consider Run to the store and Ran to the store as 100% matches because they vary only by tense, provided the stemmer resolves them to the same root.
The "generate both standard and stemmed shingles" option (Both) combines the advantages of both standard and stemming, but requires the generation of two sets of shingles for every stored translation. That means that the shingle tables for the TM would grow twice as fast, including the size of the table indexes. The additional space consumption coupled with the decrease in storage performance might outweigh the leverage benefits.
A TM property, common_stem_penalty, allows you to configure a penalty value that reduces the full penalty that is otherwise assessed against a stem match. This penalty lets you treat words with a common stem as being related, and results in a higher match score than if they were treated as being completely different.
You are also presented with a drop-down list from which to pick a stemming algorithm for each language. For an explanation of how to develop or obtain stemmer components and add them (via ), see the WorldServer Software Development Kit (SDK) User Guide.