Documentation Center

Word Stemmer Support Framework

WorldServer allows you to create and configure stemmers to be used by WorldServer translation memory lookup operations. The following summarizes the word stemmer support:
  • Supports an SDK Stemmer Component to allow you to develop and upload custom stemming implementation.
  • Supports stemmer use in locating TM fuzzy matches.
  • Provides the ability to associate stemmers with a specific language.
  • Allows stemmers to be associated with specific TMs.
  • Restricts a TM from being assigned multiple stemmers for the same language.
  • Allows use of assigned stemmers to be enabled or disabled for specific TM.
  • Generates standard shingles if stem use is disabled, or if no stemmer is assigned for the specific locale (whether source or target).
  • Supports a configurable stemming penalty to be applied during the scoring process to denote word mismatches that have common stems. The assumption is that the penalty would not be as substantial as a completely different word.
  • Supports an option that would allow both standard and stem-based shingles to be generated whenever a stemmer is provided. This would generate larger shingle tables, but would generate additional hits for tighter matches when stems are used.

The purpose of the stemmer is to generate the stem for a supplied word. The algorithm or heuristics to be employed should be language sensitive. A stemmer should be implemented for a specific language. To handle multiple languages, multiple stemmers must be provided.

Stemmers are usable during ad hoc search operations and internal search operations (such as the leverage process). When attempting to decide which stemmers to implement, it is useful to note that WorldServer leverage and scoping processes only perform searches from the source language. As a result, only the source languages require stemming implementations to affect scoping results. All additional language stemmers will be used only during the ad hoc search processes when the user does a target-based search. Therefore, creating stemmers for languages that do not represent source languages will add limited value to a WorldServer environment.