Generating Standard and Stem Shingles
Searches based on standard shingles are sensitive to the slightest change in a word. They are optimal for finding exact matches and fuzzy matches with identical word runs. The higher the number of hits generated, the more likely the match is the one for which you are looking. Searches based on stemmed shingles are less sensitive to changes in words. Matches generating the same number of hits are less likely to be equally appropriate because certain (pre-scoring) differences are ignored. Stem-based matching is geared more toward finding pure fuzzy matches, as opposed to finding exact matches.
The cat in the hatStandard shingles:
the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
Stemmed shingles: the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
The cats in the hatsStandard shingles:
the|cats|in|, cats|in|the|, in|the|hats|, the|hats|the, and hats|the|cats|
Stemmed shingles: the|cat|in|, cat|in|the|, in|the|hat|, the|hat|the, and hat|the|cat|
- If we are using standard shingles, we note that
The cat in the hatandThe cats in the hatsproduce a completely different set of shingles. As a result, a search for one of the sentences would never return the other sentence as a match candidate. This represents a potential loss in a high fuzzy match opportunity. - If we are using stemmed shingles, both sentences generate the exact same shingles. As a result, a search for one of the sentences returns the other sentence as a match candidate. Also, because the shingles are identical, the non-exact matching sentence results in the same number of hits as the exact match shingle. If the number of candidates returned from the database is restricted (as they are), it is possible that the best match would not be returned depending on how many equal hit candidates exist in the database and how many are pulled back for processing. (This is already possible, and stemming increases its likeliness.) To compensate, the allowed number of candidates during the candidate selection process may need to be increased.
Only a subset of 100% matches rely on the fuzzy lookup process to be found. The leverage process searches for ICE and exact matches before even engaging the fuzzy lookup process. The exact match lookup requires everything to be exactly the same. It is possible to have 100% matches that are missed by the exact match lookup process. These typically include those entries that have certain textual differences that are not penalized (such as extra whitespace). As a result, increasing the allowed number candidate selections might not make an appreciable difference to the leverage results. (This is a good thing.)
- If we are using both standard and stemmed shingles, then we will get hits against both sentences regardless of the sentence we use for our search. However, the more appropriate match will have more hits.
In the above example, each sentence will generate five additional hits for the exact match versus the fuzzy match. This gives you the best of both options, but it comes with a price. This option requires the generation of two sets of shingles for every stored translation. That means that the shingle tables for the TM grows twice as fast, including the size of the table indexes! The additional space consumption coupled with the decrease in storage performance may outweigh the leverage benefits.
Having the Generate both standard and stemmed shingle option provides the customer with all of the above possibilities. Customers need to assess which configuration is optimal for their data profile.
This section mostly applies to terminology databases, though the TD does not have the same table size considerations described in Point 3. The TD would not double or even grow in the number of entries. Instead, it would need to support an additional search key field to allow for the storage of both standard and stem-based search keys.