Word breaking
Word breaking attempts to identify word boundaries within textual content and presents the information to WorldServer for further processing. Word breaking is critical to translation memory (TM) and terminology database (TD) related searches (excluding ICE and pure exact match TM searches.) It defines the word units that are then used when creating shingles for match searches.
Before WorldServer 8.0, WorldServer supported a single word breaker implementation that was used ubiquitously across all supported language. The process analyzed alphanumeric characters as well as whitespace and punctuation in order to determine where a word began, continued, and ended. The current implementation was designed with a bias toward English. However, this implementation was not sophisticated enough to handle all languages effectively. In particular, the core word breaker implementation did not handle character-based languages, such as Japanese, where a character represents a word. Additionally, the current word-breaking strategy may not be equally effective for all Western or word-based languages.
The core implementation is generally sufficient for WorldServer customers who use English as the source language for translation work. There are scenarios where this implementation breaks down. Attempts to perform ad hoc searches against certain target languages may break down, especially for character-based languages, where each character should represent a word. However, the driving statistics that drive a customer’s ROI is based on the source language. As long as the source language is English, the impact of the English-centric word breaker is minimized.
However, the issue becomes more critical for customers who use a source language other than English. For instance, using Japanese as the source language dramatically exposed limitations of the core implementation. For Japanese and other character-based languages, the implementation almost completely eliminates fuzzy matches. Fortunately, the core implementation has been updated to identify and treat Chinese, Japanese and Korean characters each as separate words. This improves the fuzzy matching support for these languages. Additionally, WorldServer has introduced a framework that allows for a new word breaking implementation to be uploaded and used within WorldServer.
The following sections discuss WorldServer support for word breaking.