Word counting algorithm
WorldServer uses a simple and efficient word counting scheme. During the scoping process WorldServer breaks an asset into segments and then runs each segment through the word breaking process described above. After a list of word and number elements is generated, those words and numbers are counted using the following rules:
- Each word is counted as one point to be added to the total scoped value.
- For Chinese and Japanese, WorldServer has a special way to count words. Each character is considered a word. For these languages we are, effectively, counting characters. When a user sees "Words" in the WorldServer UI (for example, in scoping) for Chinese and Japanese source languages it actually means "Characters". If a content is a mixture of Chinese or Japanese and Latin-based languages, the appropriate word counting scheme is used for each language. For example, "WorldServer " is counted as 6 words.
- In Korean, word counting is based on white spaces, not characters.
- Numbers are ignored unless a segment has no words.
- If a segment has no words but has at least one number the whole segment is counted as one word count.
- Wordless and numberless segments do not contribute to the scoped result.
This number counting scheme was chosen to adequately reflect a lower cost of number translation. In most cases numbers do not require any translation. Counting numbers in numerically oriented content numbers might increase the translation cost disproportional to the actual translation effort.