Documentation Center

WorldServer standard word breaking

Character classes
WorldServer introduces an internal character classification used for word breaking. Each character is placed into one of the following categories, based on character properties.
  • Letter Characters (L)

    A character is considered to be a letter if and only if it is specified as a letter by the Unicode 2.0 standard (category Lu, Ll, Lt, Lm, or Lo in the Unicode specification data file). WorldServer uses the Java Character.isLetter() implementation to determine if a character is a letter.

  • Digit Character (D)

    A character is considered to be a digit if it is not in the range \u2000' <= ch <= '\u2FFF and its Unicode name contains the word DIGIT. WorldServer uses the Java Character.isDigit() implementation to determine if a character is a digit.

  • Middle-Word Characters (W)

    A character is considered to be a middle-word character if it is not a letter but it should be treated as one when it is encountered in the middle of a word. The set of middle-word characters is configurable in WorldServer. The default configuration includes common characters such as dash (-), apostrophe () and “at” (@).

  • Middle-Number Characters (N)

    A character is considered to be a middle-number character if it is not a digit but it should be treated as one when it is encountered in the middle of a number. The set of middle-number characters is configurable in WorldServer. The default configuration includes common characters such as dot (.) and comma (,).

During the word-breaking process, WorldServer scans an array of characters and determines word boundaries based on character classes. This information is then used to define basic elements used for word counting and segment comparisons. The basic elements that are identified are words, numbers, and placeholders. The additional information around and between these elements is categorized as punctuation (which includes whitespace).