Documentation Center

Segmentation rules

Segmentation rules are customizable language processing rules, defined by the regular expressions that specify a segment. Often, a segment represents a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.

Language processing rules are specified in the translation engine for the entire project. All the TMs selected in a translation engine must have the same language processing rule. If multiple segmentation rules are needed, they must be defined in the same language processing rule.

Segment break

The definition of a segment break is defined in two parts:

  • Before break: A pattern for the text immediately before the segment break.
  • After break: Another pattern of text that defines the text immediately following the break.

A segment break is created only if some text matches the before break pattern and is immediately followed by text that matches the after break pattern.

Multiple rules

You might want a number of segmentation rules; for example, one rule to define the segmentation where there is a colon and another rule to cover the case where there is a period. In any one project, for the same language pair, you can use multiple (main) TMs with different segmentation rules.

Other language resources that affect segmentation

  • List of Abbreviations - The list contains abbreviations that finish with a period (.), such as etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
  • List of Ordinal followers - Similar to abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example, 23.) signifies the ordinal (23rd), not the end of a sentence. For example, 23. April, can mean 23rd April.

Rules specifying exceptions

In addition, even if the text does satisfy these rules, if the text also matches the exception rules, a segment break is not created. You might want to define an exception to cater for the use of a period in a sentence, for example.

Example

The following rule is used as an exception to the segmentation rule that defines a segment demarcated by a period (full stop). Because it is used as an exception, the TM will treat text that matches this pattern as matching a section of text that does not contain a segment break, even if the text also matches the more general pattern that defines a segment break.

This rule matches any text that contains a period (perhaps followed by other closing punctuation), followed by a space, and then a lowercase letter:

Before break

\.+[\p{Pe}\p{Pf}\p{Po}"]*

Close, final and other punctuation, are defined Unicode categories for the following codes:
  • \p{Pe} specifies close punctuation.
  • \p{Pf} specifies final quote punctuation.
  • \p{Po} specifies other punctuation.

After break

\s\p{Ll}

This regular expression matches a space followed by a lowercase letter.

For more information about Unicode categories, see the Microsoft documentation.