Segmentation rules
Segmentation rules are customizable language processing rules, defined by the regular expressions that specify a segment. Often, a segment represents a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.
Language processing rules are specified in the translation engine for the entire project. All the TMs selected in a translation engine must have the same language processing rule. If multiple segmentation rules are needed, they must be defined in the same language processing rule.
Segment break
The definition of a segment break is defined in two parts:
- Before break: A pattern for the text immediately before the segment break.
- After break: Another pattern of text that defines the text immediately following the break.
A segment break is created only if some text matches the before break pattern and is immediately followed by text that matches the after break pattern.
Multiple rules
You might want a number of segmentation rules; for example, one rule to define the segmentation where there is a colon and another rule to cover the case where there is a period. In any one project, for the same language pair, you can use multiple (main) TMs with different segmentation rules.
Other language resources that affect segmentation
- List of Abbreviations - The list contains abbreviations that finish with a period (.), such as etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
- List of Ordinal followers - Similar to abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example, 23.) signifies the ordinal (23rd), not the end of a sentence. For example, 23. April, can mean 23rd April.
Rules specifying exceptions
Example
The following rule is used as an exception to the segmentation rule that defines a segment demarcated by a period (full stop). Because it is used as an exception, the TM will treat text that matches this pattern as matching a section of text that does not contain a segment break, even if the text also matches the more general pattern that defines a segment break.
This rule matches any text that contains a period (perhaps followed by other closing punctuation), followed by a space, and then a lowercase letter:
Before break
\.+[\p{Pe}\p{Pf}\p{Po}"]*
\p{Pe}specifies close punctuation.\p{Pf}specifies final quote punctuation.\p{Po}specifies other punctuation.
After break
\s\p{Ll}
This regular expression matches a space followed by a lowercase letter.
For more information about Unicode categories, see the Microsoft documentation.