Documentation Center

Segmentation rules

Segmentation rules define how a translation memory (TM) or a project divides source text into segments.

Segmentation rules are customizable language resources, defined by the regular expressions that specify a segment. Often a segment is identical to a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.

The definition of a segment break is defined in two parts:

  • Before break: A pattern for the text immediately before the segment break.
  • After break: Another pattern of text that defines the text immediately following the break.

A segment break is created only if some text matches the before break pattern and is immediately followed by text that matches after break pattern.

Multiple rules

You might want a number of segmentation rules, for example, one rule to define the segmentation where there is a colon, and another rule to cover the case where there is a period.

In any one project, for the same language pair, you can use multiple (main) TMs with different segmentation rules.

Other language resources that affect segmentation

  • List of Abbreviations - The list contains abbreviations that finish with a period (.), for example, etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
  • List of Ordinal followers - Like abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example 23.) signifies the ordinal (23rd), not the end of a sentence. For example 23. April, can mean 23rd April.

Rules specifying exceptions

In addition, even if the text does satisfy these rules, if the text also matches the exception rules, a segment break is not created. You might want to define an exception to cater for the use of a period in a sentence, for example:

You should not use periods (.) in file names.