About segmentation rules
Segmentation rules define how a TM or a project divides source text into segments.
They are defined by the regular expressions that specify a segment.
Segmentation rules are a language resource, so you add, edit and delete them under the appropriate language resource.
Often a segment is identical to a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.
The definition of a segment break is defined in two parts:
- Before break: A pattern for the text immediately before the segment break.
- After break: Another pattern of text that defines the text immediately following the break.
A segment break is created only if some text matches the before break pattern and is immediately followed by text that matches after break pattern.
Multiple rules
You might want a number of segmentation rules, for example, one rule to define the segmentation where there is a colon, and another rule to cover the case where there is a period.
In any one project, for the same language pair, you can use multiple main TMs with different segmentation rules.
Other language resources that affect segmentation
- List of Abbreviations. This contains a list of abbreviations that finish with a period (.), for example, etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
- List of Ordinal followers. Like abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example 23.) signifies the ordinal (23rd), not the end of a sentence. For example 23. April, can mean 23rd April. The list of ordinal followers is the list of such nouns.
Rules specifying exceptions
In addition, even if the text does satisfy these rules, if the text also matches the exception rules, a segment break is not created. You might want to define an exception to cater for the use of a period in a sentence, for example:
You should not use periods (.) in file names.