About segmentation settings: How a TM segments text
Segmentation settings define how a TM or a project divides source text into segments.
Segmentation rules
Segmentation rules are defined by the regular expressions that specify a segment.
Often a segment is identical to a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.
In any one project, for the same language pair, you can use multiple main TMs with different segmentation rules.
Rules specifying exceptions
List of abbreviations. This contains a list of abbreviations that finish with a period (.), for example, etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
List of ordinal followers. Like abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example 23.) signifies the ordinal (23rd), not the end of a sentence. For example 23. April, can mean 23rd April. The list of ordinal followers is the list of such nouns.
Example: A simple segmentation rule
\.+[\p{Pe}\p{Pf}\p{Po}"]*
This regular expression specifies a segment in a rather simplistic manner. It matches all characters up to a punctuation mark that closes the segment.
Close, final and other punctuation, are defined Unicode categories for the following codes:
\p{Pe}specifies close punctuation.
\p{Pf}specifies final quote punctuation.
\p{Po} specifies other punctuation.
For more information, see for example, UnicodeCategory Enumeration.