About segmentation rules
Segmentation rules define how SDL Trados Studio divides paragraphs of source text into segments. Often a segment is identical to a sentence, in which case the rules specify the text patterns that constitute a sentence.
In Studio, the segmentation rules are specified in terms of regular expressions. These regular expressions define patterns of characters that mark the end of sentences. See below for a description of the regular expressions used for the default segmentation rules. Segmentation rules are a language resource, so you add, edit and delete them under the appropriate language resource. The definition of a segment break is defined in two parts:
- Before break
- A pattern for the text immediately before the segment break.
- After break
- Another pattern of text that defines the text immediately following the break.
A segment break is created only if some text matches the Before break pattern and is immediately followed by text that matches After break pattern, and does not match any of the exception rules.
Exception rules
An exception rule has the same form as a segmentation rule. If the text matches an exception rule, a segment break is not created. A common exception rule is the Lower-case letter exception rule for the full stop segmentation rule. This says that if the next letter after a period is a lower-case letter, then do not create a segment break.
Multiple segmentation rules
You can have more than one segmentation rule, for example, one rule to define the segmentation where there is a colon, and another rule to cover the case where there is a period.
In any one project, for the same language pair, you can use multiple main TMs with different segmentation rules.
Other settings that affect sentence level segmentation
- List of Abbreviations. This contains a list of abbreviations that finish with a period (.), for example, etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
- List of Ordinal followers. Like abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment. For example, if
Avril... is an ordinal follower, the phrase23. April means23rd April, not23followed by a sentence that startsApril....
About the regular expressions used in the default segmentation rules
The regular expressions used in the default segmentation rules make extensive use of the Unicode categories. The ones that are used are as follows
- \p{Ll}
- Lowercase letters
- \p{Pe}
- Any type of closing bracket
- \p{Pf}
- Any closing quotation mark
- \p{Po}
- Any punctuation character except dash, bracket, quotation mark or connector (underscore)
- \uFFFF
- Unicode characters. For example,\u002C is a comma and\u003A is a colon.