How source files get segmented and tokenized
Understanding how SDL Trados Studio segments, and then tokenizes, a source file is the key to using TMs effectively. When you open a document, the software uses the file type settings to do initial segmentation. It then uses the TM segmentation rules (taking into account some file type settings) to do further segmentation. The software uses the segmentation rules from the TM (or, if you use a TM sequence, the first TM in the sequence). After segmentation, the software uses the TM settings to replace text with tokens where applicable ('tokenization').
Paragraph segmentation
When you open a file in Studio, the software breaks the file into paragraphs using rules given in the file type settings. Each file type has different rules for delimiting paragraphs. For example in simple delimited text files, the new line character introduces a paragraph break, but in HTML files, tags such as <p> introduce paragraphs, and a new line character is a whitespace character equivalent to a space. For file types that use tags (for example HTML and XML), you can specify that a tag is a 'structure' tag to indicate that it delimits paragraphs.
Sentence segmentation
Segmentation rules are properties of the TM (under language resources). You can specify paragraph-based segmentation, in which case Studio does no further segmentation. However, usually you specify sentence-based segmentation, in which case the software uses the segmentation rules to divide the paragraphs into segments.
The segmentation rules are regular expressions that recognize patterns of characters that mark the end of sentences. Example: a period followed by whitespace usually marks the end of a sentence.
When the software is segmenting a paragraph, it also takes into account the following settings:
- Whether abbreviations and ordinal followers are recognized. If they are recognized then some segments can contain periods. (This recognition is a property of the TM language resources.)
- Segmentation hints. A segmentation hint indicates whether the software should treat the tag as an indication of a segment break ('exclude') or include the tag in a segment ('include'). Segmentation hints are a file type setting for a tag.
The segments produced by the sentence segmentation are the ones that the translator will see in the editor. The software compares each segment against the translation units in the TMs to see if there is a match.
Tokenization
Tokenization is a stage after segmentation. The software breaks a segment into tokens when it is about to search for a match for the segment. In the TM text is stored as tokens.
The TM settings determine what constitutes a token. For example if the TM treats dates as a recognized token, 12 January 1900 is treated as one token, but if dates are not recognized, it is treated as three tokens: 12, January and 1900.
Where a segment of presented text contains recognized tokens, a TM can find a match based on a pattern rather than exact wording. For example, if dates are recognized as tokens, the following segments will be treated as identical:
He arrived on 1 January 1900
He arrived on 2 February 2012
Although the TM settings specify which text patterns are tokenized, the way in which the translation editor handles tokens - for example whether it automatically localizes dates - depends on the project settings, not the TM settings.