File type configuration and segmentation
You can configure the settings for WorldServer file formats in . You can have multiple file type configurations for each format.
File type configurations customize two of the main jobs performed by file types: segmenting and recomposing.
- Segmenting (decomposing the asset into segments presented for translation) occurs when WorldServer users perform the following operations:
- Apply the Segment Asset automatic action
- Open an asset in the Browser Workbench
- Export the asset to a translation kit
- Scope the asset
- Recomposing (reassembling the translated target asset into its original format) occurs when you save the translated asset back to the repository (for example, using a Browser Workbench).
Decomposing the Asset: Segmentation
When an asset is segmented, the file type has two basic considerations for determining when to create a new segment: markup and "delimiters." In Browser Workbench, you only see the source segments that are marked for translation by default.
Figure 1. Segments for Translation
Figure 2. Segments with Show Markup Enabled
WorldServer processes markup first, segmenting the asset, before the delimiters are processed.
- Handling Markup
To determine what to present in translation segments, file types take into account things such as markup, formatting attributes, and metadata.
In some cases, as for HTML, you can add user-defined elements and attributes, and specify the conditions in which they can be translated.
- Other Options for Controlling Segmentation
WorldServer application file types give you control over what segments are presented for translation in other formats, such as Microsoft Office, Adobe FrameMaker, and Adobe InDesign. For example:
- For Microsoft Word and RTF file formats, you can specify text to exclude from translation based on the text style.
- For Adobe FrameMaker, you can extract and translate content and text from formats such as autonumbers, text insets, and conditional text.
- For Adobe InDesign, you can include the content of hidden layers for translation.
- Workbench Presentation of Segments
After an asset is segmented, if it is associated with a translation memory, it is leveraged against that TM, and a match score is assigned to each segment. By default, all text segments are displayed when you open the asset in a workbench. However, you can narrow the view of which segments are presented by a view category like "All except ICE and 100%", "All non-translated", "All with comments", or "All pending review". These "views" are sometimes referred to as filters because they filter out data. However, you should not confuse them with the file types that are applied during segmentation.
- Sentence Breaking
After the file type processes markup, it further segments the asset, looking for "sentence breaking" delimiters.
When you open a text file, which has minimal markup, segments basically are sentences, delimited by periods (or question and exclamation marks). When the file type comes to one of these, it ends the segment. The Text File Type also lets you specify structure and inline patterns to use.
If the file contains markup in addition to text, the segmentation process first segments based on the markup, then it makes another pass based on the "sentence breaking" delimiters. For example, the HTML 4 File Type extracts everything in a paragraph (<p>) element first, then breaks up sentences in the paragraph if it contains more than one sentence.
Decomposing the Asset: Formatting Encodings
File types offer control over how formatting encodings such as entities are handled.
In the entities example, a configuration option (in the XML File Type family and HTML 4 File Type) lets you "register" entities. If you register an entity it will always be presented as a character (for example, "<"). To have it presented as an entity (for example, "<"), you should not register it. You can also control how these entities are handled when you save the asset. See the "XML Entity Conversion Settings" topic for more information about handling entities.
Recomposing the Asset: Saved Targets
WorldServer also offers control over how the segments should be handled when it recomposes the target segments into a formatted asset after you save the asset in Browser Workbench. The following are just some of the options handled by file types:
- For the XML and XHTML formats, you can use the Writer Settings page to have the file type insert language information via the XML
LANGattribute when generating target documents. The value of theLANGattribute is determined by the target locale. - By default, the XML validation page checks the schema validation when it verifies the target asset. You can opt not to validate the XML. For example, when the XML contains references to external entities, it causes this check to fail, causing the format processing to stop.
- The Adobe FrameMaker and Microsoft Office formats have a Font Mapping feature for specifying what fonts are used in the target asset. You enter the name of the font you wish to map, then enter the name of a corresponding font for every language in which you wish font translation to occur.