Documentation Center

File type configuration and segmentation

You can configure the settings for WorldServer file formats in Management > Linguistic Tool Setup > File Types. You can have multiple file type configurations for each format.

Each file type is designed to recognize what is translatable text and what is formatting in an asset and to handle the formatting, subject to certain configuration options. The details in this topic can help you obtain the segmentation you need for translation and the proper format after saving.

File type configurations customize two of the main jobs performed by file types: segmenting and recomposing.

  • Segmenting (decomposing the asset into segments presented for translation) occurs when WorldServer users perform the following operations:
    • Apply the Segment Asset automatic action
    • Open an asset in the Browser Workbench
    • Export the asset to a translation kit
    • Scope the asset
  • Recomposing (reassembling the translated target asset into its original format) occurs when you save the translated asset back to the repository (for example, using a Browser Workbench).

Decomposing the Asset: Segmentation

When an asset is segmented, the file type has two basic considerations for determining when to create a new segment: markup and "delimiters." In Browser Workbench, you only see the source segments that are marked for translation by default.

Figure 1. Segments for Translation

You can enable "Show Markup" in the workbench to display the markup segments, which are hidden by default. "Show Markup" is not available for all file types. In cases where it is not, the markup is verbose and would not provide useful information.

Figure 2. Segments with Show Markup Enabled

WorldServer processes markup first, segmenting the asset, before the delimiters are processed.

  • Handling Markup

    To determine what to present in translation segments, file types take into account things such as markup, formatting attributes, and metadata.

    In some cases, as for HTML, you can add user-defined elements and attributes, and specify the conditions in which they can be translated.

  • Other Options for Controlling Segmentation
    WorldServer application file types give you control over what segments are presented for translation in other formats, such as Microsoft Office, Adobe FrameMaker, and Adobe InDesign. For example:
    • For Microsoft Word and RTF file formats, you can specify text to exclude from translation based on the text style.
    • For Adobe FrameMaker, you can extract and translate content and text from formats such as autonumbers, text insets, and conditional text.
    • For Adobe InDesign, you can include the content of hidden layers for translation.
  • Workbench Presentation of Segments

    After an asset is segmented, if it is associated with a translation memory, it is leveraged against that TM, and a match score is assigned to each segment. By default, all text segments are displayed when you open the asset in a workbench. However, you can narrow the view of which segments are presented by a view category like "All except ICE and 100%", "All non-translated", "All with comments", or "All pending review". These "views" are sometimes referred to as filters because they filter out data. However, you should not confuse them with the file types that are applied during segmentation.

  • Sentence Breaking

    After the file type processes markup, it further segments the asset, looking for "sentence breaking" delimiters.

    When you open a text file, which has minimal markup, segments basically are sentences, delimited by periods (or question and exclamation marks). When the file type comes to one of these, it ends the segment. The Text File Type also lets you specify structure and inline patterns to use.

    If the file contains markup in addition to text, the segmentation process first segments based on the markup, then it makes another pass based on the "sentence breaking" delimiters. For example, the HTML 4 File Type extracts everything in a paragraph (<p>) element first, then breaks up sentences in the paragraph if it contains more than one sentence.

Decomposing the Asset: Formatting Encodings

File types offer control over how formatting encodings such as entities are handled.

In the entities example, a configuration option (in the XML File Type family and HTML 4 File Type) lets you "register" entities. If you register an entity it will always be presented as a character (for example, "<"). To have it presented as an entity (for example, "&lt;"), you should not register it. You can also control how these entities are handled when you save the asset. See the "XML Entity Conversion Settings" topic for more information about handling entities.

Recomposing the Asset: Saved Targets

WorldServer also offers control over how the segments should be handled when it recomposes the target segments into a formatted asset after you save the asset in Browser Workbench. The following are just some of the options handled by file types:

  • For the XML and XHTML formats, you can use the Writer Settings page to have the file type insert language information via the XML LANG attribute when generating target documents. The value of the LANG attribute is determined by the target locale.
  • By default, the XML validation page checks the schema validation when it verifies the target asset. You can opt not to validate the XML. For example, when the XML contains references to external entities, it causes this check to fail, causing the format processing to stop.
  • The Adobe FrameMaker and Microsoft Office formats have a Font Mapping feature for specifying what fonts are used in the target asset. You enter the name of the font you wish to map, then enter the name of a corresponding font for every language in which you wish font translation to occur.