Documentation Center

Data preparation for language pair adaptation

Valid *.tmx files are required to adapt language pairs.

General checks for language pair adaptation

Although not enforced within the application, RWS recommends that you perform some generic checks on the translation units (TUs) used to adapt a language pair.

Check that you have:
  • A valid *.tmx file format
  • UTF-8 encoding
  • Correct *.tmx language tags
  • Clean data

Data cleaning for language pair adaptation

The data cleaning refers to the preparation of the input material to make it compatible with the Language Weaver Edge trainer. The cleaning process is aimed at optimizing the quality of the resulting model, and it does not involve any improvement of the data from a linguistic point of view.

Good quality translation memories (TMs) is the starting point for language pair adaptation. Cleaning will not, however, improve the consistency of your data if the TMs include, for example, different translations for the same source or different styles.

RWS recommends that you remove the following from your TMs during the data cleaning:
  • TUs in languages different from the language pair that you'd like to adapt.
  • TUs containing only non-semantic text, like symbols or punctuation marks.
  • Misaligned TUs, i.e., translations that don't correspond to the source text.
  • Corrupted data, normally caused by an incorrect encoding.