Documentation Center

Training data for language pair adaptation

The training data corresponds to the content used to adapt a language pair. Language Weaver Edge uses this data to create a new adapted language pair that will offer customized translations for a specific domain.

Consider the following points when preparing the training data for the language pair adaptation:
  • The data is stored in a translation memory (TM) in *.tmx format.
  • You can upload a single *.tmx file or a *.zip file containing multiple TMs. The maximum file size for either the *.tmx or the *.zip files is 3 GB.
  • Each translation unit (TU) consists of a single sentence or a meaningful unit, but not incomplete phrases, multiple sentences, or paragraphs.
  • There must be a minimum of 1,000 TUs, but RWS strongly recommends a minimum of 30,000 TUs.
  • The maximum number of TUs that can be used for training is 30 million.
  • Terminology is consistent across the TUs in the TMs.
  • Segments are correctly aligned: the target text is a proper translation of the source text.
  • Segments are representative of the content you will process with machine translation: similar terminology, similar style, similar domain.
  • Factual-style content has more chances of being well translated, while flowery expressions or idioms may not perform so well.