Automatic data cleaning

Automated pre-processing of TMX files used for training to improve the quality of the adapted and auto-adaptive models.

To ensure that customers get the best results when adapting models in Language Weaver , the bilingual data used for training is automatically pre-processed to filter out noise and normalize content. This article details the automated cleaning rules performed on the TMX files before the training starts.

Default cleaning steps

Language Weaver performs the following steps to filter out noisy data from a TMX file uploaded for adaptation.

Step descriptionExamples
Discard training unit if either the source or target contain only spaces or are an empty string.
Discard training unit if any of the characters [] {} () <>「」『』《》【】do not match exactly, both in count and order.
Discard training unit if the source and target segments contain a different number of bullet points.

Bullet points can be any symbols that might be used to introduce items in a list, such as bullets, dots, circles, stars, squares, rightwards arrows, etc.

Remove all bullet points from a string.
Discard training unit if the source and target segments are identical.
Discard training unit if the source and target segments contain a different number of email addresses.
Discard training unit if either the source or target segments contain only email addresses, (and leading/trailing spaces, if present).
Detect and correct Unicode encoding errors by converting garbled text back to its intended Unicode format.
  • Common examples of broken Unicode text include:
    • Mojibake: Characters displayed as random symbols, question marks or garbled text because the encoding was misinterpreted. For example, "é" might appear as "é" if UTF-8 text is read as ISO-8859-1.
    • Replacement characters: for example, the � (U+FFFD) symbol.
  • Example: "Ring meg nÃ¥"` will be corrected to "Ring meg nå".
Convert HTML special characters to their unescaped form."a word &amp; another word" is converted to "a word & another word".

Discard training unit if either the source or the target segments are longer than 500 characters.

Discard training unit if either the source or target segment is longer or equal to 100 words. This filter is not applied if the source or target segments are in Chinese, Japanese or Thai (which do not use spaces).
Discard training unit if either the source or target segment is shorter or equal to two characters. Whitespace characters are not ignored.
Discard training unit if the percentage of non-alphanumeric characters in either the source or target segment is equal or greater than 50% of the total number of characters in a string. Whitespace characters are ignored.Examples of non-alphanumeric characters include Roman numbers, fractions, mathematical symbols, etc.
Discard training unit if the percentage of numeric characters in either the source or target segment is equal or greater than 50% of the total number of characters in a string. Whitespace characters are ignored.
Discard training unit if the percentage of whitespace characters in either the source or target segment is equal or greater than 40% of the total number of characters in a string.This rule protects against ANTSPEAK data, as in text "l i k e t h i s f o r e x a m p l e".
Replace single-character ligatures of Latin letters, such as the characters "æ" and "œ", with the characters they contain ("ae" and "oe" in this case).
  • "encyclopædia" will be updated to "encyclopaedia"
  • "fluffiest" will be converted to "fluffiest"
Replace "fullwidth" characters from the source segment with their standard form.
  • "LOUD NOISES" will be replaced with "LOUD NOISES"
  • "Uターン" will be replaced with "Uターン"
Remove control characters from the source and target segments. Many of these control characters appear in the table of "Characters not suitable for use with markup" at http://www.unicode.org/reports/tr20/tr20-9.html.
  • ASCII control characters, except for the important whitespace characters (U+00 to U+08, U+0B, U+0E to U+1F, U+7F)
  • Deprecated Arabic control characters (U+206A to U+206F)
  • Interlinear annotation characters (U+FFF9 to U+FFFB)
  • The Object Replacement Character (U+FFFC)
  • The byte order mark (U+FEFF)
  • Right to Left Mark (U+200E)
Remove emoji characters.😊, 👍, 💗
Replace one or more occurrences of spaces (as defined by the "\s" regex character class) with simple whitespace in both the source and target segments.
  • Vertical Tab: \\t
  • Carriage Return + Line Feed: \\r\\n
  • Carriage Return: \\r
  • Line Feed: \\n
  • Form Feed: \\u000C
  • Line Separator: \\u2028
  • Paragraph Separator: \\u2029
  • Next Line: \\u0085
  • Non-breaking/mathematical/ideographic spaces
Remove tags and trailing whitespaces that follow tags in both the source and target segments.
Discard training unit if the source or target segments contain characters from unexpected scripts (scripts unlikely to be part of the source/target segments alphabets).
Discard training unit if the source and target segments contain a different number of URLs/URIs.
Discard training unit if either the source or target segments contain text that is URL encoded."Hello%20World%21%20This%20is%20an%20example%2E"
Discard training unit if either the source or target segments contain only URLs or URIs (and leading/trailing spaces, if present).

Data deduplication

Language Weaver automatically removes duplicate segments from a dataset. The deduplication process reads through each segment and removes all duplicated training units from all TMX files uploaded for adaptation.