Database size estimation
WorldServer can handle extremely large amounts of data. However, it is essential to estimate the amount of content that a WorldServer installation might be required to handle and to provide adequate hardware, including storage capacity. The estimation of storage requirements depends on factors such as the type of data to be used, the number of target languages supported, or typical user activity patterns.
- Translation memories (TMs)
- Terminology databases (TDs)
- Segmented assets
Translation memories
You can measure translation memory size either in words or in entries, where an entry usually contains a single sentence. The average English sentence contains 15 to 20 words. An analysis of customer translation memories performed by SDL confirmed this range, with the average word count being 17 words per sentence. If you measure the size of a translation memory in the number of entries, you can use the following formula to calculate the number of words:
TM words = 17 * TM entries
To estimate the disk space required for a translation memory, you should multiply the total translation memory size in thousands of entries by a space factor that needs to be calculated experimentally and is database dependent.
If you do not have any existing translation memories, you can estimate the future translation memory size from the number of words you expect to translate, using the formula presented earlier.
Customers using the file types introduced in WorldServer and processed by the File Type Support (FTS) Server should expect their translation memory size to increase compared to that of TMs created with legacy WorldServer filters. This size increase should be covered by the formula change described in the CONTENT entry for the estimate guidance presented later in this topic.
Terminology databases
Terminology database (TD or termbase) size is measured in number of entries. Entries usually contain one source term and one or more target terms. In most cases, there is a one-to-one correspondence, but a terminology database entry might contain several target terms in several languages.
Segmented assets
WorldServer segments all content into textual and markup elements. Usually, segmentation is done on a sentence level; each sentence becomes a separate segment. WorldServer stores this information in its database and uses it as the representation of the internal document. Each asset (file) becomes a separate segmented document structure. The size of this structure corresponds directly to the size of the document and the amount of textual content it contains. The estimating factor in the final formula is based on markup language formats (.html, .xml, and so on). For binary data formats (.doc, .ppt) the space factor is much smaller because most of the binary data deals with formatting and does not get loaded into WorldServer.
Overhead
In addition to the three biggest storage contributors described earlier, there is also the overhead of keeping all other WorldServer information, including users, projects, workflows, and so on. This overhead is difficult to estimate because it depends on how you are using WorldServer. However, 1 GB of space is usually more than enough to cover the overhead for most WorldServer installations.