Database size estimation

WorldServer was designed to handle extremely large amounts of data. However, it is essential to estimate the amount of content a WorldServer installation will be required to handle and to provide adequate hardware including storage capacity. Estimation of storage requirements depends on the type of data to be used, the number of target languages supported, typical user activity patterns, total system load in terms of how many projects are turned around within a particular time frame, and many other factors.

Areas in WorldServer that can require a large amount of storage space include:

  • Translation memories (TMs)
  • Terminology databases (TDs)
  • Segmented documents

Translation memories

Translation memory size can be measured either in words or in entries, where an entry usually contains a single sentence. The average English sentence contains 15 to 20 words. An analysis of customer translation memories performed by SDL confirmed this range, with the average word count being 17 words per sentence. If a translation memory size is measured in the number of entries, the number of words is calculated using the following formula:


TMwords = 17 * TMentries

To estimate the disk space required for a translation memory, the total translation memory size in thousands of entries should be multiplied by a space factor that needs to be calculated experimentally and is database dependent.

If you do not have any existing translation memories, the future translation memory size can be estimated from the number of words you expect to translate, using the formula presented earlier.

Customers using the file types introduced in WorldServer and processed by the FTS server should expect their translation memory size to increase compared to that of TMs created with legacy WorldServer filters. This size increase should be covered by the formula change described in the CONTENT entry for the estimate guidance presented later in this topic.

Terminology databases

Terminology database (TD or termbase) size is measured in number of entries. Entries usually contain one source term and one or more target terms. In most cases, there is a one-to-one correspondence, but a terminology database entry might contain several target terms in several languages.

Segmented documents

WorldServer segments all content into textual and markup elements. Usually, segmentation is done on a sentence level; each sentence becomes a separate segment. WorldServer stores this data in the application database and uses it as the internal document representation. Each content asset (file) becomes a separate segmented document structure. The size of this structure directly corresponds to the size of the document and the amount of textual content it contains. The estimating factor in the final formula is based on markup language formats (HTML, SGML, XML, and so on). For binary data formats (DOC, PPT) the space factor is much smaller because most of the binary data deals with formatting and does not get loaded into WorldServer.

Overhead

In addition to the three biggest storage contributors described above, there is also the overhead of keeping all other WorldServer information, including users, projects, workflows, and so on. This overhead is difficult to estimate because it is very dependent on the particular usage pattern. However, SDL finds that 1 GB of space is more than enough to cover the overhead for most WorldServer installations.

Estimate guidance

Customers have used the following formula for estimating SQL Server database sizes for WorldServer 9.x and earlier. Take your own environment and data into account when you think about your own database needs.

Where:

  • TOTAL SPACE

    Total space required for the database, in gigabytes (GB)

  • LANGS

    Number of languages supported by the installation

  • TM

    Expected translation memory size, in thousands of entries

  • TD

    Expected terminology database size, in thousands of entries

  • CONTENT

    Total translatable content size, in megabytes (MB)

    Customers using the file types introduced in WorldServer and processed by the FTS server should expect that they will need more database space. File types may require up to three times the CONTENT space required for assets processed by legacy WorldServer filters.